Parallel Algorithms for Near-Realtime Visualization
Two eyes in humans and animals provide depth perception. The images are obtained from two independent optical systems, the brain correlates the two, and provides a 3-D impression.
The mars exploration rover carries several cameras including two dedicated sets of stereo cameras (navigation and panorama). The stereo images are used to determine distances to certain objects and the generation of 3-D maps.
While the optical correlation in the brain appears to function with incredible ease, the digital correlation of two images is a numerically intensive task. This web page describes briefly our work to speed-up this task.
The original algorithm was developed in the Multi-mission Image Processing Laboratory (MIPL). The original algorithm consumed typically 90 minutes on a 450MHz Pentium III CPU running Linux. Under some conditions the algorithm was plagued by conversion problems which could result in computation times of days or longer. This task was initiated to reduce the required time for image correlation by at least one order of magnitude by the usage of a Beowulf cluster. We performed algorithmic changes and parallelized the code. The new parallel algorithm partitions the original image into separate segments that are processed on different CPUs, as indicated by the blue lines in the images above. The new algorithm requires about 55 minutes on a single CPU (due to algorithm changes) and runs in about 4.5 minutes on 16 CPUs. Running the same problem on a 800MHz Pentium III cluster reduces the time further to about 3 minutes on 16 CPUs. Tests on old inputs that choked the original algorithm with conversion problems showed that the new algorithm appears to be no longer subject to these pathologies. The contoured XYZ data shown above is the essential input to the determination of distances to certain objects. The different color transitions indicate the transitions between a ridge and a valley behind that ridge (the valley is farther away the ridge resulting in a depth difference).
The old algorithm:
The numerical correlation algorithm tries to find for every pixel in the left image a corresponding pixel in the right image. The is some information available from the camera model that contains basic information such as distance of the two eye, orientation of the camera etc. The algorithm is based on the correlation of a variable size correlation window. A variety of parameterizations that work well with different physical features, such as edges and rocksare available. The correlation starts from a sequence of seed points (that is fed in as an additional input to the two starting images). A correlation is atempted at a given seed point. Upon success an ever increasing window is created and correlations are performed on the edge of this window. A secondary argument tries to correlate points that failed in a previous attempt inside the window with the new information that might be availble to later correlations. The correlation stops if the none of the pixels inside and on the edge of the window can be correlated. This process continues until all the seedpoints have been processed.
Before the existing algotithm was parallelized a timing analysis was performed in order to determine in which functions most of the time is spent. Most of the time is indeed spent in the core function that performs the actual correlation evaluation of a few pixels (as opposed to set-up functions or the drivers that feed pixel pairs to the core correlator). However the analysis showed that the algorithm was called several orders of magnitude more often than there are pixels to correlate. Most of the calls turned to be exited immediately since the particular pixel pair had been correlated already. Such a test-only entry and exit from a function turns out to be very expensive since it results in register push operations. We moved this check outside the function call to remove this inefficiency. We also reduced the overall number of correlation existence checs by several algorithm changes. One of the major contributors to failed correlations is the attempt to correlate edge pixels that only exist in the left image but not in the right. With these changes. These changes resulted in a reduction of required CPU time by about 10-20%.
The parallelization scheme:
This these improvements and an analysis of actual times spent in the core correlation function we could determine an appropriate path for program parallelization. Individual calls to the function consume about 0.5ms (on an 800MHz Pentium III) and it is called about twice the number of total pixels in the left image (approximately 2 million times). The actual execution time of this function call also depends on correlation input; i.e. the correlation of an initial seed point, the correlation on a window edge and the correlation of remaining gores take different times. Fine grain parallelization of this function call would be completely futile for such a fast execution time because communication costs such as sending a message to another CPU or sharing information between threads will cost about that much time (within a factor of 10). We therefore decided to to divide the image into segments and perform to correlation of the segments independently on different CPUs (as indicated with the blue bars in the image above).
Significant algorithm difference:
A division of the image into several idependent segments is a significant deviation from the original algorithm which was based on the correlation on the edge of an ever increasing window. Segmenting the image results in the introduction of new borders accross which correlation cannot proceed (We decided for the first parallelization attempt to not transmit correlation information of adjacent segments to adjacent CPUs, which is associated with significantly increased communication and algorithm complexity costs). Note that the right image pixels that are formally outside the particular CPU segment of the left image can be correlated since the full left and right image are available to the correlation algorithm for each CPU. To test the need of correlation information to be nicely tied between different correlation segments we enabled the option to leave out a frame around the image segments to remain uncorrelated, which will be filled in on a second pass through the correlation algorithm, once the data has been gathered from all CPUs. At this time such a single pass through the correlation is performed by the master CPU only. More complicated divisions of the uncorrelated border segments to more CPUs are of course possible.
Another algorithm change that was necessary before the program could be parallelized was the implementation of a seed generation algorithm that can be called on an arbitrary segment of the image. The original algorithm fed a list of seedpoints into the correlator through an input parameter. It can be argued that this proceedure enabled the feeding-in of "good" correlation points, that may have been picked by humans or other algorithms that identified unique/simple features such as a big rock or a distinct edge to increase the success with a good correlation. However in practice the seed were picked always to be the same, distributed evenly throughout the image, yet uncorrelated / random to the image data. Within the newly implemented seed generation algorithm we pick the first point in the dead center of the segment. Sub-sequent points are picked at random within the segment. If the new seed pixel has been correlated or if the quality of a seed (as evaluated by a single pass through the correlator) is insufficient, a new seed is generated.
Stopping the algorithm and CPU synchronization:
After a seed is generated the algorithm reamained the same within an image segment. Since the correlation of seed points, window edges and gores can take a different time one can easily imagine that different CPUs working on the different image segments may finish at different times. To prevent the run-away correlation delay of a single CPU that will hold up all others (as experienced in the original single CPU code) we limit the number of times correlations are attempted. The correlation within the segments on the the different CPUS is controlled by a maximum number of (successful) seed generations and a desired filling percentage of correlated pixels within the segment. The correlation is stopped once the maximum number of trieals has been executed or if the desired filling percentage has been achieved. A time-based cut-off that synchronizes all CPU's against a desired maximum execution time can also be implemented. A master slave approach could also even out a possible work load difference.
Run-away correlation problem eliminated:
With the algorithm changes described above we have tested image pairs that caused the original correlation algorithm to run for weeks at a time. Our new algorithm does not suffer from this pathology anymore and finishes with average time compared to other image pairs. The number of correlated points was approximately the same as any other image pair (80-90%). The successful finish of the new algorithm is not due to a premature stop of the correlation which might leave one segment completely uncorrelated. It turns out that the original code was stuck on gore point correlations due to bad seeds. The segmentation of the image speeds up the search for the reamaining gores and their processing significantly.
This parallel correlator prototype has been tested in the last field test of the Mars Exploration Rover in the beginning of May 2001.
Other parallelization projects:
We have also worked on the parallelization of other mars imaging software such as the generation of large mosaics from many individual images.
This work was sponsored by the TMOD technology program under the Beowulf Application and Networking Environment (BANE) task.The original VICAR based software is maintained in the Multi-mission Image Processing Laboratory (MIPL). The work was performed in a collaboration between Gerhard Klimeck, Myche McAuley, Tom Cwik, Bob Deen, and Eric DeJong