We present a complete approach to highly efficient image registration for embedded systems, covering all steps from theory to practice. An optimization-based image registration algorithm using a least-squares data term is implemented on an embedded distributed multicore digital signal processor (DSP) architecture. All relevant parts are optimized, ranging from mathematics, algorithmics, and data transfer to hardware architecture and electronic components.

The optimization for the rigid alignment of two-dimensional images is performed in a multilevel Gauss-Newton minimization framework. We propose a reformulation of the necessary derivative computations, which eliminates all sparse matrix operations and allows for parallel, memory-efficient computation. The pixelwise parallelism forms an ideal starting point for our implementation on a multicore, multichip DSP architecture.

The reduction of data transfer to the particular DSP chips is key for an efficient calculation. By determining worst cases for the subimages needed on each DSP, we can substantially reduce data transfer and memory requirements. This is accompanied by a sophisticated padding mechanism that eliminates pipeline hazards and speeds up the generation of the multilevel pyramid.

Finally, we present a reference hardware architecture consisting of four TI C6678 DSPs with eight cores each. We show that it is possible to register high-resolution images within milliseconds on an embedded device. In our example, we register two images with 4096x4096 pixels within 93 ms while offloading the CPU by a factor of 20 and requiring 3.12 times less electrical energy.