We present a CUDA implementation of a complete registration algorithm, which is capable of aligning two multimodal images, using affine linear transformations and normalized gradient fields. Through the extensive use of different memory types, well handled thread management and efficient hardware interpolation we gained fast executing code. Contrary to the common technique of reducing kernel calls, we significantly increased performance by rearranging a single kernel into multiple smaller ones. Our GPU implementation achieved a speedup of up to 11 compared to parallelized CPU code. Matching two 512 x 512 pixel images is performed in 37 milliseconds, thus making state-of-the-art multimodal image registration available in real time scenarios.