We present a novel parallelized formulation for fast non-linear image registration. By carefully analyzing the mathematical structure of the intensity independent Normalized Gradient Fields distance measure, we obtain a scalable, parallel algorithm that combines fast registration and high accuracy to an attractive package. Based on an initial formulation as an optimization problem, we derive a per pixel parallel formulation that drastically reduces computational overhead. The method was evaluated on ten publicly available 4DCT lung datasets, achieving an average registration error of only 0.94 mm at a runtime of about 20 s. By omitting the finest level, we obtain a speedup to 6.56 s with a moderate increase of registration error to 1.00 mm. In addition our algorithm shows excellent scalability on a multi-core system.