Fast bit-reversals on unitprocessors and shared-memory multiprocessors 

Z. Zhang and X. Zhang

SIAM Journal on Scientific Computing, Vol. 22, No. 6, 2001, pp. 2113-2134. 


In this paper, we examine different methods using techniques of
blocking, buffering, and padding for efficient implementations of
bit-reversals. We
evaluate the merits and limits of each technique and their application and
architecture-dependent conditions for developing cache-optimal methods.
Besides testing the methods on different uniprocessors, we
conducted both simulation and measurements on two commercial
SMP multiprocessors to provide architectural insights into the methods
and their implementations.
We present two contributions in this paper: (1) Our integrated blocking
methods, which match cache associativity and TLB cache size and which
fully use the available registers are cache-optimal and fast.  (2)
We show that our padding methods outperform other software oriented
methods, and believe they are the fastest in terms of minimizing both
CPU and memory access cycles. Since the padding methods are almost
independent of hardware, they could be widely used on many uniprocessor
workstations and multiprocessors.