Note: This article is a bit dated, and as the relative speed of memory access, floating point calculations, and integer/pointer calculations (for array indexing) change; it may be less relavant. However I believe some of the techniques used here (Esp. the 1-multiply,1-add trig generation) may still be relevant to modern FFTs as well.
As shown on the excellent FFTW benchmark page, today, for complex-valued fft's there are a number of algorithms that beat this one (except for the kinda special case of N=2^11 on a power mac;-)). I haven't seen a comprehensive benchmark for real-valued ffts or hartley transforms yet.
Newsgroups: comp.dsp
Subject: Real FFT comparison
References: <[email protected]>
Reply-To: [email protected] ()[changed to [email protected]]
Distribution: world
Organization: ACUSON, Mountain View, CA
Keywords: FFT, Real valued.
[Date: Sometime in 1993, I believe.]
Summary: The Real Valued FFT Henrik Sorensen posted here a few days ago
is considerably faster than any of the other FFT routines I've
seen posted recently. I think I have a FHT which seems
faster on some architectures.
This posting will explain my results, the following posting
will contain my source code, a C translation of Sorensen's
code, and timing routines to compare them.
The code Henrik posted does indeed require less floating point computation than my FHT routine; but it requires almost 3/2 times as many memory accesses. Systems, like a Sun Sparc, with fast floating point arithmetic and slower memory access give my FHT an advantage. We also used different methods of calcuating trig values; but on a Sparc his algorithm was faster. Another difference may have been in my translating his code from Fortran to C; I just did a line-by-line translation, and I'm certain it could have been cleaned up to better suit gcc's optimizer. My apologies if the performance overhead is due to a poor C translation on my part.
I'd guess a radix-4 real-valued FFT should beat them both.
---------------------------------------------------------------------- Timing Results
The tables below contines timing results for various array sizes for each of the a number of different FFT algorithms. The timing values are in units of "microseconds/n/log(n)/number_of_ffts_run". An explanation of each algorithm follows the tables.
The following table, in roughly decreasing order of performance, contains timing results on a Sparcstation2 and compiled with gcc version 2.0.2 with the flag -O4.
fft_algorithm
| trig_algorithm
| |
A: fourea-real ;c_lib
B: fourea ;c_lib
C: duhamel ;c_lib
D: mayer-rad4 ;c_lib
E: numrec ;numrec
F: singleton ;sing
G: mayer-rad4 ;Buneman2
H: mayer-rad4 ;Simple
I: nrec_real ;numrec (real valued fft)
J: sing_real ;sing (real valued fft)
K: mayer-real ;Buneman2 (real valued fft)
L: sorensen-r ;n-rec (real valued fft)
M: mayer-fht ;Buneman2 (hartley transform)
# A B C D E F G H I J K L M
4 : 5.9 5.5 2.3 1.5 5.6 9.4 1.6 1.6 4.0 11. 0.82 0.82 0.63
8 : 5.1 5.0 2.7 1.1 3.3 6.2 1.1 1.1 2.5 3.5 0.58 0.65 0.41
16 : 4.5 4.3 2.8 2.0 2.2 2.6 1.2 1.1 1.6 2.8 0.61 0.71 0.50
32 : 4.0 3.9 2.7 2.0 1.8 1.9 1.0 1.0 1.2 1.4 0.53 0.61 0.42
64 : 3.6 3.5 2.6 2.1 1.5 1.4 1.1 1.0 0.96 1.1 0.53 0.56 0.45
128 : 3.2 3.2 2.4 2.0 1.4 1.6 0.96 0.92 0.81 0.87 0.49 0.48 0.42
256 : 3.0 2.9 2.2 2.0 1.3 1.3 0.98 0.91 0.74 0.89 0.50 0.49 0.43
512 : 2.9 2.8 2.1 1.9 1.2 1.2 0.92 0.90 0.71 0.70 0.46 0.44 0.40
1024 : 2.7 2.8 2.2 1.8 1.2 1.3 0.94 0.92 0.69 0.65 0.46 0.45 0.41
2048 : 2.6 2.9 2.2 1.7 1.2 1.5 0.91 0.87 0.68 0.68 0.44 0.44 0.39
4096 : 2.6 3.0 2.3 1.7 1.2 1.4 0.94 0.91 0.68 0.81 0.46 0.45 0.41
8192 : 2.8 2.9 2.2 1.6 2.2 1.4 0.89 0.86 0.66 0.73 0.44 0.44 0.39
16384: 3.5 3.5 2.6 2.1 2.3 1.7 1.5 1.5 0.99 0.75 0.74 0.80 0.69
32768: 3.5 3.5 2.6 2.1 2.3 1.9 1.5 1.5 1.0 0.88 0.74 0.85 0.69
65536: 3.5 3.5 2.5 2.1 2.4 1.7 1.5 1.5 1.0 0.99 0.77 0.85 0.71
real-value *** *******************
hartley ****
radix2 ********* ****
radix4 **** ************** ********* ****
splitradix **** ****
c-lib-trig ******************
Note: * There's at least a factor of 3 in the time taken by various algorithms
* Real valued transforms are significantly faster than complex ffts
* FFTs which rely on C library calls for all trig values are very slow
* My workstation apparently started paging at about N=16384. This
slowed down memory accesses. It is interesting to note that
the split-radix-real-valued-FFT was more affected by paging
than the radix-4 fht. My complex-FFT is also severely
affected by memory paging.
* Order(N) and Order(log(N)) components are very significant for
transforms smaller than about 256. If FFTs were a purely N*log(N)
process all the numbers in any given column would be constant.
* I suspect these results are highly architecture dependant. On
a 486 system, Henrik's FFT was faster than my FHT for most 2^K
for even K, and slower for most 2^K for odd K. The main
reason for this is probalby because the 486 has relatively few
registers, so my algorithm can't effectively decrease the
memory accesses.
-----------------
Description of FFT algorithms compared.
numrec
A straightforward radix-2 FFT
Source: "Numerical Recipies in C";
duhamel
"A Duhamel-Hollman split-radix dif fft"
by: Dave Edelblute, [email protected], 05 Jan 1993 [...]
Ref: Electronics Letters, Jan. 5, 1984"
Source: periodic posting on comp.benchmarks
fourea
"adapted from subroutine FOUREA listed in Programs for Digital
Signal Processing [...] edited by Digital Signal Processing
Committee [...] IEEE Acoustics Speech and Signal Processing
Committee [...] Chapter 1 Section 1.1 Page 1.1-4,5"
Source: comp.dsp 02/04/93; by [email protected] (Ted Wong)
singleton
"Singleton's mixed radix algorithm"
Source: From edu/math/msdos/modelling on wuarchive.wustl.edu
Translated to C by: Javier Soley, [email protected]
sorensen
my C translation of the code Henrik Sorensen posted here.
Source: "Real-valued fast Fourier transform algorithms", IEEE
Tran. ASSP, Vol. ASSP-35, No. 6, pp. 849-864, June 1987
(My C translation included in the following posting)
mayer-*
All my routines are wrappers around a fast Hartley transform.
Source: the following posting.
---------------------------------------------------------------------------
Trig algorithms
A few people asked me via email about the trig algorithm I used: I'll try to look up the reference to the trig algorithm I was using; but I think I lost it somewhere in storage. It works by manipulating a log2(n) sized table of trig values. Each newly generated value is for an angle which lies exactly halfway between two angles whose trig values are known. I think the identity is something like
sin( (w+v)/2 ) = (sin(w) + sin(v)) * sec(v-w) / 2
although I might be wrong about the argument inside the sec()
function. You can start with a table of cos(pi),cos(pi/2),cos(pi/4),
cos(pi/8)...., and everytime you generate a new value you can
overwrite one which you no longer need (to avoid excessive memory
usage). This takes less floating point math than any of the other
trig generating algorithms I know of; but requires a bunch of integer
math to determine which value can be overwritten. It is probably only
faster on systems which emulate floating point math; but it is
considerably more stable than other common methods.
I'll try to find and mention the reference in a future posting.
--------------------------------------------------------------------------- Accuracy and Noise of FFT algorithms.
An earlier discussion I had on comp.benchmarks people were discussing possible tradeoffs between speed and accuracy. The amount of noise generated by various FFT algorithms varies by a factor of about 1e10! However, as shown in the table below, this is not all that closely correlated to the speed of the algorithm.
In reality, noise added by any of these FFTs probably only matters if you are doing zillions of forward and reverse transforms; but some will argue that a good FFT should have very low noise.
One significant component of errors in FFTs is the trig algorithm used. Avoiding recursively successive calculating trig functions can reduce accumulated errors by about a factor of 5. Unfortunatelly C library functions are too slow for this purpose. See my code in the next posting for a number of compromise trig algorithms.
algo trig N time ssq-errors fourea-real ;c_lib 65536 : 3.48 2.19e-12 fourea ;c_lib 65536 : 3.45 2.64e-11 duhamel ;c_lib 65536 : 2.53 4.63e-21 numrec ;numrec 65536 : 2.35 9.13e-14 mayer-rad4 ;c_lib 65536 : 2.08 1.79e-21 *(all library functions) sing ;sing 65536 : 1.68 5.22e-19 mayer-rad4 ;Buneman2 65536 : 1.52 2.46e-21 *(half-angle algorithm) mayer-rad4 ;Simple 65536 : 1.51 8.16e-21 *(recursively generated) nrec_real ;numrec 65536 : 1.04 5.39e-19 sing_real ;sing 65536 : 0.987 ???????? sorensen-r ;n-rec 65536 : 0.855 1.83e-20 mayer-real ;Buneman2 65536 : 0.765 1.78e-21 mayer-fht ;Buneman2 65536 : 0.707 2.46e-21
Note that the wierd trig algorithm reduces errors by almost a factor of 4 with negligable performance cost.
----------------------------------------------------------------------------
Ron Mayer
[email protected]