Performance and accuracy benchmarking for FFT (Fast Fourier Transform), RFFT(real valued fft) and FHT (Fast Hartley Transform) algorithms.

Note: This article is a bit dated, and as the relative speed of memory access, floating point calculations, and integer/pointer calculations (for array indexing) change; it may be less relavant. However I believe some of the techniques used here (Esp. the 1-multiply,1-add trig generation) may still be relevant to modern FFTs as well.

As shown on the excellent FFTW benchmark page, today, for complex-valued fft's there are a number of algorithms that beat this one (except for the kinda special case of N=2^11 on a power mac;-)). I haven't seen a comprehensive benchmark for real-valued ffts or hartley transforms yet.

For a general purpose systems(PC's and workstations) and general applications I would recommend FFTW .
For embedded systems with certain memory and/or processing constraints this code may be better.
Obviously for DSPs, use the (almost certainly assembly language) FFT recommended by the DSP manufacturer.

Source code for FFT's and FHT's described below.

Newsgroups: comp.dsp
Subject: Real FFT comparison
References: <[email protected]>
Reply-To: [email protected] ()[changed to [email protected]]
Distribution: world
Organization: ACUSON, Mountain View, CA
Keywords: FFT, Real valued.
[Date: Sometime in 1993, I believe.]


Summary:  The Real Valued FFT Henrik Sorensen posted here a few days ago
          is considerably faster than any of the other FFT routines I've
          seen posted recently.  I think I have a FHT which seems
          faster on some architectures.
          This posting will explain my results, the following posting
          will contain my source code, a C translation of Sorensen's
          code, and timing routines to compare them.

The code Henrik posted does indeed require less floating point computation than my FHT routine; but it requires almost 3/2 times as many memory accesses. Systems, like a Sun Sparc, with fast floating point arithmetic and slower memory access give my FHT an advantage. We also used different methods of calcuating trig values; but on a Sparc his algorithm was faster. Another difference may have been in my translating his code from Fortran to C; I just did a line-by-line translation, and I'm certain it could have been cleaned up to better suit gcc's optimizer. My apologies if the performance overhead is due to a poor C translation on my part.

I'd guess a radix-4 real-valued FFT should beat them both.

---------------------------------------------------------------------- Timing Results

The tables below contines timing results for various array sizes for each of the a number of different FFT algorithms. The timing values are in units of "microseconds/n/log(n)/number_of_ffts_run". An explanation of each algorithm follows the tables.

The following table, in roughly decreasing order of performance, contains timing results on a Sparcstation2 and compiled with gcc version 2.0.2 with the flag -O4.


      fft_algorithm
        |            trig_algorithm
        |            |
    A:  fourea-real ;c_lib   
    B:  fourea      ;c_lib   
    C:  duhamel     ;c_lib   
    D:  mayer-rad4  ;c_lib   
    E:  numrec      ;numrec  
    F:  singleton   ;sing    
    G:  mayer-rad4  ;Buneman2
    H:  mayer-rad4  ;Simple  
    I:  nrec_real   ;numrec       (real valued fft)
    J:  sing_real   ;sing         (real valued fft)
    K:  mayer-real  ;Buneman2     (real valued fft)
    L:  sorensen-r  ;n-rec        (real valued fft)
    M:  mayer-fht   ;Buneman2     (hartley transform)
    #      A    B    C    D    E    F    G    H     I    J    K    L    M
    4    : 5.9  5.5  2.3  1.5  5.6  9.4  1.6  1.6   4.0  11.  0.82 0.82 0.63
    8    : 5.1  5.0  2.7  1.1  3.3  6.2  1.1  1.1   2.5  3.5  0.58 0.65 0.41
    16   : 4.5  4.3  2.8  2.0  2.2  2.6  1.2  1.1   1.6  2.8  0.61 0.71 0.50
    32   : 4.0  3.9  2.7  2.0  1.8  1.9  1.0  1.0   1.2  1.4  0.53 0.61 0.42
    64   : 3.6  3.5  2.6  2.1  1.5  1.4  1.1  1.0   0.96 1.1  0.53 0.56 0.45
    128  : 3.2  3.2  2.4  2.0  1.4  1.6  0.96 0.92  0.81 0.87 0.49 0.48 0.42
    256  : 3.0  2.9  2.2  2.0  1.3  1.3  0.98 0.91  0.74 0.89 0.50 0.49 0.43
    512  : 2.9  2.8  2.1  1.9  1.2  1.2  0.92 0.90  0.71 0.70 0.46 0.44 0.40
    1024 : 2.7  2.8  2.2  1.8  1.2  1.3  0.94 0.92  0.69 0.65 0.46 0.45 0.41
    2048 : 2.6  2.9  2.2  1.7  1.2  1.5  0.91 0.87  0.68 0.68 0.44 0.44 0.39
    4096 : 2.6  3.0  2.3  1.7  1.2  1.4  0.94 0.91  0.68 0.81 0.46 0.45 0.41
    8192 : 2.8  2.9  2.2  1.6  2.2  1.4  0.89 0.86  0.66 0.73 0.44 0.44 0.39
    16384: 3.5  3.5  2.6  2.1  2.3  1.7  1.5  1.5   0.99 0.75 0.74 0.80 0.69
    32768: 3.5  3.5  2.6  2.1  2.3  1.9  1.5  1.5   1.0  0.88 0.74 0.85 0.69
    65536: 3.5  3.5  2.5  2.1  2.4  1.7  1.5  1.5   1.0  0.99 0.77 0.85 0.71
real-value ***                                      *******************
hartley                                                                 ****
radix2     *********           ****
radix4                    ****      **************       *********      ****
splitradix           ****                                          ****
c-lib-trig ******************

Note: * There's at least a factor of 3 in the time taken by various algorithms
      * Real valued transforms are significantly faster than complex ffts
      * FFTs which rely on C library calls for all trig values are very slow
      * My workstation apparently started paging at about N=16384.  This
	slowed down memory accesses.  It is interesting to note that
	the split-radix-real-valued-FFT was more affected by paging
	than the radix-4 fht.  My complex-FFT is also severely
	affected by memory paging.
      * Order(N) and Order(log(N)) components are very significant for
	transforms smaller than about 256.  If FFTs were a purely N*log(N)
	process all the numbers in any given column would be constant.
      * I suspect these results are highly architecture dependant.  On
	a 486 system, Henrik's FFT was faster than my FHT for most 2^K
	for even K, and slower for most 2^K for odd K.  The main
	reason for this is probalby because the 486 has relatively few
	registers, so my algorithm can't effectively decrease the
	memory accesses.



-----------------
            Description of FFT algorithms compared.

numrec
    A straightforward radix-2 FFT
    Source: "Numerical Recipies in C";

duhamel
    "A Duhamel-Hollman split-radix dif fft"
     by: Dave Edelblute, [email protected], 05 Jan 1993 [...]
     Ref: Electronics Letters, Jan. 5, 1984"
    Source: periodic posting on comp.benchmarks

fourea
    "adapted from subroutine FOUREA listed in Programs for Digital
     Signal Processing [...] edited by Digital Signal Processing
     Committee [...] IEEE Acoustics Speech and Signal Processing
     Committee [...] Chapter 1 Section 1.1 Page 1.1-4,5"
    Source: comp.dsp 02/04/93; by [email protected] (Ted Wong)

singleton
    "Singleton's mixed radix algorithm"
    Source: From edu/math/msdos/modelling on wuarchive.wustl.edu
            Translated to C by: Javier Soley, [email protected]

sorensen
    my C translation of the code Henrik Sorensen posted here.
    Source: "Real-valued fast Fourier transform algorithms", IEEE
            Tran. ASSP, Vol. ASSP-35, No. 6, pp. 849-864, June 1987
            (My C translation included in the following posting)

mayer-*
    All my routines are wrappers around a fast Hartley transform.
    Source: the following posting.

---------------------------------------------------------------------------

Trig algorithms

A few people asked me via email about the trig algorithm I used: I'll try to look up the reference to the trig algorithm I was using; but I think I lost it somewhere in storage. It works by manipulating a log2(n) sized table of trig values. Each newly generated value is for an angle which lies exactly halfway between two angles whose trig values are known. I think the identity is something like

    sin( (w+v)/2 ) = (sin(w) + sin(v)) * sec(v-w) / 2

although I might be wrong about the argument inside the sec() function. You can start with a table of cos(pi),cos(pi/2),cos(pi/4), cos(pi/8)...., and everytime you generate a new value you can overwrite one which you no longer need (to avoid excessive memory usage). This takes less floating point math than any of the other trig generating algorithms I know of; but requires a bunch of integer math to determine which value can be overwritten. It is probably only faster on systems which emulate floating point math; but it is considerably more stable than other common methods.

I'll try to find and mention the reference in a future posting.

--------------------------------------------------------------------------- Accuracy and Noise of FFT algorithms.

An earlier discussion I had on comp.benchmarks people were discussing possible tradeoffs between speed and accuracy. The amount of noise generated by various FFT algorithms varies by a factor of about 1e10! However, as shown in the table below, this is not all that closely correlated to the speed of the algorithm.

In reality, noise added by any of these FFTs probably only matters if you are doing zillions of forward and reverse transforms; but some will argue that a good FFT should have very low noise.

One significant component of errors in FFTs is the trig algorithm used. Avoiding recursively successive calculating trig functions can reduce accumulated errors by about a factor of 5. Unfortunatelly C library functions are too slow for this purpose. See my code in the next posting for a number of compromise trig algorithms.

 algo        trig      N       time      ssq-errors
fourea-real ;c_lib    65536 :  3.48      2.19e-12
fourea      ;c_lib    65536 :  3.45      2.64e-11
duhamel     ;c_lib    65536 :  2.53      4.63e-21
numrec      ;numrec   65536 :  2.35      9.13e-14
mayer-rad4  ;c_lib    65536 :  2.08      1.79e-21   *(all library functions)
sing        ;sing     65536 :  1.68      5.22e-19
mayer-rad4  ;Buneman2 65536 :  1.52      2.46e-21   *(half-angle algorithm)
mayer-rad4  ;Simple   65536 :  1.51      8.16e-21   *(recursively generated)
nrec_real   ;numrec   65536 :  1.04      5.39e-19           
sing_real   ;sing     65536 :  0.987     ????????
sorensen-r  ;n-rec    65536 :  0.855     1.83e-20
mayer-real  ;Buneman2 65536 :  0.765     1.78e-21
mayer-fht   ;Buneman2 65536 :  0.707     2.46e-21

Note that the wierd trig algorithm reduces errors by almost a factor of 4 with negligable performance cost.

----------------------------------------------------------------------------
Ron Mayer
[email protected]

Source code for FFT's adn FHT's described above.
Various alternatives for FHT and FFT algorithms, including those benchmarkd on FFTW's page.

This page hosted byGeoCities Get your ownFree Home Page

Hosted by www.Geocities.ws