info fftw3

6.2 Cell Caveats

The FFTW benchmark program allocates memory using malloc() or equivalent library calls, reflecting the common usage of the FFTW library. However, you can sometimes improve performance significantly by allocating memory in system-specific large TLB pages. E.g., we have seen 39 GFLOPS/s for a 256 × 256 × 256 problem using large pages, whereas the speed is about 25 GFLOPS/s with normal pages. YMMV.
FFTW hoards all available SPEs for itself. You can optionally choose a different number of SPEs by calling the undocumented function fftw_cell_set_nspe(n), where n is the number of desired SPEs. Expect this interface to go away once we figure out how to make FFTW play nicely with other Cell software.
In particular, if you try to link both the single and double precision of FFTW in the same program (which you can do), they will both try to grab all SPEs and the second one will hang.
The SPEs demand that data be stored in contiguous arrays aligned at 16-byte boundaries. If you instruct FFTW to operate on noncontiguous or nonaligned data, the SPEs will not be used, resulting in slow execution. See section Data Alignment.
The FFTW_ESTIMATE mode may produce seriously suboptimal plans, and it becomes particularly confused if you enable both the SPEs and Altivec. If you care about performance, please use FFTW_MEASURE or FFTW_PATIENT until we figure out a more reliable performance model.