What they say, in practice, is that if you use the GPU (the graphic chip) instead of the CPU (the main brain of the computer)...
In this work we present a novel implementation of FFT on GeForce 8800GTX that achieves 144 Gflop/s that is nearly 3x faster than best rate achieved in the current vendor’s numerical libraries.
Another paper says:
We implemented our algorithms using the NVIDIA CUDA API and compared their performance with NVIDIA's CUFFT library and an optimized CPU-implementation (Intel's MKL) on a high-end quad-core CPU. On an NVIDIA GPU, we obtained performance of up to 300 GFlops, with typical performance improvements of 2--4x over CUFFT and 8--40x improvement over MKL for large sizes.
The source code (to be compiled), is available on another site. What upsets me is the ReadMe file:
Currently there are a few known performance issues (bug) that this sample has discovered in rumtime and code generation that are being actively fixed. Hence, for sizes >= 1024, performance is much below the expected peak for any particular size. However, we have internally verified that once these bugs are fixed, performance should be on par with expected peak. Note that these are bugs in OpenCL runtime/compiler and not in this sample.
Maybe they have already found the solution to this problem.
There's another issue, however: how long does it take to move the matrix from the main memory to the GPU and back? I presume that loading the columns will take much more time than loading the rows. In this case it is better to transpose the matrix between the two FFTs (just like when we use the CPU). Eventually, the bottleneck will the the transposition, not the FT.
0 comments:
Post a Comment