API
Creating plans
NonuniformFFTs.PlanNUFFT
— TypePlanNUFFT([T = ComplexF64], dims::Dims; ntransforms = Val(1), backend = CPU(), kwargs...)
Construct a plan for performing non-uniform FFTs (NUFFTs).
The created plan contains all data needed to perform NUFFTs for non-uniform data of type T
(ComplexF64
by default) and uniform data with dimensions dims
.
Optional keyword arguments
ntransforms = Val(1)
: the number of simultaneous transforms to perform. This is useful if one wants to transform multiple scalar quantities at the same non-uniform points.backend::KernelAbstractions.Backend = CPU()
: corresponds to the device type where everything will be executed. This could be e.g.CUDABackend()
if CUDA.jl is loaded.
NUFFT parameters
The following parameters control transform accuracy. The default values give a relative accuracy of the order of $10^{-7}$ for Float64
or ComplexF64
data.
m = HalfSupport(4)
: the half-support of the convolution kernels. Large values increase accuracy at the cost of performance.σ = 2.0
: NUFFT oversampling factor. Typical values are 2.0 (more accurate) and 1.25 (faster), but other values such as 1.5 should also work.kernel::AbstractKernel = BackwardsKaiserBesselKernel()
: convolution kernel used for NUFFTs.
Main performance parameters
kernel_evalmode
: method used for kernel evaluation. The default isFastApproximation
on CPU, which will attempt to use a fast approximation method which greatly speeds up kernel evaluation. On GPUs the default isDirect
, as the fast approximation method is not necessarily faster.block_size
: the block size (in number of elements) when using block partitioning or when sorting is enabled. This enables spatial sorting of points, even whensort_points = False()
(which actually permutes point data for possibly faster memory accesses). The block size can be tuned for maximal performance. It can either be passed as anInt
(linear block size) or as a tuple(B₁, …, Bₙ)
to specify the block size in each Cartesian direction. The current default is 4096 on the CPU and around 1024 on the GPU (but depends on the number of dimensions). These may change in the future or even depend on the actual computing device. On the CPU, using block partitioning is required for running with multiple threads. On the GPU, this option is ignored ifgpu_method = :shared_memory
. Blocking / spatial sorting can be completely disabled by passingblock_size = nothing
(but this is generally slower).gpu_method
: allows to select between different implementations of GPU transforms. Possible options are::global_memory
(default): directly read and write onto arrays in global memory in spreading (type-1) and interpolation (type-2) operations;:shared_memory
: copy data between global memory and shared memory (local to each GPU workgroup) and perform most operations in the latter, which is faster and can help avoid some atomic operations in type-1 transforms. We try to use as much shared memory as is typically available on current GPUs (which is typically 48 KiB on CUDA and 64 KiB on AMDGPU). Note that this method completely ignores theblock_size
parameter, as the actual block size is adjusted to maximise shared memory usage. When this method is enabled, one can play with thegpu_batch_size
parameter (see below) to further tune performance.
For highly dense problems (number of non-uniform points comparable to the total grid size), the
:shared_memory
method can be much faster, especially when theHalfSupport
is 4 or less (accuracies up to1e-7
forσ = 2
).fftw_flags = FFTW.MEASURE
: parameters passed to the FFTW planner whenbackend = CPU()
.
Other performance parameters
These are more advanced performance parameters which may disappear or whose behaviour may change in the future.
sort_points = False()
: whether to internally permute the order of the non-uniform points. This can be enabled by passingsort_points = True()
. Ignored whenblock_size = nothing
(which disables spatial sorting). In this case, more time will be spent inset_points!
and less time on the actual transforms. This can improve performance if executing multiple transforms on the same non-uniform points. Note that, even when enabled, this does not modify thepoints
argument passed toset_points!
.gpu_batch_size = Val(Np)
: minimum batch size used in type-1 transforms whengpu_method = :shared_memory
. The idea is that, to avoid inefficient atomic operations on shared-memory arrays, we process non-uniform points in batches ofNp
points. The actual value ofNp
will typically be larger than the input one in order to maximise shared memory usage within each GPU workgroup. Note that largerNp
also means that less shared memory space is available for local blocks, meaning that the effective block size can get smaller (which is not necessarily bad for performance, and can actually be beneficial). When tuning performance, it is helpful to print the plan (as inprintln(plan)
) to see the actual block and batch sizes.
Other parameters
fftshift = false
: determines the order of wavenumbers in uniform space. Iffalse
(default), the same order used by FFTW is used, with positive wavenumbers first ([0, 1, 2, …, N÷2-1]
for even-size transforms) and negative ones afterwards ([-N÷2, …, -1]
). Otherwise, wavenumbers are expected to be in increasing order ([-N÷2, -kmax, …, -1, 0, 1, …, N÷2-1]
), which is compatible with the output in NFFT.jl and corresponds to applying theAbstractFFTs.fftshift
function to the data. This option also corresponds to themodeord
parameter in FINUFFT. This only affects complex-to-complex transforms.timer = TimerOutput()
: allows to specify aTimerOutput
(from the TimerOutputs.jl package) where timing information will be written to. By default the plan creates its own timer. One can visualise the time spent on different parts of the NUFFT computation usingp.timer
.synchronise = false
: iftrue
, add synchronisation barrier between calls to GPU kernels. Enabling this is needed for accurate timings inp.timer
when computing on a GPU, but may result in reduced performance.
FFT size and performance
For performance reasons, when doing FFTs one usually wants the size of the input along each dimension to be a power of 2 (ideally), or the product of powers of small prime numbers (2, 3, 5, …). The problem is that, with the NUFFT, one does not directly control the FFT size due to the oversampling factor $σ$, which may be any real number $σ > 1$. That is, for an input of size $N$, FFTs are performed on an oversampled grid of size $Ñ ≈ σN$. Note that $σN$ is generally not an integer, hence the $≈$.
The aim of this section is to clarify how $Ñ$ is actually chosen, so that one can predict its value for given inputs $N$ and $σ$. This may be better understood in actual code:
Ñ = nextprod((2, 3, 5), floor(Int, σ * N))
Basically, we truncate $σN$ to an integer, and then we choose $Ñ$ as the next integer that can be written as the product of powers of 2, 3 and 5 (see nextprod
). Most often, the result will be greater than or equal to $σN$.
Using real non-uniform data
In some applications, the non-uniform data to be transformed is purely real. In this case, one may pass Float64
or Float32
as the first argument. This may be faster than converting data to complex types, in particular because the real-to-complex FFTs from FFTW will be used to compute the transforms. Note that, in this case, the dimensions of the uniform data arrays is not exactly dims
, since the size of the first dimension is divided roughly by 2 (taking advantage of Hermitian symmetry). For convenience, one can call size(::PlanNUFFT)
on the constructed plan to know in advance the dimensions of the uniform data arrays.
NonuniformFFTs.NFFTPlan
— TypeNonuniformFFTs.NFFTPlan(xp::AbstractMatrix{T}, dims::Dims{D}; kwargs...) -> plan
Create a NUFFT plan which is compatible with the AbstractNFFTs.jl interface.
The plan follows different rules from plans created by PlanNUFFT
. In particular:
- points are assumed to be in $[-1/2, 1/2)$ instead of $[0, 2π)$;
- the opposite Fourier sign convention is used (e.g. $e^{-i k x_j}$ becomes $e^{+2π i k x_j}$);
- uniform data is in increasing order by default, with frequencies $k = -N/2, …, -1, 0, 1, …, N/2-1$, as opposed to preserving the order used by FFTW (which starts at $k = 0$);
- it only supports complex non-uniform data.
This constructor requires passing the non-uniform locations xp
as the first argument. These should be given as a matrix of dimensions (D, Np)
, where D
is the spatial dimension and Np
the number of non-uniform points.
The second argument is simply the size (N₁, N₂, …)
of the uniform data arrays.
Most keyword arguments from PlanNUFFT
are also accepted here. Moreover, for compatibility reasons, most keyword arguments from the NFFT.jl package are also accepted as detailed below.
This type of plan can also be created via the AbstractNFFTs.plan_nfft
function.
This constructor creates a plan which assumes complex-valued non-uniform data. For real-valued data, the PlanNUFFT
constructor should be used instead.
Compatibility with NFFT.jl
Most of the parameters supported by the NFFT.jl package are also supported by this constructor. The currently supported parameters are reltol
, m
, σ
, window
, blocking
, sortNodes
and fftflags
.
Moreover, unlike PlanNUFFT
, this constructor sets fftshift = true
by default (but can be overridden) so that the uniform data ordering is the same as in NFFT.jl.
Explicitly passing some of these parameters may result in type-unstable code, since the exact type of the returned plan cannot be inferred. This is because, in NonuniformFFTs.jl, parameters such as the kernel size (m
) or the convolution window (window
) are included in the plan type (they are compile-time constants).
GPU usage
To create a GPU-compatible plan, simply pass the locations xp
as a GPU array (e.g. a CuArray
in CUDA). Unlike PlanNUFFT
, the backend
argument is not needed here and will be simply ignored.
Setting non-uniform points
NonuniformFFTs.set_points!
— Functionset_points!(p::PlanNUFFT, points)
Set non-uniform points before executing a NUFFT.
In one dimension, points
is simply a vector of real values (the non-uniform locations).
In multiple dimensions, points
may be passed as:
- a tuple of vectors
(xs::AbstractVector, ys::AbstractVector, …)
; - a vector
[x⃗₁, x⃗₂, x⃗₃, x⃗₄, …]
of tuples or static vectors (typicallySVector
s from the StaticArrays.jl package); - a matrix of size
(d, Np)
whered
is the spatial dimension andNp
the number of non-uniform points.
The points are allowed to be outside of the periodic cell $[0, 2π)^d$, in which case they will be "folded" to that domain.
Executing plans
NonuniformFFTs.exec_type1!
— Functionexec_type1!(ûs::AbstractArray{<:Complex}, p::PlanNUFFT, vp::AbstractVector{<:Number})
exec_type1!(ûs::NTuple{N, AbstractArray{<:Complex}}, p::PlanNUFFT, vp::NTuple{N, AbstractVector{<:Number}})
Perform type-1 NUFFT (from non-uniform points to uniform grid).
Here vp
contains the input values at non-uniform points. The result of the transform is written into ûs
.
One first needs to set the non-uniform points using set_points!
.
To perform multiple transforms at once, both vp
and ûs
should be a tuple of arrays (second variant above). Note that this requires a plan initialised with ntransforms = Val(N)
(see PlanNUFFT
).
See also exec_type2!
.
NonuniformFFTs.exec_type2!
— Functionexec_type2!(vp::AbstractVector{<:Number}, p::PlanNUFFT, ûs::AbstractArray{<:Complex})
exec_type2!(vp::NTuple{N, AbstractVector{<:Number}}, p::PlanNUFFT, ûs::NTuple{N, AbstractArray{<:Complex}})
Perform type-2 NUFFT (from uniform grid to non-uniform points).
Here ûs
contains the input coefficients in the uniform grid. The result of the transform at non-uniform points is written into vp
.
One first needs to set the non-uniform points using set_points!
.
To perform multiple transforms at once, both vp
and ûs
should be a tuple of arrays (second variant above). Note that this requires a plan initialised with ntransforms = Val(N)
(see PlanNUFFT
).
See also exec_type1!
.
Other functions
Base.size
— Methodsize(p::PlanNUFFT) -> (N₁, N₂, ...)
Return the dimensions of arrays containing uniform values.
This corresponds to the number of Fourier modes in each direction (in the non-oversampled grid).
Available spreading kernels
NonuniformFFTs.Kernels.KaiserBesselKernel
— TypeKaiserBesselKernel <: AbstractKernel
KaiserBesselKernel([β])
Represents a Kaiser–Bessel spreading kernel.
Definition
\[ϕ(x) = I₀ \left(β \sqrt{1 - x²} \right) \quad \text{ for } |x| ≤ 1\]
where $I₀$ is the zeroth-order modified Bessel function of the first kind and $β$ is a shape factor.
Fourier transform
\[ϕ̂(k) = 2 \frac{\sinh\left( \sqrt{β² - k²} \right)}{\sqrt{β² - k²}}\]
Parameter selection
By default, the shape parameter is chosen to be [1]
\[β = γ M π \left( 2 - \frac{1}{σ} \right)\]
where $M$ is the kernel half-width and $σ$ the oversampling factor. Moreover, $γ = 0.980$ is an empirical "safety factor", similarly to the one used by FINUFFT [2], which slightly improves accuracy.
This default value can be overriden by explicitly passing a $β$ value.
Implementation details
Since the evaluation of Bessel functions can be costly, this kernel is efficiently evaluated via an accurate piecewise polynomial approximation. We use the same method originally proposed for FINUFFT [2] and later discussed by Shamshirgar et al. [3].
[1] Potts & Steidl, SIAM J. Sci. Comput. 24, 2013 (2003)
[2] Barnett, Magland & af Klinteberg, SIAM J. Sci. Comput. 41, C479 (2019)
[3] Shamshirgar, Bagge & Tornberg, J. Chem. Phys. 154, 164109 (2021)
NonuniformFFTs.Kernels.BackwardsKaiserBesselKernel
— TypeBackwardsKaiserBesselKernel <: AbstractKernel
BackwardsKaiserBesselKernel([β])
Represents a backwards Kaiser–Bessel (KB) spreading kernel.
This kernel basically results from swapping the KaiserBesselKernel
spreading function and its Fourier transform. It has very similar properties to the Kaiser–Bessel kernel.
Definition
\[ϕ(x) = \frac{\sinh \left(β \sqrt{1 - x²} \right)}{π \sqrt{1 - x²}} \quad \text{ for } |x| ≤ 1\]
where $β$ is a shape factor.
Fourier transform
\[ϕ̂(k) = I₀ \left( \sqrt{β² - k²} \right)\]
where $I₀$ is the zeroth-order modified Bessel function of the first kind.
Parameter selection
By default, the shape parameter is chosen to be [1]
\[β = γ M π \left( 2 - \frac{1}{σ} \right)\]
where $M$ is the kernel half-width and $σ$ the oversampling factor. Moreover, $γ = 0.995$ is an empirical "safety factor", similarly to the one used by FINUFFT [2], which slightly improves accuracy.
This default value can be overriden by explicitly passing a $β$ value.
Implementation details
Since the evaluation of the hyperbolic sine functions can be costly, this kernel is efficiently evaluated via an accurate piecewise polynomial approximation. We use the same method originally proposed for FINUFFT [2] and later discussed by Shamshirgar et al. [3].
[1] Potts & Steidl, SIAM J. Sci. Comput. 24, 2013 (2003)
[2] Barnett, Magland & af Klinteberg, SIAM J. Sci. Comput. 41, C479 (2019)
[3] Shamshirgar, Bagge & Tornberg, J. Chem. Phys. 154, 164109 (2021)
NonuniformFFTs.Kernels.GaussianKernel
— TypeGaussianKernel <: AbstractKernel
GaussianKernel([ℓ/Δx])
Represents a truncated Gaussian spreading kernel.
Definition
\[ϕ(x) = e^{-x² / 2ℓ²}\]
where $ℓ$ is the characteristic width of the kernel.
Fourier transform
\[ϕ̂(k) = \sqrt{2πℓ²} e^{-ℓ² k² / 2}\]
Parameter selection
By default, given a kernel half-width $M$, an oversampling factor $σ$ and the oversampling grid spacing $Δx$, the characteristic width $ℓ$ is chosen as [1]
\[ℓ² = Δx² \frac{σ}{2σ - 1} \frac{M}{π}\]
This default value can be overriden by explicitly passing a value of the wanted normalised width $ℓ/Δx$.
Implementation details
In the implementation, this kernel is efficiently evaluated using the fast Gaussian gridding method proposed by Greengard & Lee [2].
[1] Potts & Steidl, SIAM J. Sci. Comput. 24, 2013 (2003)
[2] Greengard & Lee, SIAM Rev. 46, 443 (2004)
NonuniformFFTs.Kernels.BSplineKernel
— TypeBSplineKernel <: AbstractKernel
BSplineKernel()
Represents a B-spline spreading kernel.
The order $n$ of the B-spline is directly related to the kernel half-width $M$ by $n = 2M$ (the polynomial degree is $n - 1$).
Definition
A B-spline of order $n$ may be defined via its Fourier transform:
\[ϕ̂(k) = Δx \operatorname{sinc}^n \left( \frac{k Δx}{2} \right)\]
where $Δx$ is the spacing of the oversampled grid and $\operatorname{sinc}$ is the unnormalised sinc function.
Implementation details
In the implementation, this kernel is evaluated using de Boor's algorithm.
Kernel evaluation methods
NonuniformFFTs.Kernels.Direct
— TypeDirect <: Kernels.EvaluationMode
Directly evaluate kernels using their definition.
NonuniformFFTs.Kernels.FastApproximation
— TypeFastApproximation <: Kernels.EvaluationMode
Use a fast approximation or algorithm to efficiently evaluate a kernel.
The evaluation type depends on the actual kernel:
for
KaiserBesselKernel
andBackwardsKaiserBesselKernel
, this uses a fast piecewise polynomial approximation heavily inspired from FINUFFT;for
GaussianKernel
, this uses the fast Gaussian gridding algorithm proposed by Greengard & Lee (SIAM Rev. 2004);for
BSplineKernel
, this is currently the same asDirect
evaluation.
In the first two cases, this is generally faster than Direct
evaluation for CPU transforms.
On GPUs this is not necessarily the case, and results may depend on the actual GPU used. For example, on an Nvidia A100, Direct
evaluation of the KaiserBesselKernel
(which involves Bessel functions) seems to beat polynomial approximation, and is also a bit faster than the BackwardsKaiserBesselKernel
(involving sinh
functions).
Index
NonuniformFFTs.Kernels.BSplineKernel
NonuniformFFTs.Kernels.BackwardsKaiserBesselKernel
NonuniformFFTs.Kernels.Direct
NonuniformFFTs.Kernels.FastApproximation
NonuniformFFTs.Kernels.GaussianKernel
NonuniformFFTs.Kernels.KaiserBesselKernel
NonuniformFFTs.NFFTPlan
NonuniformFFTs.PlanNUFFT
Base.size
NonuniformFFTs.exec_type1!
NonuniformFFTs.exec_type2!
NonuniformFFTs.set_points!