# Tutorial

The following tutorial shows how to perform a 3D FFT of real periodic data defined on a grid of $N_x × N_y × N_z$ points.

By default, the domain is distributed on a 2D MPI topology of dimensions $N_1 × N_2$. As an example, the above figure shows such a topology with $N_1 = 4$ and $N_2 = 3$, for a total of 12 MPI processes.

## Creating plans

The first thing to do is to create a domain decomposition configuration for the given dataset dimensions $N_x × N_y × N_z$. In the framework of PencilArrays, such a configuration is described by a Pencil object. As described in the PencilArrays docs, we can let the Pencil constructor automatically determine such a configuration. For this, only an MPI communicator and the dataset dimensions are needed:

using MPI
using PencilFFTs

MPI.Init()
comm = MPI.COMM_WORLD

# Input data dimensions (Nx × Ny × Nz)
dims = (16, 32, 64)
pen = Pencil(dims, comm)

By default this creates a 2D decomposition (for the case of a 3D dataset), but one can change this as detailed in the PencilArrays documentation linked above.

We can now create a PencilFFTPlan, which requires information on decomposition configuration (the Pencil object) and on the transforms that will be applied:

# Apply a 3D real-to-complex (r2c) FFT.
transform = Transforms.RFFT()

# Note that, for more control, one can instead separately specify the transforms along each dimension:
# transform = (Transforms.RFFT(), Transforms.FFT(), Transforms.FFT())

# Create plan
plan = PencilFFTPlan(pen, transform)

See the PencilFFTPlan constructor for details on the accepted options, and the Transforms module for the possible transforms. It is also possible to enable fine-grained performance measurements via the TimerOutputs package, as described in Measuring performance.

## Allocating data

Next, we want to apply the plan on some data. Transforms may only be applied on PencilArrays, which are array wrappers that include MPI decomposition information (in some sense, analogous to DistributedArrays in Julia's distributed computing approach). The helper function allocate_input can be used to allocate a PencilArray that is compatible with our plan:

# In our example, this returns a 3D PencilArray of real data (Float64).
u = allocate_input(plan)

# Fill the array with some (random) data
using Random
randn!(u)

PencilArrays are a subtype of AbstractArray, and thus they support all common array operations.

Similarly, to preallocate output data, one can use allocate_output:

# In our example, this returns a 3D PencilArray of complex data (Complex{Float64}).
v = allocate_output(plan)

This is only required if one wants to apply the plans using a preallocated output (with mul!, see right below).

The data types returned by allocate_input and allocate_output are slightly different when working with in-place transforms. See the in-place example for details.

## Applying plans

The interface to apply plans is consistent with that of AbstractFFTs. Namely, * and mul! are respectively used for forward transforms without and with preallocated output data. Similarly, \ and ldiv! are used for backward transforms.

using LinearAlgebra  # for mul!, ldiv!

# Apply plan on u with v as an output
mul!(v, plan, u)

# Apply backward plan on v with w as an output
w = similar(u)
ldiv!(w, plan, v)  # now w ≈ u

Note that, consistently with AbstractFFTs, normalisation is performed at the end of a backward transform, so that the original data is recovered when applying a forward followed by a backward transform.

## Accessing and modifying data

For any given MPI process, a PencilArray holds the data associated to its local partition in the global geometry. PencilArrays are accessed using local indices that start at 1, regardless of the location of the local process in the MPI topology. Note that PencilArrays, being based on regular Arrays, support both linear and Cartesian indexing (see the Julia docs for details).

For convenience, the global_view function can be used to generate an OffsetArray wrapper that takes global indices.

### Output data layout

In memory, the dimensions of the transform output are by default reversed with respect to the input. That is, if the order of indices in the input data is (x, y, z), then the output has order (z, y, x) in memory. This detail is hidden from the user, and output arrays are always accessed in the same order as the input data, regardless of the underlying output dimension permutation. This applies to PencilArrays and to OffsetArrays returned by global_view.

The reasoning behind dimension permutations, is that they allow to always perform FFTs along the fastest array dimension and to avoid a local data transposition, resulting in performance gains. A similar approach is followed by other parallel FFT libraries. FFTW itself, in its distributed-memory routines, includes a flag that enables a similar behaviour. In PencilFFTs, index permutation is the default, but it can be disabled via the permute_dims flag of PencilFFTPlan.

A great deal of work has been spent in making generic index permutations as efficient as possible, both in intermediate and in the output state of the multidimensional transforms. This has been achieved, in part, by making sure that permutations such as (3, 2, 1) are compile-time constants.

For details on working with PencilArrays see the PencilArrays docs.
The examples on the sidebar further illustrate the use of transforms and provide an introduction to working with MPI-distributed data in the form of PencilArrays. In particular, the gradient example illustrates different ways of computing things using Fourier-transformed distributed arrays. Then, the incompressible Navier–Stokes example is a more advanced and complete example of a possible application of the PencilFFTs package.