# Tutorial

The following tutorial shows how to perform a 3D FFT of real periodic data defined on a grid of $N_x × N_y × N_z$ points.

By default, the domain is distributed on a 2D MPI topology of dimensions $N_1 × N_2$. As an example, the above figure shows such a topology with $N_1 = 4$ and $N_2 = 3$, for a total of 12 MPI processes.

## Creating plans

The first thing to do is to create a domain decomposition configuration for the given dataset dimensions $N_x × N_y × N_z$. In the framework of PencilArrays, such a configuration is described by a `Pencil`

object. As described in the PencilArrays docs, we can let the `Pencil`

constructor automatically determine such a configuration. For this, only an MPI communicator and the dataset dimensions are needed:

```
using MPI
using PencilFFTs
MPI.Init()
comm = MPI.COMM_WORLD
# Input data dimensions (Nx × Ny × Nz)
dims = (16, 32, 64)
pen = Pencil(dims, comm)
```

By default this creates a 2D decomposition (for the case of a 3D dataset), but one can change this as detailed in the PencilArrays documentation linked above.

We can now create a `PencilFFTPlan`

, which requires information on decomposition configuration (the `Pencil`

object) and on the transforms that will be applied:

```
# Apply a 3D real-to-complex (r2c) FFT.
transform = Transforms.RFFT()
# Note that, for more control, one can instead separately specify the transforms along each dimension:
# transform = (Transforms.RFFT(), Transforms.FFT(), Transforms.FFT())
# Create plan
plan = PencilFFTPlan(pen, transform)
```

See the `PencilFFTPlan`

constructor for details on the accepted options, and the `Transforms`

module for the possible transforms. It is also possible to enable fine-grained performance measurements via the TimerOutputs package, as described in Measuring performance.

## Allocating data

Next, we want to apply the plan on some data. Transforms may only be applied on `PencilArray`

s, which are array wrappers that include MPI decomposition information (in some sense, analogous to `DistributedArray`

s in Julia's distributed computing approach). The helper function `allocate_input`

can be used to allocate a `PencilArray`

that is compatible with our plan:

```
# In our example, this returns a 3D PencilArray of real data (Float64).
u = allocate_input(plan)
# Fill the array with some (random) data
using Random
randn!(u)
```

`PencilArray`

s are a subtype of `AbstractArray`

, and thus they support all common array operations.

Similarly, to preallocate output data, one can use `allocate_output`

:

```
# In our example, this returns a 3D PencilArray of complex data (Complex{Float64}).
v = allocate_output(plan)
```

This is only required if one wants to apply the plans using a preallocated output (with `mul!`

, see right below).

The data types returned by `allocate_input`

and `allocate_output`

are slightly different when working with in-place transforms. See the in-place example for details.

## Applying plans

The interface to apply plans is consistent with that of `AbstractFFTs`

. Namely, `*`

and `mul!`

are respectively used for forward transforms without and with preallocated output data. Similarly, `\`

and `ldiv!`

are used for backward transforms.

```
using LinearAlgebra # for mul!, ldiv!
# Apply plan on `u` with `v` as an output
mul!(v, plan, u)
# Apply backward plan on `v` with `w` as an output
w = similar(u)
ldiv!(w, plan, v) # now w ≈ u
```

Note that, consistently with `AbstractFFTs`

, normalisation is performed at the end of a backward transform, so that the original data is recovered when applying a forward followed by a backward transform.

## Accessing and modifying data

For any given MPI process, a `PencilArray`

holds the data associated to its local partition in the global geometry. `PencilArray`

s are accessed using local indices that start at 1, regardless of the location of the local process in the MPI topology. Note that `PencilArray`

s, being based on regular `Array`

s, support both linear and Cartesian indexing (see the Julia docs for details).

For convenience, the `global_view`

function can be used to generate an `OffsetArray`

wrapper that takes global indices.

### Output data layout

In memory, the dimensions of the transform output are by default reversed with respect to the input. That is, if the order of indices in the input data is `(x, y, z)`

, then the output has order `(z, y, x)`

in memory. This detail is hidden from the user, and **output arrays are always accessed in the same order as the input data**, regardless of the underlying output dimension permutation. This applies to `PencilArray`

s and to `OffsetArray`

s returned by `global_view`

.

The reasoning behind dimension permutations, is that they allow to always perform FFTs along the fastest array dimension and to avoid a local data transposition, resulting in performance gains. A similar approach is followed by other parallel FFT libraries. FFTW itself, in its distributed-memory routines, includes a flag that enables a similar behaviour. In PencilFFTs, index permutation is the default, but it can be disabled via the `permute_dims`

flag of `PencilFFTPlan`

.

A great deal of work has been spent in making generic index permutations as efficient as possible, both in intermediate and in the output state of the multidimensional transforms. This has been achieved, in part, by making sure that permutations such as `(3, 2, 1)`

are compile-time constants.

## Further reading

For details on working with `PencilArray`

s see the PencilArrays docs.

The examples on the sidebar further illustrate the use of transforms and provide an introduction to working with MPI-distributed data in the form of `PencilArray`

s. In particular, the gradient example illustrates different ways of computing things using Fourier-transformed distributed arrays. Then, the incompressible Navier–Stokes example is a more advanced and complete example of a possible application of the PencilFFTs package.