Global MPI operations

One of the most time-consuming parts of a large-scale computation involving multidimensional FFTs, is the global data transpositions between different MPI decomposition configurations. In PencilArrays, this is performed by the transpose! function, which takes two PencilArrays, typically associated to two different configurations. The implementation performs comparably to similar implementations in lower-level languages (see PencilFFTs benchmarks for details).

Also provided is a gather function that creates a single global array from decomposed data. This can be useful for tests (in fact, it is used in the PencilArrays tests to verify the correctness of the transpositions), but shouldn't be used with large datasets. It is generally useful for small problems where the global size of the data can easily fit the locally available memory.

Library

PencilArrays.Transpositions.Transposition — Type

Transposition

Holds data for transposition between two pencil configurations.

Transposition(dest::PencilArray{T,N}, src::PencilArray{T,N};
              method = Transpositions.PointToPoint())

Prepare transposition of arrays from one pencil configuration to the other.

The two pencil configurations must be compatible for transposition:

they must share the same MPI Cartesian topology,
they must have the same global data size,
the decomposed dimensions must be almost the same, with at most one difference. For instance, if the input of a 3D dataset is decomposed in (2, 3), then the output may be decomposed in (1, 3) or in (2, 1), but not in (1, 2). Note that the order of the decomposed dimensions (as passed to the Pencil constructor) matters. If the decomposed dimensions are the same, then no transposition is performed, and data is just copied if needed.

The src and dest arrays may be aliased (they can share memory space).

Performance tuning

The method argument allows to choose between transposition implementations. This can be useful to tune performance of MPI data transfers. Two values are currently accepted:

Transpositions.PointToPoint() uses non-blocking point-to-point data transfers (MPI_Isend and MPI_Irecv). This may be more performant since data transfers are interleaved with local data transpositions (index permutation of received data). This is the default.
Transpositions.Alltoallv() uses collective MPI_Alltoallv for global data transpositions.

source

LinearAlgebra.transpose! — Function

transpose!(t::Transposition; waitall=true)
transpose!(dest::PencilArray{T,N}, src::PencilArray{T,N};
           method = Transpositions.PointToPoint())

Transpose data from one pencil configuration to the other.

The first variant allows to optionally delay the wait for MPI send operations to complete. This is useful if the caller wants to perform other operations with the already received data. To do this, the caller should pass waitall = false, and manually invoke MPI.Waitall on the Transposition object once the operations are done. Note that this option only has an effect when the transposition method is PointToPoint.

See Transposition for details.

source

MPI.Waitall — Function

MPI.Waitall(t::Transposition)

Wait for completion of all unfinished MPI communications related to the transposition.

source

PencilArrays.gather — Function

gather(x::PencilArray{T, N}, [root::Integer=0]) -> Array{T, N}

Gather data from all MPI processes into one (big) array.

Data is received by the root process.

Returns the full array on the root process, and nothing on the other processes.

Note that gather always returns a base Array, even when the PencilArray wraps a different kind of array (e.g. a CuArray).

This function can be useful for testing, but it shouldn't be used with very large datasets!

source

Index

PencilArrays.Transpositions.Transposition
LinearAlgebra.transpose!
MPI.Waitall
PencilArrays.gather