Global MPI operations
One of the most time-consuming parts of a large-scale computation involving multidimensional FFTs, is the global data transpositions between different MPI decomposition configurations. In PencilArrays
, this is performed by the transpose!
function, which takes two PencilArray
s, typically associated to two different configurations. The implementation performs comparably to similar implementations in lower-level languages (see PencilFFTs benchmarks for details).
Also provided is a gather
function that creates a single global array from decomposed data. This can be useful for tests (in fact, it is used in the PencilArrays tests to verify the correctness of the transpositions), but shouldn't be used with large datasets. It is generally useful for small problems where the global size of the data can easily fit the locally available memory.
Library
PencilArrays.Transpositions.Transposition
— TypeTransposition
Holds data for transposition between two pencil configurations.
Transposition(dest::PencilArray{T,N}, src::PencilArray{T,N};
method = Transpositions.PointToPoint())
Prepare transposition of arrays from one pencil configuration to the other.
The two pencil configurations must be compatible for transposition:
they must share the same MPI Cartesian topology,
they must have the same global data size,
the decomposed dimensions must be almost the same, with at most one difference. For instance, if the input of a 3D dataset is decomposed in
(2, 3)
, then the output may be decomposed in(1, 3)
or in(2, 1)
, but not in(1, 2)
. Note that the order of the decomposed dimensions (as passed to thePencil
constructor) matters. If the decomposed dimensions are the same, then no transposition is performed, and data is just copied if needed.
The src
and dest
arrays may be aliased (they can share memory space).
Performance tuning
The method
argument allows to choose between transposition implementations. This can be useful to tune performance of MPI data transfers. Two values are currently accepted:
Transpositions.PointToPoint()
uses non-blocking point-to-point data transfers (MPI_Isend
andMPI_Irecv
). This may be more performant since data transfers are interleaved with local data transpositions (index permutation of received data). This is the default.Transpositions.Alltoallv()
uses collectiveMPI_Alltoallv
for global data transpositions.
LinearAlgebra.transpose!
— Functiontranspose!(t::Transposition; waitall=true)
transpose!(dest::PencilArray{T,N}, src::PencilArray{T,N};
method = Transpositions.PointToPoint())
Transpose data from one pencil configuration to the other.
The first variant allows to optionally delay the wait for MPI send operations to complete. This is useful if the caller wants to perform other operations with the already received data. To do this, the caller should pass waitall = false
, and manually invoke MPI.Waitall
on the Transposition
object once the operations are done. Note that this option only has an effect when the transposition method is PointToPoint
.
See Transposition
for details.
MPI.Waitall
— FunctionMPI.Waitall(t::Transposition)
Wait for completion of all unfinished MPI communications related to the transposition.
PencilArrays.gather
— Functiongather(x::PencilArray{T, N}, [root::Integer=0]) -> Array{T, N}
Gather data from all MPI processes into one (big) array.
Data is received by the root
process.
Returns the full array on the root
process, and nothing
on the other processes.
Note that gather
always returns a base Array
, even when the PencilArray
wraps a different kind of array (e.g. a CuArray
).
This function can be useful for testing, but it shouldn't be used with very large datasets!