Load balancing in PPM comprises two main components: domain decomposition and assignment of sub-domains onto processors. While the former has to ensure sufficient granularity and partitioning of the computational cost, the latter has to ensure even distribution of computational load among processors, accounting for possible differences in processor speeds.
In order to account for a large spectrum of possible applications, PPM provides a number of different adaptive domain decomposition techniques for particles, meshes, and volumes, the latter defining geometric sub-domains with neither meshes nor particles present. These decompositions currently include:
Recursive orthogonal bisection is based on an adaptive binary tree, where subdivisions are allowed in all spatial directions.
The user-defined decomposition allows the client program to explicitly specify the sub-domains.
The null decomposition does not perform any domain decomposition. It creates only one sub-domain which is the computational domain itself. This trivial decomposition is used to evenly distribute the particles among processors, irrespective of their spatial location. The resulting special topology is called the ring topology and the sub-domain is assigned to every processor.
The computational cost for each sub-domain (e.g. given by the number of particles, number of mesh points, or the true computational cost) is known from the domain decomposition step. The individual processor speeds are measured using PPM routines. This involves performing a molecular dynamics test simulation of a small Lennard-Jones system using an increasing number of particles until all processors yield sufficient timing statistics. Such speed measurements are important in inhomogeneous machine clusters or under concurrent use of shared processors and typically take less than 1 second of wall-clock time.
Using the computational costs of the sub-domains and the relative processor speeds, PPM provides several methods of assigning the sub-domains to the processors. The PPM-internal method assigns contiguous blocks of sub-domains to processors until the accumulated cost of a processor is greater or equal to the theoretical average cost under uniform load distribution. The average is weighted with the relative processor speeds. In addition, four different Metis-based and a user-defined assignment are available. In conjunction with a user-defined domain decomposition, the user-defined assignment scheme allows the client program to enforce a specific processor affiliation for each sub-domain. For a Metis-assignment, the sub-domain partitioning problem is first translated to the equivalent graph partitioning problem. Two different conversions are supported by \metis: in the primal scheme each sub-domain is represented by a node in the graph and the neighborhood relations by the edges of the graph; the dual scheme represents sub-domains by graph edges. Graph partitioning is then performed such as to minimize either edge cut or communication volume. The relative processor speeds and computational costs of the sub-domains are accounted for by means of weights assigned to the nodes and edges of the graph.