SST/macro
Launching, Allocation, and Indexing

Launching, Allocation, and Indexing

Launch Commands

Just as jobs must be launched on a shared supercomputer using Slurm or aprun, SST/macro requires the user to specify a launch command for the application. Currently, we encourage the user to use aprun from Cray, for which documentation can easily be found online. In the parameter file you specify, e.g.

1 launch_app1 = user_mpiapp_cxx
2 launch_app1_cmd = aprun -n 8 -N 2

which launches an external user C++ application with eight ranks and two ranks per node. The aprun command has many command line options (see online documentation), some of which may be supported in future versions of SST/macro. In particular, we are in the process of adding support for thread affinity, OpenMP thread allocation, and NUMA containment flags. Most flags, if included, will simply be ignored.

Allocation Schemes

In order for a job to launch, it must first allocate nodes to run on. Here we choose a simple 2D torus

1 topology_name = torus
2 topology_geometry = 3 3
3 network_nodes_per_switch = 1

which has 9 nodes arranged in a 3x3 mesh. For the launch command aprun -n 8 -N 2, we must allocate 4 compute nodes from the pool of 9. Our first option is to specify the first available allocation scheme (Figure 8)

1 launch_allocation = first_available


firstavailable.png

Figure 8: First available Allocation of 4 Compute Codes on a 3x3 2D Torus



In first available, the allocator simply loops through the list of available nodes as they are numbered by the topology object. In the case of a 2D torus, the topology numbers by looping through columns in a row. In general, first available will give a contiguous allocation, but it won't necessarily be ideally structured.

To give more structure to the allocation, a Cartesian allocator can be used (Figure 9).

1 launch_allocation = cartesian
2 cart_launch_sizes = 2 2
3 cart_launch_offsets = 0 0


cartesian.png

Figure 9: Cartesian Allocation of 4 Compute Codes on a 3x3 2D Torus



Rather than just looping through the list of available nodes, we explicitly allocate a 2x2 block from the torus. If testing how "topology agnostic'' your application is, you can also choose a random allocation.

1 launch_allocation = random


random.png

Figure 10: Random Allocation of 4 Compute Codes on a 3x3 2D Torus



In many use cases, the number of allocated nodes equals the total number of nodes in the machine. In this case, all allocation strategies allocate the same set of nodes, i.e. the whole machine. However, results may still differ slightly since the allocation strategies still assign an initial numbering of the node, which means a random allocation will give different results from Cartesian and first available.

Indexing Schemes

Once nodes are allocated, the MPI ranks (or equivalent) must be assigned to physical nodes, i.e. indexed. The simplest strategies are block and round-robin. If only running one MPI rank per node, the two strategies are equivalent, indexing MPI ranks in the order received from the allocation list. If running multiple MPI ranks per node, block indexing tries to keep consecutive MPI ranks on the same node (Figure 11).

1 launch_indexing = block


block.png

Figure 11: Block Indexing of 8 MPI Ranks on 4 Compute Nodes



In contrast, round-robin spreads out MPI ranks by assigning consecutive MPI ranks on different nodes (Figure 12).

1 launch_indexing = round_robin


roundrobin.png

Figure 12: Round-Robin Indexing of 8 MPI Ranks on 4 Compute Nodes



Finally, one may also choose

1 launch_indexing = random

Random allocation with random indexing is somewhat redundant. Random allocation with block indexing is not similar to Cartesian allocation with random indexing. Random indexing on a Cartesian allocation still gives a contiguous block of nodes, even if consecutive MPI ranks are scattered around. A random allocation (unless allocating the whole machine) will not give a contiguous set of nodes.