SST/macro
Introduction

Overview

The SST/macro software package provides a simulator for large-scale parallel computer architectures. It permits the coarse-grained study of distributed-memory applications. The simulator is driven from either a trace file or skeleton application. The simulator architecture is modular, allowing it to easily be extended with additional network models, trace file formats, software services, and processor models.

Simulation can be broadly categorized as either off-line or on-line. Off-line simulators typically first run a full parallel application on a real machine, recording certain communication and computation events to a simulation trace. This event trace can then be replayed post-mortem in the simulator. Most common are MPI traces which record all MPI events, and SST/macro provides the DUMPI utility (Using DUMPI) for collecting and replaying MPI traces. Trace extrapolation can extend the usefulness of off-line simulation by estimating large or untraceable system scales without having to collect a trace, it is typically only limited to strictly weak scaling.

We turn to on-line simulation when the hardware or applications parameters need to change. On-line simulators instead run real application code, allowing native C/C++/Fortran to be compiled directly into the simulator. SST/macro intercepts certain function calls, estimating how much time passes rather than actually executing the function. In MPI programs, for example, calls to MPI_Send are linked to the simulator instead of passing to the real MPI library. If desired, SST/macro can actually be a full MPI emulator, delivering messages between ranks and replicating the behavior of a full MPI implementation.

Although SST/macro supports both on-line and off-line modes, on-line simulation is encouraged because event traces are much less flexible, containing a fixed sequence of events. Application inputs and number of nodes cannot be changed. Without a flexible control flow, it also cannot simulate dynamic behavior like load balancing or faults. On-line simulation can explore a much broader problem space since they evolve directly in the simulator.

For large, system-level experiments with thousands of network endpoints, high-accuracy cycle-accurate simulation is not possible, or at least not convenient. Simulation requires coarse-grained approximations to be practical. SST/macro is therefore designed for specific cost/accuracy tradeoffs. It should still capture complex cause/effect behavior in applications and hardware, but be efficient enough to simulate at the system-level. For speeding up simulator execution, we encourage skeletonization, discussed further in in the PDF manual. A high-quality skeleton is an application model that reproduces certain characteristics with only limited computation. We also encourage uncertainty quantification (UQ) for validating simulator results, discussed further in in the PDF manual. Skeletonization and UQ are the two main elements in the "canonical'' SST/macro workflow (Figure 1).


\image html figures/workflow.png
  <b>Figure 1:</b> SST/macro workflow. 



Currently Supported

Programming APIs

The following sections describe the state of the software API's (found in sstmac) that are available in SST/macro for use by applications, as of this release. The level of testing indicates the integration of compliance/functionality tests into our make check test suite.

Final and Tested

Some testing complete

In development

Analysis Tools and Statistics

The following analysis tools are currently available in SST/macro. Some are thoroughly tested. Others have undergone some testing, but are still considered Beta. Others have been implemented, but are relatively untested.

Fully tested

Beta

Untested

Known Issues and Limitations

Global Variables

The use of global variables in SST/macro inherently creates a false-sharing scenario because of the use of user-space threads to model parallel processes. While we do have a mechanism for supporting them (in the PDF manual for more information), the file using them must be compiled with C++. This is somewhat unfortunate, because many C programs will use global variables as a convenient means of accessing program data. In almost every case, though, a C program can simply be compiled as C++ by changing the extension to .cc or .cpp.

MPI

Everything from MPI 2 is implemented with a few exceptions noted below. The following are not implemented (categorized by MPI concepts):

Communicators

Datatypes and Addressing

Info and Attributes

No MPI_Info_*, MPI_*_keyval, or MPI_Attr_* functions are supported.

Point-to-Point

Collectives

Miscellaneous

OpenSHMEM

Only the collect and fcollect functions of the API are not implemented (they will be in future releases).

Also, like handling of global variables discussed in Sections Global Variables in the PDF manual, SHMEM globals need to use a different type. For primitives and primitive arrays, refer to <sstmac/openshmem/shmem/globals.h> for replacing types, e.g. the following code

1 int my_global_var = 4;
2 double an_array[6];

needs to become

1 shmem_int my_global_var(4);
2 shmem_arr<double, 6> an_array;

Fortran

SST/macro can run Fortran90 applications. However, at least using gfortran, Fortran variables using allocate() go on the heap. Therefore, it creates a false sharing situation pretty much everywhere as threads swap in and out. A workaround is to make a big map full of data structures that store needed variables, indexed by a rank that you pass around to every function. We are exploring more user-friendly alternatives. The Fortran MPI interface is also still somewhat incomplete. Most functions are just wrappers to the C/C++ implementation and we are working on adding the bindings.