Overview

The SST/macro software package provides a simulator for large-scale parallel computer architectures. It permits the coarse-grained study of distributed-memory applications. The simulator is driven from either a trace file or skeleton application. The simulator architecture is modular, allowing it to easily be extended with additional network models, trace file formats, software services, and processor models.

Simulation can be broadly categorized as either off-line or on-line. Off-line simulators typically first run a full parallel application on a real machine, recording certain communication and computation events to a simulation trace. This event trace can then be replayed post-mortem in the simulator. Most common are MPI traces which record all MPI events, and SST/macro provides the DUMPI utility (Using DUMPI) for collecting and replaying MPI traces. Trace extrapolation can extend the usefulness of off-line simulation by estimating large or untraceable system scales without having to collect a trace, it is typically only limited to strictly weak scaling.

We turn to on-line simulation when the hardware or applications parameters need to change. On-line simulators instead run real application code, allowing native C/C++/Fortran to be compiled directly into the simulator. SST/macro intercepts certain function calls, estimating how much time passes rather than actually executing the function. In MPI programs, for example, calls to MPI_Send are linked to the simulator instead of passing to the real MPI library. If desired, SST/macro can actually be a full MPI emulator, delivering messages between ranks and replicating the behavior of a full MPI implementation.

Although SST/macro supports both on-line and off-line modes, on-line simulation is encouraged because event traces are much less flexible, containing a fixed sequence of events. Application inputs and number of nodes cannot be changed. Without a flexible control flow, it also cannot simulate dynamic behavior like load balancing or faults. On-line simulation can explore a much broader problem space since they evolve directly in the simulator.

For large, system-level experiments with thousands of network endpoints, high-accuracy cycle-accurate simulation is not possible, or at least not convenient. Simulation requires coarse-grained approximations to be practical. SST/macro is therefore designed for specific cost/accuracy tradeoffs. It should still capture complex cause/effect behavior in applications and hardware, but be efficient enough to simulate at the system-level. For speeding up simulator execution, we encourage skeletonization, discussed further in in the PDF manual. A high-quality skeleton is an application model that reproduces certain characteristics with only limited computation. We also encourage uncertainty quantification (UQ) for validating simulator results, discussed further in in the PDF manual. Skeletonization and UQ are the two main elements in the "canonical'' SST/macro workflow (Figure 1).

\image html figures/workflow.png
  <b>Figure 1:</b> SST/macro workflow.

Currently Supported

Programming APIs

The following sections describe the state of the software API's (found in sstmac) that are available in SST/macro for use by applications, as of this release. The level of testing indicates the integration of compliance/functionality tests into our make check test suite.

Final and Tested

{MPI:} Because of its popularity, MPI is one of our main priorities in providing programming model support. We currently test against the MPICH test suite. All tests compile, so you should never see compilation errors. However, since many of the functions are not typically used in the community, we only test commonly-used functions. See Section MPI for functions that are not supported. Functions that are not implemented will throw a sstmac::unimplemented_error, reporting the function name.
{OpenSHMEM:} Most of the standard OpenSHMEM tests pass. The ones that don't are because they haven't been ported to C++, or test the single unsupported feature (collect).

Some testing complete

{HPX:} HPX is an implementation of the Parallex execution model. Some applications have been ported to it, and it has a simple test in the make check suite. Further development and test integration of HPX is not likely.
{Sockets:} The Socket API is mostly implemented. Most basic client/server functionality is available. However, only the default socket options are allowed. In most cases, setsockopt is just a no-op.

In development

{Pthreads:} Only the pthread_create(), pthread_join(), and pthread_self() functions are implemented. A basic pthread test validates the core spawn/run/join behavior.
{UPC:} We almost have the full UPC build and runtime implemented, but no tests are currently integrated and there are many bugs to work out before it can be used.
{GNI:} Cray's low-level messaging interface, GNI, is being implemented.

Analysis Tools and Statistics

The following analysis tools are currently available in SST/macro. Some are thoroughly tested. Others have undergone some testing, but are still considered Beta. Others have been implemented, but are relatively untested.

Fully tested

Call graph: Generates callgrind.out file that can be visualized in either KCacheGrind or QCacheGrind. More details are given in Call Graph Visualization.
Spyplot: Generates .csv data files tabulating the number of messages and number of bytes sent between MPI ranks. SST/macro can also directly generate a PNG file. Otherwise, the .csv files can be visualized in the plotting program Scilab. More details are given in Spyplot Diagrams.
Fixed-time quanta (FTQ): Generates a .csv data tabulating the amount of time spent doing computation/communication as the application progresses along with a Gnuplot script for visualization as a histogram. More details are given in Fixed-Time Quanta Charts

Beta

Trace analysis: With the traceanalyzer executable, fine-grained metrics for characterizing application execution can be output.

Untested

Congestion: With the -d "<stats> congestion" command line option, SST/macro will dump statistics for network congestion on individual links (packet train model only).

Known Issues and Limitations

Global Variables

The use of global variables in SST/macro inherently creates a false-sharing scenario because of the use of user-space threads to model parallel processes. While we do have a mechanism for supporting them (in the PDF manual for more information), the file using them must be compiled with C++. This is somewhat unfortunate, because many C programs will use global variables as a convenient means of accessing program data. In almost every case, though, a C program can simply be compiled as C++ by changing the extension to .cc or .cpp.

MPI

Everything from MPI 2 is implemented with a few exceptions noted below. The following are not implemented (categorized by MPI concepts):

Communicators

Anything using or having to do with Inter-communicators (MPI_Intercomm_create())
Topology communicators

Datatypes and Addressing

Complicated use of MPI_LB and MPI_UB to define a struct, and collections of structs (MPI test 138).
Changing the name of built-in datatypes with MPI_Type_set_name() (MPI test 171).
MPI_Create_darray(), MPI_Create_subarray(), and MPI_Create_resized()
MPI_Pack_external() , which is only useful for sending messages across MPI implementations apparently.
MPI_Type_match_size() - extended fortran support
Use of MPI_BOTTOM (relative addressing). Use normal buffers.
Using Fortran types (e.g. MPI_COMLEX) from C.

Info and Attributes

No MPI_Info_*, MPI_*_keyval, or MPI_Attr_* functions are supported.

Point-to-Point

MPI_Grequest_* functions (generalized requests).
Use of testing non-blocking functions in a loop, such as:

while(!flag)
{
MPI_Iprobe( 0, 0, MPI_COMM_WORLD, &flag, &status );
}
For some configurations, simulation time never advances in the MPI_Iprobe call. This causes an infinite loop that never returns to the discrete event manager. Even if configured so that time progresses, the code will work but will take a very long time to run.

Collectives

There seems to be a problem with using MPI_FLOAT and MPI_PROD in MPI_Allreduce() (MPI test 22)
There seems to be a problem with using non-commutative user-defined operators in MPI_Reduce() and MPI_Allreduce().
MPI_Alltoallw() is not implemented
MPI_Exscan() is not implemented
MPI_Reduce_Scatter_block() is not implemented.
MPIX_* functions are not implemented (like non-blocking collectives).
Calling MPI functions from user-defined reduce operations (MPI test 39; including MPI_Comm_rank).

Miscellaneous

MPI_Is_thread_main() is not implemented.

OpenSHMEM

Only the collect and fcollect functions of the API are not implemented (they will be in future releases).

Also, like handling of global variables discussed in Sections Global Variables in the PDF manual, SHMEM globals need to use a different type. For primitives and primitive arrays, refer to <sstmac/openshmem/shmem/globals.h> for replacing types, e.g. the following code

1 int my_global_var = 4;

2 double an_array[6];

needs to become

1 shmem_int my_global_var(4);

2 shmem_arr<double, 6> an_array;

Fortran

SST/macro can run Fortran90 applications. However, at least using gfortran, Fortran variables using allocate() go on the heap. Therefore, it creates a false sharing situation pretty much everywhere as threads swap in and out. A workaround is to make a big map full of data structures that store needed variables, indexed by a rank that you pass around to every function. We are exploring more user-friendly alternatives. The Fortran MPI interface is also still somewhat incomplete. Most functions are just wrappers to the C/C++ implementation and we are working on adding the bindings.