SST/macro
|
The SST/macro software package provides a simulator for large-scale parallel computer architectures. It permits the coarse-grained study of distributed-memory applications. The simulator is driven from either a trace file or skeleton application. The simulator architecture is modular, allowing it to easily be extended with additional network models, trace file formats, software services, and processor models.
Simulation can be broadly categorized as either off-line or on-line. Off-line simulators typically first run a full parallel application on a real machine, recording certain communication and computation events to a simulation trace. This event trace can then be replayed post-mortem in the simulator. Most common are MPI traces which record all MPI events, and SST/macro provides the DUMPI utility (Using DUMPI) for collecting and replaying MPI traces. Trace extrapolation can extend the usefulness of off-line simulation by estimating large or untraceable system scales without having to collect a trace, it is typically only limited to strictly weak scaling.
We turn to on-line simulation when the hardware or applications parameters need to change. On-line simulators instead run real application code, allowing native C/C++/Fortran to be compiled directly into the simulator. SST/macro intercepts certain function calls, estimating how much time passes rather than actually executing the function. In MPI programs, for example, calls to MPI_Send are linked to the simulator instead of passing to the real MPI library. If desired, SST/macro can actually be a full MPI emulator, delivering messages between ranks and replicating the behavior of a full MPI implementation.
Although SST/macro supports both on-line and off-line modes, on-line simulation is encouraged because event traces are much less flexible, containing a fixed sequence of events. Application inputs and number of nodes cannot be changed. Without a flexible control flow, it also cannot simulate dynamic behavior like load balancing or faults. On-line simulation can explore a much broader problem space since they evolve directly in the simulator.
For large, system-level experiments with thousands of network endpoints, high-accuracy cycle-accurate simulation is not possible, or at least not convenient. Simulation requires coarse-grained approximations to be practical. SST/macro is therefore designed for specific cost/accuracy tradeoffs. It should still capture complex cause/effect behavior in applications and hardware, but be efficient enough to simulate at the system-level. For speeding up simulator execution, we encourage skeletonization, discussed further in in the PDF manual. A high-quality skeleton is an application model that reproduces certain characteristics with only limited computation. We also encourage uncertainty quantification (UQ) for validating simulator results, discussed further in in the PDF manual. Skeletonization and UQ are the two main elements in the "canonical'' SST/macro workflow (Figure 1).
\image html figures/workflow.png <b>Figure 1:</b> SST/macro workflow.
The following sections describe the state of the software API's (found in sstmac) that are available in SST/macro for use by applications, as of this release. The level of testing indicates the integration of compliance/functionality tests into our make check test suite.
{OpenSHMEM:} Most of the standard OpenSHMEM tests pass. The ones that don't are because they haven't been ported to C++, or test the single unsupported feature (collect).
{Sockets:} The Socket API is mostly implemented. Most basic client/server functionality is available. However, only the default socket options are allowed. In most cases, setsockopt
is just a no-op.
pthread_create()
, pthread_join()
, and pthread_self()
functions are implemented. A basic pthread test validates the core spawn/run/join behavior. {GNI:} Cray's low-level messaging interface, GNI, is being implemented.
The following analysis tools are currently available in SST/macro. Some are thoroughly tested. Others have undergone some testing, but are still considered Beta. Others have been implemented, but are relatively untested.
Fixed-time quanta (FTQ): Generates a .csv data tabulating the amount of time spent doing computation/communication as the application progresses along with a Gnuplot script for visualization as a histogram. More details are given in Fixed-Time Quanta Charts
Trace analysis: With the traceanalyzer executable, fine-grained metrics for characterizing application execution can be output.
Congestion: With the -d "<stats> congestion"
command line option, SST/macro will dump statistics for network congestion on individual links (packet train model only).
The use of global variables in SST/macro inherently creates a false-sharing scenario because of the use of user-space threads to model parallel processes. While we do have a mechanism for supporting them (in the PDF manual for more information), the file using them must be compiled with C++. This is somewhat unfortunate, because many C programs will use global variables as a convenient means of accessing program data. In almost every case, though, a C program can simply be compiled as C++ by changing the extension to .cc or .cpp.
Everything from MPI 2 is implemented with a few exceptions noted below. The following are not implemented (categorized by MPI concepts):
MPI_Intercomm_create()
) Topology communicators
MPI_Type_set_name()
(MPI test 171). MPI_Create_darray()
, MPI_Create_subarray()
, and MPI_Create_resized()
MPI_Pack_external()
, which is only useful for sending messages across MPI implementations apparently. MPI_Type_match_size()
- extended fortran support Using Fortran types (e.g. MPI_COMLEX) from C.
No MPI_Info_*
, MPI_*_keyval
, or MPI_Attr_*
functions are supported.
MPI_Grequest_*
functions (generalized requests). Use of testing non-blocking functions in a loop, such as:
For some configurations, simulation time never advances in the MPI_Iprobe call. This causes an infinite loop that never returns to the discrete event manager. Even if configured so that time progresses, the code will work but will take a very long time to run.
MPI_Allreduce()
(MPI test 22) MPI_Reduce()
and MPI_Allreduce()
. MPI_Alltoallw()
is not implemented MPI_Exscan()
is not implemented MPI_Reduce_Scatter_block()
is not implemented. MPIX_*
functions are not implemented (like non-blocking collectives). Calling MPI functions from user-defined reduce operations (MPI test 39; including MPI_Comm_rank
).
MPI_Is_thread_main()
is not implemented.
Only the collect and fcollect functions of the API are not implemented (they will be in future releases).
Also, like handling of global variables discussed in Sections Global Variables in the PDF manual, SHMEM globals need to use a different type. For primitives and primitive arrays, refer to <sstmac/openshmem/shmem/globals.h>
for replacing types, e.g. the following code
needs to become
SST/macro can run Fortran90 applications. However, at least using gfortran, Fortran variables using allocate() go on the heap. Therefore, it creates a false sharing situation pretty much everywhere as threads swap in and out. A workaround is to make a big map full of data structures that store needed variables, indexed by a rank that you pass around to every function. We are exploring more user-friendly alternatives. The Fortran MPI interface is also still somewhat incomplete. Most functions are just wrappers to the C/C++ implementation and we are working on adding the bindings.