SST/macro
Simple Message API

Simple Message API

For more advanced usage, users may wish to code more directly with SST/macro C++ functions. If coding directly in MPI, you can only send messages as void* arrays. Additionally, MPI only exposes basic send/recv functionality leaving little control over the exact protocol. If using SST/macro to design a code BEFORE the code actually exists, the recommended path is to use the SST/macro~simp API (simple messages). Send/recv calls pass C++ objects rather than void* arrays, allowing data structures to be directly passed between parallel processes. Additionally, simple functions for datagram send, RDMA, and polling are available, allowing the user to experiment directly with HOW a message is sent. The added flexibility of the C++ interface is potentially beneficial for exploring performance tradeoffs when laying out the initial design of a project. In general, the simp API is geared towards asynchronous execution models rather than bulk-synchronous.

Initial Setup

The code for the example can be found in skeletons/sstmsg. To begin, the basic header files must be included.

#include <sstmac/libraries/sstmsg.h>
using namespace sstmac;
using namespace sstmac::sw;
using namespace sstmac::hw;
#define debug_print(...) std::cout << sprockit::printf(__VA_ARGS__)
namespace sstmsg {
sstmac_register_app(sstmsg);

To simplify the use of SST/macro objects, we declare the three major SST/macro namespaces. For debug printing, SST/macro provides a utility function, sprockit::printf, that functions exactly like printf but is compatible with C++ output streams (std::cout). To demonstrate slightly more advanced usage, we declare a new application named "sstmsg" using the sstmac_register_app macro.

To create and send messages, we need to create a message class.

class test_message :
public message
{
...
};

We must inherit form the message type. Beyond that, the class can be structured however the user wants. In this test case, each message will perform an action for some value.

public:
typedef enum { compute, request, terminate, data } action_t;
private:
/**
* Number of bytes requested OR
* Number of bytes being sent OR
* Number of microseconds to compute
*/
int value_;
action_t action_;

The meaning of value depends on the action, being either the number of bytes to send or the number of μs to compute.

SST/macro is built on top of the Boost smart pointers header library for memory management. To simplify reading of the code and shorten type names, we explicitly typedef all pointer types in the class.

The next part of the code (omitted) implements various constructors. The most important function call to implement is the clone method

sst_message*
clone(MESSAGE_TYPES ty) const
{
test_message::ptr cln = new test_message(value_, action_);
message::clone_into(ty, cln);
message::clone_into(cln);
return cln;
}

which overrides a virtual function in the parent class. Messages must be cloned for various subtle reasons through the code. To ensure that the message interface is cloned properly, function calls must be made to the parent clone_into method.

Now that we have a class for sending messages, we can create a thread that will poll for incoming messages.

class messenger_thread :
public sstmsg_thread
{
private:
sstmsg_queue::ptr work_queue_;
public:
messenger_thread(const sstmsg_queue::ptr& queue)
: work_queue_(queue)
{
}
virtual void run();
void terminate();
};

As before, the thread object must inherit from a simple message type. Additionally, we will make use of a another type: sstmsg_queue that can be used for managing work queues between threads. Every thread must implement a run method, which will be invoked by the SST/macro operating system.

void
{
int me = sst_rank();
while(1)
{
message::ptr msg = sst_poll();
test_message::ptr tmsg = safe_cast(test_message, msg);
debug_print("receiver got message of type %s on %d\n",
test_message::tostr(tmsg->action()), me);
work_queue_->put_message(msg);
if (tmsg->action() == test_message::terminate)
break;
}
}

In this case, the loop just runs continuously polling for messages. Because of subtleties in discrete event simulation, sst_poll blocks until a message is found. As mentioned in the introduction, non-blocking probe/polls are not really amenable to discrete event simulation. The messenger thread performs no real work and simply places the message received into a work queue.

Now we can skip ahead to the actual main routine.

int
sstmsg_main(int argc, char** argv)
{
sst_init();
int me = sst_rank();
int nproc = sst_nproc();
std::cout << sprockit::printf("Rank %d starting\n", me);
//spin off a messenger thread
sstmsg_queue::ptr work_queue = sstmsg_queue::construct();
messenger_thread::ptr thr = new messenger_thread(work_queue);
thr->start();
run_work_loop(me, nproc, work_queue);
thr->terminate();
thr->join();
std::cout << sprockit::printf("Rank %d terminating\n", me);
sst_finalize();
return 0;
}

As in MPI, we must call initialize/finalize routines at the beginning and end of main. Additionally, each process is still assigned a process id (rank) and told the total number of ranks. The code first creates a work queue, creates a messenger thread based on that work queue, and then starts the thread running. The start method invokes all the necessary SST/macro operating system routines, shielding the user from details of the discrete event scheduler. Once the messenger thread is running, the main thread enters a work loop.

void
run_work_loop(int me, int nproc, const sstmsg_queue::ptr& work_queue)
{
// send out some data requests and some tasks
for (int t=1; t <= max_num_tasks; ++t){
int dst = (me + t) % nproc;
sst_send(dst, new test_message(1e4, test_message::compute));
sst_send(dst, new test_message(1e6, test_message::request));
}

The work loop begins by sending out (in round-robin fashion) compute tasks that take 10^4 μ s. Additionally, it sends out data requests for messages of size 1 MB. Once all the tasks and data requests have been sent, the worker thread starts pulling work from the queue.

while (1)
{
message::ptr msg = work_queue->poll_until_message();
test_message::ptr tmsg = safe_cast(test_message, msg);
debug_print("got work message of type %s on %d\n",
test_message::tostr(tmsg->action()), me);
switch (tmsg->action())
{
...
}

Depending on the type of message received, the worker thread will perform various actions. If a compute message is received:

//compute for x microseconds
compute(tmsg->value()*1e-6);
++num_tasks;
if (terminate(num_tasks, num_sends))
return;
break;

The function compute is included by the header files and simulates a thread computing for a given number of seconds (hence the factor of 10^{-6}). The code tracks the number of tasks and number of send requests received. The function terminate checks to see if all the tasks and sends have been completed.

//compute for x microseconds
compute(tmsg->value()*1e-6);
++num_tasks;
if (terminate(num_tasks, num_sends))
return;
break;

If a request message is received

case test_message::request:
//send back x bytes of data
sst_rdma_put(tmsg->src(), new test_message(tmsg->value()));
break;

the code responds to the request by performing an RDMA put of tmsg->value() number of bytes to the requesting node. If an RDMA put completes at the destination, an ack is generated to signal the node that data is now available.

case test_message::rdma_put_ack:
++num_sends;
if (terminate(num_tasks, num_sends))
return;
break;

Finally, a terminate message might be sent, causing the thread to quit the work loop.

case test_message::terminate:
return;

This example shows usage for a very simple asynchronous execution model. Although the code cannot be compiled and actually run on an HPC machine the way an MPI C code can, the C++ interface provides greater flexibility and potentially makes performance experiments easier.