This document was created using the >e-novative> DocBook Environment (eDE)

Atollrun v1.0 High Level Documentation

Documentation of the Atollrun library and the script 'atollrun.py'

Holger Sattel

19.01.2004

Revision History
Revision 0.119.01.2004HS
Initial version.
Revision 0.202.02.2004HS
Renamed ATOLL++ to Atollrun. Added C wrapper. Minor changes.

Abstract

This document describes the Atollrun environment. The Atollrun library is written in C++ and responsible for the initialization of the whole communication at the very beginning of each program using the Atollrun networking environment. It's functionality is similar to the well known MPI_Init() function in MPI. The second part of the Atollrun environment is the starter script 'atollrun.py'. It builds the counterpart to the library and is comparable to 'mpirun' in MPI.


Table of Contents

Introduction
1. The Initialization Protocol
1.1. Connecting Hosts
1.2. Collecting Port Numbers
1.3. Distributing Environment Informations
1.4. Establishing Connections and Communication Test
1.5. Finish
1.6. Initialization Example
2. The Starter Script 'atollrun.py'
2.1. The Prerequisites
2.2. The command line parameters
2.3. A sample 'atollrun.py' session
3. The Atollrun library
3.1. Classes and Methods
3.2. Atollrun example program ('porttest')
4. The C Wrapper
4.1. The C functions
4.2. Atollrun example program ('porttest') using the C wrapper functions

List of Figures

1.1. UML sequence diagram of the initialization protocol
3.1. UML class diagram of the Atollrun library

List of Examples

1.1. Initialization protocol of process w/ ID 0 (2 processes total)
1.2. Initialization protocol of process w/ ID 1 (2 processes total)
2.1. A sample machinefile
2.2. A sample groupfile
2.3. A sample 'atollrun.py' session
3.1. The source code of 'porttest'
4.1. The source code of 'porttest_c'

Introduction

The idea behind the Atollrun environment is the problem of the initial setup of the communication at the very beginning of every communication over the ATOLL networking environment. The distribution of the port numbers of the peer applications is a common problem of every networking environment like ATOLL or Myrinet. A solution is implemented in the MPI library. It uses a helper script which is responsible for the correct distribution of the 'addresses' of all members to every member of the parallel application. The counterpart of this script is the MPI_Init() function, which must be executed before every other MPI function. The problem of this approach is that you are now forced to use only the routines provided by the MPI library. You cannot leave that level and go down to the lower level; this means you have a lot of unnecessary overhead.

The goal of the Atollrun environment is now to perform only this initialization process and then giving the further control over the communication back to the low level routines defined in the underlying PALMS library.

The first thing we need to achieve this is a protocol between one central master and one or many processes communicating over an ATOLL network. So chapter 1 is about this protocol:

  • Initialization protocol: The protocol is a simple, ASCII text based one. The detailed description is given in chapter 1.

Now we need 2 instances talking this protocol. One central master, the starter script, and the processes communicating over an ATOLL network. This part is encoded in the Atollrun library.

  • starter script: The starter script will be used to start the parallel application. It connects the hosts and starts the applications. Then it asks every process for their ports and distributes the ports of all processes to every process. A detailed description on how this works is given in chapter 2.

  • MPI_Init() replacement: The counterpart communicating with the starter script is encapsulated in the ATOLL class of the Atollrun library. It also stores the informations received from the starter script like port numbers of the peer processes. A detailed description of the classes and methods of the Atollrun library is given in chapter 3.

Chapter 1. The Initialization Protocol

The protocol is divided into 5 parts:

  • Connecting Hosts: This step includes the connection to the hosts either direct (localhost) or via SSH and to start the processes.

  • Collecting Port Numbers: Every process opens an ATOLL host port and reports the number back to the starter script.

  • Distributing Environment Informations: The starter script now distributes the collected information (i.e. IDs and host ports) to all (remote) processes.

  • Establishing Connections and Communication Test: Now the processes may initialize the connections and perform a little communication test and report the result back to the starter script.

  • Finish: When every process has reported the result of the previous step back, the starter script sends a signal indicating the initialization process has finished. Now the processes are responsible for the further communication using the low level PALMS functions.

Here's a simple UML sequence diagram showing the message flow of such an initialization conversation using this protocol:

UML sequence diagram of the initialization protocol

Figure 1.1. UML sequence diagram of the initialization protocol

The protocol itself is text based and works over the standard input and output channels of the processes!

1.1. Connecting Hosts

After connecting the hosts (remote via SSH) the starter script executes the processes and waits until it detects the following string on the stdout of the process:

	__ATOLLRUN_INIT__

This means the process has reached the initialization procedure and waits for a command from the starter script. When every process has reached this position, the starter script continues with the next step.

1.2. Collecting Port Numbers

Now the starter script sends the following string to the processes.

	__YOUR_PORT__

After receiving the message, the process tries to acquire an ATOLL host port and in case of success it sends the following string back (in the example the acquired ATOLL host port is 267365):

	__MY_PORT__267365

In case of a failure it sends back:

	__MY_PORT__FAILURE

There is a special variant of this step when there is only one process. In this case the starter script sends the following string to the single process:

	__YOUR_PORT_SINGLE__

Now the process knows there's no peer and the process has the option to avoid the opening of a port in that case (there is a special option in the Atollrun library). The answer string is exactly the same as in the normal mode.

When every process reported an open ATOLL host port and there was no error, the script continues with step 3, otherwise the script exits with an error.

1.3. Distributing Environment Informations

At this point the script knows all informations it needs about the processes, so the next step is to propagate these informations to all processes. These informations are in detail:

  • Number of processes: The total number of processes. Each process has it's own ID. The IDs have the range 0 to n-1, where n is the total number of processes. The concept of IDs is exactly the same one as in MPI.

  • ID of the own process: Every process needs to know its own process ID. Again same concept as in MPI.

  • Mapping from IDs to ATOLL host ports: This map is the heart of the whole Atollrun environment. With the knowledge of this information the programmer can utilize the low level routines of the PALMS library to communicate with the peer processes.

The script now sends the following strings to every process: First the total number of processes (in this example: 2):

	__NUMBER_OF_PROCESSES__2

The second part is the ID of the process (in this example: ID 0):

	__YOUR_ID__0

And third the mapping from IDs to ATOLL host ports. The scripts send exactly n strings (where n is the number of processes). The first line is the ATOLL host port of ID 0, the second line the one of ID 1 and so on:

	__REMOTE_PORT__267365
	__REMOTE_PORT__267366

After the process received this informations it sends the following acknowledgement string back to the master script:

	__PORT_VECTOR_ACCEPTED__

When every process reported that acknowledgement and there was no error, the script continues with step 4, otherwise the script exits with an error.

1.4. Establishing Connections and Communication Test

Now the processes have the chance to establish the connections and perform a little self test. Whether this stuff will be done or not depends on some options the programmer can set or unset in the Atollrun library. In every case (for the script there is now difference what the processes do, only the answer is important) the script sends the following message:

	__INIT_CONNECTIONS__

There are 3 possible answers from the processes. The first one indicates a success. This means the processes built up the connections to each other and (maybe) performed a self test:

	__INIT_STATUS__SUCCESS

The second one indicates an error building up the connections or while performing the self test:

	__INIT_STATUS__FAILURE

The third one indicates that the processes did not setup any connections, so the programmer wants to did it manually.

	__INIT_STATUS__IGNORED

And again when every process reported their status back and there was no error, the script continues with step 5, otherwise the script exits with an error.

1.5. Finish

The last step is a very simple one, it is only for synchronizing reasons. The script sends the following message:

	__FINISH_INIT__

And the processes send the following acknowledgement message back:

	__FINISHED_INIT__

After this step the processes executes their programs and the script just observes the processes until they finish.

1.6. Initialization Example

The following 2 printouts show the complete initialization flow using the described protocol. There are 2 processes which are controlled by the master script:

Example 1.1. Initialization protocol of process w/ ID 0 (2 processes total)


__ATOLLRUN_INIT__
__YOUR_PORT__
__MY_PORT__1773329232
__NUMBER_OF_PROCESSES__2
__YOUR_ID__0
__REMOTE_PORT__1773329232
__REMOTE_PORT__1773329233
__PORT_VECTOR_ACCEPTED__
__INIT_CONNECTIONS__
__INIT_STATUS__IGNORED
__FINISH_INIT__
__FINISHED_INIT__
                                        

Example 1.2. Initialization protocol of process w/ ID 1 (2 processes total)


__ATOLLRUN_INIT__
__YOUR_PORT__
__MY_PORT__1773329233
__NUMBER_OF_PROCESSES__2
__YOUR_ID__1
__REMOTE_PORT__1773329232
__REMOTE_PORT__1773329233
__PORT_VECTOR_ACCEPTED__
__INIT_CONNECTIONS__
__INIT_STATUS__IGNORED
__FINISH_INIT__
__FINISHED_INIT__
                                        

Chapter 2. The Starter Script 'atollrun.py'

The starter script 'atollrun.py' connects the hosts, starts the processes and begins the protocol for exchanging the ATOLL ports. After the protocol finished it listens for the stdout of all processes and waits for the termination of them all. The output of the process w/ ID 0 is forwarded to the stdout of 'atollrun.py'. This chapter is divided in the following parts:

  • The prerequisites: A description of all software and modules the script depends on.

  • The command line parameters: A description of the command line parameters.

  • A sample 'atollrun.py' session: A example showing an 'atollrun.py' session.

2.1. The Prerequisites

The script depends on the following software:

  • Python 2.2.x or 2.3.x (other versions not tested)

  • pexpect 0.98 or pexpect 0.99 (other versions not tested)

2.2. The command line parameters

The script has the following usage:

	atollrun.py [atollrun_options...] progname [parameters...]

The different parts of the command line have the following meanings:

  • progname: The progname parameter is the name of the executable executed by the script on the remote hosts. The path name may be relative or an absolute path

  • [parameters...]: These parameters are parameters which are passed to the program. The have to be different from the parameters used in 'atollrun.py'!

  • [atollrun_options...]: The 'atollrun.py' script accepts the following 5 options

    • -p --processes=PROCESSES: The number of parallel processes to be started by the script. The default number of processes is 1.

    • -m --machinefile=MACHINEFILE: The file containing the names of the hosts to connect to. The format of the machinefile is one host per line. The default machine file is the file '~/.atollrun_machinefile'. The host w/ ID 0 is always the machine 'localhost'!

      Example 2.1. A sample machinefile

      
      pcatoll01
      pcatoll02
      pcatoll04
      pcatoll06
                                                                                      
      

    • -g --groupfile=GROUPFILE: The file containing the group of machines. The concept of machine groups means that every machine in the group have the same password, so you have to enter it only at the first connection to any machine per machine group. The default group file is the file '~/.atollrun_groupfile'. The format is like in the following example (one group per line).

      Example 2.2. A sample groupfile

      
      # atollrun.py groupfile #
      #-----------------------#
      
      # LS Rechnerarchitektur ATOLL Cluster
      LSRA: pcatoll00, pcatoll01, pcatoll02, pcatoll03, pcatoll04, pcatoll05, pcatoll06
                                                                                      
      

    • -r --rsh: With this option set the script uses rsh instead of ssh to connect the remote hosts. This option is BETA!

    • -l --log: This option means that the script logs all output from any host to a file. The place for these files is the directory ~/.atollrun.log.

    • -h --help: This option prints a little usage information to stdout and quits.

2.3. A sample 'atollrun.py' session

The following example shows a session of the test program porttest with 2 hosts:

Example 2.3. A sample 'atollrun.py' session


sattel@pcatoll01:~/ATOLL++> python scripts/atollrun.py --log -p 2 test/porttest
[atollrun.py] ==============================================================
[atollrun.py] | atollrun.py v0.7 --- start script for ATOLL++ applications |
[atollrun.py] |------------------------------------------------------------|
[atollrun.py] | (c) 2003 by Holger Sattel <hsattel@rumms.uni-mannheim.de>  |
[atollrun.py] | University of Mannheim, Computer Architecture Group        |
[atollrun.py] ==============================================================
[atollrun.py] >>> SCANNING COMMAND LINE...
[atollrun.py] processes      = 2
[atollrun.py] machinefile    = /home/sattel/.atoll.machinefile
[atollrun.py] groupfile      = /home/sattel/.atoll.groupfile
[atollrun.py] program        = /home/sattel/ATOLL++/test/porttest
[atollrun.py] program args   = 
[atollrun.py] stdout logging = ENABLED
[atollrun.py] >>> READING GROUP FILE...
[atollrun.py] found 7 hosts in 1 group(s)
[atollrun.py] >>> MAPPING ID<->MACHINE...
[atollrun.py] found 1 remote machines in machinefile
[atollrun.py] mapped ID 0 --> localhost
[atollrun.py] mapped ID 1 --> pcatoll01
[atollrun.py] >>> CONNECTING MACHINES...
[atollrun.py] connecting ID 0 ... [local     ] ... NO AUTHENTICATION
[atollrun.py] connecting ID 1 ... [password  ] ... PASSWORD PROTOCOL
[atollrun.py] enter password for pcatoll01: 
[atollrun.py] password ACCEPTED
[atollrun.py] >>> PERFORM INITIALIZATION...
[atollrun.py] requesting ports ........... 0 1 --- OK
[atollrun.py] propagating network info ... 0 1 --- OK
[atollrun.py] initializing connections ... 0 1 --- OK
[atollrun.py] checking results ........... 0 1 --- IGNORED
[atollrun.py] finishing init procedure ... 0 1 --- OK
[atollrun.py] >>> OBSERVING PROGRAMS...
#Processes      = 2
ATOLL++ Rank 0 --> ATOLL Port = 1773329232 <-- thats me :-)
ATOLL++ Rank 1 --> ATOLL Port = 1773329233
[atollrun.py] ID 0 has finished
[atollrun.py] ID 1 has finished
[atollrun.py] >>> CLEANUP CONNECTIONS...
sattel@pcatoll01:~/ATOLL++>
                                        

Chapter 3. The Atollrun library

The Atollrun library consists of two classes. The first class is the important one, Atollrun. It has 9 methods, which will be described below. The other one is the exception class AtollrunException. After the description of the user methods a detailed example using the Atollrun library follows.

Here's a UML class diagram of the Atollrun library for a first look:

UML class diagram of the Atollrun library

Figure 3.1. UML class diagram of the Atollrun library

3.1. Classes and Methods

This chapter only describes the methods of the class Atollrun, which one controls the whole Atollrun environment. The other one, AtollrunException, is derived from the standard C++ exception class std::exception and defines no new method, it only overwrites the bases method what() for a detailed error message. So here are the descriptions of the 9 methods of the class Atollrun:

  • getReference(): The class is designed as a singleton class, so you need a reference to the single instance of the class in order to access the methods of it. This static method returns this reference.

  • Init(): The most important method of the class is responsible for performing the initialization protocol w/ the 'atollrun.py' script. The next three set-methods can be used to set or unset some options regarding this Init(). After returning from the call to this method, the 4 get-methods are available.

  • setEstablishConnectionsOnInit(): When this option is turned on, the Init() method builds the ATOLL connections on initialization. If this option is off the user is responsible for this task. The default is On.

  • setPerformConnectionTestOnInit(): When this option is turned on, the Init() method performs a little communcation test after building the connections. This option implies the previous option. The default is Off.

  • setAcquirePortOnSingle(): When this option is turned on, the Init() method opens an ATOLL port even when the process is alone, meaning there is no other process to communicate with.

  • getCommRank(): This method returns the Atollrun ID of the process. It's similar the MPI function.

  • getCommSize(): This method returns the total number of processes started by 'atollrun.py'. The Atollrun ID of the n processes are 0 to n-1. It's similar to the MPI function.

  • getPort(): This method converts an Atollrun ID to the corresponding ATOLL host port number.

  • getHandle(): This method returns the ATOLL connection handle for a certain remote host (Atollrun ID). This method only works when the connection were established on Init().

For a more detailed description of the parameters and return values, take a look at the doxygen documentation of the classes.

3.2. Atollrun example program ('porttest')

The following example shows the source of the 'porttest' test program with some comments:

Example 3.1. The source code of 'porttest'


/*
  porttest.cpp - created 30.08.2003 by Holger Sattel <hsattel@rumms.uni-mannheim.de>
                 University of Mannheim, Department of Computer Engineering, Computer Architecture Group
*/

/* header include */
#include "atollrun.h"

/* ============================================================================ */
/* | Routine: main()                                                          | */
/* ============================================================================ */
int main(int pArgc, char *pArgv[])
{
  try {
    // get reference to Atollrun singleton
    Atollrun &atollrun = Atollrun::getReference();
    // don't establish connections to peerst
    atollrun.setEstablishConnectionsOnInit(false);
    // disable self-test
    atollrun.setPerformConnectionTestOnInit(false);
    // enable acquire port on single flag
    atollrun.setAcquirePortOnSingle(true);
    // initialize the network
    atollrun.Init(&pArgc, &pArgv);
    // printout informations about all peers
    std::cout << "#Processes      = " << atollrun.getCommSize() << std::endl; 
    // printout list of ports
    for(int lID = 0; lID < atollrun.getCommSize(); lID++) {
      // printout informations about current peer
      std::cout << "Atollrun ID " << lID << " --> ATOLL Port = " << atollrun.getPort(lID);
      // check if current ID is me
      if(lID == atollrun.getCommRank()) std::cout << " <-- thats me :-)";
      // newline
      std::cout << std::endl;
    }
    // catch exceptions
  } catch(AtollrunException &e) {
    std::cout << "Exception: " << e.what() << std::endl;        
  }
}
                                        

Chapter 4. The C Wrapper

The header file atollrun.h contains also some C functions which can be used in ordinary C programs to access the methods provided by the Atollrun class. Because the wrapper is nearly a 1:1 translation between the C++ methods and the C functions, this chapter contains only the names of the C functions and the previous example as a pure C version.

4.1. The C functions

  • Atollrun_Init(): matches the Init() method.

  • Atollrun_setEstablishConnectionsOnInit(): matches the setEstablishConnectionsOnInit() method.

  • Atollrun_setPerformConnectionTestOnInit(): matches the setPerformConnectionTestOnInit() method.

  • Atollrun_setAcquirePortOnSingle(): matches the Atollrun_setAcquirePortOnSingle() method.

  • Atollrun_getCommRank(): matches the getCommRank() method.

  • Atollrun_getCommSize(): matches the getCommSize() method.

  • Atollrun_getPort(): matches the getPort() method. There is an alias name for this function: ATOLLPORT()

  • Atollrun_getHandle(): matches the getHandle() method. There is an alias name for this function: ATOLLHANDLE()

For a more detailed description of the parameters and return values, take a look at the doxygen documentation of the functions

4.2. Atollrun example program ('porttest') using the C wrapper functions

The following example shows the source of the 'porttest_c' test program using the C wrapper functions

Example 4.1. The source code of 'porttest_c'


/*
  porttest_c.c - created 02.02.2004 by Holger Sattel <hsattel@rumms.uni-mannheim.de>
                 University of Mannheim, Department of Computer Engineering, Computer Architecture Group
*/

/* header include */
#include "atollrun.h"

/* printf */
#include <stdio.h>

/* ============================================================================ */
/* | Routine: main()                                                          | */
/* ============================================================================ */
int main(int pArgc, char *pArgv[])
{
  /* some local variables */
  int result, ID;

  /* don't establish connections to peerst */
  Atollrun_setEstablishConnectionsOnInit(0);
  /* disable self-test */
  Atollrun_setPerformConnectionTestOnInit(0);
  /* enable acquire port on single flag */
  Atollrun_setAcquirePortOnSingle(1);

  /* initialize the network */
  result = Atollrun_Init(&pArgc, &pArgv);
  /* check the result and exit on error */
  if(result == ATOLLRUN_INIT_ERROR) return 1;

  /* printout informations about all peers */
  printf("#Processes      = %i\n", Atollrun_getCommSize());
  /* printout list of ports */
  for(ID = 0; ID < Atollrun_getCommSize(); ID++) {
    /* printout informations about current peer */
    printf("Atollrun ID %i --> ATOLL Port = %i", ID,  (int)ATOLLPORT(ID));
    /* check if current ID is me */
    if(ID == Atollrun_getCommRank()) printf(" <-- thats me :-)");
    /* newline */
    printf("\n");
  }

  /* return from function */
  return 0;
}
                                        

This document was created using the >e-novative> DocBook Environment (eDE)