Simulation on a Parallel Computer

In this section the scheduling schemes given by the edge-colouring algorithms are applied in organising the communication for an irregular mesh based calculation. Simulations were done on 16 nodes of a Intel iPSC/860 parallel computer using the scheduling schemes to evaluate the practical effect of the edge-colouring algorithms.

On the Intel, there are two ways of sending and receiving messages. When the blocked send/receive subroutines csend/crecv are called, the program does not leave the calls until the sending/receiving action is completed. When the unblocked send/receive subroutines isend/irecv are called, the program immediately leaves with a message id, and carries on executing other parts of the code. A msgwait subroutine with the appropriate message id can be called at a later stage to make sure that the messages have been actually sent/received.

On an Intel iPSC/860, the communication time needed to csend/crecv a message is a linear function of the length. More specifically [6],

where

is the communication time in microseconds,

is the number of bytes in the message. The discontinuity of the communication time is due to the effect of buffering for messages longer than 100 bytes. In the above model the effect of the distance between processors and possible link contentions are not included.

Given a scheduling scheme such as that of Table 5, generated by an edge-colouring algorithm, the pseudo code for the synchronous pair-wise exchange type communication scheme on any processor

The cost given by a colouring scheme, such as those presented in Tables 4 and 5, are just predictions of the actual cost that the scheduling scheme might give. In order to see if these predictions have any significance in practice, simulations were carried out on the Intel i860 hypercube. Two meshes were considered and five partitioning algorithms [8] were used to partition each mesh into 16 subdomains. For each partitioning, three message passing schemes were generated using the three scheduling algorithms COLOUR, REDUCE and REDUCE10. The five partitioning algorithms used were: the recursive coordinate bisection (RCB), the recursive graph bisection (RGB), the recursive spectral bisection (RSB) (see, e.g. [13] for these three algorithms), the KL algorithm [11] and a hybrid algorithm of KL and RGB, called MINGRAPH [8].

For each shared face between two subdomains, it was assumed that 10 double precision numbers need to be exchanged. Thus for example, in Table 4, node 7 exchanges 30 double precision numbers with node 4 at stage 1, and 40 double precision numbers with node 13 at stage 2, etc. After each stage, a global synchronisation step was executed by calling the Intel i860 FORTRAN subroutine gsync(), so as to ensure that each node had finished sending and receiving data, before proceeding to the next stage. The message passing was repeated 1000 times, and the communication time was taken to be the elapsed time between the start and the finish of the message passing. In Table 7, the number of stages, the communication costs (predicted) and the actual communication time (in milliseconds, or ms for short) recorded in the simulation are reported. For the partitioning of the 788 mesh given by the recursive coordinate bisection (RCB), the algorithm COLOUR gave a message passing scheme with 7 stages and a predicted communication cost of 47 units. The actual communication time was about 8.4 seconds. This was reduced to about 7.0 seconds by using the algorithm REDUCE10. In general, Table 7 illustrates that the predicted costs reflect the trend of actual communication time very well, and that the predicted reductions of communication costs by the algorithms REDUCE and REDUCE10 were actually delivered in the measured communication time.

**Table 7:** Predicted and simulated communication costs ${}^*$
Par. methods	Mesh	COLOUR	REDUCE	REDUCE10
RCB	788	7/47/8423	6/35/6958	6/31/6992
RCB	5520	7/842/45582	7/666/40006	7/521/35662
RGB	788	6/47/7580	5/34/6103	5/32/5975
RGB	5520	7/598/33528	7/511/30266	7/427/28418
RSB	788	5/36/6286	5/26/6052	5/25/5634
RSB	5520	5/385/22200	4/216/14778	4/216/15010
KL	788	10/52/10935	9/36/9424	9/36/9366
KL	5520	6/326/20345	6/263/19170	6/256/18471
MINGRAPH	788	6/40/7230	6/29/6838	6/28/6505
MINGRAPH	5520	4/209/14136	3/189/13582	3/189/13020

${}^*$ Simulation: message passing 1000 times using the schemes given by the three scheduling algorithms on the partitions generated with the five partitioning methods. For each shared face between two subdomains 10 double precision numbers are exchanged. The results are reported in the form

number of stages/communication costs (units)/communication time (ms)

It is also interesting to compare the communication time given by various partitioning algorithms. The recursive coordinate bisection algorithm (RCB) on the 5520 mesh needed at least 35 seconds, almost 3 times greater than the best communication time given by the MINGRAPH algorithm. The partitioning algorithm KL performed badly on the 788 mesh, mainly because the algorithm generated disconnected subdomains which bordered a greater number of neighbouring subdomains. This increased the number of stages for the communication, therefore the start-up and synchronisation costs.

It is interesting to relate the communication time in Table 7 to formula (1). The synchronisation costs for one call of the synchronisation routine gsync() on a 16 node configuration of the Intel was found to be about 530 microseconds. Assuming no link contentions, the communication time (ms) for executing

stages of synchronous pair-wise exchange 1000 times was

For large meshes, the start-up and synchronisation costs were less dominant. For example on the partition of the mesh-5520 using RSB, the message passing algorithm COLOUR gave a scheme of 5 stages, with a sum of largest message lengths of 385. This gives a calculated $t=14748\$ ms. The message passing algorithm REDUCE10 gave a scheduling scheme of 4 stages with a sum of largest message lengths of 216, and a calculated $t=9148\$ ms. Again the two calculated timings are proportional to, but under-estimate the actual communication time (which are 22200 ms and 15010 ms respectively, from Table 7). Between $25{\rm -}32\%$ of the time was spent in the start-up and synchronisation.