next up previous
Next: Effect of the Scheduling Up: Algorithms for Scheduling with Previous: Graph Edge-Colouring and Minimum

Simulation on a Parallel Computer

In this section the scheduling schemes given by the edge-colouring algorithms are applied in organising the communication for an irregular mesh based calculation. Simulations were done on 16 nodes of a Intel iPSC/860 parallel computer using the scheduling schemes to evaluate the practical effect of the edge-colouring algorithms.

On the Intel, there are two ways of sending and receiving messages. When the blocked send/receive subroutines csend/crecv are called, the program does not leave the calls until the sending/receiving action is completed. When the unblocked send/receive subroutines isend/irecv are called, the program immediately leaves with a message id, and carries on executing other parts of the code. A msgwait subroutine with the appropriate message id can be called at a later stage to make sure that the messages have been actually sent/received.

On an Intel iPSC/860, the communication time needed to csend/crecv a message is a linear function of the length. More specifically [6],


\begin{displaymath}
t=\left\{
\begin{array}{ll}
73+0.42*n,&\ \ \ {\rm if \ } n\...
...\\
202+0.36*n,&\ \ \ {\rm if \ } n> 100,
\end{array} \right.
\end{displaymath} (1)

where $t$ is the communication time in microseconds, $n$ is the number of bytes in the message. The discontinuity of the communication time is due to the effect of buffering for messages longer than 100 bytes. In the above model the effect of the distance between processors and possible link contentions are not included.

Given a scheduling scheme such as that of Table 5, generated by an edge-colouring algorithm, the pseudo code for the synchronous pair-wise exchange type communication scheme on any processor $me$ is


do $i=1,num\_stages$
csend/crecv message of length $msglen(me,i)$
to/from processor $adjproc(me,i)$
synchronisation
end do

Thus the communication is done in a number of stages, at each stage a processor pair exchanges messages, and this is followed by a synchronisation.

The cost given by a colouring scheme, such as those presented in Tables 4 and 5, are just predictions of the actual cost that the scheduling scheme might give. In order to see if these predictions have any significance in practice, simulations were carried out on the Intel i860 hypercube. Two meshes were considered and five partitioning algorithms [8] were used to partition each mesh into 16 subdomains. For each partitioning, three message passing schemes were generated using the three scheduling algorithms COLOUR, REDUCE and REDUCE10. The five partitioning algorithms used were: the recursive coordinate bisection (RCB), the recursive graph bisection (RGB), the recursive spectral bisection (RSB) (see, e.g. [13] for these three algorithms), the KL algorithm [11] and a hybrid algorithm of KL and RGB, called MINGRAPH [8].

For each shared face between two subdomains, it was assumed that 10 double precision numbers need to be exchanged. Thus for example, in Table 4, node 7 exchanges 30 double precision numbers with node 4 at stage 1, and 40 double precision numbers with node 13 at stage 2, etc. After each stage, a global synchronisation step was executed by calling the Intel i860 FORTRAN subroutine gsync(), so as to ensure that each node had finished sending and receiving data, before proceeding to the next stage. The message passing was repeated 1000 times, and the communication time was taken to be the elapsed time between the start and the finish of the message passing. In Table 7, the number of stages, the communication costs (predicted) and the actual communication time (in milliseconds, or ms for short) recorded in the simulation are reported. For the partitioning of the 788 mesh given by the recursive coordinate bisection (RCB), the algorithm COLOUR gave a message passing scheme with 7 stages and a predicted communication cost of 47 units. The actual communication time was about 8.4 seconds. This was reduced to about 7.0 seconds by using the algorithm REDUCE10. In general, Table 7 illustrates that the predicted costs reflect the trend of actual communication time very well, and that the predicted reductions of communication costs by the algorithms REDUCE and REDUCE10 were actually delivered in the measured communication time.


Table 7: Predicted and simulated communication costs${}^*$
Par. methods Mesh COLOUR REDUCE REDUCE10
RCB 788 7/47/8423 6/35/6958 6/31/6992
RCB 5520 7/842/45582 7/666/40006 7/521/35662
RGB 788 6/47/7580 5/34/6103 5/32/5975
RGB 5520 7/598/33528 7/511/30266 7/427/28418
RSB 788 5/36/6286 5/26/6052 5/25/5634
RSB 5520 5/385/22200 4/216/14778 4/216/15010
KL 788 10/52/10935 9/36/9424 9/36/9366
KL 5520 6/326/20345 6/263/19170 6/256/18471
MINGRAPH 788 6/40/7230 6/29/6838 6/28/6505
MINGRAPH 5520 4/209/14136 3/189/13582 3/189/13020

${}^*$ Simulation: message passing 1000 times using the schemes given by the three scheduling algorithms on the partitions generated with the five partitioning methods. For each shared face between two subdomains 10 double precision numbers are exchanged. The results are reported in the form

number of stages/communication costs (units)/communication time (ms)

It is also interesting to compare the communication time given by various partitioning algorithms. The recursive coordinate bisection algorithm (RCB) on the 5520 mesh needed at least 35 seconds, almost 3 times greater than the best communication time given by the MINGRAPH algorithm. The partitioning algorithm KL performed badly on the 788 mesh, mainly because the algorithm generated disconnected subdomains which bordered a greater number of neighbouring subdomains. This increased the number of stages for the communication, therefore the start-up and synchronisation costs.

It is interesting to relate the communication time in Table 7 to formula (1). The synchronisation costs for one call of the synchronisation routine gsync() on a 16 node configuration of the Intel was found to be about 530 microseconds. Assuming no link contentions, the communication time (ms) for executing $k$ stages of synchronous pair-wise exchange 1000 times was

\begin{displaymath}t=202*k+0.36*\sum_{i=1}^k n_i+530*k,\end{displaymath} (2)

where $n_i$ is the maximum number of bytes of message transferred in stage $i$. For example, in Table 4, the maximum number of shared edges at each stage is 6, 8, 6, 6 and 10 respectively. Thus if 10 double precision numbers are transferred for each shared edge, then the message length will be 480, 640, 480, 480 and 800 bytes respectively, so $t=5226\ $ms. For the communication task given by Table 5, we calculated that $t=4939\ $ms. These figures are proportional but less than the actual communication time (6286 ms and 6052 ms) found in Table 7. The under-prediction is probably due to the link contentions. In both cases, because the message lengths are short, the start-up and synchronisation accounted for the majority of the time.

For large meshes, the start-up and synchronisation costs were less dominant. For example on the partition of the mesh-5520 using RSB, the message passing algorithm COLOUR gave a scheme of 5 stages, with a sum of largest message lengths of 385. This gives a calculated $t=14748\ $ms. The message passing algorithm REDUCE10 gave a scheduling scheme of 4 stages with a sum of largest message lengths of 216, and a calculated $t=9148\ $ms. Again the two calculated timings are proportional to, but under-estimate the actual communication time (which are 22200 ms and 15010 ms respectively, from Table 7). Between $25{\rm -}32\%$ of the time was spent in the start-up and synchronisation.


next up previous
Next: Effect of the Scheduling Up: Algorithms for Scheduling with Previous: Graph Edge-Colouring and Minimum

2000-03-22
Hosted by www.Geocities.ws

1