In this section the effect of the scheduling algorithms on a practical finite element code is examined. To give a more complete view, synchronous and asynchronous communication will also be examined. It is known that in general asynchronous communication is faster, and should be used whenever possible. However there are situations where synchronous communication is more appropriate, mainly due to the fact that it requires less memory. The impact of scheduling algorithms and that of the mode of communication (synchronous or asynchronous) will be compared.
The scheduling algorithms of Section 2 were implemented in an explicit unstructured finite element Euler solver code FELISA [12]. The code was run on a number of parallel platforms, including the Intel iPSC/860 and the Cray T3D.
Apart from the synchronous communication scheme discussed in the previous sections, there are alternative ways of organising the communication. The following is an asynchronous communication scheme, which uses the asynchronous communication subroutines isend/irecv available on the Intel.
The advantage of this asynchronous communication scheme, compared with the synchronous scheme detailed in the last section, is that there is no need to organise the communication in a pair-wise fashion. Furthermore no synchronisation is needed, and a processor can carry on its computation as soon as it has received the messages it needs, without waiting for other processors to finish. The disadvantage is that the receiving/sending buffer space can not be reused. For example the receiving buffer space needed will be the sum of the message lengths from all the neighbouring processors. For applications where a lot of data has to be sent to many processors, for instance when a high order scheme is used, or when the partition of the unstructured grid is not of high quality, this could add up to a significant amount of memory, thus restricting the maximum size of the application one is able to solve.
The asynchronous algorithm, and the synchronous algorithm combined with the two edge-colouring algorithms COLOUR and REDUCE10, were compared as communication schemes for the finite element code. The particular problem solved is the flow field around an aircraft; the mesh has 353710 tetrahedra. The mesh was partitioned using the spectral bisection algorithm into subdomains with equal numbers of tetrahedra. The finite element code took 1173 iterations to converge.
It was found that although each subdomain has an equal number of tetrahedra, the time
spent on computing varied between processors.
This is because,
for the particular finite element code used,
the amount of computation
on each processor is not only related to the number of tetrahedra
on the subdomain, but also to other factors such as the number of boundary points.
The latter
are difficult to control statically. If the communication time
is defined as the total elapsed time
for
the code minus the computation time, then the communication time also varies. This variation is
due to the time spent in waiting. For the
synchronous communication schemes, each processor has to wait for all the processors
to finish computing before communication can begin, and it also has to wait for all processors
to finish communication before further computation can start.
For the asynchronous schemes a processor needs to wait for its neighbouring processors
to finish computation before it is able to receive messages from them, but
can carry on its computation
as soon as it has received the messages it needs, even if other processors are still communicating
or waiting. This ability of overlapping computation
with communication reduced the total execution time of running the program.
| comm. scheme | |||
| 32 | COLOUR | 7922.70 | 3528.76/ 2046.63/ 4052.41 |
| 32 | REDUCE10 | 7255.69 | 2854.82/ 1376.33/ 3406.18 |
| 32 | asynchronous | 6239.69 | 1775.78/ 289.56/ 2334.00 |
| 64 | COLOUR | 5560.26 | 3437.08/ 2624.29/ 3837.24 |
| 64 | REDUCE10 | 4843.32 | 2721.01/ 1919.71/ 3112.97 |
| 64 | asynchronous | 3373.56 | 1169.84/ 357.39/ 1574.72 |
Because communication time (including waiting time)
varied between processors, to
give an indication of the proportion of the communication
(and waiting) time to that
of the total elapsed time, in Table 8
and the
averaged/minimum/maximum
are reported. For the synchronous communication,
the optimised edge-colouring algorithm REDUCE10 was able to reduce the total time of
the finite element code by 8-13%. When asynchronous communication was used,
a further reduction of 14-30% was found, at the expense of higher memory
usage. This confirms that
asynchronous communication is the most efficient when applicable.
On the other hand if
synchronous communication is to be used, the application of the
minimum cost colouring algorithm REDUCE10 results in significant savings.