Effect of the Scheduling Algorithms on the Finite Element Code

In this section the effect of the scheduling algorithms on a practical finite element code is examined. To give a more complete view, synchronous and asynchronous communication will also be examined. It is known that in general asynchronous communication is faster, and should be used whenever possible. However there are situations where synchronous communication is more appropriate, mainly due to the fact that it requires less memory. The impact of scheduling algorithms and that of the mode of communication (synchronous or asynchronous) will be compared.

The scheduling algorithms of Section 2 were implemented in an explicit unstructured finite element Euler solver code FELISA [12]. The code was run on a number of parallel platforms, including the Intel iPSC/860 and the Cray T3D.

Apart from the synchronous communication scheme discussed in the previous sections, there are alternative ways of organising the communication. The following is an asynchronous communication scheme, which uses the asynchronous communication subroutines isend/irecv available on the Intel.

The advantage of this asynchronous communication scheme, compared with the synchronous scheme detailed in the last section, is that there is no need to organise the communication in a pair-wise fashion. Furthermore no synchronisation is needed, and a processor can carry on its computation as soon as it has received the messages it needs, without waiting for other processors to finish. The disadvantage is that the receiving/sending buffer space can not be reused. For example the receiving buffer space needed will be the sum of the message lengths from all the neighbouring processors. For applications where a lot of data has to be sent to many processors, for instance when a high order scheme is used, or when the partition of the unstructured grid is not of high quality, this could add up to a significant amount of memory, thus restricting the maximum size of the application one is able to solve.

The asynchronous algorithm, and the synchronous algorithm combined with the two edge-colouring algorithms COLOUR and REDUCE10, were compared as communication schemes for the finite element code. The particular problem solved is the flow field around an aircraft; the mesh has 353710 tetrahedra. The mesh was partitioned using the spectral bisection algorithm into subdomains with equal numbers of tetrahedra. The finite element code took 1173 iterations to converge.

It was found that although each subdomain has an equal number of tetrahedra, the time spent on computing varied between processors. This is because, for the particular finite element code used, the amount of computation on each processor is not only related to the number of tetrahedra on the subdomain, but also to other factors such as the number of boundary points. The latter are difficult to control statically. If the communication time $t_{\rm comm}$ is defined as the total elapsed time $t_{\rm total}$ for the code minus the computation time, then the communication time also varies. This variation is due to the time spent in waiting. For the synchronous communication schemes, each processor has to wait for all the processors to finish computing before communication can begin, and it also has to wait for all processors to finish communication before further computation can start. For the asynchronous schemes a processor needs to wait for its neighbouring processors to finish computation before it is able to receive messages from them, but can carry on its computation as soon as it has received the messages it needs, even if other processors are still communicating or waiting. This ability of overlapping computation with communication reduced the total execution time of running the program.

**Table 8:** The total elapsed time $t_{\rm total}$ and and communication (including waiting) time $t_{\rm comm}$ for the finite element code to converge on a mesh around an aircraft. Three communication schemes are compared. Here and 64 is the number of processors.
	comm. scheme	$t_{\rm total}$	$t_{\rm comm}$ ( average/min/max)
32	COLOUR	7922.70	3528.76/ 2046.63/ 4052.41
32	REDUCE10	7255.69	2854.82/ 1376.33/ 3406.18
32	asynchronous	6239.69	1775.78/ 289.56/ 2334.00
64	COLOUR	5560.26	3437.08/ 2624.29/ 3837.24
64	REDUCE10	4843.32	2721.01/ 1919.71/ 3112.97
64	asynchronous	3373.56	1169.84/ 357.39/ 1574.72

Because communication time (including waiting time) varied between processors, to give an indication of the proportion of the communication (and waiting) time to that of the total elapsed time, in Table 8 $t_{\rm total}$ and the averaged/minimum/maximum $t_{\rm comm}$ are reported. For the synchronous communication, the optimised edge-colouring algorithm REDUCE10 was able to reduce the total time of the finite element code by 8-13%. When asynchronous communication was used, a further reduction of 14-30% was found, at the expense of higher memory usage. This confirms that asynchronous communication is the most efficient when applicable. On the other hand if synchronous communication is to be used, the application of the minimum cost colouring algorithm REDUCE10 results in significant savings.