Different Types of Calling Sequences a Case Study

                                                                                                          By Maher Alshwaiki

 

 

 

Pascal versus C function calls

 

From Merriam-Webster dictionary, a program is "a sequence of coded instructions that can be inserted into a mechanism (as a computer)"

We, computer programmers, see the word 'program' with less abstract definition. For us there are two distinct meaning for the word. The first, and more generic, form is the binary code that runs on a computer and performs a certain action. The other form is a sequence of characters and tokens organized in lines and columns with a specific meaning.

Tokens are mostly English words whereas characters can be anything of variables, identifiers or compiler commands. We call the collection of tokens and characters that comply with certain rules as a programming language. Since not all programmers can read this language, we add comments and techniques to organize the final text.

One of the techniques used is to group together instructions that are logically related and separate them with an empty line from the others. Whenever we reuse the same group of instructions multiple times we tend to move them aside from the main stream of the code and with a simple label we invoke those commands. We call this practice invoking a macro. The pre-processor of a language will then take the group of instructions and put them in place of the label. The disadvantage of this method is that the code will be repeated as many as the occurrence of the label. Because of memory constraints, another method was developed in which the program execution would jump to the group of instructions and execute the commands, then once reached the end of the instruction group, it returns to the previous execution line and continue from that point on. This method has different names depending on the functionality and language used. In general, and for our use in this text, they will be called functions.

Functions became an integral part of every program. Although there were additional instructions added for passing parameters to the called function and retrieving the values, the advantages gained from function calls surpassed the time spent on those additional instructions. Programming languages had different mythologies on how to pass the parameters and how to retrieve the resulting values. They varied in speed and features. Some languages such as C are capable of using different methods to accomplish the same functionality. This is usually referred to as Calling Convention. In this document I will present and discuss the differences between those different calling conventions. Later, I will write small test programs to measure the time differences between each method and present my conclusion.

A calling convention consists of two parts: The method used to call a function and the method used by the function to return the execution to the caller. Three different methods have been invented to call a function, whereas, two methods were created to return the execution to the caller. The combination of the two parts has resulted in the creation of four calling convention known as C Declaration, Pascal Convention, Standard Call and Fast Call. C and Pascal are the representatives of the main two calling conventions C Declaration and Pascal Convention respectively.

 

1.1  C Declaration

In this method parameters are pushed into the stack starting with the last parameter on the right back to the first parameter on the left of the function declaration. This way, the called function will always know how to set the stack pointer to point to the first parameter passed to it. Remember that the stack pointer is pointing to the address that the control will return to after the function terminates.

Because of the way the parameters are passed, it is possible to create functions with variable amount of parameters that can be passed to it. A typical example of this category is printf function in C language. The function does not know how many parameters are passed to it and only through the context obtained from the string it can determine how many are actually passed.

The other part that characterises this method of calling is the in the steps performed after returning from the called function. The called function does not always know the size of the stack it always returns the stack to the position where it was just after it was called. Thus freeing the memory space, so-called cleaning the stack, is done by the caller.

 

1.2  Pascal Convention

This convention is exactly the opposite of the previous method (C Declaration). Parameters are passed to the function from left to right, that is, the first parameter is pushed to the stack first. This prevents us from creating functions with variable number of parameters because the called function would never know where to find the first parameter. Adding additional parameter at the end of the list that would contain the number of parameters passed adds unnecessary overhead.

Because the called function knows how many parameters are passed to it, it can clean the stack before it returns to control back to the caller. This is useful to do only at processors that have an instruction that combines both the return instruction and the stack cleaning instruction together. Otherwise, moving values around in the stack becomes a necessity, which renders the effectiveness of this method.

 

1.3  Standard Call

Mainly used by Windows 95, Windows NT and its successors. This takes the first option from C Declaration and second option from the Pascal Convention. The idea is to be able to provide both the ability to create variable length parameters when calling a function and the advantage provided by 80x86 processors family.

 

1.4  Fast Call

This is the newest developed method and is used by both Watcom C++ and Borland C++ Builder. The idea is that there are free CPU registers that are unused when a function is called. Access to the registers is many times faster than external memory and could save CPU cycles, especially on relatively small size functions that have very few variables. On Intel platform, data are passed in registers ECX and EDX. Return value in turn is passed back in ??

 

Method of testing:

Some of the tests can be mathematically calculated and using formal methods a proof could be presented to which method would have better results. In this research paper I am concerned with the engineering view to the topic, thus actual tests will run, obtained values will be entered into a table, and a graph will be created from the resulting table. Depending on the results achieved I will present my conclusion on which method is has the best speed.

 

The framework for the each test sequence will look similar to the following:

1.       Get the current system time

2.       Call the required function

3.       Apon return from the function get the current system time

4.       Calculate the difference between the beginning time and ending time

5.       Display the time on the screen.

 

There are three main test sections. A test section consists of multiple test sequences, which consists of one common testing method.

The first testing method is a simple call to each of the calling convention functions. The called method reads the values from the stack to the registers then simply returns control back to the calling program. The second testing method is a recursive call where the read data are again passed in a new call to the same function. The reason for doing this is that the first method created very small timing numbers and the error introduced was high. That’s why I thought of the second method. The function would call itself as many times as specified in one of the passed parameter. Since this function still doesn’t do anything other than repeated calling and thus no real data are being manipulated. I thought of creating the last sequence, which is based on using the addition operator to calculate the multiplication of numbers. Again I used recursive calling to create significant delay to make it possible for me to measure it.

Generally, I am interested in testing 1, 2, 3, and 5 calling parameters to balance between usefulness of the test and the amount of work I will need to do. In this respect, I will not perform the _fastcall test for the five parameter function call because to my knowledge only three registers are usually used to pass the information to the function above that number the parameters are pushed into the stack.

The first set of test I will create will use 16 bit data and will run on Intel processor running in 8086 real mode. Because there might be a difference in timing between the Intel processor 8086 16-bit mode and the Intel processor 80386 running in 32-bit real mode, both tests will be performed. I decided to use real mode for the 32-bit 80386 because of its simplicity and had no reasonable doubt that switching to protected mode could alter the processor instruction timings.

 

In the following section I will describe the tests I have written and included fragments of the corresponding code. Please note, that because there is no significant change in the code written for 16-bit and 32-bit processor mode. I will not discuss the 32-bit code here and will direct you to view the source code of the program.

To make the test comprehensive I created additional functions that target a specific area in the sequence.

The first of such functions is the right-to-left and left-to-right calling sequence.

Since we are looking into the arrangements of the arguments it doesn’t matter what return sequence we use. In both cases we have to use the same return sequence, thus using the C style return sequence offers the ability to use the _cdecl function. This is useful because it implies that we can use the test results obtained from _cdecl test when comparing it with the results from the left-to-right function.

 

Recursive call for data passed from left to right:

             mov     ax, 0

             mov     bx, 0

             push     ax

             push     bx

             call       lr_recurs_call                ; lr_recurs_call(ax, bx)

             add       sp, 0004

 

lr_recurs_call:

             push     bp

             mov     bp, sp

             mov     ax, [bp+6]

             inc        ax

             mov     bx, [bp+4]

             cmp      ax, 0FFFFh

             je          lr_recurs_end

             push     ax

             push     bx

             call       lr_recurs_call

             add       sp, 0004

lr_recurs_end:

             pop       bp

             ret

 

Recursive call for registry passing: 

             mov     ax, 0

             mov     bx, 0

             mov     dx, bx

             mov     cx, ax

             call       reg_recurs_call                          ; reg_recurs_call(ax, bx)

 

reg_recurs_call:

             mov     ax, cx

             inc        ax

             mov     bx, dx

             cmp      ax, 0FFFFh

             je          reg_recurs_end

             mov     dx, bx

             mov     cx, ax

             call       reg_recurs_call

reg_recurs_end:

             ret

 

 

The interesting point will be here in the _fastcall. Since we only pass the last two parameters (the right most parameters) through registers the rest of the parameters will be passed using another method. We will assume the C style of passing parameters through the stack. Here is an example for passing three parameters:

 

             mov     ax, 0

             mov     bx, 0

             mov     cx, 0

             push     cx

             mov     dx, bx

             mov     cx, ax

             call       reg_rl_recurs_call                     ; reg_rl_recurs_call(ax, bx, cx)

 

reg_rl_recurs_call:

             push     bp

             mov     bp, sp

             mov     ax, cx

             inc        ax

             mov     bx, dx

             mov     cx, [bp+4]

             cmp      ax, 0FFFFh

             je          reg_rl_recurs_end

             push     cx

             mov     dx, bx

             mov     cx, ax

             call       reg_rl_recurs_call

reg_rl_recurs_end:

             pop       bp

             ret         0002

 

Recursive call for registry passing and right to left

 

Now we move to the comparison of the standard functions. The listing of the four standard functions _cdecl, _pascal, _stdcall and _fastcall for two passed arguments is below.

 

Recursive call for _cdecl:

             mov     ax, 0

             mov     bx, 0

             push     bx

             push     ax

             call       c_recurs_call                ; c_recurs_call(ax, bx)

             add       sp, 0004

 

c_recurs_call:

             push     bp

             mov     bp, sp

             mov     ax, [bp+4]

             inc        ax

             mov     bx, [bp+6]

             cmp      ax, 0FFFFh

             je          c_recurs_end

             push     bx

             push     ax

             call       c_recurs_call

             add       sp, 0004

c_recurs_end:

             pop       bp

             ret

 

 Recursive call for _pascal:

             mov     ax, 0

             mov     bx, 0

             push     ax

             push     bx

             call       pas_recurs_call                          ; pas_recurs_call(ax, bx)

 

pas_recurs_call:

             push     bp

             mov     bp, sp

             mov     ax, [bp+6]

             inc        ax

             mov     bx, [bp+4]

             cmp      ax, 0FFFFh

             je          pas_recurs_end

             push     ax

             push     bx

             call       pas_recurs_call

pas_recurs_end:

             pop       bp

             ret         0004

 

 Recursive call for _stdcall:

             mov     ax, 0

             mov     bx, 0

             push     ax

             push     bx

             call       std_recurs_call                          ; std_recurs_call(ax, bx)

             add       sp, 0004

 

std_recurs_call:

             push     bp

             mov     bp, sp

             mov     ax, [bp+6]

             inc        ax

             mov     bx, [bp+4]

             cmp      ax, 0FFFFh

             je          std_recurs_end

             push     ax

             push     bx

             call       std_recurs_call

             add       sp, 0004

std_recurs_end:

             pop       bp

             ret

 

 

 Recursive call for _fastcall:

             mov     ax, 0

             mov     bx, 0

             mov     dx, bx

             mov     cx, ax

             call       fc_recurs_call               ; fc_recurs_call(ax, bx)

 

fc_recurs_call:

             mov     ax, cx

             inc        ax

             mov     bx, dx

             cmp      ax, 0FFFFh

             je          fc_recurs_end

             mov     dx, bx

             mov     cx, ax

             call       fc_recurs_call

fc_recurs_end:

             ret

 

The interesting point will be here in the _fastcall. Since we only pass the last two parameters through registers the rest of the parameters will be passed using another method. we will assume the C style of passing parameters to the stack. Here is an example for passing three parameters:

 

             mov     ax, 0

             mov     bx, 0

             mov     cx, 0

             push     cx

             mov     dx, bx

             mov     cx, ax

             call       fc_sp_recurs_call                      ; fc_sp_recurs_call(ax, bx, cx)

 

fc_sp_recurs_call:

             push     bp

             mov     bp, sp

             mov     ax, cx

             inc        ax

             mov     bx, dx

             mov     cx, [bp+4]

             cmp      ax, 0FFFFh

             je          fc_sp_recurs_end

             push     cx

             mov     dx, bx

             mov     cx, ax

             call       fc_sp_recurs_call

fc_sp_recurs_end:

             pop       bp

             ret         0002

 

 

In the third set of test functions, I will focus on return method. I will use one method of passing parameters to all called function. I decided to use right-to-left always for no particular reason. Again in this case, you can notice that I can use the test results from _cdecl because it will be identical with the “Caller cleans the stack” test. The following is a listing only shows the second test “Called cleans the stack”.

 

Called cleans the stack:

 

             mov     ax, 0

             mov     bx, 0

             push     bx

             push     ax

             call       pas_ret_recurs_call                   ; pas_ret_recurs_call(ax, bx)

 

pas_ret_recurs_call:

             push     bp

             mov     bp, sp

             mov     ax, [bp+4]

             inc        ax

             mov     bx, [bp+6]

             cmp      ax, 0FFFFh

             je          pas_ret_recurs_end

             push     bx

             push     ax

             call       pas_ret_recurs_call

pas_ret_recurs_end:

             pop       bp

             ret         0004

 

To make the functions more realistic, I will be applying a uniform operation inside the functions with the same number of calling parameters. Since I will be using recursive calling in some of the tests, the operation selected must have a recursive way of implementation. I selected the multiplication function since it is very simple. Let us look at the example when you use two or three parameters. Let’s start with two parameters: the implemented mathematical operation then is c = a ´ b where a and b are the function parameters and c is the result returned by the function. To implement it in a recursion b will be again passed to the function itself but decremented by 1. We will repeat that until b is equal to 1, in this case we will return the value of a to the caller. The caller  function in the queue will add a to the returned value before returning it again. This sequence will be repeated until the queue is empty. In the end a will have been added b times. The C language code will look similar to the following:

 

int multiplyUsingAdd_2Arg(unsigned int a, unsigned int b)

{

    if (b > 1) {

        return a + multiplyUsingAdd_2Arg(a, b-1);

   } else {

        return a;

   }

}

 

Similarly we will do the same with a function of three parameters. The analogous operation is d = a ´ b ´ c. In this case the original function ‘add’ will have to be modified to reflect the need to add the result of a ´ b, c times. Note that though this is a classical associative property, I will not split the function into two multiplyUsingAdd_2Arg() functions. My aim is to use the three parameter calling convention thus in the fragment below, I recursively called again the same function but set the c parameter to 1.

 

int multiplyUsingAdd_3Arg(unsigned int a, unsigned int b, unsigned int c)

{

    if (c > 1) {

        return multiplyUsingAdd_3Arg(a, b, 1) + multiplyUsingAdd_3Arg(a, b, c-1);

    } else {

        if (b > 1) {

            return a + multiplyUsingAdd_3Arg(a, b-1, c);

        } else {

            return a;

        }

    }

}

 

 

Note that in the examples above I used C programming language to present the idea. The real code will be though written in Assembler. To view the actual code, please refer to the source code provided with this document.

 

 

 

 

Conclusion:

 

There are two reasons that justify me for using Intel 80x86 architecture to do my tests on. The first simply is that Intel is the only computer architecture that I have. The second reason is because not all other processors in the market have an instruction that pops out the data from the stack simultaneously as the function returns control to the caller. 

The test has been run under MS-DOS when rebooted from Windows 98. Windows has many other threads that are concurrently running and this will affect the accuracy of my test. After I have wrote the tests I was not satisfied with the results, so I started inspecting where is something that I can change in order to remove any running process. The only place that seemed to have effect on the results was in the timer interrupt. So I went ahead and created my own timer routine and disabled the system time. Running the test multiple times I gained number that were consistent with respect to a small variations that is within the range of measurement error.

Originally I wanted to run a comprehensive test with different number of parameters ranging from two parameters to nine parameters. In addition to using 16 and 32 bit instruction sets. Unfortunately, I lost the will to continue it all the way through. Especially, when I received the first reliable numbers. For this reason I decided to stop at three parameters test and only tried rewriting the test for 32 bit instruction set.

For easy referral I numbered the tests as follows: 1. is a test from right to left. 2. Test from left to right. 3. Registry passing test. All have the same return sequence. test 4 and 5 are about the return sequence, so test 4 is a C style return sequence and test 5 has a Pascal return sequence. Tests 6 to 9 represent the existing function calls. C style calling sequence is assigned test number 6, test 7 is about Pascal calling sequence. Windows 98, NT calling sequence __stdcall is in test 8 and finally Test 9 is __fastcall which is used by Watcom and Borland.

There were multiple surprises in my test. The first surprise was in test 1 and 2 very close numbers for the recursive calls but had a much bigger difference in a none function calling. This seemed to defy logic because the number of clocks to perform both functions are the same. I then tried switching the order of the BP+x calls. Surprisengly the output number from the timer became the same! I concluded that the sequence:

             mov     ax, [bp+4]

             mov     bx, [bp+6]

is faster than the sequence:

             mov     ax, [bp+6]

             mov     bx, [bp+4]

Although the clock cycles to produce these instructions are the same, but for unexplained reason the first two instructions get executed faster that the second two! Note that this is still inconsiderable difference since the maximum measured difference was only 0.8%!

The next surprising numbers came from test 4 and 5. It was commonly suggested that the Pascal calling sequence is faster because the called function takes advantage of the Intel return instruction that also pops the stack out. Although you could eventually save some bytes using this instruction, but from my test there was no advantage of using it. In fact the contrary was true! The difference in favour of C return convention with up to 2% difference. I could not come with a convincing explanation. Moreover, when counting the cycles produced by ‘ret 000x’ and ‘ret’ and ‘add sp, 000x’ come to be equal, so there doesn’t seem to be any gain there either.

Overall there was no big difference in the speed between C style and Pascal style calling sequence. On the other hand, __fastcall was considerably faster that both mentioned above. In comparison, using the __fastcall was 36.6% faster for recursive calls whereas for non recursive calls it was only 11.6%. In all cases the difference is still significant to consider converting some functions to use it.

In conclusion, I was disappointed with the results of the test. I was hoping there will be significant difference between different calling conventions and that Pascal will be faster that C style or __stdcall. Unfortunately, on the contrary, in many cases it was slower. On the other hand, as expected __fastcall prove to be the fastest calling sequence, that was true for 2 and 3 passed parameters.

 

 

 

Hosted by www.Geocities.ws

1