Different Types of Calling Sequences a
Case Study
By
Maher Alshwaiki
Pascal versus C function calls
From Merriam-Webster dictionary, a program
is "a sequence of coded instructions that can be inserted into a mechanism
(as a computer)"
We, computer programmers, see the word
'program' with less abstract definition. For us there are two distinct meaning
for the word. The first, and more generic, form is the binary code that runs on
a computer and performs a certain action. The other form is a sequence of
characters and tokens organized in lines and columns with a specific meaning.
Tokens are mostly English words whereas
characters can be anything of variables, identifiers or compiler commands. We
call the collection of tokens and characters that comply with certain rules as
a programming language. Since not all programmers can read this language, we
add comments and techniques to organize the final text.
One of the techniques used is to group
together instructions that are logically related and separate them with an
empty line from the others. Whenever we reuse the same group of instructions multiple
times we tend to move them aside from the main stream of the code and with a
simple label we invoke those commands. We call this practice invoking a macro.
The pre-processor of a language will then take the group of instructions and
put them in place of the label. The disadvantage of this method is that the
code will be repeated as many as the occurrence of the label. Because of memory
constraints, another method was developed in which the program execution would
jump to the group of instructions and execute the commands, then once reached
the end of the instruction group, it returns to the previous execution line and
continue from that point on. This method has different names depending on the
functionality and language used. In general, and for our use in this text, they
will be called functions.
Functions became an integral part of every
program. Although there were additional instructions added for passing
parameters to the called function and retrieving the values, the advantages
gained from function calls surpassed the time spent on those additional
instructions. Programming languages had different mythologies on how to pass
the parameters and how to retrieve the resulting values. They varied in speed
and features. Some languages such as C are capable of using different methods
to accomplish the same functionality. This is usually referred to as Calling
Convention. In this document I will present and discuss the differences between
those different calling conventions. Later, I will write small test programs to
measure the time differences between each method and present my conclusion.
A calling convention consists of two
parts: The method used to call a function and the method used by the function
to return the execution to the caller. Three different methods have been
invented to call a function, whereas, two methods were created to return the
execution to the caller. The combination of the two parts has resulted in the
creation of four calling convention known as C Declaration, Pascal Convention,
Standard Call and Fast Call. C and Pascal are the representatives of the main
two calling conventions C Declaration and Pascal Convention respectively.
1.1 C Declaration
In this method parameters are pushed into the stack starting with the last parameter on the right back to the first parameter on the left of the function declaration. This way, the called function will always know how to set the stack pointer to point to the first parameter passed to it. Remember that the stack pointer is pointing to the address that the control will return to after the function terminates.
Because of the way the parameters are passed, it is possible to create functions with variable amount of parameters that can be passed to it. A typical example of this category is printf function in C language. The function does not know how many parameters are passed to it and only through the context obtained from the string it can determine how many are actually passed.
The other part that characterises this method of calling is the in the steps performed after returning from the called function. The called function does not always know the size of the stack it always returns the stack to the position where it was just after it was called. Thus freeing the memory space, so-called cleaning the stack, is done by the caller.
1.2 Pascal Convention
This convention is exactly the opposite of the previous method (C Declaration). Parameters are passed to the function from left to right, that is, the first parameter is pushed to the stack first. This prevents us from creating functions with variable number of parameters because the called function would never know where to find the first parameter. Adding additional parameter at the end of the list that would contain the number of parameters passed adds unnecessary overhead.
Because the called function knows how many parameters are passed to it, it can clean the stack before it returns to control back to the caller. This is useful to do only at processors that have an instruction that combines both the return instruction and the stack cleaning instruction together. Otherwise, moving values around in the stack becomes a necessity, which renders the effectiveness of this method.
1.3 Standard Call
Mainly used by Windows 95, Windows NT and its successors. This takes the first option from C Declaration and second option from the Pascal Convention. The idea is to be able to provide both the ability to create variable length parameters when calling a function and the advantage provided by 80x86 processors family.
1.4 Fast Call
This is the newest developed method and is
used by both Watcom C++ and Borland C++ Builder. The idea is that there are
free CPU registers that are unused when a function is called. Access to the registers
is many times faster than external memory and could save CPU cycles, especially
on relatively small size functions that have very few variables. On Intel
platform, data are passed in registers ECX and EDX. Return value in turn is
passed back in ??
Method of testing:
Some of the tests can be mathematically
calculated and using formal methods a proof could be presented to which method
would have better results. In this research paper I am concerned with the
engineering view to the topic, thus actual tests will run, obtained values will
be entered into a table, and a graph will be created from the resulting table.
Depending on the results achieved I will present my conclusion on which method
is has the best speed.
The framework for the each test sequence
will look similar to the following:
1.
Get
the current system time
2.
Call
the required function
3.
Apon
return from the function get the current system time
4.
Calculate
the difference between the beginning time and ending time
5.
Display
the time on the screen.
There are three main test sections. A test
section consists of multiple test sequences, which consists of one common
testing method.
The first testing method is a simple call
to each of the calling convention functions. The called method reads the values
from the stack to the registers then simply returns control back to the calling
program. The second testing method is a recursive call where the read data are
again passed in a new call to the same function. The reason for doing this is
that the first method created very small timing numbers and the error
introduced was high. That’s why I thought of the second method. The function
would call itself as many times as specified in one of the passed parameter.
Since this function still doesn’t do anything other than repeated calling and
thus no real data are being manipulated. I thought of creating the last
sequence, which is based on using the addition operator to calculate the
multiplication of numbers. Again I used recursive calling to create significant
delay to make it possible for me to measure it.
Generally, I am interested in testing 1,
2, 3, and 5 calling parameters to balance between usefulness of the test and
the amount of work I will need to do. In this respect, I will not perform the
_fastcall test for the five parameter function call because to my knowledge
only three registers are usually used to pass the information to the function
above that number the parameters are pushed into the stack.
The first set of test I will create will
use 16 bit data and will run on Intel processor running in 8086 real mode.
Because there might be a difference in timing between the Intel processor 8086
16-bit mode and the Intel processor 80386 running in 32-bit real mode, both
tests will be performed. I decided to use real mode for the 32-bit 80386
because of its simplicity and had no reasonable doubt that switching to
protected mode could alter the processor instruction timings.
In the following section I will describe
the tests I have written and included fragments of the corresponding code.
Please note, that because there is no significant change in the code written
for 16-bit and 32-bit processor mode. I will not discuss the 32-bit code here
and will direct you to view the source code of the program.
To make the test comprehensive I created
additional functions that target a specific area in the sequence.
The first of such functions is the right-to-left and left-to-right calling sequence.
Since we are looking into the arrangements
of the arguments it doesn’t matter what return sequence we use. In both cases
we have to use the same return sequence, thus using the C style return sequence
offers the ability to use the _cdecl function. This is useful because it
implies that we can use the test results obtained from _cdecl test when
comparing it with the results from the left-to-right function.
Recursive call for data passed from left
to right:
|
mov ax, 0 mov bx, 0 push ax push bx call lr_recurs_call ;
lr_recurs_call(ax, bx) add sp, 0004 lr_recurs_call: push bp mov bp, sp mov ax, [bp+6] inc ax mov bx, [bp+4] cmp ax, 0FFFFh je lr_recurs_end push ax push bx call lr_recurs_call add sp, 0004 lr_recurs_end: pop bp ret |
Recursive call for registry passing:
|
mov ax, 0 mov bx, 0 mov dx, bx mov cx, ax call reg_recurs_call ;
reg_recurs_call(ax, bx) reg_recurs_call: mov ax, cx inc ax mov bx, dx cmp ax, 0FFFFh je reg_recurs_end mov dx, bx mov cx, ax call reg_recurs_call reg_recurs_end: ret |
The interesting point will be here in the
_fastcall. Since we only pass the last two parameters (the right most
parameters) through registers the rest of the parameters will be passed using
another method. We will assume the C style of passing parameters through the
stack. Here is an example for passing three parameters:
|
mov ax, 0 mov bx, 0 mov cx, 0 push cx mov dx, bx mov cx, ax call reg_rl_recurs_call ;
reg_rl_recurs_call(ax, bx, cx) reg_rl_recurs_call: push bp mov bp, sp mov ax, cx inc ax mov bx, dx mov cx, [bp+4] cmp ax, 0FFFFh je reg_rl_recurs_end push cx mov dx, bx mov cx, ax call reg_rl_recurs_call reg_rl_recurs_end: pop bp ret 0002 |
|
Recursive call for
registry passing and right to left |
Now we move to the comparison of the
standard functions. The listing of the four standard functions _cdecl, _pascal,
_stdcall and _fastcall for two passed arguments is below.
Recursive call for _cdecl:
|
mov ax, 0 mov bx, 0 push bx push ax call c_recurs_call ;
c_recurs_call(ax, bx) add sp, 0004 c_recurs_call: push bp mov bp, sp mov ax, [bp+4] inc ax mov bx, [bp+6] cmp ax, 0FFFFh je c_recurs_end push bx push ax call c_recurs_call add sp, 0004 c_recurs_end: pop bp ret |
Recursive call for _pascal:
|
mov ax, 0 mov bx, 0 push ax push bx call pas_recurs_call ;
pas_recurs_call(ax, bx) pas_recurs_call: push bp mov bp, sp mov ax, [bp+6] inc ax mov bx, [bp+4] cmp ax, 0FFFFh je pas_recurs_end push ax push bx call pas_recurs_call pas_recurs_end: pop bp ret 0004 |
Recursive call for _stdcall:
|
mov ax, 0 mov bx, 0 push ax push bx call std_recurs_call ;
std_recurs_call(ax, bx) add sp, 0004 std_recurs_call: push bp mov bp, sp mov ax, [bp+6] inc ax mov bx, [bp+4] cmp ax, 0FFFFh je std_recurs_end push ax push bx call std_recurs_call add sp, 0004 std_recurs_end: pop bp ret |
Recursive call for _fastcall:
|
mov ax, 0 mov bx, 0 mov dx, bx mov cx, ax call fc_recurs_call ;
fc_recurs_call(ax, bx) fc_recurs_call: mov ax, cx inc ax mov bx, dx cmp ax, 0FFFFh je fc_recurs_end mov dx, bx mov cx, ax call fc_recurs_call fc_recurs_end: ret |
The interesting point will be here in the _fastcall.
Since we only pass the last two parameters through registers the rest of the
parameters will be passed using another method. we will assume the C style of
passing parameters to the stack. Here is an example for passing three
parameters:
|
mov ax, 0 mov bx, 0 mov cx, 0 push cx mov dx, bx mov cx, ax call fc_sp_recurs_call ;
fc_sp_recurs_call(ax, bx, cx) fc_sp_recurs_call: push bp mov bp, sp mov ax, cx inc ax mov bx, dx mov cx, [bp+4] cmp ax, 0FFFFh je fc_sp_recurs_end push cx mov dx, bx mov cx, ax call fc_sp_recurs_call fc_sp_recurs_end: pop bp ret 0002 |
In the third set of test functions, I will
focus on return method. I will use one method of passing parameters to all
called function. I decided to use right-to-left always for no particular
reason. Again in this case, you can notice that I can use the test results from
_cdecl because it will be identical with the “Caller cleans the stack” test.
The following is a listing only shows the second test “Called cleans the stack”.
Called cleans the stack:
|
mov ax, 0 mov bx, 0 push bx push ax call pas_ret_recurs_call ;
pas_ret_recurs_call(ax, bx) pas_ret_recurs_call: push bp mov bp, sp mov ax, [bp+4] inc ax mov bx, [bp+6] cmp ax, 0FFFFh je pas_ret_recurs_end push bx push ax call pas_ret_recurs_call pas_ret_recurs_end: pop bp ret 0004 |
To make the functions more realistic, I
will be applying a uniform operation inside the functions with the same number of
calling parameters. Since I will be using recursive calling in some of the
tests, the operation selected must have a recursive way of implementation. I
selected the multiplication function since it is very simple. Let us look at
the example when you use two or three parameters. Let’s start with two
parameters: the implemented mathematical operation then is c = a ´ b where a and b are the function
parameters and c is the result returned by the function. To implement it in a
recursion b will be again passed to the function itself but decremented by 1.
We will repeat that until b is equal to 1, in this case we will return the
value of a to the caller. The caller
function in the queue will add a to the returned value before returning
it again. This sequence will be repeated until the queue is empty. In the end a
will have been added b times. The C language code will look similar to the
following:
|
int
multiplyUsingAdd_2Arg(unsigned int a, unsigned int b) { if (b > 1) { return a + multiplyUsingAdd_2Arg(a,
b-1); } else { return a; } } |
Similarly we will do the same with a
function of three parameters. The analogous operation is d = a ´ b ´ c. In this case the original function
‘add’ will have to be modified to reflect the need to add the result of a ´ b, c times. Note that though this is a
classical associative property, I will not split the function into two
multiplyUsingAdd_2Arg() functions. My aim is to use the three parameter calling
convention thus in the fragment below, I recursively called again the same
function but set the c parameter to 1.
|
int
multiplyUsingAdd_3Arg(unsigned int a, unsigned int b, unsigned int c) { if (c > 1) { return multiplyUsingAdd_3Arg(a, b,
1) + multiplyUsingAdd_3Arg(a, b, c-1); } else { if (b > 1) { return a +
multiplyUsingAdd_3Arg(a, b-1, c); } else { return a; } } } |
Note that in the examples above I used C programming
language to present the idea. The real code will be though written in
Assembler. To view the actual code, please refer to the source code provided
with this document.
|
|
Conclusion:
There are two reasons that justify me for using
Intel 80x86 architecture to do my tests on. The first simply is that Intel is
the only computer architecture that I have. The second reason is because not
all other processors in the market have an instruction that pops out the data
from the stack simultaneously as the function returns control to the
caller.
The test has been run under MS-DOS when
rebooted from Windows 98. Windows has many other threads that are concurrently
running and this will affect the accuracy of my test. After I have wrote the
tests I was not satisfied with the results, so I started inspecting where is
something that I can change in order to remove any running process. The only
place that seemed to have effect on the results was in the timer interrupt. So
I went ahead and created my own timer routine and disabled the system time.
Running the test multiple times I gained number that were consistent with
respect to a small variations that is within the range of measurement error.
Originally I wanted to run a comprehensive
test with different number of parameters ranging from two parameters to nine
parameters. In addition to using 16 and 32 bit instruction sets. Unfortunately,
I lost the will to continue it all the way through. Especially, when I received
the first reliable numbers. For this reason I decided to stop at three
parameters test and only tried rewriting the test for 32 bit instruction set.
For easy referral I numbered the tests as
follows: 1. is a test from right to left. 2. Test from left to right. 3.
Registry passing test. All have the same return sequence. test 4 and 5 are
about the return sequence, so test 4 is a C style return sequence and test 5
has a Pascal return sequence. Tests 6 to 9 represent the existing function
calls. C style calling sequence is assigned test number 6, test 7 is about
Pascal calling sequence. Windows 98, NT calling sequence __stdcall is in test 8
and finally Test 9 is __fastcall which is used by Watcom and Borland.
There were multiple surprises in my test.
The first surprise was in test 1 and 2 very close numbers for the recursive
calls but had a much bigger difference in a none function calling. This seemed
to defy logic because the number of clocks to perform both functions are the
same. I then tried switching the order of the BP+x calls. Surprisengly the
output number from the timer became the same! I concluded that the sequence:
mov ax, [bp+4]
mov bx, [bp+6]
is faster than the sequence:
mov ax, [bp+6]
mov bx, [bp+4]
Although the clock cycles to produce these
instructions are the same, but for unexplained reason the first two
instructions get executed faster that the second two! Note that this is still
inconsiderable difference since the maximum measured difference was only 0.8%!
The next surprising numbers came from test
4 and 5. It was commonly suggested that the Pascal calling sequence is faster
because the called function takes advantage of the Intel return instruction
that also pops the stack out. Although you could eventually save some bytes
using this instruction, but from my test there was no advantage of using it. In
fact the contrary was true! The difference in favour of C return convention
with up to 2% difference. I could not come with a convincing explanation.
Moreover, when counting the cycles produced by ‘ret 000x’ and ‘ret’ and ‘add
sp, 000x’ come to be equal, so there doesn’t seem to be any gain there either.
Overall there was no big difference in the
speed between C style and Pascal style calling sequence. On the other hand,
__fastcall was considerably faster that both mentioned above. In comparison,
using the __fastcall was 36.6% faster for recursive calls whereas for non
recursive calls it was only 11.6%. In all cases the difference is still
significant to consider converting some functions to use it.
In conclusion, I was disappointed with the
results of the test. I was hoping there will be significant difference between
different calling conventions and that Pascal will be faster that C style or
__stdcall. Unfortunately, on the contrary, in many cases it was slower. On the other
hand, as expected __fastcall prove to be the fastest calling sequence, that was
true for 2 and 3 passed parameters.