Structure and Syntax of Assembly
Language 

The syntax of assembly language is relatively constant across platforms. Of course, this has very little
implication for portability as the actual semantics of assembly vary so greatly, but I suppose it is
some consolation.

A program (assembly or high-level language) is made up of several segments, each with its own
function. The three types of segments are stack segments, data segments, and code segments.

Several different types of components make up an assembly language program. Among these are
comments, Instructions and pseudo-instructions, and labels. We will take each of these in turn
and examine their syntax and purpose.


Segments 

A segment is a contiguous section of a program containing related data or code, or a stack. Some
operating systems use segments as the primary unit of a program for swapping programs in and out
of memory and on these platforms it is important to keep code and data that are usually accessed
together in the same segment. For DOS, you don't have to worry about this.

There are, as I mentioned above, three types of segments: stack, data, and code.


Stack segments 

A stack segment contains, not surprisingly, a stack. The ss register will point to the beginning of the
stack segment. The sp register will point to the offset within the stack segment of the item at the top
of the stack. Any push or pop operation uses the ss and sp registers to determine where to read or
write data in the stack. Normally, you have only one stack segment in a given program, although it is
possible (though a bit tricky) to have more.



Data segments 

A data segment contains data. You can put any kind of data you want here: bytes, words, double
words, quad words, strings (really arrays of bytes), etc. There are two segment registers which are
used to point to data segments, namely the ds and es registers.
Code segments 

A code segment contains code. Are you beginning to see a trend here? The cs segment register
points to the code segment.

Note: although the exigencies of limited resources and/or basic machismo have in the past caused
people to write programs which modify their own code, this is generally considered today as a bad
practice, as (a) such programs are non-reentrant so that recursion becomes a problem and multiple
instances of the program cannot be run from the same memory image and (b) many modern
computer architectures with Harvard-style caches (separate caches for code and for data) require
the entire cache to be flushed if the code is modified, therefore degrading performance. Don't do it.

Comments 

A comment can come anywhere in an assembly language program. It begins with a semicolon and
runs to the end of the line. Because assembly language programs are so low-level they are very
difficult to read once written. Therefore good comments are absolutely vital, far more so than in
higher-level languages which are able to reflect the logical structure of the algorithm in the text of the
program.

Instructions and pseudo-instructions 

An instruction line in an assembly language program is turned into a single machine instruction in the
resulting binary. A pseudo-instruction may generate data in the binary or may be a compiler directive
that controls the assembly process. The syntax for both types of line is <name> <opcode> <arglist>.

Name 

The name associates a symbol with a memory address and is required for some pseudo-instructions
such as those which specify data locations. Usually, case-sensitivity for names can be turned off or
on by giving a command-line argument to the assembler.

Opcode 

The opcode is always required; it is a machine instruction such as an add or a jump, or an assembler
directive like the segment directive. If no name is given, there should be at least one space before
the opcode. Opcodes are case-insensitive.

Arglist 

The arglist is required for some instructions and pseudo-instructions and specifies the argument list.
If there is more than one argument, the arguments are separated by commas. In assembly-language
programs for MS-DOG, the order of the arguments is usually destination and then source. This is
counterintuitive for most people and in fact it is the opposite of Intel's recommendations. If you do
Intel assembly programming under Unix, the order of the operands is source and then destination.
Just to confuse you. Arguments are case-insensitive except inside strings.

Labels 

Labels are the basis of all flow-control in an assembly-language program. They associate an offset in
the code segment with a symbol which can be referred to in jump instructions. There are no
high-level control structures in assembly such as while loops or for loops; instead everything must be
done through unconditional jumps, tests, and conditional jumps. The argument of a jump instruction
is usually given as a label (although masochists and users of the ROM-resident mini-assembler in the
Apple ][ give these as hexadecimal numbers).

An example assembly language program 

(This is not a complete program)


data    segment
count   db                      ; counter variable
data    ends

code    segment
        assume cs:code,ds:nothing,es:data,ss:stack
main:
        mov     ax,seg data     ; set es register
        mov     es,ax           ; to point to data segment
        mov     count,10        ; initialize counter to 10
loop_top:
        dec     count           ; decrement counter
        jnz     loop_top        ; if counter is not zero, repeat loop
        mov     ax,4c00h        ; set up for exit
        int     21h             ; call exit interrupt
code    ends


This corresponds roughly to the following C program:


int main() {
  int a=10;        /* counter variable */

  while( a != 0 )
    a--;
  exit(0);
}


You don't have to worry about the specifics of the assembly-language program above, but I'll point
out some of the features described above. Note that each instruction or pseudo-instruction line has a
comment along with it. You really ought to comment your programs to this level, as well as using
entire comment lines to mark off and describe sections of the code. The pseudo-operations include
the segment directives which declare segments, the assume directive, which tells the assembler
which segments correspond to which segment registers, the db directive which reserves a space in
the data segment for a byte-sized variable, and the ends directive which shows the end of the
segment. The instructions include mov, which moves data from one place to another, dec, which
decrements a register or memory location, jnz, a conditional jump instruction, and int, which calls
an operating system function via an interrupt. There are two labels in the program: main, which just
refers to the top of the main procedure, and loop_top, which is given as the target of the
conditional jump and marks, amazingly enough, the top of the loop.


Here again are the main points:

         Comments
                                   Begin in any column with a semicolon and
                                   continue to end of line
         Instructions and
         Pseudo-Instructions
                                   Name (if any) begins in the first column.
                                   Opcode given starting in the second
                                   column or greater followed by some
                                   number of spaces and/or tabs and then a
                                   comma-separated list of the arguments (if
                                   any).
         Labels
                                   Begin in the first column. They start with a
                                   letter or special character [?@_$] and
                                   contain letters, numbers, and special
                                   characters. They are followed by a
                                   semicolon.


