                                CHAPTER 8

                          SORTING AND SEARCHING


Two fundamental operations required of many applications are searching and
sorting the data they operate on.  Many different types of data are
commonly sorted, such as customer names, payment due dates, or even a list
of file names displayed in a file selection menu.  If you are writing a
programmer's cross reference utility, you may need to sort a list of
variable names without regard to capitalization.  In some cases, you may
want to sort several pieces of related information based on the contents
of only one of them.  One example of that is a list of names and addresses
sorted in ascending zip code order.
   Searching is equally important; for example, to locate a customer name
in an array or disk file.  In some cases you may wish to search for a
complete match, while in others a partial match is needed.  If you are
searching a list of names for, say, Leonard, you probably would want to
ignore Leonardo.  But when searching a list of zip codes you may need to
locate all that begin with the digits 068.  There are many different ways
sorting and searching can be accomplished, and the subject is by no means
a simple one.
   Most programmers are familiar with the Bubble Sort, because it is the
simplest to understand.  Each adjacent pair of items is compared, and then
exchanged if they are out of order.  This process is repeated over and
over, until the entire list has been examined as many times as there are
items.  Unfortunately, these repeated comparisons make the Bubble Sort an
extremely poor performer.  Similarly, code to perform a linear search that
simply examines each item in succession for a match is easy to grasp, but
it will be painfully slow when there are many items.
   In this chapter you will learn how sophisticated algorithms that handle
these important programming chores operate.  You will also learn how to
sort data on more than one key.  Often, it is not sufficient to merely sort
a list of customers by their last name.  For example, you may be expected
to sort first by last name, then by first name, and finally by balance due. 
That is, all of the last names would first be sorted.  Then within all of
the Smiths you would sort again by first name, and for all of the John
Smiths sort that subgroup based on how much money is owed.
   For completeness I will start each section by introducing sorting and
searching methods that are easy to understand, and then progress to the
more complex algorithms that are much more effective.  Specifically, I will
show the Quick Sort and Binary Search algorithms.  When there are many
thousands of data items, a good algorithm can make the difference between
a sort routine that takes ten minutes to complete, and one that needs only
a few seconds.
   Finally, I will discuss both BASIC and assembly language sort routines. 
As important as the right algorithm is for good performance, an assembly
language implementation will be even faster.  Chapter 12 describes how
assembly language routines are written and how they work, and in this
chapter I will merely show how to use the routines included with this book.


SORTING FUNDAMENTALS
====================

Although there are many different ways to sort an array, the simplest
sorting algorithm is the Bubble Sort.  The name Bubble is used because a
FOR/NEXT loop repeatedly examines each adjacent pair of elements in the
array, and those that have higher values rise to the top like bubbles in
a bathtub.  The most common type of sort is ascending, which means that "A"
comes before "B", which comes before "C", and so forth.  Figure 8-1 shows
how the name Zorba ascends to the top of a five-item list of first names.

Initial array contents:

  Element 4    Kathy
  Element 3    Barbara
  Element 2    Cathy
  Element 1    Zorba <


      After 1 pass:

        Element 4    Kathy
        Element 3    Barbara
        Element 2    Zorba <
        Element 1    Cathy


            After 2 passes:

              Element 4    Kathy
              Element 3    Zorba <
              Element 2    Barbara
              Element 1    Cathy


                After 3 passes:

                  Element 4    Zorba <
                  Element 3    Kathy
                  Element 2    Barbara
                  Element 1    Cathy

Figure 8.1: Data ascending a list during a bubble sort.

The Bubble Sort routine that follows uses a FOR/NEXT loop to repeatedly
examine an array and exchange elements as necessary, until all of the items
are in the correct order.

DEFINT A-Z
DECLARE SUB BubbleSort (Array$())

CONST NumItems% = 20
CONST False% = 0
CONST True% = -1

DIM Array$(1 TO NumItems%)
FOR X = 1 TO NumItems%
  READ Array$(X)
NEXT

CALL BubbleSort(Array$())

CLS
FOR X = 1 TO NumItems%
  PRINT Array$(X)
NEXT

DATA Zorba, Cathy, Barbara, Kathy, Josephine
DATA Joseph, Joe, Peter, Arnold, Glen
DATA Ralph, Elli, Lucky, Rocky, Louis
DATA Paula, Paul, Mary Lou, Marilyn, Keith
END

SUB BubbleSort (Array$()) STATIC

DO
  OutOfOrder = False%                 'assume it's sorted
  FOR X = 1 TO UBOUND(Array$) - 1
    IF Array$(X) > Array$(X + 1) THEN
      SWAP Array$(X), Array$(X + 1)   'if we had to swap
      OutOfOrder = True%              'we may not be done
    END IF
  NEXT
LOOP WHILE OutOfOrder

END SUB

This routine is simple enough to be self-explanatory, and only a few things
warrant discussing.  One is the OutOfOrder flag variable.  When the array
is nearly sorted to begin with, fewer passes through the loop are needed. 
The OutOfOrder variable determines when no more passes are necessary.  It
is cleared at the start of each loop, and set each time two elements are
exchanged.  If, after examining all of the elements in one pass no
exchanges were required, then the sorting is done and there's no need for
the DO loop to continue.
   The other item worth mentioning is that the FOR/NEXT loop is set to
consider one element less than the array actually holds.  This is necessary
because each element is compared to the one above it.  If the last element
were included in the loop, then BASIC would issue a "Subscript out of
range" error on the statement that examines Array$(X + 1).
   There are a number of features you can add to this Bubble Sort routine. 
For example, you could sort without regard to capitalization.  In that case
"adams" would come before "BAKER", even though the lowercase letter "a" has
a higher ASCII value than the uppercase letter "B".  To add that capability
simply use BASIC's UCASE$ (or LCASE$) function as part of the comparisons:

   IF UCASE$(Array$(X)) > UCASE$(Array$(X + 1)) THEN

And to sort based on the eight-character portion that starts six bytes
into each string you would use this:

   IF MID$(Array$(X), 5, 8) > MID$(Array$(X + 1), 5, 8) THEN

Although the comparisons in this example are based on just a portion of
each string, the SWAP statement must exchange the entire elements.  This
opens up many possibilities as you will see later in this chapter.
   If there is a chance that the strings may contain trailing blanks that
should be ignored, you can use RTRIM$ on each pair of elements:

   IF RTRIM$(Array$(X)) > RTRIM$(Array$(X + 1)) THEN

Of course, you can easily combine these enhancements to consider only the
characters in the middle after they have been converted to upper or lower
case.
   Sorting in reverse (descending) order is equally easy; you'd simply
replace the greater-than symbol (>) with a less-than symbol (<).
   Finally, you can modify the routine to work with any type of data by
changing the array type identifier.  That is, for every occurrence of
Array$ you will change that to Array% or Array# or whatever is appropriate. 
If you are sorting a numeric array, then different modifications may be in
order.  For example, to sort ignoring whether the numbers are positive or
negative you would use BASIC's ABS (absolute value) function:

   IF ABS(Array!(X)) > ABS(Array!(X + 1)) THEN

It is important to point out that all of the simple modifications described
here can also be applied to the more sophisticated sort routines we will
look at later in this chapter.


INDEXED SORTS

Besides the traditional sorting methods--whether a Bubble Sort or Quick
Sort or any other type of sort--there is another category of sort routine
you should be familiar with.  Where a conventional sort exchanges elements
in an array until they are in order, an Index Sort instead exchanges
elements in a parallel numeric array of *pointers*.  The original data is left
intact, so it may still be accessed in its natural order.  However, the
array can also be accessed in sorted order by using the element numbers
contained in the index array.
   As with a conventional sort, the comparisons in an indexed sort routine
examine each element in the primary array, but based on the element numbers
in that index array.  If it is determined that the data is out of order,
the routine exchanges the elements in the index array instead of the
primary array.  A modification to the Bubble Sort routine to sort using an
index is shown below.

DEFINT A-Z
DECLARE SUB BubbleISort (Array$(), Index())

CONST NumItems% = 20
CONST False% = 0
CONST True% = -1

DIM Array$(1 TO NumItems%)  'this holds the string data
DIM Ndx(1 TO NumItems%)     'this holds the index

FOR X = 1 TO NumItems%
  READ Array$(X)            'read the string data
  Ndx(X) = X                'initialize the index array
NEXT

CALL BubbleISort(Array$(), Ndx())

CLS
FOR X = 1 TO NumItems%
  PRINT Array$(Ndx(X))      'print based on the index
NEXT

DATA Zorba, Cathy, Barbara, Kathy, Josephine
DATA Joseph, Joe, Peter, Arnold, Glen
DATA Ralph, Elli, Lucky, Rocky, Louis
DATA Paula, Paul, Mary lou, Marilyn, Keith

SUB BubbleISort (Array$(), Index()) STATIC

DO
  OutOfOrder = False%                 'assume it's sorted
  FOR X = 1 TO UBOUND(Array$) - 1
    IF Array$(Index(X)) > Array$(Index(X + 1)) THEN
      SWAP Index(X), Index(X + 1)     'if we had to swap
      OutOfOrder% = True%             'we're not done yet
    END IF
  NEXT
LOOP WHILE OutOfOrder%

END SUB

In this indexed sort, all references to the data are through the index
array.  And when a swap is necessary, it is the index array elements that
are exchanged.  Note that an indexed sort requires that the index array be
initialized to increasing values--even if the sort routine is modified to
be descending instead of ascending.  Therefore, when BubbleISort is called
Ndx(1) must hold the value 1, Ndx(2) is set to 2, and so forth.
   In this example the index array is initialized by the caller.  However,
it would be just as easy to put that code into the subprogram itself. 
Since you can't pass an array that hasn't yet been dimensioned, it makes
the most sense to do both steps outside of the subprogram.  Either way, the
index array must be assigned to these initial values.
   As I mentioned earlier, one feature of an indexed sort is that it lets
you access the data in both its original and sorted order.  But there are
other advantages, and a disadvantage as well.  The disadvantage is that
each comparison takes slightly longer, because of the additional overhead
required to first look up the element number in the index array, to
determine which elements in the primary array will be compared.  In some
cases, though, that can be more than offset by requiring less time to
exchange elements.
   If you are sorting an array of 230-byte TYPE variables, the time needed
for SWAP to exchange the elements can become considerable.  Every byte in
both elements must be read and written, so the time needed increases
linearly as the array elements become longer.  Contrast that with the fixed
two bytes in the integer index array that are swapped.
   Another advantage of an indexed sort is that it lends itself to sorting
more data than can fit in memory.  As you will see later in the section
that shows how to sort files, it is far easier to manipulate an integer
index than an entire file.  Further, sorting the file data using multiple
passes requires twice as much disk space as the file already occupies.


DATA MANIPULATION TECHNIQUES

Before I show the Quick Sort algorithm that will be used as a basis for
the remaining sort examples in this chapter, you should also be aware of
a few simple tricks that can help you maintain and sort your data.  One was
described in Chapter 6, using a pair of functions that pack and unpack
dates such that the year is stored before the month, which in turn is
before the day.  Thus, date strings are reduced to only three characters
each, and they can be sorted directly.
   Another useful speed-up trick is to store string data as integers or
long integers.  If you had a system of four-digit account numbers you could
use an integer instead of a string.  Besides saving half the memory and
disk space, the integer comparisons in a sort routine will be many times
faster than a comparison on string equivalents.  Zip codes are also suited
to this, and could be stored in a long integer.  Even though the space
savings is only one byte, the time needed to compare the values for sorting
will be greatly reduced.
   This brings up another important point.  As you learned in Chapter 2,
all conventional (not fixed-length) strings require more memory than might
be immediately apparent.  Besides the amount of memory needed to hold the
data itself, four additional bytes are used for a string descriptor, and
two more beyond those for a back pointer.  Therefore, a zip code stored as
a string will actually require eleven bytes rather than the five you might
expect.  With this in mind, you may be tempted to think that using a fixed-
length string to hold the zip codes will solve the problem.  Since fixed-
length strings do not use either descriptors or back pointers, they do not
need the memory they occupy.  And that leads to yet another issue.
   Whenever a fixed-length string or the string portion of a TYPE variable
is compared, it must first be converted to a regular descriptored string. 
BASIC has only one string comparison routine, and it expects the addresses
for two conventional string descriptors.  Every time a fixed-length string
is used as an argument for comparison, BASIC must create a temporary copy,
call its comparison routine, and then delete the copy.  This copying adds
code and wastes an enormous amount of time; in many cases the copying will
take longer than the comparison itself.  Therefore, using integers and long
integers for numeric data where possible will provide more improvement than
just the savings in memory use.
   In some cases, however, you must use fixed-length string or TYPE arrays. 
In particular, when sorting information from a random access disk file it
is most sensible to load the records into a TYPE array.  And as you learned
in Chapter 2, the string components of a TYPE variable or array element are
handled by BASIC as a fixed-length string.  So how can you effectively sort
fixed-length string arrays without incurring the penalty BASIC's overhead
imposes?  With assembly language subroutines, of course!
   Rather than ask BASIC to pass the data using its normal methods,
assembly language routines can be invoked passing the data segments and
addresses directly.  When you use SEG, or a combination of VARSEG and
VARPTR with fixed-length and TYPE variables, BASIC knows that you want the
segmented address of the variable or array element.  Thus, you are tricking
BASIC into not making a copy as it usually would when passing such data. 
An assembly language subroutine or function can be designed to accept data
addresses in any number of ways.  As you will see later when we discuss
sorting on multiple keys, extra trickery is needed to do the same thing in
a BASIC procedure.
   The three short assembly language functions that follow compare two
portions of memory, and then return a result that can be tested by your
program.

;COMPARE.ASM - compares two ranges of memory

.Model Medium, Basic
.Code

Compare Proc Uses DS ES DI SI, SegAdr1:DWord, _
  SegAdr2:DWord, NumBytes:Word

    Cld                ;compare in the forward direction
    Mov  SI,NumBytes   ;get the address for NumBytes%
    Mov  CX,[SI]       ;put it into CX for comparing below

    Les  DI,SegAdr1    ;load ES:DI with the first
                       ;  segmented address
    Lds  SI,SegAdr2    ;load DS:SI with the second
                       ;  segmented address

    Repe Cmpsb         ;do the compare
    Mov  AX,0          ;assume the bytes didn't match
    Jne  Exit          ;we were right, skip over
    Dec  AX            ;wrong, decrement AX down to -1

Exit:
    Ret                ;return to BASIC

Compare Endp
End

;COMPARE2.ASM - compares memory case-insensitive

.Model Medium, Basic
.Code

Compare2 Proc Uses DS ES DI SI, SegAdr1:DWord, _
  SegAdr2:DWord, NumBytes:Word

    Cld                ;compare in the forward direction
    Mov  BX,-1         ;assume the ranges are the same

    Mov  SI,NumBytes   ;get the address for NumBytes%
    Mov  CX,[SI]       ;put it into CX for comparing below
    Jcxz Exit          ;if zero bytes were given, they're
                       ;  the same
    Les  DI,SegAdr1    ;load ES:DI with the first address
    Lds  SI,SegAdr2    ;load DS:SI with the second address

Do:
    Lodsb              ;load the current character from
                       ;  DS:SI into AL
    Call Upper         ;capitalize as necessary
    Mov  AH,AL         ;copy the character to AH
    
    Mov  AL,ES:[DI]    ;load the other character into AL
    Inc  DI            ;point at the next one for later
    Call Upper         ;capitalize as necessary

    Cmp  AL,AH         ;now, are they the same?
    Jne  False         ;no, exit now and show that
    Loop Do            ;yes, continue
    Jmp  Short Exit    ;if we get this far, the bytes are
                       ;  all the same
False:
    Inc  BX            ;increment BX to return zero
    
Exit:
    Mov  AX,BX         ;assign the function output
    Ret                ;return to BASIC

Upper:
    Cmp  AL,"a"        ;is the character below an "a"?
    Jb   Done          ;yes, so we can skip it
    Cmp  AL,"z"        ;is the character above a "z"?
    Ja   Done          ;yes, so we can skip that too
    Sub  AL,32         ;convert to upper case

Done:
    Retn               ;do a near return to the caller

Compare2 Endp
End

;COMPARE3.ASM - case-insensitive, greater/less than

.Model Medium, Basic
.Code

Compare3 Proc Uses DS ES DI SI, SegAdr1:DWord, _
  SegAdr2:DWord, NumBytes:Word

    Cld               ;compare in the forward direction
    Xor  BX,BX        ;assume the ranges are the same

    Mov  SI,NumBytes  ;get the address for NumBytes%
    Mov  CX,[SI]      ;put it into CX for comparing below
    Jcxz Exit         ;if zero bytes were given, they're
                      ;  the same
    Les  DI,SegAdr1   ;load ES:DI with the first address
    Lds  SI,SegAdr2   ;load DS:SI with the second address

Do:
    Lodsb             ;load the current character from
                      ;  DS:SI into AL
    Call Upper        ;capitalize as necessary, remove for
                      ;  case-sensitive
    Mov  AH,AL        ;copy the character to AH

    Mov  AL,ES:[DI]   ;load the other character into AL
    Inc  DI           ;point at the next character for later
    Call Upper        ;capitalize as necessary, remove for
                      ;  case-sensitive

    Cmp  AL,AH        ;now, are they the same?
    Loope Do          ;yes, continue
    Je   Exit         ;we exhausted the data and they're
                      ;  the same
    Mov  BL,1         ;assume block 1 was "greater"
    Ja   Exit         ;we assumed correctly
    Dec  BX           ;wrong, bump BX down to -1
    Dec  BX

Exit:
    Mov  AX,BX        ;assign the function output
    Ret               ;return to BASIC

Upper:
    Cmp  AL,"a"       ;is the character below an "a"?
    Jb   Done         ;yes, so we can skip it
    Cmp  AL,"z"       ;is the character above a "z"?
    Ja   Done         ;yes, so we can skip that too
    Sub  AL,32        ;convert to upper case

Done:
    Retn              ;do a near return to the caller

Compare3 Endp
End

The first Compare routine above simply checks if all of the bytes are
identical, and returns -1 (True) if they are, or 0 (False) if they are not. 
By returning -1 or 0 you can use either

   IF Compare%(Type1, Type2, NumBytes%) THEN
or
   IF NOT Compare%(Type1, Type2, NumBytes%) THEN

depending on which logic is clearer for your program.  Compare2 is similar
to Compare, except it ignores capitalization.  That is, "SMITH" and Smith"
are considered equal.  The Compare3 function also compares memory and
ignores capitalization, but it returns either -1, 0, or 1 to indicate if
the first data range is less than, equal to, or greater than the second.
   The correct declaration and usage for each of these routines is shown
below.  Note that Compare and Compare2 are declared and used in the same
fashion.


Compare and Compare2:

   DECLARE FUNCTION Compare%(SEG Type1 AS ANY, SEG Type2 AS ANY, _
     NumBytes%)
   Same = Compare%(Type1, Type2, NumBytes%)

or

   DECLARE FUNCTION Compare%(BYVAL Seg1%, BYVAL Adr1%, BYVAL Seg2%, _
     BYVAL Adr2%, NumBytes%)
   Same = Compare%(Seg1%, Adr1%, Seg2%, Adr2%, NumBytes%)


Here, Same receives -1 if the two TYPE variables or ranges of memory are
the same, or 0 if they are not.  NumBytes% tells how many bytes to compare.


Compare3:

   DECLARE FUNCTION Compare3%(SEG Type1 AS ANY, SEG Type2 AS ANY, _
     NumBytes%)
   Result = Compare3%(Type1, Type2, NumBytes%)

or
   DECLARE FUNCTION Compare3%(BYVAL Seg1%, BYVAL Adr1%, BYVAL Seg2%, _
     BYVAL Adr2%, NumBytes%)
   Result = Compare3%(Seg1%, Adr1%, Seg2%, Adr2%, NumBytes%)


Result receives 0 if the two type variables or ranges of memory are the
same, -1 if the first is less when compared as strings, or 1 if the first
is greater.  NumBytes% tells how many bytes are to be to compared.  In the
context of a sort routine you could invoke Compare3 like this:


IF Compare3%(TypeEl(X), TypeEl(X + 1), NumBytes%) = 1 THEN
  SWAP TypeEl(X), TypeEl(X + 1)
END IF


As you can see, these routines may be declared in either of two ways. 
When used with TYPE arrays the first is more appropriate and results in
slightly less setup code being generated by the compiler.  When comparing
fixed-length strings or arbitrary blocks of memory (for example, when one
of the ranges is on the display screen) you should use the second method. 
Since SEG does not work correctly with fixed-length strings, if you want
to use that more efficient version you must create a dummy TYPE comprised
solely of a single string portion:


TYPE FixedLength
  Something AS STRING * 35
END TYPE


Then simply use DIM to create a single variable or an array based on this
or a similar TYPE, depending on what your program needs.  The requirement
to create a dummy TYPE was discussed in Chapter 2, and I won't belabor the
reasons again here.  These comparison routines will be used extensively in
the sort routines presented later in this chapter; however, their value in
other, non-sorting situations should also be apparent.
   Although these routines are written in assembly language, they are
fairly simple to follow.  It is important to understand that you do not
need to know anything about assembly language to use them.  All of the
files you need to add these and all of the other routines in this book are
contained on the accompanying diskette [here, in the same ZIP file as this
text].  Chapter 12 discusses assembly language in great detail, and you can
refer there for further explanation of the instructions used.
   If you plan to run the programs that follow in the QuickBASIC editor,
you must load the BASIC.QLB Quick Library as follows:

   qb program /l basic

Later when you compile these or other programs you must link with the
parallel BASIC.LIB file:

   bc program [/o];
   link program , , nul , basic;

If you are using BASIC PDS start QBX using the BASIC7.QLB file, and then
link with BASIC7.LIB to produce a stand-alone .EXE program.  [VB/DOS users
will also use the BASIC7 version.]


THE QUICK SORT ALGORITHM
========================

It should be obvious to you by now that a routine written in assembly
language will always be faster than an equivalent written in BASIC. 
However, simply translating a procedure to assembly language is not always
the best solution.  Far more important than which language you use is
selecting an appropriate algorithm.  The best sorting method I know is the
Quick Sort, and a well-written version of Quick Sort using BASIC will be
many times faster than an assembly language implementation of the Bubble
Sort.
   The main problem with the Bubble Sort is that the number of comparisons
required grows exponentially as the number of elements increases.  Since
each pass through the array exchanges only a few elements, many passes are
required before the entire array is sorted.  The Quick Sort was developed
by C.A.R. (Tony) Hoare, and is widely recognized as the fastest algorithm
available.  In some special cases, such as when the data is already sorted
or nearly sorted, the Quick Sort may be slightly slower than other methods. 
But in most situations, a Quick Sort is many times faster than any other
sorting algorithm.
   As with the Bubble Sort, there are many different variations on how a
Quick Sort may be coded.  (You may have noticed that the Bubble Sort shown
in Chapter 7 used a nested FOR/NEXT loop, while the one shown here uses a
FOR/NEXT loop within a DO/WHILE loop.)  A Quick Sort divides the array into
sections--sometimes called partitions--and then sorts each section
individually.  Many implementations therefore use recursion to invoke the
subprogram from within itself, as each new section is about to be sorted. 
However, recursive procedures in any language are notoriously slow, and
also consume stack memory at an alarming rate.
   The Quick Sort version presented here avoids recursion, and instead uses
a local array as a form of stack.  This array stores the upper and lower
bounds showing which section of the array is currently being considered. 
Another refinement I have added is to avoid making a copy of elements in
the array.  As a Quick Sort progresses, it examines one element selected
arbitrarily from the middle of the array, and compares it to the elements
that lie above and below it.  To avoid assigning a temporary copy this
version simply keeps track of the selected element number.
   When sorting numeric data, maintaining a copy of the element is
reasonable.  But when sorting strings--especially strings whose length is
not known ahead of time--the time and memory required to keep a copy can
become problematic.  For clarity, the generic Quick Sort shown below uses
the copy method.  Although this version is meant for sorting a single
precision array, it can easily be adapted to sort any type of data by
simply changing all instances of the "!" type declaration character.

'******** QSORT.BAS, Quick Sort algorithm demonstration

'Copyright (c) 1991 Ethan Winer

DEFINT A-Z
DECLARE SUB QSort (Array!(), StartEl, NumEls)

RANDOMIZE TIMER         'generate a new series each run

DIM Array!(1 TO 21)             'create an array
FOR X = 1 TO 21                 'fill with random numbers
  Array!(X) = RND(1) * 500      'between 0 and 500
NEXT

FirstEl = 6                     'sort starting here
NumEls = 10                     'sort this many elements

CLS
PRINT "Before Sorting:"; TAB(31); "After sorting:"
PRINT "==============="; TAB(31); "=============="

FOR X = 1 TO 21                 'show them before sorting
  IF X >= FirstEl AND X <= FirstEl + NumEls - 1 THEN
    PRINT "==>";
  END IF
  PRINT TAB(5); USING "###.##"; Array!(X)
NEXT

CALL QSort(Array!(), FirstEl, NumEls)

LOCATE 3
FOR X = 1 TO 21                 'print them after sorting
  LOCATE , 30
  IF X >= FirstEl AND X <= FirstEl + NumEls - 1 THEN
    PRINT "==>";                'point to sorted items
  END IF
  LOCATE , 35
  PRINT USING "###.##"; Array!(X)
NEXT

SUB QSort (Array!(), StartEl, NumEls) STATIC

REDIM QStack(NumEls \ 5 + 10)  'create a stack array

First = StartEl                'initialize work variables
Last = StartEl + NumEls - 1

DO
  DO
    Temp! = Array!((Last + First) \ 2)  'seek midpoint
    I = First
    J = Last

    DO     'reverse both < and > below to sort descending
      WHILE Array!(I) < Temp!
        I = I + 1
      WEND
      WHILE Array!(J) > Temp!
        J = J - 1
      WEND
      IF I > J THEN EXIT DO
      IF I < J THEN SWAP Array!(I), Array!(J)
      I = I + 1
      J = J - 1
    LOOP WHILE I <= J

    IF I < Last THEN
      QStack(StackPtr) = I              'Push I
      QStack(StackPtr + 1) = Last       'Push Last
      StackPtr = StackPtr + 2
    END IF

    Last = J
  LOOP WHILE First < Last

  IF StackPtr = 0 THEN EXIT DO          'Done
  StackPtr = StackPtr - 2
  First = QStack(StackPtr)              'Pop First
  Last = QStack(StackPtr + 1)           'Pop Last
LOOP

ERASE QStack               'delete the stack array

END SUB

Notice that I have designed this routine to allow sorting only a portion
of the array.  To sort the entire array you'd simply omit the StartEl and
NumEls parameters, and assign First and Last from the LBOUND and UBOUND
element numbers.  That is, you will change these:

   First = StartEl
and
   Last = StartEl + NumEls - 1

to these:

   First = LBOUND(Array!)
and
   Last = UBOUND(Array!)

As I mentioned earlier, the QStack array serves as a table of element
numbers that reflect which range of elements is currently being considered. 
You will need to dimension this array to one element for every five
elements in the primary array being sorted, plus a few extra for good
measure.  In this program I added ten elements, because one stack element
for every five main array elements is not enough for very small arrays. 
For data arrays that have a large amount of duplicated items, you will
probably need to increase the size of the stack array.
   Note that this ratio is not an absolute--the exact size of the stack
that is needed depends on how badly out of order the data is to begin with. 
Although it is possible that one stack element for every five in the main
array is insufficient in a given situation, I have never seen this formula
fail.  Because the stack is a dynamic integer array that is stored in far
memory, it will not impinge on near string memory.  If this routine were
designed using the normal recursive method, BASIC's stack would be used
which is in near memory.
   Each of the innermost DO loops searches the array for the first element
in each section about the midpoint that belongs in the other section.  If
the elements are indeed out of order (when I is less than J) the elements
are exchanged.  This incrementing and comparing continues until I and J
cross.  At that point, assuming the variable I has not exceeded the upper
limits of the current partition, the partition bounds are saved and Last
is assigned to the top of the next inner partition level.  When the entire
partition has been processed, the previous bounds are retrieved, but as a
new set of First and Last values.  This process continues until no more
partition boundaries are on the stack.  At that point the entire array is
sorted.
   On the accompanying disk you will find a program called SEEQSORT.BAS
that contains an enhanced version of the QSort demo and subprogram.  This
program lets you watch the progress of the comparisons and exchanges as
they are made, and actually see this complex algorithm operate.  Simply
load SEEQSORT.BAS into the BASIC editor and run it.  A constant named
Delay! is defined at the beginning of the program.  Increasing its value
makes the program run more slowly; decreasing it causes the program to run
faster.


AN ASSEMBLY LANGUAGE QUICK SORT

As fast as the BASIC QuickSort routine is, we can make it even faster. 
The listing below shows an assembly language version that is between ten
and twenty percent faster, depending on which compiler you are using and
if the BASIC PDS /fs (far strings) option is in effect.

;SORT.ASM - sorts an entire BASIC string array

.Model Medium, Basic
.Data
    S          DW 0
    F          DW 0
    L          DW 0
    I          DW 0
    J          DW 0
    MidPoint   DW 0

.Code
    Extrn B$SWSD:Proc   ;this swaps two strings
    Extrn B$SCMP:Proc   ;this compares two strings

Sort Proc Uses SI DI ES, Array:Word, Dir:Word

    Cld                 ;all fills and compares are forward
    Push DS             ;set ES = DS for string compares
    Pop  ES
    
    Xor  CX,CX          ;clear CX
    Mov  AX,7376h       ;load AL and AH with the opcodes
                        ;  Jae and Jbe in preparation for
                        ;  code self-modification
    Mov  BX,Dir         ;get the sorting direction
    Cmp  [BX],CX        ;is it zero (ascending sort)?
    Je   Ascending      ;yes, skip ahead
    Xchg AL,AH          ;no exchange the opcodes

Ascending:
    Mov  CS:[X1],AH     ;install correct comparison opcodes
    Mov  CS:[X2],AL     ;  based on the sort direction

    Mov  BX,Array       ;load the array descriptor address
    Mov  AX,[BX+0Eh]    ;save the number of elements
    Dec  AX             ;adjust the number to zero-based
    Jns  L0             ;at least 1 element, continue
    Jmp  L4             ;0 or less elements, get out now!

L0:
    Mov  BX,Array       ;reload array descriptor address
    Mov  BX,[BX]        ;Array$(LBOUND) descriptor address
    Mov  S,SP           ;StackPtr = 0 (normalized to SP)
    Mov  F,CX           ;F = 0
    Mov  L,AX           ;L = Size%

;----- calculate the value of MidPoint
L1:
    Mov  DI,L           ;MidPoint = (L + F) \ 2
    Add  DI,F
    Shr  DI,1
    Mov  MidPoint,DI

    Mov  AX,F           ;I = F
    Mov  I,AX

    Mov  AX,L           ;J = L
    Mov  J,AX

;----- calculate the offset into the descriptor table for Array$(MidPoint)
L1_2:

    Shl  DI,1           ;multiply MidPoint in DI times 4
    Shl  DI,1           ;now DI holds how far beyond Array$(Start)
                        ;  Array$(MidPoint)'s descriptor is
    Add  DI,BX          ;add the array base address to produce the final
                        ;  address for Array$(MidPoint)

;----- calculate descriptor offset for Array$(I)
L2:
    Mov  SI,I           ;put I into SI
    Shl  SI,1           ;as above
    Shl  SI,1           ;now SI holds how far beyond Array$(Start)
                        ;  Array$(I)'s descriptor is
    Add  SI,BX          ;add the base to produce the final descriptor
                        ;  address

    ;IF Array$(I) < Array$(MidPoint) THEN I = I + 1: GOTO L2
    Push BX             ;save BX because B$SCMP trashes it
    Push SI
    Push DI
    Call B$SCMP         ;do the compare
    Pop  BX             ;restore BX

X1 Label Byte           ;modify the code below to "Jbe" if descending sort
    Jae  L2_1           ;Array$(I) isn't less, continue on

    Inc  Word Ptr I     ;I = I + 1
    Jmp  Short L2       ;GOTO L2

;----- calculate descriptor offset for Array$(J)
L2_1:
    Mov  SI,J           ;put J into SI
    Shl  SI,1           ;as above
    Shl  SI,1           ;now SI holds how far beyond Array$(Start)
                        ;  Array$(J)'s descriptor is
    Add  SI,BX          ;add the base to produce the final descriptor
                        ;  address

    ;IF Array$(J) > Array$(MidPoint) THEN J = J - 1: GOTO L2.1
    Push BX             ;preserve BX
    Push SI
    Push DI
    Call B$SCMP         ;do the compare
    Pop  BX             ;restore BX

X2 Label Byte           ;modify the code below to "Jae" if descending sort
    Jbe  L2_2           ;Array$(J) isn't greater, continue on

    Dec  Word Ptr J     ;J = J - 1
    Jmp  Short L2_1     ;GOTO L2.1

L2_2:
    Mov  AX,I           ;IF I > J GOTO L3
    Cmp  AX,J
    Jg   L3             ;J is greater, go directly to L3
    Je   L2_3           ;they're the same, skip the swap

    ;Swap Array$(I), Array$(J)
    Mov  SI,I           ;put I into SI
    Mov  DI,J           ;put J into DI

    Cmp  SI,MidPoint    ;IF I = MidPoint THEN MidPoint = J
    Jne  No_Mid1        ;not equal, skip ahead
    Mov  MidPoint,DI    ;equal, assign MidPoint = J
    Jmp  Short No_Mid2  ;don't waste time comparing again

No_Mid1:
    Cmp  DI,MidPoint    ;IF J = MidPoint THEN MidPoint = I
    Jne  No_Mid2        ;not equal, skip ahead
    Mov  MidPoint,SI    ;equal, assign MidPoint = I

No_Mid2:
    Mov  SI,I           ;put I into SI
    Shl  SI,1           ;multiply times four for the
    Shl  SI,1           ;  for the descriptors
    Add  SI,BX          ;add address for first descriptor

    Mov  DI,J           ;do the same for J in DI
    Shl  DI,1
    Shl  DI,1
    Add  DI,BX

    Push BX             ;save BX because B$SWSD destroys it
    Call B$SWSD         ;and swap 'em good
    Pop  BX

L2_3:
    Inc  Word Ptr I     ;I = I + 1
    Dec  Word Ptr J     ;J = J - 1

    Mov  AX,I           ;IF I <= J GOTO L2
    Cmp  AX,J
    Jg   L3             ;it's greater, skip to L3
    Mov  DI,MidPoint    ;get MidPoint again
    Jmp  L1_2           ;go back to just before L2

L3:
    Mov  AX,I           ;IF I < L THEN PUSH I: PUSH L
    Cmp  AX,L
    Jnl  L3_1           ;it's not less, so skip Pushes

    Push I              ;Push I
    Push L              ;Push L

L3_1:
    Mov  AX,J           ;L = J
    Mov  L,AX

    Mov  AX,F           ;IF F < L GOTO L1
    Cmp  AX,L
    Jnl  L3_2           ;it's not less, jump ahead to L3_2
    Jmp  L1             ;it's less, go to L1

L3_2:
    Cmp  S,SP           ;IF S = 0 GOTO L4
    Je   L4

    Pop  L              ;Pop L
    Pop  F              ;Pop F
    Jmp  L1             ;GOTO L1

L4:
    Ret                 ;return to BASIC

Sort Endp
End

Besides being faster than the BASIC version, the assembly language Sort
routine is half the size.  This version also supports sorting either
forward or backward, but not just a portion of an array.  The general
syntax is:

   CALL Sort(Array$(), Direction)

Where Array$() is any variable-length string array, and Direction is 0 for
ascending, or any other value for descending.  Note that this routine calls
upon BASIC's internal services to perform the actual comparing and
swapping; therefore, the exact same code can be used with either QuickBASIC
or BASIC PDS.  Again, I refer you forward to Chapter 12 for an explanation
of the assembly language commands used in SORT.ASM.


SORTING ON MULTIPLE KEYS

In many situations, sorting based on one key is sufficient.  For example,
if you are sorting a mailing list to take advantage of bulk rates you must
sort all of the addresses in order by zip code.  When considering complex
data such as a TYPE variable, it is easy to sort the array based on one
component of each element.  The earlier Bubble Sort example showed how MID$
could be used to consider just a portion of each string, even though the
entire elements were exchanged.  Had that routine been designed to operate
on a TYPE array, the comparisons would have examined just one component,
but the SWAP statements would exchange entire elements:

   IF Array(X).ZipCode > Array(X + 1).ZipCode THEN
     SWAP Array(X), Array(X + 1)
   END IF

This way, each customer's last name, first name, street address, and so
forth remain connected to the zip codes that are being compared and
exchanged.
   There are several ways to sort on more than one key, and all are of
necessity more complex than simply sorting based on a single key.  One
example of a multi-key sort first puts all of the last names in order. 
Then within each group of identical last names the first names are sorted,
and within each group of identical last and first names further sorting is
performed on yet another key--perhaps Balance Due.  As you can see, this
requires you to sort based on differing types of data, and also to compare
ranges of elements for the subgroups that need further sorting.
   The biggest complication with this method is designing a calling syntax
that lets you specify all of the information.  A table array must be
established to hold the number of keys, the type of data in each key
(string, double precision, and so forth), and how many bytes into the TYPE
element each key portion begins.  Worse, you can't simply use the name of
a TYPE component in the comparisons inside the sort routine--which would
you use: Array(X).LastName, Array(X).FirstName, or Array(X).ZipCode? 
Therefore, a truly general multi-key sort must be called passing the
address where the array begins in memory, and a table of offsets beyond
that address where each component being considered is located.
   To avoid this added complexity I will instead show a different method
that has only a few minor restrictions, but is much easier to design and
understand.  This method requires you to position each TYPE component into
the key order you will sort on.  You will also need to store all numbers
that will be used for a sort key as ASCII digits.  To sort first on last
name, then first name, and then on balance due, the TYPE might be
structured as follows:


TYPE Customer
  LastName   AS STRING * 15
  FirstName  AS STRING * 15
  BalanceDue AS STRING * 9
  Street     AS STRING * 32
  City       AS STRING * 15
  State      AS STRING * 2
  ZipCode    AS STRING * 5
  AnyNumber  AS DOUBLE
END TYPE


In most cases the order in which each TYPE member is placed has no
consequence.  When you refer to TypeVar.LastName, BASIC doesn't care if
LastName is defined before or after FirstName in the TYPE structure. 
Either way it translates your reference to LastName into an address. 
Having to store numeric data as strings is a limitation, but this is needed
only for those TYPE fields that will be used as a sort key.
   The key to sorting on multiple items simultaneously is by treating the
contiguous fields as a single long field.  Since assignments to the string
portion of a TYPE variable are handled internally by BASIC's LSET routine,
the data in each element will be aligned such that subsequent fields can
be treated as an extension of the primary field.  Figure 8-2 below shows
five TYPE array elements in succession, as they would be viewed by a string
comparison routine.  This data is defined as a subset of the name and
address TYPE shown above, using just the first three fields.  Notice that
the balance due fields must be right-aligned (using RSET) for the numeric
values to be considered correctly.


Type.LastName  Type.FirstName Type.BalanceDue
===============---------------=========
Munro          Jay              8000.00
Smith          John              122.03
Johnson        Alfred          14537.89
Rasmussen      Peter             100.90
Hudson         Cindy              21.22
^              ^              ^
Field 1        Field 2        Field 3
starts here    starts here    starts here

Figure 8-2: Multiple contiguous fields in a TYPE can be treated as a single
long field.


Thus, the sort routine would be told to start at the first field, and
consider the strings to be 15 + 15 + 9 = 39 characters long.  This way all
three fields are compared at one time, and treated as a single entity. 
Additional fields can of course follow these, and they may be included in
the comparison or not at your option.
   The combination demonstration and subroutine below sorts such a TYPE
array on any number of keys using this method, and it has a few additional
features as well.  Besides letting you confine the sorting to just a
portion of the array, you may also specify how far into each element the
first key is located.  As long as the key fields are contiguous, they do
not have to begin at the start of each TYPE.  Therefore, you could sort
just on the first name field, or on any other field or group of fields.

'TYPESORT.BAS - performs a multi-key sort on TYPE arrays

'Copyright (c) 1991 Ethan Winer

DEFINT A-Z
DECLARE FUNCTION Compare3% (BYVAL Seg1, BYVAL Adr1, BYVAL Seg2, _
  BYVAL Adr2, NumBytes)
DECLARE SUB SwapMem (BYVAL Seg1, BYVAL Adr1, BYVAL Seg2, BYVAL Adr2, _
  BYVAL Length)
DECLARE SUB TypeSort (Segment, Address, ElSize, Offset, KeySize, NumEls)

CONST NumEls% = 23              'this keeps it all on the screen

TYPE MyType
  LastName  AS STRING * 10
  FirstName AS STRING * 10
  Dollars   AS STRING * 6
  Cents     AS STRING * 2
END TYPE
REDIM Array(1 TO NumEls%) AS MyType

'---- Disable (REM out) all but one of the following blocks to test

Offset = 27                 'start sorting with Cents
ElSize = LEN(Array(1))      'the length of each element
KeySize = 2                 'sort on the Cents only

Offset = 21                 'start sorting with Dollars
ElSize = LEN(Array(1))      'the length of each element
KeySize = 8                 'sort Dollars and Cents only

Offset = 11                 'start sorting with FirstName
ElSize = LEN(Array(1))      'the length of each element
KeySize = 18                'sort FirstName through Cents

Offset = 1                  'start sorting with LastName
ElSize = LEN(Array(1))      'the length of each element
KeySize = ElSize            'sort based on all 4 fields

FOR X = 1 TO NumEls%        'build the array from DATA
  READ Array(X).LastName
  READ Array(X).FirstName
  READ Amount$              'format the amount into money
  Dot = INSTR(Amount$, ".")
  IF Dot THEN
    RSET Array(X).Dollars = LEFT$(Amount$, Dot - 1)
    Array(X).Cents = LEFT$(MID$(Amount$, Dot + 1) + "00", 2)
  ELSE
    RSET Array(X).Dollars = Amount$
    Array(X).Cents = "00"
  END IF
NEXT

Segment = VARSEG(Array(1))      'show where the array is
Address = VARPTR(Array(1))      '  located in memory
CALL TypeSort(Segment, Address, ElSize, Offset, KeySize, NumEls%)

CLS                             'display the results
FOR X = 1 TO NumEls%
  PRINT Array(X).LastName, Array(X).FirstName,
  PRINT Array(X).Dollars; "."; Array(X).Cents
NEXT

DATA Smith, John, 123.45
DATA Cramer, Phil, 11.51
DATA Hogan, Edward, 296.08
DATA Cramer, Phil, 112.01
DATA Malin, Donald, 13.45
DATA Cramer, Phil, 111.3
DATA Smith, Ralph, 123.22
DATA Smith, John, 112.01
DATA Hogan, Edward, 8999.04
DATA Hogan, Edward, 8999.05
DATA Smith, Bob, 123.45
DATA Cramer, Phil, 11.50
DATA Hogan, Edward, 296.88
DATA Malin, Donald, 13.01
DATA Cramer, Phil, 111.1
DATA Smith, Ralph, 123.07
DATA Smith, John, 112.01
DATA Hogan, Edward, 8999.33
DATA Hogan, Edward, 8999.17
DATA Hogan, Edward, 8999.24
DATA Smith, John, 123.05
DATA Cramer, David, 1908.80
DATA Cramer, Phil, 112
END

SUB TypeSort (Segment, Address, ElSize, Displace, KeySize, NumEls) STATIC

REDIM QStack(NumEls \ 5 + 10) 'create a stack array

First = 1                  'initialize working variables
Last = NumEls
Offset = Displace - 1      'decrement once now rather than
                           '  repeatedly later
DO
  DO
    Temp = (Last + First) \ 2   'seek midpoint
    I = First
    J = Last

    DO
      WHILE Compare3%(Segment, Address + Offset + (I - 1) * ElSize, Segment, _
        Address + Offset + (Temp-1) * ElSize, KeySize) = -1 '< 1 for descending
        I = I + 1
      WEND
      WHILE Compare3%(Segment, Address + Offset + (J - 1) * ElSize, Segment, _
        Address + Offset + (Temp-1)  * ElSize, KeySize) = 1 '< -1 for descending
        J = J - 1
      WEND
      IF I > J THEN EXIT DO
      IF I < J THEN
        CALL SwapMem(Segment, Address + (I - 1) * ElSize, Segment, _
          Address + (J - 1) * ElSize, ElSize)
        IF Temp = I THEN
          Temp = J
        ELSEIF Temp = J THEN
          Temp = I
        END IF
      END IF
      I = I + 1
      J = J - 1
    LOOP WHILE I <= J

    IF I < Last THEN
      QStack(StackPtr) = I              'Push I
      QStack(StackPtr + 1) = Last       'Push Last
      StackPtr = StackPtr + 2
    END IF

    Last = J
  LOOP WHILE First < Last

  IF StackPtr = 0 THEN EXIT DO          'Done
  StackPtr = StackPtr - 2
  First = QStack(StackPtr)              'Pop First
  Last = QStack(StackPtr + 1)           'Pop Last
LOOP

ERASE QStack                    'delete the stack array

END SUB

As you can see, this version of the Quick Sort subprogram is derived from
the one shown earlier.  The important difference is that all of the
incoming information is passed as segments, addresses, and bytes, rather
than using an explicit array name.  But before describing the inner details
of the subprogram itself, I'll address the demonstration portion and show
how the routine is set up and called.
   As with some of the other procedures on the disk that comes with this
book, you will extract the TypeSort subprogram and add it to your own
programs by loading it as a module, and then using the Move option of
BASIC's View Subs menu.  You can quickly access this menu by pressing F2,
and then use Alt-M to select Move.  Once this is done you will unload
TYPESORT.BAS using the Alt-F-U menu selection, and answer *No* when asked
if you want to save the modified file.  You could also copy the TypeSort
subprogram into a separate file, and then load that file as a module in
each program that needs it.
   Although the example TYPE definition here shows only four components,
you may of course use any TYPE structure.  TypeSort expects six parameters
to tell it where in memory the array is located, how far into each element
the comparison routines are to begin, the total length of each element, the
length of the key fields, and the number of elements to sort.
   After defining MyType, the setup portion of TYPESORT.BAS establishes the
offset, element size, and key size parameters.  As you can see, four
different sample setups are provided, and you should add remarking
apostrophes to all but one of them.  If the program is left as is, the last
setup values will take precedence.
   The next section reads sample names, addresses and dollar amounts from
DATA statements, and formats the dollar amounts as described earlier.  The
dollar portion of the amounts are right justified into the Dollars field
of each element, and the Cents portion is padded with trailing zeros as
necessary to provide a dollars and cents format.  This way, the value 12.3
will be assigned as 12.30, and 123 will be formatted to 123.00 which gives
the expected appearance.
   The final setup step is to determine where the array begins in memory. 
Since you specify the starting segment and address, it is simple to begin
sorting at any array element.  For example, to sort elements 100 through
200--even if the array is larger than that--you'd use VARSEG(Array(100))
and VARPTR(Array(100) instead of element 1 as shown in this example.
   In addition to the starting segment and address of the array, TypeSort
also requires you to tell it how many elements to consider.  If you are
sorting the entire array and the array starts with element 1, this will
simply be the UBOUND of the array.  If you are sorting just a portion of
the array then you give it only the number of elements to be sorted.  So
to sort elements 100 through 200, the number of elements will be 101.  A
general formula you can use for calculating this based on element numbers
is NumElements = LastElement - FirstElement + 1.
   Now let's consider the TypeSort subprogram itself.  Since it is more
like the earlier QSort program than different, I will cover only the
differences here.  In fact, the primary difference is in the way
comparisons and exchanges are handled.  The Compare3 function introduced
earlier is used to compare the array elements with the midpoint.  Although
QSort made a temporary copy of the midpoint element, that would be
difficult to do here.  Since the routine is designed to work with any type
of data--and the size of each element can vary depending on the TYPE
structure--it is impractical to make a copy.
   While SPACE$ could be used to claim a block of memory into which the
midpoint element is copied, there's a much better way:  the Temp variable
is used to remember the element number itself.  The only complication is
that once elements I and J are swapped, Temp must be reassigned if it was
equal to either of them.  (This happens just below the call to SwapMem.) 
But the simple integer IF test and assignment required adds far less code
and is much faster than making a copy of the element.
   TypeSort is designed to sort the array in ascending order, and comments
in the code show how to change it to sort descending instead.  If you
prefer to have one subprogram that can do both, you should add an extra
parameter, perhaps called Direction.  Near the beginning of the routine
before the initial outer DO you would add this:

   IF Direction = 0 THEN     'sort ascending
     ICompare = -1
     JCompare = 1
   ELSE                      'sort descending
     ICompare = 1
     JCompare = -1
   END IF

Then, where the results from Compare3 are compared to -1 and 1 replace
those comparisons (at the end of each WHILE line) to instead use ICompare
and JCompare:

   WHILE Compare3%(...) = ICompare
     I = I + 1
   WEND
   WHILE Compare3%(...) = JCompare
     J = J - 1
   WEND

This way, you are using variables to establish the sorting direction, and
those variables can be set either way each time TypeSort is called.
   The last major difference is that elements are exchanged using the
SwapMem routine rather than BASIC's SWAP statement.  While it is possible
to call SWAP by aliasing its name as shown in Chapter 5, it was frankly
simpler to write a new routine for this purpose.  Further, BASIC's SWAP is
slower than SwapMem because it must be able to handle variables of
different lengths, and also exchange fixed-length and conventional strings. 
SwapMem is extremely simple, and it works very quickly.
   As I stated earlier, the only way to write a truly generic sort routine
is by passing segments and addresses and bytes, instead of array names. 
Although it would be great if BASIC could let you declare a subprogram or
function using the AS ANY option to allow any type of data, that simply
wouldn't work.  As BASIC compiles your program, it needs to know the size
and type of each parameter.  When you reference TypeVar.LastName, BASIC
knows where within TypeVar the LastName portion begins, and uses that in
its address calculations.  It is not possible to avoid this limitation
other than by using addresses as is done here.
   Indeed, this is the stuff that C and assembly language programs are made
of.  In these languages--especially assembly language--integer pointer
variables are used extensively to show where data is located and how long
it is.  However, the formulas used within the Compare3 and SwapMem function
calls are not at all difficult to understand.
   The formula Address + Offset - (I - 1) * ElSize indicates where the key
field of element I begins.  Address holds the address of the beginning of
the first element, and Offset is added to identify the start of the first
sort key.  (I - 1) is used instead of I because addresses are always zero-
based.  That is, the first element in the array from TypeSort's perspective
is element 0, even though the calling program considers it to be element
1.  Finally, the element number is multiplied times the length of each
element, to determine the value that must be added to the starting address
and offset to obtain the final address for the data in element I.  Please
understand that calculations such as these are what the compiler must do
each time you access an array element.
   Note that if you call TypeSort incorrectly or give it illegal element
numbers, you will not receive a "Subscript out of range" error from BASIC. 
Rather, you will surely crash your PC and have to reboot.  This is the
danger--and fun--of manipulating pointers directly.
   As I stated earlier, the SwapMem routine that does the actual exchanging
of elements is very simple, and it merely takes a byte from one element and
exchanges it with the corresponding byte in the other.  This task is
greatly simplified by the use of the XCHG assembly language command, which
is similar to BASIC's SWAP statement.  Although XCHG cannot swap a word in
memory with another word in memory, it can exchange memory with a register. 
SwapMem is shown in the listing below.

;SWAPMEM.ASM, swaps two sections of memory

.Model Medium, Basic
.Code

SwapMem Proc Uses SI DI DS ES, Var1:DWord, Var2:DWord, NumBytes:Word

    Lds  SI,Var1      ;get the segmented address of the
                      ;  first variable
    Les  DI,Var2      ;and for the second variable
    Mov  CX,NumBytes  ;get the number of bytes to exchange
    Jcxz Exit         ;we can't swap zero bytes!

DoSwap:
    Mov  AL,ES:[DI]   ;get a byte from the second variable
    Xchg AL,[SI]      ;swap it with the first variable
    Stosb             ;complete the swap and increment DI
    Inc  SI           ;point to the next source byte
    Loop DoSwap       ;continue until done

Exit:
    Ret               ;return to BASIC

SwapMem Endp
End

INDEXED SORTING ON MULTIPLE KEYS

Earlier I showed how to modify the simple Bubble Sort routine to sort a
parallel index array instead of the primary array.  One important reason
you might want to do that is to allow access to the primary array in both
its original and sorted order.  Another reason, and one we will get to
shortly, is to facilitate sorting disk files.  Although a routine to sort
the records in a file could swap the actual data, it takes a long time to
read and write that much data on disk.  Further, each time you wanted to
access the data sorted on a different key, the entire file would need to
be sorted again.
   A much better solution is to create one or more sorted lists of record
numbers, and store those on disk each in a separate file.  This lets you
access the data sorted by name, or by zip code, or by any other field,
without ever changing the actual file.  The TypeISort subprogram below is
adapted from TypeSort, and it sorts an index array that holds the element
numbers of a TYPE array.

'TYPISORT.BAS, indexed multi-key sort for TYPE arrays

DEFINT A-Z

DECLARE FUNCTION Compare3% (BYVAL Seg1, BYVAL Adr1, BYVAL Seg2, _
  BYVAL Adr2, NumBytes)
DECLARE SUB SwapMem (BYVAL Seg1, BYVAL Adr1, BYVAL Seg2, _
  BYVAL Adr2, BYVAL Length)
DECLARE SUB TypeISort (Segment, Address, ElSize, Offset, KeySize, _
  NumEls, Index())

CONST NumEls% = 23              'this keeps it all on the screen

TYPE MyType
  LastName  AS STRING * 10
  FirstName AS STRING * 10
  Dollars   AS STRING * 6
  Cents     AS STRING * 2
END TYPE
REDIM Array(1 TO NumEls%) AS MyType
REDIM Index(1 TO NumEls%)   'create the index array

Offset = 1                  'start sorting with LastName
ElSize = LEN(Array(1))      'the length of each element
KeySize = ElSize            'sort based on all 4 fields

FOR X = 1 TO NumEls%        'build the array from DATA
  READ Array(X).LastName
  READ Array(X).FirstName
  READ Amount$            
   ...                      'this continues as already
   ...                      '  shown in TypeSort
NEXT

FOR X = 1 TO NumEls%            'initialize the index
  Index(X) = X - 1              'but starting with 0
NEXT

Segment = VARSEG(Array(1))      'show where the array is
Address = VARPTR(Array(1))      '  located in memory
CALL TypeISort(Segment, Address, ElSize, Offset, KeySize, NumEls%, Index())

CLS                             'display the results
FOR X = 1 TO NumEls%            '+ 1 adjusts to one-based
  PRINT Array(Index(X) + 1).LastName,
  PRINT Array(Index(X) + 1).FirstName,
  PRINT Array(Index(X) + 1).Dollars; ".";
  PRINT Array(Index(X) + 1).Cents
NEXT

DATA Smith, John, 123.45        'this continues as already
  ...                           '  shown in TypeSort
  ...

END

SUB TypeISort (Segment, Address, ElSize, Displace, KeySize, NumEls, _
  Index()) STATIC

REDIM QStack(NumEls \ 5 + 10) 'create a stack

First = 1                     'initialize working variables
Last = NumEls
Offset = Displace - 1         'make zero-based now for speed later

DO
  DO
    Temp = (Last + First) \ 2 'seek midpoint
    I = First
    J = Last

    DO  'change -1 to 1 and 1 to -1 to sort descending
      WHILE Compare3%(Segment, Address + Offset + (Index(I) * ElSize), _
        Segment, Address + Offset + (Index(Temp) * ElSize), KeySize) = -1
        I = I + 1
      WEND
      WHILE Compare3%(Segment, Address + Offset + (Index(J) * ElSize), _
        Segment, Address + Offset + (Index(Temp) * ElSize), KeySize) = 1
        J = J - 1
      WEND
      IF I > J THEN EXIT DO
      IF I < J THEN
        SWAP Index(I), Index(J)
        IF Temp = I THEN
          Temp = J
        ELSEIF Temp = J THEN
          Temp = I
        END IF
      END IF
      I = I + 1
      J = J - 1
    LOOP WHILE I <= J

    IF I < Last THEN
      QStack(StackPtr) = I              'Push I
      QStack(StackPtr + 1) = Last       'Push Last
      StackPtr = StackPtr + 2
    END IF

    Last = J
  LOOP WHILE First < Last

  IF StackPtr = 0 THEN EXIT DO          'Done
  StackPtr = StackPtr - 2
  First = QStack(StackPtr)              'Pop First
  Last = QStack(StackPtr + 1)           'Pop Last
LOOP

ERASE QStack                    'delete the stack array

END SUB

As with TypeSort, TypeISort is entirely pointer based so it can be used
with any type of data and it can sort multiple contiguous keys.  The only
real difference is the addition of the Index() array parameter, and the
extra level of indirection needed to access the index array each time a
comparison is made.  Also, when a swap is required, only the integer index
elements are exchanged, which simplifies the code and reduces its size. 
Like TypeSort, you can change the sort direction by reversing the -1 and
1 values used with Compare3, or add a Direction parameter to the list and
modify the code to use that.
   As with BubbleISort, the index array is initialized to increasing values
by the calling program; however, here the first element is set to hold a
value of 0 instead of 1.  This reduces the calculations needed within the
routine each time an address must be obtained.  Therefore, when TypeISort
returns, the caller must add 1 to the element number held in each index
element.  This is shown within the FOR/NEXT loop that displays the sorted
results.


SORTING FILES

With the development of TypeISort complete, we can now use that routine
to sort disk files.  The sorting strategy will be to determine how many
records are in the file, to determine how many separate passes are needed
to process the entire file.  TypeISort and TypeSort are restricted to
working with arrays no larger than 64K (32K in the editing environment),
so there is a limit as to how much data may be loaded into memory at one
time.  These sort routines can accommodate more data when compiled because
address calculations that result in values larger than 32767 cause an
overflow error in the QB editor.  This overflow is in fact harmless, and
is ignored in a compiled program unless you use the /d switch.
   Although the routines could be modified to perform segment and address
arithmetic to accommodate larger arrays, that still wouldn't solve the
problem of having more records than can fit in memory at once.  Therefore,
separate passes must be used to sort the file contents in sections, with
each pass writing a temporary index file to disk.  A final merge pass then
reads each index to determine which pieces fits where, and then writes the
final index file.  The program FILESORT.BAS below incorporates all of the
sorting techniques shown so far, and includes a few custom BASIC routines
to improve its performance.

'FILESORT.BAS, indexed multi-key random access file sort

DEFINT A-Z

DECLARE FUNCTION Compare3% (BYVAL Seg1, BYVAL Adr1, BYVAL Seg2, _
  BYVAL Adr2, NumBytes)
DECLARE FUNCTION Exist% (FileSpec$)
DECLARE SUB DOSInt (Registers AS ANY)
DECLARE SUB FileSort (FileName$, NDXName$, RecLength, Offset, KeySize)
DECLARE SUB LoadFile (FileNum, Segment, Address, Bytes&)
DECLARE SUB SaveFile (FileNum, Segment, Address, Bytes&)
DECLARE SUB SwapMem (BYVAL Seg1, BYVAL Adr1, BYVAL Seg2, BYVAL Adr2, _
  BYVAL Length)
DECLARE SUB TypeISort (Segment, Address, ElSize, Offset, KeySize, _
  NumEls, Index())

RANDOMIZE TIMER                 'create new data each run
DEF FnRand% = INT(RND * 10 + 1) 'returns RND from 1 to 10

TYPE RegType                    'used by DOSInt
  AX AS INTEGER
  BX AS INTEGER
  CX AS INTEGER
  DX AS INTEGER
  BP AS INTEGER
  SI AS INTEGER
  DI AS INTEGER
  FL AS INTEGER
  DS AS INTEGER
  ES AS INTEGER
END TYPE

DIM SHARED Registers AS RegType 'share among all subs
REDIM LastNames$(1 TO 10)       'we'll select names at
REDIM FirstNames$(1 TO 10)      '  random from these

NumRecords = 2988               'how many test records to use
FileName$ = "TEST.DAT"          'really original, eh?
NDXName$ = "TEST.NDX"           'this is the index file name

TYPE RecType
  LastName  AS STRING * 11
  FirstName AS STRING * 10
  Dollars   AS STRING * 6
  Cents     AS STRING * 2
  AnyNumber AS LONG         'this shows that only key
  OtherNum  AS LONG         '  information must be ASCII
END TYPE

FOR X = 1 TO 10             'read the possible last names
  READ LastNames$(X)
NEXT

FOR X = 1 TO 10             'and the possible first names
  READ FirstNames$(X)
NEXT

DIM RecordVar AS RecType    'to create the sample file
RecLength = LEN(RecordVar)  'the length of each record
CLS
PRINT "Creating a test file..."

IF Exist%(FileName$) THEN   'if there's an existing file
  KILL FileName$            'kill the old data from prior
END IF                      '  runs to start fresh

IF Exist%(NDXName$) THEN    'same for any old index file
  KILL NDXName$
END IF


'---- Create some test data and write it to the file
OPEN FileName$ FOR RANDOM AS #1 LEN = RecLength
  FOR X = 1 TO NumRecords
    RecordVar.LastName = LastNames$(FnRand%)
    RecordVar.FirstName = FirstNames$(FnRand%)
    Amount$ = STR$(RND * 10000)
    Dot = INSTR(Amount$, ".")
    IF Dot THEN
      RSET RecordVar.Dollars = LEFT$(Amount$, Dot - 1)
      RecordVar.Cents = LEFT$(MID$(Amount$, Dot + 1) + "00", 2)
    ELSE
      RSET RecordVar.Dollars = Amount$
      RecordVar.Cents = "00"
    END IF
    RecordVar.AnyNumber = X
    PUT #1, , RecordVar
  NEXT
CLOSE

'----- Created a sorted index based on the main data file
Offset = 1                  'start sorting with LastName
KeySize = 29                'sort based on first 4 fields
PRINT "Sorting..."
CALL FileSort(FileName$, NDXName$, RecLength, Offset, KeySize)


'----- Display the results
CLS
VIEW PRINT 1 TO 24
LOCATE 25, 1
COLOR 15
PRINT "Press any key to pause/resume";
COLOR 7
LOCATE 1, 1

OPEN FileName$ FOR RANDOM AS #1 LEN = RecLength
OPEN NDXName$ FOR BINARY AS #2
  FOR X = 1 TO NumRecords
    GET #2, , ThisRecord            'get next rec. number
    GET #1, ThisRecord, RecordVar   'then the actual data

    PRINT RecordVar.LastName;       'print each field
    PRINT RecordVar.FirstName;
    PRINT RecordVar.Dollars; ".";
    PRINT RecordVar.Cents

    IF LEN(INKEY$) THEN             'pause on a keypress
      WHILE LEN(INKEY$) = 0: WEND
    END IF
  NEXT
CLOSE

VIEW PRINT 1 TO 24                  'restore the screen
END

DATA Smith, Cramer, Malin, Munro, Passarelli
DATA Bly, Osborn, Pagliaro, Garcia, Winer

DATA John, Phil, Paul, Anne, Jacki
DATA Patricia, Ethan, Donald, Tami, Elli
END


FUNCTION Exist% (Spec$) STATIC  'reports if a file exists

DIM DTA AS STRING * 44          'the work area for DOS
DIM LocalSpec AS STRING * 60    'guarantee the spec is in
LocalSpec$ = Spec$ + CHR$(0)    '  DGROUP for BASIC PDS

Exist% = -1                     'assume true for now

Registers.AX = &H1A00           'assign DTA service
Registers.DX = VARPTR(DTA)      'show DOS where to place it
Registers.DS = VARSEG(DTA)
CALL DOSInt(Registers)

Registers.AX = &H4E00           'find first matching file
Registers.CX = 39               'any file attribute okay
Registers.DX = VARPTR(LocalSpec)
Registers.DS = VARSEG(LocalSpec)
CALL DOSInt(Registers)          'see if there's a match

IF Registers.FL AND 1 THEN      'if the Carry flag is set
  Exist% = 0                    '  there were no matches
END IF

END FUNCTION


SUB FileSort (FileName$, NDXName$, RecLength, Displace, KeySize) STATIC

CONST BufSize% = 32767  'holds the data being sorted
Offset = Displace - 1   'make zero-based for speed later

'----- Open the main data file
FileNum = FREEFILE
OPEN FileName$ FOR BINARY AS #FileNum

'----- Calculate the important values we'll need
NumRecords = LOF(FileNum) \ RecLength
RecsPerPass = BufSize% \ RecLength
IF RecsPerPass > NumRecords THEN RecsPerPass = NumRecords

NumPasses = (NumRecords \ RecsPerPass) - ((NumRecords MOD RecsPerPass) _
  <> 0)
IF NumPasses = 1 THEN
  RecsLastPass = RecsPerPass
ELSE
  RecsLastPass = NumRecords MOD RecsPerPass
END IF

'----- Create the buffer and index sorting arrays
REDIM Buffer(1 TO 1) AS STRING * BufSize
REDIM Index(1 TO RecsPerPass)
IndexAdjust = 1


'----- Process all of the records in manageable groups
FOR X = 1 TO NumPasses

  IF X < NumPasses THEN         'if not the last pass
    RecsThisPass = RecsPerPass  'do the full complement
  ELSE                          'the last pass may have
    RecsThisPass = RecsLastPass '  fewer records to do
  END IF

  FOR Y = 1 TO RecsThisPass     'initialize the index
    Index(Y) = Y - 1            'starting with value of 0
  NEXT

  '----- Load a portion of the main data file
  Segment = VARSEG(Buffer(1))   'show where the buffer is
  CALL LoadFile(FileNum, Segment, Zero, RecsThisPass * CLNG(RecLength))
  CALL TypeISort(Segment, Zero, RecLength, Displace, KeySize, _
    RecsThisPass, Index())

  '----- Adjust the zero-based index to record numbers
  FOR Y = 1 TO RecsThisPass
    Index(Y) = Index(Y) + IndexAdjust
  NEXT

  '----- Save the index file for this pass
  TempNum = FREEFILE
  OPEN "$$PASS." + LTRIM$(STR$(X)) FOR OUTPUT AS #TempNum
  CALL SaveFile(TempNum, VARSEG(Index(1)), Zero, RecsThisPass * 2&)
  CLOSE #TempNum

  '----- The next group of record numbers start this much higher
  IndexAdjust = IndexAdjust + RecsThisPass

NEXT

ERASE Buffer, Index             'free up the memory


'----- Do a final merge pass if necessary
IF NumPasses > 1 THEN

  NDXNumber = FREEFILE
  OPEN NDXName$ FOR BINARY AS #NDXNumber
  REDIM FileNums(NumPasses)        'this holds the file numbers
  REDIM RecordNums(NumPasses)      'this holds record numbers

  REDIM MainRec$(1 TO NumPasses)   'holds main record data
  REDIM Remaining(1 TO NumPasses)  'tracks index files

  '----- Open the files and seed the first round of data
  FOR X = 1 TO NumPasses
    FileNums(X) = FREEFILE
    OPEN "$$PASS." + LTRIM$(STR$(X)) FOR BINARY AS #FileNums(X)
    Remaining(X) = LOF(FileNums(X))   'this is what remains
    MainRec$(X) = SPACE$(RecLength)   'holds main data file

    GET #FileNums(X), , RecordNums(X)     'get the next record number
    RecOffset& = (RecordNums(X) - 1) * CLNG(RecLength) + 1
    GET #FileNum, RecOffset&, MainRec$(X) 'then get the data
  NEXT

  FOR X = 1 TO NumRecords

    Lowest = 1               'assume this is the lowest data in the group
    WHILE Remaining(Lowest) = 0 'Lowest can't refer to a dead index
      Lowest = Lowest + 1       'so seek to the next higher active index
    WEND

    FOR Y = 2 TO NumPasses      'now seek out the truly lowest element
      IF Remaining(Y) THEN      'consider only active indexes
        IF Compare3%(SSEG(MainRec$(Y)), _    '<-- use VARSEG with QB
          SADD(MainRec$(Y)) + Offset,   _
          SSEG(MainRec$(Lowest)),       _    '<-- use VARSEG with QB
          SADD(MainRec$(Lowest)) + Offset, KeySize) = -1 THEN
           Lowest = Y
        END IF
      END IF
    NEXT

    PUT #NDXNumber, , RecordNums(Lowest)     'write the main index
    Remaining(Lowest) = Remaining(Lowest) - 2
    IF Remaining(Lowest) THEN                'if the index is still active
      GET #FileNums(Lowest), , RecordNums(Lowest)
      RecOffset& = (RecordNums(Lowest) - 1) * CLNG(RecLength) + 1
      GET #FileNum, RecOffset&, MainRec$(Lowest)
    END IF

  NEXT

ELSE
  '----- Only one pass was needed so simply rename the index file
  NAME "$$PASS.1" AS NDXName$
END IF

CLOSE                       'close all open files

IF Exist%("$$PASS.*") THEN  'ensure there's a file to kill
  KILL "$$PASS.*"           'kill the work files
END IF

ERASE FileNums, RecordNums  'erase the work arrays
ERASE MainRec$, Remaining

END SUB


SUB LoadFile (FileNum, Segment, Address, Bytes&) STATIC
  IF Bytes& > 32767 THEN Bytes& = Bytes& - 65536
  Registers.AX = &H3F00         'read from file service
  Registers.BX = FILEATTR(FileNum, 2) 'get the DOS handle
  Registers.CX = Bytes&         'how many bytes to load
  Registers.DX = Address        'and at what address
  Registers.DS = Segment        'and at what segment
  CALL DOSInt(Registers)
END SUB


SUB SaveFile (FileNum, Segment, Address, Bytes&) STATIC
  IF Bytes& > 32767 THEN Bytes& = Bytes& - 65536
  Registers.AX = &H4000         'write to file service
  Registers.BX = FILEATTR(FileNum, 2) 'get the DOS handle
  Registers.CX = Bytes&         'how many bytes to load
  Registers.DX = Address        'and at what address
  Registers.DS = Segment        'and at what segment
  CALL DOSInt(Registers)
END SUB


SUB TypeISort (....) STATIC     'as shown in TYPISORT.BAS

END SUB

FILESORT.BAS begins by defining a function that returns a random number
between 1 and 10.  Although the earlier sort demonstrations simply read the
test data from DATA statements, that is impractical when creating thousands
of records.  Instead, two arrays are filled--one with ten last names and
another with ten first names--and these names are drawn from at random.
   The Registers TYPE variable that is defined is used by three of the
supporting routines in this program.  RegType is normally associated with
CALL Interrupt and InterruptX, but I have written a small-code replacement
to mimic InterruptX that works with DOS Interrupt &H21 only.  DOSInt
accepts just a single Registers argument, instead of the three parameters
that BASIC's Interrupt and InterruptX require.  Besides adding less code
each time it is used, the routine itself is smaller and simpler than
InterruptX.
   The remainder of the demonstration program should be easy to follow, so
I won't belabor its operation; the real action is in the FileSort
subprogram.
   Like TypeSort and TypeISort, FileSort is entirely pointer based, to
accommodate TYPE elements of any size and structure.  You provide the name
of the main data file to be sorted, the name of an index file to create,
and the length and offset of the keys within the disk records.  The
Displace parameter tells how far into the TYPE structure the key
information is located.  When calling TypeISort this value is should be
one-based, but in the final merge pass where Compare3 is used, a zero-based
number is required.  Therefore, a copy is made (Offset = Displace - 1) near
the beginning of the routine.  This way, both are available quickly without
having to calculate - 1 repeatedly slowing its operation.
   The initial steps FileSort performs are to determine how many records
are in the data file, and from that how many records can fit into memory
at one time.  Once these are known, the number of passes necessary can be
easily calculated.  An extra step is needed to ensure that RecsPerPass is
not greater than the number of records in the file.  Just because 200
records can fit into memory at once doesn't mean there are that many
records.  In most cases where multiple passes are needed the last pass will
process fewer records than the others.  If there are, say, 700 records and
each pass can sort 300, the last pass will sort only 100 records.
   Once the pass information is determined, a block of memory is created
to hold each portion of the file for sorting.  This is the purpose of the
Buffer array.  REDIM is used to create a 32K chunk of memory that doesn't
impinge on available string space.
   For each pass that is needed, the number of records in the current pass
is determined and the index array is initialized to increasing values. 
Then, a portion of the main data file is read using the LoadFile
subprogram.  BASIC does not allow you to read records from a random access
file directly into a buffer specified by its address.  And even if it did,
we can load data much faster than pure BASIC by reading a number of records
all at once.
   Once the current block of records has been loaded, TypeISort is called
to sort the index array.  The index array is also saved very quickly using
SaveFile, which is the compliment to LoadFile.  A unique name is given to
each temporary index file such that the first one is named $$PASS.1, the
second $$PASS.2, and so forth.  By using dollar signs in the name it is
unlikely that the routine will overwrite an existing file from another
application.  Of course, you may change the names to anything else if you
prefer.
   Notice the extra step that manipulates the IndexAdjust variable.  This
adjustment is needed because each sort pass returns the index array holding
record numbers starting at 0.  The first time through, 1 must be added to
each element to reflect BASIC's use of record numbers that start at 1.  If
the first pass sorts, say, 250 records, then the index values 1 through 250
are saved to disk.  But the second pass is processing records 251 through
500, so an adjustment value of 251 must be added to each element prior to
writing it to disk.
   If the data file is small and only one pass was needed, the $$PASS.1
file is simply renamed to whatever the caller specified.  Otherwise, a
merge pass is needed to determine which record number is the next in
sequence based on the results of each pass.  Believe it or not, this is the
trickiest portion of the entire program.  For the sake of discussion, we'll
assume that four passes were required to sort the file.
   Each of the four index files contains a sequence of record numbers, and
all of the records within that sequence are in sorted order.  However,
there is no relationship between the data records identified in one index
file and those in another.  Thus, each index file and corresponding data
record must be read in turn.  A FOR/NEXT loop then compares each of the
four records, to see which is truly next in the final sequence.  The
complication arises as the merge nears completion, because some of the
indexes will have become exhausted.  This possibility is handled by the
Remaining array.
   Elements in the Remaining array are initialized to the length of each
index file as the indexes are opened.  Then, as each index entry is read
from disk, the corresponding element is decremented by two to show that
another record number was read.  Therefore, the current Remaining element
must be checked to see if that index has been exhausted.  Otherwise, data
that was already processed might be considered in the merge comparisons.
   The final steps are to close all the open files, delete the temporary
index files, and erase the work arrays to free the memory they occupied.
   One important point to observe is the use of SSEG to show Compare3 where
the MainRec$ elements are located.  SSEG is for BASIC 7 only; if you are
using QuickBASIC you must change SSEG to VARSEG.  SSEG can be used with
either near or far strings in BASIC 7, but VARSEG works with near strings
only.  SSEG is used as the default, so an error will be reported if you are
using QuickBASIC.  The cursor will then be placed near the comment in the
program that shows the appropriate correction.


SEARCHING FUNDAMENTALS
======================

As with sorting, searching data effectively also requires that you select
an appropriate algorithm.  There are many ways to search data, and we will
look at several methods here.  The easiest to understand is a linear
search, which simply examines each item in sequence until a match is found:


FoundAt = 0                   'assume no match

FOR X = 1 TO NumElements      'search all elements
  IF Array$(X) = Sought$ THEN
    FoundAt = X               'remember where it is
    EXIT FOR                  'no need to continue
  END IF
NEXT

IF FoundAt THEN               'if it was found
  PRINT "Found at element"; FoundAt
ELSE
  PRINT "Not found"           'otherwise
END IF


For small arrays a linear search is effective and usually fast enough. 
Also, integer and long integer arrays can be searched reasonably quickly
even if there are many elements.  But with string data, as the number of
elements that must be searched increases, the search time can quickly
become unacceptable.  This is particularly true when additional features
are required such as searching without regard to capitalization or
comparing only a portion of each element using MID$.  Indeed, many of the
same techniques that enhance a sort routine can also be employed when
searching.
   To search ignoring capitalization you would first capitalize Sought$
outside of the loop, and then use UCASE$ with each element in the
comparisons.  Using UCASE$(Sought$) repeatedly within the loop is both
wasteful and unnecessary:

   Sought$ = UCASE$(Sought$)
    .
    .
   IF UCASE$(Array$(X)) = Sought$ THEN

Likewise, comparing only a portion of each string will require MID$ with
each comparison, after using MID$ initially to extract what is needed from
Sought$:

   Sought$ = MID$(Sought$, 12, 6)
    .
    .
   IF MID$(Array$(X), 12, 6) = Sought$ THEN

And again, as with sorting, these changes may be combined in a variety of
ways.  You could even use INSTR to see if the string being searched for
is within the array, when an exact match is not needed:

   IF INSTR(UCASE$(Array$(X)), Sought$) THEN

However, each additional BASIC function you use will make the searching
slower and slower.  Although BASIC's INSTR is very fast, adding UCASE$ to
each comparison as shown above slows the overall process considerably.
   There are three primary ways that searching can be speeded up.  One is
to apply simple improvements based on understanding how BASIC works, and
knowing which commands are fastest.  The other is to select a better
algorithm.  The third is to translate selected portions of the search
routine into assembly language.  I will use all three of these techniques
here, starting with enhancements to the linear search, and culminating with
a very fast binary search for use with sorted data.
   One of the slowest operations that BASIC performs is comparing strings. 
For each string, its descriptor address must be loaded and passed to the
comparison routine.  That routine must then obtain the actual data address,
and examine each byte in both strings until one of the characters is
different, or it determines that both strings are the same.  As I mentioned
earlier, if one or both of the strings are fixed-length, then copies also
must be made before the comparison can be performed.
   There is another service that the string comparison routine must
perform, which is probably not obvious to most programmers and which also
impacts its speed.  BASIC frequently creates and then deletes temporary
strings without your knowing it.  One example is the copy it makes of
fixed-length strings before comparing them.  But there are other, more
subtle situations in which this can happen.
   For example, when you use IF X$ + Y$ > Z$ BASIC must create a temporary
string comprised of X$ + Y$, and then pass that to the comparison routine. 
Therefore, that routine is also responsible for determining if the incoming
string is a temporary copy, and deleting it if so.  In fact, all of BASIC's
internal routines that accept string arguments are required to do this.
   Therefore, one good way to speed searching of conventional (not fixed-
length) string arrays is to first compare the lengths.  Since strings whose
lengths are different can't possibly be the same, this will quickly weed
those out.  BASIC's LEN function is much faster than its string compare
routine, and it offers a simple but effective opportunity to speed things
up.  LEN is made even faster because it requires only a single argument,
as opposed to the two required for the comparison routine.


SLen = LEN(Sought$)       'do this once outside the loop
FOR X = 1 TO NumElements
  IF LEN(Array$(X)) = SLen THEN   'maybe...
    IF Array$(X) = Sought$ THEN   'found it!
      FoundAt = X
      EXIT FOR
    END IF
  END IF
NEXT


Similarly, if the first characters are not the same then the strings can't
match either.  Like LEN, BASIC's ASC is much faster than the full string
comparison routine, and it too can improve search time by eliminating
elements that can't possibly match.  Depending on the type and distribution
of the data in the array, using both LEN and ASCII can result in a very
fast linear search:


SLen = LEN(Sought$)
SAsc = ASC(Sought$)
FOR X = 1 TO NumElements
  IF LEN(Array$(X)) = SLen THEN
    IF ASC(Array$(X)) = SAsc THEN
      IF Array$(X) = Sought$ THEN
        ...
      END IF
    END IF
  END IF
NEXT


Notice that the LEN test must always be before the ASC test, to avoid an
"Illegal function call" error if the array element is a null string.  If
all or most of the strings are the same length, then LEN will not be
helpful, and ASC should be used alone.
   As I mentioned before, when comparing fixed-length string arrays BASIC
makes a copy of each element into a conventional string, prior to calling
its comparison routine.  This copying is also performed when using ASC is
used, but not LEN.  After all, the length of a fixed-length never changes,
and BASIC is smart enough to know the length directly.  But then, comparing
the lengths of these string is pointless anyway.
   Because of the added overhead to make these copies, the performance of
a conventional linear search for fixed-length data is generally quite poor. 
This is a shame, because fixed-length strings are often the only choice
when as much data as possible must be kept in memory at once.  And fixed-
length strings lend themselves perfectly to names and addresses.  It should
be apparent by now that the best solution for quickly comparing fixed-
length string arrays--and the string portion of TYPE arrays too--is with
the various Compare functions already shown.
   If you are searching for an exact match, then either Compare or Compare2
will be ideal, depending on whether you want to ignore capitalization.  If
you have only a single string element in each array, you should define a
dummy TYPE.  This avoids the overhead of having to use both VARSEG and
VARPTR as separate arguments.  The short example program and SearchType
functions that follow search a fixed-length string array for a match.

DEFINT A-Z
DECLARE FUNCTION Compare% (SEG Type1 AS ANY, SEG Type2 AS ANY, NumBytes)
DECLARE FUNCTION Compare2% (SEG Type1 AS ANY, SEG Type2 AS ANY, NumBytes)
DECLARE FUNCTION SearchType% (Array() AS ANY, Sought AS ANY)
DECLARE FUNCTION SearchType2% (Array() AS ANY, Sought AS ANY)
DECLARE FUNCTION SearchType3% (Array() AS ANY, Searched AS ANY)

CLS
TYPE FLen                       'this lets us use SEG
  LastName AS STRING * 15
END TYPE

REDIM Array(1 TO 4000) AS FLen  '4000 is a lot of names
DIM Search AS FLen              'best comparing like data

FOR X = 1 TO 4000 STEP 2        'impart some realism
  Array(X).LastName = "Henderson"
NEXT

Array(4000).LastName = "Henson" 'almost at the end
Search.LastName = "Henson"      'find the same name

'----- first time how long it takes using Compare
Start! = TIMER                  'start timing

FOR X = 1 TO 5                  'search five times
   FoundAt = SearchType%(Array(), Search)
NEXT

IF FoundAt >= 0 THEN
  PRINT "Found at element"; FoundAt
ELSE
  PRINT "Not found"
END IF

Done! = TIMER
PRINT USING "##.## seconds with Compare"; Done! - Start!
PRINT


'----- then time how long it takes using Compare2
Start! = TIMER                  'start timing

FOR X = 1 TO 5                  'as above
   FoundAt = SearchType2%(Array(), Search)
NEXT

IF FoundAt >= 0 THEN
  PRINT "Found at element"; FoundAt
ELSE
  PRINT "Not found"
END IF

Done! = TIMER
PRINT USING "##.## seconds with Compare2"; Done! - Start!
PRINT


'---- finally, time how long it takes using pure BASIC
Start! = TIMER

FOR X = 1 TO 5
   FoundAt = SearchType3%(Array(), Search)
NEXT

IF FoundAt >= 0 THEN
  PRINT "Found at element"; FoundAt
ELSE
  PRINT "Not found"
END IF

Done! = TIMER
PRINT USING "##.## seconds using BASIC"; Done! - Start!
END

FUNCTION SearchType% (Array() AS FLen, Sought AS FLen) STATIC

SearchType% = -1                'assume not found

FOR X = LBOUND(Array) TO UBOUND(Array)
  IF Compare%(Array(X), Sought, LEN(Sought)) THEN
    SearchType% = X             'save where it was found
    EXIT FOR                    'and skip what remains
  END IF
NEXT

END FUNCTION


FUNCTION SearchType2% (Array() AS FLen, Sought AS FLen) STATIC

SearchType2% = -1               'assume not found

FOR X = LBOUND(Array) TO UBOUND(Array)
  IF Compare2%(Array(X), Sought, LEN(Sought)) THEN
    SearchType2% = X            'save where it was found
    EXIT FOR                    'and skip what remains
  END IF
NEXT

END FUNCTION


FUNCTION SearchType3% (Array() AS FLen, Searched AS FLen) STATIC

SearchType3% = -1               'assume not found

FOR X = LBOUND(Array) TO UBOUND(Array)
  IF Array(X).LastName = Searched.LastName THEN
    SearchType3% = X            'save where it was found
    EXIT FOR                    'and skip what remains
  END IF
NEXT
  
END FUNCTION

When you run this program it will be apparent that the SearchType function
is the fastest, because it uses Compare which doesn't perform any case
conversions.  SearchType2 is only slightly slower with that added overhead,
and the purely BASIC function, SearchType3, lags far behind at half the
speed.  Note that the array is searched five times in succession, to
minimize the slight errors TIMER imposes.  Longer timings are generally
more accurate than short ones, because of the 1/18th second resolution of
the PC's system timer.


BINARY SEARCHES

This is about as far as we can go using linear searching, and to achieve
higher performance requires a better algorithm.  The Binary Search is one
of the fastest available; however, it requires the data to already be in
sorted order.  A Binary Search can also be used with a sorted index, and
both methods will be described.
   Binary searches are very fast, and also very simple to understand. 
Unlike the Quick Sort algorithm which achieves great efficiency at the
expense of being complicated, a Binary Search can be written using only a
few lines of code.  The strategy is to start the search at the middle of
the array.  If the value of that element value is less than that of the
data being sought, a new halfway point is checked and the process repeated. 
This way, the routine can quickly zero in on the value being searched for. 
Figure 8-3 below shows how this works.


13:  Zambia
12:  Sweden
11:  Peru
10:  Mexico  <-- step 2
 9:  Holland
 8:  Germany
 7:  Finland <-- step 1
 6:  England
 5:  Denmark
 4:  China
 3:  Canada
 2:  Austria
 1:  Australia

Figure 8-3: How a Binary Search locates data in a sorted array.


If you are searching for Mexico, the first element examined is number 7,
which is halfway through the array.  Comparing Mexico to Finland shows
that Mexico is greater, so the distance is again cut in half.  In this
case, a match was found after only two tries--remarkably faster than a
linear search that would have required ten comparisons.  Even when huge
arrays must be searched, data can often be found in a dozen or so tries. 
One interesting property of a binary search is that it takes no longer to
find the last element in the array than the first one.
   The program below shows one way to implement a Binary Search.

DEFINT A-Z
DECLARE FUNCTION BinarySearch% (Array$(), Find$)

CLS
PRINT "Creating test data..."

REDIM Array$(1 TO 1000)         'create a "sorted" array
FOR X = 1 TO 1000
  Array$(X) = "String " + RIGHT$("000" + LTRIM$(STR$(X)), 4)
NEXT

PRINT "Searching array..."

FoundAt = BinarySearch%(Array$(), "String 0987")
IF FoundAt >= 0 THEN
  PRINT "Found at element"; FoundAt
ELSE
  PRINT "Not found"
END IF

END


FUNCTION BinarySearch% (Array$(), Find$) STATIC

BinarySearch% = -1              'no matching element yet
Min = LBOUND(Array$)            'start at first element
Max = UBOUND(Array$)            'consider through last

DO
  Try = (Max + Min) \ 2         'start testing in middle

  IF Array$(Try) = Find$ THEN   'found it!
    BinarySearch% = Try         'return matching element
    EXIT DO                     'all done
  END IF

  IF Array$(Try) > Find$ THEN   'too high, cut in half
    Max = Try - 1
  ELSE
    Min = Try + 1               'too low, cut other way
  END IF
LOOP WHILE Max >= Min

END FUNCTION

The BinarySearch function returns either the element number where a match
was found, or -1 if the search string was not found.  Not using a value of
zero to indicate failure lets you use arrays that start with element number
0.  As you can see, the simplicity of this algorithm belies its incredible
efficiency.  The only real problem is that the data must already be in
sorted order.  Also notice that two string comparisons must be made--one
to see if the strings are equal, and another to see if the current element
is too high.  Although you could use Compare3 which examines the strings
once and tells if the data is the same or which is greater, a Binary Search
is so fast that this probably isn't worth the added trouble.  As you will
see when you run the test program, it takes far longer to create the data
than to search it!
   Besides the usual enhancements that can be applied to the comparisons
using UCASE$ or MID$, this function could also be structured to use a
parallel index array.  Assuming the data is not sorted but the index array
is, the modified Binary Search would look like this:

FUNCTION BinaryISearch% (Array$(), Index(), Find$) STATIC

BinaryISearch% = -1             'assume not found
Min = LBOUND(Array$)            'start at first element
Max = UBOUND(Array$)            'consider through last

DO
  Try = (Max + Min) \ 2         'start testing in middle

  IF Array$(Index(Try)) = Find$ THEN    'found it!
    BinaryISearch% = Try        'return matching element
    EXIT DO                     'all done
  END IF

  IF Array$(Index(Try)) > Find$ THEN    'too high, cut
    Max = Try - 1
  ELSE
    Min = Try + 1               'too low, cut other way
  END IF
LOOP WHILE Max >= Min

END FUNCTION

NUMERIC ARRAYS

All of the searching techniques considered so far have addressed string
data.  In most cases, string array searches are the ones that will benefit
the most from improved techniques.  As you have already seen, BASIC makes
copies of fixed-length strings before comparing them, which slows down
searching.  And the very nature of strings implies that many bytes may have
to be compared before determining if they are equal or which string is
greater.  In most cases, searching a numeric array is fast enough without
requiring any added effort, especially when the data is integer or long
integer.
   However, a few aspects of numeric searching are worth mentioning here. 
One is avoiding the inevitable rounding errors that are sure to creep into
the numbers you are examining.  Another is that in many cases, you may not
be looking for an exact match.  For example, you may need to find the first
element that is higher than a given value, or perhaps determine the
smallest value in an array.
   Unlike strings that are either the same or they aren't, the binary
representation of numeric values is not always so precise.  Consider the
following test which *should* result in a match, but doesn't.


Value! = 1!
Result! = 2!
CLS

FOR X = 1 TO 1000
  Value! = Value! + .001
NEXT

IF Value! = Result! THEN
  PRINT "They are equal"
ELSE
  PRINT "Value! ="; Value!
  PRINT "Result! ="; Result!
END IF


 After adding .001 to Value! 1000 times Value! should be equal to 2, but
instead it is slightly higher.  This is because the binary storage method
used by computers simply cannot represent every possible value with
absolute accuracy.  Even changing all of the single precision exclamation
points (!) to double precision pound signs (#) will not solve the problem. 
Therefore, to find a given value in a numeric array can require some extra
trickery.
   What is really needed is to determine if the numbers are *very close* to
each other, as opposed to exactly the same.  One way to accomplish this is
to subtract the two, and see if the result is very close to zero.  This is
shown below.


Value! = 1!
Result! = 2!
CLS

FOR X = 1 TO 1000
  Value! = Value! + .001
NEXT

IF ABS(Value! - Result!) < .0001 THEN
  PRINT "They are equal"
ELSE
  PRINT "Value! ="; Value!
  PRINT "Result! ="; Result!
END IF


Here, the absolute value of the difference between the numbers is examined,
and if that difference is very small the numbers are assumed to be the
same.  Unfortunately, the added overhead of subtracting before comparing
slows the comparison even further.  There is no simple cure for this, and
an array search must apply this subtraction to each element that is
examined.
   Another common use for numeric array searches is when determining the
largest or smallest value.  Many programmers make the common mistake shown
below when trying to find the largest value in an array.


MaxValue# = 0

FOR X = 1 TO NumElements
  IF Array#(X) > MaxValue# THEN
    MaxValue# = Array#(X)
    Element = X
  END IF
NEXT

PRINT "The largest value found is"; MaxValue#
PRINT "And it was found at element"; Element


The problem with this routine is that it doesn't account for arrays where
all of the elements are negative numbers!  In that case no element will be
greater than the initial MaxValue#, and the routine will incorrectly report
zero as the result.  The correct method is to obtain the lowest element
value, and use that as a starting point:


MaxValue# = Array#(1)

FOR X = 2 TO NumElements
  IF Array#(X) > MaxValue# THEN
    MaxValue# = Array#(X)
  END IF
NEXT

PRINT "The largest value found is"; MaxValue#


Determining the highest value in an array would be handled similarly,
except the greater-than symbol (>) would be replaced with a less-than
operator (<).


SOUNDEX

The final searching technique I will show is Soundex.  It is often useful
to search for data based on its sound, for example when you do not know how
to spell a person's name.  Soundex was invented in the 1920's and has been
used since then by, among others, the U.S. Census Bureau.  A Soundex code
is an alpha-numeric representation of the sound of a word, and it is
surprisingly accurate despite its simplicity.  The classic implementation
of Soundex returns a four-character result code.  The first character is
the same as the first letter of the word, and the other three are numeric
digits coded as shown in Figure 8-4.


    1    B, F, P, V
    2    C, G, J, K, Q, S, X
    3    D, T
    4    L
    5    M, N
    6    R

Figure 8-4: The Soundex code numbers returned for significant letters of
the alphabet.


Letters not shown are simply skipped as being statistically insignificant
to the sound of the word.  In particular, speaking accents often minimize
the importance of vowels, and blur their distinction.  If the string is
short and there are fewer than four digits, the result is simply padded
with trailing zeros.  One additional rule is that a code digit is never
repeated, unless there is an uncoded letter in between.  In the listing
that follows, two different implementations of Soundex are shown.

'SOUNDEX.BAS, Soundex routines and example

DEFINT A-Z

DECLARE FUNCTION ASoundex$ (Word$)
DECLARE FUNCTION ISoundex% (Word$)

CLS
DO
  PRINT "press Enter alone to exit"
  INPUT "What is the first word"; FWord$
  IF LEN(FWord$) = 0 THEN EXIT DO
  INPUT "What is the second word"; SWord$
  PRINT

  'Test by alpha-numeric soundex
  PRINT "Alpha-Numeric Soundex: "; FWord$; " and ";
  PRINT SWord$; " do ";
  IF ASoundex$(FWord$) <> ASoundex$(SWord$) THEN
    PRINT "NOT ";
  END IF
  PRINT "sound the same."
  PRINT

  'Test by numeric soundex
  PRINT "      Numeric Soundex: "; FWord$; " and ";
  PRINT SWord$; " do ";
  IF ISoundex%(FWord$) <> ISoundex%(SWord$) THEN
    PRINT "NOT ";
  END IF
  PRINT "sound the same."
  PRINT
LOOP
END


FUNCTION ASoundex$ (InWord$) STATIC

  Word$ = UCASE$(InWord$)
  Work$ = LEFT$(Word$, 1) + "000"
  WkPos = 2
  PrevCode = 0

  FOR L = 2 TO LEN(Word$)
    Temp = INSTR("BFPVCGJKQSXZDTLMNR", MID$(Word$, L, 1))
    IF Temp THEN
      Temp = ASC(MID$("111122222222334556", Temp, 1))
      IF Temp <> PrevCode THEN
        MID$(Work$, WkPos) = CHR$(Temp)
        PrevCode = Temp
        WkPos = WkPos + 1
        IF WkPos > 4 THEN EXIT FOR
      END IF
    ELSE
      PrevCode = 0
    END IF
  NEXT

  ASoundex$ = Work$

END FUNCTION


FUNCTION ISoundex% (InWord$) STATIC

  Word$ = UCASE$(InWord$)
  Work$ = "0000"
  WkPos = 1
  PrevCode = 0

  FOR L = 1 TO LEN(Word$)
    Temp = INSTR("BFPVCGJKQSXZDTLMNR", MID$(Word$, L, 1))
    IF Temp THEN
      Temp = ASC(MID$("111122222222334556", Temp, 1))
      IF Temp <> PrevCode THEN
        MID$(Work$, WkPos) = CHR$(Temp)
        PrevCode = Temp
        WkPos = WkPos + 1
        IF WkPos > 4 THEN EXIT FOR
      END IF
    ELSE
      PrevCode = 0
    END IF
  NEXT

  ISoundex% = VAL(Work$)

END FUNCTION


The first function, ASoundex, follows the standard Soundex definition and
returns the result as a string.  The ISoundex version cheats slightly by
coding the first letter as a number, but it returns an integer value
instead of a string.  Because integer searches are many times faster than
string searches, this version will be better when thousands--or even
hundreds of thousands--of names must be examined.
   An additional benefit of the integer-only method is that it allows for
variations on the first letter.  For example, if you enter Cane and Kane
in response to the prompts from SOUNDEX.BAS ASoundex will not recognize the
names as sounding alike where ISoundex will.


LINKED DATA
===========

No discussion of searching and sorting would be complete without a mention
of linked lists and other data links.  Unlike arrays where all of the
elements lie in adjacent memory locations, linked data is useful when data
locations may be disjointed.  One example is the linked list used by the
DOS File Allocation Table (FAT) on every disk.  As I described in Chapter
6, the data in each file may be scattered throughout the disk, and only
through a linked list can DOS follow the thread from one sector in a file
to another.
   Another example where linked data is useful--and the one we will focus
on here--is to keep track of memo fields in a database.  A memo field is
a field that can store freeform text such as notes about a sales contact
or a patient's medical history.  Since these fields typically require
varying lengths, it is inefficient to reserve space for the longest one
possible in the main database file.  Therefore, most programs store memo
fields in a separate disk file, and use a *pointer field* in the main data
file to show where the corresponding memo starts in the dedicated memo
file.  Similarly, a back pointer adjacent to each memo identifies the
record that points to it.  This is shown in Figure 8-5 below.


         
MAIN.DAT LastNameFirstName1LastNameFirstName73LastNa...
         
                             ^                    ^
                             |                    |
pointers into memo file -----+--------------------+
(forward pointers)



offsets into --+----------+-------+--------------+
this memo file |          |       |              |
               1          73     126            233
              Ŀ
     MEMO.DAT 1LMemo1   2LMemo23LMemo3       4LMemo4  
              
               ^          ^       ^              ^
               |          |       |              |
record numbers +----------+-------+--------------+
(back pointers)

(L = length of this memo)

Figure 8-5: Pointers relate record numbers to memo file offsets and vice
versa.


Here, the pointer in the main data file record is a long integer that holds
the byte offset into the memo file where the corresponding memo text
begins.  And just before the memo text is an integer record number that
shows which record this memo belongs to.  (If you anticipate more than
65,535 records a long integer must be used instead.)  Thus, these pointers
provide links between the two files, and relate the information they
contain.
   When a new record is added to the main file, the memo that goes with it
is appended to the end of the memo file.  BASIC's LOF function can be used
to determine the current end of the memo file, which is then used as the
beginning offset for the new memo text.  And as the new memo is appended
to MEMO.DAT, the first data actually written is the number of the new
record in the main data file.
   The record number back pointer in the memo file is needed to allow memo
data to be edited.  Since there's no reasonable way to extend memo text
when other memo data follows it, most programs simply abandon the old text,
and allocate new space at the end of the file.  The abandoned text is then
marked as such, perhaps by storing a negative value as the record number. 
Storing a negative version of the abandoned data's length is ideal, because
that both identifies the data as obsolete, and also tells how much farther
into the file the next memo is located.
   The idea here is that you would periodically run a memo file maintenance
program that compacts the file, thus eliminating the wasted space the
abandoned memos occupy.  This is similar to the DBPACK.BAS utility shown
in Chapter 7, and also similar to the way that BASIC compacts string memory
when it becomes full.  But when an existing memo is relocated in the memo
file, the field in the main data file that points to the memo must also be
updated.  And that's why the record number back pointer is needed: so the
compaction program can know which record in the main file must be updated.
   The "L" identifier in the memo file in Figure 8-5, shown between the
record number and memo text, is a length byte or word that tells how long
the text is.  If you plan to limit the memo field lengths to 255 or fewer
characters, then a single byte is sufficient.  Otherwise an integer must
be used.  An example of code that reads a data record and then its
associated memo text is shown below.


GET #MainFile, RecNumber, TypeVar
MemoOffset& = TypeVar.MemoOff
GET #MemoFile, MemoOffset& + 2, MemoLength%
Memo$ = SPACE$(MemoLength%)
GET #MemoFile, , Memo$


The first step reads a record from the main data file into a TYPE variable,
and the second determines where in the memo file the memo text begins.  Two
is added to that offset in the second GET statement, to skip over the
record number back pointer which isn't needed here.  Once the length of the
memo text is known, a string is assigned to that length, and the actual
text is read into it.
   If you are using long integer record numbers you would of course use
MemoOffset& + 4 in the second GET.  And if you're using a single byte to
hold the memo length you would define a fixed-length string to receive
that byte:


DIM Temp AS STRING *1
GET #MemoFile, MemoOffset& + 2, Temp
MemoLength = ASC(Temp)


Since BASIC doesn't offer a byte-sized integer data type, ASC and STR$ can
be used to convert between numeric and string formats.


ARRAY ELEMENT INSERTION AND DELETION
====================================

The last issue related to array and memory manipulation I want to cover
is inserting and deleting elements.  If you intend to maintain file indexes
or other information in memory and in sorted order, you will need some way
to insert a new entry.  By the same token, deleting an entry in a database
requires that the parallel index entry also be deleted.
   The most obvious way to insert or delete elements in an array is with
a FOR/NEXT loop.  The first example below inserts an element, and the
second deletes one.


'----- Insert an element:
Element = 200
InsertValue = 999

FOR X = UBOUND(Array) TO Element + 1 STEP -1
  Array(X) = Array(X - 1)
NEXT
Array(Element) = InsertValue


'----- Delete an element:
Element = 200
FOR X = Element TO UBOUND(Array) - 1
  Array(X) = Array(X + 1)
NEXT
Array(UBOUND(Array)) = 0  'optionally clear last element


For integer, long integer, and fixed-length arrays this is about as
efficient as you can get, short of rewriting the code in assembly language. 
However, with floating point and string arrays the performance is less than
ideal.  Unless a numeric coprocessor is installed, floating point values
are assigned using interrupts and support code in the emulator library. 
This adds an unnecessary level of complication that also impacts the speed. 
When strings are assigned the situation is even worse, because of the
memory allocation overhead associated with dynamic string management.
   A better solution for floating point and string arrays is a series of
SWAP statements.  The short program below benchmarks the speed difference
of the two methods, as it inserts an element into a single precision array.

REDIM Array(1 TO 500)
CLS
Element% = 200
InsertValue = 999

Start = TIMER
FOR A% = 1 TO 500
  FOR X% = UBOUND(Array) TO Element% + 1 STEP -1
    Array(X%) = Array(X% - 1)
  NEXT
  Array(Element%) = InsertValue
NEXT
Done = TIMER
PRINT USING "##.## seconds when assigning"; Done - Start

Start = TIMER
FOR A% = 1 TO 500
  FOR X% = UBOUND(Array) TO Element% + 1 STEP -1
    SWAP Array(X%), Array(X% - 1)
  NEXT
  Array(Element%) = InsertValue
NEXT
Done = TIMER
PRINT USING "##.## seconds when swapping"; Done - Start

If you run this program in the BASIC environment, the differences may not
appear that significant.  But when the program is compiled to an executable
file, the swapping method is more than four times faster.  In fact, you
should never compare programming methods using the BASIC editor for exactly
this reason.  In many cases, the slowness of the interpreting process
overshadows significant differences between one approach and another.
   String arrays also benefit greatly from using SWAP instead of
assignments, though the amount of benefit varies depending on the length
of the strings.  If you modify the previous program to use a string array,
also add this loop to initialize the elements:

   FOR X% = 1 TO 500
     Array$(X%) = "String number" + STR$(X)
   NEXT

With BASIC PDS far strings the difference is only slightly less at about
three to one, due to the added complexity of far data.  Also, SWAP will
always be worse than assignments when inserting or deleting elements in a
fixed-length string or TYPE array.  An assignment merely copies the data
from one location to another.  SWAP, however, must copy the data in both
directions.
   Understand that when using SWAP with conventional string arrays, the
data itself is not exchanged.  Rather, the four-byte string descriptors are
copied.  But because BASIC program modules store string data in different
segments, extra work is necessary to determine which descriptor goes with
which segment.  When near strings are being used, only six bytes are
exchanged, regardless of the length of the strings.  Four bytes hold the
descriptors, and two more store the back pointers.


SUMMARY
=======

This chapter explained many of the finer points of sorting and searching
all types of data in BASIC.  It began with sorting concepts using the
simple Bubble Sort as a model, and then went on to explain indexed and
multi-key sorts.  One way to implement a multi-key sort is by aligning the
key fields into adjacent TYPE components.  While there are some
restrictions to this method, it is fairly simple to implement and also
very fast.
   The Quick Sort algorithm was shown, and the SEEQSORT.BAS program on the
accompanying disk helps you to understand this complex routine by
displaying graphically the progress of the comparisons and exchanges as
they are performed.  Along the way you saw how a few simple modifications
to any string sort routine can be used to sort regardless of
capitalization, or based on only a portion of a string element.
   You also learned that writing a truly general sort routine that can
handle any type of data requires dealing exclusively with segment and
address pointers.  Here, assembly language routines are invaluable for
assisting you when performing the necessary comparisons and data exchanges. 
Although the actual operation of the assembly language routines will be
deferred until Chapter 12, such routines may easily be added to a BASIC
program using .LIB and .QLB libraries.
   I mentioned briefly the usefulness of packing and aligning data when
possible, as an aid to fast sorting.  In particular, dates can be packed
to only three bytes in Year/Month/Day order, and other data such as zip
codes can be stored in long integers.  Because numbers can be compared much
faster than strings, this helps the sorting routines operate more quickly.
   Array searching was also discussed in depth, and both linear and binary
search algorithms were shown.  As with the sorting routines, searching can
also employ UCASE$ and MID$ to search regardless of capitalization, or on
only a portion of each array element.  Two versions of the Soundex
algorithm were given, to let you easily locate names and other data based
on how they sound.
   Besides showing the more traditional searching methods, I presented
routines to determine the minimum and maximum values in a numeric array. 
I also discussed some of the ramifications involved when searching floating
point data, to avoid the inevitable rounding errors that might cause a
legitimate match to be ignored.
   Finally, some simple ways to insert and delete elements in both string
and numeric arrays were shown.  Although making direct assignments in a
loop is the most obvious way to do this, BASIC's often-overlooked SWAP
command can provide a significant improvement in speed.
   The next chapter will conclude this section about hands-on programming
by showing a variety of program optimization techniques.
