
                           THE BWSORT ALGORITHM
                           --------------------

The Bengali character set consists of the following:

Vowels         a, ae (= a + jafala + aa-kaar), aa, i (hrashwa-i), I (dirgha-i),
               u (hraswa-u), U (dirgha-u), Ri,
               e, ea (= e + jafala + aa-kaar), E (oi), o, O (ou).

Vowel forms    Similar to the vowels

Consonants     k, K (=kh), g, G (=gh), ^n (=una),
               c (=ch), C (=chh), j (=j), J (=jh), ^N (=ina),
               T, Z (=Th), D, X (=Dh), N,
               t, z (=th), d, x (=dh), n,
               p, f (=ph), b, v (=bh), m,
               Y (=antashtha-ja), r, l, b, S (=sh = talabya-sha),
               S (=murdhanya-sa), s (=dantya-sa), h, rr (=Da-e shunya ra),
                                                     rh (=Dha-e shunya ra),
               y (=antashtya-a), ^t (khanda-ta), M (=anuswar), H (=bisarga),
                                                     ^ (=chandrabindu).

Conjunct consonants
               Some allowed combination of two or more consonants

Digits         0 1 2 3 4 5 6 7 8 9

Punctuation symbols
               period (=dnari), comma, quote, space etc.

However the basic primitives are:

The vowels and the pure consonants (i.e. consonants without any vowel
sound, e.g., ka-e hasanta, etc.) plus the punctuation symbols and
digits. Any Bengali string can be broken as a concatenation of these
primitives. For example,

  prakhara daaruNa ati dirgha dagdha din.

can be broken as

  p_ + r_ + a + kh_ + a + r_ + a + space + d_ + aa + r_ + u + N_ + a + space +
  a + t_ + i + space + d_ + I + r_ + gh_ + a + space + d_ + a + g_ + dh_ + a +
  space + d_ + i + n_ + a + .

Here the underscore (_) stands for the pure consonant forms (i.e. consonants
without vowel sounds, or with hasanta). Any Bengali sorting scheme (be it
a computer program or a press standard) sorts Bengali strings based on this
decomposition. As regards the positions of these primitives in the Bengali
alphabet, we have the following ordering:

  a   < ae   < aa  < i    < I   <
  u   < U    < Ri  <
  e   < ea   < oi  < o    < ou  <
  k_  < kh_  < g_  < gh_  < ^n_ <
  ch_ < chh_ < j_  < jh_  < ^N_ <
  T_  < Th_  < D_  < Dh_  < N_  <
  t_  < th_  < d_  < dh_  < n_  <
  p_  < ph_  < b_  < bh_  < m_  <
  Y_  < r_   < l_  < sh_  <
  ss_ < s_   < h_  < rr_  < rh_ <
  y_  < ^t   < M_  < H_   < ^_

The DEFAULT sorting scheme of BWSORT respects this order.

Note that there are a total of 52 alphabetic primitives. These have been
given the ASCII values A - Z, a - Z in that order. Punctuation symbols
and digits are given the same ASCII values as in roman. This makes an
ordering of all finite length Bengali strings. BWSORT sorts Bengali
strings based on this converted decomposition (using `strcmp').

While this scheme seems quite reasonable, many modern dictionaries in
Bengali follow a slight variation of the primitive order. This mostly
conforms with old Sanskrit conventions. The OLD sorting scheme of
BWSORT is based on these conventions. We will now enumerate the
differences between DEFAULT and OLD schemes:

1. In the OLD scheme the following pairs are identified:

      rr_ <--> D_,   rh_ <--> Dh_,   y_ <--> Y_,   ^t <--> t_

   In the dictionary order rr_, rh_ and y_ immediately follow D_, Dh_
   and Y_ respectively, though they are not in those positions in the
   alphabet. See point 4 below for a discussion on ^t.

2. The ajogabaho barno's (M, H, ^) come before any other consonant
   primitive in the order.

3. A consonant with a hasanta is treated the same way as the consonant
   without the hasanta in the OLD sorting scheme. That is, b_da is broken
   as

      b_ + a + d_ + a

   This is not grammatically correct, but this convention is followed
   in Bengali dictionaries. BWSORT's OLD scheme respects this convention.
   The DEFAULT one, on the other hand, does not put the a after the hasanta
   (b_) and thereby identifies b_da as the conjunct bda (ba-e da-e).

4. Etymologically ^t (khanda-ta) is nothing but ta with a hasanta. In view
   of this and the previous point (3), ^t is identified with t in the OLD
   sorting scheme and is not treated as a separate primitive.

   The primitive ordering for the OLD scheme is, therefore, like the
   following:

     a   < ae   < aa  < i    < I    <
     u   < U    < Ri  <
     e   < ea   < oi  < o    < ou   <
     M_  < H_   < ^_  <
     k_  < kh_  < g_  < gh_  < ^n_  <
     ch_ < chh_ < j_  < jh_  < ^N_  <
     T_  < Th_  < D_  < rr_  < Dh_  < rh_  < N_  <
     t_  = ^t   < th_ < d_   < dh_  < n_   <
     p_  < ph_  < b_  < bh_  < m_   <
     Y_  < y_   < r_  < l_   < sh_  <
     ss_ < s_   < h_


These make the OLD sorting scheme a little bit different from the
DEFAULT scheme. As we have discussed elsewhere, bwsort allows you to
choose the one you like in a variety of ways (-s option in command line,
setting the environment variable BWSORTSTYLE, calling sortstyle in the
interactive mode).

Before we end, some general remarks about a few BWSORT conventions are in
order:

1. bargya-ba and antashtha-ba are pronounced and written the same way
   in Bengali (like the bargya-ba in Sankrit). We, therefore, omitted
   the antashtha-ba from the alphabet. Some Bengali dictionaries still
   find it necessary to find out the original Sanskrit spelling and
   sort based on the type of the ba. Neither sorting scheme of BWSORT
   does that. BWFU fonts do not encourage that too.

2. Bargya-ja and antashtha-ja are pronounced the same way in Bengali.
   They are, however, written differently. So these two are treated as
   separate characters.

3. The vowels `ae' and `ea' are not listed as vowels in classical
   definition of Bengali grammar. Their vowel form would be

      jafala + aa-kaar

   When this sequence comes immediately after a consonant (as in
   baekaraN, for example), the decomposition goes like this

      baekaraN = b_ + Y_ + aa + k_ + a + r_ + a + N_ + a

   On the other hand, when jafala + aakaar comes after the vowels
   `a' or `e', they are not decomposed the same way, that is, not as

      a + Y_ + aa    or    e + Y_ + aa

   Instead it is preferable to treat `ae' and `ea' as separate vowels
   which do not have any vowel forms (kaar) associated with them. This
   convention is followed for both the DEFALUT and the OLD sorting
   schemes.

4. The BWFU and BWTI fonts (on which BWSORT is based) do not define the
   Bengali vowel `Li'. I have never heard of anybody who has seen
   this character in a Bengali word (however old, obsolete or uncommon
   the word is). So I felt no justification for including this
   character in the Bengali alphabet.

5. BWSORT assumes that the files you are sorting are Bengali text files
   in the sense that those words make syntactic senses to a Bengali
   reader. For example, a word cannot start with a aa-kaar or a hasanta.
   In a word, hasanta and a vowel-form cannot coexist against a consonant.
   Similarly two vowel forms cannot modify the sound of a consonant
   simultaneously. There is nothing like a bisarga-e hraswa-i-kaar etc.
   `a' and `e' are the only vowels that take a jafala (+ aa-kaar) after
   them. And so on... If the input file does not conform to these general
   rules, you may expect peculiar behavior of bwsort.


That's all! If you find some conventions wrong or wrongly implemented,
or there is a pre-defined standard which every sorting scheme should
follow, please let me know. I can be reached at

   abhij@csa.iisc.ernet.in

Thanks for your interest in bwsort.


                            -------------------------------------------------
                            Abhijit Das (Barda)
                            Department of Computer Science and Automation
                            Indian Institute of Science
                            Bangalore 560 012
                            INDIA

