Chapter 10: Non-regular Languages

Topics
Are all languages regular?
 anbn is not a regular language
The pumping lemma
Using the pumping lemma to prove that anbn is not regular
An interesting and useful way to classify strings
A surprising construction: the CCFA of a language L
The CCFA of a language L accepts exactly L
How does the CCFA for L relate to other FA that accept L?
The Myhill-Nerode theorem
Using Myhill-Nerode to show that a language is not regular


Are all languages regular?

That would be nice, because then we could stop right here. But they’re not.

For example, the language { anbn, n = 1, 2, …} is not regular.

Note: we will usually denote the set {anbn, n=1,2,3, …} in the simpler, abbreviated form: anbn anbn is, of course is the set of all strings with any number of a’s followed by the same number of b’s. How would we as humans when given a string, test it for membership in this language? Here’s one way:
    1. Read a’s, counting as we go
    2. When we come to a "b", save our count
    3. Start counting b’s
    4. If we find another a, reject
    5. If we come to the end of the string, compare the current count with the saved count
    6. If they’re the same, accept, otherwise, reject
Why can’t we program a FA to do this? A key to the algorithm is being able to count. FA can’t count arbitrarily high, because they have only finitely many states: a FA with 10 states can count to no more than 10!

Following is a more formal proof of our assertion:


anbn is not a regular language

Suppose we have a NFA with 20 states that can recognize a21b21.  Since there are only 20 states, one will have to be revisited while reading in the a21 part. Suppose a state is revisited at the 10th and 15th transitions. The sequence of states will look like:

But this machine, by short-cutting the loop, also accepts a11b21, which is not in the language. So a machine with 20 states cannot be the required machine. The same argument holds for any number of states.

So, some languages are non-regular. The question arises, given a language, how can one tell whether it is regular or not? Of course, any finite language is regular: if L = {s1, s2, … sn} then s1 + s2 + … + sn is a RE whose language is L. But what about infinite languages? We introduce two theorems that are useful in showing whether specific languages are regular or not:

(1) The pumping lemma
(2) The Myhill-Nerode theorem


The pumping lemma

Suppose L is regular and infinite. Then there is some string xyz in L (y ¹ l ) such that xynz is also in L, for all n = 1, 2, … .

Suppose the FA for L has m states. Choose any string in L of length > m. While recognizing this string, some state must be revisited; let’s say state k is visited on transitions i and j. We divide up our input string into three pieces: x, y, and z, where the x-part finishes by visiting state k for the first time, the y-part causes us to visit state k for the second time, and the z-part is the rest:
 

input string
x
y
z
character positions 1, 2, … i i+1, … j j+1, … , m
states 
visited
doesn’t matter k
doesn’t matter
k doesn’t matter

y, the string of (j-i) characters recognized starting just after the first state k and ending with state k, is the sought string; if xyz is recognized, we can insert, in place of y, y2. The first y will take us from state-k to state-k, and so will the second, then we go on to recognize the z and accept.

If we can do it with y2, we can likewise do it with y3, y4, etc.

This theorem is mainly useful for showing that languages are not regular.


Using the pumping lemma to prove that anbn is not regular

The proof is by contradiction. That is, we will assume that L = anbn is regular, and arrive at a contradiction, thus showing that our assumption is wrong.

If L is regular, then by the pumping lemma, there is some string xyz in L such that all xynz are also in L.

Case 1: y = aa … a.
By the pumping lemma, we can "pump" up y to a high enough power k such that the number of a’s in xykz exceeds the number of b’s, creating a string that is not in L. Contradiction.

Case 2: y = aa … abb … b.
By the pumping lemma, x(aa … abb … b)(aa … abb … b)z should also be in L. Contradiction.

Case 3: y = bb … b. See Case 1.


Pumping lemma 2

Pumping Lemma 1 states that “there exists a string w = xyz …”.  Sometimes that’s not good enough to prove that a language is not regular.

Pumping Lemma 2 states:  If a language L is accepted by a finite automaton with N states, then every string w in L of length > N can be written as xyz, where y ¹l, length(xy) £ N, and all strings xynz are also in L.


An interesting and useful way to classify strings

Consider the language L = {aab, aba, bab}

Now think of the set of arbitrary strings of a’s and b’s (including l ). One such is aa.

Think of "completing" aa by appending strings of a’s and b’s (or l ), and observing whether the completed string is in L or not. We will use a completion code "L" for yes and ~ for no. We can tabulate a completion code for each completer of each string, called the string’s completion behavior.

Completion behavior of "aa":

string
completers
  l a b
any string of length ³ 2
aa ~ L
~

Adding the completion behaviors of ba and ab:
 

string
completers
  l a b
any string of length ³ 2
aa ~ ~ L
~
ba ~ ~ L
~
ab ~ ~
~

Notice that the completion behaviors of aa and ba are the same. We say that aa and ba are in the same completion class. On the other hand, ab will be in some other completion class. Following is a complete table of completion behaviors for all possible strings of a’s and b’s. The ~’s have been omitted to avoid clutter.
 
completers
l a b a

a

ab ba bb a

a

a

aba a

ab

abb bbb bba bab ba

a

l                 L L       L  
a         L L                  
b         L                    
aa     L                        
ab   L                          
ba     L                        
bb                              
aaa                              
aba L                            
aab L                            
abb                              
bbb                              
bba                              
bab L                            
baa                              

Note: we have omitted strings and completers of length 3 or more; they of course are never in L, and belong to the same class as bb.

Based on this table, we have seven completion classes:
 
{l } {bb, aaa, abb, bbb, bba, baa, all strings of length ³ 4} {ab} {aa, ba} {b} {a} {aba, aab, bab}

Notice that every conceivable string is in exactly one class; we say that this classification partitions all strings into (disjoint) equivalence classes.

For this language, we have a finite number of classes.  For some languages the number of classes will be infinite; thus, we would not be able to tabulate them, but we might be able to describe them using more abstract descriptions.


A surprising construction: the CCFA of a language L

We build a finite automaton, using these completion classes as states, as follows:

The start state is the class containing l :

Now:

Construct:
the a-transition from this state: by determining what class "la" is in, namely, {a}
the b-transition: by determining what class "lb" is in, namely, {b}


 
Transitions from the a-state:

a-trans = the class that aa is in = {aa, ba}
b-trans = class that ab is in = {ab}


 
Transitions from the b-state:

a-trans = the class that ba is in = {aa, ba}
b-trans = class that bb is in = {bb, …}

Now think about a-transitions from the aa,ba-state = the class that aaa is in; or is it the class that baa is in?
That will always be the same class, because aa and ba have the same completion behavior! so:

Transitions from the aa,ba-state:

a-trans = {bb, … }
b-trans = {aba, aab, bab} notice that this is just L. We make it our final state.


Completing the remaining transitions according to the above, our final machine is:

Notice that this machine accepts exactly L = {aba, aab, bab}.

If there are finitely many completion classes (therefore finitely many states), we call this machine the Completion Classes FA (CCFA) of L.

Question: is the CCFA of a language L a FA that accepts L? The answer is yes. Read on.


The CCFA of a language L accepts exactly L

Let M be the CCFA of L.

Terminology: Given a FA M and a string x, we say that x puts M in state S if x as input to M will cause M to stop in state S.

(1) Suppose x is in L. Then x will put M into the state containing x (by construction). This is a final state, by construction. So if x is in L, M accepts x.

(2) Suppose x takes M into a final state S.  S has at least one string y of L (by definition). Since y is in L, the completion code for y and completer l is L. But the completion behavior for all strings in S is the same, so xl ’s code is also L, so x is in L.  So if M accepts x, then x is in L.

Combining (1) and (2), x is in L Û M accepts x, and the assertion is proved.


How does the CCFA for L relate to other FA that accept L?

Here’s another FA that accepts the language L = {aab, aba, bab } introduced earlier:

Some observations: For our example: S1 = {l }, S2 = {a}, S3 = {aa}, S4 = {aab}, S5 = {ab}, S6 = {aba}, S7 = {b}, S8 = {ba}, S9 = {bab}, S10 (not depicted, defined by all undefined transitions) = {bb, aaa, abb, bbb, bba, baa, all strings of length ³ 4} Here’s how it goes for our example:

{l } = S1
{bb, aaa, abb, bbb, bba, baa, all strings of length ³ 4} = S10
{ab} = S5
{aa, ba} = S3 ÈS8
{b} = S7
{a} = S2
{aba, aab, bab} = S4 È S6 È S9


The Myhill-Nerode theorem

A language L is regular Û it has a finite number of completion classes.

Proof:
Ü if it has a finite # of completion classes, its CCFA is a recognizer for it, as just shown.
Þ if the language is regular, it has a finite number of states, defining finitely many Si’s (see above) which can be allocated to at most a finite number of completion classes.


Using Myhill-Nerode to show that a language is not regular

Let’s do anbn again:

Consider the following completion classes:
 
for a: completer b has an L, but completers b2, b3, … are all rejected
for aa:  completer bb has an L, but completers b, b3, … are all rejected – different completion behavior than a’s
for aaa:  completer bbb has an L, but completers b, b2, b4, … are all rejected – different completion behavior than a’s or aa’s
for a to any power: similarly 

So a, a2, a3, … are all in different completion classes Þ infinitely many completion classes Þ anbn non-regular.


Quotient Languages (omit)
Hosted by www.Geocities.ws

1