Chapter 4: Regular Expressions

Chapter 4: Regular Expressions

Topics
Introduction
Three operations on languages
Regular expressions
Definition of regular expression
Some common regular expressions and their languages

Introduction

We have the means to describe some languages:

{ ab abc abb }
the Kleene closure of a language: { a b }*
note that * is an "operation" on the language {a b}
which creates a new language:

all strings consisting of 0 or more a's and b's
but not a very interesting or complicated language!

We need some more operations on languages that will allow us to create more interesting languages

Three operations on languages

Operations on numbers:

we're familiar with operations on numbers: +,-,x,/, ...
operations are used to combine numbers to create new numbers: 1 + 3, 7 - 12, ...
we can build arithmetic expressions using them: (1 + 3)/7, 3 + 5(6 + 2), ...
each arithmetic expression stands for a number

Operations on languages (the analogy with numbers is complete):

we have operations on languages (remember, languages are sets): Ç, È, ...
used to combine languages to create new languages: {a}Ç{ab, a}, {a, b}È{a, cd}, ...
we can build language expressions using them: ({a}Ç{ab, a})Ç({a, b}È{a, cd}), ...
each language expression stands for a language

For the purposes of our goal, we will use three set operations:

Operation *: Kleene closure

a unary operation: operates on one set at a time (like the negation "-" operation for numbers)
this was defined in Chapter 2:

{aa, b}* = set of all strings consisting of 0 or more aa's and/or b's

Operation È: set union

a binary operation: operates on two sets at a time, (like the + operation for numbers)
{ aa, b}È{c, b} = { aa, b, c}

Operation: set product

a binary operation: operates on two sets at a time
does not use an operator symbol (like xy in algebra means x ´ y)
i.e., juxtaposition is used to indicate set product

Definition:

suppose A = { ab, a } and B = { c, b }
AB will mean all concatenations of a string chosen from A with a string chosen from B
the "left part" must be chosen from A, the "right part" form B
AB = { abc, abb, ac, ab }
formally, AB = {xy | xÎA, yÎB}

In addition to the operations, we will use parentheses for grouping, as for arithmetic expressions.

Now we are ready to construct a language expression: ({a}{ab, a})È({a, b}{a, cd}) which stands for: {aab, aa, acd, ba, bcd }

Regular expressions

we now introduce another (and very famous) form of language expression
called a regular expression (r.e. or RE)
the notation is a little cleaner and more concise than set notation:

for example, {a}* will be shortened to a*
we will use + (easier to type?) instead of È
but set product will still be by juxtaposition

we will describe an association:

every regular expression R
will be associated with a language denoted by L(R) or language(R)
symbolically: R Þ L(R) (read: "the r.e. R stands for the language L(R))

Note: we will now be using Kleene * in two ways!!

we used {a}* to represent a language: {l, a, aa, aaa, …}
by using Kleene * as an operator on a language
we will now use the notation a* (it's a r.e.!)
using Kleene * as an operator on a character
and even on another r.e.'s
the relationship between the two?

a* Þ {a}* = {l, a, aa, aaa, …}
Some examples (assume a and b are characters in an alphabet):

a Þ {a} (the r.e. a stands for the language {a})

a* denotes the RE Kleene closure of a

a* Þ {a}* = {l , a, aa, aaa, …}

we also use the RE positive closure operation a⁺

a+ Þ {a}⁺ = { a, aa, aaa, …}

a*b denotes the RE product of r.e.'s a* and b
it stands for the language formed by the set product of L(a*) with L(b)
a*b Þ {a}*{b} = {b, ab, aab, aaab, …}

(a*b) + (b*a) denotes the RE sum of r.e.'s a*b and b*a
it stands for the language formed by the set union of L(a*b) with L(b*a)
(a*b) + (b*a) Þ {a}*{b}È{b}*{a} = {b, ab, aab, aaab, … a, ba, bba, bbba, …}

Grouping operation ( "()" ):

regular expressions (like language expressions) will also use parens for grouping
c(a + b)* Þ {c, ca, cb, cab, cba, . . . }

Operator precedence:

mimic exponeniation, product and + for arithmetic expressions:

highest priority: *
next highest: product
lowest: +

Definition of regular expression

Given a vocabulary a, b, c, …

Rule 1: l is a regular expression: L(l) = Æ
Rule 2: a, b, c, … are regular expressions: L(a) = {a}, L(b) = {b}, …
Rule 3: If R and S are regular expressions, then:

(R)* is a r.e. defined by L((R)*) = L(R)*
(R)(S) is a r.e. defined by L((R)(S)) = L(R)L(S)
(R) + (S) is a r.e. defined by L((R) + (S)) = L(R) È L(S)
(R) is a r.e. defined by L( (R) ) = L(R)

Rule 4: parens may be omitted, subject to the priority rules:

Some common regular expressions and their languages

r.e.	language
a*	strings of 0 or more a's
aa*	strings of 1 or more a's
a⁺	"
aⁿ	a string of n a's (not in the definition, but sometimes used, for brevity)
(a + b)*	{a, b}* = strings of 0 or more a's and b's
ab	any # a's followed by any # of b's
(ab)*	strings of any number of ab's
bab	strings with exactly one a
(a + b)a(a + b)	strings with at least one a
a(a + b)*	strings beginning with a

Hosted by www.Geocities.ws