Chapter
4: Regular Expressions
Topics
Introduction
Three operations on languages
Regular expressions
Definition of regular expression
Some common regular expressions and their languages
Introduction
We have the means to describe some languages:
-
{ ab abc abb }
-
the Kleene closure of a language: { a b }*
-
note that * is an "operation" on the language {a b}
-
which creates a new language:
-
all strings consisting of 0 or more a's and b's
-
but not a very interesting or complicated language!
We need some more operations on languages that will allow us to
create more interesting languages
Three
operations on languages
Operations on numbers:
-
we're familiar with operations on numbers: +,-,x,/, ...
-
operations are used to combine numbers to create new numbers: 1
+ 3, 7 - 12, ...
-
we can build arithmetic expressions using them: (1 + 3)/7,
3 + 5(6 + 2), ...
-
each arithmetic expression stands for a number
Operations on languages (the analogy with numbers is complete):
-
we have operations on languages (remember, languages are sets):
Ç, È, ...
-
used to combine languages to create new languages: {a}Ç{ab,
a}, {a, b}È{a, cd}, ...
-
we can build language expressions using them: ({a}Ç{ab,
a})Ç({a, b}È{a,
cd}), ...
-
each language expression stands for a language
For the purposes of our goal, we will use three set operations:
Operation *: Kleene closure
-
a unary operation: operates on one set at a time (like the negation
"-" operation for numbers)
-
this was defined in Chapter 2:
-
{aa, b}* = set of all strings consisting of 0 or more aa's and/or b's
Operation È: set union
-
a binary operation: operates on two sets at a time, (like the +
operation for numbers)
-
{ aa, b}È{c, b} = { aa, b, c}
Operation: set product
-
a binary operation: operates on two sets at a time
-
does not use an operator symbol (like xy in algebra means x ´
y)
-
i.e., juxtaposition is used to indicate set product
Definition:
-
suppose A = { ab, a } and B = { c, b }
-
AB will mean all concatenations of a string chosen from A with a string
chosen from B
-
the "left part" must be chosen from A, the "right part" form B
-
AB = { abc, abb, ac, ab }
-
formally, AB = {xy | xÎA, yÎB}
In addition to the operations, we will use parentheses for grouping, as
for arithmetic expressions.
Now we are ready to construct a language expression:
({a}{ab, a})È({a, b}{a, cd})
which stands for:
{aab, aa, acd, ba, bcd }
Regular
expressions
-
we now introduce another (and very famous) form of language expression
-
called a regular expression (r.e. or RE)
-
the notation is a little cleaner and more concise than set notation:
-
for example, {a}* will be shortened to a*
-
we will use + (easier to type?) instead of È
-
but set product will still be by juxtaposition
-
we will describe an association:
regular expression Þ language
-
every regular expression R
-
will be associated with a language denoted by L(R) or language(R)
-
symbolically: R Þ L(R)
(read: "the r.e. R stands for the language L(R))
Note: we will now be using Kleene * in two ways!!
-
we used {a}* to represent a language: {l,
a, aa, aaa, …}
-
by using Kleene * as an operator on a language
-
we will now use the notation a* (it's a r.e.!)
-
using Kleene * as an operator on a character
-
and even on another r.e.'s
-
the relationship between the two?
a* Þ {a}* = {l,
a, aa, aaa, …}
Some examples (assume a and b are characters in an alphabet):
-
a Þ {a} (the r.e. a stands for the
language {a})
-
a* denotes the RE Kleene closure of a
-
a* Þ {a}* = {l
, a, aa, aaa, …}
-
we also use the RE positive closure operation a+
-
a+ Þ {a}+
= { a, aa, aaa, …}
-
a*b denotes the RE product of r.e.'s a* and b
-
it stands for the language formed by the set product of L(a*) with
L(b)
-
a*b Þ {a}*{b} = {b,
ab, aab, aaab, …}
-
(a*b) + (b*a) denotes the RE sum of r.e.'s a*b and b*a
-
it stands for the language formed by the set union of L(a*b) with
L(b*a)
-
(a*b) + (b*a) Þ {a}*{b}È{b}*{a}
= {b, ab, aab, aaab, … a, ba, bba, bbba, …}
Grouping operation ( "()" ):
-
regular expressions (like language expressions) will also use parens for
grouping
-
c(a + b)* Þ {c, ca, cb, cab, cba,
. . . }
Operator precedence:
-
mimic exponeniation, product and + for arithmetic expressions:
-
highest priority: *
-
next highest: product
-
lowest: +
Definition
of regular expression
Given a vocabulary a, b, c, …
Rule 1: l is a regular expression:
L(l) = Æ
Rule 2: a, b, c, … are regular expressions: L(a) = {a}, L(b) = {b},
…
Rule 3: If R and S are regular expressions, then:
(R)* is a r.e. defined by L((R)*) = L(R)*
(R)(S) is a r.e. defined by L((R)(S)) = L(R)L(S)
(R) + (S) is a r.e. defined by L((R) + (S)) = L(R) È
L(S)
(R) is a r.e. defined by L( (R) ) = L(R)
Rule 4: parens may be omitted, subject to the priority rules:
(b)* can be written b*
(a)((b)*) can be written ab*
a + b* means a + (b*) Þ {a, l,
b, bb, ...}
(a + b)* Þ {a, b}*
ab* means a(b*)
(ab)* Þ {ab}*
Some
common regular expressions and their languages
| r.e. |
language |
| a* |
strings of 0 or more a's |
| aa* |
strings of 1 or more a's |
| a+ |
" |
| an |
a string of n a's (not in the definition, but
sometimes used, for brevity) |
| (a + b)* |
{a, b}* = strings of 0 or more a's and b's |
| a*b* |
any # a's followed by any # of b's |
| (ab)* |
strings of any number of ab's |
| b*ab* |
strings with exactly one a |
| (a + b)*a(a + b)* |
strings with at least one a |
| a(a + b)* |
strings beginning with a |