| Author: | Paul McGuire <ptmcg@austin.rr.com> |
|---|---|
| Version: | 1.1 |
Don't clutter up the parser grammar with whitespace, just handle it! (likewise for comments)
Grammars must tolerate change, as grammar evolves or input text becomes more challenging.
Grammars do not have to be exhaustive to be useful.
Simple grammars can return lists; complex grammars need named results.
Class names are easier to read and understand than specialized typography.
Parsers can (and sometimes should) do more than tokenize.
Grammars should be:
No separate code-generation step.
No imposed naming conventions.
Stay Pure Python.
Design-driven:
language -> BNF -> parser impl --+->
concept ^ |
| refine/extend |
+--- language --+
Data-driven:
gather determine
sample -> text -> BNF -> parser ---+->
inputs patterns |
^ |
| gather new |
+----- (non-conforming) ---+
inputs
Compose grammar:
greeting = oneOf("Hello Ahoy Yo Hi") +
Literal(",") +
Word( string.uppercase, string.lowercase ) +
Literal("!")
Call parseString():
print greeting.parseString("Hello, World!")
print greeting.parseString("Ahoy, Matey !")
print greeting.parseString("Yo,Adrian!")
['Hello', ',', 'World', '!']
['Ahoy', ',', 'Matey', '!']
['Yo', ',', 'Adrian', '!']
Words and Literals
Word:
Word("ABC","def") matches "C", "Added", "Beef",
but not "BB", "ACE", "ADD"
Literal:
Literal("name(x)") matches "name(x)",
but not "name (x)"
CaselessLiteral:
CaselessLiteral("ABC") matches "abc", "ABC", "AbC"
returns "ABC" in all cases
Combining
And (operator +) - must match all (ignoring whitespace):
assignment = varName + Literal("=") + arithExpression
or:
assignment = varName + "=" + arithExpression
Or (operator ^) - match longest:
qty = integerNum ^ realNum
MatchFirst (operator |) - match first:
arithOp = Literal("+") | Literal("-") | Literal("X") |
Literal("/") | Literal("^") | Literal("%")
Repetition and Collection
Whitespace is not matched as part of the grammar:
Literal("name") + Literal("(") + Literal("x") + Literal(")")
matches "name(x)", "name (x)", "name( x )", "name ( x )"
oneOf("string of space-separated literal strings"):
matches "strings", "string", "of", "space-separated", or "literal"
arithOp = oneOf("+ - X / ^ %")
delimitedList(expression, delim=','):
matches expression or expression,expression,...
commaSeparatedList:
special form of delimitedList with a pre-defined expression for comma-separated values (strings, quoted strings, integers, etc.)
quotedString, dblQuotedString, sglQuotedString:
matches "abc", 'def', "string with escaped \" character"
makeHTMLTags("A") - returns beginning and ending tags:
beginning matches "<A>", "<a href="blah">", "<A/>" ending matches "</A>", "</a>"
makeXMLTags("BODY") - like makeHTMLTags, but with tighter case matching:
beginning matches "<BODY>", "<BODY align="top">"
but not "<body>" or "<Body>"
ending matches "</BODY>", but not "</body>"
Typical comment forms:
cStyleComment /* block comment,
spans lines */
dblSlashComment // everything up to end of line
cppStyleComment = cStyleComment | dblSlashComment
htmlComment <!-- commented out HTML -->
pythonComment # everything up to end of line
Strings of common character sets for Word definitions:
alphas - A-Za-z
alphas8bit - accented and umlauted characters
ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ
nums - 0-9
alphanums = alphas + nums
hexnums = 0-9A-Fa-f
printables = string.printables, minus space character
srange('A-Za-z')
same as 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcde...'
SkipTo - "wildcard" advance to expression
Suppress - match, but suppress matching results; useful for punctuation, delimiters
NotAny - negative lookahead
FollowedBy - assertive lookahead
Other ParserElement methods
Forward - to define recursive grammars - example in just a few slides...
For example:
Regex(r"(\+|-)?\d+")
is equivalent to:
Combine( Optional(oneOf("+ -")) + Word(nums) )
Named re groups in the Regex string are also preserved as ParseResults names
(Don't be too hasty to optimize, though...)
Dict - creates named fields using input data for field names (see pyparsing examples)
StringStart, StringEnd, LineStart, LineEnd - positional expressions for line-break dependent expressions
White - (sigh...) if you must parse whitespace, use this class
Exceptions
setName(exprName) - useful for exceptions and debugging, to name parser expression (instead of default repr-like string)
For example, compare:
integer = Word(nums)
(integer + integer).parseString("123 J56")
pyparsing.ParseException: Expected W:(0123...) (at char 4), (line:1, col:5)
vs:
integer = Word(nums).setName("integer")
(integer + integer).parseString("123 J56")
pyparsing.ParseException: Expected integer (at char 4), (line:1, col:5)
setDebug() - activates logging of all match attempts, failures, and successes; custom debugging callback methods can be passed as optional arguments:
integer.setDebug()
(integer + integer).parseString("123 A6")
prints:
Match integer at loc 0 (1,1) Matched integer -> ['123'] Match integer at loc 3 (1,4) Exception raised: Expected integer (at char 4), (line:1, col:5)
Pyparsing distribution includes
Online resources
SourceForge pyparsing support forum (RSS feed via Gmane)
comp.lang.python (but not always best)
ONLamp article, Building Recursive Descent Parsers with Python
How to parse the string "['a', 100, 3.14]" back to a Python list
First pass:
lbrack = Literal("[")
rbrack = Literal("]")
integer = Word(nums).setName("integer")
real = Combine(Optional(oneOf("+ -")) + Word(nums) + "." +
Optional(Word(nums))).setName("real")
listItem = real | integer | quotedString
listStr = lbrack + delimitedList(listItem) + rbrack
test = "['a', 100, 3.14]" print listStr.parseString(test)
Returns:
['[', "'a'", '100', '3.14', ']']
Brackets []'s are significant during parsing, but are not interesting as part of the results (just as delimitedList discarded the delimiting commas for us):
lbrack = Suppress("[")
rbrack = Suppress("]")
We also want actual values, not strings, and we really don't want to carry the quotes around from our quoted strings, so use parse actions for conversions:
cvtInt = lambda s,l,toks: int(toks[0]) integer.setParseAction( cvtInt ) cvtReal = lambda s,l,toks: float(toks[0]) real.setParseAction( cvtReal ) listItem = real | integer | quotedString.setParseAction(removeQuotes)
Updated version now returns:
['a', 100, 3.1400000000000001]
How would we handle nested lists?
We'd like to expand listItem to accept listStr's, but this creates a chicken-and-egg problem - listStr is defined in terms of listItem.
Solution: Forward declare listStr before listItem, using Forward():
listStr = Forward()
Add listStr as another type of listItem - enclose as Group to preserve nesting:
listItem = real | integer | quotedString.setParseAction(removeQuotes) \
| Group(listStr)
Finally, define listStr using '<<' operator instead of '=' assignment:
listStr << ( lbrack + delimitedList(listItem) + rbrack )
Using a nested list as input:
test = "['a', 100, 3.14, [ +2.718, 'xyzzy', -1.414] ]"
We now get returned:
['a', 100, 3.1400000000000001, [2.718, 'xyzzy', -1.4139999999999999]]
Entire program:
cvtInt = lambda s,l,toks: int(toks[0])
cvtReal = lambda s,l,toks: float(toks[0])
lbrack = Suppress("[")
rbrack = Suppress("]")
integer = Word(nums).setName("integer").setParseAction( cvtInt )
real = Combine(Optional(oneOf("+ -")) + Word(nums) + "." +
Optional(Word(nums))).setName("real").setParseAction( cvtReal )
listStr = Forward()
listItem = real | integer | quotedString.setParseAction(removeQuotes)
| Group(listStr)
listStr << ( lbrack + delimitedList(listItem) + rbrack )
test = "['a', 100, 3.14, [ +2.718, 'xyzzy', -1.414] ]"
print listStr.parseString(test)
Want to extract all links from a web page to external URLs.
Basic form is:
<a href="www.blah.com/xyzzy.html">Zork magic words</a>
But anchor references are not always so clean looking:
Use pyparsing makeHTMLTags helper method:
aStart,aEnd = makeHTMLTags("A")
Compose suitable expression:
link = aStart + SkipTo(aEnd).setResultsName("link") + aEnd
Skip any links found in HTML comments:
link.ignore(htmlComment)
Use scanString to search through HTML page - avoids need to define complete HTML grammar:
for toks,start,end in link.scanString(htmlSource):
print toks.link, "->", toks.startA.href
Complete program:
from pyparsing import makeHTMLTags,SkipTo,htmlComment
import urllib
serverListPage = urllib.urlopen( "http://www.yahoo.com" )
htmlText = serverListPage.read()
serverListPage.close()
aStart,aEnd = makeHTMLTags("A")
link = aStart + SkipTo(aEnd).setResultsName("link") + aEnd
link.ignore(htmlComment)
for toks,start,end in link.scanString(htmlText):
print toks.link, "->", toks.startA.href
Results:
<img src="http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/ml.gif" width=36 height=36 border=0 alt=""><br><nobr><font face=verdana size=-2 color=#000000>Mail</font></nobr> -> r/m1 <img src="http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/my.gif" width=36 height=36 border=0 alt=""><br><nobr><font face=verdana size=-2 color=#000000>My Yahoo!</font></nobr> -> r/i1 <img src="http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/msgn.gif" width=36 height=36 border=0 alt=""><br><nobr><font face=verdana size=-2 color=#000000>Messenger</font></nobr> -> r/p1 Advanced -> r/so My Web -> r/mk Answers <img src="http://us.i1.yimg.com/us.yimg.com/i/ww/beta.gif" border="0"> -> r/1l <b>Yahoo! Health</b>: Make your resolutions a reality. -> s/266569 Lose weight -> s/266570 get fit -> s/266571 and be healthy -> s/266572 <b>Sign In</b> -> r/l6 <b>Sign Up</b> -> r/m7 360° -> r/3t Autos -> r/cr Finance -> r/sq Games -> r/pl
JSON (JavaScript Object Notation) is a simple serialization format for JavaScript web applications:
object
{ members }
{}
members
string : value
members , string : value
array
[ elements ]
[]
elements
value
elements , value
value
string | number | object | array| true| false | null
Keyword constants and basic building blocks:
TRUE = Keyword("true")
FALSE = Keyword("false")
NULL = Keyword("null")
jsonString = dblQuotedString.setParseAction( removeQuotes )
jsonNumber = Combine( Optional('-') + ( '0' | Word('123456789',nums) ) +
Optional( '.' + Word(nums) ) +
Optional( Word('eE',exact=1) + Word(nums+'+-',nums) ) )
Compound elements:
jsonObject = Forward()
jsonValue = Forward()
jsonElements = delimitedList( jsonValue )
jsonArray = Group( Suppress('[') + jsonElements + Suppress(']') )
jsonValue << ( jsonString | jsonNumber | jsonObject |
jsonArray | TRUE | FALSE | NULL )
memberDef = Group( jsonString + Suppress(':') + jsonValue )
jsonMembers = delimitedList( memberDef )
jsonObject << Dict( Suppress('{') + jsonMembers + Suppress('}') )
jsonComment = cppStyleComment
jsonObject.ignore( jsonComment )
Conversion parse actions:
TRUE.setParseAction( replaceWith(True) )
FALSE.setParseAction( replaceWith(False) )
NULL.setParseAction( replaceWith(None) )
jsonString.setParseAction( removeQuotes )
def convertNumbers(s,l,toks):
n = toks[0]
try:
return int(n)
except ValueError, ve:
return float(n)
jsonNumber.setParseAction( convertNumbers )
{ "glossary": {
"title": "example glossary",
"GlossDiv": {
"title": "S",
"GlossList":
[{
"ID": "SGML",
"SortAs": "SGML",
"GlossTerm": "Standard Generalized Markup Language",
"TrueValue": true,
"FalseValue": false,
"IntValue": -144,
"FloatValue": 6.02E23,
"NullValue": null,
"Acronym": "SGML",
"Abbrev": "ISO 8879:1986",
"GlossDef": "A meta-markup language, used to create markup languages such as DocBook.",
"GlossSeeAlso": ["GML", "XML", "markup"]
}] } } }
[['glossary',
['title', 'example glossary'],
['GlossDiv',
['title', 'S'],
['GlossList',
[['ID', 'SGML'],
['SortAs', 'SGML'],
['GlossTerm', 'Standard Generalized Markup Language'],
['TrueValue', True],
['FalseValue', False],
['IntValue', -144],
['FloatValue', 6.02e+023],
['NullValue', None],
['Acronym', 'SGML'],
['Abbrev', 'ISO 8879:1986'],
['GlossDef',
'A meta-markup language, used to create markup languages such as DocBook.'],
['GlossSeeAlso', ['GML', 'XML', 'markup']]]]]]]
Accessing results by hierarchical path name:
print results.glossary.title print results.glossary.GlossDiv.GlossList.keys() print results.glossary.GlossDiv.GlossList.ID print results.glossary.GlossDiv.GlossList.FalseValue print results.glossary.GlossDiv.GlossList.Acronym example glossary ['GlossSeeAlso', 'GlossDef', 'FalseValue', 'Acronym', 'GlossTerm', 'TrueValue', 'FloatValue', 'IntValue', 'Abbrev', 'SortAs', 'ID', 'NullValue'] SGML False SGML
Implementer of a C API needed to parse the library header to extract the function and parameter declarations.
API followed a very predictable form:
int func1(float *arr, int len, double arg1); int func2(float **arr, float *arr2, int len, double arg1, double arg2);
Was able to define a straightforward grammar using pyparsing:
ident = Word(alphas, alphanums + "_")
vartype = oneOf("float double int char") + Optional(Word("*"))
arglist = delimitedList( Group(vartype + ident) )
functionCall = Literal("int") + ident + "(" + arglist+ ")" + ";"
Added results names to simplify access to individual function elements
Complete program:
from pyparsing import *
testdata = """
int func1(float *arr, int len, double arg1);
int func2(float **arr, float *arr2, int len, double arg1, double arg2);
"""
ident = Word(alphas, alphanums + "_")
vartype = Combine( oneOf("float double int") + Optional(Word("*")), adjacent = False)
arglist = delimitedList( Group(vartype.setResultsName("type") +
ident.setResultsName("name")) )
functionCall = Literal("int") + ident.setResultsName("name") + \
"(" + arglist.setResultsName("args") + ")" + ";"
for fn,s,e in functionCall.scanString(testdata):
print fn.name
for a in fn.args:
print " -",a.type, a.name
Results:
func1 - float* arr - int len - double arg1 func2 - float** arr - float* arr2 - int len - double arg1 - double arg2
Linux distributions:
CheeseShop:
SourceForge:
Pyparsing is not the solution to every text processing problem, but it is a potential solution for many such problems.