*: zero or more repeating "a*b"; It can match ab, aab, aaaab
+: one or more repeating a+b cannot match ab.
?: zero or one time occurence
(m,n): at least m, at most n times appearence
Using the python regular exp.
import re; re is the package supporting the regular exp.
Backslash '\'
In order to match a string "\section", the regular exp. should be "\\section". The string we want to pass to the compile() is "\\\\section".
A better way is to use the raw string. We can pass the s"\\section" to the compile function.
matching and searching
re.match(repattern, string, mode)
re.search(repattern, string, mode)
The difference is: match will try to match from the beginning of the string, while search can find the matching string anywhere.
Some mode such as re.I (ignore case), re.DOTALL(dot can match any chars)
re.split() is used to extract the contents you need.
(?P
To extract the strings we need, use group().
The Python regular exp. is so funny! I will play with it using some applications in the follow-up posts.
An example:
about how to write a regular expression to extract the id and topic info from a string like this: "
Some comments: The metacharacter '.' can match any character except the newline!
ReplyDeleteAn example about compiling of a regular expression:
>>> import re
>>> p = re.compile('ab*')
>>> print p
re.RegexObject instance at 80b4150
The findall and finditer method can return all the matches concerning a given pattern, here is an illustration about use of finditer:
>>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
>>> iterator
callable-iterator object at 0x401833ac
>>> for match in iterator:
... print match.span()
...
(0, 2)
(22, 24)
(29, 31)
'\b' means the world boundary, i.e., the starting or ending of a word. '\w' can match [a-z0-9_].
the following RE detects doubled words in a string.
>>> p = re.compile(r'(\b\w+)\s+\1')
>>> p.search('Paris in the the spring').group()
'the the'