Friday, March 20, 2009

about python regular expression

The repeating labels:
*: zero or more repeating "a*b"; It can match ab, aab, aaaab
+: one or more repeating a+b cannot match ab.
?: zero or one time occurence
(m,n): at least m, at most n times appearence

Using the python regular exp.
import re; re is the package supporting the regular exp.

Backslash '\'
In order to match a string "\section", the regular exp. should be "\\section". The string we want to pass to the compile() is "\\\\section".
A better way is to use the raw string. We can pass the s"\\section" to the compile function.

matching and searching
re.match(repattern, string, mode)
re.search(repattern, string, mode)
The difference is: match will try to match from the beginning of the string, while search can find the matching string anywhere.

Some mode such as re.I (ignore case), re.DOTALL(dot can match any chars)

re.split() is used to extract the contents you need.

(?P...) can be used to match and reference the contents.

To extract the strings we need, use group().

The Python regular exp. is so funny! I will play with it using some applications in the follow-up posts.

An example:
about how to write a regular expression to extract the id and topic info from a string like this: "". The regular expression could be "[a-zA-Z]*)\"[0-9a-zA-Z=.\'\"\-\n ]*newid=\"(?P[0-9]+)\">"You can get the id and topic info using the group function

1 comment:

  1. Some comments: The metacharacter '.' can match any character except the newline!

    An example about compiling of a regular expression:
    >>> import re
    >>> p = re.compile('ab*')
    >>> print p
    re.RegexObject instance at 80b4150

    The findall and finditer method can return all the matches concerning a given pattern, here is an illustration about use of finditer:

    >>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
    >>> iterator
    callable-iterator object at 0x401833ac
    >>> for match in iterator:
    ... print match.span()
    ...
    (0, 2)
    (22, 24)
    (29, 31)

    '\b' means the world boundary, i.e., the starting or ending of a word. '\w' can match [a-z0-9_].
    the following RE detects doubled words in a string.
    >>> p = re.compile(r'(\b\w+)\s+\1')
    >>> p.search('Paris in the the spring').group()
    'the the'

    ReplyDelete

Note: Only a member of this blog may post a comment.