Sunday, July 19, 2009

python regular expression-Grouping

Metacharacters are not active inside classes. For example, [akm$] will match any of the characters "a", "k", "m", or "$"; "$" is usually a metacharacter, but inside a character class it's stripped of its special nature.

The python provides very powerful grouping facilities. When you get a matching object,
you can apply the group(index) or groups() method on it.
The groups() is the same as {group(1), group(2), ...}

The syntax for a named group is one of the Python-specific extensions: (?P...). name is, obviously, the name of the group. Except for associating a name with a group, named groups also behave identically to capturing groups. The MatchObject methods that deal with capturing groups all accept either integers, to refer to groups by number, or a string containing the group name. Named groups are still given numbers, so you can retrieve information about a group in two ways:

>>> p = re.compile(r'(?P\b\w+\b)')
>>> m = p.search( '(((( Lots of punctuation )))' )
>>> m.group('word')
'Lots'
>>> m.group(1)
'Lots'
You can refer to the previous named groups by their names:
>>> p = re.compile(r'(?P\b\w+)\s+(?P=word)')
>>> p.search('Paris in the the spring').group()
'the the'

p.s.:
Be noted that both match and search will stop once they find ONE substring that fits the pattern.

One example about the grouping:
This is an example I met in the recovery and backup system.
Suppose we want to split the filename
2009-07-22-23-09.tar.gz (year-month-day-hour-minute.tar.gz)
We want to get the year, month, day, hour and minute when this file was created. How to write the regular expression?
the pattern should be

?P\d{4})-(?P\d{2})-(?P\d{2})-(?P\d{2})-(?P\d{2})\.tar\.gz
The file name is grouped into five parts, and we can retrieve each part by invoking the group(name) function.

A better example using VERBOSE:
pat = re.compile(r"""
\s* # Skip leading whitespace
(?P
[^:]+) # Header name
\s* : # Whitespace, and a colon
(?P.*?) # The header's value -- *? used to
# lose the following trailing whitespace
\s*$ # Trailing whitespace to end-of-line
""", re.VERBOSE)

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.