![]() |
![]() |
|
|||||||||||||||||||||||||||
| Example Expression (Python) | Effect | Annotations |
| m=re.search( "xell(.*)soft" ) |
search something like xell_any_string_soft; m.group(1) == "_any_string_" | "()"=group; ".*" = any length string; .=any char; *=repeat previous thing zero or more times |
| "xell\s{1,4}soft" | match something like "xell soft" | "\s"=white space/tab/return; {1,4}=repeat 1 to 4 times |
| "(?i)(?s)aAb\s*BcC" | match case insensitive e.g. AAb bCC | (?i) : case insensitive search (?s) : *,+{}, .. match across line boundaries |
| "xell[0-9ABCDEF]*soft" | match something like xell1234Asoft | "[]"=character set "[^0-9]": ^=inverse character set; *=repeat previous thing zero or more times |
| "harry (?:and|or) mary" | match "harry and mary" or "harry or mary" | "(?:)" non-group bracket; "|" = or |
| "harvex+" | match something like "harvexxxxxx" | "+"=repeat last thing 1 or more times |
| "harvex\+?" | match "harvex" or "harvex+" | "\"=escape special character +; "?"= last thing present or not |
| "\bword\b" | match "word" but only when it is a separate word | "\b"=match empty string at word boundaries only |
(python/perl style regular expressions)
A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).
Regular expressions can be concatenated to form new regular expressions; if Aand B are both regular expressions, then AB is also a regular expression. If a string p matches A and another string q matches B, the string pq will match AB if A and B do no specify boundary conditions that are no longer satisfied by pq. Thus, complex expressions can easily be constructed from simpler primitive expressions like the ones described here. For details of the theory and implementation of regular expressions, consult the Friedl book referenced below, or almost any textbook about compiler construction.
A brief explanation of the format of regular expressions follows. For further information and a gentler presentation, consult for example the Regular Expression HOWTO, accessible from http://www.python.org/doc/howto/.
Regular expressions can contain both special and ordinary characters.
Most ordinary characters, like "A", "a", or "0", are the simplest
regular expressions; they simply match themselves. You can concatenate ordinary
characters, so last matches the string 'last'.
(In the rest of this section, we'll write RE's in this
special style, usually without quotes, and strings to be matched 'in
single quotes'.)
Some characters, like "|" or "(", are special. Special characters either stand for classes of ordinary characters, or affect how the regular expressions around them are interpreted.
The special characters are:
*?, +?, ?? '<H1>title</H1>',
it will match the entire string, and not just '<H1>'.
Adding "?" after the qualifier makes it
perform the match in non-greedy or minimal
fashion; as few characters as possible will be matched. Using
.*? in the previous expression will match
only '<H1>'.
{m}
{m,n} aaaab,
a thousand "a" characters followed by a b,
but not aaab. The comma may not be omitted or the modifier
would be confused with the previously described form.
{m,n}? 'aaaaaa',
a{3,5} will match 5 "a" characters, while a{3,5}?
will only match 3 characters.
If you're not using a raw string to express the pattern, remember that Python also uses the backslash as an escape sequence in string literals; if the escape sequence isn't recognized by Python's parser, the backslash and subsequent character are included in the resulting string. However, if Python would recognize the resulting sequence, the backslash should be repeated twice. This is complicated and hard to understand, so it's highly recommended that you use raw strings for all but the simplest expressions.
[] [a-zA-Z0-9]
matches any letter or digit. Character classes such as \w
or \S (defined below) are also acceptable inside a range.
If you want to include a "]" or a "-" inside a set, precede it with a backslash,
or place it as the first character. The pattern []]
will match ']', for example.
You can match the characters not within a range by complementing the set. This is indicated by including a "^" as the first character of the set; "^" elsewhere will simply match the "^" character. For example, [^5] will match any character except "5".
A|B, where A and B can be arbitrary REs, creates a regular
expression that will match either A or B. An arbitrary number of REs
can be separated by the "|" in this way. This
can be used inside groups (see below) as well. REs separated by "|" are tried from left to right, and the first
one that allows the complete pattern to match is considered the accepted
branch. This means that if A matches, B will
never be tested, even if it would produce a longer overall match. In
other words, the "|" operator is never greedy.
To match a literal "|", use \|, or enclose it inside a character class, as in
[|].
(...)
(?...)
(?iLmsux) Note that the (?x) flag changes how the expression is parsed. It should be used first in the expression string, or after one or more whitespace characters. If there are non-whitespace characters before the flag, the results are undefined.
(?:...)
(?P<name>...) For example, if the pattern is (?P<id>[a-zA-Z_]\w*),
the group can be referenced by its name in arguments to methods of match
objects, such as m.group('id') or m.end('id'),
and also by name in pattern text (for example, (?P=id))
and replacement text (such as \g<id>).
(?P=name)
(?#...)
(?=...) 'Isaac '
only if it's followed by 'Asimov'.
(?!...) 'Isaac ' only if it's not
followed by 'Asimov'.
(?<=...)
(?<!...)
The special sequences consist of "\" and a character from the list below. If the ordinary character is not on the list, then the resulting RE will match the second character. For example, \$matches the character "$".
\number 'the the' or '55 55', but not 'the
end' (note the space after the group). This special sequence can
only be used to match one of the first 99 groups. If the first digit
of number is 0, or number is 3 octal digits long,
it will not be interpreted as a group match, but as the character with
octal value number. (There is a group 0, which is the entire
matched pattern, but it can't be referenced with \0;
instead, use \g<0>.) Inside the "[" and "]" of a character
class, all numeric escapes are treated as characters. \A \b \B \d \D \s \S \w \W \Z \\
webmaster@xellsoft.com
- © 2009 Xellsoft.com. All Rights Reserved
- Online Privacy - Affiliate
- Company