This is the regular expressions expert page. For a simpler and quicker overview consult the basic regular expression page.
Regex syntax
Mode of operation
Airlock Gateway uses the pattern matching engine exclusively to perform "find" operations: A match exists if the given pattern occurs at least once in the tested string. In contrast a "match" operation would match if the given pattern covers the entire tested string. Example: the pattern 'b' matches the string 'abc' in Airlock Gateway. For ensuring a complete match string start and string end tags would have to be added.
Example: the pattern '^b$' does not match the string 'abc' - but the pattern '^abc$' would match it. As a consequence the empty pattern '' matches every possible string (including the empty string).
PCRE is run with following flags / modes:
- dotall: the dot matches all characters (inclusive newline characters)
- no UCP: meaning standard mode - some relevant Perl and POSIX classes are matching to ASCII only
- JIT: Just-in-time compiler support - makes matching faster
Characters
\x{hhh..} Character with unicode codepoint U+hhh.. (1 to 6 hex digits) (recommended to use) \a Alert (bell) character (U+0007) \cx "control-x", where x is any ASCII character \e Escape character (U+001B) \f Form feed character (U+000C) \n Line feed character (U+000A) \o{ddd..} Character with unicode codepoint ddd.. (1 to 7 octal digits) \r Carriage return character (U+000D) \t Tab character (U+0009) \xhh Character with unicode codepoint hh (hex) (not recommended to use!) \ddd Character with unicode codepoint ddd (octal) (not recommended to use!)
Escaping
\\ Backslash character \? Escaped character (for any non-alphanumeric character) Escaping is needed for these characters: [({.*?+^$\| \Q .. \E Literal-text span: treat enclosed characters as literal until the first appearance of \E (no escaping possible)
Generic characters types
Basic character types
. Any character (including CR or LF) \C One data unit ("Byte") - not recommended to use \R Any unicode newline sequence such as CRLF, CR,LF, VT, FF, NEL, .. - see (*BSR_...) settings \X a Unicode extended grapheme cluster (meaning a character potentially with modification marks) shorthand for \P{M}\p{M}*
ASCII character types
\d Any ASCII decimal digit - equals [0123456789] \D Any character that is not an ASCII decimal digit \s Any ASCII white space character - equals to ' ', HT, LF, FF, CR, VT \S Any character that is not an ASCII whitespace character \w Any ASCII "word" character [a-zA-Z0-9_] \W Any "non-word" character
non-ASCII character types
\h Any horizontal white space character (including non-ASCII U+2000, U+00a0, U+180e, ...) \H Any character that is not a horizontal whitespace character \v Any vertical white space character (including non-ASCII U+2028, U+0085) \V Any character that is not a vertical white space character
Unicode properties and scripts
\p{xx} a character with the unicode property xx \P{xx} a character without the unicode property xx
The following general category property codes are supported:
PCRE special categories
To be used within \p{xxx} and \P{xxx}.
Xan Alphanumeric: union of properties L and N Xps POSIX-like space: property Z plus HT, NL, VT, FF and CR Xsp Perl-like space: property Z plus HT, NL, FF, CR Xwd Perl-like word: property Xan or underscore
Unicode script
Script recognition like \p{Greek} ("is Greek") or \P{Greek} ("is not Greek") is possible for the following scripts:
POSIX named sets
All POSIX named sets include ASCII characters only.
Boundary matchers
Comments
(?#...) comment (not nestable)
Mode modifiers
Option setting
(?i) case insensitive (?J) allow duplicate names (?m) multiline (^ and $ are matching newlines in the string) (?s) dot-matches-all mode (already set by default) (?U) default ungreedy (?x) extended (ignore white space) - used for documentation/readability (?-...) unset option(s)
Unicode operational mode
The following modifiers are recognized only at the start of a pattern or after one of the newline-setting options with similar syntax.
(*NO_START_OPT) no start-match optimization (*UTF8) set UTF-8 mode: 8-bit library (already set by default) (*UTF-16) set UTF-16 mode: 16-bit library (not recommended) (*UTF-32) set UTF-32 mode: 32-bit library (not recommended) (*UTF) set appropriate UTF mode for the library (not recommended) (*UCP) set PCRE_UCP to use Unicode properties instead of ASCII for: [:alnum:] [:alpha:] [:blank:] [:digit:] [:graph:] [:lower:] [:print:] [:punct:] [:space:] [:upper:] [:word:] \b \B \d \D \s \S \w \W
Newline conventions
Since Airlock Gateway operates in dot-matches-all and multiline-off mode those settings are not that relevant normally.
(*CR) carriage return only U+000d (*LF) linefeed only U+000a (*CRLF) carriage return followed by linefeed (*ANYCRLF) all three of the above (*ANY) any Unicode newline sequence
What \R matches
(*BSR_ANYCRLF) CR, LF or CRLF
(*BSR_UNICODE) any Unicode newline sequence (default on Airlock Gateway)
Grouping, capturing and back references
Grouping
(...) capturing group
(?<name>...) named capturing group (used by Airlock Gateway in (?<URL>...) )
(?'name'...) named capturing group
(?P<name>...) named capturing group
(?:...) non-capturing group - grouping only
(?|...) non-capturing group
reset group numbers for capturing groups in each alternative
(?>...) Atomic grouping (never gives up the first match - no backtracking)
Back references
\1 .. \9 Back reference by number (for values within 1..9) - not recommended for using, ambiguous \g1 .. \g9 Back reference by number (for values within 1..9) \g{n} Back reference by number \g{-n} Relative reference by number \k<name> Back reference by name \k'name' Back reference by name \g{name} Back reference by name \k{name} Back reference by name (?P=name) Back reference by name
Subroutine references
Subroutine references are possibly recursive.
Conditional patterns
Quantifiers
Lookaround
Lookahead and lookbehind
Lookaround constructs check for existence, but do not 'consume' the text. Lookbehind is only supported for fixed-length strings.
(?=...) positive look ahead (?!...) negative look ahead (?<=...) positive look behind (?<!...) negative look behind
Match point reset
\K resets start of match
This can be useful to reset the start of the current whole match. This provides a flexible alternative approach to lookbehind assertions because the discarded part of the match (the part that precedes \K) need not to be fixed in length.
Backtracking control
The following tags act immediately they are reached:
(*ACCEPT) force successful match (*FAIL) force backtrack; synonym (*F) (*MARK:NAME) set name to be passed back; synonym (*:NAME)
The following tags act only when a subsequent match failure causes a backtrack to reach them. They all force a match failure, but they differ in what happens afterwards.
(*COMMIT) overall failure, no advance of starting point (*PRUNE) advance to next starting character (*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE) (*SKIP) advance to current matching position (*SKIP:NAME) advance to position corresponding to an earlier (*MARK:NAME); if not found, the (*SKIP) is ignored (*THEN) local failure, backtrack to next alternation (*THEN:NAME) equivalent to (*MARK:NAME)(*THEN)
Precedence
The precedence of regex operators, from highest to lowest:
- Escaped special characters like \\ \. \( \[ \^
- Extended ASCII characters like \x{hh} \t \r \f \a
- Unicode property \p{Prop}
- Collation-related bracket symbols [: :]
- Character class [ ]
- Grouping ( )
- Quantifiers * + ? {m,n}
- Concatenation
- Anchoring ^ $
- Alternation |
Examples
Pattern | Matches |
---|---|
the empty pattern matches every string | |
| Just the letter 'a' |
| Matches any of the ASCII characters between + and -. What evaluates to the three characters plus, comma and minus. |
| Invalid expression, since the letter 'a' follows the symbol '-'. |
| One of the five characters 'a', 'b', 'c', '-' or 'o' |
| Either the letter 'a' or 'b' or the symbol ']' |
| The at-Symbol, any digit from 0-9, a lowercase ASCII letter, any uppercase Unicode letter or the tabulator |
| Any value, which does not start with 'Expires:' or 'If-Match:' and does contain a single or double quote |
| xYz and xyz - no other variants |
| Any thai digit |
Original documentation
For a more detailed documentation please read the extensive original PCRE man pages.