Regular Expressions - Experts Page

Regex syntax

Mode of operation

Airlock Gateway uses the pattern matching engine exclusively to perform "find" operations: A match exists if the given pattern occurs at least once in the tested string. In contrast a "match" operation would match if the given pattern covers the entire tested string. Example: the pattern 'b' matches the string 'abc' in Airlock Gateway. For ensuring a complete match string start and string end tags would have to be added.

Example: the pattern '^b$' does not match the string 'abc' - but the pattern '^abc$' would match it. As a consequence the empty pattern '' matches every possible string (including the empty string).

PCRE is run with following flags / modes:

  • dotall: the dot matches all characters (inclusive newline characters)
  • no UCP: meaning standard mode - some relevant Perl and POSIX classes are matching to ASCII only
  • JIT: Just-in-time compiler support - makes matching faster

Characters

\x{hhh..}       Character with unicode codepoint U+hhh.. (1 to 6 hex digits) (recommended to use)
\a              Alert (bell) character (U+0007)
\cx             "control-x", where x is any ASCII character
\e              Escape character (U+001B)
\f              Form feed character (U+000C)
\n              Line feed character (U+000A)
\o{ddd..}       Character with unicode codepoint ddd.. (1 to 7 octal digits)
\r              Carriage return character (U+000D)
\t              Tab character (U+0009)
\xhh            Character with unicode codepoint hh (hex) (not recommended to use!)
\ddd            Character with unicode codepoint ddd (octal) (not recommended to use!)

Escaping

\\              Backslash character
\?              Escaped character (for any non-alphanumeric character)
                   Escaping is needed for these characters: [({.*?+^$\|
\Q .. \E        Literal-text span: treat enclosed characters as literal
                   until the first appearance of \E (no escaping possible)

Generic characters types

Basic character types

.               Any character (including CR or LF)
\C              One data unit ("Byte") - not recommended to use
\R              Any unicode newline sequence such as CRLF, CR,LF, VT, FF, NEL, ..
                  - see (*BSR_...) settings
\X              a Unicode extended grapheme cluster (meaning a character potentially with modification marks)
                  shorthand for \P{M}\p{M}*

ASCII character types

\d              Any ASCII decimal digit - equals [0123456789]
\D              Any character that is not an ASCII decimal digit
\s              Any ASCII white space character - equals to ' ', HT, LF, FF, CR, VT
\S              Any character that is not an ASCII whitespace character
\w              Any ASCII "word" character [a-zA-Z0-9_]
\W              Any "non-word" character

non-ASCII character types

\h              Any horizontal white space character (including non-ASCII U+2000, U+00a0, U+180e, ...)
\H              Any character that is not a horizontal whitespace character
\v              Any vertical white space character (including non-ASCII U+2028, U+0085)
\V              Any character that is not a vertical white space character

Unicode properties and scripts

\p{xx}          a character with the unicode property xx
\P{xx}          a character without the unicode property xx

The following general category property codes are supported:

Show moreShow less
C               Other
Cc              Control                - The ASCII and Latin-1 control characters HT, LF, CR, NULL, Escape, NEL, ...)
Cf              Format                 - Non-visible characters for basic formatting (zero width joiner,
                                         activate arabic form shaping, ...)
Cn              Unassigned             - Codepoints that have no characters assigned
Co              Private use            - Company logos, ...
Cs              Surrogate              - No characters, reserved for use in UTF-16 for specifying characters outside basic
                                         multilingual plane

L               Letter
L&              Case letters           - A composite matching all Ll, Lu and Lt characters
Ll              Lower case letter
Lm              Modifier letter        - A small set of letter-like special-use characters
Lo              Other letter           - Letters that have no case and aren't modifiers (hebrew, arabic, bengali, tibetan,
                                         japanese, ...)
Lt              Title case letter      - Letters that appear at the start of a word (e.g. the character Dž is the title case of
                                         the lowercase dž and of the uppercase DŽ)
Lu              Upper case letter

M               Mark
Mc              Spacing mark          - Modification characters that take up space (mostly "vowel signs" in bengali, gujarati,
                                        tamil, telugu, kannada, sinhala, ...)
Me              Enclosing mark        - A small set of marks that can enclose other characters (circles, squares, diamonds,
                                        keycaps, ...)
Mn              Non-spacing mark      - Characters that modify other characters, such as accents, umlauts, "vowel signs",
                                        tone marks, ...

N               Number
Nd              Decimal number        - Zero to nine, in various scripts (not including chinese, japanese and korean):
                                        8, ٣, ໖ or ꘧
Nl              Letter number         - Mostly roman numerals (such as ⅶ, ↉, 〨, 〇, ᛯ or ᛰ)
No              Other number          - Other numbers as ⑲, ፷, ₅, ⅞ or ៵

P               Punctuation
Pc              Connector punctuation - A few punctuation characters with special linguistic meaning such as underscore
Pd              Dash punctuation      - Hyphens and dashes of all sorts
Pe              Close punctuation     - Characters like ), ], }, ⟫ 】, ︾, ...
Pf              Final punctuation     - Characters like », ›, ”, ’, ...
Pi              Initial punctuation   - Characters like «, ‘, ‛, ‹, ...
Po              Other punctuation     - Characters like !, ", *, :, @, ߷, ‱, ⁜, ፣ or ¶
Ps              Open punctuation      - Characters like (, [, {, ༼, ❰, ﹃, ...

S               Symbol
Sc              Currency symbol       - Characters like $, £, ¤, ¥, ₠, ௹, €, ₧, ₪d or ﷼
Sk              Modifier symbol       - Mostly versions of the combining characters, but as full-fledged characters in their own
                                        right such as ^, `, ¸, ˚ or ˨
Sm              Mathematical symbol   - Characters like +, >, =, ±, ÷, ⅀, ∀, ∜, ∰, ≹
                                        or ﬩ (not including '-' which is in Pd)
So              Other symbol          - Various Dingbats, box-drawing symbols, Braille patterns, ...
                                        (¦, ©, מ, ࿏, ℡, ⌫, ▧, ,  or ☣) 

Z               Separator
Zl              Line separator        - Just the LINE SEPARATOR characters (U+2028)
Zp              Paragraph separator   - Just the PARAGRAPH SEPARATOR character (U+2029)
Zs              Space separator       - Various spacing characters such as normal space, non-breaking space, em-space,
                                        thin space, ...

PCRE special categories

To be used within \p{xxx} and \P{xxx}.

Xan             Alphanumeric: union of properties L and N
Xps             POSIX-like space: property Z plus HT, NL, VT, FF and CR
Xsp             Perl-like space: property Z plus HT, NL, FF, CR
Xwd             Perl-like word: property Xan or underscore

Unicode script

Script recognition like \p{Greek} ("is Greek") or \P{Greek} ("is not Greek") is possible for the following scripts:

Show moreShow less

Arabic, Armenian, Avestan, Balinese, Bamum, Batak, Bengali, Bopomofo, Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Chakma, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Egyptian_Hieroglyphs, Ethiopic, Georgian, Glagolitic, Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Imperial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian, Lydian, Malayalam, Mandaic, Meetei_Mayek, Meroitic_Cursive, Meroitic_Hieroglyphs, Miao, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic, Old_Persian, Old_South_Arabian, Old_Turkic, Ol_Chiki, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Samaritan, Saurashtra, Sharada, Shavian, Sinhala, Sora_Sompeng, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet, Takri, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai, Yi.

POSIX named sets

All POSIX named sets include ASCII characters only.

Show moreShow less
[[:alnum:]]     An alphanumeric character: [[:alpha:][:digit:]]
[[:alpha:]]     An ascii character: [[:lower:][:upper:]]
[[:ascii:]]     Codepoints U+0000 .. U+007F
[[:blank:]]     A space or a tab: [ \t]
[[:cntrl:]]     An ascii control character: U+0000 to U+001f
[[:digit:]]     A decimal digit: [0123456789]
[[:graph:]]     A visible ascii character: [[:alnum:][:punct:]]
[[:lower:]]     A lower-case ascii character
[[:print:]]     A printable character: [ [:graph:]] (including space U+0020)
[[:punct:]]     Punctuation: One of !"#$%&`()*+,-./:;<=>?@[\]^_`{|}~
[[:space:]]     A whitespace character: [ \t\n\x{0b}\f\r]
[[:upper:]]     An upper-case ASCII character
[[:word:]]      Same as \w
[[:xdigit:]]    A hexadecimal digit: [[:digit:]abcdefABCDEF]

[[:^xxxx:]]     negative POSIX named set

Boundary matchers

Show moreShow less
^               The beginning of the string
                  (plus after newlines in multiline mode)
$               The end of the string
                  (plus before newline at end of string)
                  (plus before internal newline in multiline mode)
\b              Word boundary
                  the position is between a word character (\w) and a non-word character (\W)
                  beware: only ASCII characters are regarded as word characters
\B              Not a word boundary
[[:<:]]         start of a word
[[:>:]]         end of a word
\A              start of the string
\Z              end of the string
                  (plus before newline at end of string)
\z              end of the string
\G              first matching position in string

Comments

(?#...)         comment (not nestable)

Mode modifiers

Option setting

(?i)            case insensitive
(?J)            allow duplicate names
(?m)            multiline (^ and $ are matching newlines in the string)
(?s)            dot-matches-all mode (already set by default)
(?U)            default ungreedy
(?x)            extended (ignore white space) - used for documentation/readability
(?-...)         unset option(s)

Unicode operational mode

The following modifiers are recognized only at the start of a pattern or after one of the newline-setting options with similar syntax.

(*NO_START_OPT) no start-match optimization
(*UTF8)         set UTF-8 mode: 8-bit library (already set by default)
(*UTF-16)       set UTF-16 mode: 16-bit library (not recommended)
(*UTF-32)       set UTF-32 mode: 32-bit library (not recommended)
(*UTF)          set appropriate UTF mode for the library (not recommended)
(*UCP)          set PCRE_UCP to use Unicode properties instead of ASCII for:
                    [:alnum:] [:alpha:] [:blank:] [:digit:] [:graph:] [:lower:] [:print:] [:punct:] [:space:] [:upper:] [:word:]
                    \b \B \d \D \s \S \w \W

Newline conventions

Since Airlock Gateway operates in dot-matches-all and multiline-off mode those settings are not that relevant normally.

(*CR)           carriage return only U+000d
(*LF)           linefeed only U+000a
(*CRLF)         carriage return followed by linefeed
(*ANYCRLF)      all three of the above
(*ANY)          any Unicode newline sequence

What \R matches

(*BSR_ANYCRLF)  CR, LF or CRLF
(*BSR_UNICODE)  any Unicode newline sequence (default on Airlock Gateway)

Grouping, capturing and back references

Grouping

(...)           capturing group
(?<name>...)    named capturing group (used by Airlock Gateway in (?<URL>...) )
(?'name'...)    named capturing group
(?P<name>...)   named capturing group
(?:...)         non-capturing group - grouping only
(?|...)         non-capturing group
                  reset group numbers for capturing groups in each alternative

(?>...)         Atomic grouping (never gives up the first match - no backtracking)

Back references

\1 .. \9        Back reference by number (for values within 1..9) - not recommended for using, ambiguous
\g1 .. \g9      Back reference by number (for values within 1..9)
\g{n}           Back reference by number
\g{-n}          Relative reference by number
\k<name>        Back reference by name
\k'name'        Back reference by name
\g{name}        Back reference by name
\k{name}        Back reference by name
(?P=name)       Back reference by name

Subroutine references

Subroutine references are possibly recursive.

Show moreShow less
(?R)            recurse whole pattern
(?n)            call subpattern by absolute number (0 for whole)
(?+n)           call subpattern by relative number
(?-n)           call subpattern by relative number
(?&name)        call subpattern by name
(?P>name)       call subpattern by name
\g<name>        call subpattern by name
\g'name'        call subpattern by name
\g<n>           call subpattern by absolute number
\g'n'           call subpattern by absolute number
\g<+n>          call subpattern by relative number
\g<-n>          call subpattern by relative number
\g'+n'          call subpattern by absolute number
\g'-n'          call subpattern by absolute number

Conditional patterns

Show moreShow less
(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)

(?(n)...        absolute reference condition
(?(+n)...       relative reference condition
(?(-n)...       relative reference condition
(?(<name>)...   named reference condition
(?('name')...   named reference condition
(?(name)...     named reference condition
(?(R)...        overall recursion condition
(?(Rn)...       specific group recursion condition
(?(R&name)...   specific group recursion condition
(?(DEFINE)...   define subpattern for references
(?(assert)...   assertion condition

Quantifiers

Show moreShow less
X?              X, one single occurrence or none at all, equivalent to X{0,1}, greedy
X?+             X, one single occurrence or none at all, possessive
X??             X, one single occurrence or none at all, lazy
X*              X, zero or more times, equivalent to X{0,}, greedy
X*+             X, zero or more times, possessive
X*?             X, zero or more times, lazy
X+              X, one or more times, equivalent to X{1,}, greedy
X++             X, one or more times, possessive
X+?             X, one or more times, lazy
X{n}            X, exactly n times
X{min,}         X, at least min times, greedy
X{min,}+        X, at least min times, possessive
X{min,}?        X, at least min times, lazy
X{min,max}      X, at least min but not more than max times, greedy
X{min,max}+     X, at least min but not more than max times, possessive
X{min,max}?     X, at least min but not more than max times, lazy

Lookaround

Lookahead and lookbehind

Lookaround constructs check for existence, but do not 'consume' the text. Lookbehind is only supported for fixed-length strings.

(?=...)         positive look ahead
(?!...)         negative look ahead
(?<=...)        positive look behind
(?<!...)        negative look behind

Match point reset

\K              resets start of match

This can be useful to reset the start of the current whole match. This provides a flexible alternative approach to lookbehind assertions because the discarded part of the match (the part that precedes \K) need not to be fixed in length.

Backtracking control

The following tags act immediately they are reached:

(*ACCEPT)      force successful match
(*FAIL)        force backtrack; synonym (*F)
(*MARK:NAME)   set name to be passed back; synonym (*:NAME)

The following tags act only when a subsequent match failure causes a backtrack to reach them. They all force a match failure, but they differ in what happens afterwards.

(*COMMIT)      overall failure, no advance of starting point
(*PRUNE)       advance to next starting character
(*PRUNE:NAME)  equivalent to (*MARK:NAME)(*PRUNE)
(*SKIP)        advance to current matching position
(*SKIP:NAME)   advance to position corresponding to an earlier (*MARK:NAME);
                  if not found, the (*SKIP) is ignored
(*THEN)        local failure, backtrack to next alternation
(*THEN:NAME)   equivalent to (*MARK:NAME)(*THEN)

Precedence

The precedence of regex operators, from highest to lowest:

  • Escaped special characters like \\ \. \( \[ \^
  • Extended ASCII characters like \x{hh} \t \r \f \a
  • Unicode property \p{Prop}
  • Collation-related bracket symbols [: :]
  • Character class [ ]
  • Grouping ( )
  • Quantifiers * + ? {m,n}
  • Concatenation
  • Anchoring ^ $
  • Alternation |

Examples

Pattern

Matches

the empty pattern matches every string

[aaaa]

Just the letter 'a'

[+--]

Matches any of the ASCII characters between + and -. What evaluates to the three characters plus, comma and minus.

[a--@]

Invalid expression, since the letter 'a' follows the symbol '-'.

[a-c-o]

One of the five characters 'a', 'b', 'c', '-' or 'o'

[]ab]

Either the letter 'a' or 'b' or the symbol ']'

[@\d[:lower:]\p{Lu}\t]

The at-Symbol, any digit from 0-9, a lowercase ASCII letter, any uppercase Unicode letter or the tabulator

(?<!^Expires|^If-Match):.*['"]

Any value, which does not start with 'Expires:' or 'If-Match:' and does contain a single or double quote

(x(i?)y)z

xYz and xyz - no other variants

(?=\p{N})\p{Thai}

Any thai digit

Original documentation