21 Text Processing<a class="heading-anchor" href="#sec-text-processing" aria-label="Permalink to this section">

[?U, ?N]

)

(

[?U, ?N]

)

(

[?U, ?N]

)

(

[?U, ?N]

)

Quantifier

QuantifierPrefix

{

}

{

}

{

}

[U, N]

[?U, ?N]

[?U]

(

GroupSpecifier

[?U]

[?U, ?N]

)

(

[?U, ?N]

)

SyntaxCharacter

one of

(

)

[

]

{

}

PatternCharacter

SourceCharacter but not SyntaxCharacter

[U, N]

[?U]

[?U]

[+N]

[?U]

[U]

[lookahead ∉ DecimalDigit]

HexEscapeSequence

[?U]

[?U]

one of

one of

[U]

[empty]

[?U]

[U]

[?U]

[U]

[?U]

RegExpIdentifierName

[?U]

[?U]

[U]

UnicodeIDStart

[?U]

[U]

UnicodeIDContinue

[?U]

<ZWNJ>

<ZWJ>

[U]

[+U]

[+U]

[+U]

[+U]

[~U]

[+U]

}

Each \u TrailSurrogate for which the choice of associated u LeadSurrogate is ambiguous shall be associated with the nearest possible u LeadSurrogate that would otherwise have no corresponding \u TrailSurrogate.

LeadSurrogate

but only if the SV of Hex4Digits is in the inclusive range 0xD800 to 0xDBFF

TrailSurrogate

but only if the SV of Hex4Digits is in the inclusive range 0xDC00 to 0xDFFF

NonSurrogate

but only if the SV of Hex4Digits is not in the inclusive range 0xD800 to 0xDFFF

IdentityEscape

[U]

[+U]

SyntaxCharacter

[+U]

[~U]

SourceCharacter but not UnicodeIDContinue

opt

[lookahead ∉ DecimalDigit]

[U]

[+U]

}

[+U]

}

LoneUnicodePropertyNameOrValue

UnicodePropertyName

UnicodePropertyValue

UnicodePropertyName

UnicodePropertyNameCharacter

opt

UnicodePropertyValue

LoneUnicodePropertyNameOrValue

UnicodePropertyValueCharacter

UnicodePropertyValueCharacter

opt

UnicodePropertyNameCharacter

ControlLetter

CharacterClass

[U]

[

[lookahead ∉ {^}]

[?U]

]

[

[?U]

]

[U]

[empty]

[?U]

[U]

[?U]

[?U]

[?U]

[?U]

[?U]

[?U]

[U]

[?U]

[?U]

[?U]

[?U]

[?U]

[?U]

[U]

[?U]

[U]

SourceCharacter but not one of \ or ] or -

[?U]

[U]

[+U]

[?U]

[?U]

21.2.1.1 Static Semantics: Early Errors

Pattern

It is a Syntax Error if NcapturingParens ≥ 2³² - 1.
It is a Syntax Error if Pattern contains multiple GroupSpecifiers whose enclosed RegExpIdentifierNames have the same StringValue.

QuantifierPrefix

{

}

It is a Syntax Error if the MV of the first DecimalDigits is larger than the MV of the second DecimalDigits.

AtomEscape

GroupName

It is a Syntax Error if the enclosing Pattern does not contain a GroupSpecifier with an enclosed RegExpIdentifierName whose StringValue equals the StringValue of the RegExpIdentifierName of this production's GroupName.

AtomEscape

It is a Syntax Error if the CapturingGroupNumber of DecimalEscape is larger than NcapturingParens (21.2.2.1).

NonemptyClassRanges

It is a Syntax Error if IsCharacterClass of the first ClassAtom is true or IsCharacterClass of the second ClassAtom is true.
It is a Syntax Error if IsCharacterClass of the first ClassAtom is false and IsCharacterClass of the second ClassAtom is false and the CharacterValue of the first ClassAtom is larger than the CharacterValue of the second ClassAtom.

It is a Syntax Error if IsCharacterClass of ClassAtomNoDash is true or IsCharacterClass of ClassAtom is true.
It is a Syntax Error if IsCharacterClass of ClassAtomNoDash is false and IsCharacterClass of ClassAtom is false and the CharacterValue of ClassAtomNoDash is larger than the CharacterValue of ClassAtom.

[U]

[?U]

It is a Syntax Error if SV(RegExpUnicodeEscapeSequence) is none of "$", or "_", or the UTF16Encoding of a code point matched by the UnicodeIDStart lexical grammar production.

[U]

[?U]

It is a Syntax Error if SV(RegExpUnicodeEscapeSequence) is none of "$", or "_", or the UTF16Encoding of either <ZWNJ> or <ZWJ>, or the UTF16Encoding of a Unicode code point that would be matched by the UnicodeIDContinue lexical grammar production.

UnicodePropertyName

UnicodePropertyValue

It is a Syntax Error if the List of Unicode code points that is SourceText of UnicodePropertyName is not identical to a List of Unicode code points that is a Unicode property name or property alias listed in the “Property name and aliases” column of Table 54.
It is a Syntax Error if the List of Unicode code points that is SourceText of UnicodePropertyValue is not identical to a List of Unicode code points that is a value or value alias for the Unicode property or property alias given by SourceText of UnicodePropertyName listed in the “Property value and aliases” column of the corresponding tables Table 56 or Table 57.

LoneUnicodePropertyNameOrValue

It is a Syntax Error if the List of Unicode code points that is SourceText of LoneUnicodePropertyNameOrValue is not identical to a List of Unicode code points that is a Unicode general category or general category alias listed in the “Property value and aliases” column of Table 56, nor a binary property or binary property alias listed in the “Property name and aliases” column of Table 55.

21.2.1.2 Static Semantics: CapturingGroupNumber

Return the MV of NonZeroDigit.

Let n be the number of code points in DecimalDigits.
Return (the MV of NonZeroDigit × 10ⁿ) plus the MV of DecimalDigits.

The definitions of “the MV of NonZeroDigit” and “the MV of DecimalDigits” are in 11.8.3.

21.2.1.3 Static Semantics: IsCharacterClass

SourceCharacter but not one of \ or ] or -

Return false.

Return true.

21.2.1.4 Static Semantics: CharacterValue

Return the code point value of U+002D (HYPHEN-MINUS).

SourceCharacter but not one of \ or ] or -

Let ch be the code point matched by SourceCharacter.
Return the code point value of ch.

Return the code point value of U+0008 (BACKSPACE).

Return the code point value of U+002D (HYPHEN-MINUS).

ControlEscape

Return the code point value according to Table 53.

ControlEscape	Code Point Value	Code Point	Unicode Name	Symbol
`t`	9	`U+0009`	CHARACTER TABULATION	<HT>
`n`	10	`U+000A`	LINE FEED (LF)	<LF>
`v`	11	`U+000B`	LINE TABULATION	<VT>
`f`	12	`U+000C`	FORM FEED (FF)	<FF>
`r`	13	`U+000D`	CARRIAGE RETURN (CR)	<CR>

ControlLetter

Let ch be the code point matched by ControlLetter.
Let i be ch's code point value.
Return the remainder of dividing i by 32.

[lookahead ∉ DecimalDigit]

Return the code point value of U+0000 (NULL).

Note

\0 represents the <NUL> character and cannot be followed by a decimal digit.

HexEscapeSequence

Return the numeric value of the code unit that is the SV of HexEscapeSequence.

LeadSurrogate

TrailSurrogate

Let lead be the CharacterValue of LeadSurrogate.
Let trail be the CharacterValue of TrailSurrogate.
Let cp be UTF16Decode(lead, trail).
Return the code point value of cp.

LeadSurrogate

Return the CharacterValue of LeadSurrogate.

TrailSurrogate

Return the CharacterValue of TrailSurrogate.

NonSurrogate

Return the CharacterValue of NonSurrogate.

Return the MV of Hex4Digits.

CodePoint

}

Return the MV of CodePoint.

Return the MV of HexDigits.

IdentityEscape

Let ch be the code point matched by IdentityEscape.
Return the code point value of ch.

21.2.1.5 Static Semantics: SourceText

UnicodePropertyNameCharacter

opt

UnicodePropertyValueCharacter

opt

Return the List, in source text order, of Unicode code points in the source text matched by this production.

21.2.1.6 Static Semantics: StringValue

RegExpIdentifierName

[U]

[?U]

RegExpIdentifierName

[?U]

[?U]

Return the String value consisting of the sequence of code units corresponding to RegExpIdentifierName. In determining the sequence any occurrences of \ RegExpUnicodeEscapeSequence are first replaced with the code point represented by the RegExpUnicodeEscapeSequence and then the code points of the entire RegExpIdentifierName are converted to code units by UTF16Encoding each code point.

21.2.2 Pattern Semantics

A regular expression pattern is converted into an internal procedure using the process described below. An implementation is encouraged to use more efficient algorithms than the ones listed below, as long as the results are the same. The internal procedure is used as the value of a RegExp object's [[RegExpMatcher]] internal slot.

A Pattern is either a BMP pattern or a Unicode pattern depending upon whether or not its associated flags contain a "u". A BMP pattern matches against a String interpreted as consisting of a sequence of 16-bit values that are Unicode code points in the range of the Basic Multilingual Plane. A Unicode pattern matches against a String interpreted as consisting of Unicode code points encoded using UTF-16. In the context of describing the behaviour of a BMP pattern “character” means a single 16-bit Unicode BMP code point. In the context of describing the behaviour of a Unicode pattern “character” means a UTF-16 encoded code point (6.1.4). In either context, “character value” means the numeric value of the corresponding non-encoded code point.

The syntax and semantics of Pattern is defined as if the source code for the Pattern was a List of SourceCharacter values where each SourceCharacter corresponds to a Unicode code point. If a BMP pattern contains a non-BMP SourceCharacter the entire pattern is encoded using UTF-16 and the individual code units of that encoding are used as the elements of the List.

Note

For example, consider a pattern expressed in source text as the single non-BMP character U+1D11E (MUSICAL SYMBOL G CLEF). Interpreted as a Unicode pattern, it would be a single element (character) List consisting of the single code point 0x1D11E. However, interpreted as a BMP pattern, it is first UTF-16 encoded to produce a two element List consisting of the code units 0xD834 and 0xDD1E.

Patterns are passed to the RegExp constructor as ECMAScript String values in which non-BMP characters are UTF-16 encoded. For example, the single character MUSICAL SYMBOL G CLEF pattern, expressed as a String value, is a String of length 2 whose elements were the code units 0xD834 and 0xDD1E. So no further translation of the string would be necessary to process it as a BMP pattern consisting of two pattern characters. However, to process it as a Unicode pattern UTF16Decode must be used in producing a List consisting of a single pattern character, the code point U+1D11E.

An implementation may not actually perform such translations to or from UTF-16, but the semantics of this specification requires that the result of pattern matching be as if such translations were performed.

21.2.2.1 Notation

The descriptions below use the following variables:

Input is a List consisting of all of the characters, in order, of the String being matched by the regular expression pattern. Each character is either a code unit or a code point, depending upon the kind of pattern involved. The notation Input[n] means the n^th character of Input, where n can range between 0 (inclusive) and InputLength (exclusive).
InputLength is the number of characters in Input.
NcapturingParens is the total number of left-capturing parentheses (i.e. the total number of Atom :: ( GroupSpecifier Disjunction ) Parse Nodes) in the pattern. A left-capturing parenthesis is any ( pattern character that is matched by the ( terminal of the Atom :: ( GroupSpecifier Disjunction ) production.
DotAll is true if the RegExp object's [[OriginalFlags]] internal slot contains "s" and otherwise is false.
IgnoreCase is true if the RegExp object's [[OriginalFlags]] internal slot contains "i" and otherwise is false.
Multiline is true if the RegExp object's [[OriginalFlags]] internal slot contains "m" and otherwise is false.
Unicode is true if the RegExp object's [[OriginalFlags]] internal slot contains "u" and otherwise is false.

Furthermore, the descriptions below use the following internal data structures:

A CharSet is a mathematical set of characters, either code units or code points depending up the state of the Unicode flag. “All characters” means either all code unit values or all code point values also depending upon the state of Unicode.
A State is an ordered pair (endIndex, captures) where endIndex is an integer and captures is a List of NcapturingParens values. States are used to represent partial match states in the regular expression matching algorithms. The endIndex is one plus the index of the last input character matched so far by the pattern, while captures holds the results of capturing parentheses. The n^th element of captures is either a List that represents the value obtained by the n^th set of capturing parentheses or undefined if the n^th set of capturing parentheses hasn't been reached yet. Due to backtracking, many States may be in use at any time during the matching process.
A MatchResult is either a State or the special token failure that indicates that the match failed.
A Continuation procedure is an internal closure (i.e. an internal procedure with some arguments already bound to values) that takes one State argument and returns a MatchResult result. If an internal closure references variables which are bound in the function that creates the closure, the closure uses the values that these variables had at the time the closure was created. The Continuation attempts to match the remaining portion (specified by the closure's already-bound arguments) of the pattern against Input, starting at the intermediate state given by its State argument. If the match succeeds, the Continuation returns the final State that it reached; if the match fails, the Continuation returns failure.
A Matcher procedure is an internal closure that takes two arguments — a State and a Continuation — and returns a MatchResult result. A Matcher attempts to match a middle subpattern (specified by the closure's already-bound arguments) of the pattern against Input, starting at the intermediate state given by its State argument. The Continuation argument should be a closure that matches the rest of the pattern. After matching the subpattern of a pattern to obtain a new State, the Matcher then calls Continuation on that new State to test if the rest of the pattern can match as well. If it can, the Matcher returns the State returned by Continuation; if not, the Matcher may try different choices at its choice points, repeatedly calling Continuation until it either succeeds or all possibilities have been exhausted.
An AssertionTester procedure is an internal closure that takes a State argument and returns a Boolean result. The assertion tester tests a specific condition (specified by the closure's already-bound arguments) against the current place in Input and returns true if the condition matched or false if not.

21.2.2.2 Pattern

The production Pattern :: Disjunction evaluates as follows:

Evaluate Disjunction with +1 as its direction argument to obtain a Matcher m.
Return an internal closure that takes two arguments, a String str and an integer index, and performs the following steps:
1. Assert: index ≤ the length of str.
2. If Unicode is true, let Input be a List consisting of the sequence of code points of str interpreted as a UTF-16 encoded (6.1.4) Unicode string. Otherwise, let Input be a List consisting of the sequence of code units that are the elements of str. Input will be used throughout the algorithms in 21.2.2. Each element of Input is considered to be a character.
3. Let InputLength be the number of characters contained in Input. This variable will be used throughout the algorithms in 21.2.2.
4. Let listIndex be the index into Input of the character that was obtained from element index of str.
5. Let c be a Continuation that always returns its State argument as a successful MatchResult.
6. Let cap be a List of NcapturingParens undefined values, indexed 1 through NcapturingParens.
7. Let x be the State (listIndex, cap).
8. Call m(x, c) and return its result.

Note

A Pattern evaluates (“compiles”) to an internal procedure value. RegExpBuiltinExec can then apply this procedure to a String and an offset within the String to determine whether the pattern would match starting at exactly that offset within the String, and, if it does match, what the values of the capturing parentheses would be. The algorithms in 21.2.2 are designed so that compiling a pattern may throw a SyntaxError exception; on the other hand, once the pattern is successfully compiled, applying the resulting internal procedure to find a match in a String cannot throw an exception (except for any host-defined exceptions that can occur anywhere such as out-of-memory).

21.2.2.3 Disjunction

With parameter direction.

The production Disjunction :: Alternative evaluates as follows:

Evaluate Alternative with argument direction to obtain a Matcher m.
Return m.

The production Disjunction :: Alternative | Disjunction evaluates as follows:

Evaluate Alternative with argument direction to obtain a Matcher m1.
Evaluate Disjunction with argument direction to obtain a Matcher m2.
Return an internal Matcher closure that takes two arguments, a State x and a Continuation c, and performs the following steps when evaluated:
1. Call m1(x, c) and let r be its result.
2. If r is not failure, return r.
3. Call m2(x, c) and return its result.

Note

The | regular expression operator separates two alternatives. The pattern first tries to match the left Alternative (followed by the sequel of the regular expression); if it fails, it tries to match the right Disjunction (followed by the sequel of the regular expression). If the left Alternative, the right Disjunction, and the sequel all have choice points, all choices in the sequel are tried before moving on to the next choice in the left Alternative. If choices in the left Alternative are exhausted, the right Disjunction is tried instead of the left Alternative. Any capturing parentheses inside a portion of the pattern skipped by | produce undefined values instead of Strings. Thus, for example,

/a|ab/.exec("abc")

returns the result "a" and not "ab". Moreover,

/((a)|(ab))((c)|(bc))/.exec("abc")

returns the array

["abc", "a", "a", undefined, "bc", undefined, "bc"]

and not

["abc", "ab", undefined, "ab", "c", "c", undefined]

The order in which the two alternatives are tried is independent of the value of direction.

21.2.2.4 Alternative

With parameter direction.

The production Alternative :: [empty] evaluates as follows:

Return a Matcher that takes two arguments, a State x and a Continuation c, and returns the result of calling c(x).

The production Alternative :: Alternative Term evaluates as follows:

Evaluate Alternative with argument direction to obtain a Matcher m1.
Evaluate Term with argument direction to obtain a Matcher m2.
If direction is equal to +1, then
1. Return an internal Matcher closure that takes two arguments, a State x and a Continuation c, and performs the following steps when evaluated:
  1. Let d be a Continuation that takes a State argument y and returns the result of calling m2(y, c).
  2. Call m1(x, d) and return its result.
Else,
1. Assert: direction is equal to -1.
2. Return an internal Matcher closure that takes two arguments, a State x and a Continuation c, and performs the following steps when evaluated:
  1. Let d be a Continuation that takes a State argument y and returns the result of calling m1(y, c).
  2. Call m2(x, d) and return its result.

Note

Consecutive Terms try to simultaneously match consecutive portions of Input. When direction is equal to +1, if the left Alternative, the right Term, and the sequel of the regular expression all have choice points, all choices in the sequel are tried before moving on to the next choice in the right Term, and all choices in the right Term are tried before moving on to the next choice in the left Alternative. When direction is equal to -1, the evaluation order of Alternative and Term are reversed.

21.2.2.5 Term

With parameter direction.

The production Term :: Assertion evaluates as follows:

Return an internal Matcher closure that takes two arguments, a State x and a Continuation c, and performs the following steps when evaluated:
1. Evaluate Assertion to obtain an AssertionTester t.
2. Call t(x) and let r be the resulting Boolean value.
3. If r is false, return failure.
4. Call c(x) and return its result.

Note

The AssertionTester is independent of direction.

The production Term :: Atom evaluates as follows:

Return the Matcher that is the result of evaluating Atom with argument direction.

The production Term :: Atom Quantifier evaluates as follows:

Evaluate Atom with argument direction to obtain a Matcher m.
Evaluate Quantifier to obtain the three results: an integer min, an integer (or ∞) max, and Boolean greedy.
Assert: If max is finite, then max is not less than min.
Let parenIndex be the number of left-capturing parentheses in the entire regular expression that occur to the left of this Term. This is the total number of Atom :: ( GroupSpecifier Disjunction ) Parse Nodes prior to or enclosing this Term.
Let parenCount be the number of left-capturing parentheses in Atom. This is the total number of Atom :: ( GroupSpecifier Disjunction ) Parse Nodes enclosed by Atom.
Return an internal Matcher closure that takes two arguments, a State x and a Continuation c, and performs the following steps when evaluated:
1. Call RepeatMatcher(m, min, max, greedy, x, c, parenIndex, parenCount) and return its result.

21.2.2.5.1 Runtime Semantics: RepeatMatcher ( `m`, `min`, `max`, `greedy`, `x`, `c`, `parenIndex`, `parenCount` )

The abstract operation RepeatMatcher takes eight parameters, a Matcher m, an integer min, an integer (or ∞) max, a Boolean greedy, a State x, a Continuation c, an integer parenIndex, and an integer parenCount, and performs the following steps:

If max is zero, return c(x).
Let d be an internal Continuation closure that takes one State argument y and performs the following steps when evaluated:
1. If min is zero and y's endIndex is equal to x's endIndex, return failure.
2. If min is zero, let min2 be zero; otherwise let min2 be min - 1.
3. If max is ∞, let max2 be ∞; otherwise let max2 be max - 1.
4. Call RepeatMatcher(m, min2, max2, greedy, y, c, parenIndex, parenCount) and return its result.
Let cap be a copy of x's captures List.
For each integer k that satisfies parenIndex < k and k ≤ parenIndex + parenCount, set cap[k] to undefined.
Let e be x's endIndex.
Let xr be the State (e, cap).
If min is not zero, return m(xr, d).
If greedy is false, then
1. Call c(x) and let z be its result.
2. If z is not failure, return z.
3. Call m(xr, d) and return its result.
Call m(xr, d) and let z be its result.
If z is not failure, return z.
Call c(x) and return its result.

Note 1

An Atom followed by a Quantifier is repeated the number of times specified by the Quantifier. A Quantifier can be non-greedy, in which case the Atom pattern is repeated as few times as possible while still matching the sequel, or it can be greedy, in which case the Atom pattern is repeated as many times as possible while still matching the sequel. The Atom pattern is repeated rather than the input character sequence that it matches, so different repetitions of the Atom can match different input substrings.

Note 2

If the Atom and the sequel of the regular expression all have choice points, the Atom is first matched as many (or as few, if non-greedy) times as possible. All choices in the sequel are tried before moving on to the next choice in the last repetition of Atom. All choices in the last (n^th) repetition of Atom are tried before moving on to the next choice in the next-to-last (n - 1)^st repetition of Atom; at which point it may turn out that more or fewer repetitions of Atom are now possible; these are exhausted (again, starting with either as few or as many as possible) before moving on to the next choice in the (n - 1)^st repetition of Atom and so on.

Compare

/a[a-z]{2,4}/.exec("abcdefghi")

which returns "abcde" with

/a[a-z]{2,4}?/.exec("abcdefghi")

which returns "abc".

Consider also

/(aa|aabaac|ba|b|c)*/.exec("aabaac")

which, by the choice point ordering above, returns the array

["aaba", "ba"]

and not any of:


              ["aabaac", "aabaac"]
              ["aabaac", "c"]

The above ordering of choice points can be used to write a regular expression that calculates the greatest common divisor of two numbers (represented in unary notation). The following example calculates the gcd of 10 and 15:

"aaaaaaaaaa,aaaaaaaaaaaaaaa".replace(/^(a+)\1*,\1+$/, "$1")

which returns the gcd in unary notation "aaaaa".

Note 3

Step 4 of the RepeatMatcher clears Atom's captures each time Atom is repeated. We can see its behaviour in the regular expression

/(z)((a+)?(b+)?(c))*/.exec("zaacbbbcac")

which returns the array

["zaacbbbcac", "z", "ac", "a", undefined, "c"]

and not

["zaacbbbcac", "z", "ac", "a", "bbb", "c"]

because each iteration of the outermost * clears all captured Strings contained in the quantified Atom, which in this case includes capture Strings numbered 2, 3, 4, and 5.

Note 4

Step 1 of the RepeatMatcher's d closure states that, once the minimum number of repetitions has been satisfied, any more expansions of Atom that match the empty character sequence are not considered for further repetitions. This prevents the regular expression engine from falling into an infinite loop on patterns such as:

/(a*)*/.exec("b")

or the slightly more complicated:

/(a*)b\1+/.exec("baaaac")

which returns the array

["b", ""]

21.2.2.6 Assertion

The production Assertion :: ^ evaluates as follows:

Return an internal AssertionTester closure that takes a State argument x and performs the following steps when evaluated:
1. Let e be x's endIndex.
2. If e is zero, return true.
3. If Multiline is false, return false.
4. If the character Input[e - 1] is one of LineTerminator, return true.
5. Return false.

Note

Even when the y flag is used with a pattern, ^ always matches only at the beginning of Input, or (if Multiline is true) at the beginning of a line.

The production Assertion :: $ evaluates as follows:

Return an internal AssertionTester closure that takes a State argument x and performs the following steps when evaluated:
1. Let e be x's endIndex.
2. If e is equal to InputLength, return true.
3. If Multiline is false, return false.
4. If the character Input[e] is one of LineTerminator, return true.
5. Return false.

The production Assertion :: \ b evaluates as follows:

Return an internal AssertionTester closure that takes a State argument x and performs the following steps when evaluated:
1. Let e be x's endIndex.
2. Call IsWordChar(e - 1) and let a be the Boolean result.
3. Call IsWordChar(e) and let b be the Boolean result.
4. If a is true and b is false, return true.
5. If a is false and b is true, return true.
6. Return false.

The production Assertion :: \ B evaluates as follows:

Return an internal AssertionTester closure that takes a State argument x and performs the following steps when evaluated:
1. Let e be x's endIndex.
2. Call IsWordChar(e - 1) and let a be the Boolean result.
3. Call IsWordChar(e) and let b be the Boolean result.
4. If a is true and b is false, return false.
5. If a is false and b is true, return false.
6. Return true.

The production Assertion :: ( ? = Disjunction ) evaluates as follows:

Evaluate Disjunction with +1 as its direction argument to obtain a Matcher m.
Return an internal Matcher closure that takes two arguments, a State x and a Continuation c, and performs the following steps:
1. Let d be a Continuation that always returns its State argument as a successful MatchResult.
2. Call m(x, d) and let r be its result.
3. If r is failure, return failure.
4. Let y be r's State.
5. Let cap be y's captures List.
6. Let xe be x's endIndex.
7. Let z be the State (xe, cap).
8. Call c(z) and return its result.

The production Assertion :: ( ? ! Disjunction ) evaluates as follows:

Evaluate Disjunction with +1 as its direction argument to obtain a Matcher m.
Return an internal Matcher closure that takes two arguments, a State x and a Continuation c, and performs the following steps:
1. Let d be a Continuation that always returns its State argument as a successful MatchResult.
2. Call m(x, d) and let r be its result.
3. If r is not failure, return failure.
4. Call c(x) and return its result.

The production Assertion :: ( ? <= Disjunction ) evaluates as follows:

Evaluate Disjunction with -1 as its direction argument to obtain a Matcher m.
Return an internal Matcher closure that takes two arguments, a State x and a Continuation c, and performs the following steps:
1. Let d be a Continuation that always returns its State argument as a successful MatchResult.
2. Call m(x, d) and let r be its result.
3. If r is failure, return failure.
4. Let y be r's State.
5. Let cap be y's captures List.
6. Let xe be x's endIndex.
7. Let z be the State (xe, cap).
8. Call c(z) and return its result.

The production Assertion :: ( ? <! Disjunction ) evaluates as follows:

Evaluate Disjunction with -1 as its direction argument to obtain a Matcher m.
Return an internal Matcher closure that takes two arguments, a State x and a Continuation c, and performs the following steps:
1. Let d be a Continuation that always returns its State argument as a successful MatchResult.
2. Call m(x, d) and let r be its result.
3. If r is not failure, return failure.
4. Call c(x) and return its result.

21.2.2.6.1 Runtime Semantics: WordCharacters ( )

The abstract operation WordCharacters performs the following steps:

Let A be a set of characters containing the sixty-three characters:
Let U be an empty set.
For each character c not in set A where Canonicalize(c) is in A, add c to U.
Assert: Unless Unicode and IgnoreCase are both true, U is empty.
Add the characters in set U to set A.
Return A.

21.2.2.6.2 Runtime Semantics: IsWordChar ( `e` )

The abstract operation IsWordChar takes an integer parameter e and performs the following steps:

If e is -1 or e is InputLength, return false.
Let c be the character Input[e].
Let wordChars be the result of ! WordCharacters().
If c is in wordChars, return true.
Return false.

21.2.2.7 Quantifier

The production Quantifier :: QuantifierPrefix evaluates as follows:

Evaluate QuantifierPrefix to obtain the two results: an integer min and an integer (or ∞) max.
Return the three results min, max, and true.

The production Quantifier :: QuantifierPrefix ? evaluates as follows:

Evaluate QuantifierPrefix to obtain the two results: an integer min and an integer (or ∞) max.
Return the three results min, max, and false.

The production QuantifierPrefix :: * evaluates as follows:

Return the two results 0 and ∞.

The production QuantifierPrefix :: + evaluates as follows:

Return the two results 1 and ∞.

The production QuantifierPrefix :: ? evaluates as follows:

Return the two results 0 and 1.

The production QuantifierPrefix :: { DecimalDigits } evaluates as follows:

Let i be the MV of DecimalDigits (see 11.8.3).
Return the two results i and i.

The production QuantifierPrefix :: { DecimalDigits , } evaluates as follows:

Let i be the MV of DecimalDigits.
Return the two results i and ∞.

The production QuantifierPrefix :: { DecimalDigits , DecimalDigits } evaluates as follows:

Let i be the MV of the first DecimalDigits.
Let j be the MV of the second DecimalDigits.
Return the two results i and j.

21.2.2.8 Atom

With parameter direction.

The production Atom :: PatternCharacter evaluates as follows:

Let ch be the character matched by PatternCharacter.
Let A be a one-element CharSet containing the character ch.
Call CharacterSetMatcher(A, false, direction) and return its Matcher result.

The production Atom :: . evaluates as follows:

If DotAll is true, then
1. Let A be the set of all characters.
Otherwise, let A be the set of all characters except LineTerminator.
Call CharacterSetMatcher(A, false, direction) and return its Matcher result.

The production Atom :: \ AtomEscape evaluates as follows:

Return the Matcher that is the result of evaluating AtomEscape with argument direction.

The production Atom :: CharacterClass evaluates as follows:

Evaluate CharacterClass to obtain a CharSet A and a Boolean invert.
Call CharacterSetMatcher(A, invert, direction) and return its Matcher result.

The production Atom :: ( GroupSpecifier Disjunction ) evaluates as follows:

Evaluate Disjunction with argument direction to obtain a Matcher m.
Let parenIndex be the number of left-capturing parentheses in the entire regular expression that occur to the left of this Atom. This is the total number of Atom :: ( GroupSpecifier Disjunction ) Parse Nodes prior to or enclosing this Atom.
Return an internal Matcher closure that takes two arguments, a State x and a Continuation c, and performs the following steps:
1. Let d be an internal Continuation closure that takes one State argument y and performs the following steps:
  1. Let cap be a copy of y's captures List.
  2. Let xe be x's endIndex.
  3. Let ye be y's endIndex.
  4. If direction is equal to +1, then
    1. Assert: xe ≤ ye.
    2. Let s be a new List whose elements are the characters of Input at indices xe (inclusive) through ye (exclusive).
  5. Else,
    1. Assert: direction is equal to -1.
    2. Assert: ye ≤ xe.
    3. Let s be a new List whose elements are the characters of Input at indices ye (inclusive) through xe (exclusive).
  6. Set cap[parenIndex + 1] to s.
  7. Let z be the State (ye, cap).
  8. Call c(z) and return its result.
2. Call m(x, d) and return its result.

The production Atom :: ( ? : Disjunction ) evaluates as follows:

Return the Matcher that is the result of evaluating Disjunction with argument direction.

21.2.2.8.1 Runtime Semantics: CharacterSetMatcher ( `A`, `invert`, `direction` )

The abstract operation CharacterSetMatcher takes three arguments, a CharSet A, a Boolean flag invert, and an integer direction, and performs the following steps:

Return an internal Matcher closure that takes two arguments, a State x and a Continuation c, and performs the following steps when evaluated:
1. Let e be x's endIndex.
2. Let f be e + direction.
3. If f < 0 or f > InputLength, return failure.
4. Let index be min(e, f).
5. Let ch be the character Input[index].
6. Let cc be Canonicalize(ch).
7. If invert is false, then
  1. If there does not exist a member a of set A such that Canonicalize(a) is cc, return failure.
8. Else,
  1. Assert: invert is true.
  2. If there exists a member a of set A such that Canonicalize(a) is cc, return failure.
9. Let cap be x's captures List.
10. Let y be the State (f, cap).
11. Call c(y) and return its result.

21.2.2.8.2 Runtime Semantics: Canonicalize ( `ch` )

The abstract operation Canonicalize takes a character parameter ch and performs the following steps:

If IgnoreCase is false, return ch.
If Unicode is true, then
1. If the file CaseFolding.txt of the Unicode Character Database provides a simple or common case folding mapping for ch, return the result of applying that mapping to ch.
2. Return ch.
Else,
1. Assert: ch is a UTF-16 code unit.
2. Let s be the String value consisting of the single code unit ch.
3. Let u be the same result produced as if by performing the algorithm for String.prototype.toUpperCase using s as the this value.
4. Assert: Type(u) is String.
5. If u does not consist of a single code unit, return ch.
6. Let cu be u's single code unit element.
7. If the numeric value of ch ≥ 128 and the numeric value of cu < 128, return ch.
8. Return cu.

Note 1

Parentheses of the form ( Disjunction ) serve both to group the components of the Disjunction pattern together and to save the result of the match. The result can be used either in a backreference (\ followed by a nonzero decimal number), referenced in a replace String, or returned as part of an array from the regular expression matching internal procedure. To inhibit the capturing behaviour of parentheses, use the form (?: Disjunction ) instead.

Note 2

The form (?= Disjunction ) specifies a zero-width positive lookahead. In order for it to succeed, the pattern inside Disjunction must match at the current position, but the current position is not advanced before matching the sequel. If Disjunction can match at the current position in several ways, only the first one is tried. Unlike other regular expression operators, there is no backtracking into a (?= form (this unusual behaviour is inherited from Perl). This only matters when the Disjunction contains capturing parentheses and the sequel of the pattern contains backreferences to those captures.

For example,

/(?=(a+))/.exec("baaabac")

matches the empty String immediately after the first b and therefore returns the array:

["", "aaa"]

To illustrate the lack of backtracking into the lookahead, consider:

/(?=(a+))a*b\1/.exec("baaabac")

This expression returns

["aba", "a"]

and not:

["aaaba", "a"]

Note 3

The form (?! Disjunction ) specifies a zero-width negative lookahead. In order for it to succeed, the pattern inside Disjunction must fail to match at the current position. The current position is not advanced before matching the sequel. Disjunction can contain capturing parentheses, but backreferences to them only make sense from within Disjunction itself. Backreferences to these capturing parentheses from elsewhere in the pattern always return undefined because the negative lookahead must fail for the pattern to succeed. For example,

/(.*?)a(?!(a+)b\2c)\2(.*)/.exec("baaabaac")

looks for an a not immediately followed by some positive number n of a's, a b, another n a's (specified by the first \2) and a c. The second \2 is outside the negative lookahead, so it matches against undefined and therefore always succeeds. The whole expression returns the array:

["baaabaac", "ba", undefined, "abaac"]

Note 4

In case-insignificant matches when Unicode is true, all characters are implicitly case-folded using the simple mapping provided by the Unicode standard immediately before they are compared. The simple mapping always maps to a single code point, so it does not map, for example, "ß" (U+00DF) to "SS". It may however map a code point outside the Basic Latin range to a character within, for example, "ſ" (U+017F) to "s". Such characters are not mapped if Unicode is false. This prevents Unicode code points such as U+017F and U+212A from matching regular expressions such as /[a-z]/i, but they will match /[a-z]/ui.

21.2.2.8.3 Runtime Semantics: UnicodeMatchProperty ( `p` )

The abstract operation UnicodeMatchProperty takes a parameter p that is a List of Unicode code points and performs the following steps:

Assert: p is a List of Unicode code points that is identical to a List of Unicode code points that is a Unicode property name or property alias listed in the “Property name and aliases” column of Table 54 or Table 55.
Let c be the canonical property name of p as given in the “Canonical property name” column of the corresponding row.
Return the List of Unicode code points of c.

Implementations must support the Unicode property names and aliases listed in Table 54 and Table 55. To ensure interoperability, implementations must not support any other property names or aliases.

Note 1

For example, Script_Extensions (property name) and scx (property alias) are valid, but script_extensions or Scx aren't.

Note 2

The listed properties form a superset of what UTS18 RL1.2 requires.

Property name and aliases	Canonical property name
`General_Category` `gc`	`General_Category`
`Script` `sc`	`Script`
`Script_Extensions` `scx`	`Script_Extensions`

Property name and aliases	Canonical property name
`ASCII`	`ASCII`
`ASCII_Hex_Digit` `AHex`	`ASCII_Hex_Digit`
`Alphabetic` `Alpha`	`Alphabetic`
`Any`	`Any`
`Assigned`	`Assigned`
`Bidi_Control` `Bidi_C`	`Bidi_Control`
`Bidi_Mirrored` `Bidi_M`	`Bidi_Mirrored`
`Case_Ignorable` `CI`	`Case_Ignorable`
`Cased`	`Cased`
`Changes_When_Casefolded` `CWCF`	`Changes_When_Casefolded`
`Changes_When_Casemapped` `CWCM`	`Changes_When_Casemapped`
`Changes_When_Lowercased` `CWL`	`Changes_When_Lowercased`
`Changes_When_NFKC_Casefolded` `CWKCF`	`Changes_When_NFKC_Casefolded`
`Changes_When_Titlecased` `CWT`	`Changes_When_Titlecased`
`Changes_When_Uppercased` `CWU`	`Changes_When_Uppercased`
`Dash`	`Dash`
`Default_Ignorable_Code_Point` `DI`	`Default_Ignorable_Code_Point`
`Deprecated` `Dep`	`Deprecated`
`Diacritic` `Dia`	`Diacritic`
`Emoji`	`Emoji`
`Emoji_Component`	`Emoji_Component`
`Emoji_Modifier`	`Emoji_Modifier`
`Emoji_Modifier_Base`	`Emoji_Modifier_Base`
`Emoji_Presentation`	`Emoji_Presentation`
`Extended_Pictographic`	`Extended_Pictographic`
`Extender` `Ext`	`Extender`
`Grapheme_Base` `Gr_Base`	`Grapheme_Base`
`Grapheme_Extend` `Gr_Ext`	`Grapheme_Extend`
`Hex_Digit` `Hex`	`Hex_Digit`
`IDS_Binary_Operator` `IDSB`	`IDS_Binary_Operator`
`IDS_Trinary_Operator` `IDST`	`IDS_Trinary_Operator`
`ID_Continue` `IDC`	`ID_Continue`
`ID_Start` `IDS`	`ID_Start`
`Ideographic` `Ideo`	`Ideographic`
`Join_Control` `Join_C`	`Join_Control`
`Logical_Order_Exception` `LOE`	`Logical_Order_Exception`
`Lowercase` `Lower`	`Lowercase`
`Math`	`Math`
`Noncharacter_Code_Point` `NChar`	`Noncharacter_Code_Point`
`Pattern_Syntax` `Pat_Syn`	`Pattern_Syntax`
`Pattern_White_Space` `Pat_WS`	`Pattern_White_Space`
`Quotation_Mark` `QMark`	`Quotation_Mark`
`Radical`	`Radical`
`Regional_Indicator` `RI`	`Regional_Indicator`
`Sentence_Terminal` `STerm`	`Sentence_Terminal`
`Soft_Dotted` `SD`	`Soft_Dotted`
`Terminal_Punctuation` `Term`	`Terminal_Punctuation`
`Unified_Ideograph` `UIdeo`	`Unified_Ideograph`
`Uppercase` `Upper`	`Uppercase`
`Variation_Selector` `VS`	`Variation_Selector`
`White_Space` `space`	`White_Space`
`XID_Continue` `XIDC`	`XID_Continue`
`XID_Start` `XIDS`	`XID_Start`

21.2.2.8.4 Runtime Semantics: UnicodeMatchPropertyValue ( `p`, `v` )

The abstract operation UnicodeMatchPropertyValue takes two parameters p and v, each of which is a List of Unicode code points, and performs the following steps:

Assert: p is a List of Unicode code points that is identical to a List of Unicode code points that is a canonical, unaliased Unicode property name listed in the “Canonical property name” column of Table 54.
Assert: v is a List of Unicode code points that is identical to a List of Unicode code points that is a property value or property value alias for Unicode property p listed in the “Property value and aliases” column of Table 56 or Table 57.
Let value be the canonical property value of v as given in the “Canonical property value” column of the corresponding row.
Return the List of Unicode code points of value.

Implementations must support the Unicode property value names and aliases listed in Table 56 and Table 57. To ensure interoperability, implementations must not support any other property value names or aliases.

Note 1

For example, Xpeo and Old_Persian are valid Script_Extensions values, but xpeo and Old Persian aren't.

Note 2

This algorithm differs from the matching rules for symbolic values listed in UAX44: case, white space, U+002D (HYPHEN-MINUS), and U+005F (LOW LINE) are not ignored, and the Is prefix is not supported.

Property value and aliases	Canonical property value
`Cased_Letter` `LC`	`Cased_Letter`
`Close_Punctuation` `Pe`	`Close_Punctuation`
`Connector_Punctuation` `Pc`	`Connector_Punctuation`
`Control` `Cc` `cntrl`	`Control`
`Currency_Symbol` `Sc`	`Currency_Symbol`
`Dash_Punctuation` `Pd`	`Dash_Punctuation`
`Decimal_Number` `Nd` `digit`	`Decimal_Number`
`Enclosing_Mark` `Me`	`Enclosing_Mark`
`Final_Punctuation` `Pf`	`Final_Punctuation`
`Format` `Cf`	`Format`
`Initial_Punctuation` `Pi`	`Initial_Punctuation`
`Letter` `L`	`Letter`
`Letter_Number` `Nl`	`Letter_Number`
`Line_Separator` `Zl`	`Line_Separator`
`Lowercase_Letter` `Ll`	`Lowercase_Letter`
`Mark` `M` `Combining_Mark`	`Mark`
`Math_Symbol` `Sm`	`Math_Symbol`
`Modifier_Letter` `Lm`	`Modifier_Letter`
`Modifier_Symbol` `Sk`	`Modifier_Symbol`
`Nonspacing_Mark` `Mn`	`Nonspacing_Mark`
`Number` `N`	`Number`
`Open_Punctuation` `Ps`	`Open_Punctuation`
`Other` `C`	`Other`
`Other_Letter` `Lo`	`Other_Letter`
`Other_Number` `No`	`Other_Number`
`Other_Punctuation` `Po`	`Other_Punctuation`
`Other_Symbol` `So`	`Other_Symbol`
`Paragraph_Separator` `Zp`	`Paragraph_Separator`
`Private_Use` `Co`	`Private_Use`
`Punctuation` `P` `punct`	`Punctuation`
`Separator` `Z`	`Separator`
`Space_Separator` `Zs`	`Space_Separator`
`Spacing_Mark` `Mc`	`Spacing_Mark`
`Surrogate` `Cs`	`Surrogate`
`Symbol` `S`	`Symbol`
`Titlecase_Letter` `Lt`	`Titlecase_Letter`
`Unassigned` `Cn`	`Unassigned`
`Uppercase_Letter` `Lu`	`Uppercase_Letter`

Property value and aliases	Canonical property value
`Adlam` `Adlm`	`Adlam`
`Ahom` `Ahom`	`Ahom`
`Anatolian_Hieroglyphs` `Hluw`	`Anatolian_Hieroglyphs`
`Arabic` `Arab`	`Arabic`
`Armenian` `Armn`	`Armenian`
`Avestan` `Avst`	`Avestan`
`Balinese` `Bali`	`Balinese`
`Bamum` `Bamu`	`Bamum`
`Bassa_Vah` `Bass`	`Bassa_Vah`
`Batak` `Batk`	`Batak`
`Bengali` `Beng`	`Bengali`
`Bhaiksuki` `Bhks`	`Bhaiksuki`
`Bopomofo` `Bopo`	`Bopomofo`
`Brahmi` `Brah`	`Brahmi`
`Braille` `Brai`	`Braille`
`Buginese` `Bugi`	`Buginese`
`Buhid` `Buhd`	`Buhid`
`Canadian_Aboriginal` `Cans`	`Canadian_Aboriginal`
`Carian` `Cari`	`Carian`
`Caucasian_Albanian` `Aghb`	`Caucasian_Albanian`
`Chakma` `Cakm`	`Chakma`
`Cham` `Cham`	`Cham`
`Cherokee` `Cher`	`Cherokee`
`Common` `Zyyy`	`Common`
`Coptic` `Copt` `Qaac`	`Coptic`
`Cuneiform` `Xsux`	`Cuneiform`
`Cypriot` `Cprt`	`Cypriot`
`Cyrillic` `Cyrl`	`Cyrillic`
`Deseret` `Dsrt`	`Deseret`
`Devanagari` `Deva`	`Devanagari`
`Dogra` `Dogr`	`Dogra`
`Duployan` `Dupl`	`Duployan`
`Egyptian_Hieroglyphs` `Egyp`	`Egyptian_Hieroglyphs`
`Elbasan` `Elba`	`Elbasan`
`Ethiopic` `Ethi`	`Ethiopic`
`Georgian` `Geor`	`Georgian`
`Glagolitic` `Glag`	`Glagolitic`
`Gothic` `Goth`	`Gothic`
`Grantha` `Gran`	`Grantha`
`Greek` `Grek`	`Greek`
`Gujarati` `Gujr`	`Gujarati`
`Gunjala_Gondi` `Gong`	`Gunjala_Gondi`
`Gurmukhi` `Guru`	`Gurmukhi`
`Han` `Hani`	`Han`
`Hangul` `Hang`	`Hangul`
`Hanifi_Rohingya` `Rohg`	`Hanifi_Rohingya`
`Hanunoo` `Hano`	`Hanunoo`
`Hatran` `Hatr`	`Hatran`
`Hebrew` `Hebr`	`Hebrew`
`Hiragana` `Hira`	`Hiragana`
`Imperial_Aramaic` `Armi`	`Imperial_Aramaic`
`Inherited` `Zinh` `Qaai`	`Inherited`
`Inscriptional_Pahlavi` `Phli`	`Inscriptional_Pahlavi`
`Inscriptional_Parthian` `Prti`	`Inscriptional_Parthian`
`Javanese` `Java`	`Javanese`
`Kaithi` `Kthi`	`Kaithi`
`Kannada` `Knda`	`Kannada`
`Katakana` `Kana`	`Katakana`
`Kayah_Li` `Kali`	`Kayah_Li`
`Kharoshthi` `Khar`	`Kharoshthi`
`Khmer` `Khmr`	`Khmer`
`Khojki` `Khoj`	`Khojki`
`Khudawadi` `Sind`	`Khudawadi`
`Lao` `Laoo`	`Lao`
`Latin` `Latn`	`Latin`
`Lepcha` `Lepc`	`Lepcha`
`Limbu` `Limb`	`Limbu`
`Linear_A` `Lina`	`Linear_A`
`Linear_B` `Linb`	`Linear_B`
`Lisu` `Lisu`	`Lisu`
`Lycian` `Lyci`	`Lycian`
`Lydian` `Lydi`	`Lydian`
`Mahajani` `Mahj`	`Mahajani`
`Makasar` `Maka`	`Makasar`
`Malayalam` `Mlym`	`Malayalam`
`Mandaic` `Mand`	`Mandaic`
`Manichaean` `Mani`	`Manichaean`
`Marchen` `Marc`	`Marchen`
`Medefaidrin` `Medf`	`Medefaidrin`
`Masaram_Gondi` `Gonm`	`Masaram_Gondi`
`Meetei_Mayek` `Mtei`	`Meetei_Mayek`
`Mende_Kikakui` `Mend`	`Mende_Kikakui`
`Meroitic_Cursive` `Merc`	`Meroitic_Cursive`
`Meroitic_Hieroglyphs` `Mero`	`Meroitic_Hieroglyphs`
`Miao` `Plrd`	`Miao`
`Modi` `Modi`	`Modi`
`Mongolian` `Mong`	`Mongolian`
`Mro` `Mroo`	`Mro`
`Multani` `Mult`	`Multani`
`Myanmar` `Mymr`	`Myanmar`
`Nabataean` `Nbat`	`Nabataean`
`New_Tai_Lue` `Talu`	`New_Tai_Lue`
`Newa` `Newa`	`Newa`
`Nko` `Nkoo`	`Nko`
`Nushu` `Nshu`	`Nushu`
`Ogham` `Ogam`	`Ogham`
`Ol_Chiki` `Olck`	`Ol_Chiki`
`Old_Hungarian` `Hung`	`Old_Hungarian`
`Old_Italic` `Ital`	`Old_Italic`
`Old_North_Arabian` `Narb`	`Old_North_Arabian`
`Old_Permic` `Perm`	`Old_Permic`
`Old_Persian` `Xpeo`	`Old_Persian`
`Old_Sogdian` `Sogo`	`Old_Sogdian`
`Old_South_Arabian` `Sarb`	`Old_South_Arabian`
`Old_Turkic` `Orkh`	`Old_Turkic`
`Oriya` `Orya`	`Oriya`
`Osage` `Osge`	`Osage`
`Osmanya` `Osma`	`Osmanya`
`Pahawh_Hmong` `Hmng`	`Pahawh_Hmong`
`Palmyrene` `Palm`	`Palmyrene`
`Pau_Cin_Hau` `Pauc`	`Pau_Cin_Hau`
`Phags_Pa` `Phag`	`Phags_Pa`
`Phoenician` `Phnx`	`Phoenician`
`Psalter_Pahlavi` `Phlp`	`Psalter_Pahlavi`
`Rejang` `Rjng`	`Rejang`
`Runic` `Runr`	`Runic`
`Samaritan` `Samr`	`Samaritan`
`Saurashtra` `Saur`	`Saurashtra`
`Sharada` `Shrd`	`Sharada`
`Shavian` `Shaw`	`Shavian`
`Siddham` `Sidd`	`Siddham`
`SignWriting` `Sgnw`	`SignWriting`
`Sinhala` `Sinh`	`Sinhala`
`Sogdian` `Sogd`	`Sogdian`
`Sora_Sompeng` `Sora`	`Sora_Sompeng`
`Soyombo` `Soyo`	`Soyombo`
`Sundanese` `Sund`	`Sundanese`
`Syloti_Nagri` `Sylo`	`Syloti_Nagri`
`Syriac` `Syrc`	`Syriac`
`Tagalog` `Tglg`	`Tagalog`
`Tagbanwa` `Tagb`	`Tagbanwa`
`Tai_Le` `Tale`	`Tai_Le`
`Tai_Tham` `Lana`	`Tai_Tham`
`Tai_Viet` `Tavt`	`Tai_Viet`
`Takri` `Takr`	`Takri`
`Tamil` `Taml`	`Tamil`
`Tangut` `Tang`	`Tangut`
`Telugu` `Telu`	`Telugu`
`Thaana` `Thaa`	`Thaana`
`Thai` `Thai`	`Thai`
`Tibetan` `Tibt`	`Tibetan`
`Tifinagh` `Tfng`	`Tifinagh`
`Tirhuta` `Tirh`	`Tirhuta`
`Ugaritic` `Ugar`	`Ugaritic`
`Vai` `Vaii`	`Vai`
`Warang_Citi` `Wara`	`Warang_Citi`
`Yi` `Yiii`	`Yi`
`Zanabazar_Square` `Zanb`	`Zanabazar_Square`

21.2.2.9 AtomEscape

With parameter direction.

The production AtomEscape :: DecimalEscape evaluates as follows:

Evaluate DecimalEscape to obtain an integer n.
Assert: n ≤ NcapturingParens.
Call BackreferenceMatcher(n, direction) and return its Matcher result.

The production AtomEscape :: CharacterEscape evaluates as follows:

Evaluate CharacterEscape to obtain a character ch.
Let A be a one-element CharSet containing the character ch.
Call CharacterSetMatcher(A, false, direction) and return its Matcher result.

The production AtomEscape :: CharacterClassEscape evaluates as follows:

Evaluate CharacterClassEscape to obtain a CharSet A.
Call CharacterSetMatcher(A, false, direction) and return its Matcher result.

Note

An escape sequence of the form \ followed by a nonzero decimal number n matches the result of the _n_th set of capturing parentheses (21.2.2.1). It is an error if the regular expression has fewer than n capturing parentheses. If the regular expression has n or more capturing parentheses but the _n_th one is undefined because it has not captured anything, then the backreference always succeeds.

The production AtomEscape :: k GroupName evaluates as follows:

Search the enclosing Pattern for an instance of a GroupSpecifier for a RegExpIdentifierName which has a StringValue equal to the StringValue of the RegExpIdentifierName contained in GroupName.
Assert: A unique such GroupSpecifier is found.
Let parenIndex be the number of left-capturing parentheses in the entire regular expression that occur to the left of the located GroupSpecifier. This is the total number of Atom :: ( GroupSpecifier Disjunction ) Parse Nodes prior to or enclosing the located GroupSpecifier.
Call BackreferenceMatcher(parenIndex, direction) and return its Matcher result.

21.2.2.9.1 Runtime Semantics: BackreferenceMatcher ( `n`, `direction` )

The abstract operation BackreferenceMatcher takes two arguments, an integer n and an integer direction, and performs the following steps:

Return an internal Matcher closure that takes two arguments, a State x and a Continuation c, and performs the following steps:
1. Let cap be x's captures List.
2. Let s be cap[n].
3. If s is undefined, return c(x).
4. Let e be x's endIndex.
5. Let len be the number of elements in s.
6. Let f be e + direction × len.
7. If f < 0 or f > InputLength, return failure.
8. Let g be min(e, f).
9. If there exists an integer i between 0 (inclusive) and len (exclusive) such that Canonicalize(s[i]) is not the same character value as Canonicalize(Input[g + i]), return failure.
10. Let y be the State (f, cap).
11. Call c(y) and return its result.

21.2.2.10 CharacterEscape

The CharacterEscape productions evaluate as follows:

ControlEscape

ControlLetter

[lookahead ∉ DecimalDigit]

HexEscapeSequence

IdentityEscape

Let cv be the CharacterValue of this CharacterEscape.
Return the character whose character value is cv.

21.2.2.11 DecimalEscape

The DecimalEscape productions evaluate as follows:

opt

Return the CapturingGroupNumber of this DecimalEscape.

Note

If \ is followed by a decimal number n whose first digit is not 0, then the escape sequence is considered to be a backreference. It is an error if n is greater than the total number of left-capturing parentheses in the entire regular expression.

21.2.2.12 CharacterClassEscape

The production CharacterClassEscape :: d evaluates as follows:

Return the ten-element set of characters containing the characters 0 through 9 inclusive.

The production CharacterClassEscape :: D evaluates as follows:

Return the set of all characters not included in the set returned by CharacterClassEscape :: d.

The production CharacterClassEscape :: s evaluates as follows:

Return the set of characters containing the characters that are on the right-hand side of the WhiteSpace or LineTerminator productions.

The production CharacterClassEscape :: S evaluates as follows:

Return the set of all characters not included in the set returned by CharacterClassEscape :: s .

The production CharacterClassEscape :: w evaluates as follows:

Return the set of all characters returned by WordCharacters().

The production CharacterClassEscape :: W evaluates as follows:

Return the set of all characters not included in the set returned by CharacterClassEscape :: w .

The production CharacterClassEscape :: p{ UnicodePropertyValueExpression } evaluates by returning the CharSet containing all Unicode code points included in the CharSet returned by UnicodePropertyValueExpression.

The production CharacterClassEscape :: P{ UnicodePropertyValueExpression } evaluates by returning the CharSet containing all Unicode code points not included in the CharSet returned by UnicodePropertyValueExpression.

The production UnicodePropertyValueExpression :: UnicodePropertyName = UnicodePropertyValue evaluates as follows:

Let ps be SourceText of UnicodePropertyName.
Let p be ! UnicodeMatchProperty(ps).
Assert: p is a Unicode property name or property alias listed in the “Property name and aliases” column of Table 54.
Let vs be SourceText of UnicodePropertyValue.
Let v be ! UnicodeMatchPropertyValue(p, vs).
Return the CharSet containing all Unicode code points whose character database definition includes the property p with value v.

The production UnicodePropertyValueExpression :: LoneUnicodePropertyNameOrValue evaluates as follows:

Let s be SourceText of LoneUnicodePropertyNameOrValue.
If ! UnicodeMatchPropertyValue("General_Category", s) is identical to a List of Unicode code points that is the name of a Unicode general category or general category alias listed in the “Property value and aliases” column of Table 56, then
1. Return the CharSet containing all Unicode code points whose character database definition includes the property “General_Category” with value s.
Let p be ! UnicodeMatchProperty(s).
Assert: p is a binary Unicode property or binary property alias listed in the “Property name and aliases” column of Table 55.
Return the CharSet containing all Unicode code points whose character database definition includes the property p with value “True”.

21.2.2.13 CharacterClass

The production CharacterClass :: [ ClassRanges ] evaluates as follows:

Evaluate ClassRanges to obtain a CharSet A.
Return the two results A and false.

The production CharacterClass :: [ ^ ClassRanges ] evaluates as follows:

Evaluate ClassRanges to obtain a CharSet A.
Return the two results A and true.

21.2.2.14 ClassRanges

The production ClassRanges :: [empty] evaluates as follows:

Return the empty CharSet.

The production ClassRanges :: NonemptyClassRanges evaluates as follows:

Return the CharSet that is the result of evaluating NonemptyClassRanges.

21.2.2.15 NonemptyClassRanges

The production NonemptyClassRanges :: ClassAtom evaluates as follows:

Return the CharSet that is the result of evaluating ClassAtom.

The production NonemptyClassRanges :: ClassAtom NonemptyClassRangesNoDash evaluates as follows:

Evaluate ClassAtom to obtain a CharSet A.
Evaluate NonemptyClassRangesNoDash to obtain a CharSet B.
Return the union of CharSets A and B.

The production NonemptyClassRanges :: ClassAtom - ClassAtom ClassRanges evaluates as follows:

Evaluate the first ClassAtom to obtain a CharSet A.
Evaluate the second ClassAtom to obtain a CharSet B.
Evaluate ClassRanges to obtain a CharSet C.
Call CharacterRange(A, B) and let D be the resulting CharSet.
Return the union of CharSets D and C.

21.2.2.15.1 Runtime Semantics: CharacterRange ( `A`, `B` )

The abstract operation CharacterRange takes two CharSet parameters A and B and performs the following steps:

Assert: A and B each contain exactly one character.
Let a be the one character in CharSet A.
Let b be the one character in CharSet B.
Let i be the character value of character a.
Let j be the character value of character b.
Assert: i ≤ j.
Return the set containing all characters numbered i through j, inclusive.

21.2.2.16 NonemptyClassRangesNoDash

The production NonemptyClassRangesNoDash :: ClassAtom evaluates as follows:

Return the CharSet that is the result of evaluating ClassAtom.

The production NonemptyClassRangesNoDash :: ClassAtomNoDash NonemptyClassRangesNoDash evaluates as follows:

Evaluate ClassAtomNoDash to obtain a CharSet A.
Evaluate NonemptyClassRangesNoDash to obtain a CharSet B.
Return the union of CharSets A and B.

The production NonemptyClassRangesNoDash :: ClassAtomNoDash - ClassAtom ClassRanges evaluates as follows:

Evaluate ClassAtomNoDash to obtain a CharSet A.
Evaluate ClassAtom to obtain a CharSet B.
Evaluate ClassRanges to obtain a CharSet C.
Call CharacterRange(A, B) and let D be the resulting CharSet.
Return the union of CharSets D and C.

Note 1

ClassRanges can expand into a single ClassAtom and/or ranges of two ClassAtom separated by dashes. In the latter case the ClassRanges includes all characters between the first ClassAtom and the second ClassAtom, inclusive; an error occurs if either ClassAtom does not represent a single character (for example, if one is \w) or if the first ClassAtom's character value is greater than the second ClassAtom's character value.

Note 2

Even if the pattern ignores case, the case of the two ends of a range is significant in determining which characters belong to the range. Thus, for example, the pattern /[E-F]/i matches only the letters E, F, e, and f, while the pattern /[E-f]/i matches all upper and lower-case letters in the Unicode Basic Latin block as well as the symbols [, \, ], ^, _, and `.

Note 3

A - character can be treated literally or it can denote a range. It is treated literally if it is the first or last character of ClassRanges, the beginning or end limit of a range specification, or immediately follows a range specification.

21.2.2.17 ClassAtom

The production ClassAtom :: - evaluates as follows:

Return the CharSet containing the single character - U+002D (HYPHEN-MINUS).

The production ClassAtom :: ClassAtomNoDash evaluates as follows:

Return the CharSet that is the result of evaluating ClassAtomNoDash.

21.2.2.18 ClassAtomNoDash

The production ClassAtomNoDash :: SourceCharacter but not one of \ or ] or - evaluates as follows:

Return the CharSet containing the character matched by SourceCharacter.

The production ClassAtomNoDash :: \ ClassEscape evaluates as follows:

Return the CharSet that is the result of evaluating ClassEscape.

21.2.2.19 ClassEscape

The ClassEscape productions evaluate as follows:

Let cv be the CharacterValue of this ClassEscape.
Let c be the character whose character value is cv.
Return the CharSet containing the single character c.