Looks like no one’s replied in a while. To start the conversation again, simply ask a new question.

WCHawk

Does Safari support JavaScript RegExp?

In a cross-browser test of RegExp, Safari seemed to not support it.

MacBook Pro, OS X Mavericks (10.9.1)

Posted on Sep 5, 2016 6:40 PM

Loading page content

Page content loaded

Carolyn Samit

Sep 5, 2016 8:37 PM in response to WCHawk

VikingOSX

Sep 6, 2016 7:13 AM in response to WCHawk

The answer is yes, Safari has supported JavaScript RegExp for years. It is straight forward when used in HTML. If you use RegExp in a Do JavaScript within AppleScript, it invokes an Apple Event, and Safari 9.1.3 will block you with the following dialog:

User uploaded file

One then enables Allow JavaScript from Apple Events from the Safari Develop menu, at the cost of a user password prompt when the AppleScript is run.

Look at the Regular Expressions example for JavaScript at Rosetta Code . Add that to a <script> section in an HTML document, and add the following line before your closing </script> tag, before you open the HTML in Safari 9.1.3.

alert(isMatch + " " + matches[1]);

Lookbehind in JS regular expressions

The positive lookbehind ( (?&lt;= ) ) and negative lookbehind ( (?&lt;! ) ) zero-width assertions in JavaScript regular expressions can be used to ensure a pattern is preceded by another pattern.

  • 4 - 61 : Not supported
  • 62 - 125 : Supported
  • 126 : Supported
  • 127 - 129 : Supported
  • 12 - 18 : Not supported
  • 79 - 124 : Supported
  • 125 : Supported
  • 3.1 - 16.3 : Not supported
  • 16.4 - 17.4 : Supported
  • 17.5 : Supported
  • 17.6 - TP : Supported
  • 2 - 77 : Not supported
  • 78 - 126 : Supported
  • 127 : Supported
  • 128 - 130 : Supported
  • 9 - 48 : Not supported
  • 49 - 109 : Supported
  • 110 : Supported
  • 5.5 - 10 : Not supported
  • 11 : Not supported

Chrome for Android

Safari on ios.

  • 3.2 - 16.3 : Not supported
  • 17.6 - 18.0 : Supported

Samsung Internet

  • 4 - 7.4 : Not supported
  • 8.2 - 24 : Supported
  • 25 : Supported
  • all : Not supported

Opera Mobile

  • 10 - 12.1 : Not supported
  • 80 : Supported

UC Browser for Android

  • 15.5 : Supported

Android Browser

  • 2.1 - 4.4.4 : Not supported

Firefox for Android

  • 14.9 : Supported

Baidu Browser

  • 13.52 : Supported

KaiOS Browser

  • 2.5 : Not supported
  • 3 : Supported

Regex not compatible with safari, need help to convert

Hello guys, seems that Safari does not support look behind assertion, so now i have a perfect script that works fine except on Safari, there’s a way to make this perfectly cross-browser?

Use two regexes, one ran after the other.

it is not possible to use unique?

someone could provide an example? i have no experienxe with regex

this is the good work…but with no look behind for safari

It would be very helpful if you showed what you were trying to match. You can’t have lookbehind if you need to support Safari, but you can definitely use something different to match what you want. It’s difficult to know what though without knowing what your aim is.

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.

Workaround for Lookbehind Regex in Safari

Workaround for Lookbehind Regular expressions in Safari

Prathamesh Sonpatki

Prathamesh Sonpatki

In this post we will discuss:

  • What are Lookbehind Regex
  • Browser Compatibility of Lookbehind Regex
  • Alternative ways to use them so that it works in all browsers

What are Lookbehind regex

At times, we need to match a pattern only if it is followed or preceded by another pattern.

For eg. in case of URL which contains the organization information:

Here, :org is dynamic name of the organization which can be of following pattern:

We want to match all URLs which match the pattern for

But there are also URLs such as which we don't want to match.

Where the slug is also of same pattern as /[a-z0-9]/ .

So we want to make sure that we match the organizations URLs only if they are preceded by organizations word.

There is a way to write such regular expressions using Lookbehind regex which checks if a pattern is preceded by specific pattern.

The syntax is:

  • Positive lookbehind: (?<=Y)X , matches X , but only if there’s Y before it.
  • Negative lookbehind: (?<!Y)X , matches X , but only if there’s no Y before it.

In our case:

We use this as follows:

Which will basically replace anything that followed our pattern organizations/:org :

Browser Compatibility

Lookbehind regex are very powerful but they are not supported in all browsers. Non V8 browsers such as Safari don't support them.

Let's run the same example in Safari:

does safari support regex

Alternative to use Lookbehind Regex in Safari

We can relook at our usage and try to avoid the lookbehind regex and just use the capture groups.

In this case, we capture two groups, the first one where organizations/:org/ gets captured and then in second everything else.

We want to keep organizations/:org/ and drop everything else. This can be achieved with following:

This basically keeps the first match and drops everything else. Most importantly, this works on Safari as well!

P.S. If you want detailed overview on Lookbehind and its friend Lookahead regex, this is an excellent post - https://javascript.info/regexp-lookahead-lookbehind

Sign up for more like this.

Flagrant Badassery

A JavaScript and regular expression centric blog

Safari Support with XRegExp 0.2.2

When I released XRegExp 0.2 several days ago, I hadn't yet tested in Safari or Swift. When I remembered to do this shortly afterwards, I found that both of those WebKit-based browsers didn't like it and often crashed when trying to use it! This was obviously a Very Bad Thing, but due to major time availability issues I wasn't able to get around to in-depth bug-shooting and testing until tonight.

It turns out that Safari's regex engine contains a bug which causes an error to be thrown when compiling a regex containing a character class ending with " [\\ ".

As a result, I've changed two instances of [^[\\] to [^\\[] and upped the version number to 0.2.2. XRegExp has now been tested and works without any known issues in all of the following browsers:

  • Internet Explorer 5.5 – 7
  • Firefox 2.0.0.4
  • Safari 3.0.2 beta for Windows

You can get the newest version here .

3 thoughts on “Safari Support with XRegExp 0.2.2”

I have now published my SweetXML parser, using your XRegExp.

I developed it first without using your code, and you will be pleased to here that it was a 5 minute swap-in. Worked a dream, with no errors.

SweetButty seems like the perfect kind of project to use XRegExp… a regex intensive app, which becomes much more self-documenting though named capture.

http://bugs.webkit.org/show_bug.cgi?id=14823

Leave a Reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

  • Skip to main content
  • Skip to search
  • Skip to select language
  • Sign up for free
  • Português (do Brasil)

Regular expressions

Regular expressions are patterns used to match character combinations in strings. In JavaScript, regular expressions are also objects. These patterns are used with the exec() and test() methods of RegExp , and with the match() , matchAll() , replace() , replaceAll() , search() , and split() methods of String . This chapter describes JavaScript regular expressions. It provides a brief overview of each syntax element. For a detailed explanation of each one's semantics, read the regular expressions reference.

Creating a regular expression

You construct a regular expression in one of two ways:

  • Using a regular expression literal, which consists of a pattern enclosed between slashes, as follows: js const re = / ab + c / ; Regular expression literals provide compilation of the regular expression when the script is loaded. If the regular expression remains constant, using this can improve performance.
  • Or calling the constructor function of the RegExp object, as follows: js const re = new RegExp ( "ab+c" ) ; Using the constructor function provides runtime compilation of the regular expression. Use the constructor function when you know the regular expression pattern will be changing, or you don't know the pattern and are getting it from another source, such as user input.

Writing a regular expression pattern

A regular expression pattern is composed of simple characters, such as /abc/ , or a combination of simple and special characters, such as /ab*c/ or /Chapter (\d+)\.\d*/ . The last example includes parentheses, which are used as a memory device. The match made with this part of the pattern is remembered for later use, as described in Using groups .

Using simple patterns

Simple patterns are constructed of characters for which you want to find a direct match. For example, the pattern /abc/ matches character combinations in strings only when the exact sequence "abc" occurs (all characters together and in that order). Such a match would succeed in the strings "Hi, do you know your abc's?" and "The latest airplane designs evolved from slabcraft." . In both cases the match is with the substring "abc" . There is no match in the string "Grab crab" because while it contains the substring "ab c" , it does not contain the exact substring "abc" .

Using special characters

When the search for a match requires something more than a direct match, such as finding one or more b's, or finding white space, you can include special characters in the pattern. For example, to match a single "a" followed by zero or more "b" s followed by "c" , you'd use the pattern /ab*c/ : the * after "b" means "0 or more occurrences of the preceding item." In the string "cbbabbbbcdebc" , this pattern will match the substring "abbbbc" .

The following pages provide lists of the different special characters that fit into each category, along with descriptions and examples.

Assertions include boundaries, which indicate the beginnings and endings of lines and words, and other patterns indicating in some way that a match is possible (including look-ahead, look-behind, and conditional expressions).

Distinguish different types of characters. For example, distinguishing between letters and digits.

Groups group multiple patterns as a whole, and capturing groups provide extra submatch information when using a regular expression pattern to match against a string. Backreferences refer to a previously captured group in the same regular expression.

Indicate numbers of characters or expressions to match.

If you want to look at all the special characters that can be used in regular expressions in a single table, see the following:

Note: A larger cheat sheet is also available (only aggregating parts of those individual articles).

If you need to use any of the special characters literally (actually searching for a "*" , for instance), you must escape it by putting a backslash in front of it. For instance, to search for "a" followed by "*" followed by "b" , you'd use /a\*b/ — the backslash "escapes" the "*" , making it literal instead of special.

Similarly, if you're writing a regular expression literal and need to match a slash ("/"), you need to escape that (otherwise, it terminates the pattern). For instance, to search for the string "/example/" followed by one or more alphabetic characters, you'd use /\/example\/[a-z]+/i —the backslashes before each slash make them literal.

To match a literal backslash, you need to escape the backslash. For instance, to match the string "C:\" where "C" can be any letter, you'd use /[A-Z]:\\/ — the first backslash escapes the one after it, so the expression searches for a single literal backslash.

If using the RegExp constructor with a string literal, remember that the backslash is an escape in string literals, so to use it in the regular expression, you need to escape it at the string literal level. /a\*b/ and new RegExp("a\\*b") create the same expression, which searches for "a" followed by a literal "*" followed by "b".

If escape strings are not already part of your pattern you can add them using String.prototype.replace() :

The "g" after the regular expression is an option or flag that performs a global search, looking in the whole string and returning all matches. It is explained in detail below in Advanced Searching With Flags .

Why isn't this built into JavaScript? There is a proposal to add such a function to RegExp.

Using parentheses

Parentheses around any part of the regular expression pattern causes that part of the matched substring to be remembered. Once remembered, the substring can be recalled for other use. See Groups and backreferences for more details.

Using regular expressions in JavaScript

Regular expressions are used with the RegExp methods test() and exec() and with the String methods match() , matchAll() , replace() , replaceAll() , search() , and split() .

When you want to know whether a pattern is found in a string, use the test() or search() methods; for more information (but slower execution) use the exec() or match() methods. If you use exec() or match() and if the match succeeds, these methods return an array and update properties of the associated regular expression object and also of the predefined regular expression object, RegExp . If the match fails, the exec() method returns null (which coerces to false ).

In the following example, the script uses the exec() method to find a match in a string.

If you do not need to access the properties of the regular expression, an alternative way of creating myArray is with this script:

(See Using the global search flag with exec() for further info about the different behaviors.)

If you want to construct the regular expression from a string, yet another alternative is this script:

With these scripts, the match succeeds and returns the array and updates the properties shown in the following table.

As shown in the second form of this example, you can use a regular expression created with an object initializer without assigning it to a variable. If you do, however, every occurrence is a new regular expression. For this reason, if you use this form without assigning it to a variable, you cannot subsequently access the properties of that regular expression. For example, assume you have this script:

However, if you have this script:

The occurrences of /d(b+)d/g in the two statements are different regular expression objects and hence have different values for their lastIndex property. If you need to access the properties of a regular expression created with an object initializer, you should first assign it to a variable.

Advanced searching with flags

Regular expressions have optional flags that allow for functionality like global searching and case-insensitive searching. These flags can be used separately or together in any order, and are included as part of the regular expression.

To include a flag with the regular expression, use this syntax:

Note that the flags are an integral part of a regular expression. They cannot be added or removed later.

For example, re = /\w+\s/g creates a regular expression that looks for one or more characters followed by a space, and it looks for this combination throughout the string.

You could replace the line:

and get the same result.

The m flag is used to specify that a multiline input string should be treated as multiple lines. If the m flag is used, ^ and $ match at the start or end of any line within the input string instead of the start or end of the entire string.

Using the global search flag with exec()

RegExp.prototype.exec() method with the g flag returns each match and its position iteratively.

In contrast, String.prototype.match() method returns all matches at once, but without their position.

Using unicode regular expressions

The u flag is used to create "unicode" regular expressions; that is, regular expressions which support matching against unicode text. An important feature that's enabled in unicode mode is Unicode property escapes . For example, the following regular expression might be used to match against an arbitrary unicode "word":

Unicode regular expressions have different execution behavior as well. RegExp.prototype.unicode contains more explanation about this.

Note: Several examples are also available in:

  • The reference pages for exec() , test() , match() , matchAll() , search() , replace() , split()
  • The guide articles: character classes , assertions , groups and backreferences , quantifiers

Using special characters to verify input

In the following example, the user is expected to enter a phone number. When the user presses the "Check" button, the script checks the validity of the number. If the number is valid (matches the character sequence specified by the regular expression), the script shows a message thanking the user and confirming the number. If the number is invalid, the script informs the user that the phone number is not valid.

The regular expression looks for:

  • the beginning of the line of data: ^
  • followed by three numeric characters \d{3} OR | a left parenthesis \( , followed by three digits \d{3} , followed by a close parenthesis \) , in a non-capturing group (?:)
  • followed by one dash, forward slash, or decimal point in a capturing group ()
  • followed by three digits \d{3}
  • followed by the match remembered in the (first) captured group \1
  • followed by four digits \d{4}
  • followed by the end of the line of data: $

An online tool to learn, build, & test Regular Expressions.

An online regex builder/debugger

An online interactive tutorials, Cheat sheet, & Playground.

An online visual regex tester.

Regular Expressions 101

Save & share.

  • Regex Version: ver. 1
  • Fork Regex ctrl+s
  • Go to community entry
  • PCRE2 (PHP >=7.3)
  • PCRE (PHP <7.3)
  • ECMAScript (JavaScript)
  • .NET 7.0 (C#)
  • Regex Flavor Guide
  • Substitution
  • Code Generator
  • Regex Debugger
  • Export Matches

Explanation

Match information, quick reference.

  • Common Tokens
  • General Tokens
  • Meta Sequences
  • Quantifiers
  • Group Constructs
  • Character Classes
  • Flags/Modifiers
  • A single character of: a, b or c [abc]
  • A character except: a, b or c [^abc]
  • A character in the range: a-z [a-z]
  • A character not in the range: a-z [^a-z]
  • A character in the range: a-z or A-Z [a-zA-Z]
  • Any single character .
  • Alternate - match either a or b a|b
  • Any whitespace character \s
  • Any non-whitespace character \S
  • Any digit \d
  • Any non-digit \D
  • Any word character \w
  • Any non-word character \W
  • Match everything enclosed (?:...)
  • Capture everything enclosed (...)
  • Zero or one of a a?
  • Zero or more of a a*
  • One or more of a a+
  • Exactly 3 of a a{3}
  • 3 or more of a a{3,}
  • Between 3 and 6 of a a{3,6}
  • Start of string ^
  • End of string $
  • A word boundary \b
  • Non-word boundary \B

Regular Expression No Match

Test string.

  • | New Account
  • | Log In Remember [x]
  • | Forgot Password Login: [x]
  • Format For Printing
  •  -  XML
  •  -  Clone This Bug
  •  -  Top of page

Unicode® Technical Standard #18

Unicode regular expressions.

This document describes guidelines for how to adapt regular expression engines to use Unicode.

This is a draft document which may be updated, replaced, or superseded by other documents at any time. Publication does not imply endorsement by the Unicode Consortium. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.

This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium. This is a stable document and may be used as reference material or cited as a normative reference by other specifications.

A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS.

Please submit corrigenda and other comments with the online reporting form [ Feedback ]. Related information that is useful in understanding this document is found in the References . For the latest version of the Unicode Standard, see [ Unicode ]. For a list of current Unicode Technical Reports, see [ Reports ]. For more information about versions of the Unicode Standard, see [ Versions ].

0.1.1 Character Classes

0.1.2 property examples, 0.2 conformance, 1.1.1 hex notation and normalization, 1.2.1 domain of properties, 1.2.2 codomain of properties, 1.2.3 examples of properties, 1.2.4 property syntax, 1.2.5 general category property, 1.2.6 script and script extensions properties, 1.2.8 blocks, 1.3 subtraction and intersection, 1.4 simple word boundaries, 1.5 simple loose matches, 1.6 line boundaries, 1.7 code points, 2.1 canonical equivalents, 2.2.1 character classes with strings, 2.3 default word boundaries, 2.4 default case conversion, 2.5.1 individually named characters, 2.6 wildcards in property values, 2.7 full properties, 2.8 optional properties.

  • 3 Tailored Support: Level 3 (Retracted)

Annex A: Character Blocks

  • Annex B: Sample Collation Grapheme Cluster Code (Retracted)

Annex C: Compatibility Properties

Annex d: resolving character classes with strings and complement, annex e: notation for properties of strings, annex f. parsing character classes, acknowledgments, modifications, 0 introduction.

Regular expressions are a powerful tool for using patterns to search and modify text. They are a key component of many programming languages, databases, and spreadsheets. Starting in 1999, this document has supplied guidelines and conformance levels for supporting Unicode in regular expressions. The following issues are involved in supporting Unicode.

  • Unicode is a large character set—regular expression engines that are only adapted to handle small character sets will not scale well.
  • Unicode encompasses a wide variety of languages which can have very different characteristics than English or other western European text.

There are three fundamental levels of Unicode support that can be offered by regular expression engines:

  • Level 1 : Basic Unicode Support. At this level, the regular expression engine provides support for Unicode characters as basic logical units. (This is independent of the actual serialization of Unicode as UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE.) This is a minimal level for useful Unicode support. It does not account for end-user expectations for character support, but does satisfy most low-level programmer requirements. The results of regular expression matching at this level are independent of country or language. At this level, the user of the regular expression engine would need to write more complicated regular expressions to do full Unicode processing.
  • Level 2 : Extended Unicode Support. At this level, the regular expression engine also accounts for extended grapheme clusters (what the end-user generally thinks of as a character), better detection of word boundaries, and canonical equivalence. This is still a default level—independent of country or language—but provides much better support for end-user expectations than the raw level 1, without the regular-expression writer needing to know about some of the complications of Unicode encoding structure.

In particular:

  • Level 1 is the minimally useful level of support for Unicode. All regex implementations dealing with Unicode should be at least at Level 1.
  • Level 2 is recommended for implementations that need to handle additional Unicode features. This level is achievable without too much effort. However, some of the subitems in Level 2 are more important than others: see Level 2 .

One of the most important requirements for a regular expression engine is to document clearly what Unicode features are and are not supported. Even if higher-level support is not currently offered, provision should be made for the syntax to be extended in the future to encompass those features.

Note: The Unicode Standard is constantly evolving: new characters will be added in the future. This means that a regular expression that tests for currency symbols, for example, has different results in Unicode 2.0 than in Unicode 2.1, which added the euro sign currency symbol.

At any level, efficiently handling properties or conditions based on a large character set can take a lot of memory. A common mechanism for reducing the memory requirements—while still maintaining performance—is the two-stage table, discussed in Chapter 5 of The Unicode Standard [ Unicode ]. For example, the Unicode character properties required in RL1.2 Properties can be stored in memory in a two-stage table with only 7 or 8 Kbytes. Accessing those properties only takes a small amount of bit-twiddling and two array accesses.

Note: For ease of reference, the section ordering for this document is intended to be as stable as possible over successive versions. That may lead, in some cases, to the ordering of the sections being less than optimal.

0.1 Notation

In order to describe regular expression syntax, an extended BNF form is used:

The text also uses the following notation for sets in describing the behavior of Character Classes.

The Full Complement of a finite set results in an infinite set. Because that is not useful for regular expressions, the complement operations such as [^...] are interpreted as Code Point Complement.

A Character Class represents a set of characters. When a regex implementation follows Section 2.2.1 Character Classes with Strings the set can include sequences of characters as well. The following syntax for Character Classes is used and extended in successive sections. This syntax is not normative: regular expression implementations may need to use different syntax to be consistent with their current syntax.

The EBNF can be enhanced with other features. For example, to allow ignored spaces for readability, it can add \u{20} to SYNTAX_CHAR, and add SP? around various elements, change ITEM+ to SP? ITEM (SP? ITEM)+, etc. In this document, SP is allowed between any elements in examples, but to simplify the presentation those changes are omitted from the EBNF.

In subsequent sections of this document, additional EBNF lines will be added for additional features. In one case, marked in a comment, one of the above lines will be replaced.

Complementing affects the entire value in square brackets. That is, [^abcm-z] = [^[abcm-z]]. It is defined to be the Code Point Complement = ℙ ∖ A, and consists of the set of all code points that are not in the enclosed character class. Using syntax introduced below, [^A] is equivalent to [\p{any}--[A]] or to an expression with the equivalent literal, [[\u{0}-\u{10FFFF}]--[A]].

See Annex D: Resolving Character Classes with Strings and Complement for details.

For the purpose of regular expressions, in this document the terms “character” and “code point” are used interchangeably. Similarly, the terms “string” and “sequence of code points” are used interchangeably. Typically the code points of interest will be those representing characters. A Character Class is also referred to as the set of all characters specified by that Character Class.

In addition, for readability the simple parentheses are used where in practice a non-capturing group would be used. That is, (ab|c) is written instead of (?:ab|c).

Code points that are syntax characters or whitespace are typically escaped. For more information see [ UAX31 ]. In examples, the syntax "\s" is sometimes used to indicate whitespace. See also Annex C: Compatibility Properties . Also, in many regex implementations, the first position after the opening '[' or '[^' is treated specially, with some syntax chars treated as literals.

Note: This is only a sample syntax for the purposes of examples in this document. Regular expression syntax varies widely: the issues discussed here would need to be adapted to the syntax of the particular implementation. However, it is important to have a concrete syntax to correctly illustrate the different issues. In general, the syntax here is similar to that of Perl Regular Expressions [ Perl ].) In some cases, this gives multiple syntactic constructs that provide for the same functionality.

The following table gives examples of Character Classes:

Where string offsets are used in examples, they are from zero to n (the length of the string), and indicate positions between characters. Thus in "abcde", the substring from 2 to 4 includes the two characters "cd".

The following additional notation is defined for use here and in other Unicode specifications:

Because any character could occur as a literal in a regular expression, when regular expression syntax is embedded within other syntax it can be difficult to determine where the end of the regex expression is. Common practice is to allow the user to choose a delimiter like '/' in /ab(c)*/. The user can then simply choose a delimiter that is not in the particular regular expression.

All examples of properties being equivalent to certain literal character classes are illustrative. They were generated at a point in time, and are not updated with each release. Thus when an example contains “\p{sc=Hira} = [ぁ-ゖゝ-ゟ𛀁🈀]”, it does not imply that that identity expression would be true for the current version of Unicode.

The following section describes the possible ways that an implementation can claim conformance to this Unicode Technical Standard.

All syntax and API presented in this document is only for the purpose of illustration; there is absolutely no requirement to follow such syntax or API. Regular expression syntax varies widely: the features discussed here would need to be adapted to the syntax of the particular implementation. In general, the syntax in examples is similar to that of Perl Regular Expressions [ Perl ], but it may not be exactly the same. While the API examples generally follow Java style , it is again only for illustration.

RL1.1 Hex Notation RL1.2 Properties RL1.2a Compatibility Properties RL1.3 Subtraction and Intersection RL1.4 Simple Word Boundaries RL1.5 Simple Loose Matches RL1.6 Line Boundaries RL1.7 Supplementary Code Points
RL2.1 Canonical Equivalents RL2.2 Extended Grapheme Clusters and Character Classes with Strings RL2.3 Default Word Boundaries RL2.4 Default Case Conversion RL2.5 Name Properties RL2.6 Wildcards in Property Values RL2.7 Full Properties
For example, an implementation may claim conformance to Level 1, except for Subtraction and Intersection .

A regular expression engine may be operating in the context of a larger system. In that case some of the requirements may be met by the overall system. For example, the requirements of Section 2.1 Canonical Equivalents might be best met by making normalization available as a part of the larger system, and requiring users of the system to normalize strings where desired before supplying them to the regular-expression engine. Such usage is conformant, as long as the situation is clearly documented.

A conformance claim may also include capabilities added by an optional add-on, such as an optional library module, as long as this is clearly documented.

For backwards compatibility, some of the functionality may only be available if some special setting is turned on. None of the conformance requirements require the functionality to be available by default.

1 Basic Unicode Support: Level 1

Regular expression syntax usually allows for an expression to denote a set of single characters, such as [a-z A-Z 0-9] . Because there are a very large number of characters in the Unicode Standard, simple list expressions do not suffice.

1.1 Hex Notation

The character set used by the regular expression writer may not be Unicode, or may not have the ability to input all Unicode code points from a keyboard.

The syntax must use the code point in its hexadecimal representation. For example, syntax such as \uD834\uDD1E or \xF0\x9D\x84\x9E does not meet this requirement for expressing U+ 1D11E ( 𝄞 ) because " 1D11E " does not appear in the syntax. In contrast, syntax such as \U000 1D11E, \x{ 1D11E } or \u{ 1D11E } does satisfy the requirement for expressing U+ 1D11E .

A sample notation for listing hex Unicode characters within strings uses "\u" followed by four hex digits or "\u{" followed by any number of hex digits and terminated by "}", with multiple characters indicated by separating the hex digits by spaces. This would provide for the following addition:

Note : \u{​​3b1 3b3 3b5 3b9} is semantic sugar — useful for readability and concision but not a requirement. It can be used anywhere the equivalent individual hex escapes could be, thus [a-\u{3b1 3b3}-ζ] behaves like [a-\u{3b1}\u{3b3}-ζ] == [a-αγ-ζ]

The following table gives examples of this hex notation:

More advanced regular expression engines can also offer the ability to use the Unicode character name for readability. See 2.5 Name Properties .

For comparison, the following table shows some additional examples of escape syntax for Unicode code points:

† Following whitespace is consumed. * ICU4C regex + ICU UnicodeSet

The Unicode Standard treats certain sequences of characters as equivalent, such as the following:

Literal text in regular expressions may be normalized (converted to equivalent characters) in transmission, out of the control of the authors of that text. For example, a regular expression may contain a sequence of literal characters 'u' and grave , such as the expression [aeiou◌̀◌́◌̈] (the last three characters being U+0300 ( ◌̀ ) COMBINING GRAVE ACCENT, U+0301 ( ◌́ ) COMBINING ACUTE ACCENT, and U+0308 ( ◌̈ ) COMBINING DIAERESIS. In transmission, the two adjacent characters in Row 1 might be changed to the different expression containing just one character in Row 2, thus changing the meaning of the regular expression. Hex notation can be used to avoid this problem. In the above example, the regular expression should be written as [aeiou\u{300 301 308}] for safety.

A regular expression engine may also enforce a single, uniform interpretation of regular expressions by always normalizing input text to Normalization Form NFC before interpreting that text. For more information, see UAX #15, Unicode Normalization Forms [ UAX15 ].

1.2 Properties

Because Unicode is a large character set that is regularly extended, a regular expression engine needs to provide for the recognition of whole categories of characters as well as simply literal sets of characters and strings; otherwise the listing of characters becomes impractical, out of date, and error-prone. This is done by providing syntax for sets of characters based on the Unicode character properties, as well as related properties and functions. Examples of such syntax are \p{Script=Greek} and [:Script=Greek:], which both stand for the set of characters that have the Script value of Greek. In addition to the basic syntax, regex engines also need to allow them to be combined with other sets defined by properties or with literal sets of characters and strings. An example is [\p{Script=Greek}--\p{General_Category=Letter}], which stands for the set of characters that have the Script value of Greek and that do not have the General_Category value of Letter.

Many character properties are defined in the Unicode Character Database (UCD), which also provides the official data for mapping Unicode characters (and code points) to property values. See UAX #44, Unicode Character Database [ UAX44 ] and Chapter 4 in The Unicode Standard [Unicode ]. For use in regular expressions, properties can also be considered to be defined by Unicode definitions and algorithms, and by data files and definitions associated with other Unicode Technical Standards, such as UTS #51, Unicode Emoji . For example, this includes the Basic_Emoji definition from UTS #51. The full list of recommended properties is in Section 2.7,  Full Properties .

UAX #44, Unicode Character Database [UAX44] divides character properties into several types: Catalog, Enumeration, Binary, String, Numeric, and Miscellaneous. Those categories are not all precisely defined or immediately relevant to regular expressions. Some are more pertinent to the maintenance of the Unicode Character Database.

For regular expressions, it is more helpful to divide up properties by the treatment of their domain (what they are properties of) and their codomain (the values of the properties). Most properties are properties of Unicode code points; thus their domains are simply the full set of Unicode code points. Typically the important information is for the subset of the code points that are characters; therefore, those properties are often also called properties of characters.

In addition to properties of characters, there are also properties of strings (sequences of characters). A property of strings is more general than a property of characters. In other words, any property of characters is also a property of strings; its domain is, however, limited to strings consisting of a single character.

Data, definitions, and properties defined by the Unicode Standard and other Unicode Technical Standards, which map from strings to values, can thus be specified in this document as defining regular-expression properties.

A complement of a property of strings or a Character Class with strings may not be valid in regular expressions. For more information, see Annex D: Resolving Character Classes with Strings and Complement and Section 2.2.1 Character Classes with Strings .

The values (codomain) of properties of characters (or strings) have the following simple types: Binary, Enumerated, Numeric, Code Point, and String. Properties can also have multivalued types: a Set or List of other types.

The Binary type is a special case of an Enumerated type limited to precisely the two values "True" and "False". In general, a property of Enumerated type has a longer list of defined values. Those defined values are abstractions, but they are identified in the Unicode Character Database with labels known as aliases. Thus, the Script value "Devanagari" may also be identified by the abbreviated alias "Deva"—both refer to the same enumerated value, even though the exact label for that value may differ.

The Code Point type is a special case of a String type where the values are always limited to single-code point strings.

The UCD "Catalog" type is the same as Enumerated (the name differs for historical reasons).

The following tables provide some examples of property values for each domain type.

Examples of Properties of Characters

Note: The Script_Extensions property maps from code points to a set of enumerated Script property values.

Expressions involving Set properties, which have multiple values, are most often tested for containment, not equality. An expression like \p{Script_Extensions=Hira} is interpreted as containment: matching each code point cp such that Script_Extensions( cp ) ⊇ {Hira}. Thus, \p{Script_Extensions=Hira} will match both U+3032 〲 VERTICAL KANA REPEAT WITH VOICED SOUND MARK (with value {Hira Kana}) and U+3041 ぁ HIRAGANA LETTER SMALL A (with value {Hira}). That also allows the natural replacement of the regular expression \p{Script=Hira} by \p{Script_Extensions=Hira} — the latter just adds characters that may be either Hira or some other script. For a more detailed example, see Section 1.2.6 Script and Script Extensions Properties .

Expressions involving List properties may be tested for containment, but may have different semantics for the elements based on position. For example, each value of the kMandarin property is a list of up to two String values: the first being preferred for zh-Hans and the second for zh-Hant (where the preference differs).

Examples of Properties of Strings

Note: Properties of strings can always be “narrowed” to just contain code points. For example, [\p{Basic_Emoji} && \p{any}] is the set of characters in Basic_Emoji.

The recommended names (identifiers) for UCD properties and property values are in PropertyAliases.txt and PropertyValueAliases.txt . There are both abbreviated names and longer, more descriptive names. It is strongly recommended that both names be recognized, and that loose matching of property names and values be implemented following the guidelines in Section 5.9 Matching Rules in [ UAX44 ].

Note: It may be a useful implementation technique to load the Unicode tables that support properties and other features on demand, to avoid unnecessary memory overhead for simple regular expressions that do not use those properties.

Where a regular expression is expressed as much as possible in terms of higher-level semantic constructs such as Letter , it makes it practical to work with the different alphabets and languages in Unicode. The following is an example of a syntax addition that permits properties. Following Perl Syntax, the p is lowercase to indicate a positive match, and uppercase to indicate a complemented match.

The following table shows examples of this extended syntax to match properties:

Some properties are binary: they are either true or false for a given code point. In that case, only the property name is required. Others have multiple values, so for uniqueness both the property name and the property value need to be included.

For example, Alphabetic is a binary property, but it is also a value of the enumerated Line_Break property. So \p{Alphabetic} would refer to the binary property, whereas \p{Line Break:Alphabetic} or \p{Line_Break=Alphabetic} would refer to the enumerated Line_Break property.

There are two exceptions to the general rule that expressions involving properties with multiple value should include both the property name and property value. The Script and General_Category properties commonly have their property name omitted. Thus \p{Unassigned} is equivalent to \p{General_Category = Unassigned}, and \p{Greek} is equivalent to \p{Script=Greek}.

In order to meet requirements RL1.2 and RL1.2a , the implementation must satisfy the Unicode definition of the properties for the supported version of The Unicode Standard, rather than other possible definitions. However, the names used by the implementation for these properties may differ from the formal Unicode names for the properties. For example, if a regex engine already has a property called "Alphabetic", for backwards compatibility it may need to use a distinct name, such as "Unicode_Alphabetic", for the corresponding property listed in RL1.2 .

Implementers may add aliases beyond those recognized in the UCD. For example, in the case of the Age property an implementation could match the defined aliases "3.0" and "V3_0" , but also match "3", "3.0.0", "V3.0" , and so on. However, implementers must be aware that such additional aliases may cause problems if they collide with future UCD aliases for different values.

Ignoring an initial "is" in property values is optional. Loose matching rule UAX44-LM3 in [ UAX44 ] specifies that occurrences of an initial prefix of "is" are ignored, so that, for example, "Greek" and "isGreek" are equivalent as property values. Because existing implementations of regular expressions commonly make distinctions based on the presence or absence of "is", this requirement from [ UAX44 ] is dropped.

For more information on properties, see UAX #44, Unicode Character Database [ UAX44 ].

Of the properties in RL1.2 , General_Category and Script have enumerated property values with more than two values; the other properties are binary. An implementation that does not support non-binary enumerated properties can essentially "flatten" the enumerated type. Thus, for example, instead of \p{script=latin} the syntax could be \p{script_latin} .

The most basic overall character property is the General_Category, which is a basic categorization of Unicode characters into: Letters, Punctuation, Symbols, Marks, Numbers, Separators, and Other . These property values each have a single letter abbreviation, which is the uppercase first character except for separators, which use Z. The official data mapping Unicode characters to the General_Category value is in UnicodeData.txt .

Each of these categories has different subcategories. For example, the subcategories for Letter are uppercase , lowercase , titlecase , modifier , and other (in this case, other includes uncased letters such as Chinese). By convention, the subcategory is abbreviated by the category letter (in uppercase), followed by the first character of the subcategory in lowercase. For example, Lu stands for Uppercase Letter .

Note: Because it is recommended that the property syntax be lenient as to spaces, casing, hyphens and underbars, any of the following should be equivalent: \p{Lu} , \p{lu} , \p{uppercase letter} , \p{Uppercase Letter} , \p{Uppercase_Letter} , and \p{uppercaseletter}

The General_Category property values are listed below. For more information on the meaning of these values, see UAX #44, Unicode Character Database [ UAX44 ].

Starred entries in the table are not part of the enumeration of General_Category values. They are explained below.

A regular-expression mechanism may choose to offer the ability to identify characters on the basis of other Unicode properties besides the General Category. In particular, Unicode characters are also divided into scripts as described in UAX #24, Unicode Script Property [ UAX24 ] (for the data file, see Scripts.txt ). Using a property such as \p{sc=Greek} allows implementations to test whether letters are Greek or not.

Some characters, such as U+30FC ( ー ) KATAKANA-HIRAGANA PROLONGED SOUND MARK, are regularly used with multiple scripts. For such characters the Script_Extensions property (abbreviated as scx ) identifies the set of associated scripts. The following shows some sample characters with their Script and Script_Extensions property values:

The expression \p{sc=Hira} includes those characters whose Script value is Hira, while the expression \p{scx=Hira} includes all the characters whose Script_Extensions value contains Hira. The following table shows the difference:

See Section 0.1.2 Property Examples for information about updates to the contents of a literal set across versions.

The expression \p{scx=Hira} contains not only the characters in \p{script=Hira} , but many other characters such as U+30FC ( ー ), which are either Hiragana or Katakana.

In most cases, script extensions are a superset of the script values ( \p{scx=X} ⊇ \p{sc=X} ). However, in some cases that is not true. For example, the Script property value for U+30FC ( ー ) is Common, but the Script_Extensions value for U+30FC ( ー ) does not contain the script value Common. In other words, \p{scx=Common} ⊉ \p{sc=Common} .

The usage model for the Script and Script_Extensions properties normally requires that people construct somewhat more complex regular expressions, because a great many characters (Common and Inherited) are shared between scripts. Documentation should point users to the description in [ UAX24 ]. The values for Script_Extensions are likely be extended over time as new information is gathered on the use of characters with different scripts. For more information, see The Script_Extensions Property in UAX #24, Unicode Script Property [ UAX24 ].

As defined in the Unicode Standard, the Age property (in the DerivedAge data file in the UCD) specifies the first version of the standard in which each character was assigned. It does not refer to how long it has been encoded, nor does it indicate the historic status of the character.

In regex expressions, the Age property is used to indicate the characters that were in a particular version of the Unicode Standard. That is, a character has the Age property of that version or less. Thus \p{age=3.0} includes the letter a , which was included in Unicode 1.0. To get characters that are new in a particular version, subtract off the previous version as described in 1.3 Subtraction and Intersection . For example: [\p{age=3.1} -- \p{age=3.0}] .

Unicode blocks have an associated enumerated property, the Block property. However, there are some very significant caveats to the use of Unicode blocks for the identification of characters: see Annex A: Character Blocks . If blocks are used, some of the names can collide with Script names, so they should be distinguished, with syntax such as \p{Greek Block} or \p{Block=Greek} .

As discussed earlier, character properties are essential with a large character set. In addition, there needs to be a way to "subtract" characters from what is already in the list. For example, one may want to include all non-ASCII letters without having to list every character in \p{letter} that is not one of those 52.

The following is an example of a syntax extension to handle set operations:

The symmetric difference of two sets is defined as being the union minus the intersection, that is (A∪B)\(A∩B), or equivalently, the union of the asymmetric differences (A\B)∪(B\A).

For discussions of support by various engines, see:

  • https://www.regular-expressions.info/charclassintersect.html
  • https://www.regular-expressions.info/charclasssubtract.html

Either set difference or symmetric difference can be used with union to produce all combinations of sets that can be used in regular expressions. They cannot be replaced by [^...], because it is defined to be Code Point Complement. For example, you cannot express [A--B] as [A&&[^B]]: the following are not equivalent if A contains a string s that is not in B.

Code point complement can also be expressed using the property \p{any} or the equivalent literal [\u{0}-\u{10FFFF}]. Thus [^A] is equivalent to [\p{any}--A] and to [[\u0}-\u{10FFFF}]--A].

For clarity, it is common to use doubled symbols, and require a CHARACTER_CLASS on both sides of the OPERATOR, such as [[abc]--[cde]]. Thus [abc--cde] or [abc--[cde]] or [[abc]--cde] would be illegal syntax, and cause a parse error. This also decreases the risk that the meaning of an older regular expression accidentally changes.

Note: There is no exact analog between arithmetic operations and the set operations. The operator || adds items to the current results, the operators && and -- remove items, and the operator ~~ both adds and removes items.

This specification does not require any particular operator precedence scheme. The illustrative syntax puts all operators on the same precedence level, similar to how in arithmetic expressions work with + and -, where a + b - c + d - e is the same as ((((a + b) - c) + d) - e). That is, in the absence of brackets, each operator combines the following CHARACTER_CLASS with the current accumulated results. Using the same precedence level also works well in parsing (see Annex F. Parsing Character Classes ).

Binding or precedence may vary by regular expression engine, so as a user it is safest to always disambiguate using brackets to be sure. In particular, precedence may put all operators on the same level, or may take union as binding more closely. For example, where A..F stand for expressions, not characters:

Binding at the same level is used in this specification.

The following table shows various examples of set subtraction:

The boolean expressions can also involve properties of strings or Character Classes with strings . Thus the following matches all code points that neither have a Script value of Greek nor are in Basic_Emoji:

[^[\p{Script=Greek} && \p{Basic_Emoji}]]

For more information, see Annex D: Resolving Character Classes with Strings and Complement and Section 2.2.1 Character Classes with Strings .

Most regular expression engines allow a test for word boundaries (such as by "\b" in Perl). They generally use a very simple mechanism for determining word boundaries: one example of that would be having word boundaries between any pair of characters where one is a <word_character> and the other is not, or at the start and end of a string. This is not adequate for Unicode regular expressions.

Level 2 provides more general support for word boundaries between arbitrary Unicode characters which may override this behavior.

Most regular expression engines offer caseless matching as the only loose matching. If the engine does offer this, then it needs to account for the large range of cased Unicode characters outside of ASCII.

In addition, because of the vagaries of natural language, there are situations where two different Unicode characters have the same uppercase or lowercase. To meet this requirement, implementations must implement these in accordance with the Unicode Standard. For example, the Greek U+03C3 "σ" small sigma, U+03C2 "ς" small final sigma, and U+03A3 "Σ" capital sigma all match.

Some caseless matches may match one character against two: for example, U+00DF "ß" matches the two characters "SS". And case matching may vary by locale. However, because many implementations are not set up to handle this, at Level 1 only simple case matches are necessary. To correctly implement a caseless match, see Chapter 3, Conformance of [ Unicode ]. The data file supporting caseless matching is [ CaseData ].

To meet this requirement, where an implementation also offers case conversions, these must also follow Chapter 3, Conformance of [ Unicode ]. The relevant data files are [ SpecialCasing ] and [ UData ].

Matching case-insensitively is one example of matching under an equivalence relation:

A regular expression R matches under an equivalence relation E whenever for all strings S and T: If S is equivalent to T under E, then R matches S if and only if R matches T.

In the Unicode Standard, the relevant equivalence relation for case-insensitivity is established according to whether two strings case fold to the same value. The case folding can either be simple (a 1:1 mapping of code points) or full (with some 1:n mappings).

  • “ABC” and “Abc” are equivalent under both full and simple case folding.
  • “cliff” (with the “ff” ligature) and “CLIFF” are equivalent under full case folding, but not under simple case folding.

In practice, regex APIs are not set up to match parts of characters. For this reason, full case equivalence is difficult to handle with regular expressions. For more information, see Section 2.1, Canonical Equivalents .

For case-insensitive matching:

  • For example, /Dåb/ matches as if it were expanded into /(?:d|D)(?:å|Å|\u{212B})(?:b|B)/. (The \u{212B} is an angstrom sign, identical in appearance to Å.)
  • Back references are subject to this logical expansion, such as /(?i)(a.c)\1/, where \1 matches what is in the first grouping.
  • For example, [\p{Block=Phonetic_Extensions} [A-E]] is a character class that matches 133 code points (under Unicode 6.0). Its case-closure adds 7 more code points: a-e, Ᵽ, and Ᵹ, for a total of 140 code points.

For condition #2, in both property character classes and explicit character classes, closing under simple case-insensitivity means including characters not in the set. For example:

  • The case-closure of \p{Block=Phonetic_Extensions} includes two characters not in that set, namely Ᵽ and Ᵹ.
  • The case-closure of [A-E] includes five characters not in that set, namely [a-e] .

Conformant implementations can choose whether and how to apply condition #2: the only requirement is that they declare what they do. For example, an implementation may:

  • uniformly apply condition #2 to all property and explicit character classes
  • uniformally not apply condition #2 to any property or explicit character classes
  • apply condition #2 only within the scope of a switch
  • apply condition #2 to just specific properties and/or explicit character classes

Most regular expression engines also allow a test for line boundaries: end-of-line or start-of-line. This presumes that lines of text are separated by line (or paragraph) separators.

Formfeed (U+000C) also normally indicates an end-of-line. For more information, see Chapter 3 of [ Unicode ].

These characters should be uniformly handled in determining logical line numbers, start-of-line, end-of-line, and arbitrary-character implementations. Logical line number is useful for compiler error messages and the like. Regular expressions often allow for SOL and EOL patterns, which match certain boundaries. Often there is also a "non-line-separator" arbitrary character pattern that excludes line separator characters.

The behavior of these characters may also differ depending on whether one is in a "multiline" mode or not. For more information, see Anchors and Other "Zero-Width Assertions" in Chapter 3 of [ Friedl ].

A newline sequence is defined to be any of the following:

\u{A} | \u{B} | \u{C} | \u{D} | \u{85} | \u{2028} | \u{2029} | \u{D A}

  • The line number is increased by one for each occurrence of a newline sequence.
  • Note that different implementations may call the first line either line zero or line one.
  • SOL is at the start of a file or string, and depending on matching options, also immediately following any occurrence of a newline sequence.
  • There is no empty line within the sequence \u{D A} , that is, between the first and second character.
  • Note that there may be a separate pattern for "beginning of text" for a multiline mode, one which matches only at the beginning of the first line. For example, in Perl this is \A.
  • EOL at the end of a file or string, and depending on matching options, also immediately preceding a final occurrence of a newline sequence.
  • EOL matches at the end of the string
  • EOL matches before final newline
  • EOL matches before any newline
  • Where the 'arbitrary character pattern' matches a newline sequence, it must match all of the newline sequences, and \u{D A} (CRLF) should match as if it were a single character. (The recommendation that CRLF match as a single character is, however, not required for conformance to RL1.6.)
  • Note that ^$ (an empty line pattern) should not match the empty string within the sequence \u{D A} , but should match the empty string within the reversed sequence \u{A D} .

It is strongly recommended that there be a regular expression meta-character, such as "\R", for matching all line ending characters and sequences listed above (for example, in #1). This would correspond to something equivalent to the following expression. That expression is slightly complicated by the need to avoid backup.

(?:\u{D A}|(?!\u{D A})[\u{A}-\u{D}\u{85}\u{2028}\u{2029}]

Note: For some implementations, there may be a performance impact in recognizing CRLF as a single entity, such as with an arbitrary pattern character ("."). To account for that, an implementation may also satisfy R1.6 if there is a mechanism available for converting the sequence CRLF to a single line boundary character before regex processing.

For more information on line breaking, see [ UAX14 ].

A fundamental requirement is that Unicode text be interpreted semantically by code point, not code units.

UTF-16 uses pairs of 16-bit code units to express code points above FFFF 16 , while UTF-8 uses from two to four 8-bit code units to represent code points above 7F 16 . Surrogate pairs (or their equivalents in other encoding forms) are to be handled internally as single code point values. In particular, [\u{0}-\u{10000}] will match all the following sequence of code units:

For backwards compatibility, some regex engines allow for switches to reset matching to be by code unit instead of code point. Such usage is discouraged. For example, in order to match 👎  it is far better to write \u{1F44E) rather than \uD83D\uDC4E (using UTF-16) or \xF0\x9F\x91\x8E (using UTF-8).

Note: It is permissible, but not required, to match an isolated surrogate code point (such as \u{D800}), which may occur in Unicode 16-bit Strings. See Unicode String in the Unicode [ Glossary ].

2 Extended Unicode Support: Level 2

Level 1 support works well in many circumstances. However, it does not handle more complex languages or extensions to the Unicode Standard very well. Particularly important cases are canonical equivalence, word boundaries, extended grapheme cluster boundaries, and loose matches. (For more information about boundary conditions, see UAX #29, Unicode Text Segmentation [ UAX29 ].)

Level 2 support matches much more what user expectations are for sequences of Unicode characters. It is still locale-independent and easily implementable. However, for compatibility with Level 1, it is useful to have some sort of syntax that will turn Level 2 support on and off.

The features comprising Level 2 are not in order of importance. In particular, the most useful and highest priority features in practice are:

  • RL2.3 Default Word Boundaries
  • RL2.5 Name Properties
  • RL2.6 Wildcards in Property Values
  • RL2.7 Full Properties

The equivalence relation for canonical equivalence is established by whether two strings are identical when normalized to NFD.

For most full-featured regular expression engines, it is quite difficult to match under canonical equivalence, which may involve reordering, splitting, or merging of characters. For example, all of the following sequences are canonically equivalent:

  • U+006F ( o ) LATIN SMALL LETTER O
  • U+031B ( ◌̛ ) COMBINING HORN
  • U+0323 ( ◌̣ ) COMBINING DOT BELOW
  • U+01A1 ( ơ ) LATIN SMALL LETTER O WITH HORN
  • U+1ECD ( ọ ) LATIN SMALL LETTER O WITH DOT BELOW
  • U+1EE3 ( ợ ) LATIN SMALL LETTER O WITH HORN AND DOT BELOW

The regular expression pattern /o\u{31B}/ matches the first two characters of A, the first and third characters of B, the first character of C, part of the first character together with the third character of D, and part of the character in E.

In practice, regex APIs are not set up to match parts of characters or handle discontiguous selections. There are many other edge cases: a combining mark may come from some part of the pattern far removed from where the base character was, or may not explicitly be in the pattern at all. It is also unclear what /./ should match and how back references should work.

It is feasible, however, to construct patterns that will match against NFD (or NFKD) text. That can be done by:

  • Putting the text to be matched into a defined normalization form (NFD or NFKD).
  • Having the user design the regular expression pattern to match against that defined normalization form. For example, the pattern should contain no characters that would not occur in that normalization form, nor sequences that would not occur.
  • Applying the matching algorithm on a code point by code point basis, as usual.

2.2 Extended Grapheme Clusters and Character Classes with Strings

One or more Unicode characters may make up what the user thinks of as a character. To avoid ambiguity with the computer use of the term character, this is called a grapheme cluster . For example, "G" + acute-accent is a grapheme cluster: it is thought of as a single character by users, yet is actually represented by two Unicode characters. The Unicode Standard defines extended grapheme clusters that treat certain sequences as units, including Hangul syllables and base characters with combining marks. The precise definition is in UAX #29, Unicode Text Segmentation [ UAX29 ]. However, the boundary definitions in CLDR are strongly recommended: they are more comprehensive than those defined in [UAX29] and include Indic extended grapheme clusters such as ksha .

For example, an implementation could interpret \X as matching any extended grapheme cluster, while interpreting "." as matching any single code point. It could interpret \b{g} as a zero-width match against any extended grapheme cluster boundary, and \B{g} as the complement of that.

More generally, it is useful to have zero width boundary detections for each of the different kinds of segment boundaries defined by Unicode ([ UAX29 ] and [ UAX14 ]). For example:

Thus \X is equivalent to .+?\b{g} ; proceed the minimal number of characters (but at least one) to get to the next extended grapheme cluster boundary.

Regular expression engines should also provide some mechanism for easily matching against Character Classes with Strings , because they are more likely to match user expectations for many languages. One mechanism for doing that is to have explicit syntax for strings in Character Classes, as in the following addition to the syntax of Section 0.1.1 Character Classes :

The '|' separator is used to make an expression more readable. Some implementations may choose to drop the \q, although many will choose to retain it for backwards compatibility.

The following table shows examples of use of the \q syntax:

In implementing Character Classes with strings, the expression /[a-m \q{ch|chh|rr|} β-ξ]/ should behave as the alternation /(chh | ch | rr | [a-mβ-ξ] | )/ . Note that such an alternation must have the multi-code point strings ordered as longest-first to work correctly in arbitrary regex engines, because some regex engines try the leftmost matching alternative first. Therefore it does not work to have shorter strings first. The exception is where those shorter strings are not initial substrings of longer strings.

String literals in character classes are especially useful in combination with a property of strings. String literals can be used to modify the property by removing exceptions. Such exceptions cannot be expressed by other means. The only workaround would be to hard-code the result in an alternation, creating a large expression that loses the automatic updates of properties. For example, the following could not be expressed with alternation, except by replacing the property by hard-coded current contents (that would get out of date):

[p\{RGI_Emoji}--[a-z🧐\q{ch|sch|🇧🇪|🇧🇫|🇧🇬 }]]

If the implementation supports empty alternations, such as (ab|[ac-m]|), then it can also handle empty strings: [\q{ab}[ac-m]\q{}].

Of course, such alternations can be optimized internally for speed and/or memory, such as (ab|[ac-m]|) → ((ab?)|[c-m]|).

Like properties of strings, complemented Character Classes with strings need to be handled specially: see Annex D: Resolving Character Classes with Strings and Complement .

The simple Level 1 support using simple <word_character> classes is only a very rough approximation of user word boundaries. A much better method takes into account more context than just a single pair of letters. A general algorithm can take care of character and word boundaries for most of the world's languages. For more information, see UAX #29, Unicode Text Segmentation [ UAX29 ].

Note: Word boundaries and "soft" line-break boundaries (where one could break in line wrapping) are not generally the same; line breaking has a much more complex set of requirements to meet the typographic requirements of different languages. See UAX #14, Line Breaking Properties [ UAX14 ] for more information. However, soft line breaks are not generally relevant to general regular expression engines.

A fine-grained approach to languages such as Chinese or Thai—languages that do not use spaces—requires information that is beyond the bounds of what a Level 2 algorithm can provide.

Previous versions of RL2.4 included full default Unicode case-insensitive matching. For most full-featured regular expression engines, it is quite difficult to match under code point equivalences that are not 1:1. For more discussion of this, see 1.5 Simple Loose Matches and 2.1 Canonical Equivalents . Thus that part of RL2.4 has been retracted.

Instead, it is recommended that implementations provide for full, default Unicode case conversion, allowing users to provide both patterns and target text that has been fully case folded. That allows for matches such as between U+00DF "ß" and the two characters "SS". Some implementations may choose to have a mixed solution, where they do full case matching on literals such as "Strauß", but simple case folding on character classes such as [ß].

To correctly implement case conversions, see [ Case ]. For ease of implementation, a complete case folding file is supplied at [ CaseData ]. Full case mappings use the data files [ SpecialCasing ] and [ UData ].

2.5 Name Properties

When using names in regular expressions, the data is supplied in both the Name (na) and Name_Alias properties in the UCD, as described in UAX #44, Unicode Character Database [ UAX44 ], or computed as in the case of CJK Ideographs or Hangul Syllables. Name matching rules follow Matching Rules from [ UAX44#UAX44-LM2 ].

The following provides examples of usage:

Certain code points are not assigned names or name aliases in the standard. With the exception of "reserved", these should be given names based on Code Point Label Tags table in [ UAX44 ], as shown in the following examples:

Characters with the <reserved> tag in the Code Point Label Tags table of [ UAX44 ] are excluded : the syntax \p{reserved-058F} would mean that the code point U+058F is unassigned. While this code point was unassigned in Unicode 6.0, it is assigned in Unicode 6.1 and thus no longer "reserved".

Implementers may add aliases beyond those recognized in the UCD. They must be aware that such additional aliases may cause problems if they collide with future character names or aliases. For example, implementations that used the name "BELL" for U+0007 broke when the new character U+1F514 ( 🔔 ) BELL was introduced.

Previous versions of this specification recommended supporting ISO control names from the Unicode 1.0 name field. These names are now covered by the name aliases (see NameAliases.txt ). In four cases, the name field included both the ISO control name as well as an abbreviation in parentheses.

U+000A LINE FEED (LF) U+000C FORM FEED (FF) U+000D CARRIAGE RETURN (CR) U+0085 NEXT LINE (NEL) These abbreviations were intended as alternate aliases, not as part of the name, but the documentation did not make this sufficiently clear. As a result, some implementations supported the entire field as a name. Those implementations might benefit from continuing to support them for compatibility. Beyond that, their use is not recommended.

The \p{name=...} syntax can be used meaningfully with wildcards (see Section 2.6 Wildcards in Property Values ). For example, in Unicode 6.1, \p{name=/ALIEN/} would include a set of two characters:

  • U+1F47D ( 👽 ) EXTRATERRESTRIAL ALIEN,
  • U+1F47E ( 👾 ) ALIEN MONSTER

The namespace for the \p{name=...} syntax is the namespace for character names plus name aliases.

The following provides syntax for specifying a code point by supplying the precise name. This syntax specifies a single code point, which can thus be used wherever \u{...} can be used. Note that \N and \p{name} may be extended to match sequences if NamedSequences.txt is supported as in Section 2.7 Full Properties .

The \N syntax is related to the syntax \p{name=...}, but there are important distinctions:

  • \N matches a single character, while \p matches a set of characters (when using wildcards).
  • The \p{name=<character_name>} may silently fail, if no character exists with that name. The \N syntax should instead cause a syntax error for an undefined name.

The namespace for the \N{name=...} syntax is the namespace for character names plus name aliases. Name matching rules follow Matching Rules from [ UAX44#UAX44-LM2 ].

The following table gives examples of the \N syntax:

Instead of a single property value, this feature allows the use of a regular expression to pick out a set of characters (or strings) based on whether the property values match the regular expression. The regular expression must support at least wildcards; other regular expressions features are recommended but optional.

Notes: Where regular expressions are used in matching, the case, spaces, hyphen, and underbar are significant; it is presumed that users will make use of regular-expression features to ignore these if desired. In this syntax, the syntax characters are doubled at the start and end to avoid colliding with actual property values. For example, this prevents problems with properties with string values. In the unusual case that a a desired property value happens to start and end with, say, @, the expression can use quoted characters such as \u{40} As usual, the syntax in this document is illustrative: characters other than '/' and '@' can be chosen if these are not appropriate for the environment used by the regular expression engine.

The @…@ syntax is used to compare property values, and is primarily intended for string properties. It allows for expressions such as [:^toNFKC_Casefold=@toNFKC@:], which expresses the set of all and only those code points CP such that toNFKC_Casefold(CP) = toNFKC(CP) . The value identity can be used in this context. For example, \p{toLowercase≠@identity@} expresses the set of all characters that are changed by the toLowercase mapping.

The following table shows examples of the use of wildcards.

The lists in the examples above were extracted on the basis of Unicode 5.0; different Unicode versions may produce different results.

The following table some additional samples, illustrating various sets. A click on the link will use the online Unicode utilities on the Unicode website to show the contents of the sets. Note that these online utilities curently use single-letter operations.

The list excludes provisional, contributory, obsolete, and deprecated properties. It also excludes specific properties: Unicode_1_Name, Unicode_Radical_Stroke, and the Unihan properties. The properties shown in the table with a gray background are covered by RL1.2 Properties. For more information on properties, see UAX #44, Unicode Character Database [ UAX44 ].

Property Domains: All listed properties marked with * are properties of strings. All other listed properties are properties of code points. The domain of these properties (strings vs code points) will not change in subsequent versions.

The properties that are not in the UCD provide property metadata in their data file headers that can be used to support property syntax. That information is used to match and validate properties and property values for syntax such as \p{pname=pvalue}, so that they can be used in the same way as UCD properties. These include the Identifier_Status and Identifier_Type , and the Emoji sequence properties.

The Name and Name_Alias properties are used in \p{name=…} and \N{…}. The data in NamedSequences.txt is also used in \N{…}. For more information see Section 2.5, Name Properties . The Script and Script_Extensions properties are used in \p{scx=…}. For more information, see Section 1.2.6, Script and Script Extensions Properties .

To test whether a string is in a normalization format such as NFC requires special code. However, there are "quick-check" properties that can detect whether characters are allowed in a normalization format at all. Those can be used for cases like the following, which removes characters that cannot occur in NFC: [\p{ Alphabetic }--\p{ NFC_Quick_Check =No}]

The Emoji properties can be used to precisely parse text for valid emoji of different kinds, while the Equivalent_Unified_Ideograph can be used to find radicals for unified ideographs (or vice versa): \p{ Equivalent_Unified_Ideograph =⼚} matches [⼚⺁厂].

See also 2.5 Name Properties and 2.6 Wildcards in Property Values .

Implementations may also add other regular expression properties based on Unicode data that are not listed above . Some possible candidates include the following. These are optional, and are not required by any conformance clauses in this document, nor is the example syntax required.

3 Tailored Support: Level 3

This section has been retracted. It last appeared in version 19 .

The Block property from the Unicode Character Database can be a useful property for quickly describing a set of Unicode characters. It assigns a name to segments of the Unicode codepoint space; for example, [\u{370}-\u{3FF}] is the Greek block.

However, block names need to be used with discretion; they are very easy to misuse because they only supply a very coarse view of the Unicode character allocation. For example:

  • Blocks are not at all exclusive. There are many mathematical operators that are not in the Mathematical Operators block; there are many currency symbols not in Currency Symbols, and so on.
  • Blocks may include characters not assigned in the current version of Unicode. This can be both an advantage and disadvantage. Like the General Property, this allows an implementation to handle characters correctly that are not defined at the time the implementation is released. However, it also means that depending on the current properties of assigned characters in a block may fail. For example, all characters in a block may currently be letters, but this may not be true in the future.
  • Writing systems may use characters from multiple blocks: English uses characters from Basic Latin and General Punctuation, Syriac uses characters from both the Syriac and Arabic blocks, various languages use Cyrillic plus a few letters from Latin, and so on.
  • Characters from a single writing system may be split across multiple blocks. See the following table on Writing Systems versus Blocks. Moreover, presentation forms for a number of different scripts may be collected in blocks like Alphabetic Presentation Forms or Halfwidth and Fullwidth Forms.

The following table illustrates the mismatch between writing systems and blocks. These are only examples; this table is not a complete analysis. It also does not include common punctuation used with all of these writing systems.

Writing Systems Versus Blocks

For the above reasons, Script values are generally preferred to Block values. Even there, they should be used in accordance with the guidelines in UAX #24, Unicode Script Property [ UAX24 ].

Annex B: Sample Collation Grapheme Cluster Code

This annex was retracted at the same time that Level 3 was retracted.

The following table shows recommended assignments for compatibility property names, for use in Regular Expressions. The standard recommendation is shown in the column labeled "Standard"; applications should use this definition wherever possible. If populated with a different value, the column labeled "POSIX Compatible" shows modifications to the standard recommendation required to meet the formal requirements of [ POSIX ], and also to maintain (as much as possible) compatibility with the POSIX usage in practice. That modification involves some compromises, because POSIX does not have as fine-grained a set of character properties as in the Unicode Standard, and also has some additional constraints. So, for example, POSIX does not allow more than 20 characters to be categorized as digits, whereas there are many more than 20 digit characters in Unicode.

Compatibility Property Names

The operators and contents of a character class correspond to a set of strings. With full complement, the normal set-theoretic equivalences are maintained:

  • A ∪ B = B ∪ A
  • A ∩ B = B ∩ A
  • A ∪ (B ∪ C) = (A ∪ B) ∪ C
  • A \ (B ∪ C) = (A \ B) \ C
  • ∁(∁(A)) = A
  • A \ B = A ∩ ∁B
  • A \ (B \ C) = (A \ B) ∪ (A ∩ C)

See https://en.wikipedia.org/wiki/Set_(mathematics)#Basic_operations for more examples. (Note that that page uses one of the alternate notations for complement: A′.)

However, the full complement turns a finite set into an infinite set. This is a problem for regular expressions. If [^a] were defined to be the full complement of [a], then it would include every string except for 'a'. Matching a finite set of strings can be represented in regular expression implementations using alternation, in a straightforward way. Matching an infinite set of strings fails badly: [^a] would match "ab", since the string "ab" is not in [a]. So [^a] cannot be interpreted as full complement, since that would break well-established behavior.

This is not a problem for the other set operations: A ∪ B, A ∩ B, A \ B, A ⊖ B. None of them can produce an infinite set from finite sets. Moreover, the operator for full complement of strings is not necessary for regular expressions: that is, with the operations A ∪ B, A ∩ B, A \ B, A ⊖ B, all combinations of character classes resulting in a finite set of strings can be formed.

For this reason, [^...] remains as code point complement even when other regular expression syntax is extended to allow for strings. The normal set-theoretic equivalences still hold for all operations, except that those involving code point complement are qualified, so:

  • ∁ ℙ (∁ ℙ (A)) = A , if ℙ ⊇ A
  • A \ B = A ∩ ∁ ℙ B , if ℙ ⊇ A

These can be derived by converting ∁ ℙ A to the equivalent ( ℙ \ A ). For example, ∁ ℙ (∁ ℙ (A)) = ℙ \ (ℙ \ A) = ℙ ∩ A.

Note: Some implementations may choose to throw exceptions when complement is applied to an expression that contains (or could contain) strings. For those implementations, [^A] would not always be equivalent to [\p{any}--[A]], since the former could throw an exception, while the latter would always resolve to the code point complement.

However, the full complement of a Character Class with strings or of a property of strings could be allowed internal to a character class expression as long as the fully resolved version of the outermost expression does not contain an infinite number of strings. If an implementation is to support Full Complement, then the following section describes how this can be done. First is to provide an additional operator for Full Complement:

For example, suppose that C is a Character Class without strings or property of characters, and S is a Character Class with strings or property of strings.

  • [!![!!S]] is allowable
  • [C--S] is allowable
  • [C&&[!!S]] resolves to [C--S] and is thus allowable — it does not contain any strings.
  • [!!C--S] is allowable
  • [!!S--C] is not allowable (on the top level)

A narrowed set of single characters can always be represented by intersecting with the set of single characters, such as [ \p{Basic_Emoji}&& \p{any}] .

The following describes how a boolean expression can be resolved to a Character Class with only characters, a Character Class with strings, or a full-complemented Character Class with only characters. As usual, this is a logical expression of the process; implementations can optimize as long as they get the same results.

When incrementally parsing and building a resolved boolean expression, the process can be analyzed in terms of a series of core operations. In parsing Character Classes, the intermediate objects are logically  enhanced sets  of strings, such as A and B. The enhancement is the addition of a flag to indicate whether the internal set is  full-complemented  or not. The symbol ➕ stands for the flag value =  normal . The symbol ➖  stands for the flag value =  full-complemented . Thus:

➕ means that the internal set is treated normally; the enhanced set is the same as the internal set.

➖ means that the internal set is full-complemented; the logical contents of the enhanced set are every possible string  except those in the internal set .  Where 𝕊 stands for the set of all strings, and {α, β} is the internal set, then the semantics is: (𝕊 ∖ {α, β}), that is, the set of all strings  except for  {α, β}.

When the flag is full-complemented, adding or removing from the enhanced set has the reverse effect on the internal set.

  • adding  β to (𝕊 ∖ {α, β}) is the same as  removing  from the internal set: ⇒ (𝕊 ∖ {α}) 
  • removing  γ from (𝕊 ∖ {α, β}) is the same as  adding  to the internal set: ⇒ (𝕊 ∖ {α, β, γ}) 

For brevity in the table below, ∁ 𝕊 {α, β} is used to express (𝕊 ∖ {α, β}).

While logically the enhanced set can contain an infinite set of strings, internally there is only ever a finite set.

Creation and Unary Operations

  • [expression] and \p{expression} (without full-complementing) create enhanced sets with the internal sets corresponding to the expression, and the flags set to ➕.
  • [!!expression] and \P{expression} (with full-complementing) create enhanced sets with the internal sets corresponding to the expression, and the flags set to ➖.

[!!A] where A is an enhanced set with (set, flag) results in the flag being toggled: ➕ ⇔ ➖

Binary Operations

The table shows how to process binary operations on enhanced sets, with each result being the internal set plus flag. Examples are provided with two overlapping sets: A = {α, β} and B = {β, γ}.

The normal set equivalences hold, such as ∁ 𝕊 (A ∪ B) = ∁ 𝕊 A ∩ ∁ 𝕊 B

Properties of strings are properties that can apply to, or match, sequences of two or more characters (in addition to single characters). This is in contrast to the more common case of properties of characters, which are functions of individual code points only. Those properties marked with an asterisk in the Full Properties table are properties of strings. See, for example, Basic_Emoji.

The preferred notation for properties of strings is \p{Property_Name} , the same as for the traditional properties of characters. For regular expressions, properties of strings may appear both within and outside of character class expressions. As described in Annex D , some character class expressions are invalid when they contain properties of strings. Detection of such invalid expressions should happen early, when the regular expression is first compiled or processed.

Implementations that are constrained in that they do not support strings in character classes may use \m{Property_Name} as an alternate notation for properties of strings appearing outside of character class expressions. However:

  • \m should also accept ordinary properties of characters. If a property that applies to strings later changes to only apply to characters, a regex with such a \m{property} should not become invalid. Also, being able to use the same \m syntax outside of a character class for any property would be simpler for a regex writer.
  • Implementations with full support for \p and properties of strings in character class expressions may also optionally support the \m syntax.
  • Implementations that initially adopt \m only for properties of strings, then later add support for strings in character classes, should also add support for \p as alternate syntax for properties of strings.

It is reasonably straightforward to build a parser for Character Classes. While there are many ways to do this, the following describes one example of a logical process for building such a parser. Implementations can use optimized code, such as a DFA ( Deterministic Finite Automaton ) for processing.

The description uses Java syntax to illustrate the code, but of course would be expressed in other programming languages. At the core is a class (here called CharacterClass) that stores the information that is being built, typically a set of strings optimized for compact storage of ranges of characters, such as ICU’s UnicodeSet ( C++ ).

The methods needed are the following:

At the top level a method parseCharacterClass can recognize and branch on ‘\p{’, ‘\P{’, ‘[’, and ‘[^’ . For ‘\p{’ and ‘\P{’, it calls a parseProperty method that parses up to an unescaped ‘}’, and returns a set based on Unicode properties. See RL1.2 Properties , 2.7 Full Properties , RL2.7 Full Properties , and 2.8 Optional Properties .

For ‘[’, and ‘[^’, it calls a parseSequence method that parses out items, stopping when it hits ‘]’. The type of each item can be determined by the initial characters. There is a special check for ‘-’ so that it can be interpreted according to context. The targetSet is set to the first item. All successive items at that level are combined with the targetSet, according to the specified operation (union, intersection, etc.). Note that other binding/precedence options would require somewhat more complicated parsing.

For the Character Class item, a recursive call is made on the parseCharacterClass method. The other initial characters that are branched on are ‘\u{’, ‘\u’, ‘\q{’, ‘\N{’, ‘\’, the operators, and literal and escaped characters.

In the following examples, ➗  is a cursor marking how the parsing progresses. For brevity, intermediate steps that only change state are omitted. The two examples are the same, except that in the right-hand example the second and third character classes are grouped.

Mark Davis created the initial version of this annex and maintains the text, with significant contributions from Andy Heninger. Andy also served as co-editor for many years.

Thanks to Julie Allen, Mathias Bynens,Tom Christiansen, David Corbett, Michael D’Errico, Asmus Freytag, Jeffrey Friedl, Norbert Lindenberg, Peter Linsley, Alan Liu, Kent Karlsson, Jarkko Hietaniemi, Ivan Panchenko, Michael Saboff, Gurusamy Sarathy, Markus Scherer, Xueming Shen, Henry Spencer, Kento Tamura, Philippe Verdy, Tom Watson, Ken Whistler, Karl Williamson, and Richard Wordingham for their feedback on the document.

The following summarizes modifications from the previous revision of this document.

Revision 23

The main focus in this release is on handling the complement of properties of strings. The distinction is drawn between code point complement and full complement , followed by explicitly defining the complement operator [^...] to be code point complement , and providing the reasons for doing so in an annex. The important difference between [A--B] and [A&&[^B]] is outlined — setting out the reasons why the latter is insufficient to represent set difference.

For the EBNF in general, and for character classes with strings in particular, examples were added and the text clarified. A new annex provides examples for how character classes can be parsed.

  • Misc editorial changes
  • Changed the ¬ notation for complement to the more standard ∁ math notation, such that ∁A means the complement of A, and the ∁ can be combined with optional subscripts to distinguish the different kinds of complement operations.
  • Changed most "negation" to a form of "complement"
  • Section 0.1.1 Character Classes
  • Section 1.1  Hex Notation
  • Section 1.2.4  Property Syntax
  • Section 1.3  Subtraction and Intersection
  • Section 2.5.1  Individually Named Characters
  • Section 2.6  Wildcards in Property Values
  • Added a table of math symbols used in the document.
  • Explicitly defined the complement operator [^...] to be code point complement , for backwards compatibility.
  • Updated SYNTAX_CHAR to add { and |
  • Adjusted the examples of escape syntax to account for improved capabilities in regex handling in Java, JavaScript, and ICU.
  • Clarified that loose matching of property names and values should be implemented according to the guidelines in Section 5.9 Matching Rules in [ UAX44 ].
  • Explicitly defined \P in terms of [^...] so that it is also code point complement
  • Noted the difference between [A--B] and [A&&[^B]], and why the latter is inadequate.
  • Added information on usage of \q with properties, and how the '|' separator is used to make an expressions shorter and more readable.
  • Explicitly documented the domain of listed properties (whether they are properties of strings or code points).
  • Emoji_Keycap_Sequence*
  • RGI_Emoji_Flag_Sequence*
  • Added an introduction to document why [^...] needed to be narrowed to be code point complement.
  • Made support of full complement optional, and retained the section for how to do it if desired.
  • Added a note that some implementations may wish to throw an exception with complements of sets containing strings.
  • Added symmetric difference (A ~~ B) to the table of Binary Operations.
  • Clarified notation and added examples
  • Added new annex with guidance on parsing character classes.

Modifications for previous versions are listed in those respective versions.

Copyright © 2022 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The Unicode Terms of Use apply.

Install Adobe Acrobat Reader | Mac OS

Open on web

If you're on a Windows computer, see Install Adobe Acrobat Reader | Windows .

Installing Adobe Acrobat Reader is a two-step process: Download the installation package and install Acrobat Reader from the package file. You do not have to remove the older version of Reader before installing Acrobat Reader.

System requirements

Before you install Acrobat Reader on your computer, ensure that your computer meets the minimum system requirements . If you're on macOS Big Sur, read the Big Sur compatibility document to understand the known issues.

You need macOS version 10.13 or later to run Acrobat Reader. For older versions of macOS, you can install an earlier version of Reader. For step-by-step instructions, see  Install an older version of Acrobat Reader on macOS .

Not sure which version of macOS you have? Select the Apple icon in the upper-left corner of your screen, then choose  About This Mac .

Firefox: Download and install Acrobat Reader

Go to the Adobe Acrobat Reader download page, and select  Download Acrobat Reader .

When asked whether to save the .dmg file, select Save File .

If you do not see this dialog box, another window could be blocking it. Try moving any other windows out of the way.

does safari support regex

Double-click the .dmg file. (If you don't see the Downloads window, choose Tools > Downloads.)

does safari support regex

Double-click Install Adobe Acrobat Reader to start the installation.

Double-click to install

When asked if you're sure that you want to open the file, select  Open .

does safari support regex

When prompted, enter your macOS user name and password. In case you do not remember your password, refer the Apple document:  https://support.apple.com/HT202860

does safari support regex

When you view the confirmation message that the installation is complete, select  Finish .

does safari support regex

Safari: Download and install Acrobat Reader

Double-click the .dmg file. (If you don't view the Safari Downloads window, select  Finder > (User Name) > Downloads .)

does safari support regex

Double-click Install Adobe Acrobat Reader  to start the installation.

does safari support regex

When prompted, enter your macOS user name and password. In case you do not remember your password, refer the Apple document: https://support.apple.com/HT202860

does safari support regex

Chrome: Download and install Acrobat Reader

Go to the  Adobe Acrobat Reader download  page, and select  Download Acrobat Reader .

When the file appears at the bottom of the browser, select the file. (If you don't view the file, choose  Downloads  from the Chrome menu.)

does safari support regex

Double-click  Install Adobe Acrobat Reader  to start the installation.

does safari support regex

When prompted, enter your macOS user name and password. In case you do not remember your password, refer the Apple document:  https://support.apple.com/HT202860 .

does safari support regex

When you view the confirmation message that the installation is complete, select  Finish .

does safari support regex

Still need help?

To see if other users are experiencing similar download and installation problems, visit the Acrobat Reader user forum . Try posting your problem on the forums for interactive troubleshooting. When posting on forums, include your operating system and product version number.

More like this

  • Install Adobe Acrobat Reader | Windows
  • Troubleshoot Acrobat Reader download
  • Troubleshoot macOS 10.x system errors, freezes
  • Close conflicting processes or apps
  • Close Safari notification agent

Get help faster and easier

 alt=

Quick links

Adobe MAX 2024

Adobe MAX The Creativity Conference

Oct 14–16 Miami Beach and online

The Creativity Conference

Legal Notices    |    Online Privacy Policy

Share this page

Language Navigation

Safari User Guide

  • Get started
  • Go to a website
  • Bookmark webpages to revisit
  • See your favorite websites
  • Use tabs for webpages
  • Import bookmarks and passwords
  • Pay with Apple Pay
  • Autofill credit card info
  • View links from friends
  • Keep a Reading List
  • Hide ads when reading

Translate a webpage

  • Download items from the web
  • Add passes to Wallet
  • Save part or all of a webpage
  • Print or create a PDF of a webpage
  • Interact with text in a picture
  • Change your homepage
  • Customize a start page
  • Create a profile
  • Block pop-ups
  • Make Safari your default web browser
  • Hide your email address
  • Manage cookies
  • Clear your browsing history
  • Browse privately
  • Prevent cross-site tracking
  • See who tried to track you
  • Change Safari settings
  • Keyboard and other shortcuts

does safari support regex

Translate a webpage in Safari on Mac

If a webpage can be translated into one of your preferred languages, you can have Safari translate it.

Open Safari for me

does safari support regex

If a language isn’t available

You may be able to make more languages available in the Translate menu by adding the languages in Language & Region settings. After you add a language to your list of preferred languages, if a translation is available to that language, it appears in the Translate menu in Safari.

To add a language, see Change Language & Region settings .

Note: The availability of translations and the number of languages that can be translated may vary by country or region.

You can also select some text in a webpage and translate that. See Translate text .

macOS Sequoia takes productivity and intelligence on Mac to new heights

MacBook Pro shows iPhone Mirroring; Mac shows Highlights in Safari; and another MacBook Pro shows a more immersive gaming experience.

Wirelessly Use iPhone Right from Mac with iPhone Mirroring

With iPhone Mirroring, a user wirelessly uses their iPhone 15 Pro from the desktop of their MacBook Pro.

Big Updates Come to Safari

On a user’s MacBook Pro, the new Highlights feature in Safari is shown.

Gaming Gets Even Better with Highly Anticipated Titles and More

  • Highly anticipated titles : Developers are delivering an amazing host of new titles to Mac. Ubisoft will release Prince of Persia: The Lost Crown and Assassin’s Creed Shadows, and Capcom will offer even more exciting titles from the popular RESIDENT EVIL series, including RESIDENT EVIL 7 biohazard and RESIDENT EVIL 2. The next major expansion of World of Warcraft: The War Within is coming later this year. Also on the way are Frostpunk 2, Palworld, Sniper Elite 4, and RoboCop: Rogue City, all leveraging powerful software technologies like MetalFX Upscaling to accelerate performance and deliver high-quality visuals across the Mac lineup. And Control Ultimate Edition and Wuthering Waves are coming soon, taking advantage of the latest M3 family of chips to deliver breathtaking visuals with ray tracing.
  • A more immersive gaming experience : Personalized Spatial Audio puts players in the middle of the action like never before, while significantly reduced audio latency with AirPods Pro (2nd generation) provides even better responsiveness. Improvements to Game Mode unlock smoother frame rates, and advanced power management features boost performance across the Mac lineup.
  • Game Porting Toolkit 2 : Since the introduction of the Game Porting Toolkit, developers have been able to bring their games to Apple devices faster than ever, and gaming enthusiasts can experience more titles on the Mac. Game Porting Toolkit 2 takes this to the next level with some of the most-requested capabilities from game developers, making it even easier to bring advanced games to Mac, as well as iPhone and iPad.

Window Tiling Is Easier and Faster Than Ever

Video Conferencing Gets More Updates

The new presenter preview is shown on a user’s Mac desktop.

The New Passwords App Keeps Credentials Secure and Organized

The new Passwords app is shown on a user’s MacBook Pro.

Apple Intelligence Ushers in the Next Chapter of AI on Mac

The Apple Intelligence-powered Rewrite tool is shown on a user’s MacBook Pro.

  • Messages has big updates to the ways users express themselves and stay connected, including all-new text effects, emoji and sticker Tapbacks, and the ability to schedule a message to send later.
  • Apple Maps is introducing even more ways to explore the world, including curated hikes and custom walking routes. Beginning this fall, users can browse thousands of hikes across all 63 national parks in the United States, filtered by length, elevation, and route type, and save them to use while offline.
  • Photos now surfaces Collections, which automatically organizes a user’s library by helpful themes, and includes a big update to search, so users can get results quickly.
  • Note taking in Notes is getting smarter, making it easier than ever to take detailed and well-written notes. New audio transcription and summarization features with Apple Intelligence enable a device to take notes for the user, letting them stay present in a situation where they need to capture details about what’s happening. And if they need to quickly crunch a number, they can just type in an equation to have it solved automatically in their note body.
  • An updated Calculator app lets users see previous calculations with history, and gives them the ability to see their expressions as they type.
  • Calendar shows events and tasks from Reminders , making it easy to see, edit, or complete tasks throughout the day. An updated Month View makes it easier to see events and reminders for an entire month at a glance.

A user’s MacBook Pro shows the updated Messages experience.

Text of this article

June 10, 2024

PRESS RELEASE

The Mac experience gets better than ever with iPhone Mirroring, big updates to Safari, highly anticipated games, and Apple Intelligence to deliver all-new capabilities

CUPERTINO, CALIFORNIA Apple today previewed macOS Sequoia , the next version of the world’s most advanced desktop operating system, bringing entirely new ways of working and transformative intelligence features to Mac. macOS Sequoia is full of exciting new capabilities, including iPhone Mirroring, which expands Continuity by enabling full access to and control of iPhone directly from macOS. Safari gets another big update with the new Highlights feature for effortless information discovery on webpages while browsing. The new Passwords app makes it even easier to access passwords and organize credentials all in one place. Gaming advances with a more immersive experience, as well as a breadth of new titles, including Assassin’s Creed Shadows, Frostpunk 2, and more.

macOS Sequoia also introduces Apple Intelligence , the personal intelligence system for Mac, iPhone, and iPad that combines the power of generative models with personal context to deliver intelligence that’s incredibly useful and relevant. Built with privacy from the ground up, Apple Intelligence is deeply integrated into macOS Sequoia, iOS 18, and iPadOS 18. It understands and creates language and images, takes action across apps, and draws from personal context, simplifying and accelerating everyday tasks. Taking full advantage of the power of Apple silicon and the Neural Engine, Apple Intelligence will be supported by every Mac with an M-series chip. 1

“The all-star combination of the power of Apple silicon and the legendary ease of use of macOS have made the Mac more capable than ever. Today, we’re excited to take macOS to new heights with macOS Sequoia, a big release that elevates productivity and intelligence,” said Craig Federighi, Apple’s senior vice president of Software Engineering. “macOS Sequoia ushers in Apple Intelligence, unlocking incredible new features that will be a game changer for working on Mac. And with more ways to help users effortlessly get things done, new Continuity features like iPhone Mirroring, major updates to Safari, and a host of new games, we think Mac users are going to love it.”

macOS Sequoia makes Continuity even more magical with iPhone Mirroring, which allows users to fully access and engage with their iPhone — right from their Mac. A user’s custom wallpaper and icons appear just like on their iPhone, and they can swipe between pages on their Home Screen, or launch and browse any of their favorite apps. The keyboard, trackpad, and mouse on Mac also let a user interact with their iPhone, and audio even comes through. Users can seamlessly drag and drop between iPhone and Mac, and a user’s iPhone remains locked, so nobody else can access or see what the user is doing. It also works great with StandBy, which stays visible, so users can get information at a glance. Additionally, users can review and respond to iPhone notifications directly from their Mac.

Safari, the world’s fastest browser, 2 now offers Highlights, an even easier way to discover information on the web, such as directions, summaries, or quick links to learn more about people, music, movies, and TV shows. A redesigned Reader includes even more ways to enjoy articles without distractions, featuring a streamlined view of the article a user is reading, a summary, and a table of contents for longer articles. And when Safari detects a video on the page, Viewer helps users put it front and center, while still giving them full access to system playback controls, including Picture in Picture.

A stellar lineup of games is coming to Mac — including the highly anticipated Assassin’s Creed Shadows, the next installment in Ubisoft’s blockbuster series — alongside new features like Personalized Spatial Audio that make gaming even more immersive.

Users can stay organized with new ways to arrange windows into a layout that works best for them. When a user drags a window to the edge of the screen, macOS Sequoia automatically suggests a tiled position on their desktop. Users can release their window right into place, quickly arrange tiles side by side, or place them in corners to keep even more apps in view. And new keyboard and menu shortcuts help users organize tiles even faster.

The new presenter preview makes it easier to present, letting users see what they’re about to share before they share it, and works with apps like FaceTime and Zoom. Users can also apply beautiful built-in backgrounds, including a variety of color gradients and system wallpapers, or upload their own photos. Background replacements can be applied during a video call in FaceTime or in third-party apps like Webex, and with Apple’s industry-leading segmentation, users will look their best when on a call.

Built on the foundation of Keychain, which was first introduced over 25 years ago, macOS Sequoia brings Passwords, a new app that makes it even easier to access passwords, passkeys, Wi-Fi passwords, and other credentials all in one place. iCloud syncing is backed by secure end-to-end encryption. Passwords works great with Safari, and seamlessly syncs between a user’s Apple devices and Windows with the iCloud for Windows app.

Deeply integrated into macOS Sequoia and built with privacy from the ground up, Apple Intelligence unlocks new ways for users to enhance their writing and communicate more effectively. With brand-new systemwide Writing Tools built into macOS Sequoia, users can rewrite, proofread, and summarize text nearly everywhere they write, including Mail, Notes, Pages, and third-party apps.

New image capabilities make communication and self-expression even more fun. With Image Playground, users can create playful images in seconds, choosing from three styles: Animation, Illustration, or Sketch. Image Playground is easy to use, built right into apps like Messages, and also available in a dedicated app.

Memories in Photos lets users create the stories they want to see just by typing a description. Apple Intelligence will pick out the best photos and videos based on the description, craft a storyline with chapters based on themes identified from the photos, and arrange them into a movie with its own narrative arc. In addition, a new Clean Up tool can identify and remove distracting objects in the background of a photo — without accidentally altering the subject.

With the power of Apple Intelligence, Siri takes a major step forward, becoming even more natural, contextually relevant, and personal. Additionally, users can type to Siri, and switch between text and voice to communicate with Siri in whatever way feels right for the moment — making the Siri experience on Mac incredibly easy and seamless.

With Private Cloud Compute, Apple sets a new standard for privacy in AI, with the ability to flex and scale computational capacity between on-device processing and larger, server-based models that run on dedicated Apple silicon servers. When requests are routed to Private Cloud Compute, data is not stored or made accessible to Apple, and is only used to fulfill the user’s requests, and independent experts can verify this privacy promise.

Additionally, access to ChatGPT is integrated into Siri and systemwide Writing Tools across Apple’s platforms, allowing users to access its expertise — as well as its image- and document-understanding capabilities — without needing to jump between tools.

Additional features in macOS Sequoia include:

Availability

The developer beta of macOS Sequoia is available through the Apple Developer Program at developer.apple.com starting today, and a public beta will be available through the Apple Beta Software Program next month at beta.apple.com . The release will be available as a free software update this fall. Apple Intelligence will be available in beta on iPhone 15 Pro, iPhone 15 Pro Max, and iPad and Mac with M1 and later, with Siri and device language set to U.S. English, as part of iOS 18, iPadOS 18, and macOS Sequoia this fall. For more information, visit apple.com/macos/macos-sequoia-preview  and apple.com/apple-intelligence . Features are subject to change. Some features are not available in all regions, all languages, or on all devices. For more information about availability, visit apple.com .

  • Users with an eligible iPhone, iPad, or Mac with Siri and device language set to English (U.S.) can sign up this fall to access the Apple Intelligence beta.
  • Testing was conducted by Apple in May 2023. See apple.com/safari for more information.

Press Contacts

Michelle Del Rio

[email protected]

Starlayne Meza

[email protected]

Apple Media Helpline

[email protected]

Images in this article

View in English

More Videos

Streaming is available in most browsers, and in the Developer app.

A Swift Tour: Explore Swift’s features and design

Learn the essential features and design philosophy of the Swift programming language. We'll explore how to model data, handle errors, use protocols, write concurrent code, and more while building up a Swift package that has a library, an HTTP server, and a command line client. Whether you're just beginning your Swift journey or have been with us from the start, this talk will help you get the most out of the language.

  • 0:00 - Introduction
  • 0:51 - Agenda
  • 1:05 - The example
  • 1:32 - Value types
  • 4:26 - Errors and optionals
  • 9:47 - Code organization
  • 11:58 - Classes
  • 14:06 - Protocols
  • 18:33 - Concurrency
  • 23:13 - Extensibility
  • 26:55 - Wrap up
  • Forum: Programming Languages
  • The Swift Programming Language
  • Tools used: Ubuntu
  • Tools used: Visual Studio Code
  • Tools used: Windows
  • Value and Reference types
  • Wrapping C/C++ Library in Swift

Related Videos

  • Expand on Swift macros
  • Design protocol interfaces in Swift
  • Embrace Swift generics
  • Meet Swift Regex
  • Explore structured concurrency in Swift
  • Write a DSL in Swift using result builders
  • Download Array

1:49 - Integer variables

3:04 - User struct

3:05 - User struct error handling

11:01 - SocialGraph package manifest

11:12 - User struct

12:36 - Classes

12:59 - Automatic reference counting

13:26 - Reference cycles

14:20 - Protocols

15:21 - Common capabilities of Collections

15:31 - Collection algorithms

15:45 - Collection algorithms with anonymous parameters

16:13 - Friends of friends algorithm

19:23 - async/await

19:43 - Server

20:20 - Data race example

22:24 - Server with friendsOfFriends route

23:27 - Property wrappers

23:57 - SocialGraph command line client

26:07 - Result builders

Looking for something specific? Enter a topic above and jump straight to the good stuff.

An error occurred when submitting your query. Please check your Internet connection and try again.

We've detected unusual activity from your computer network

To continue, please click the box below to let us know you're not a robot.

Why did this happen?

Please make sure your browser supports JavaScript and cookies and that you are not blocking them from loading. For more information you can review our Terms of Service and Cookie Policy .

For inquiries related to this message please contact our support team and provide the reference ID below.

IMAGES

  1. ‎Regex for Safari on the App Store

    does safari support regex

  2. regex

    does safari support regex

  3. Safari doesn't support Regex lookback · Issue #235 · edge/wallet · GitHub

    does safari support regex

  4. X to Y: Safari-Erweiterung ersetzt Texte und Links nach Belieben

    does safari support regex

  5. Safari not support Lookbehind Regex · Issue #7127 · vitejs/vite · GitHub

    does safari support regex

  6. Does Safari support JavaScript RegExp?

    does safari support regex

VIDEO

  1. Make Safari support mouse gesture

  2. Does safari walk 2024 #nature #safarilife #explore #wildlife #safari #walkingsafari

  3. E253

  4. Does safari know me and @Cheesys_nurserey 

  5. How does Safari auto dark mode work?

  6. Does Safari Hunting Protect Wilderness? #shorts

COMMENTS

  1. Regex lookbehind not supported in Safari

    0. I came across this regex lookbehind issue with Safari today and (eventually) came up with a solution using: split() replace() / replaceAll() The following working example (using the same methods) is one approach to getting around the issue that in June 2022, Safari does not (yet) understand regex lookbehinds.

  2. Does Safari support JavaScript RegExp?

    In a cross-browser test of RegExp, Safari seemed to not support it. No. The answer is yes, Safari has supported JavaScript RegExp for years. It is straight forward when used in HTML. If you use RegExp in a Do JavaScript within AppleScript, it invokes an Apple Event, and Safari 9.1.3 will block you with the following dialog:

  3. regex101: Modern Safari versions User Agent

    Regular Expressions 101. Social Donate Info. Regex Editor Community Patterns Account Regex Quiz Settings. Save & Share. Regex Version: ver. 1. Fork Regex. ctrl+s. Go to community entry. ... Safari 11.1 on Mac OS X (El Capitan) Safari 12.1 on iOS 12.2. Safari 12.1 on macOS (Mojave) Safari 13 on iOS 13.1. Safari 13 on macOS (Mojave)

  4. Lookbehind in JS regular expressions

    Lookbehind in JS regular expressions. The positive lookbehind ( (?<= )) and negative lookbehind ( (?<! )) zero-width assertions in JavaScript regular expressions can be used to ensure a pattern is preceded by another pattern. "Can I use" provides up-to-date browser support tables for support of front-end web technologies on desktop and mobile ...

  5. Use of regular expressions in macOS search fields

    Does anyone know where I can find the docs describing Regex parsing capabilities of macOS search fields? ... This Apple Support article on how to Narrow your search results on Mac is the most detailed documentation I have ... although their utility for text search is rather questionable. Regular expressions are not mentioned, so I suppose these ...

  6. Regex

    A regular expression.

  7. Regular expressions

    A regular expression (regex for short) allow developers to match strings against a pattern, extract submatch information, or simply test if the string conforms to that pattern. Regular expressions are used in many programming languages, and JavaScript's syntax is inspired by Perl.

  8. Anyway to do regular expression search/find on safari? : r/Safari

    Apparantly the new macOS update will allow for developers to port chrome extensions to safari but there may be a solid few years for the transition. So if anyone knows any possibilities to get regex on safari I'd love to hear it out, it's truly life-changing if you're comfortable with regex and do frequent searches. There is a Safari extension ...

  9. Regex not compatible with safari, need help to convert

    regex101: build, test, and debug regex. Regular expression tester with syntax highlighting, explanation, cheat sheet for PHP/PCRE, Python, GO, JavaScript, Java. Features a regex quiz & library. It would be very helpful if you showed what you were trying to match. You can't have lookbehind if you need to support Safari, but you can definitely ...

  10. Workaround for Lookbehind Regex in Safari

    Lookbehind regex are very powerful but they are not supported in all browsers. Non V8 browsers such as Safari don't support them. Let's run the same example in Safari:

  11. News from WWDC24: WebKit in Safari 18 beta

    WebKit for Safari 18 beta adds support for subresource integrity in imported module scripts, which gives cryptographic assurances about the integrity of contents of externally-hosted module scripts. WebKit for Safari 18 beta adds support for the bytes()method to the Request,Response, Blob, and PushMessageDataobjects.

  12. Safari Support with XRegExp 0.2.2

    When I released XRegExp 0.2 several days ago, I hadn't yet tested in Safari or Swift. When I remembered to do this shortly afterwards, I found that both of those WebKit-based browsers didn't like it and often crashed when trying to use it! This was obviously a Very Bad Thing, but due to major time…

  13. Regular expressions

    Regular expressions are patterns used to match character combinations in strings. In JavaScript, regular expressions are also objects. These patterns are used with the exec() and test() methods of RegExp, and with the match(), matchAll(), replace(), replaceAll(), search(), and split() methods of String. This chapter describes JavaScript regular expressions. It provides a brief overview of each ...

  14. regex101: Safari-extension

    Regular expression tester with syntax highlighting, explanation, cheat sheet for PHP/PCRE, Python, GO, JavaScript, Java, C#/.NET, Rust.

  15. 174931

    As somebody who uses lookbehind in regex for a lot of my code, I would like to show my strong support for the resolution of this issue. Considering that all modern browsers -- save for Safari/WebKit -- support regex lookbehind, as a web developer, I find it frustrating that WebKit is the only modern browser engine which does not support this.

  16. UTS #18: Unicode Regular Expressions

    1 Basic Unicode Support: Level 1. Regular expression syntax usually allows for an expression to denote a set of single characters, such as [a-z A-Z 0-9]. Because there are a very large number of characters in the Unicode Standard, simple list expressions do not suffice.

  17. Regex is working on Chrome but not in Safari

    The following regex is working just fine on Chrome, but it breaks in Safari with the following error: SyntaxError: Invalid regular expression: invalid group specifier name. Regex: /^[a-zA-Z0-9.!#$%...

  18. Safari

    Find out how to download, update and manage your Safari settings with official Apple support resources and tips.

  19. Install Adobe Acrobat Reader on Mac OS

    Safari: Download and install Acrobat Reader. Go to the Adobe Acrobat Reader download page, and select Download Acrobat Reader. Double-click the .dmg file. (If you don't view the Safari Downloads window, select Finder > (User Name) > Downloads .) Double-click Install Adobe Acrobat Reader to start the installation.

  20. Translate a webpage in Safari on Mac

    In the Safari app on your Mac, go to the webpage you want to translate. If the webpage can be translated, the Smart Search field displays the Translate button . Click the Translate button , then choose a language. If you think the translation might need improvement, click the Translate button , then choose Report Translation Issue.

  21. macOS Sequoia takes productivity and intelligence on Mac to new ...

    Apple today previewed macOS Sequoia, bringing entirely new ways of working and personal intelligence to the Mac.

  22. javascript

    The regex I used works in Chrome, but not in Safari and after doing google search I realized that Safari doesn't yet support lookbehinds. Is there another way to write the following regex so that it does the above, but works in Safari too?

  23. A Swift Tour: Explore Swift's features and design

    A Swift Tour: Explore Swift's features and design. Learn the essential features and design philosophy of the Swift programming language. We'll explore how to model data, handle errors, use protocols, write concurrent code, and more while building up a Swift package that has a library, an HTTP server, and a command line client.

  24. Regex Not compatible on Safari

    I have a regular expression to pick the number. For ?page=2& this will return 2. My Expression is working fine for Chrome and Mozilla but not working on Safari.

  25. Apple WWDC 2024 What to Expect: AI, iPadOS, iOS 18, macOS 15, Siri

    The company will introduce 'Apple Intelligence,' its long-awaited push into modern AI for the iPhone, iPad and Mac.