Class Pattern
- All Implemented Interfaces:
Serializable
public final class Pattern extends Object implements Serializable
String.matches, String.replaceAll and
String.split will be preferable, but if you need to do a lot of work
with the same regular expression, it may be more efficient to compile it once and reuse it.
The Pattern class and its companion, Matcher, also offer more functionality
than the small amount exposed by String.
// String convenience methods:
boolean sawFailures = s.matches("Failures: \\d+");
String farewell = s.replaceAll("Hello, (\\S+)", "Goodbye, $1");
String[] fields = s.split(":");
// Direct use of Pattern:
Pattern p = Pattern.compile("Hello, (\\S+)");
Matcher m = p.matcher(inputString);
while (m.find()) { // Find each match in turn; String can't do this.
String name = m.group(1); // Access a submatch group; String can't do this.
}
Regular expression syntax
Java supports a subset of Perl 5 regular expression syntax. An important gotcha is that Java
has no regular expression literals, and uses plain old string literals instead. This means that
you need an extra level of escaping. For example, the regular expression \s+ has to
be represented as the string "\\s+".
Escape sequences
| \ | Quote the following metacharacter (so \. matches a literal .). |
| \Q | Quote all following metacharacters until \E. |
| \E | Stop quoting metacharacters (started by \Q). |
| \\ | A literal backslash. |
| \uhhhh | The Unicode character U+hhhh (in hex). |
| \xhh | The Unicode character U+00hh (in hex). |
| \cx | The ASCII control character ^x (so \cH would be ^H, U+0008). |
| \a | The ASCII bell character (U+0007). |
| \e | The ASCII ESC character (U+001b). |
| \f | The ASCII form feed character (U+000c). |
| \n | The ASCII newline character (U+000a). |
| \r | The ASCII carriage return character (U+000d). |
| \t | The ASCII tab character (U+0009). |
Character classes
It's possible to construct arbitrary character classes using set operations:
| [abc] | Any one of a, b, or c. (Enumeration.) |
| [a-c] | Any one of a, b, or c. (Range.) |
| [^abc] | Any character except a, b, or c. (Negation.) |
| [[a-f][0-9]] | Any character in either range. (Union.) |
| [[a-z]&&[jkl]] | Any character in both ranges. (Intersection.) |
Most of the time, the built-in character classes are more useful:
| \d | Any digit character (see note below). |
| \D | Any non-digit character (see note below). |
| \s | Any whitespace character (see note below). |
| \S | Any non-whitespace character (see note below). |
| \w | Any word character (see note below). |
| \W | Any non-word character (see note below). |
| \p{NAME} | Any character in the class with the given NAME. |
| \P{NAME} | Any character not in the named class. |
Note that these built-in classes don't just cover the traditional ASCII range. For example,
\w is equivalent to the character class [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}].
For more details see Unicode TR-18,
and bear in mind that the set of characters in each class can vary between Unicode releases.
If you actually want to match only ASCII characters, specify the explicit characters you want;
if you mean 0-9 use [0-9] rather than \d, which would also include
Gurmukhi digits and so forth.
There are also a variety of named classes:
- Unicode category names,
prefixed by
Is. For example\p{IsLu}for all uppercase letters. - POSIX class names. These are 'Alnum', 'Alpha', 'ASCII', 'Blank', 'Cntrl', 'Digit', 'Graph', 'Lower', 'Print', 'Punct', 'Upper', 'XDigit'.
- Unicode block names, as accepted as input to
Character.UnicodeBlock.forName(java.lang.String), prefixed byIn. For example\p{InHebrew}for all characters in the Hebrew block. - Character method names. These are all non-deprecated methods from
Characterwhose name starts withis, but with theisreplaced byjava. For example,\p{javaLowerCase}.
Quantifiers
Quantifiers match some number of instances of the preceding regular expression.
| * | Zero or more. |
| ? | Zero or one. |
| + | One or more. |
| {n} | Exactly n. |
| {n,} | At least n. |
| {n,m} | At least n but not more than m. |
Quantifiers are "greedy" by default, meaning that they will match the longest possible input
sequence. There are also non-greedy quantifiers that match the shortest possible input sequence.
They're same as the greedy ones but with a trailing ?:
| *? | Zero or more (non-greedy). |
| ?? | Zero or one (non-greedy). |
| +? | One or more (non-greedy). |
| {n}? | Exactly n (non-greedy). |
| {n,}? | At least n (non-greedy). |
| {n,m}? | At least n but not more than m (non-greedy). |
Quantifiers allow backtracking by default. There are also possessive quantifiers to prevent
backtracking. They're same as the greedy ones but with a trailing +:
| *+ | Zero or more (possessive). |
| ?+ | Zero or one (possessive). |
| ++ | One or more (possessive). |
| {n}+ | Exactly n (possessive). |
| {n,}+ | At least n (possessive). |
| {n,m}+ | At least n but not more than m (possessive). |
Zero-width assertions
| ^ | At beginning of line. |
| $ | At end of line. |
| \A | At beginning of input. |
| \b | At word boundary. |
| \B | At non-word boundary. |
| \G | At end of previous match. |
| \z | At end of input. |
| \Z | At end of input, or before newline at end. |
Look-around assertions
Look-around assertions assert that the subpattern does (positive) or doesn't (negative) match after (look-ahead) or before (look-behind) the current position, without including the matched text in the containing match. The maximum length of possible matches for look-behind patterns must not be unbounded.
| (?=a) | Zero-width positive look-ahead. |
| (?!a) | Zero-width negative look-ahead. |
| (?<=a) | Zero-width positive look-behind. |
| (?<!a) | Zero-width negative look-behind. |
Groups
| (a) | A capturing group. |
| (?:a) | A non-capturing group. |
| (?>a) | An independent non-capturing group. (The first match of the subgroup is the only match tried.) |
| \n | The text already matched by capturing group n. |
See Matcher.group(int) for details of how capturing groups are numbered and accessed.
Operators
| ab | Expression a followed by expression b. |
| a|b | Either expression a or expression b. |
Flags
| (?dimsux-dimsux:a) | Evaluates the expression a with the given flags enabled/disabled. |
| (?dimsux-dimsux) | Evaluates the rest of the pattern with the given flags enabled/disabled. |
The flags are:
i | CASE_INSENSITIVE | case insensitive matching |
d | UNIX_LINES | only accept '\n' as a line terminator |
m | MULTILINE | allow ^ and $ to match beginning/end of any line |
s | DOTALL | allow . to match '\n' ("s" for "single line") |
u | UNICODE_CASE | enable Unicode case folding |
x | COMMENTS | allow whitespace and comments |
Either set of flags may be empty. For example, (?i-m) would turn on case-insensitivity
and turn off multiline mode, (?i) would just turn on case-insensitivity,
and (?-m) would just turn off multiline mode.
Note that on Android, UNICODE_CASE is always on: case-insensitive matching will
always be Unicode-aware.
There are two other flags not settable via this mechanism: CANON_EQ and
LITERAL. Attempts to use CANON_EQ on Android will throw an exception.
Implementation notes
The regular expression implementation used in Android is provided by ICU. The notation for the regular expressions is mostly a superset of those used in other Java language implementations. This means that existing applications will normally work as expected, but in rare cases Android may accept a regular expression that is not accepted by other implementations.
In some cases, Android will recognize that a regular expression is a simple
special case that can be handled more efficiently. This is true of both the convenience methods
in String and the methods in Pattern.
- See Also:
Matcher, Serialized Form
-
Field Summary
Fields Modifier and Type Field Description static intCANON_EQThis constant specifies that a character in aPatternand a character in the input string only match if they are canonically equivalent.static intCASE_INSENSITIVEThis constant specifies that aPatternis matched case-insensitively.static intCOMMENTSThis constant specifies that aPatternmay contain whitespace or comments.static intDOTALLThis constant specifies that the '.' meta character matches arbitrary characters, including line endings, which is normally not the case.static intLITERALThis constant specifies that the wholePatternis to be taken literally, that is, all meta characters lose their meanings.static intMULTILINEThis constant specifies that the meta characters '^' and '$' match only the beginning and end of an input line, respectively.static intUNICODE_CASEThis constant specifies that aPatternthat uses case-insensitive matching will use Unicode case folding.static intUNIX_LINESThis constant specifies that a pattern matches Unix line endings ('\n') only against the '.', '^', and '$' meta characters. -
Method Summary
Modifier and Type Method Description static Patterncompile(String pattern)Equivalent toPattern.compile(pattern, 0).static Patterncompile(String regularExpression, int flags)Returns a compiled form of the givenregularExpression, as modified by the givenflags.protected voidfinalize()Invoked when the garbage collector has detected that this instance is no longer reachable.intflags()Returns the flags supplied tocompile.Matchermatcher(CharSequence input)Returns aMatcherfor this pattern applied to the giveninput.static booleanmatches(String regularExpression, CharSequence input)Tests whether the givenregularExpressionmatches the giveninput.Stringpattern()Returns the regular expression supplied tocompile.static Stringquote(String string)Quotes the givenstringusing "\Q" and "\E", so that all meta-characters lose their special meaning.String[]split(CharSequence input)Equivalent tosplit(input, 0).String[]split(CharSequence input, int limit)Splits the giveninputat occurrences of this pattern.StringtoString()Returns a string containing a concise, human-readable description of this object.
-
Field Details
-
UNIX_LINES
public static final int UNIX_LINESThis constant specifies that a pattern matches Unix line endings ('\n') only against the '.', '^', and '$' meta characters. Corresponds to(?d).- See Also:
- Constant Field Values
-
CASE_INSENSITIVE
public static final int CASE_INSENSITIVEThis constant specifies that aPatternis matched case-insensitively. That is, the patterns "a+" and "A+" would both match the string "aAaAaA". SeeUNICODE_CASE. Corresponds to(?i).- See Also:
- Constant Field Values
-
COMMENTS
public static final int COMMENTSThis constant specifies that aPatternmay contain whitespace or comments. Otherwise comments and whitespace are taken as literal characters. Corresponds to(?x).- See Also:
- Constant Field Values
-
MULTILINE
public static final int MULTILINEThis constant specifies that the meta characters '^' and '$' match only the beginning and end of an input line, respectively. Normally, they match the beginning and the end of the complete input. Corresponds to(?m).- See Also:
- Constant Field Values
-
LITERAL
public static final int LITERALThis constant specifies that the wholePatternis to be taken literally, that is, all meta characters lose their meanings.- See Also:
- Constant Field Values
-
DOTALL
public static final int DOTALLThis constant specifies that the '.' meta character matches arbitrary characters, including line endings, which is normally not the case. Corresponds to(?s).- See Also:
- Constant Field Values
-
UNICODE_CASE
public static final int UNICODE_CASEThis constant specifies that aPatternthat uses case-insensitive matching will use Unicode case folding. On Android,UNICODE_CASEis always on: case-insensitive matching will always be Unicode-aware. If your code is intended to be portable and uses case-insensitive matching on non-ASCII characters, you should use this flag. Corresponds to(?u).- See Also:
- Constant Field Values
-
CANON_EQ
public static final int CANON_EQThis constant specifies that a character in aPatternand a character in the input string only match if they are canonically equivalent. It is (currently) not supported in Android.- See Also:
- Constant Field Values
-
-
Method Details
-
matcher
Returns aMatcherfor this pattern applied to the giveninput. TheMatchercan be used to match thePatternagainst the whole input, find occurrences of thePatternin the input, or replace parts of the input. -
split
Splits the giveninputat occurrences of this pattern.If this pattern does not occur in the input, the result is an array containing the input (converted from a
CharSequenceto aString).Otherwise, the
limitparameter controls the contents of the returned array as described below.- Parameters:
limit- Determines the maximum number of entries in the resulting array, and the treatment of trailing empty strings.- For n > 0, the resulting array contains at most n entries. If this is fewer than the number of matches, the final entry will contain all remaining input.
- For n < 0, the length of the resulting array is
exactly the number of occurrences of the
Patternplus one for the text after the final separator. All entries are included. - For n == 0, the result is as for n < 0, except trailing empty strings will not be returned. (Note that the case where the input is itself an empty string is special, as described above, and the limit parameter does not apply there.)
-
split
Equivalent tosplit(input, 0). -
pattern
Returns the regular expression supplied tocompile. -
toString
Description copied from class:ObjectReturns a string containing a concise, human-readable description of this object. Subclasses are encouraged to override this method and provide an implementation that takes into account the object's type and data. The default implementation is equivalent to the following expression:getClass().getName() + '@' + Integer.toHexString(hashCode())
See Writing a useful
toStringmethod if you intend implementing your owntoStringmethod. -
flags
public int flags()Returns the flags supplied tocompile. -
compile
Returns a compiled form of the givenregularExpression, as modified by the givenflags. See the flags overview for more on flags.- Throws:
PatternSyntaxException- if the regular expression is syntactically incorrect.- See Also:
CANON_EQ,CASE_INSENSITIVE,COMMENTS,DOTALL,LITERAL,MULTILINE,UNICODE_CASE,UNIX_LINES
-
compile
Equivalent toPattern.compile(pattern, 0). -
matches
Tests whether the givenregularExpressionmatches the giveninput. Equivalent toPattern.compile(regularExpression).matcher(input).matches(). If the same regular expression is to be used for multiple operations, it may be more efficient to reuse a compiledPattern.- See Also:
compile(java.lang.String, int),Matcher.matches()
-
quote
Quotes the givenstringusing "\Q" and "\E", so that all meta-characters lose their special meaning. This method correctly escapes embedded instances of "\Q" or "\E". If the entire result is to be passed verbatim tocompile(java.lang.String, int), it's usually clearer to use theLITERALflag instead. -
finalize
Description copied from class:ObjectInvoked when the garbage collector has detected that this instance is no longer reachable. The default implementation does nothing, but this method can be overridden to free resources.Note that objects that override
finalizeare significantly more expensive than objects that don't. Finalizers may be run a long time after the object is no longer reachable, depending on memory pressure, so it's a bad idea to rely on them for cleanup. Note also that finalizers are run on a single VM-wide finalizer thread, so doing blocking work in a finalizer is a bad idea. A finalizer is usually only necessary for a class that has a native peer and needs to call a native method to destroy that peer. Even then, it's better to provide an explicitclosemethod (and implementCloseable), and insist that callers manually dispose of instances. This works well for something like files, but less well for something like aBigIntegerwhere typical calling code would have to deal with lots of temporaries. Unfortunately, code that creates lots of temporaries is the worst kind of code from the point of view of the single finalizer thread.If you must use finalizers, consider at least providing your own
ReferenceQueueand having your own thread process that queue.Unlike constructors, finalizers are not automatically chained. You are responsible for calling
super.finalize()yourself.Uncaught exceptions thrown by finalizers are ignored and do not terminate the finalizer thread. See Effective Java Item 7, "Avoid finalizers" for more.
-