public final class UCharacter extends Object implements UCharacterEnums.ECharacterCategory, UCharacterEnums.ECharacterDirection
The UCharacter class provides extensions to the java.lang.Character class. These extensions provide support for more Unicode properties and together with the UTF16 class, provide support for supplementary characters (those with code points above U+FFFF). Each ICU release supports the latest version of Unicode available at that time.
Code points are represented in these API using ints. While it would be more convenient in Java to have a separate primitive datatype for them, ints suffice in the meantime.
To use this class please add the jar file name icu4j.jar to the
class path, since it contains data files which supply the information used
by this file.
E.g. In Windows
set CLASSPATH=%CLASSPATH%;$JAR_FILE_PATH/ucharacter.jar.
Otherwise, another method would be to copy the files uprops.dat and
unames.icu from the icu4j source subdirectory
$ICU4J_SRC/src/com.adobe.agl.impl.data to your class directory
$ICU4J_CLASS/com.adobe.agl.impl.data.
Aside from the additions for UTF-16 support, and the updated Unicode properties, the main differences between UCharacter and Character are:
Further detail differences can be determined from the program com.adobe.agl.dev.test.lang.UCharacterCompare
In addition to Java compatibility functions, which calculate derived properties, this API provides low-level access to the Unicode Character Database.
Unicode assigns each code point (not just assigned character) values for many properties. Most of them are simple boolean flags, or constants from a small enumerated list. For some properties, values are strings or other relatively more complex types.
For more information see "About the Unicode Character Database" (http://www.unicode.org/ucd/) and the ICU User Guide chapter on Properties (http://www.icu-project.org/userguide/properties.html).
There are also functions that provide easy migration from C/POSIX functions like isblank(). Their use is generally discouraged because the C/POSIX standards do not define their semantics beyond the ASCII range, which means that different implementations exhibit very different behavior. Instead, Unicode properties should be used directly.
There are also only a few, broad C/POSIX character classes, and they tend to be used for conflicting purposes. For example, the "isalpha()" class is sometimes used to determine word boundaries, while a more sophisticated approach would at least distinguish initial letters from continuation characters (the latter including combining marks). (In ICU, BreakIterator is the most sophisticated API for word boundaries.) Another example: There is no "istitle()" class for titlecase characters.
ICU 3.4 and later provides API access for all twelve C/POSIX character classes. ICU implements them according to the Standard Recommendations in Annex C: Compatibility Properties of UTS #18 Unicode Regular Expressions (http://www.unicode.org/reports/tr18/#Compatibility_Properties).
API access for C/POSIX character classes is as follows:
- alpha: isUAlphabetic(c) or hasBinaryProperty(c, UProperty.ALPHABETIC)
- lower: isULowercase(c) or hasBinaryProperty(c, UProperty.LOWERCASE)
- upper: isUUppercase(c) or hasBinaryProperty(c, UProperty.UPPERCASE)
- punct: ((1<
The C/POSIX character classes are also available in UnicodeSet patterns,
using patterns like [:graph:] or \p{graph}.
Note: There are several ICU (and Java) whitespace functions.
Comparison:
- isUWhiteSpace=UCHAR_WHITE_SPACE: Unicode White_Space property;
most of general categories "Z" (separators) + most whitespace ISO controls
(including no-break spaces, but excluding IS1..IS4 and ZWSP)
- isWhitespace: Java isWhitespace; Z + whitespace ISO controls but excluding no-break spaces
- isSpaceChar: just Z (including no-break spaces)
This class is not subclassable
UCharacterEnums| Modifier and Type | Field and Description |
|---|---|
static int |
MAX_VALUE
The highest Unicode code point value (scalar value) according to the
Unicode Standard.
|
static int |
MIN_VALUE
The lowest Unicode code point value.
|
CHAR_CATEGORY_COUNT, COMBINING_SPACING_MARK, CONNECTOR_PUNCTUATION, CONTROL, CURRENCY_SYMBOL, DASH_PUNCTUATION, DECIMAL_DIGIT_NUMBER, ENCLOSING_MARK, END_PUNCTUATION, FINAL_PUNCTUATION, FINAL_QUOTE_PUNCTUATION, FORMAT, GENERAL_OTHER_TYPES, INITIAL_PUNCTUATION, INITIAL_QUOTE_PUNCTUATION, LETTER_NUMBER, LINE_SEPARATOR, LOWERCASE_LETTER, MATH_SYMBOL, MODIFIER_LETTER, MODIFIER_SYMBOL, NON_SPACING_MARK, OTHER_LETTER, OTHER_NUMBER, OTHER_PUNCTUATION, OTHER_SYMBOL, PARAGRAPH_SEPARATOR, PRIVATE_USE, SPACE_SEPARATOR, START_PUNCTUATION, SURROGATE, TITLECASE_LETTER, UNASSIGNED, UPPERCASE_LETTERARABIC_NUMBER, BLOCK_SEPARATOR, BOUNDARY_NEUTRAL, CHAR_DIRECTION_COUNT, COMMON_NUMBER_SEPARATOR, DIR_NON_SPACING_MARK, DIRECTIONALITY_ARABIC_NUMBER, DIRECTIONALITY_BOUNDARY_NEUTRAL, DIRECTIONALITY_COMMON_NUMBER_SEPARATOR, DIRECTIONALITY_EUROPEAN_NUMBER, DIRECTIONALITY_EUROPEAN_NUMBER_SEPARATOR, DIRECTIONALITY_EUROPEAN_NUMBER_TERMINATOR, DIRECTIONALITY_LEFT_TO_RIGHT, DIRECTIONALITY_LEFT_TO_RIGHT_EMBEDDING, DIRECTIONALITY_LEFT_TO_RIGHT_OVERRIDE, DIRECTIONALITY_NONSPACING_MARK, DIRECTIONALITY_OTHER_NEUTRALS, DIRECTIONALITY_PARAGRAPH_SEPARATOR, DIRECTIONALITY_POP_DIRECTIONAL_FORMAT, DIRECTIONALITY_RIGHT_TO_LEFT, DIRECTIONALITY_RIGHT_TO_LEFT_ARABIC, DIRECTIONALITY_RIGHT_TO_LEFT_EMBEDDING, DIRECTIONALITY_RIGHT_TO_LEFT_OVERRIDE, DIRECTIONALITY_SEGMENT_SEPARATOR, DIRECTIONALITY_UNDEFINED, DIRECTIONALITY_WHITESPACE, EUROPEAN_NUMBER, EUROPEAN_NUMBER_SEPARATOR, EUROPEAN_NUMBER_TERMINATOR, LEFT_TO_RIGHT, LEFT_TO_RIGHT_EMBEDDING, LEFT_TO_RIGHT_OVERRIDE, OTHER_NEUTRAL, POP_DIRECTIONAL_FORMAT, RIGHT_TO_LEFT, RIGHT_TO_LEFT_ARABIC, RIGHT_TO_LEFT_EMBEDDING, RIGHT_TO_LEFT_OVERRIDE, SEGMENT_SEPARATOR, WHITE_SPACE_NEUTRAL| Constructor and Description |
|---|
UCharacter() |
public static final int MIN_VALUE
public static final int MAX_VALUE
Copyright © 2010 - 2020 Adobe. All Rights Reserved