Class BreakIterator
- All Implemented Interfaces:
Cloneable
- Direct Known Subclasses:
RuleBasedBreakIterator,SimpleFilteredSentenceBreakIterator
A class that locates boundaries in text. This class defines a protocol for objects that break up a piece of natural-language text according to a set of criteria. Instances or subclasses of BreakIterator can be provided, for example, to break a piece of text into words, sentences, or logical characters according to the conventions of some language or group of languages. We provide five built-in types of BreakIterator:
- getTitleInstance() returns a BreakIterator that locates boundaries between title breaks.
- getSentenceInstance() returns a BreakIterator that locates boundaries between sentences. This is useful for triple-click selection, for example.
- getWordInstance() returns a BreakIterator that locates boundaries between words. This is useful for double-click selection or "find whole words" searches. This type of BreakIterator makes sure there is a boundary position at the beginning and end of each legal word. (Numbers count as words, too.) Whitespace and punctuation are kept separate from real words.
- getLineInstance() returns a BreakIterator that locates positions where it is legal for a text editor to wrap lines. This is similar to word breaking, but not the same: punctuation and whitespace are generally kept with words (you don't want a line to start with whitespace, for example), and some special characters can force a position to be considered a line-break position or prevent a position from being a line-break position.
- getCharacterInstance() returns a BreakIterator that locates boundaries between logical characters. Because of the structure of the Unicode encoding, a logical character may be stored internally as more than one Unicode code point. (A with an umlaut may be stored as an a followed by a separate combining umlaut character, for example, but the user still thinks of it as one character.) This iterator allows various processes (especially text editors) to treat as characters the units of text that a user would think of as characters, rather than the units of text that the computer sees as "characters".
BreakIterator's interface follows an "iterator" model (hence the name), meaning it has a concept of a "current position" and methods like first(), last(), next(), and previous() that update the current position. All BreakIterators uphold the following invariants:
- The beginning and end of the text are always treated as boundary positions.
- The current position of the iterator is always a boundary position (random- access methods move the iterator to the nearest boundary position before or after the specified position, not to the specified position).
- DONE is used as a flag to indicate when iteration has stopped. DONE is only returned when the current position is the end of the text and the user calls next(), or when the current position is the beginning of the text and the user calls previous().
- Break positions are numbered by the positions of the characters that follow them. Thus, under normal circumstances, the position before the first character is 0, the position after the first character is 1, and the position after the last character is 1 plus the length of the string.
- The client can change the position of an iterator, or the text it analyzes, at will, but cannot change the behavior. If the user wants different behavior, he must instantiate a new iterator.
Examples:
Creating and using text boundaries
public static void main(String args[]) {
if (args.length == 1) {
String stringToExamine = args[0];
//print each word in order
BreakIterator boundary = BreakIterator.getWordInstance();
boundary.setText(stringToExamine);
printEachForward(boundary, stringToExamine);
//print each sentence in reverse order
boundary = BreakIterator.getSentenceInstance(Locale.US);
boundary.setText(stringToExamine);
printEachBackward(boundary, stringToExamine);
printFirst(boundary, stringToExamine);
printLast(boundary, stringToExamine);
}
}
Print each element in order
public static void printEachForward(BreakIterator boundary, String source) {
int start = boundary.first();
for (int end = boundary.next();
end != BreakIterator.DONE;
start = end, end = boundary.next()) {
System.out.println(source.substring(start,end));
}
}
Print each element in reverse order
public static void printEachBackward(BreakIterator boundary, String source) {
int end = boundary.last();
for (int start = boundary.previous();
start != BreakIterator.DONE;
end = start, start = boundary.previous()) {
System.out.println(source.substring(start,end));
}
}
Print first element
public static void printFirst(BreakIterator boundary, String source) {
int start = boundary.first();
int end = boundary.next();
System.out.println(source.substring(start,end));
}
Print last element
public static void printLast(BreakIterator boundary, String source) {
int end = boundary.last();
int start = boundary.previous();
System.out.println(source.substring(start,end));
}
Print the element at a specified position
public static void printAt(BreakIterator boundary, int pos, String source) {
int end = boundary.following(pos);
int start = boundary.previous();
System.out.println(source.substring(start,end));
}
Find the next word
public static int nextWordStartAfter(int pos, String text) { BreakIterator wb = BreakIterator.getWordInstance(); wb.setText(text); int wordStart = wb.following(pos); for (;;) { int wordLimit = wb.next(); if (wordLimit == BreakIterator.DONE) { return BreakIterator.DONE; } int wordStatus = wb.getRuleStatus(); if (wordStatus != BreakIterator.WORD_NONE) { return wordStart; } wordStart = wordLimit; } }The iterator returned bygetWordInstance()is unique in that the break positions it returns don't represent both the start and end of the thing being iterated over. That is, a sentence-break iterator returns breaks that each represent the end of one sentence and the beginning of the next. With the word-break iterator, the characters between two boundaries might be a word, or they might be the punctuation or whitespace between two words. The above code usesgetRuleStatus()to identify and ignore boundaries associated with punctuation or other non-word characters.
- See Also:
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intDONE is returned by previous() and next() after all valid boundaries have been returned.static final intstatic final intstatic final intstatic final intDeprecated.static final intstatic final intTag value for words containing ideographic characters, lower limitstatic final intTag value for words containing ideographic characters, upper limitstatic final intTag value for words containing kana characters, lower limitstatic final intTag value for words containing kana characters, upper limitstatic final intTag value for words that contain letters, excluding hiragana, katakana or ideographic characters, lower limit.static final intTag value for words containing letters, upper limitstatic final intTag value for "words" that do not fit into any of other categories.static final intUpper bound for tags for uncategorized words.static final intTag value for words that appear to be numbers, lower limit.static final intTag value for words that appear to be numbers, upper limit. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionclone()Clone method.abstract intcurrent()Return the iterator's current position.abstract intfirst()Set the iterator to the first boundary position.abstract intfollowing(int offset) Sets the iterator's current iteration position to be the first boundary position following the specified position.static Locale[]Returns a list of locales for which BreakIterators can be used.static ULocale[]Returns a list of locales for which BreakIterators can be used.static BreakIteratorgetBreakInstance(ULocale where, int kind) Deprecated.This API is ICU internal only.static BreakIteratorReturns a new instance of BreakIterator that locates logical-character boundaries.static BreakIteratorgetCharacterInstance(Locale where) Returns a new instance of BreakIterator that locates logical-character boundaries.static BreakIteratorgetCharacterInstance(ULocale where) Returns a new instance of BreakIterator that locates logical-character boundaries.static BreakIteratorReturns a new instance of BreakIterator that locates legal line- wrapping positions.static BreakIteratorgetLineInstance(Locale where) Returns a new instance of BreakIterator that locates legal line- wrapping positions.static BreakIteratorgetLineInstance(ULocale where) Returns a new instance of BreakIterator that locates legal line- wrapping positions.final ULocalegetLocale(ULocale.Type type) Returns the locale that was used to create this object, or null.intFor RuleBasedBreakIterators, return the status tag from the break rule that determined the boundary at the current iteration position.intgetRuleStatusVec(int[] fillInArray) For RuleBasedBreakIterators, get the status (tag) values from the break rule(s) that determined the the boundary at the current iteration position.static BreakIteratorReturns a new instance of BreakIterator that locates sentence boundaries.static BreakIteratorgetSentenceInstance(Locale where) Returns a new instance of BreakIterator that locates sentence boundaries.static BreakIteratorgetSentenceInstance(ULocale where) Returns a new instance of BreakIterator that locates sentence boundaries.abstract CharacterIteratorgetText()Returns a CharacterIterator over the text being analyzed.static BreakIteratorDeprecated.ICU 64 UsegetWordInstance()instead.static BreakIteratorgetTitleInstance(Locale where) Deprecated.ICU 64 UsegetWordInstance()instead.static BreakIteratorgetTitleInstance(ULocale where) Deprecated.ICU 64 UsegetWordInstance()instead.static BreakIteratorReturns a new instance of BreakIterator that locates word boundaries.static BreakIteratorgetWordInstance(Locale where) Returns a new instance of BreakIterator that locates word boundaries.static BreakIteratorgetWordInstance(ULocale where) Returns a new instance of BreakIterator that locates word boundaries.booleanisBoundary(int offset) Return true if the specified position is a boundary position.abstract intlast()Set the iterator to the last boundary position.abstract intnext()Advances the iterator forward one boundary.abstract intnext(int n) Move the iterator by the specified number of steps in the text.intpreceding(int offset) Sets the iterator's current iteration position to be the last boundary position preceding the specified position.abstract intprevious()Move the iterator backward one boundary.static ObjectregisterInstance(BreakIterator iter, Locale locale, int kind) Registers a new break iterator of the indicated kind, to use in the given locale.static ObjectregisterInstance(BreakIterator iter, ULocale locale, int kind) Registers a new break iterator of the indicated kind, to use in the given locale.voidsetText(CharSequence newText) Sets the iterator to analyze a new piece of text.voidSets the iterator to analyze a new piece of text.abstract voidsetText(CharacterIterator newText) Sets the iterator to analyze a new piece of text.static booleanunregister(Object key) Unregisters a previously-registered BreakIterator using the key returned from the register call.
-
Field Details
-
DONE
public static final int DONEDONE is returned by previous() and next() after all valid boundaries have been returned.- See Also:
-
WORD_NONE
public static final int WORD_NONETag value for "words" that do not fit into any of other categories. Includes spaces and most punctuation.- See Also:
-
WORD_NONE_LIMIT
public static final int WORD_NONE_LIMITUpper bound for tags for uncategorized words.- See Also:
-
WORD_NUMBER
public static final int WORD_NUMBERTag value for words that appear to be numbers, lower limit.- See Also:
-
WORD_NUMBER_LIMIT
public static final int WORD_NUMBER_LIMITTag value for words that appear to be numbers, upper limit.- See Also:
-
WORD_LETTER
public static final int WORD_LETTERTag value for words that contain letters, excluding hiragana, katakana or ideographic characters, lower limit.- See Also:
-
WORD_LETTER_LIMIT
public static final int WORD_LETTER_LIMITTag value for words containing letters, upper limit- See Also:
-
WORD_KANA
public static final int WORD_KANATag value for words containing kana characters, lower limit- See Also:
-
WORD_KANA_LIMIT
public static final int WORD_KANA_LIMITTag value for words containing kana characters, upper limit- See Also:
-
WORD_IDEO
public static final int WORD_IDEOTag value for words containing ideographic characters, lower limit- See Also:
-
WORD_IDEO_LIMIT
public static final int WORD_IDEO_LIMITTag value for words containing ideographic characters, upper limit- See Also:
-
KIND_CHARACTER
public static final int KIND_CHARACTER- See Also:
-
KIND_WORD
public static final int KIND_WORD- See Also:
-
KIND_LINE
public static final int KIND_LINE- See Also:
-
KIND_SENTENCE
public static final int KIND_SENTENCE- See Also:
-
KIND_TITLE
Deprecated.ICU 64 UsegetWordInstance()instead.- See Also:
-
-
Constructor Details
-
BreakIterator
protected BreakIterator()Default constructor. There is no state that is carried by this abstract base class.
-
-
Method Details
-
clone
-
first
public abstract int first()Set the iterator to the first boundary position. This is always the beginning index of the text this iterator iterates over. For example, if the iterator iterates over a whole string, this function will always return 0.- Returns:
- The character offset of the beginning of the stretch of text being broken.
-
last
public abstract int last()Set the iterator to the last boundary position. This is always the "past-the-end" index of the text this iterator iterates over. For example, if the iterator iterates over a whole string (call it "text"), this function will always return text.length().- Returns:
- The character offset of the end of the stretch of text being broken.
-
next
public abstract int next(int n) Move the iterator by the specified number of steps in the text. A positive number moves the iterator forward; a negative number moves the iterator backwards. If this causes the iterator to move off either end of the text, this function returns DONE; otherwise, this function returns the position of the appropriate boundary. Calling this function is equivalent to calling next() or previous() n times.- Parameters:
n- The number of boundaries to advance over (if positive, moves forward; if negative, moves backwards).- Returns:
- The position of the boundary n boundaries from the current iteration position, or DONE if moving n boundaries causes the iterator to advance off either end of the text.
-
next
public abstract int next()Advances the iterator forward one boundary. The current iteration position is updated to point to the next boundary position after the current position, and this is also the value that is returned. If the current position is equal to the value returned by last(), or to DONE, this function returns DONE and sets the current position to DONE.- Returns:
- The position of the first boundary position following the iteration position.
-
previous
public abstract int previous()Move the iterator backward one boundary. The current iteration position is updated to point to the last boundary position before the current position, and this is also the value that is returned. If the current position is equal to the value returned by first(), or to DONE, this function returns DONE and sets the current position to DONE.- Returns:
- The position of the last boundary position preceding the iteration position.
-
following
public abstract int following(int offset) Sets the iterator's current iteration position to be the first boundary position following the specified position. (Whether the specified position is itself a boundary position or not doesn't matter-- this function always moves the iteration position to the first boundary after the specified position.) If the specified position is the past-the-end position, returns DONE.- Parameters:
offset- The character position to start searching from.- Returns:
- The position of the first boundary position following "offset" (whether or not "offset" itself is a boundary position), or DONE if "offset" is the past-the-end offset.
-
preceding
public int preceding(int offset) Sets the iterator's current iteration position to be the last boundary position preceding the specified position. (Whether the specified position is itself a boundary position or not doesn't matter-- this function always moves the iteration position to the last boundary before the specified position.) If the specified position is the starting position, returns DONE.- Parameters:
offset- The character position to start searching from.- Returns:
- The position of the last boundary position preceding "offset" (whether of not "offset" itself is a boundary position), or DONE if "offset" is the starting offset of the iterator.
-
isBoundary
public boolean isBoundary(int offset) Return true if the specified position is a boundary position. If the function returns true, the current iteration position is set to the specified position; if the function returns false, the current iteration position is set as though following() had been called.- Parameters:
offset- the offset to check.- Returns:
- True if "offset" is a boundary position.
-
current
public abstract int current()Return the iterator's current position.- Returns:
- The iterator's current position.
-
getRuleStatus
public int getRuleStatus()For RuleBasedBreakIterators, return the status tag from the break rule that determined the boundary at the current iteration position.For break iterator types that do not support a rule status, a default value of 0 is returned.
- Returns:
- The status from the break rule that determined the boundary at the current iteration position.
-
getRuleStatusVec
public int getRuleStatusVec(int[] fillInArray) For RuleBasedBreakIterators, get the status (tag) values from the break rule(s) that determined the the boundary at the current iteration position.For break iterator types that do not support rule status, no values are returned.
If the size of the output array is insufficient to hold the data, the output will be truncated to the available length. No exception will be thrown.
- Parameters:
fillInArray- an array to be filled in with the status values.- Returns:
- The number of rule status values from rules that determined the the boundary at the current iteration position. In the event that the array is too small, the return value is the total number of status values that were available, not the reduced number that were actually returned.
-
getText
Returns a CharacterIterator over the text being analyzed.Caution:The state of the returned CharacterIterator must not be modified in any way while the BreakIterator is still in use. Doing so will lead to undefined behavior of the BreakIterator. Clone the returned CharacterIterator first and work with that.
The returned CharacterIterator is a reference to the actual iterator being used by the BreakIterator. No guarantees are made about the current position of this iterator when it is returned; it may differ from the BreakIterators current position. If you need to move that position to examine the text, clone this function's return value first.
- Returns:
- A CharacterIterator over the text being analyzed.
-
setText
Sets the iterator to analyze a new piece of text. The new piece of text is passed in as a String, and the current iteration position is reset to the beginning of the string. (The old text is dropped.)- Parameters:
newText- A String containing the text to analyze with this BreakIterator.
-
setText
Sets the iterator to analyze a new piece of text. The new piece of text is passed in as a CharSequence, and the current iteration position is reset to the beginning of the text. (The old text is dropped.)The text underlying the CharSequence must not be be modified while the BreakIterator holds a references to it. (As could possibly occur with a StringBuilder, for example).
- Parameters:
newText- A CharSequence containing the text to analyze with this BreakIterator.
-
setText
Sets the iterator to analyze a new piece of text. This function resets the current iteration position to the beginning of the text. (The old iterator is dropped.)Caution: The supplied CharacterIterator is used directly by the BreakIterator, and must not be altered in any way by code outside of the BreakIterator. Doing so will lead to undefined behavior of the BreakIterator.
- Parameters:
newText- A CharacterIterator referring to the text to analyze with this BreakIterator (the iterator's current position is ignored, but its other state is significant).
-
getWordInstance
Returns a new instance of BreakIterator that locates word boundaries. This function assumes that the text being analyzed is in the default locale's language.- Returns:
- An instance of BreakIterator that locates word boundaries.
-
getWordInstance
Returns a new instance of BreakIterator that locates word boundaries.- Parameters:
where- A locale specifying the language of the text to be analyzed.- Returns:
- An instance of BreakIterator that locates word boundaries.
- Throws:
NullPointerException- ifwhereis null.
-
getWordInstance
Returns a new instance of BreakIterator that locates word boundaries.- Parameters:
where- A locale specifying the language of the text to be analyzed.- Returns:
- An instance of BreakIterator that locates word boundaries.
- Throws:
NullPointerException- ifwhereis null.
-
getLineInstance
Returns a new instance of BreakIterator that locates legal line- wrapping positions. This function assumes the text being broken is in the default locale's language.- Returns:
- A new instance of BreakIterator that locates legal line-wrapping positions.
-
getLineInstance
Returns a new instance of BreakIterator that locates legal line- wrapping positions.- Parameters:
where- A Locale specifying the language of the text being broken.- Returns:
- A new instance of BreakIterator that locates legal line-wrapping positions.
- Throws:
NullPointerException- ifwhereis null.
-
getLineInstance
Returns a new instance of BreakIterator that locates legal line- wrapping positions.- Parameters:
where- A Locale specifying the language of the text being broken.- Returns:
- A new instance of BreakIterator that locates legal line-wrapping positions.
- Throws:
NullPointerException- ifwhereis null.
-
getCharacterInstance
Returns a new instance of BreakIterator that locates logical-character boundaries. This function assumes that the text being analyzed is in the default locale's language.- Returns:
- A new instance of BreakIterator that locates logical-character boundaries.
-
getCharacterInstance
Returns a new instance of BreakIterator that locates logical-character boundaries.- Parameters:
where- A Locale specifying the language of the text being analyzed.- Returns:
- A new instance of BreakIterator that locates logical-character boundaries.
- Throws:
NullPointerException- ifwhereis null.
-
getCharacterInstance
Returns a new instance of BreakIterator that locates logical-character boundaries.- Parameters:
where- A Locale specifying the language of the text being analyzed.- Returns:
- A new instance of BreakIterator that locates logical-character boundaries.
- Throws:
NullPointerException- ifwhereis null.
-
getSentenceInstance
Returns a new instance of BreakIterator that locates sentence boundaries. This function assumes the text being analyzed is in the default locale's language.- Returns:
- A new instance of BreakIterator that locates sentence boundaries.
-
getSentenceInstance
Returns a new instance of BreakIterator that locates sentence boundaries.- Parameters:
where- A Locale specifying the language of the text being analyzed.- Returns:
- A new instance of BreakIterator that locates sentence boundaries.
- Throws:
NullPointerException- ifwhereis null.
-
getSentenceInstance
Returns a new instance of BreakIterator that locates sentence boundaries.- Parameters:
where- A Locale specifying the language of the text being analyzed.- Returns:
- A new instance of BreakIterator that locates sentence boundaries.
- Throws:
NullPointerException- ifwhereis null.
-
getTitleInstance
Deprecated.ICU 64 UsegetWordInstance()instead.Returns a new instance of BreakIterator that locates title boundaries. This function assumes the text being analyzed is in the default locale's language. The iterator returned locates title boundaries as described for Unicode 3.2 only. For Unicode 4.0 and above title boundary iteration, please use a word boundary iterator.getWordInstance()- Returns:
- A new instance of BreakIterator that locates title boundaries.
-
getTitleInstance
Deprecated.ICU 64 UsegetWordInstance()instead.Returns a new instance of BreakIterator that locates title boundaries. The iterator returned locates title boundaries as described for Unicode 3.2 only. For Unicode 4.0 and above title boundary iteration, please use Word Boundary iterator.getWordInstance()- Parameters:
where- A Locale specifying the language of the text being analyzed.- Returns:
- A new instance of BreakIterator that locates title boundaries.
- Throws:
NullPointerException- ifwhereis null.
-
getTitleInstance
Deprecated.ICU 64 UsegetWordInstance()instead.Returns a new instance of BreakIterator that locates title boundaries. The iterator returned locates title boundaries as described for Unicode 3.2 only. For Unicode 4.0 and above title boundary iteration, please use Word Boundary iterator.getWordInstance()- Parameters:
where- A Locale specifying the language of the text being analyzed.- Returns:
- A new instance of BreakIterator that locates title boundaries.
- Throws:
NullPointerException- ifwhereis null.
-
registerInstance
Registers a new break iterator of the indicated kind, to use in the given locale. Clones of the iterator will be returned if a request for a break iterator of the given kind matches or falls back to this locale.Because ICU may choose to cache BreakIterator objects internally, this must be called at application startup, prior to any calls to BreakIterator.getInstance to avoid undefined behavior.
- Parameters:
iter- the BreakIterator instance to adopt.locale- the Locale for which this instance is to be registeredkind- the type of iterator for which this instance is to be registered- Returns:
- a registry key that can be used to unregister this instance
-
registerInstance
Registers a new break iterator of the indicated kind, to use in the given locale. Clones of the iterator will be returned if a request for a break iterator of the given kind matches or falls back to this locale.Because ICU may choose to cache BreakIterator objects internally, this must be called at application startup, prior to any calls to BreakIterator.getInstance to avoid undefined behavior.
- Parameters:
iter- the BreakIterator instance to adopt.locale- the Locale for which this instance is to be registeredkind- the type of iterator for which this instance is to be registered- Returns:
- a registry key that can be used to unregister this instance
-
unregister
Unregisters a previously-registered BreakIterator using the key returned from the register call. Key becomes invalid after this call and should not be used again.- Parameters:
key- the registry key returned by a previous call to registerInstance- Returns:
- true if the iterator for the key was successfully unregistered
-
getBreakInstance
Deprecated.This API is ICU internal only.Returns a particular kind of BreakIterator for a locale. Avoids writing a switch statement with getXYZInstance(where) calls. -
getAvailableLocales
Returns a list of locales for which BreakIterators can be used.- Returns:
- An array of Locales. All of the locales in the array can be used when creating a BreakIterator.
-
getAvailableULocales
Returns a list of locales for which BreakIterators can be used.- Returns:
- An array of Locales. All of the locales in the array can be used when creating a BreakIterator.
-
getLocale
Returns the locale that was used to create this object, or null. This may may differ from the locale requested at the time of this object's creation. For example, if an object is created for locale en_US_CALIFORNIA, the actual data may be drawn from en (the actual locale), and en_US may be the most specific locale that exists (the valid locale).Note: The actual locale is returned correctly, but the valid locale is not, in most cases.
- Parameters:
type- type of information requested, eitherULocale.VALID_LOCALEorULocale.ACTUAL_LOCALE.- Returns:
- the information specified by type, or null if this object was not constructed from locale data.
- See Also:
-
getWordInstance()instead.