Class BreakIterator
- All Implemented Interfaces:
Cloneable
public abstract class BreakIterator extends Object implements Cloneable
BreakIterator can be provided, for
example, to break a piece of text into words, sentences, or logical
characters according to the conventions of some language or group of
languages. We provide four built-in types of BreakIterator:
getSentenceInstance()returns aBreakIteratorthat locates boundaries between sentences. This is useful for triple-click selection, for example.getWordInstance()returns aBreakIteratorthat locates boundaries between words. This is useful for double-click selection or "find whole words" searches. This type ofBreakIteratormakes sure there is a boundary position at the beginning and end of each legal word (numbers count as words, too). Whitespace and punctuation are kept separate from real words.getLineInstance()returns aBreakIteratorthat locates positions where it is legal for a text editor to wrap lines. This is similar to word breaking, but not the same: punctuation and whitespace are generally kept with words (you don't want a line to start with whitespace, for example), and some special characters can force a position to be considered a line break position or prevent a position from being a line break position.getCharacterInstance()returns aBreakIteratorthat locates boundaries between logical characters. Because of the structure of the Unicode encoding, a logical character may be stored internally as more than one Unicode code point. (A with an umlaut may be stored as an a followed by a separate combining umlaut character, for example, but the user still thinks of it as one character.) This iterator allows various processes (especially text editors) to treat as characters the units of text that a user would think of as characters, rather than the units of text that the computer sees as "characters".
BreakIterator's interface follows an "iterator" model (hence
the name), meaning it has a concept of a "current position" and methods like
first(), last(), next(), and previous() that
update the current position. All BreakIterators uphold the following
invariants:
- The beginning and end of the text are always treated as boundary positions.
- The current position of the iterator is always a boundary position (random- access methods move the iterator to the nearest boundary position before or after the specified position, not to the specified position).
DONEis used as a flag to indicate when iteration has stopped.DONEis only returned when the current position is the end of the text and the user callsnext(), or when the current position is the beginning of the text and the user callsprevious().- Break positions are numbered by the positions of the characters that follow them. Thus, under normal circumstances, the position before the first character is 0, the position after the first character is 1, and the position after the last character is 1 plus the length of the string.
- The client can change the position of an iterator, or the text it analyzes, at will, but cannot change the behavior. If the user wants different behavior, he must instantiate a new iterator.
BreakIterator accesses the text it analyzes through a
CharacterIterator, which makes it possible to use
BreakIterator to analyze text in any text-storage vehicle that provides a
CharacterIterator interface.
Note: Some types of BreakIterator can take a long time to
create, and instances of BreakIterator are not currently cached by
the system. For optimal performance, keep instances of BreakIterator
around as long as it makes sense. For example, when word-wrapping a document,
don't create and destroy a new BreakIterator for each line. Create
one break iterator for the whole document (or whatever stretch of text you're
wrapping) and use it to do the whole job of wrapping the text.
Examples:
Creating and using text boundaries:
public static void main(String args[]) {
if (args.length == 1) {
String stringToExamine = args[0];
//print each word in order
BreakIterator boundary = BreakIterator.getWordInstance();
boundary.setText(stringToExamine);
printEachForward(boundary, stringToExamine);
//print each sentence in reverse order
boundary = BreakIterator.getSentenceInstance(Locale.US);
boundary.setText(stringToExamine);
printEachBackward(boundary, stringToExamine);
printFirst(boundary, stringToExamine);
printLast(boundary, stringToExamine);
}
}
Print each element in order:
public static void printEachForward(BreakIterator boundary, String source) {
int start = boundary.first();
for (int end = boundary.next(); end != BreakIterator.DONE; start = end, end = boundary.next()) {
System.out.println(source.substring(start, end));
}
}
Print each element in reverse order:
public static void printEachBackward(BreakIterator boundary, String source) {
int end = boundary.last();
for (int start = boundary.previous(); start != BreakIterator.DONE; end = start, start = boundary
.previous()) {
System.out.println(source.substring(start, end));
}
}
Print the first element:
public static void printFirst(BreakIterator boundary, String source) {
int start = boundary.first();
int end = boundary.next();
System.out.println(source.substring(start, end));
}
Print the last element:
public static void printLast(BreakIterator boundary, String source) {
int end = boundary.last();
int start = boundary.previous();
System.out.println(source.substring(start, end));
}
Print the element at a specified position:
public static void printAt(BreakIterator boundary, int pos, String source) {
int end = boundary.following(pos);
int start = boundary.previous();
System.out.println(source.substring(start, end));
}
Find the next word:
public static int nextWordStartAfter(int pos, String text) {
BreakIterator wb = BreakIterator.getWordInstance();
wb.setText(text);
int last = wb.following(pos);
int current = wb.next();
while (current != BreakIterator.DONE) {
for (int p = last; p < current; p++) {
if (Character.isLetter(text.charAt(p)))
return last;
}
last = current;
current = wb.next();
}
return BreakIterator.DONE;
}
The iterator returned by BreakIterator.getWordInstance() is unique in
that the break positions it returns don't represent both the start and end of
the thing being iterated over. That is, a sentence-break iterator returns
breaks that each represent the end of one sentence and the beginning of the
next. With the word-break iterator, the characters between two boundaries
might be a word, or they might be the punctuation or whitespace between two
words. The above code uses a simple heuristic to determine which boundary is
the beginning of a word: If the characters between this boundary and the next
boundary include at least one letter (this can be an alphabetical letter, a
CJK ideograph, a Hangul syllable, a Kana character, etc.), then the text
between this boundary and the next is a word; otherwise, it's the material
between words.)
- See Also:
CharacterIterator
-
Field Summary
Fields Modifier and Type Field Description static intDONEThis constant is returned by iterate methods likeprevious()ornext()if they have returned all valid boundaries. -
Constructor Summary
Constructors Modifier Constructor Description protectedBreakIterator()Default constructor, for use by subclasses. -
Method Summary
Modifier and Type Method Description Objectclone()Returns a copy of this iterator.abstract intcurrent()Returns this iterator's current position.abstract intfirst()Sets this iterator's current position to the first boundary and returns that position.abstract intfollowing(int offset)Sets the position of the first boundary to the one following the given offset and returns this position.static Locale[]getAvailableLocales()Returns an array of locales for which customBreakIteratorinstances are available.static BreakIteratorgetCharacterInstance()Returns a new instance ofBreakIteratorto iterate over characters using the user's default locale.static BreakIteratorgetCharacterInstance(Locale where)Returns a new instance ofBreakIteratorto iterate over characters using the given locale.static BreakIteratorgetLineInstance()Returns a new instance of {BreakIteratorto iterate over line breaks using the user's default locale.static BreakIteratorgetLineInstance(Locale where)Returns a new instance ofBreakIteratorto iterate over line breaks using the given locale.static BreakIteratorgetSentenceInstance()Returns a new instance ofBreakIteratorto iterate over sentence-breaks using the default locale.static BreakIteratorgetSentenceInstance(Locale where)Returns a new instance ofBreakIteratorto iterate over sentence-breaks using the given locale.abstract CharacterIteratorgetText()Returns aCharacterIteratorwhich represents the text being analyzed.static BreakIteratorgetWordInstance()Returns a new instance ofBreakIteratorto iterate over word-breaks using the default locale.static BreakIteratorgetWordInstance(Locale where)Returns a new instance ofBreakIteratorto iterate over word-breaks using the given locale.booleanisBoundary(int offset)Indicates whether the given offset is a boundary position.abstract intlast()Sets this iterator's current position to the last boundary and returns that position.abstract intnext()Sets this iterator's current position to the next boundary after the current position, and returns this position.abstract intnext(int n)Sets this iterator's current position to the next boundary after the given position, and returns that position.intpreceding(int offset)Returns the position of last boundary preceding the given offset, and sets the current position to the returned value, orDONEif the given offset specifies the starting position.abstract intprevious()Sets this iterator's current position to the previous boundary before the current position and returns that position.voidsetText(String newText)Sets the new text string to be analyzed, the current position will be reset to the beginning of this new string, and the old string will be lost.abstract voidsetText(CharacterIterator newText)Sets the new text to be analyzed by the givenCharacterIterator.
-
Field Details
-
DONE
public static final int DONEThis constant is returned by iterate methods likeprevious()ornext()if they have returned all valid boundaries.- See Also:
- Constant Field Values
-
-
Constructor Details
-
BreakIterator
protected BreakIterator()Default constructor, for use by subclasses.
-
-
Method Details
-
getAvailableLocales
Returns an array of locales for which customBreakIteratorinstances are available.Note that Android does not support user-supplied locale service providers.
-
getCharacterInstance
Returns a new instance ofBreakIteratorto iterate over characters using the user's default locale. See "Be wary of the default locale".- Returns:
- a new instance of
BreakIteratorusing the default locale.
-
getCharacterInstance
Returns a new instance ofBreakIteratorto iterate over characters using the given locale.- Parameters:
where- the given locale.- Returns:
- a new instance of
BreakIteratorusing the given locale.
-
getLineInstance
Returns a new instance of {BreakIteratorto iterate over line breaks using the user's default locale. See "Be wary of the default locale".- Returns:
- a new instance of
BreakIteratorusing the default locale.
-
getLineInstance
Returns a new instance ofBreakIteratorto iterate over line breaks using the given locale.- Parameters:
where- the given locale.- Returns:
- a new instance of
BreakIteratorusing the given locale. - Throws:
NullPointerException- ifwhereisnull.
-
getSentenceInstance
Returns a new instance ofBreakIteratorto iterate over sentence-breaks using the default locale. See "Be wary of the default locale".- Returns:
- a new instance of
BreakIteratorusing the default locale.
-
getSentenceInstance
Returns a new instance ofBreakIteratorto iterate over sentence-breaks using the given locale.- Parameters:
where- the given locale.- Returns:
- a new instance of
BreakIteratorusing the given locale. - Throws:
NullPointerException- ifwhereisnull.
-
getWordInstance
Returns a new instance ofBreakIteratorto iterate over word-breaks using the default locale. See "Be wary of the default locale".- Returns:
- a new instance of
BreakIteratorusing the default locale.
-
getWordInstance
Returns a new instance ofBreakIteratorto iterate over word-breaks using the given locale.- Parameters:
where- the given locale.- Returns:
- a new instance of
BreakIteratorusing the given locale. - Throws:
NullPointerException- ifwhereisnull.
-
isBoundary
public boolean isBoundary(int offset)Indicates whether the given offset is a boundary position. If this method returns true, the current iteration position is set to the given position; if the function returns false, the current iteration position is set as thoughfollowing(int)had been called.- Parameters:
offset- the given offset to check.- Returns:
trueif the given offset is a boundary position;falseotherwise.
-
preceding
public int preceding(int offset)Returns the position of last boundary preceding the given offset, and sets the current position to the returned value, orDONEif the given offset specifies the starting position.- Parameters:
offset- the given start position to be searched for.- Returns:
- the position of the last boundary preceding the given offset.
- Throws:
IllegalArgumentException- if the offset is invalid.
-
setText
Sets the new text string to be analyzed, the current position will be reset to the beginning of this new string, and the old string will be lost.- Parameters:
newText- the new text string to be analyzed.
-
current
public abstract int current()Returns this iterator's current position.- Returns:
- this iterator's current position.
-
first
public abstract int first()Sets this iterator's current position to the first boundary and returns that position.- Returns:
- the position of the first boundary.
-
following
public abstract int following(int offset)Sets the position of the first boundary to the one following the given offset and returns this position. ReturnsDONEif there is no boundary after the given offset.- Parameters:
offset- the given position to be searched for.- Returns:
- the position of the first boundary following the given offset.
- Throws:
IllegalArgumentException- if the offset is invalid.
-
getText
Returns aCharacterIteratorwhich represents the text being analyzed. Please note that the returned value is probably the internal iterator used by this object. If the invoker wants to modify the status of the returned iterator, it is recommended to first create a clone of the iterator returned.- Returns:
- a
CharacterIteratorwhich represents the text being analyzed.
-
last
public abstract int last()Sets this iterator's current position to the last boundary and returns that position.- Returns:
- the position of last boundary.
-
next
public abstract int next()Sets this iterator's current position to the next boundary after the current position, and returns this position. ReturnsDONEif no boundary was found after the current position.- Returns:
- the position of last boundary.
-
next
public abstract int next(int n)Sets this iterator's current position to the next boundary after the given position, and returns that position. ReturnsDONEif no boundary was found after the given position.- Parameters:
n- the given position.- Returns:
- the position of last boundary.
-
previous
public abstract int previous()Sets this iterator's current position to the previous boundary before the current position and returns that position. ReturnsDONEif no boundary was found before the current position.- Returns:
- the position of last boundary.
-
setText
Sets the new text to be analyzed by the givenCharacterIterator. The position will be reset to the beginning of the new text, and other status information of this iterator will be kept.- Parameters:
newText- theCharacterIteratorreferring to the text to be analyzed.
-
clone
Returns a copy of this iterator.
-