public class BidiUtils extends Object
| Modifier and Type | Method and Description |
|---|---|
static Dir |
estimateDirection(String str,
boolean isHtml)
Estimates the directionality of a string based on relative word counts, as detailed below.
|
static Dir |
getExitDir(String str,
boolean isHtml)
Returns the directionality of the last character with strong directionality in the string, or
Dir.NEUTRAL if none was encountered.
|
static boolean |
isRtlLanguage(String locale)
Returns whether a locale, given as a string in the ICU syntax, is RTL.
|
public static boolean isRtlLanguage(String locale)
public static Dir getExitDir(String str, boolean isHtml)
str - the string to checkisHtml - whether str is HTML / HTML-escapedpublic static Dir estimateDirection(String str, boolean isHtml)
The parts of the text embedded between LRE/RLE and the matching PDF are ignored, since the directionality in which the string as a whole is displayed will not affect their display anyway, and we want to base it on the remainder.
The parts of the text embedded between LRO/RLO and the matching PDF are considered LTR/RTL "words". This is primarily in order to treat "fake bidi" pseudolocalized text as RTL.
The remaining parts of the text are divided into "words" on whitespace and, inside numbers, on neutral characters that break the LTR flow around them when used inside a number in an RTL context. (This is most of them, the primary exceptions being period, comma, NBSP and colon, i.e. bidi class CS not including slash, which a long-standing Microsoft bug treats as ES)).
Each word is assigned a type - LTR, RTL, URL, signed "European" number, unsigned "European" number, negative "Arabic" number, "Arabic" number with leading plus sign, and unsigned "Arabic" number - as follows:
- Words that start with "http[s]://" (possibly preceded by some neutrals) are URLs.
- Of the remaining words, those that contain any strongly directional characters are classified as LTR or RTL based on their first strongly directional character.
- Of the remaining words, those that contain any digits are classified as an "European" or "Arabic" number based on the type of its first digit, and signed or unsigned depending on whether the first digit was immediately preceded by a plus or minus sign (bidi class ES).
- The remaining words are classified as "neutral" and ignored.
Once the words of each type have been counted, the directionality is decided as follows:
If the number of RTL words exceeds 40% of the total of LTR and RTL words, return Dir.RTL. The threshold favors RTL because LTR words and phrases are used in RTL sentences more commonly than RTL in LTR.
Otherwise, if there are any LTR words, return Dir.LTR.
Otherwise (i.e. if there are no LTR or RTL words), if there are any URLs, or any signed "European" numbers, or an "Arabic" number with a leading plus sign, or more than one unsigned "European" number, return Dir.LTR. This ensures that the text is displayed LTR even in an RTL context, where things like "http://www.google.com/", "-5", "+١٢٣٤٢٣٤٦٧٨٩" (assuming it is intended as an international phone number, not an explicitly signed positive number, which is a very rare use case), "3 - 2 = 1", "(03) 123 4567", and, when preceded by an Arabic letter, even "123-4567" and "400×300" are displayed incorrectly. (Most neutrals, including those in the last two examples, are treated as ending a number in order to treat such expressions as containing more than one "European" number, and thus to force their display in LTR.) Considering a string containing more than "European" number to be LTR also makes sense because math expressions in "European" digits need to be displayed LTR even in RTL languages. However, that probably isn't a very important consideration, since math expressions would usually also contain strongly LTR or RTL variable names that should set the overall directionality. Ranges like "$1 - $5" *are* an important consideration, but their preferred direction unfortunately varies among the RTL languages. Since LTR is preferred for ranges in Persian and Urdu, and is the more widespread usage in Hebrew, it seems like an OK choice. Please note that native Persian digits are included in the "European" class because the unary minus is preferred on the left in Persian, and Persian math is written LTR.
Otherwise, if there are any negative "Arabic" numbers, return Dir.RTL. This is because the unary minus is supposed to be displayed to the right of a number written in "Arabic" digits.
Otherwise, return Dir.NEUTRAL. This includes the common case of a single unsigned number, which will display correctly in either "European" or "Arabic" digits in either directionality, so it is best not to force it to either. It also includes an otherwise neutral string containing two or more "Arabic" numbers. We do *not* consider it to be RTL because it is unclear that it is important to display "Arabic"-digit math and ranges in RTL even in an LTR context, and because we have no idea how to handle phone numbers spelled (or, more likely, misspelled) in "Arabic" digits with non-CS separators. But it is quite clear that we do not want to force it to LTR.
If isHtml is true, treats str as HTML, ignoring HTML tags and escapes that
would otherwise be mistaken for LTR text.
str - the string to checkisHtml - whether str is HTML / HTML-escaped