ICU4X: What it can do and how to do it

Categories:	credativ® Inside
Tags:	i18n ICU ICU4X Internationalisation Internationalization Rust Unicode

Introduction

If you regularly deal with code that needs extensive internationalization capabilities, chances are, that you’ve used functionality from one of the ICU libraries before. Being developed by the Unicode Consortium, ICU provides reliable, mature and extensive implementations for all kinds of tools for internationalization and Unicode text operations. Traditionally, there have been two implementations of ICU, ICU4C implemented in C and ICU4J implemented in Java. These libraries have been the gold standard in correct Unicode text handling and i18n for many years. But for some years now, the Unicode Consortium has been developing on ICU4X, a relatively new implementation in Rust.

The focus of ICU4X is on availability on many platforms and in many programming languages. While older implementations like ICU4C and ICU4J are very mature and at the moment provide more functionality than ICU4X, these libraries have a very large code size and large runtime memory footprint, making them infeasible to use in resource constrained environments like in web browsers or on mobile or embedded devices. ICU4X takes care to reduce library code size and provides additional facilities to optimize the code size of both the library itself and the Unicode data shipped with an application.

In this article, I will provide an overview of what ICU4X can do and how to do it. If you’ve worked with other ICU implementations before, many of them will probably feel familiar, if, on the other hand, you never came in contact with ICU, this article should give you a good introduction on how to perform various Unicode text operations using ICU4X.

Prerequisites

I will be showing a lot of code examples on how to use ICU4X in Rust. While it should not be strictly necessary to understand Rust to understand the basics of what’s going on, some familiarity with the language will definitely help to understand the finer details. If you’re unfamiliar with Rust and want to learn more, I recommend The Rust Book as an introduction.

During the examples I’ll be referring to various functions and types from ICU4X without showing their types in full detail. Feel free to open API documentation alongside this article, to look up any types for the functions mentioned.

Test setup

If you want to run the example for yourself, I recommend setting up a cargo project with the appropriate dependency:

$ cargo new --bin icu4x-blog
$ cd icu4x-blog
$ cargo add icu

This initializes a basic Cargo.toml and src/main.rs. Now you can paste any example code into the generated main functions inside main.rs and run your examples using cargo run:

$ cargo run
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.02s
     Running `target/debug/icu4x-blog`
Hello, world!

For now this only outputs the default “Hello, world!“ message generated by cargo. So let’s go on to add our own examples.

Locales

The behavior of some of ICU4X’s operations depends on lingual or cultural context. When they do, we need to specify what lingual or cultural background we want. We do this in the form of so-called Locales. At its core, a locale is identified by a short string identifying a language and region. They usually look something like “en-US” for an American English locale, or “de-AT” for German language locale as spoken in Austria.

Locales don’t do anything exciting on their own. They only tell other operations how to behave, so construction is basically the only thing we do with Locales. There are two main ways to construct a locale. We can use the locale! macro to construct and validate a static Locale like this:

let en_us = icu::locid::locale!("en-US");
println!("{en_us}");

Or we can try to parse a locale from a string at runtime:

let de_at = "de-AT".parse::<icu::locid::Locale>().unwrap();
println!("{de_at}");

Note that parsing a locale can fail on invalid inputs. This is encoded by the parse function returning a Result<Locale, ParserError>. In the example above we use unwrap to ignore the possibility of an error, which will panic on actual invalid inputs:

let invalid_locale = "Invalid!".parse::<icu::locid::Locale>().unwrap();
println!("{invalid_locale}");

Taken together, these examples will produce the following output:

$ cargo run
[...]
en-US
de-AT

thread 'main' panicked at src/main.rs:8:67:
called `Result::unwrap()` on an `Err` value: InvalidLanguage
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

In practical scenarios, you will probably want to detect locales dynamically at runtime via some standard mechanism provided by your operating system or some other execution platform. Unfortunately, there is currently no standardized way to detect ICU4X locales from external sources, but progress to implementing such a solution is tracked in this issue.

Now that we’ve looked on how to construct Locales, let’s look at some operations that need locales to function.

Collation

The first operation we’re going to look at is collation. You’re probably familiar with the concept of comparing strings lexicographically. In Rust the str type already implements the Ord trait, allowing us easy lexicographic comparisons and sorting of strings. However not all languages and cultures agree on which order letters should be sorted in. As an example, in Germany the letter Ä is usually sorted right after A, while in Sweden the letter Ä is usually sorted after Z. Rust’s standard method of comparing strings does not take these regional differences into account. ICU4X provides us with collation functionality to compare and sort strings with these cultural differences in mind.

Construction

The first step for doing so is to create a Collator like this:

let de_de = icu::locid::locale!("de-DE");
let collator_de = icu::collator::Collator::try_new(&de_de.into(), Default::default()).unwrap();

The first parameter is the locale we want collation for. Or technically, it’s a DataLocale, because that’s what try_new wants. The difference doesn’t need to concern us too much right now. Just know, that we can convert a Locale to a DataLocale using .into(). The second parameter is a CollatorOptions structure, which we could use to specify more specific options for the collation. We won’t look at the specific options here and instead just use the default options, but check out the API documentation if you’re curious about what options you can specify. At last, we unwrap the Collator, since creating it can fail in cases where no collation data for the given locale could be found. We’ll talk about this possibility later when talking about data handling.

Now that we have one collator for the de-DE locale, let’s build another for Swedish (sv-SE):

let sv_se = icu::locid::locale!("sv-SE");
let collator_sv = icu::collator::Collator::try_new(&sv_se.into(), Default::default()).unwrap();

Usage

Now that we have built some collators, let’s sort some strings with the standard Rust comparison and different locales to see the different results:

let mut strings = ["Abc", "Äbc", "ZYX"];
strings.sort_by(Ord::cmp);
println!("Rust default sorted: {strings:?}");
strings.sort_by(|a, b| collator_de.compare(a, b));
println!("Collated de-DE: {strings:?}");
strings.sort_by(|a, b| collator_sv.compare(a, b));
println!("Collated sv-SE: {strings:?}");

This produces the following output:

$ cargo run
[...]
Rust default sorted: ["Abc", "ZYX", "Äbc"]
Collated de-DE: ["Abc", "Äbc", "ZYX"]
Collated sv-SE: ["Abc", "ZYX", "Äbc"]

As predicted, the German collation sorted the strings differently than the Swedish collation. Incidentally, the default Rust order sorted these specific strings the same as the Swedish collation, though in practice you shouldn’t rely on coincidences like this and always use the correct collation when sorting Strings for display purposes.

Calendars

Sometimes it’s easy to forget, but not all cultures use the same calendar system. And even different cultures sharing the same calendar system might use different formats to represent dates. ICU4X provides support for converting between representations of different calendars and formatting them according to local taste. Besides the Gregorian calendar, popular in most regions of the world, and the ISO calendar often used for technical purposes, many other calendars, such as Japanese, Ethiopian or Indian are supported. However, no functionality for retrieving the current time is currently provided, so in real applications you will have to convert from some other representation first.

Construction

In the next example, we have a known date given as an ISO date, that we want to display in some locale:

let iso_date = icu::calendar::Date::try_new_iso_date(1978, 3, 8).unwrap();

Next we create a DateFormatter:

let local_formatter = icu::datetime::DateFormatter::try_new_with_length(
    &icu::locid::locale!("th").into(),
    icu::datetime::options::length::Date::Medium,
)
.unwrap();

We will use the formatter to format dates into locale specific textual representation. During creation we get to pick a locale (or rather a DataLocale again). We’re picking the Thai locale, because unlike most of the world it uses the Buddhist calendar instead of the Gregorian calendar. We also get to pick a format length that gives us some control over the length of the date format. We use the medium length, which uses abbreviated month names, if these are available in the locale, or numeric months otherwise.

Formatting

Now we just format the date and print it:

let local_format = local_formatter.format(&iso_date.to_any()).unwrap();
println!("{local_format}");

Which gives us this output:

$ cargo run
[...]
8 มี.ค. 2521

If, like me, you’re not well versed in the Buddhist calendar and Thai month names, this probably won’t tell you much. But that’s exactly the point of using an i18n library like ICU4X. We can use general operations that do the correct thing for any supported locale, without having to understand the intricacies of every specific locale.

When calling format you have to be careful to pass a date that belongs to a calendar suitable for the locale of the formatter. Even though its parameter type suggests that a date from any calendar can be used, the operation only accepts dates from the ISO calendar or dates from correct calendar for that locale (i.e. in this example we could have also passed a date that was already formatted according to the Buddhist calendar). In this case, we used a ISO date, which is always accepted. If you have a date in an entirely different calendar it gets more complicated. You would need to convert your date to the correct target calendar explicitly and then pass it to format. For this you obtain the needed calendar using AnyCalendar::new_for_locale and do the conversion using Date::to_calendar.

Normalization

Unicode texts are represented as a sequence of numbers called code points. But not each code point has its own atomic meaning. Some sequences of code points combine into groups to represent more complex characters. Due to this complexity it is possible in many cases, that different sequences of code points represent the same sequence of semantic characters. As an example, the letter Ä can be represented as the single code point U+00C4 or as the sequence of code points U+0041 U+0308 (an A followed by combining two dots above). This has implications when we want to compare strings for equality. Naively we might want to compare strings by checking if each of the code points are equal. But that would mean that strings that compare different, because they contain different code points, actually contain the semantically same characters.

To deal with this situation, ICU4X gives us string normalization. The idea is as follows: Before comparing strings to each other we “normalize” each string. Normalization transforms the string into a normalized representation, thereby ensuring that all strings that are semantically equal also have the same normalized representation. This means that once we have normalized the strings we want to compare, we can simply compare the resulting strings by code point to determine if the original strings where semantically the same.

Normalization forms

Before we can perform this normalization, we need to understand that there are multiple forms of normalization. These forms are differentiated by two properties. On one axis, they can be composing or decomposing. On the other axis, they can be canonical or compatible.

Composed normalization forms ensure that the normalized form has as few code points as possible, e.g. for the letter Ä the single code point form would be used. Decomposed normalization on the other hand always chooses the representation requiring the most code points available, e.g. for the letter Ä the two code point form would be used. With composed normalization we need less storage space to store the normalized form. However, composed normalization is also usually slower to perform than decomposed normalization, because internally composed normalization first has to run decomposed normalization and then compress the result. As a rule of thumb, it is usually recommended that composed normalization be used when the normalized strings are stored on disk or sent over the network, whereas decomposed normalization should be used when the normalized form is only used internally within an application.

Canonical normalization only considers different code point representations of the same characters to be equal. Compatible normalization goes a step further and considers characters that convey the same meaning, but differ in representation, to be equal. As an example, under compatible normalization the characters “2”, “²” and “②” are all considered equal, whereas under canonical normalization they are different. Compatible normalization can be appropriate when normalizing identifiers such as usernames to detect close-but-different lookalikes.

Taking all of this together, this gives us four different possible forms of normalization:

NFC: Composing and Canonical
NFKC: Composing and Compatible
NFD: Decomposing and Canonical
NFKD: Decomposing and Compatible

Performing normalization

Once we have decided on a normalization form to use, actually performing the normalization is easy. Here’s an example using NFD normalization:

let string1 = "\u{00C4}";
let string2 = "\u{0041}\u{0308}";

let rust_equal = string1 == string2;

let normalizer = icu::normalizer::DecomposingNormalizer::new_nfd();
let normalized1 = normalizer.normalize(string1);
let normalized2 = normalizer.normalize(string2);
let normalized_equal = normalized1 == normalized2;

println!(
    "1: {string1}, 2: {string2}, rust equal: {rust_equal}, normalized equal: {normalized_equal}"
)

$ cargo run
[...]
1: Ä, 2: Ä, rust equal: false, normalized equal: true

As we can see, string1 and string2 look the same when printed, but the == operator doesn’t consider them equal. However, normalizing both strings and comparing the results, does compare them equal.

NFKD normalization can be used by constructing the normalizer using DecomposingNormalizer::new_nfkd. NFC and NFKC are accessible using ComposingNormalizer::new_nfc and ComposingNormalizer::new_nfkc respectively.

Segmentation

When we look into Unicode texts, we’ll often find that they aren’t only made up of individual code points, but rather of larger constructs consisting of multiple code points, such as words or lines. When processing text, it is often necessary to recognize, where boundaries between these individual pieces are. In ICU4X this process is called segmentation and it provides us with four different types of segments to recognize: graphemes, words, sentences, and lines. The process of segmenting is very similar for each one, but each of them also has their own quirks, so we’ll look at each of them in turn.

Graphemes

As previously mentioned, some code points combine with other code points thereby gaining a different meaning than each code point would have individually. If we break strings apart between two combined code points, the code points can no longer combine and thus revert to their individual meaning. Here’s an example of such unintentional changes in meaning happening:

let string1 = "\u{61}\u{308}\u{6f}\u{308}\u{75}\u{308}";
let string2 = "stu";
println!("string1: {string1}, string2: {string2}");

let (split1, split2) = string1.split_at(4);
println!("split1: {split1}, split2: {split2}");
println!("combined: {string2}{split2}");

$ cargo run
[...]
string1: äöü, string2: stu
split1: äo, split2: ̈ü
combined: stüü

First, note that the output of split1 and split2 shows that what was previously an ö has now been split into an o and a loose pair of two dots. Even worse: when we combine string2 and split2 in a single output, the dots at the start of split2 combine with the last character of string2 forming an extra “ü” that was never intended to exist.

Graphemes to the rescue

So how do we know where it is safe to split a string, without altering the meaning of its contained characters? For this purpose, Unicode defines the concept of grapheme clusters, which is a sequence of code points that have a single meaning together, but are unaffected by the meaning of code points around them. As long as we’re careful to split strings only on the boundaries between grapheme clusters, we can be sure not to inadvertently change the semantics of characters contained in the string. Similarly, when we build a user interface for text editing or text selection, we should be careful to present a single grapheme cluster to the user as a single unbreakable unit.

To find out, where the boundaries between grapheme cluster are, ICU4X gives us the GraphemeClusterSegmenter. Let’s look at how it would have segmented our string from earlier:

let string = "\u{61}\u{308}\u{6f}\u{308}\u{75}\u{308}";
println!("string: {string}");
let grapheme_boundaries: Vec<usize> = icu::segmenter::GraphemeClusterSegmenter::new()
    .segment_str(string)
    .collect();
println!("grapheme boundaries: {grapheme_boundaries:?}");

$ cargo run
[...]
string: äöü
grapheme boundaries: [0, 3, 6, 9]

As we can see, the segment_str function returns an iterator over indices where boundaries between grapheme clusters are located. Naturally the first index is always 0 and the last index is always the end of the string. We can also see, that the index 4, where we split our string in the last example, was not a boundary between grapheme clusters, and thus our split caused the change in meaning we observed. Had we instead split the string at the indices 3 or 6, we would have not had the same problems.

Words

Sometimes it is helpful to separate a string into its individual words. For this purpose, we get the aptly named WordSegmenter. So let’s get right into it:

let string = "Hello world";
println!("string: {string}");
let word_boundaries: Vec<usize> = icu::segmenter::WordSegmenter::new_auto()
    .segment_str(string)
    .collect();
println!("word boundaries: {word_boundaries:?}");

$ cargo run
[...]
string: Hello world
word boundaries: [0, 5, 6, 11]

So far this is very similar to the GraphemeClusterSegmenter we’ve seen before. But what if we want the words themselves and not only their boundaries? We can just iterate over windows of two boundaries at a time and slice the original string:

let words: Vec<&str> = word_boundaries
    .windows(2)
    .map(|bounds| &string[bounds[0]..bounds[1]])
    .collect();
println!("words: {words:?}");

$ cargo run
[...]
words: ["Hello", " ", "world"]

This looks better. It gives use the two words we expect. It also gives us the white space between words. If we do not want that, we can ask the WordSegmenter to tell us if a given boundary comes after a real word or just some white space and filter on that:

let word_boundaries: Vec<(usize, icu::segmenter::WordType)> =
    icu::segmenter::WordSegmenter::new_auto()
        .segment_str(string)
        .iter_with_word_type()
        .collect();
println!("word boundaries: {word_boundaries:?}");
let words: Vec<&str> = word_boundaries
    .windows(2)
    .filter_map(|bounds| {
        let (start, _) = bounds[0];
        let (end, word_type) = bounds[1];
        if word_type.is_word_like() {
            Some(&string[start..end])
        } else {
            None
        }
    })
    .collect();
println!("words: {words:?}");

$ cargo run
[...]
word boundaries: [(0, None), (5, Letter), (6, None), (11, Letter)]
words: ["Hello", "world"]

In case you were wondering why the constructor for WordSegmenter is called new_auto, it’s because there are multiple algorithms for word segmentation to choose from. There are also new_dictionary and new_lstm and not every algorithm works equally well for different writing systems. new_auto is a good choice in the general case, as it automatically picks a good implementation based on the actual data encountered in the string.

Sentences

If we want to break strings into sentences, SentenceSegmenter does just that. There’s not much special to it, so let’s get right into it:

let string = "here is a sentence. This is another sentence.";
println!("string: {string}");
let sentence_boundaries: Vec<usize> = icu::segmenter::SentenceSegmenter::new()
    .segment_str(string)
    .collect();
println!("sentence boundaries: {sentence_boundaries:?}");
let words: Vec<&str> = sentence_boundaries
    .windows(2)
    .map(|bounds| &string[bounds[0]..bounds[1]])
    .collect();
println!("words: {words:?}");

$cargo run
[...]
string: here is a sentence. This is another sentence.
sentence boundaries: [0, 20, 45]
words: ["here is a sentence. ", "This is another sentence."]

No surprises there, so let’s move on.

Lines

The LineSegmenter identifies boundaries at which strings may be split into multiple lines. Let’s see an example:

let string = "The first line.\nThe\u{a0}second line.";
println!("string: {string}");
let line_boundaries: Vec<usize> = icu::segmenter::LineSegmenter::new_auto()
    .segment_str(string)
    .collect();
println!("line boundaries: {line_boundaries:?}");
let lines: Vec<&str> = line_boundaries
    .windows(2)
    .map(|bounds| &string[bounds[0]..bounds[1]])
    .collect();
println!("lines: {lines:?}");

$ cargo run
[...]
string: The first line.
The second line.
line boundaries: [0, 4, 10, 16, 28, 33]
lines: ["The ", "first ", "line.\n", "The\u{a0}second ", "line."]

This gives us more individual “lines” than we might have previously anticipated. That’s because the LineSegmenter not only gives us boundaries on line breaks already contained in the string, but also gives us boundaries in places where a soft line break could be placed. This can be very useful if you want to wrap a long string over multiple lines.

If you want to differentiate whether a given boundary is a hard line break contained in the string or just an opportunity for an optional line break, you can negotiate the character right before the line break using icu::properties::maps::line_break.

Case Mapping

When processing Unicode texts, there is sometimes the need to transform letter between lower case and upper case. ICU4X gives us various tool for this, so let’s look at each of them.

UPPERCASE and lowercase

Lowercasing and uppercasing are very simple operations on the surface. They do similar things to Rust’s built-in str::to_lowercase and str::to_uppercase methods. So let’s see why ICU4X has separate support for them:

let string = "AaBbIıİi";
println!("string: {string}");

let locale = icu::locid::locale!("de-DE");

let cm = icu::casemap::CaseMapper::new();
let lower = cm.lowercase_to_string(string, &locale.id);
let upper = cm.uppercase_to_string(string, &locale.id);
println!("lower: {lower}, upper: {upper}");

$cargo run
[...]
string: AaBbIıİi
lower: aabbiıi̇i, upper: AABBIIİI

So far this looks like the familiar lowercasing and uppercasing operations from most languages’ standard libraries. But note that we had to provide locale.id to run these operations. The twist here is that the rules for lowercasing and uppercasing can vary by language, which is reflected in ICU4X’s variants of these operations. Observe how the result changes if we use the locale tr-TR instead of de-DE:

$ cargo run
[...]
string: AaBbIıİi
lower: aabbııii, upper: AABBIIİİ

With ICU4X we don’t need to know the details of how different lowercase and uppercase letters pair up in different languages. As long as we pass the correct locale, ICU4X will do the correct thing.

Note however, that uppercasing and lowercasing operations are only intended for display purposes. If you want to compare strings case-insensitively, you want case folding instead, which we will look at later.

Titlecasing

Titlecasing is the process of uppercasing the first letter of a segment and lowercasing all other characters. So for example, if we wanted to titlecase every word in a string, we would first use a WordSegmenter to extract every word and then use a TitlecaseMapper to perform the titlecasing on every word.

let string = "abc Ǆ 'twas words and more wORDS";
println!("string: {string}");

let locale = icu::locid::locale!("de-DE");

let cm = icu::casemap::TitlecaseMapper::new();
let word_segments: Vec<usize> = icu::segmenter::WordSegmenter::new_auto()
    .segment_str(string)
    .collect();

let titlecased: String = word_segments
    .windows(2)
    .map(|bounds| {
        let word = &string[bounds[0]..bounds[1]];
        cm.titlecase_segment_to_string(word, &locale.id, Default::default())
    })
    .collect();
println!("titlecased: {titlecased}");

$ cargo run
[...]
string: abc Ǆ 'twas words and more wORDS
titlecased: Abc ǅ 'Twas Words And More Words

Again we had to provide &locale.id to specify which language-specific rules to obey during case transformations. Additionally we can pass other options as a third parameter. Here we’ve used the default options, but feel free to checkout out the API documentation to see what other options are supported.

Note how Ǆ was transformed to ǅ, even though it is a single letter whose regular uppercase form is Ǆ. This is because each character has separate uppercase and titlecase forms, which just happen to be the same for most latin characters. Also note that 'twas was transformed to 'Twas. This is because the TitlecaseMapper titlecases the first letter in a word and skips over non-letter characters at the start of a word when doing so.

Case folding

Sometimes we want to tell whether two strings are equal while ignoring differences in casing. Traditionally this has been done by transforming both strings to lower case or upper case to eliminate differences in casing and comparing those strings. With Unicode strings, for some characters simple lowercasing or uppercasing isn’t enough to eliminate all differences in casing. As an example, the German letter ß uppercases to SS, but there’s also an uppercase version of ß: ẞ, which uppercases to itself, but lowercases to a regular ß. To consistently eliminate all casing differences, we need to map SS, ß, and ẞ all to the same output character. Luckily for us, ICU4X gives us the case folding operation, which promises to do just that. Let’s see it in action:

let string = "SSßẞ";
println!("string: {string}");

let locale = icu::locid::locale!("de-DE");

let cm = icu::casemap::CaseMapper::new();
let upper = cm.uppercase_to_string(string, &locale.id);
let lower = cm.lowercase_to_string(string, &locale.id);
let folded = cm.fold_string(string);
println!("upper: {upper}, lower: {lower}, folded: {folded}");

$ cargo run
[...]
string: SSßẞ
upper: SSSSẞ, lower: ssßß, folded: ssssss

As we see, in the folded string all the different versions of ß have been consistently turned into ss, which successfully eliminates all casing differences. It also means that a single ß would be considered equal to a lowercase ss, which we might not have considered equal otherwise. This is a kind of ambiguity that is hard to avoid when comparing strings case-insensitively.

Note that we didn’t have to specify any locale or language for the case folding operation. This is because case folding is often used for identifiers that are supposed to behave identically regardless of the lingual context they’re used in. The case folding operation tries to use rules that work best across most languages. However, they don’t work perfectly for Turkic languages. To deal with this, there’s an alternative case folding operation fold_turkic_string just for Turkic languages. In most cases you’ll probably want to use the general folding operation, unless you’re really sure you need the special behavior for Turkic languages.

Case-insensitive comparison

Given the case folding operation, we could implement a function to compare two strings case-insensitively like this:

fn equal_ci(a: &str, b: &str) -> bool {
    let cm = icu::casemap::CaseMapper::new();
    cm.fold_string(a) == cm.fold_string(b)
}

Data handling

So far we’ve looked at various operations that work correctly in a vast number of locales over strings made up of a huge amount of valid code points. On the surface, these operations were relatively easy to use and most of the time we only needed to specify our input and a desired locale to get the correct result. However, in the background ICU4X needs a lot of data about different locales and Unicode characters to do the right thing in every situation. But so far, we never had to be concerned with this data at all.

So where does ICU4X get all this data from? In the default configuration we’ve been using so far, the data is shipped as part of the library and compiled directly into our application executable. This has the benefit that we don’t need to worry about shipping the data along with the binary and getting access to it at runtime, as the data is always included in the binary. But it comes at the cost of sometimes dramatically increased binary sizes. Since data for a large number of locales is included by default, we’re talking about tens of megabytes of data being included in the binary.

Alternatives to embedded data

Since ICU4X is designed to run even in minimalist environments, such as embedded devices, forcing this increased application binary size on every application would be unacceptable. Instead, ICU4X provides multiple ways to access the relevant data. Besides using the included default set of data, you can also generate you own set of data using icu4x-datagen. This allows you to reduce the data included from the beginning, either by limiting the number of locales to include or by limiting the functionalities supported by the data. Furthermore you have the choice between compiling this data directly into your application binary or putting it into separate data files that your application then parses at runtime.

Reducing the set of available runtime data of course comes with the benefit of reducing the data size needed to ship with your application. On the other hand it has the drawback of reducing the set of operations you can successfully run at runtime. Each bit of data you remove, can have the effect of making some operation fail, if no data is available to perform that operation with the requested locale. As with many other things, reducing the data size has obvious benefits, but it is always a tradeoff. In the examples above we usually used unwrap to ignore the possibility of errors, but in a real application you’ll probably want more sophisticated error handling, like falling back to some non-failing behavior or at least reporting the error to the user.

I’ll avoid going through all the available options in detail, and instead refer to ICU4X’s official Tutorial on data management instead. It should explain all the supported ways to make the required data available to your application.

Conclusion

I hope this has given you a satisfactory overview of what ICU4X can do. As we have seen, a lot of functionality works well out of the box. In other areas functionality is still lacking. For example, I’ve mentioned earlier, that there’s currently no comfortable way to detect the user’s preferred locale in a standard way from the execution environment. Another area where ICU4X is currently lacking behind its C and Java counterparts is translation support. ICU4X and ICU4J provide capabilities for formatting localized message using MessageFormats, which ICU4X still lacks. Similarly, ICU4X doesn’t currently seem to have functionality to deal with resource bundles.

Even though ICU4X doesn’t have all the functionality you might expect of it yet, overall it seems like a good choice for those cases, where it already brings all the required functionality. Given some more time, we may even see more and more of the missing functionality to land in ICU4x.

Categories:	credativ® Inside
Tags:	i18n ICU ICU4X Internationalisation Internationalization Rust Unicode

About the author

Sven Bartscher

Senior Berater

zur Person

Sven Bartscher arbeitet seit 2017 bei credativ im Operations-Solutions-Team und ist dort unter anderem an der Entwicklung und Pflege des Open Security Filters beteiligt. Er arbeitet ebenfalls als Debian-Entwickler an dem freien Betriebssystem Debian.

View posts

Beitrag teilen: