Localization (i18n): notes and planning #72

jgarzik · 2024-04-23T14:11:19Z

Introduction

Soliciting discussion over the localization (i18n) strategy for this project.

Goals

Goal 1: Localize everything

The goal is complete localization of all messages visible to the user, within the bounds of POSIX compliance:

All util messages, collation sequences, charsets and other util i18n details
All --help messages and related output (clap crate)
All OS error messages (io::Result)
It is believed that some minimum set of strings are required to be English/POSIX permanently. Minimize this POSIX-only set as much as possible.

Goal 2: Encourage UTF-8

To be forward-looking, this project looks for opportunities to

Drop support for non-UTF 8 strings
Default to UTF-8 charsets and strings

This project should aggressively interpret the POSIX standards in terms of UTF-8 support, and look for opportunities to create default-UTF-8 operating modes, with a fallback mode that is "POSIX-ly correct."

Implementation strategies

Current strategy

The current strategies are,

use the gettext crate, and mark strings with gettext(). This provides a starting point for per-util coding, and at least gets us started on the road to i18n.
each util sets the charset thusly,

    bind_textdomain_codeset(PROJECT_NAME, "UTF-8")?;

Improvements to our i18n

At present, OS error messages and --help are not translated at all, and need a project-wide strategy.

Also, one idea that is aligned with the gencat util is to generate catgets message catalogs and abandon gettext. This works because catgets exists on all modern platforms.

See issue #65 for util-related tasks.

Feedback and thoughts are requested. We want to give users the best i18n support possible.

The text was updated successfully, but these errors were encountered:

kellpossible · 2024-09-14T00:23:54Z

posixutils-rs Localization Proposal

Requirements after offline discussion with @jgarzik :

Prefer a single strings output file, or, a single output file per supported language
Need to have a strategy for
1. app strings
2. OS error strings: all the errno error codes. e.g. "no such file or directory" upon open.
3. --help strings.
It would be nice if app and help strings are embedded within each util's source code, and extracted with cargo xtr a la gettext crate, but not required.
Must use POSIX-standard environment variables such as NLSPATH (described in each man page) or LC_COLLATE (sort order)
a single message file for posixutils as a whole. It creates a distribution nightmare to have one-file-per-util, and that also eliminates the strong possibility of sharing strings and sharing translations across utils.
Does not impact the normal build + test process in a large & negative way (e.g. slows down each build by 10 minutes would be negative). Developer productivity and developer throughput is a goal.

BOTH of the following implementation strategies are valid:
(1) extract strings from each .rs source file
(2) the "IBM approach": assign unique numbers to each and every error message, whether app-specific or generic, and maintain a posixutils global set of strings (and their translations)

Considerations/Research

2b) OS Error Strings

According to https://stackoverflow.com/questions/43019882/does-libc-show-international-error-messages it seems like we should be able to use https://docs.rs/libc/latest/libc/fn.strerror.html to gain access to system provided localized messages for libc errno codes, but it sees like there might be some safety concerns related to using this function that will warrant more investigation if we use it https://users.rust-lang.org/t/unsafe-and-strerror-impossible-to-fix/90804
Most likely posixutils-rs will preference using Rust standard library functions over libc. Perhaps we can use https://doc.rust-lang.org/stable/std/io/struct.Error.html#method.raw_os_error

It seems like it should be possible https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=52ae229dd19b14298c25d488516d3750 . Here is the output using the french locale installed on my system:

It seems like libc::setlocale() needs to be called manually using the contents of the relevant environment variables, the locale for libc is not automatically detected from the available environment variables.

Interestingly the std::io::Error implementation appears to defer to libc::strerror() for its output, so there will be no need for us to call libc::strerror() manually, we can simply use the std::fmt::Display output provided we configure libc using libc::setlocale() This almost feels like an oversight on the part of Rust’s standard library not to have this functionality enabled by default on platforms that support it.

2c) –help messages

With the exception of m4 all the binaries in posixutils-rs currently use clap’s derive macro to generate help messages. In https://docs.rs/clap/latest/clap/_derive/index.html#command-attributes the about attribute accepts an expression about [= <expr>] which presumably gets put into https://docs.rs/clap/latest/clap/struct.Command.html#method.about. To use this with localized messages, the messages would need to be available in a ’static lifetime. To get around this we could use some kind of static thread-local or global cell containing a mutex that can be used to load the appropriate locale at the start of main() based on the current system settings, before executing clap. Another issue is that using the about attribute disables the parsing of help messages from the doc attribute provided by Rust’s documentation comments, which means that if we want to have these struct fields documented in the standard way for Rust, we will end up with duplicated text. We could probably get around this by creating a custom derive macro which wraps clap’s one to use the documentation comments for these fields in the localization system and provide the necessary about too to satisfy the optional requirement 3. There does come a question about the source of truth: if each application is sharing messages in a single registry as per requirement 5., then we may end up with duplicates that need to be detected. Perhaps it is better just to refer to localizations only by a message id, and keep it separated from code documentation comments, this does make the implementation a lot simpler too. Whatever we decide to do here should follow on from the more general decision about how to localize strings in the application for requirement 2a.

Message Format

While an obvious choice for message format would be to use GNU gettext, some arguably better and more modern alternatives do exist. fluent puts forward some good arguments for the choices it makes that are different https://github.com/projectfluent/fluent/wiki/Fluent-vs-gettext In summary of this article:

Using a message identifier unique from the source string makes the process of updating source language simpler (without invalidating translations and relying on fuzzy matching), and enables different translations for messages which may have identical English words but which in some translations may result in different words based on the context in which they are used, without the burden being on the developer to recognise these situations. It also enables message re-use/composition via references.
Gettext’s support for grammatical rules is very limited.
String formatting and message arguments are an afterthought for Gettext.
Fluent uses a single data format, Gettext uses 3.
Rust’s compiler messages are translated using fluent https://rustc-dev-guide.rust-lang.org/diagnostics/translation.html, a significant endorsement.

Further comparisons between systems and crate implementations:

There are two Rust crates for gettext,https://docs.rs/gettext/latest/gettext/ (which claims to be a work in progress, pure rust implementation, hasn’t been updated in 5 years), and https://docs.rs/gettext-rs/latest/gettextrs/ bindings to GNU gettext (bringing the associated downsides of using a C dependency in a Rust project).
i18n-embed + i18n-embed-fl provides some additional functionality on top of a basic fluent setup:
- Static checking of message keys and format arguments using procedural macro, this is a big one, avoiding common runtime error mistakes. There is another library which does this for fluent using codegen
- A standardized layout for localization resources that enables building more static analysis tooling for cargo-i18n Add support for validation of fluent resources to cargo-i18n command line tool kellpossible/cargo-i18n#31 Actually it’s possible to benefit from these without actually using i18n-embed but simply using the i18n.toml config file to use with cargo-i18n.
- Some alternatives that provide similar functionality
  https://github.com/zaytsev/fluent-static ttakes an alternative approach to i18n-embed-fl and uses codegen instead of a proc macro, this provides code completion and type signatures for messages as functions (there is an open issue to implement this for i18n-embed-fl Consider a code generation type safe version of i18n-embed-fl kellpossible/cargo-i18n#73 ).
  https://github.com/MathieuTricoire/l10n
Burden on compilation:
- gettext adds 9 additional crate dependencies, an additional 0.05s to build time.
- gettext-rs adds 6 crate dependencies, additional 144s to build time (if static build), or 0.8s if using gettext-system feature for dynamic linking.
- fluent brings in an additional 15 crate dependencies, additional 0.7s to build time.
- i18n-embed + i18n-embed-fl brings an additional 55 dependencies, additional 4.5s to build, I have some ideas for how to bring this down considerably (Reduce number of dependencies kellpossible/cargo-i18n#131).
An important consideration for which message format to use is its support in localization tooling. Because posixutils-rs is by its nature a technical tool it’s not unreasonable to expect translators working on the project would be familiar and fine working with plain text files and a git repository in order to make contributions to the project.
https://github.com/baptiste0928/rosetta Another alternative based on a json message serialization format and custom string formatting, with code generation for static type checking.
A custom message format could be constructed using a serialization format like TOML and message formatting/arguments using something like https://github.com/dtolnay/basic-toml and https://lib.rs/crates/minijinja if serde were an acceptable dependency, however currently none of our tools depend on serde, and considering strings are sprinkled throughout the code and serde derive macro has a reputation for increasing compile times if used extensively, probably we want a solution that doesn’t rely on it.

Proposal

This proposal is that we definitely use the fluent localization system instead of gettext, for a minimal setup it could potentially even have a lower overhead, and has none of the licensing concerns with LGPL gettext for systems that must build it statically, it seems like an obvious choice after considering the tradeoffs. If localization is to be taken seriously the features the fluent provides over other simpler ad-hoc systems with simple message formatting are very important.

The next decision is what to use for the scaffolding around fluent. Messages must be loaded from disk, bundles must be configured according to the user’s requested locale, ideally some form of static checking should be employed in order to help prevent mundane runtime errors. If keeping dependencies to a bare minimum is a high priority then we could gradually implement this ourselves from scratch. If however there is a desire to share this functionality with the rust community at large, then I’d propose to use i18n-embed-fl and cargo-i18n and upstream any changes which may be required in order to make it fit the requirements of this project. I’m the maintainer for those projects so I’d be very happy to take on this responsibility if that’s the direction we decide to go with.

jgarzik · 2024-09-25T19:14:51Z

Related recent unsolicited work: #254

fox0 · 2024-10-15T13:48:04Z

#340 (comment)

kellpossible · 2024-10-17T07:40:27Z

@jgarzik raised some interesting points regarding this:

The POSIX.2024 spec adds CLI utilities gettext, xgettext, etc.
These utils directly reference gettext-related functions.

With this in mind, it seems like, despite it's shortcomings, it looks like gettext is fully endorsed by the POSIX.2024 spec, and therefore in the spirit of implementing it, it would be best to use gettext for the implementation of localisation of these utilities. If we will be implementing the gettext and xgettext CLI utilities then it makes sense to dogfood them on this project.

For gettext library functions we have two pre-existing options:

gettext-rs bindings to GNU gettext
- Option to dynamically or statically link, in order to comply with its license I gather we would have to dynamically link with it, which seems like something we'd probably rather avoid if possible.
gettext rewrite of gettext library functions in Rust
- MIT Licensed.
- Doesn't seem very actively maintained, we could fork it if necessary.

There is no default message formatting function within these gettext libraries. One option could be https://github.com/woboq/tr which is designed to be used withxtr command, which is specifically designed improve on xgettext by fixing its problems parsing Rust syntax. It appears to currently only support gettext-rs but seems likely they would accept a PR to implement support for gettext crate. But if we went that route we would lose the option to dogfood our own implementation of xgettext (which will need to support C), one of the reasons for using gettext system in this project in the first place.

I'd propose that we start by creating our own implementation for xgettext which can additionally parse Rust syntax, probably we'd need to start this from scratch as xtr is AGPL licensed. We would then need to choose a string formatting option, we could just use tr macro (licensed as MIT) and at least share some of the work and not re-invent the wheel.

kellpossible · 2024-10-20T23:09:31Z

Pretty interesting related reading in this answer (for setting locale): https://unix.stackexchange.com/a/149129

jgarzik · 2024-10-21T00:29:58Z

Pretty interesting related reading in this answer (for setting locale): https://unix.stackexchange.com/a/149129

Yep; that is also a good checklist for posixutils. We need to be responsive to these LC_xxx, and are not yet there.

jgarzik added enhancement New feature or request i18n labels Apr 23, 2024

jgarzik mentioned this issue Sep 27, 2024

Add makefile script for locales #259

Closed

kellpossible mentioned this issue Oct 21, 2024

Draft: Gettext Localization System #351

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Localization (i18n): notes and planning #72

Localization (i18n): notes and planning #72

jgarzik commented Apr 23, 2024

kellpossible commented Sep 14, 2024 •

edited

Loading

jgarzik commented Sep 25, 2024

fox0 commented Oct 15, 2024

kellpossible commented Oct 17, 2024

kellpossible commented Oct 20, 2024

jgarzik commented Oct 21, 2024

Localization (i18n): notes and planning #72

Localization (i18n): notes and planning #72

Comments

jgarzik commented Apr 23, 2024

Introduction

Goals

Goal 1: Localize everything

Goal 2: Encourage UTF-8

Implementation strategies

Current strategy

Improvements to our i18n

kellpossible commented Sep 14, 2024 • edited Loading

posixutils-rs Localization Proposal

Considerations/Research

2b) OS Error Strings

2c) –help messages

Message Format

Proposal

jgarzik commented Sep 25, 2024

fox0 commented Oct 15, 2024

kellpossible commented Oct 17, 2024

kellpossible commented Oct 20, 2024

jgarzik commented Oct 21, 2024

kellpossible commented Sep 14, 2024 •

edited

Loading