-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Text/Unicode oriented streams #57
Changes from all commits
cd0df6b
4196769
a539852
ff6283a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,122 @@ | ||
- Start Date: 2014-04-15 | ||
- RFC PR #: | ||
- Rust Issue #: | ||
|
||
# Summary | ||
|
||
Add `TextReader` and `TextWriter` traits to `std::io` for Unicode text-oriented streams, | ||
like `Reader` and `Writer` are for byte-oriented streams. | ||
The API design of text-oriented streams guarantees well-formed Unicode scalar values (characters), | ||
so that there is no need to deal with e.g. errors caused by invalid UTF-8 in an input byte sequence. | ||
|
||
|
||
# Motivation | ||
|
||
When dealing with a potentially large amount of data, | ||
we prefer doing so incrementally rather than having the data set | ||
and all of its intermediate representations entirely in memory. | ||
This is why `Reader` and `Writer` were added. | ||
|
||
Additionally, experience in other programming language has taught us of | ||
[the Unicode sandwich](http://nedbatchelder.com/text/unipain.html): | ||
when dealing with text, the best practice is to handle Unicode only internally | ||
(in Rust: `char`, `str` and `StrBuf`; as opposed to `u8` and `[u8]`), | ||
and convert to or from bytes at the program’s boundaries, when doing I/O. | ||
Byte-oriented streams are good, but we also need text-oriented streams. | ||
|
||
For example, JSON is defined in terms of Unicode code points. | ||
Encoding these code points to UTF-8 for transmission is completely orthogonal | ||
to JSON itself. | ||
Our `serialize::json` module could be based on text streams, | ||
and avoid [the redundant UTF-8 valitiy check](https://github.com/mozilla/rust/blob/30e373390f1a2f74e78bf9ca9c8ca68451f3511a/src/libserialize/json.rs#L329) | ||
that’s involved when getting a `~str` from a byte stream. | ||
|
||
[rust-encoding](https://github.com/lifthrasiir/rust-encoding) | ||
will provide wrappers to "convert" between byte streams and text streams. | ||
For example, one that takes a `Writer`, an encoding, and an error handling behavior, | ||
and provides a `TextWriter`. | ||
|
||
Eventually, we could open a file directly in text mode with a given encoding | ||
and obtain a text stream. | ||
|
||
|
||
# Detailed design | ||
|
||
|
||
```rust | ||
/// A minimal implementation only needs `write_str`. | ||
/// However, a writer that is not based on UTF-8 may prefer | ||
/// to override `write_char` as their "most fundamental" method, | ||
/// and implement `write_str` with: | ||
/// | ||
/// | ||
/// fn write_str(&mut self, buf: &str) -> IoResult<()> { | ||
/// for c in buf.chars { | ||
/// try!(write_char(c)) | ||
/// } | ||
/// Ok(()) | ||
/// } | ||
pub trait TextWriter { | ||
fn write_str(&mut self, buf: &str) -> IoResult<()>; | ||
|
||
// These are similar to Writer, but based on `write_str` instead of `write`. | ||
fn write_char(&mut self, c: char) -> IoResult<()> { ... } | ||
fn write_line(&mut self, s: &str) -> IoResult<()> { ... } | ||
fn write_uint(&mut self, n: uint) -> IoResult<()> { ... } | ||
fn write_int(&mut self, n: int) -> IoResult<()> { ... } | ||
|
||
// These are similar to Writer | ||
fn flush(&mut self) -> IoResult<()> { ... } | ||
fn by_ref<'a>(&'a mut self) -> RefWriter<'a, Self> { ... } | ||
} | ||
|
||
impl<'a, W: TextWriter> TextWriter for RefWriter<'a, W> { ... } | ||
``` | ||
|
||
Other than `write_char`, the set of default methods is just an idea. | ||
|
||
If and when [#7771](https://github.com/mozilla/rust/issues/7771) is implemented, | ||
`write_str` can have a default implementation based on `write_char` | ||
with `#[requires(one_of(write_str, write_char)]` on the trait. | ||
|
||
|
||
|
||
```rust | ||
pub trait TextReader { | ||
// XXX See "Unresolved questions" below. | ||
fn read(&mut self, buf: &mut StrBuf, max_bytes: uint) -> IoResult<uint>; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be nice if Also, is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I’m not very happy with the design of |
||
|
||
// These are similar to Reader | ||
fn read_to_end(&mut self) -> IoResult<~str> { ... } | ||
fn bytes<'r>(&'r mut self) -> Bytes<'r, Self> { ... } | ||
fn by_ref<'a>(&'a mut self) -> RefReader<'a, Self> { ... } | ||
|
||
// These are similar to Buffer | ||
fn read_line(&mut self) -> IoResult<~str> { ... } | ||
fn lines<'r>(&'r mut self) -> Lines<'r, Self> { ... } | ||
fn read_until<C: CharEq>(&mut self, char: C) -> IoResult<~str> { ... } | ||
fn read_char(&mut self) -> IoResult<char> { ... } | ||
fn chars<'r>(&'r mut self) -> Chars<'r, Self> { ... } | ||
} | ||
|
||
impl<'a, R: TextReader> TextReader for RefReader<'a, R> { ... } | ||
``` | ||
|
||
The set of default methods here is just an idea. | ||
|
||
|
||
# Alternatives | ||
|
||
* Let rust-encoding define `TextReader` and `TextWriter` itself and revisit later. | ||
* We may want `TextReader` to be closer to `std::io::Buffer` (which requires `Reader`) rather than just `Reader` | ||
|
||
|
||
# Unresolved questions | ||
|
||
* `fn read(&mut self, buf: &mut StrBuf, max_bytes: uint) -> IoResult<uint>;` | ||
is proposed as the most fundamental method of `TextReader`. | ||
Is this the right design? See discussion in this RFC’s pull request comments. | ||
* Which of these things should have text-oriented equivalents? | ||
The `Buffer`, `Seek`, and `Stream` traits, | ||
their buffered wrapper implementations, | ||
the readers and writers in `std::io::util`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would (perhaps naively) expect that this would be the fundamental method. Why would
write_str
be it instead?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It really could be either. See the paragraph below about rust-lang/rust#7771