Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Text/Unicode oriented streams #57

Closed
wants to merge 4 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
122 changes: 122 additions & 0 deletions active/0000-text-streams.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
- Start Date: 2014-04-15
- RFC PR #:
- Rust Issue #:

# Summary

Add `TextReader` and `TextWriter` traits to `std::io` for Unicode text-oriented streams,
like `Reader` and `Writer` are for byte-oriented streams.
The API design of text-oriented streams guarantees well-formed Unicode scalar values (characters),
so that there is no need to deal with e.g. errors caused by invalid UTF-8 in an input byte sequence.


# Motivation

When dealing with a potentially large amount of data,
we prefer doing so incrementally rather than having the data set
and all of its intermediate representations entirely in memory.
This is why `Reader` and `Writer` were added.

Additionally, experience in other programming language has taught us of
[the Unicode sandwich](http://nedbatchelder.com/text/unipain.html):
when dealing with text, the best practice is to handle Unicode only internally
(in Rust: `char`, `str` and `StrBuf`; as opposed to `u8` and `[u8]`),
and convert to or from bytes at the program’s boundaries, when doing I/O.
Byte-oriented streams are good, but we also need text-oriented streams.

For example, JSON is defined in terms of Unicode code points.
Encoding these code points to UTF-8 for transmission is completely orthogonal
to JSON itself.
Our `serialize::json` module could be based on text streams,
and avoid [the redundant UTF-8 valitiy check](https://github.com/mozilla/rust/blob/30e373390f1a2f74e78bf9ca9c8ca68451f3511a/src/libserialize/json.rs#L329)
that’s involved when getting a `~str` from a byte stream.

[rust-encoding](https://github.com/lifthrasiir/rust-encoding)
will provide wrappers to "convert" between byte streams and text streams.
For example, one that takes a `Writer`, an encoding, and an error handling behavior,
and provides a `TextWriter`.

Eventually, we could open a file directly in text mode with a given encoding
and obtain a text stream.


# Detailed design


```rust
/// A minimal implementation only needs `write_str`.
/// However, a writer that is not based on UTF-8 may prefer
/// to override `write_char` as their "most fundamental" method,
/// and implement `write_str` with:
///
///
/// fn write_str(&mut self, buf: &str) -> IoResult<()> {
/// for c in buf.chars {
/// try!(write_char(c))
/// }
/// Ok(())
/// }
pub trait TextWriter {
fn write_str(&mut self, buf: &str) -> IoResult<()>;

// These are similar to Writer, but based on `write_str` instead of `write`.
fn write_char(&mut self, c: char) -> IoResult<()> { ... }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would (perhaps naively) expect that this would be the fundamental method. Why would write_str be it instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It really could be either. See the paragraph below about rust-lang/rust#7771

fn write_line(&mut self, s: &str) -> IoResult<()> { ... }
fn write_uint(&mut self, n: uint) -> IoResult<()> { ... }
fn write_int(&mut self, n: int) -> IoResult<()> { ... }

// These are similar to Writer
fn flush(&mut self) -> IoResult<()> { ... }
fn by_ref<'a>(&'a mut self) -> RefWriter<'a, Self> { ... }
}

impl<'a, W: TextWriter> TextWriter for RefWriter<'a, W> { ... }
```

Other than `write_char`, the set of default methods is just an idea.

If and when [#7771](https://github.com/mozilla/rust/issues/7771) is implemented,
`write_str` can have a default implementation based on `write_char`
with `#[requires(one_of(write_str, write_char)]` on the trait.



```rust
pub trait TextReader {
// XXX See "Unresolved questions" below.
fn read(&mut self, buf: &mut StrBuf, max_bytes: uint) -> IoResult<uint>;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if read did not necessarily have to write to a StrBuf (and incur the related heap allocations at some point). It could be parameterized over some trait that can append &strs, (or trait objects if TextReader will be largely used as a trait object. It could even be a TextWriter, although that could be more broad a trait than is necessary.

Also, is max_bytes the maximum number of utf8 bytes that can be read, or the maximum number of bytes from the underlying encoded data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_bytes is meant is the number of UTF-8 bytes added to buf, so that you could pre-allocate with StrBuf::reserve_additional before calling read.

I’m not very happy with the design of TextReader::read here, but I don’t know what else would be better. Suggestions welcome.


// These are similar to Reader
fn read_to_end(&mut self) -> IoResult<~str> { ... }
fn bytes<'r>(&'r mut self) -> Bytes<'r, Self> { ... }
fn by_ref<'a>(&'a mut self) -> RefReader<'a, Self> { ... }

// These are similar to Buffer
fn read_line(&mut self) -> IoResult<~str> { ... }
fn lines<'r>(&'r mut self) -> Lines<'r, Self> { ... }
fn read_until<C: CharEq>(&mut self, char: C) -> IoResult<~str> { ... }
fn read_char(&mut self) -> IoResult<char> { ... }
fn chars<'r>(&'r mut self) -> Chars<'r, Self> { ... }
}

impl<'a, R: TextReader> TextReader for RefReader<'a, R> { ... }
```

The set of default methods here is just an idea.


# Alternatives

* Let rust-encoding define `TextReader` and `TextWriter` itself and revisit later.
* We may want `TextReader` to be closer to `std::io::Buffer` (which requires `Reader`) rather than just `Reader`


# Unresolved questions

* `fn read(&mut self, buf: &mut StrBuf, max_bytes: uint) -> IoResult<uint>;`
is proposed as the most fundamental method of `TextReader`.
Is this the right design? See discussion in this RFC’s pull request comments.
* Which of these things should have text-oriented equivalents?
The `Buffer`, `Seek`, and `Stream` traits,
their buffered wrapper implementations,
the readers and writers in `std::io::util`.