collections: Stabilize String #17438

alexcrichton · 2014-09-22T15:59:13Z

Rationale

When dealing with strings, many functions deal with either a char (unicode
codepoint) or a byte (utf-8 encoding related). There is often an inconsistent
way in which methods are referred to as to whether they contain "byte", "char",
or nothing in their name. There are also issues open to rename all methods to
reflect that they operate on utf8 encodings or bytes (e.g. utf8_len() or
byte_len()).

The current state of String seems to largely be what is desired, so this PR
proposes the following rationale for methods dealing with bytes or characters:

When constructing a string, the input encoding must be mentioned (e.g.
from_utf8). This makes it clear what exactly the input type is expected to be
in terms of encoding.

When a method operates on anything related to an index within the string
such as length, capacity, position, etc, the method implicitly operates on
bytes. It is an understood fact that String is a utf-8 encoded string, and
burdening all methods with "bytes" would be redundant.

When a method operates on the contents of a string, such as push() or pop(),
then "char" is the default type. A String can loosely be thought of as being a
collection of unicode codepoints, but not all collection-related operations
make sense because some can be woefully inefficient.

Method stabilization

The following methods have been marked #[stable]

The String type itself
String::new
String::with_capacity
String::from_utf16_lossy
String::into_bytes
String::as_bytes
String::len
String::clear
String::as_slice

The following methods have been marked #[unstable]

String::from_utf8 - The error type in the returned Result may change to
provide a nicer message when it's unwrap()'d
String::from_utf8_lossy - The returned MaybeOwned type still needs
stabilization
String::from_utf16 - The return type may change to become a Result which
includes more contextual information like where the error
occurred.
String::from_chars - This is equivalent to iter().collect(), but currently not
as ergonomic.
String::from_char - This method is the equivalent of Vec::from_elem, and has
been marked #[unstable] becuase it can be seen as a
duplicate of iterator-based functionality as well as
possibly being renamed.
String::push_str - This can be emulated with .extend(foo.chars()), but is
less efficient because of decoding/encoding. Due to the
desire to minimize API surface this may be able to be
removed in the future for something possibly generic with
no loss in performance.
String::grow - This is a duplicate of iterator-based functionality, which may
become more ergonomic in the future.
String::capacity - This function was just added.
String::push - This function was just added.
String::pop - This function was just added.
String::truncate - The failure conventions around String methods and byte
indices isn't totally clear at this time, so the failure
semantics and return value of this method are subject to
change.
String::as_mut_vec - the naming of this method may change.
string::raw::* - these functions are all waiting on an RFC

The following method have been marked #[experimental]

String::from_str - This function only exists as it's more efficient than
to_string(), but having a less ergonomic function for
performance reasons isn't the greatest reason to keep it
around. Like Vec::push_all, this has been marked
experimental for now.

The following methods have been #[deprecated]

String::append - This method has been deprecated to remain consistent with the
deprecation of Vec::append. While convenient, it is one of
the only functional-style apis on String, and requires more
though as to whether it belongs as a first-class method or
now (and how it relates to other collections).
String::from_byte - This is fairly rare functionality and can be emulated with
str::from_utf8 plus an assert plus a call to to_string().
Additionally, String::from_char could possibly be used.
String::byte_capacity - Renamed to String::capacity due to the rationale
above.
String::push_char - Renamed to String::push due to the rationale above.
String::pop_char - Renamed to String::pop due to the rationale above.
String::push_bytes - There are a number of unsafe functions on the String
type which allow bypassing utf-8 checks. These have all
been deprecated in favor of calling .as_mut_vec() and
then operating directly on the vector returned. These
methods were deprecated because naming them with relation
to other methods was difficult to rationalize and it's
arguably more composable to call .as_mut_vec().
String::as_mut_bytes - See push_bytes
String::push_byte - See push_bytes
String::pop_byte - See push_bytes
String::shift_byte - See push_bytes

Reservation methods

This commit does not yet touch the methods for reserving bytes. The methods on
Vec have also not yet been modified. These methods are discussed in the upcoming
Collections reform RFC

# Rationale When dealing with strings, many functions deal with either a `char` (unicode codepoint) or a byte (utf-8 encoding related). There is often an inconsistent way in which methods are referred to as to whether they contain "byte", "char", or nothing in their name. There are also issues open to rename *all* methods to reflect that they operate on utf8 encodings or bytes (e.g. utf8_len() or byte_len()). The current state of String seems to largely be what is desired, so this PR proposes the following rationale for methods dealing with bytes or characters: > When constructing a string, the input encoding *must* be mentioned (e.g. > from_utf8). This makes it clear what exactly the input type is expected to be > in terms of encoding. > > When a method operates on anything related to an *index* within the string > such as length, capacity, position, etc, the method *implicitly* operates on > bytes. It is an understood fact that String is a utf-8 encoded string, and > burdening all methods with "bytes" would be redundant. > > When a method operates on the *contents* of a string, such as push() or pop(), > then "char" is the default type. A String can loosely be thought of as being a > collection of unicode codepoints, but not all collection-related operations > make sense because some can be woefully inefficient. # Method stabilization The following methods have been marked #[stable] * The String type itself * String::new * String::with_capacity * String::from_utf16_lossy * String::into_bytes * String::as_bytes * String::len * String::clear * String::as_slice The following methods have been marked #[unstable] * String::from_utf8 - The error type in the returned `Result` may change to provide a nicer message when it's `unwrap()`'d * String::from_utf8_lossy - The returned `MaybeOwned` type still needs stabilization * String::from_utf16 - The return type may change to become a `Result` which includes more contextual information like where the error occurred. * String::from_chars - This is equivalent to iter().collect(), but currently not as ergonomic. * String::from_char - This method is the equivalent of Vec::from_elem, and has been marked #[unstable] becuase it can be seen as a duplicate of iterator-based functionality as well as possibly being renamed. * String::push_str - This *can* be emulated with .extend(foo.chars()), but is less efficient because of decoding/encoding. Due to the desire to minimize API surface this may be able to be removed in the future for something possibly generic with no loss in performance. * String::grow - This is a duplicate of iterator-based functionality, which may become more ergonomic in the future. * String::capacity - This function was just added. * String::push - This function was just added. * String::pop - This function was just added. * String::truncate - The failure conventions around String methods and byte indices isn't totally clear at this time, so the failure semantics and return value of this method are subject to change. * String::as_mut_vec - the naming of this method may change. * string::raw::* - these functions are all waiting on [an RFC][2] [2]: rust-lang/rfcs#240 The following method have been marked #[experimental] * String::from_str - This function only exists as it's more efficient than to_string(), but having a less ergonomic function for performance reasons isn't the greatest reason to keep it around. Like Vec::push_all, this has been marked experimental for now. The following methods have been #[deprecated] * String::append - This method has been deprecated to remain consistent with the deprecation of Vec::append. While convenient, it is one of the only functional-style apis on String, and requires more though as to whether it belongs as a first-class method or now (and how it relates to other collections). * String::from_byte - This is fairly rare functionality and can be emulated with str::from_utf8 plus an assert plus a call to to_string(). Additionally, String::from_char could possibly be used. * String::byte_capacity - Renamed to String::capacity due to the rationale above. * String::push_char - Renamed to String::push due to the rationale above. * String::pop_char - Renamed to String::pop due to the rationale above. * String::push_bytes - There are a number of `unsafe` functions on the `String` type which allow bypassing utf-8 checks. These have all been deprecated in favor of calling `.as_mut_vec()` and then operating directly on the vector returned. These methods were deprecated because naming them with relation to other methods was difficult to rationalize and it's arguably more composable to call .as_mut_vec(). * String::as_mut_bytes - See push_bytes * String::push_byte - See push_bytes * String::pop_byte - See push_bytes * String::shift_byte - See push_bytes # Reservation methods This commit does not yet touch the methods for reserving bytes. The methods on Vec have also not yet been modified. These methods are discussed in the upcoming [Collections reform RFC][1] [1]: https://github.com/aturon/rfcs/blob/collections-conventions/active/0000-collections-conventions.md#implicit-growth

This commit deprecates the String::shift_char() function in favor of the addition of an insert()/remove() pair of functions. This aligns the API with Vec in that characters can be inserted at arbitrary positions. Additionaly, there is no `_char` suffix due to the rationaled laid out in the previous commit. These functions are both introduced as unstable as their failure semantics, while in line with slices/vectors, are uncertain about whether they should remain the same.

rust-highfive · 2014-09-22T15:59:16Z

Warning

These commits modify unsafe code. Please review it carefully!

aturon · 2014-09-22T22:31:30Z

r=me, modulo a couple of minor nits on the diff.

Also: thanks for the extensive writeup of the rationale, etc! This is a good standard to set for these PRs going forward. I do wonder if the text about the overall design rationale for String should be kept somewhere more permanent -- perhaps in the guidelines? Let me know if you have any thoughts.

alexcrichton · 2014-09-23T00:33:40Z

We may not necessarily need to lay out the guidelines for string-based api design in the documentation of String itself, but I would definitely think that it would belong in the guidelines. That said, we should certainly explain very clearly that a string is a utf-8 encoded sequence of bytes, no matter what. The current documentation is a little... sparse. (cc @steveklabnik, maybe a good module/struct to beef up the doc string for?)

alexcrichton · 2014-09-23T00:34:39Z

For now I'm going to hold off on documenting the String struct itself as that may warrant its own PR itself, and I'm going to try to land this in the meantime.

# Rationale When dealing with strings, many functions deal with either a `char` (unicode codepoint) or a byte (utf-8 encoding related). There is often an inconsistent way in which methods are referred to as to whether they contain "byte", "char", or nothing in their name. There are also issues open to rename *all* methods to reflect that they operate on utf8 encodings or bytes (e.g. utf8_len() or byte_len()). The current state of String seems to largely be what is desired, so this PR proposes the following rationale for methods dealing with bytes or characters: > When constructing a string, the input encoding *must* be mentioned (e.g. > from_utf8). This makes it clear what exactly the input type is expected to be > in terms of encoding. > > When a method operates on anything related to an *index* within the string > such as length, capacity, position, etc, the method *implicitly* operates on > bytes. It is an understood fact that String is a utf-8 encoded string, and > burdening all methods with "bytes" would be redundant. > > When a method operates on the *contents* of a string, such as push() or pop(), > then "char" is the default type. A String can loosely be thought of as being a > collection of unicode codepoints, but not all collection-related operations > make sense because some can be woefully inefficient. # Method stabilization The following methods have been marked #[stable] * The String type itself * String::new * String::with_capacity * String::from_utf16_lossy * String::into_bytes * String::as_bytes * String::len * String::clear * String::as_slice The following methods have been marked #[unstable] * String::from_utf8 - The error type in the returned `Result` may change to provide a nicer message when it's `unwrap()`'d * String::from_utf8_lossy - The returned `MaybeOwned` type still needs stabilization * String::from_utf16 - The return type may change to become a `Result` which includes more contextual information like where the error occurred. * String::from_chars - This is equivalent to iter().collect(), but currently not as ergonomic. * String::from_char - This method is the equivalent of Vec::from_elem, and has been marked #[unstable] becuase it can be seen as a duplicate of iterator-based functionality as well as possibly being renamed. * String::push_str - This *can* be emulated with .extend(foo.chars()), but is less efficient because of decoding/encoding. Due to the desire to minimize API surface this may be able to be removed in the future for something possibly generic with no loss in performance. * String::grow - This is a duplicate of iterator-based functionality, which may become more ergonomic in the future. * String::capacity - This function was just added. * String::push - This function was just added. * String::pop - This function was just added. * String::truncate - The failure conventions around String methods and byte indices isn't totally clear at this time, so the failure semantics and return value of this method are subject to change. * String::as_mut_vec - the naming of this method may change. * string::raw::* - these functions are all waiting on [an RFC][2] [2]: rust-lang/rfcs#240 The following method have been marked #[experimental] * String::from_str - This function only exists as it's more efficient than to_string(), but having a less ergonomic function for performance reasons isn't the greatest reason to keep it around. Like Vec::push_all, this has been marked experimental for now. The following methods have been #[deprecated] * String::append - This method has been deprecated to remain consistent with the deprecation of Vec::append. While convenient, it is one of the only functional-style apis on String, and requires more though as to whether it belongs as a first-class method or now (and how it relates to other collections). * String::from_byte - This is fairly rare functionality and can be emulated with str::from_utf8 plus an assert plus a call to to_string(). Additionally, String::from_char could possibly be used. * String::byte_capacity - Renamed to String::capacity due to the rationale above. * String::push_char - Renamed to String::push due to the rationale above. * String::pop_char - Renamed to String::pop due to the rationale above. * String::push_bytes - There are a number of `unsafe` functions on the `String` type which allow bypassing utf-8 checks. These have all been deprecated in favor of calling `.as_mut_vec()` and then operating directly on the vector returned. These methods were deprecated because naming them with relation to other methods was difficult to rationalize and it's arguably more composable to call .as_mut_vec(). * String::as_mut_bytes - See push_bytes * String::push_byte - See push_bytes * String::pop_byte - See push_bytes * String::shift_byte - See push_bytes # Reservation methods This commit does not yet touch the methods for reserving bytes. The methods on Vec have also not yet been modified. These methods are discussed in the upcoming [Collections reform RFC][1] [1]: https://github.com/aturon/rfcs/blob/collections-conventions/active/0000-collections-conventions.md#implicit-growth

feat: add `toggleLSPLogs` command Implement client-side command to toggle LSP logs in VSCode. The command replaces the need to add/remove the `"rust-analyzer.trace.server": "verbose"` setting each time one wants to display logs. I've also updated the docs/ instances that reference the now outdated manual method. The command labeled `rust-analyzer: Toggle LSP Logs` enables the setting project-wide and opens the relevant trace output channel. Closes rust-lang#8233

alexcrichton added 2 commits September 22, 2014 07:46

alexcrichton force-pushed the string-stable branch from ecfc954 to f1f302d Compare September 22, 2014 16:05

alexcrichton force-pushed the string-stable branch from f1f302d to 56ea769 Compare September 23, 2014 00:33

alexcrichton force-pushed the string-stable branch from 56ea769 to d9b0bc0 Compare September 23, 2014 13:39

alexcrichton force-pushed the string-stable branch from d9b0bc0 to 08fe149 Compare September 23, 2014 19:49

Deal with the fallout of string stabilization

5037513

alexcrichton force-pushed the string-stable branch from 08fe149 to 5037513 Compare September 24, 2014 01:32

bors closed this Sep 24, 2014

bors merged commit 5037513 into rust-lang:master Sep 24, 2014

thestinger mentioned this pull request Sep 26, 2014

Stabilize mutable slice API #17494

Closed

aturon mentioned this pull request Oct 3, 2014

Dangerously vague meaning of .len() and .truncate() on strings rust-lang/rfcs#350

Closed

alexcrichton deleted the string-stable branch October 11, 2014 21:40

aturon mentioned this pull request Nov 19, 2014

Rename &str::as_bytes() to &str::as_utf8() #14131

Closed

Gankra mentioned this pull request Dec 11, 2014

StrPrelude::char_at() confuses people by accept byte index instead of char index #19724

Closed

aturon mentioned this pull request Dec 16, 2014

Stabilization metabug: 1.0-alpha #19260

Closed

Philipp91 mentioned this pull request Feb 9, 2017

Help people find String::as_bytes() for UTF-8 #39688

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

collections: Stabilize String #17438

collections: Stabilize String #17438

alexcrichton commented Sep 22, 2014

rust-highfive commented Sep 22, 2014

aturon commented Sep 22, 2014

alexcrichton commented Sep 23, 2014

alexcrichton commented Sep 23, 2014

collections: Stabilize String #17438

collections: Stabilize String #17438

Conversation

alexcrichton commented Sep 22, 2014

Rationale

Method stabilization

Reservation methods

rust-highfive commented Sep 22, 2014

aturon commented Sep 22, 2014

alexcrichton commented Sep 23, 2014

alexcrichton commented Sep 23, 2014