String Types

From Rust Community Wiki
Jump to navigation Jump to search

Strings are types that represent textual data. There are several string types that have slightly different purposes.

The most common string type, strThis links to official Rust documentation, is a primitive type and is used for string literals. Other string types in the standard library are OsStrThis links to official Rust documentation and CStrThis links to official Rust documentation. All of these types are dynamically sized and can therefore only exist behind a reference or a pointer-like type (e.g. &str). However, they have owned counterparts, StringThis links to official Rust documentation, OsStringThis links to official Rust documentation and CStringThis links to official Rust documentation, which are allocated on the heap and are Sized. This is similar to the distinction between Vec<T> and [T].

Strings are represented as slices of bytes; however, Rust guarantees that strings are always valid. For instance, the str and String types must be valid UTF-8, violating this rule causes undefined behavior.

Example[edit | edit source]

let s1: &str = "hello world!";
let s2 = String::from("👍👌🤝");
assert_eq!(s1.len(), s2.len());

StringThis links to official Rust documentation and strThis links to official Rust documentation[edit | edit source]

String and str are types containing UTF-8 encoded text. This means that characters have a variable width: A character can be between 1 and 4 bytes long. Therefore these strings can't be indexed, and slicing a string to get a substring uses byte offsets. Slicing a string in the middle of a multi-byte character causes a panic:

let s = "hi 😂";
assert_eq!(&s[3..7], "😂");
&s[0..4]; // panics

The len()This links to official Rust documentation method returns the byte length of the string; this is not the same as the number of characters. To iterate over the characters of a string, the chars()This links to official Rust documentation and char_indices()This links to official Rust documentation methods can be used.

A String or str can be converted from a Vec<u8> or [u8] with the String::from_utf8 and str::from_utf8 family of functions, which validate the input.

A note about Unicode[edit | edit source]

In Rust, a char is a Unicode scalar value. This is similar to, but not the same as a Unicode code point: A char can't be a high or low surrogate. This means that converting a u32 to a char can fail.

A char is not the same as a character in the general sense. When talking about characters, we often mean graphemes, which can consist of multiple chars. To iterate over or count the graphemes of a string, external crates such as Cargo vec.svgunicode-segmentation should be used. Note that graphemes don't necessarily correspond to what is displayed as a visual unit, since that depends on the text rendering pipeline. For example, fonts can define ligatures, and these can be turned on and off with font features.

Also note that strings that are semantically equal don't necessarily have the same byte representation. Often there are multiple ways to represent a single grapheme, and sometimes different graphemes should be treated equal. Therefore, when comparing or sorting strings, they should be normalized beforehand to produce correct results. This can be done with the Cargo vec.svgunicode-normalization crate, for example.

OsStringThis links to official Rust documentation and OsStrThis links to official Rust documentation[edit | edit source]

OsString and OsStr are used when interfacing with platform-specific APIs. Their format varies by system and is therefore not exposed to the programmer. One can infallibly convert a String to an OsString. The reverse is not guaranteed as OsString may contain values unrepresentable by UTF-8, so the conversion can be done fallibly (OsString::into_stringThis links to official Rust documentation) or lossily (OsStr::to_string_lossyThis links to official Rust documentation).

CStringThis links to official Rust documentation and CStrThis links to official Rust documentation[edit | edit source]

CString and CStr are constrained by C language requirements. Namely, they are terminated by a nul byte (b'\0') and can't contain any other nul bytes. They can be created with the CString::new() method, and will fail if the input contains a non-terminal nul character.

Note that C strings are not constrained to UTF-8. This means that when a &CStr is converted to a &str, it must be validated.

Important traits[edit | edit source]

DerefThis links to official Rust documentation[edit | edit source]

Owned strings can be dereferenced to their borrowed counterparts:

  • String implements Deref<Target = str>
  • OsString implements Deref<Target = OsStr>
  • CString implements Deref<Target = CStr>

This means that str methods are also available for String because of auto-deref. For example, `String::new().chars()` is equivalent to `String::new().deref().chars()`.

AsRefThis links to official Rust documentation[edit | edit source]

All strings implement the AsRef trait to convert them to the borrowed variant:

  • String and str implement AsRef<str>
  • OsString and OsStr implement AsRef<OsStr>
  • CString and CStr implement AsRef<CStr>

This is useful to be generic over strings, when a string reference is enough, for example:

fn foo(s: impl AsRef<str>) {
    let s: &str = s.as_ref();
    // do something with s
}

foo(String::from("this works"));
foo("this also works");

For conversions in the opposite direction, the ToStringThis links to official Rust documentation trait can be used.

FromStrThis links to official Rust documentation and ToStringThis links to official Rust documentation[edit | edit source]

These traits are used to convert other values from and into a string.

FromStr is fallible, i.e. it returns a Result. It is used by str::parse()This links to official Rust documentation.

T: Display has a blanket implementation for ToString. This means that ToString doesn't need to be implemented manually; instead, one should implement the DisplayThis links to official Rust documentation trait. An implementation for ToString is then automatically provided.

BorrowThis links to official Rust documentation and ToOwnedThis links to official Rust documentation[edit | edit source]

These traits are used to convert a borrowed string to an owned string and vice versa: String implements Borrow<str>, OsString implements Borrow<OsStr> and CString implements Borrow<CStr>.

As a result, strings can be used with the CowThis links to official Rust documentation type, which stands for clone on write. It can be used to return a type that is either owned or borrowed, while avoiding allocations unless necessary. For example:

fn foo(s: &str) -> Cow<str> {
    if s.chars().all(char::is_lowercase) {
        Cow::Borrowed(s)
    } else {
        Cow::Owned(s.to_lowercase())
    }
}

let s: &str = foo("hello world").as_ref();

Sources of confusion[edit | edit source]

Rust is more pedantic than other languages when it comes to string handling, which can lead to confusion as to why a certain type or trait is used. Additionally, as strings are such a fundamental type, there are some arguably inelegant or redundant items such as FromStr.

Cow<str>[edit | edit source]

Cow<str> is not a special case, since CowThis links to official Rust documentation can be used with any type that implements ToOwned, e.g. [T], Path and many more. However, Cow<str> is arguably the most common use case.

FromStrThis links to official Rust documentation and ToStringThis links to official Rust documentation[edit | edit source]

While FromStr deviates from FromThis links to official Rust documentation by returning a Result, it seems redundant with TryFromThis links to official Rust documentation and ToString redundant with IntoThis links to official Rust documentation.

Indeed, TryFrom was going to make FromStr obsolete, but cannot as it would overlap with impl<T, U> TryFrom<T> for U where U: From<T> (#44174), which violates coherence. It is also used by str::parseThis links to official Rust documentation. On the other hand, ToString::to_string() is used as a convenience method for format!("{}", x).