String Types
Strings are types that represent textual data. There are several string types that have slightly different purposes.
The most common string type, str
, is a primitive type and is used for string literals. Other string types in the standard library are OsStr
and CStr
. All of these types are dynamically sized and can therefore only exist behind a reference or a pointer-like type (e.g. &str
). However, they have owned counterparts, String
, OsString
and CString
, which are allocated on the heap and are Sized
. This is similar to the distinction between Vec<T>
and [T]
.
Strings are represented as slices of bytes; however, Rust guarantees that strings are always valid. For instance, the str
and String
types must be valid UTF-8, violating this rule causes undefined behavior.
Example
let s1: &str = "hello world!";
let s2 = String::from("👍👌🤝");
assert_eq!(s1.len(), s2.len());
String
and str
String
and str
are types containing UTF-8 encoded text. This means that characters have a variable width: A character can be between 1 and 4 bytes long. Therefore these strings can't be indexed, and slicing a string to get a substring uses byte offsets. Slicing a string in the middle of a multi-byte character causes a panic:
let s = "hi 😂";
assert_eq!(&s[3..7], "😂");
&s[0..4]; // panics
The len()
method returns the byte length of the string; this is not the same as the number of characters. To iterate over the characters of a string, the chars()
and char_indices()
methods can be used.
A String
or str
can be converted from a Vec<u8>
or [u8]
with the String::from_utf8
and str::from_utf8
family of functions, which validate the input.
A note about Unicode
In Rust, a char
is a Unicode scalar value. This is similar to, but not the same as a Unicode code point: A char
can't be a high or low surrogate. This means that converting a u32
to a char
can fail.
A char
is not the same as a character in the general sense. When talking about characters, we often mean graphemes, which can consist of multiple char
s. To iterate over or count the graphemes of a string, external crates such as
1.10.1 should be used. Note that graphemes don't necessarily correspond to what is displayed as a visual unit, since that depends on the text rendering pipeline. For example, fonts can define ligatures, and these can be turned on and off with font features.
unicode-segmentation
Also note that strings that are semantically equal don't necessarily have the same byte representation. Often there are multiple ways to represent a single grapheme, and sometimes different graphemes should be treated equal. Therefore, when comparing or sorting strings, they should be normalized beforehand to produce correct results. This can be done with the
0.1.22 crate, for example.
unicode-normalization
OsString
and OsStr
OsString
and OsStr
are used when interfacing with platform-specific APIs. Their format varies by system and is therefore not exposed to the programmer. One can infallibly convert a String
to an OsString
. The reverse is not guaranteed as OsString
may contain values unrepresentable by UTF-8, so the conversion can be done fallibly (OsString::into_string
) or lossily (OsStr::to_string_lossy
).
CString
and CStr
CString
and CStr
are constrained by C language requirements. Namely, they are terminated by a nul byte (b'\0'
) and can't contain any other nul bytes. They can be created with the CString::new()
method, and will fail if the input contains a non-terminal nul character.
Note that C strings are not constrained to UTF-8. This means that when a &CStr
is converted to a &str
, it must be validated.
Important traits
Deref
Owned strings can be dereferenced to their borrowed counterparts:
String
implementsDeref<Target = str>
OsString
implementsDeref<Target = OsStr>
CString
implementsDeref<Target = CStr>
This means that str
methods are also available for String
because of auto-deref. For example, `String::new().chars()` is equivalent to `String::new().deref().chars()`.
AsRef
All strings implement the AsRef
trait to convert them to the borrowed variant:
String
andstr
implementAsRef<str>
OsString
andOsStr
implementAsRef<OsStr>
CString
andCStr
implementAsRef<CStr>
This is useful to be generic over strings, when a string reference is enough, for example:
fn foo(s: impl AsRef<str>) {
let s: &str = s.as_ref();
// do something with s
}
foo(String::from("this works"));
foo("this also works");
For conversions in the opposite direction, the ToString
trait can be used.
FromStr
and ToString
These traits are used to convert other values from and into a string.
FromStr
is fallible, i.e. it returns a Result
. It is used by str::parse()
.
T: Display
has a blanket implementation for ToString
. This means that ToString
doesn't need to be implemented manually; instead, one should implement the Display
trait. An implementation for ToString
is then automatically provided.
Borrow
and ToOwned
These traits are used to convert a borrowed string to an owned string and vice versa: String
implements Borrow<str>
, OsString
implements Borrow<OsStr>
and CString
implements Borrow<CStr>
.
As a result, strings can be used with the Cow
type, which stands for clone on write. It can be used to return a type that is either owned or borrowed, while avoiding allocations unless necessary. For example:
fn foo(s: &str) -> Cow<str> {
if s.chars().all(char::is_lowercase) {
Cow::Borrowed(s)
} else {
Cow::Owned(s.to_lowercase())
}
}
let s: &str = foo("hello world").as_ref();
Sources of confusion
Rust is more pedantic than other languages when it comes to string handling, which can lead to confusion as to why a certain type or trait is used. Additionally, as strings are such a fundamental type, there are some arguably inelegant or redundant items such as FromStr
.
Cow<str>
Cow<str>
is not a special case, since Cow
can be used with any type that implements ToOwned
, e.g. [T]
, Path
and many more. However, Cow<str>
is arguably the most common use case.
FromStr
and ToString
While FromStr
deviates from From
by returning a Result
, it seems redundant with TryFrom
and ToString
redundant with Into
.
Indeed, TryFrom
was going to make FromStr
obsolete, but cannot as it would overlap with impl<T, U> TryFrom<T> for U where U: From<T>
(#44174), which violates coherence. It is also used by str::parse
. On the other hand, ToString::to_string()
is used as a convenience method for format!("{}", x)
.