CHRISDOESCODING
LATEST
POSTS
KB

Why are there multiple string types in Rust? (I'll tell you)

Jul 30, 2021


It's no secret I love Rust.

However, Rust often reveals, usually painfully, the gaps in my knowledge over concepts that other programming languages abstracted from me.

One of these concepts was way more complicated than I had originally thought: Strings.

Yeah. Strings. You know strings. They're great. They store text and words. Their recipe calls for a sprinkle of some words, two quotation marks, arrange them correctly and boom, we have a string. Super cool. In fact, here's a tasty string right here: "spam".

In Rust, it gets a fair bit deeper than that.

One of the most confusing things about Rust when I was just starting out learning it was the fact that Rust has multiple string types:

  1. str - effectively an unsized array of bytes ([u8]) somewhere in memory.
  2. String - an owned smart pointer referring to string data on the heap.
  3. &str - a string slice and pointer to string data somewhere in memory.
  4. &String - a borrowed String. But wait, don't we already have one of
    these with &str??

Side Note: There's also the mutable reference versions, &mut str and &mut String, but the distinction between these and their immutable versions &str and &String aren't relevant for this post, so I won't be referring to them.

So why does Rust have so many types for strings, whereas other languages might only have a single type?

Well, I'll tell you.

Sized types are sized, and their instances have the same size

Most types have a known size at compile-time. This size does not change, regardless of the exact value for a given type a variable holds. A variable holding a 32-bit integer (u32) will require 8 bytes of memory, regardless of whether its exact value is 0 or 4,294,967,295. Both values require the same number of bytes to store.

use std::mem::{size_of, size_of_val};

fn main() {
    let a: u32 = 0;
    let b: u32 = 4_294_967_295;

    // The following will print that a u32 has a size of 4, meaning 4 bytes.
    println!("u32 size: {}", size_of::<u32>()); 

    // This will resolve to true, despite the fact that one number is so 
    // much larger than the other.
    assert_eq!(size_of_val(&a), size_of_val(&b));

    // They both have the same size as a u32, since they have that type.
    assert_eq!(size_of_val(&a), size_of::<u32>());
    assert_eq!(size_of_val(&b), size_of::<u32>());
}

This extends even to complex types, which are types that store one or more other types. Suppose we have the following struct:

struct Foo<'a> {
    one: i32,
    two: u64,
    some_reference: &'a str,
    another_complex_type: Bar,
}

Even if we create multiple instances of Foo, each instance will always require the same number of bytes, regardless of whatever values we store in them.

At this point, you might be thinking "Well, duh, Chris. This isn't rocket surgery." And you'd be right!

Dynamically Sized Types and the str type

So how does this explain why Rust juggles around so many types for a string?

If Rust only had one string type, it'd be impossible to define a single size for it. Don't believe me? Let's cook up some examples. If the following variables were all the same imaginary "string" type, what size should they be?

let s1 = "hello, world!";
let s2 = "peanut";
let s3 = rand::random::<u32>().to_string();

Assuming a UTF-8 encoding, s1 would require 13 bytes, s2 would require 6 bytes, and we don't have a way to know how many bytes s3 would require. We do know that it'd be somewhere between 1 and 10 bytes, but without a magic crystal ball, we can't be certain. (Someone ought to make one of those!)

Underneath all of Rust's strings is a type called str. str is an example of something that is, by definition, !Sized, meaning "not sized", because we cannot enforce a single size for all values of that type.

Another name given to these kinds of types in Rust are "Dynamically Sized Types", aka DST.

&str has a size... also it's a fat pointer

The compiler needs us to use Sized types in order to know how much space to allocate for them. If you defined a function to take an argument with a !Sized type, your program wouldn't know how much space it would need to allocate for its stack frame. If you wanted to ask the operating system to allocate some memory for you on the heap, you'd need to specify a size to request.

So if compilers require us to use sized types (so that the program knows how much space to allocate), but if we want to use a so-called DST, then what can we do? We could put the DST behind a reference: &str!

For programmers who don't work in a language with pointers and haven't heard of what they are: Your computer's memory is addressable using numbers. This means that if I give you some random integer value, you could theoretically peek at that exact location and read any bytes located there. A pointer is a variable which holds one such address as a value. In other words, pointers "point" at memory addresses. A reference is a pointer with added context: it points to the address of another variable. This means it doesn't just point at any ol' memory address, it refers you to an active underlying value.

&str are references to a str living somewhere in memory.

The cool thing about pointers/references are that they are sized, even if the type they refer to is unsized. Since pointers are just an address, storing them requires enough space to store a number.

Side note: In Rust, memory addresses use the type usize. usizes are just numbers, but the number of bytes they require varies depending on the platform for which the compiler compiled your program. 32-bit computers might have an address space that can be indexed using a 32-bit integer. As a result, on these platforms, usize would require the same number of bytes as a 32-bit integer would: 4 bytes. On 64-bit systems, typically this will be 8 bytes.

&str is also what we call a "fat pointer", which is a pointer with some extra metadata included. For &str, it contains the memory address like any other pointer, but it also contains the length of the string it points to.

&str is so useful that the Rust compiler automatically turns your string literals into them. In fact, the compiler copies string literals directly as-is into your compiled binary. When you execute your program, the program loads these static string literals into a region of static memory where your &str references can point to.

// Both of these have the type `&str`, which is Sized.
let s1 = "hello, world!";
let s2 = "peanut";

As part of compiling this code, the compiler will copy the strings "hello, world!" and "peanut" directly into the resulting binary. Once executed, your program will load both strings into static memory. s1 and s2 become &str references, both of which are stored on the stack (which is possible because references are pointers and pointers are Sized). Both references will point to the address of their respective strings in static memory.

If you actually wanted to get access to the underlying str, you're going to actually have to put some effort to do so - and I won't be demonstrating it here. In general, it's not really useful to have a handle directly to a str.

So then what's String?

&strs also go by another and more common name: string slice.

You can imagine the pointer component of a &str pointing at the start of the str value in memory. When you add its length component to this address, we can find the end of the value in memory. We can refer to this range, from start to end, as a "slice".

Slices can shrink, but never grow. Why? A slice is just a "view" into some underlying data. After all, &str is merely a reference! If slices could grow, we could use them to read bytes past the end of the string, which results in Undefined Behavior (TM) (a terrible and evil thing).

This is where String comes in. Rust allocates instances of the String type on the heap. Living on the heap allows its underlying data to be resizable, which means we can grow the String by appending to it.

In contrast, &str can't give us a similar guarantee since it is a slice, and is therefore only a reference to data living somewhere else in memory. That data could be living on the heap, on the stack, or even in static memory (the latter two are not dynamically sized!)

&String versus &str

When you borrow a String, you get &String. Sometimes you can get a &str but only if you provide the compiler with extra context.

Both &String and &str are different types! They even have different methods implemented on them. For example, String has a method called clear, which isn't a thing for slices. Conversely, slices have a method called contains, which isn't implemented on String.

Then why can I do this?

let s: String = String::from("hello, world!");
let borrowed_s: &String = &s;

// what, what, what!? did I just call a &str method on a &String?
println!("{}", borrowed_s.contains(",")); // true

This occurs through a feature of Rust called Deref Coercion, which allows a pointer to dereference as another pointer. Rust comes with a built-in implementation of Deref for String which enables &String to dereference to a &str. In this particular circumstance, Strings are pointers to data living on the heap. &str is also a pointer, and also capable of pointing at string data living on the heap. Because of this, Rust can cheaply coerce the type from &String to &str with little trouble.

In general, as you write Rust, if you need an immutable/shared reference to a string, 9 times out of 10 you should just type your function arguments and structs to accept a &str value and allow deref coercion to do its magic in case a user wants to pass you a &String.

However, if you need a string to be mutable, or if you want to have ownership over the string, then you'll need &mut String or just own the value outright with String.

Summary

Rust has a few string types: str, &str, String, and &String

  • str is bytes of data encoded as a UTF-8 string. It can live in static memory, the stack, or the heap. You will virtually never want a handle to this type specifically and will want to use one of the other three types.
  • &str is a fat pointer which points to some data encoded as a UTF-8 string living in static memory, the stack or the heap. In contrast to str, &strs are Sized, which allows us to store them in variables on the stack.
  • String is owned and growable string data which are always allocated on the heap.
  • &String are references to a String, but since String has an implementation of Deref<Target = str>, they can be conveniently coerced into &str. We'll use this type if we want to be able to mutate our String (using &mut), but otherwise we should in general reach for the &str type instead.