The Danger of Blindly Trusting Compiler Help

Rust has a reputation of having good compiler error messages [citation needed].

I generally agree! However, blindly following the hints given by the compiler may sometimes hurt beginners who don’t fully understand the language. It doesn’t help that due to Rust’s excellent formatting of code suggestions [citation needed], the suggestions really seem Correct and Canonical.

Case Study

Consider this problem:

Write a function that takes in a word and a mutable reference to an output string, then appends an asterisk to the output if the entire word is made up of ASCII characters.

They likely don’t! But the pattern of “appending to an output” happens a lot, and that’s the main focus here.

Here’s how someone who is a beginner to Rust but is familiar with other programming languages might approach the problem:

fn (: , : &mut ) {
    // ...
}

With the following thought process:

My function needs to receive a string, which is a String in rust, and also a mutable reference to another String, which I know I can do by using &mut.

And here’s how they might write the function body:

fn (: , : &mut ) {
    if .() {
         =  + ::("*");
    }
}

With the following thought process:

If the target is ASCII, I need to add an asterisk to the result. I read the documentation on std::string::String, and know that I can create the String for the asterisk by using String::from.

And now the beginner is trapped:

 --> src/main.rs:3:25
  |
3 |         result = result + String::from("*");
  |                  ------ ^ ----------------- String
  |                  |      |
  |                  |      `+` cannot be used to concatenate a `&str` with a `String`
  |                  &mut String
  |
help: create an owned `String` on the left and add a borrow on the right
  |
3 |         result = result.to_owned() + &String::from("*");
  |                        +++++++++++   +

For more information about this error, try `rustc --explain E0369`.
error: could not compile `testing` (bin "testing") due to previous error

Leading to:

fn (: , : &mut ) {
    if .() {
         = .() + &::("*");
    }
}

Resulting in:

 --> src/main.rs:3:18
  |
1 | fn append_asterisk_if_ascii(target: String, result: &mut String) {
  |                                                     ----------- expected due to this parameter type
2 |     if target.is_ascii() {
3 |         result = result.to_owned() + &String::from("*");
  |                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected `&mut String`, found `String`
  |
help: consider dereferencing here to assign to the mutably borrowed value
  |
3 |         *result = result.to_owned() + &String::from("*");
  |         +

For more information about this error, try `rustc --explain E0308`.
error: could not compile `testing` (bin "testing") due to previous error

Leading to:

fn (: , : &mut ) {
    if .() {
        * = .() + &::("*");
    }
}

And now the program compiles.

Space Analysis

Rustaceans People familiar with Rust might have been screaming for the past few paragraphs. Let’s get the irrelevant (in this particular case study, but is very relevant in general and should be fixed) improvement out of the way:

fn (: , : &mut ) {
    // ...
}

This function signature is overly specific. Since the only thing we need target for is the method .is_ascii, which does not mutate the String, we can avoid taking ownership of the String and use a &str instead, which is an immutable string slice.

In a similar vein, result should be &mut str, since you can “provide” a &mut str with types other than a String, so enforcing the restriction that it must be a String object is needlessly restrictive when all we are doing is appending a &str.

Now to the meat and potatoes:

    if target.is_ascii() {
        *result = result.to_owned() + &::from("*");
    }

This code is Not Good because of one reason: It makes plenty of unnecessary memory allocations. In fact, it makes 2 extra allocations per call, when in the ideal case it makes 0. The allocations are

  1. result.to_owned(), which creates a clone of result, which is a String.
  2. String::from("*"), which creates a clone of the &'static str that is "*".

Note that the + does not allocate a new string, but rather reuses the buffer of the LHS, which in this case is result.to_owned().

Let’s find out! We’ll use the heap profiling crate dhat-rs. Here’s the code:

#[]
static : dhat:: = dhat::;

fn (: &, : &mut ) {
    if .() {
        * = .() + &::("*");
    }
}

fn () -> <(), <dyn std::error::>> {
    let  = ::(10);
    let  = dhat::::().().();

    ("ascii!", &mut );

    let  = dhat::::();
    !("  Max blocks:\t{}", .);
    !("   Max bytes:\t{}", .);
    !("Total blocks:\t{}", .);
    !(" Total bytes:\t{}", .);

    (())
}

A few things are of note here:

  • We ensure the result string has sufficient capacity before the loop to avoid growing the string during the loop. Note that in this case, ensuring capacity does not change the memory used because the function replaces result each call.
  • We create the heap profiler after creating the result string to avoid measuring the heap allocation during the creation of result.

Here are the results:

  Max blocks:   2
   Max bytes:   9
Total blocks:   2
 Total bytes:   9

From our analysis earlier we know why the maximum number of blocks is 2. The breakdown for maximum number of bytes is rather complicated, but the TLDR is that the minimum heap allocation size when growing a String is 8 bytes. If you’re interested, here’s the stack trace:

alloc::raw_vec::finish_grow                                                                                     (core/src/result.rs:0:23)
alloc::raw_vec::RawVec<T,A>::grow_amortized                                                                     (alloc/src/raw_vec.rs:404:19)
alloc::raw_vec::RawVec<T,A>::reserve::do_reserve_and_handle                                                     (alloc/src/raw_vec.rs:289:28)
alloc::raw_vec::RawVec<T,A>::reserve                                                                            (alloc/src/raw_vec.rs:293:13)
alloc::vec::Vec<T,A>::reserve                                                                                   (src/vec/mod.rs:909:18)
alloc::vec::Vec<T,A>::append_elements                                                                           (src/vec/mod.rs:1992:9)
<alloc::vec::Vec<T,A> as alloc::vec::spec_extend::SpecExtend<&T,core::slice::iter::Iter<T>>>::spec_extend       (src/vec/spec_extend.rs:55:23)
alloc::vec::Vec<T,A>::extend_from_slice                                                                         (src/vec/mod.rs:2438:9)
alloc::string::String::push_str                                                                                 (alloc/src/string.rs:903:9)
<alloc::string::String as core::ops::arith::Add<&str>>::add                                                     (alloc/src/string.rs:2264:14)
testing::append_asterisk_if_ascii                                                                               (testing/src/main.rs:6:19)
testing::main                                                                                                   (testing/src/main.rs:10:5)

with the relevant constant being MIN_NON_ZERO_CAP.

The 8 bytes, plus the 1 byte for String::from("*"), makes 9 bytes.

Elementary, my dear duckson. String::from takes a different code path!

<alloc::alloc::Global as core::alloc::Allocator>::allocate    (alloc/src/alloc.rs:241:9)
alloc::raw_vec::RawVec<T,A>::allocate_in                      (alloc/src/raw_vec.rs:184:45)
alloc::raw_vec::RawVec<T,A>::with_capacity_in                 (alloc/src/raw_vec.rs:130:9)
alloc::vec::Vec<T,A>::with_capacity_in                        (src/vec/mod.rs:670:20)
<T as alloc::slice::hack::ConvertVec>::to_vec                 (alloc/src/slice.rs:162:25)
alloc::slice::hack::to_vec                                    (alloc/src/slice.rs:111:9)
alloc::slice::<impl [T]>::to_vec_in                           (alloc/src/slice.rs:441:9)
alloc::slice::<impl [T]>::to_vec                              (alloc/src/slice.rs:416:14)
alloc::slice::<impl alloc::borrow::ToOwned for [T]>::to_owned (alloc/src/slice.rs:823:14)
alloc::str::<impl alloc::borrow::ToOwned for str>::to_owned   (alloc/src/str.rs:209:62)
<alloc::string::String as core::convert::From<&str>>::from    (alloc/src/string.rs:2612:11)
testing::append_asterisk_if_ascii                             (testing/src/main.rs:6:40)
testing::main                                                 (testing/src/main.rs:10:5)

What this code path does exactly is outside my attention span pay grade.

Here it is:

fn (: &, : &mut ) {
    if .() {
        .('*');
    }
}

Or this:

fn (: &, : &mut ) {
    if .() {
        * += "*";
    }
}

Both do a whopping 0 extra allocations provided the result still has enough capacity to fit the new content. This is because the underlying buffer in result is reused, instead of a new string being created to replace it. We can see the effects more pronounced by doing more iterations of append_asterisk_if_ascii:

    let num_asterisks = 100_000;
    let mut result = ::with_capacity(num_asterisks);
    let _profiler = dhat::::builder().testing().build();

    for _ in 0..num_asterisks {
        append_asterisk_if_ascii("full ascii!", &mut result);
    }

    let stats = dhat::::get();

which still results in 0 allocations for the better version, but for the original…

  Max blocks:   3
   Max bytes:   399995
Total blocks:   299992
 Total bytes:   14999949980

Time Analysis

Let’s use hyperfine to benchmark two versions of our program!

First, we’ll add the following to our Cargo.toml:

[features]
slow = []
fast = []

Then, we can include our two versions of append_asterisk_if_ascii:

#[(feature = "slow")]
fn append_asterisk_if_ascii(target: &, result: &mut ) {
    if target.is_ascii() {
        *result = result.to_string() + &::from("*");
    }
}

#[(feature = "fast")]
fn append_asterisk_if_ascii(target: &, result: &mut ) {
    if target.is_ascii() {
        *result += "*";
    }
}

We’ll keep the same number of iterations as before, and run a comparison of the two features:


Benchmark 1: cargo run --release --features fast
  Time (mean ± σ):      40.4 ms ±   0.8 ms    [User: 30.5 ms, System: 9.7 ms]
  Range (minmax):    39.6 ms 43.8 ms    67 runs

Benchmark 2: cargo run --release --features slow
  Time (mean ± σ):     759.6 ms ±   2.2 ms    [User: 257.7 ms, System: 496.8 ms]
  Range (minmax):   755.8 ms762.3 ms    10 runs

Summary
  cargo run --release --features fast ran
   18.81 ± 0.36 times faster than cargo run --release --features slow

18.8 times faster. Cool!

Conclusion

To be clear, I am not saying you should disregard the hints or help messages given by the Rust compiler. However, you should not assume that the help provided is accurate or solves the underlying problem exactly. The diagnostics given by the compiler is usually narrowly focused, and local rather than global.

Unfortunately, this is a tough problem to solve. For people looking to learn Rust, there’s no way around taking the time to grok the reason for the language’s existence. Tools like Clippy help with writing idiomatic code, but it isn’t a panacea either. You just have to write code, possibly bad code, and keep telling yourself there must be a better way!

let result: String
src
static ALLOC: dhat::Alloc = dhat::Alloc
dhat::ProfilerBuilder
pub fn testing(self) -> Self

Requests testing mode, which allows the use of dhat::assert and related macros, and disables saving of profile data on Profiler drop.

Examples

let _profiler = dhat::Profiler::builder().testing().build();
dhat
pub struct Profiler

A type whose lifetime dictates the start and end of profiling.

Profiling starts when the first value of this type is created. Profiling stops when (a) this value is dropped or (b) a dhat assertion fails, whichever comes first. When that happens, profiling data may be written to file, depending on how the Profiler has been configured. Only one Profiler can be running at any point in time.

alloc::boxed
pub struct Box<T, A = Global>(Unique<T>, A)
where
    T: ?Sized,
    A: Allocator,

A pointer type that uniquely owns a heap allocation of type T.

See the module-level documentation for more.

src
fn append_asterisk_if_ascii(target: &str, result: &mut String)
core::macros::builtin
macro global_allocator

Attribute macro applied to a static to register it as a global allocator.

See also std::alloc::GlobalAlloc.

dhat::ProfilerBuilder
pub fn build(self) -> Profiler

Creates a Profiler from the builder and initiates profiling.

Panics

Panics if another Profiler is running.

let _profiler: Profiler
dhat::HeapStats
pub total_bytes: u64

Number of bytes allocated over the entire run.

core::result
pub enum Result<T, E> {
    Ok( /* … */ ),
    Err( /* … */ ),
}

Result is a type that represents either success (Ok) or failure (Err).

See the documentation for details.

dhat::HeapStats
pub max_blocks: usize

Number of blocks (a.k.a. allocations) allocated at the global peak, i.e. when curr_bytes peaked.

target: &str
alloc::borrow
impl<T> ToOwned for T
fn to_owned(&self) -> T
where
    // Bounds from impl:
    T: Clone,

Creates owned data from borrowed data, usually by cloning.

Examples

Basic usage:

let s: &str = "a";
let ss: String = s.to_owned();

let v: &[i32] = &[1, 2];
let vv: Vec<i32> = v.to_owned();
dhat::HeapStats
pub fn get() -> Self

Gets the current heap stats.

Panics

Panics if called when a Profiler is not running or not doing heap profiling.

alloc::string::String
fn from(s: &str) -> String

Converts a &str into a String.

The result is allocated on the heap.

core::result::Result
Ok(T)

Contains the success value

alloc::string::String
pub fn push(&mut self, ch: char)

Appends the given char to the end of this String.

Examples

let mut s = String::from("abc");

s.push('1');
s.push('2');
s.push('3');

assert_eq!("abc123", s);
target: String
std::macros
macro_rules! println

Prints to the standard output, with a newline.

On all platforms, the newline is the LINE FEED character (\n/U+000A) alone (no additional CARRIAGE RETURN (\r/U+000D)).

This macro uses the same syntax as format, but writes to the standard output instead. See [std::fmt] for more information.

The println! macro will lock the standard output on each call. If you call println! within a hot loop, this behavior may be the bottleneck of the loop. To avoid this, lock stdout with io::stdout().lock:

use std::io::{stdout, Write};

let mut lock = stdout().lock();
writeln!(lock, "hello world").unwrap();

Use println! only for the primary output of your program. Use [eprintln] instead to print error and progress messages.

See the formatting documentation in std::fmt for details of the macro argument syntax.

Panics

Panics if writing to [io::stdout] fails.

Writing to non-blocking stdout can cause an error, which will lead this macro to panic.

Examples

println!(); // prints just a newline
println!("hello there!");
println!("format {} arguments", "some");
let local_variable = "some";
println!("format {local_variable} arguments");
dhat::HeapStats
pub total_blocks: u64

Number of blocks (a.k.a. allocations) allocated over the entire run.

core::str
pub const fn is_ascii(&self) -> bool

Checks if all characters in this string are within the ASCII range.

Examples

let ascii = "hello!\n";
let non_ascii = "Grüße, Jürgen ❤";

assert!(ascii.is_ascii());
assert!(!non_ascii.is_ascii());
#[cfg]

Valid forms are:

  • #[cfg(predicate)]
src
fn append_asterisk_if_ascii(target: String, result: &mut String)
dhat
pub struct Alloc

A global allocator that tracks allocations and deallocations on behalf of the Profiler type.

It must be set as the global allocator (via #[global_allocator]) when doing heap profiling.

src
fn main() -> Result<(), Box<dyn std::error::Error>>
dhat::Profiler
pub fn builder() -> ProfilerBuilder

Creates a new ProfilerBuilder, which defaults to heap profiling.

dhat::HeapStats
pub max_bytes: usize

Number of bytes allocated at the global peak, i.e. when curr_bytes peaked.

alloc::string::String
pub fn with_capacity(capacity: usize) -> String

Creates a new empty String with at least the specified capacity.

Strings have an internal buffer to hold their data. The capacity is the length of that buffer, and can be queried with the [capacity] method. This method creates an empty String, but one with an initial buffer that can hold at least capacity bytes. This is useful when you may be appending a bunch of data to the String, reducing the number of reallocations it needs to do.

If the given capacity is 0, no allocation will occur, and this method is identical to the [new] method.

Examples

let mut s = String::with_capacity(10);

// The String contains no chars, even though it has capacity for more
assert_eq!(s.len(), 0);

// These are all done without reallocating...
let cap = s.capacity();
for _ in 0..10 {
    s.push('a');
}

assert_eq!(s.capacity(), cap);

// ...but this may make the string reallocate
s.push('a');
alloc::string
pub struct String {
    vec: Vec<u8>,
}

A UTF-8–encoded, growable string.

String is the most common string type. It has ownership over the contents of the string, stored in a heap-allocated buffer (see Representation). It is closely related to its borrowed counterpart, the primitive [str].

Examples

You can create a String from a literal string with [String::from]:

let hello = String::from("Hello, world!");

You can append a char to a String with the [push] method, and append a [&str] with the [push_str] method:

let mut hello = String::from("Hello, ");

hello.push('w');
hello.push_str("orld!");

If you have a vector of UTF-8 bytes, you can create a String from it with the [from_utf8] method:

// some bytes, in a vector
let sparkle_heart = vec![240, 159, 146, 150];

// We know these bytes are valid, so we'll use `unwrap()`.
let sparkle_heart = String::from_utf8(sparkle_heart).unwrap();

assert_eq!("💖", sparkle_heart);

UTF-8

Strings are always valid UTF-8. If you need a non-UTF-8 string, consider OsString. It is similar, but without the UTF-8 constraint. Because UTF-8 is a variable width encoding, Strings are typically smaller than an array of the same chars:

use std::mem;

// `s` is ASCII which represents each `char` as one byte
let s = "hello";
assert_eq!(s.len(), 5);

// A `char` array with the same contents would be longer because
// every `char` is four bytes
let s = ['h', 'e', 'l', 'l', 'o'];
let size: usize = s.into_iter().map(|c| mem::size_of_val(&c)).sum();
assert_eq!(size, 20);

// However, for non-ASCII strings, the difference will be smaller
// and sometimes they are the same
let s = "💖💖💖💖💖";
assert_eq!(s.len(), 20);

let s = ['💖', '💖', '💖', '💖', '💖'];
let size: usize = s.into_iter().map(|c| mem::size_of_val(&c)).sum();
assert_eq!(size, 20);

This raises interesting questions as to how s[i] should work. What should i be here? Several options include byte indices and char indices but, because of UTF-8 encoding, only byte indices would provide constant time indexing. Getting the ith char, for example, is available using [chars]:

let s = "hello";
let third_character = s.chars().nth(2);
assert_eq!(third_character, Some('l'));

let s = "💖💖💖💖💖";
let third_character = s.chars().nth(2);
assert_eq!(third_character, Some('💖'));

Next, what should s[i] return? Because indexing returns a reference to underlying data it could be &u8, &[u8], or something else similar. Since we’re only providing one index, &u8 makes the most sense but that might not be what the user expects and can be explicitly achieved with [as_bytes()]:

// The first byte is 104 - the byte value of `'h'`
let s = "hello";
assert_eq!(s.as_bytes()[0], 104);
// or
assert_eq!(s.as_bytes()[0], b'h');

// The first byte is 240 which isn't obviously useful
let s = "💖💖💖💖💖";
assert_eq!(s.as_bytes()[0], 240);

Due to these ambiguities/restrictions, indexing with a usize is simply forbidden:

let s = "hello";

// The following will not compile!
println!("The first letter of s is {}", s[0]);

It is more clear, however, how &s[i..j] should work (that is, indexing with a range). It should accept byte indices (to be constant-time) and return a &str which is UTF-8 encoded. This is also called “string slicing”. Note this will panic if the byte indices provided are not character boundaries - see [is_char_boundary] for more details. See the implementations for [SliceIndex<str>] for more details on string slicing. For a non-panicking version of string slicing, see [get].

The [bytes] and [chars] methods return iterators over the bytes and codepoints of the string, respectively. To iterate over codepoints along with byte indices, use [char_indices].

Deref

String implements [Deref]<Target = [str]>, and so inherits all of [str]’s methods. In addition, this means that you can pass a String to a function which takes a [&str] by using an ampersand (&):

fn takes_str(s: &str) { }

let s = String::from("Hello");

takes_str(&s);

This will create a [&str] from the String and pass it in. This conversion is very inexpensive, and so generally, functions will accept [&str]s as arguments unless they need a String for some specific reason.

In certain cases Rust doesn’t have enough information to make this conversion, known as [Deref] coercion. In the following example a string slice &'a str implements the trait TraitExample, and the function example_func takes anything that implements the trait. In this case Rust would need to make two implicit conversions, which Rust doesn’t have the means to do. For that reason, the following example will not compile.

trait TraitExample {}

impl<'a> TraitExample for &'a str {}

fn example_func<A: TraitExample>(example_arg: A) {}

let example_string = String::from("example_string");
example_func(&example_string);

There are two options that would work instead. The first would be to change the line example_func(&example_string); to example_func(example_string.as_str());, using the method [as_str] to explicitly extract the string slice containing the string. The second way changes example_func(&example_string); to example_func(&*example_string);. In this case we are dereferencing a String to a [str], then referencing the [str] back to [&str]. The second way is more idiomatic, however both work to do the conversion explicitly rather than relying on the implicit conversion.

Representation

A String is made up of three components: a pointer to some bytes, a length, and a capacity. The pointer points to the internal buffer which String uses to store its data. The length is the number of bytes currently stored in the buffer, and the capacity is the size of the buffer in bytes. As such, the length will always be less than or equal to the capacity.

This buffer is always stored on the heap.

You can look at these with the [as_ptr], [len], and [capacity] methods:

use std::mem;

let story = String::from("Once upon a time...");

// Prevent automatically dropping the String's data
let mut story = mem::ManuallyDrop::new(story);

let ptr = story.as_mut_ptr();
let len = story.len();
let capacity = story.capacity();

// story has nineteen bytes
assert_eq!(19, len);

// We can re-build a String out of ptr, len, and capacity. This is all
// unsafe because we are responsible for making sure the components are
// valid:
let s = unsafe { String::from_raw_parts(ptr, len, capacity) } ;

assert_eq!(String::from("Once upon a time..."), s);

If a String has enough capacity, adding elements to it will not re-allocate. For example, consider this program:

let mut s = String::new();

println!("{}", s.capacity());

for _ in 0..5 {
    s.push_str("hello");
    println!("{}", s.capacity());
}

This will output the following:

0
8
16
16
32
32

At first, we have no memory allocated at all, but as we append to the string, it increases its capacity appropriately. If we instead use the [with_capacity] method to allocate the correct capacity initially:

let mut s = String::with_capacity(25);

println!("{}", s.capacity());

for _ in 0..5 {
    s.push_str("hello");
    println!("{}", s.capacity());
}

We end up with a different output:

25
25
25
25
25
25

Here, there’s no need to allocate more memory inside the loop.

result: &mut String
let stats: HeapStats
dhat
pub struct HeapStats {
    pub total_blocks: u64,
    pub total_bytes: u64,
    pub curr_blocks: usize,
    pub curr_bytes: usize,
    pub max_blocks: usize,
    /* … */
}

Stats from heap profiling.

core::error
// Dyn Compatible: Yes
pub trait Error
where
    Self: Debug + Display,

Error is a trait representing the basic expectations for error values, i.e., values of type E in Result<T, E>.

Errors must describe themselves through the Display and Debug traits. Error messages are typically concise lowercase sentences without trailing punctuation:

let err = "NaN".parse::<u32>().unwrap_err();
assert_eq!(err.to_string(), "invalid digit found in string");

Errors may provide cause information. Error::source is generally used when errors cross “abstraction boundaries”. If one module must report an error that is caused by an error from a lower-level module, it can allow accessing that error via Error::source. This makes it possible for the high-level module to provide its own errors while also revealing some of the implementation for debugging.

str

String slices.

See also the std::str module.

The str type, also called a ‘string slice’, is the most primitive string type. It is usually seen in its borrowed form, &str. It is also the type of string literals, &'static str.

Basic Usage

String literals are string slices:

let hello_world = "Hello, World!";

Here we have declared a string slice initialized with a string literal. String literals have a static lifetime, which means the string hello_world is guaranteed to be valid for the duration of the entire program. We can explicitly specify hello_world’s lifetime as well:

let hello_world: &'static str = "Hello, world!";

Representation

A &str is made up of two components: a pointer to some bytes, and a length. You can look at these with the [as_ptr] and [len] methods:

use std::slice;
use std::str;

let story = "Once upon a time...";

let ptr = story.as_ptr();
let len = story.len();

// story has nineteen bytes
assert_eq!(19, len);

// We can re-build a str out of ptr and len. This is all unsafe because
// we are responsible for making sure the two components are valid:
let s = unsafe {
    // First, we build a &[u8]...
    let slice = slice::from_raw_parts(ptr, len);

    // ... and then convert that slice into a string slice
    str::from_utf8(slice)
};

assert_eq!(s, Ok(story));

Note: This example shows the internals of &str. unsafe should not be used to get a string slice under normal circumstances. Use as_str instead.

Invariant

Rust libraries may assume that string slices are always valid UTF-8.

Constructing a non-UTF-8 string slice is not immediate undefined behavior, but any function called on a string slice may assume that it is valid UTF-8, which means that a non-UTF-8 string slice can lead to undefined behavior down the road.