Inclusive Range Performance

I got baited by this r/rust post on Zig being faster than Rust, which is not the focus of this tidbit, rather this comment in the same post:

Try to change this:

let sqrt = (n as ).sqrt() as ;
!(3..=sqrt)

…to:

let sqrt = (n as ).sqrt() as  + 1;
!(3..sqrt) ...

…just for fun and see what happens. The ..= range can be significantly slower than the .. range because in some cases it compiles to two separate comparisons instead of one. I don’t know if it will make a difference in your case but I’m curious.

u/ObligatoryOption

After a short excursion, I learned something new!

Inclusive ranges are slower compared to their equivalent exclusive range.

The inclusive range version requires doing additional checks since the upper bound may be the largest value for the integer type. Without the check, the iteration variable will overflow and cause an infinite loop, see also this comment by CAD1997. These checks also result in less optimization opportunities.

Rust

pub fn (: ) ->  {
    let mut  = 0;
    for _ in 1..( + 1) {
         += std::hint::(1);
    }
    
}

pub fn (: ) ->  {
    let mut  = 0;
    for _ in 1..= {
         += std::hint::(1);
    }
    
}

Assembly

        lea     rax, [rdi - 1]
        cmp     rax, -3
        ja      .LBB0_1
        xor     eax, eax
        lea     rcx, [rsp - 8]
.LBB0_4:
        mov     qword ptr [rsp - 8], 1
        add     rax, qword ptr [rsp - 8]
        dec     rdi
        jne     .LBB0_4
        ret
.LBB0_1:
        xor     eax, eax
        ret

example::inclusive:
        test    rdi, rdi
        je      .LBB1_1
        mov     ecx, 1
        xor     eax, eax
        lea     rdx, [rsp - 8]
.LBB1_4:
        mov     rsi, rcx
        cmp     rcx, rdi
        adc     rcx, 0
        mov     qword ptr [rsp - 8], 1
        add     rax, qword ptr [rsp - 8]
        cmp     rsi, rdi
        jae     .LBB1_2
        cmp     rcx, rdi
        jbe     .LBB1_4
.LBB1_2:
        ret
.LBB1_1:
        xor     eax, eax
        ret

Benchmark

The benchmark was done using Criterion:

fn (: &mut criterion::Criterion) {
    let mut  = .benchmark_group("Iteration");
    for  in &[256, 512, 1024, 2048, 4096, 8192] {
        .bench_with_input(
            criterion::BenchmarkId::new("Exclusive", ),
            ,
            |, | .iter(|| exclusive(*)),
        );
        .bench_with_input(
            criterion::BenchmarkId::new("Inclusive", ),
            ,
            |, | .iter(|| inclusive(*)),
        );
    }
    .finish();
}

criterion::criterion_group!(benches, benchmark);
criterion::criterion_main!(benches);

And the output:

0 0.5 1 1.5 2 2.5 3 3.5 4 0 1000 2000 3000 4000 5000 6000 7000 8000 9000Average time (µs)InputExclusiveInclusiveIteration: Comparison

As you can see, the exclusive range version performs about twice as fast as the inclusive range version.

b: {unknown}
u64

The 64-bit unsigned integer type.

let limit: &i32
src
pub fn exclusive(upper_limit: u64) -> u64
src
pub fn inclusive(upper_limit: u64) -> u64
let mut sum: u64
upper_limit: u64
c: &mut {unknown}
i: {unknown}
let mut group: {unknown}
src
fn benchmark(c: &mut criterion::Criterion)
f32

A 32-bit floating-point type (specifically, the “binary32” type defined in IEEE 754-2008).

This type can represent a wide range of decimal numbers, like 3.5, 27, -113.75, 0.0078125, 34359738368, 0, -1. So unlike integer types (such as i32), floating-point types can represent non-integer numbers, too.

However, being able to represent this wide range of numbers comes at the cost of precision: floats can only represent some of the real numbers and calculation with floats round to a nearby representable number. For example, 5.0 and 1.0 can be exactly represented as f32, but 1.0 / 5.0 results in 0.20000000298023223876953125 since 0.2 cannot be exactly represented as f32. Note, however, that printing floats with println and friends will often discard insignificant digits: println!("{}", 1.0f32 / 5.0f32) will print 0.2.

Additionally, f32 can represent some special values:

  • −0.0: IEEE 754 floating-point numbers have a bit that indicates their sign, so −0.0 is a possible value. For comparison −0.0 = +0.0, but floating-point operations can carry the sign bit through arithmetic operations. This means −0.0 × +0.0 produces −0.0 and a negative number rounded to a value smaller than a float can represent also produces −0.0.
  • and −∞: these result from calculations like 1.0 / 0.0.
  • NaN (not a number): this value results from calculations like (-1.0).sqrt(). NaN has some potentially unexpected behavior:
    • It is not equal to any float, including itself! This is the reason f32 doesn’t implement the Eq trait.
    • It is also neither smaller nor greater than any float, making it impossible to sort by the default comparison operation, which is the reason f32 doesn’t implement the Ord trait.
    • It is also considered infectious as almost all calculations where one of the operands is NaN will also result in NaN. The explanations on this page only explicitly document behavior on NaN operands if this default is deviated from.
    • Lastly, there are multiple bit patterns that are considered NaN. Rust does not currently guarantee that the bit patterns of NaN are preserved over arithmetic operations, and they are not guaranteed to be portable or even fully deterministic! This means that there may be some surprising results upon inspecting the bit patterns, as the same calculations might produce NaNs with different bit patterns. This also affects the sign of the NaN: checking is_sign_positive or is_sign_negative on a NaN is the most common way to run into these surprising results. (Checking x >= 0.0 or x <= 0.0 avoids those surprises, but also how negative/positive zero are treated.) See the section below for what exactly is guaranteed about the bit pattern of a NaN.

When a primitive operation (addition, subtraction, multiplication, or division) is performed on this type, the result is rounded according to the roundTiesToEven direction defined in IEEE 754-2008. That means:

  • The result is the representable value closest to the true value, if there is a unique closest representable value.
  • If the true value is exactly half-way between two representable values, the result is the one with an even least-significant binary digit.
  • If the true value’s magnitude is ≥ f32::MAX + 2(f32::MAX_EXPf32::MANTISSA_DIGITS − 1), the result is ∞ or −∞ (preserving the true value’s sign).
  • If the result of a sum exactly equals zero, the outcome is +0.0 unless both arguments were negative, then it is -0.0. Subtraction a - b is regarded as a sum a + (-b).

For more information on floating-point numbers, see Wikipedia.

See also the std::f32::consts module.

NaN bit patterns

This section defines the possible NaN bit patterns returned by floating-point operations.

The bit pattern of a floating-point NaN value is defined by:

  • a sign bit.
  • a quiet/signaling bit. Rust assumes that the quiet/signaling bit being set to 1 indicates a quiet NaN (QNaN), and a value of 0 indicates a signaling NaN (SNaN). In the following we will hence just call it the “quiet bit”.
  • a payload, which makes up the rest of the significand (i.e., the mantissa) except for the quiet bit.

The rules for NaN values differ between arithmetic and non-arithmetic (or “bitwise”) operations. The non-arithmetic operations are unary -, abs, copysign, signum, {to,from}_bits, {to,from}_{be,le,ne}_bytes and is_sign_{positive,negative}. These operations are guaranteed to exactly preserve the bit pattern of their input except for possibly changing the sign bit.

The following rules apply when a NaN value is returned from an arithmetic operation:

  • The result has a non-deterministic sign.

  • The quiet bit and payload are non-deterministically chosen from the following set of options:

    • Preferred NaN: The quiet bit is set and the payload is all-zero.
    • Quieting NaN propagation: The quiet bit is set and the payload is copied from any input operand that is a NaN. If the inputs and outputs do not have the same payload size (i.e., for as casts), then
      • If the output is smaller than the input, low-order bits of the payload get dropped.
      • If the output is larger than the input, the payload gets filled up with 0s in the low-order bits.
    • Unchanged NaN propagation: The quiet bit and payload are copied from any input operand that is a NaN. If the inputs and outputs do not have the same size (i.e., for as casts), the same rules as for “quieting NaN propagation” apply, with one caveat: if the output is smaller than the input, droppig the low-order bits may result in a payload of 0; a payload of 0 is not possible with a signaling NaN (the all-0 significand encodes an infinity) so unchanged NaN propagation cannot occur with some inputs.
    • Target-specific NaN: The quiet bit is set and the payload is picked from a target-specific set of “extra” possible NaN payloads. The set can depend on the input operand values. See the table below for the concrete NaNs this set contains on various targets.

In particular, if all input NaNs are quiet (or if there are no input NaNs), then the output NaN is definitely quiet. Signaling NaN outputs can only occur if they are provided as an input value. Similarly, if all input NaNs are preferred (or if there are no input NaNs) and the target does not have any “extra” NaN payloads, then the output NaN is guaranteed to be preferred.

The non-deterministic choice happens when the operation is executed; i.e., the result of a NaN-producing floating-point operation is a stable bit pattern (looking at these bits multiple times will yield consistent results), but running the same operation twice with the same inputs can produce different results.

These guarantees are neither stronger nor weaker than those of IEEE 754: IEEE 754 guarantees that an operation never returns a signaling NaN, whereas it is possible for operations like SNAN * 1.0 to return a signaling NaN in Rust. Conversely, IEEE 754 makes no statement at all about which quiet NaN is returned, whereas Rust restricts the set of possible results to the ones listed above.

Unless noted otherwise, the same rules also apply to NaNs returned by other library functions (e.g. min, minimum, max, maximum); other aspects of their semantics and which IEEE 754 operation they correspond to are documented with the respective functions.

When an arithmetic floating-point operation is executed in const context, the same rules apply: no guarantee is made about which of the NaN bit patterns described above will be returned. The result does not have to match what happens when executing the same code at runtime, and the result can vary depending on factors such as compiler version and flags.

Target-specific “extra” NaN values

target_arch Extra payloads possible on this platform
x86, x86_64, arm, aarch64, riscv32, riscv64 None
sparc, sparc64 The all-one payload
wasm32, wasm64 If all input NaNs are quiet with all-zero payload: None.
Otherwise: all possible payloads.

For targets not in this table, all payloads are possible.

u32

The 32-bit unsigned integer type.

core::hint
pub const fn black_box<T>(dummy: T) -> T

An identity function that hints to the compiler to be maximally pessimistic about what black_box could do.

Unlike [std::convert::identity], a Rust compiler is encouraged to assume that black_box can use dummy in any possible valid way that Rust code is allowed to without introducing undefined behavior in the calling code. This property makes black_box useful for writing code in which certain optimizations are not desired, such as benchmarks.

Note however, that black_box is only (and can only be) provided on a “best-effort” basis. The extent to which it can block optimisations may vary depending upon the platform and code-gen backend used. Programs cannot rely on black_box for correctness, beyond it behaving as the identity function. As such, it must not be relied upon to control critical program behavior. This also means that this function does not offer any guarantees for cryptographic or security purposes.

When is this useful?

While not suitable in those mission-critical cases, black_box’s functionality can generally be relied upon for benchmarking, and should be used there. It will try to ensure that the compiler doesn’t optimize away part of the intended test code based on context. For example:

fn contains(haystack: &[&str], needle: &str) -> bool {
    haystack.iter().any(|x| x == &needle)
}

pub fn benchmark() {
    let haystack = vec!["abc", "def", "ghi", "jkl", "mno"];
    let needle = "ghi";
    for _ in 0..10 {
        contains(&haystack, needle);
    }
}

The compiler could theoretically make optimizations like the following:

  • The needle and haystack do not change, move the call to contains outside the loop and delete the loop
  • Inline contains
  • needle and haystack have values known at compile time, contains is always true. Remove the call and replace with true
  • Nothing is done with the result of contains: delete this function call entirely
  • benchmark now has no purpose: delete this function

It is not likely that all of the above happens, but the compiler is definitely able to make some optimizations that could result in a very inaccurate benchmark. This is where black_box comes in:

use std::hint::black_box;

// Same `contains` function
fn contains(haystack: &[&str], needle: &str) -> bool {
    haystack.iter().any(|x| x == &needle)
}

pub fn benchmark() {
    let haystack = vec!["abc", "def", "ghi", "jkl", "mno"];
    let needle = "ghi";
    for _ in 0..10 {
        // Adjust our benchmark loop contents
        black_box(contains(black_box(&haystack), black_box(needle)));
    }
}

This essentially tells the compiler to block optimizations across any calls to black_box. So, it now:

  • Treats both arguments to contains as unpredictable: the body of contains can no longer be optimized based on argument values
  • Treats the call to contains and its result as volatile: the body of benchmark cannot optimize this away

This makes our benchmark much more realistic to how the function would actually be used, where arguments are usually not known at compile time and the result is used in some way.