-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Tracking Issue for algebraic floating point methods #136469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Added lang as requested in rust-lang/libs-team#532 (comment). That comment also mentions |
These operations are non-deterministic. @nikic do you know if LLVM scalar evolution handles that properly? We had codegen issues in the past when SE assumed that an operation was deterministic and then actually it was not. |
We also have the "may or may not fuse" intrinsics added in #124874, which so far have not been exposed in any way that has a path to stabilization. Would |
Are these "algebraic" in the sense that they have reassoc FMF? If so, then yes, SE should be treating them as non-deterministic already: https://github.com/llvm/llvm-project/blob/6684a5970e74b8b4c0c83361a90e25dae9646db0/llvm/lib/Analysis/ConstantFolding.cpp#L1437-L1444 |
|
On that note I added an unresolved question for naming since |
In theory the flags could also be represented via const generics, which would allow more fine tuned control and more flags without a method explosion. Something like: #[derive(Clone, Copy, Debug, Default, ...)]
#[non_exhaustive]
struct FpArithOps {
reassociate: bool,
contract: bool,
reciprocal: bool,
no_signed_zeros: bool,
ftz: bool,
daz: bool,
poison_nan: bool,
poison_inf: bool,
}
impl FpArithOps {
// Current algebraic_* flags
const ALGEBRAIC: Self = Self { reassociate: true, contract: true, reciprocal: true, no_signed_zeros: true, ..false };
}
// Panics if `poison_*` is set
fn add_with_ops<const OPS: FpArithOps>(self, y: Self) -> Self;
// Alows `poison_*`
unsafe fn add_with_ops_unchecked<const OPS: FpArithOps>(self, y: Self) -> Self;
// same as `f32::algebraic_div`
let x = 1.0f32.div_with_ops::<FpArithOps::ALGEBRAIC>(y); (Using a struct needs the next step of const generic support, or some way to bless types in std) |
As I wrote in the ACP:
I don’t think that we should expose the power set of LLVM’s flags, nor so many ad-hoc combinations that “method explosion” becomes a realistic problem. I’m not even sure if it’s a good idea to ever expose any of FMFs that can make an operation return poison (has anyone ever used those soundly while still getting a speedup?). And FTZ/DAZ shouldn’t be modeled as per-operation flags because you don’t want to toggle the CPU control registers for that constantly. |
Expose algebraic floating point intrinsics # Problem A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization. See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H. ### C++: 10us ✅ With Clang 18.1.3 and `-O2 -march=haswell`: <table> <tr> <th>C++</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="cc"> float dot(float *a, float *b, size_t len) { #pragma clang fp reassociate(on) float sum = 0.0; for (size_t i = 0; i < len; ++i) { sum += a[i] * b[i]; } return sum; } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" /> </td> </tr> </table> ### Nightly Rust: 10us ✅ With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i])); } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" /> </td> </tr> </table> ### Stable Rust: 84us ❌ With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum += a[i] * b[i]; } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" /> </td> </tr> </table> # Proposed Change Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature. # Alternatives Considered rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles. In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit. # References * rust-lang#21690 * rust-lang/libs-team#532 * rust-lang#136469 * https://github.com/calder/dot-bench * https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps
Should |
Expose algebraic floating point intrinsics # Problem A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization. See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H. ### C++: 10us ✅ With Clang 18.1.3 and `-O2 -march=haswell`: <table> <tr> <th>C++</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="cc"> float dot(float *a, float *b, size_t len) { #pragma clang fp reassociate(on) float sum = 0.0; for (size_t i = 0; i < len; ++i) { sum += a[i] * b[i]; } return sum; } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" /> </td> </tr> </table> ### Nightly Rust: 10us ✅ With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i])); } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" /> </td> </tr> </table> ### Stable Rust: 84us ❌ With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum += a[i] * b[i]; } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" /> </td> </tr> </table> # Proposed Change Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature. # Alternatives Considered rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles. In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit. # References * rust-lang#21690 * rust-lang/libs-team#532 * rust-lang#136469 * https://github.com/calder/dot-bench * https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps try-job: x86_64-gnu-nopt
Expose algebraic floating point intrinsics # Problem A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization. See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H. ### C++: 10us ✅ With Clang 18.1.3 and `-O2 -march=haswell`: <table> <tr> <th>C++</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="cc"> float dot(float *a, float *b, size_t len) { #pragma clang fp reassociate(on) float sum = 0.0; for (size_t i = 0; i < len; ++i) { sum += a[i] * b[i]; } return sum; } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" /> </td> </tr> </table> ### Nightly Rust: 10us ✅ With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i])); } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" /> </td> </tr> </table> ### Stable Rust: 84us ❌ With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum += a[i] * b[i]; } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" /> </td> </tr> </table> # Proposed Change Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature. # Alternatives Considered rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles. In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit. # References * rust-lang#21690 * rust-lang/libs-team#532 * rust-lang#136469 * https://github.com/calder/dot-bench * https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps try-job: x86_64-gnu-nopt
Expose algebraic floating point intrinsics # Problem A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization. See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H. ### C++: 10us ✅ With Clang 18.1.3 and `-O2 -march=haswell`: <table> <tr> <th>C++</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="cc"> float dot(float *a, float *b, size_t len) { #pragma clang fp reassociate(on) float sum = 0.0; for (size_t i = 0; i < len; ++i) { sum += a[i] * b[i]; } return sum; } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" /> </td> </tr> </table> ### Nightly Rust: 10us ✅ With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i])); } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" /> </td> </tr> </table> ### Stable Rust: 84us ❌ With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum += a[i] * b[i]; } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" /> </td> </tr> </table> # Proposed Change Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature. # Alternatives Considered rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles. In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit. # References * rust-lang#21690 * rust-lang/libs-team#532 * rust-lang#136469 * https://github.com/calder/dot-bench * https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps try-job: x86_64-gnu-nopt
Expose algebraic floating point intrinsics # Problem A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization. See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H. ### C++: 10us ✅ With Clang 18.1.3 and `-O2 -march=haswell`: <table> <tr> <th>C++</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="cc"> float dot(float *a, float *b, size_t len) { #pragma clang fp reassociate(on) float sum = 0.0; for (size_t i = 0; i < len; ++i) { sum += a[i] * b[i]; } return sum; } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" /> </td> </tr> </table> ### Nightly Rust: 10us ✅ With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i])); } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" /> </td> </tr> </table> ### Stable Rust: 84us ❌ With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum += a[i] * b[i]; } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" /> </td> </tr> </table> # Proposed Change Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature. # Alternatives Considered rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles. In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit. # References * rust-lang#21690 * rust-lang/libs-team#532 * rust-lang#136469 * https://github.com/calder/dot-bench * https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps ~~try-job: x86_64-gnu-nopt~~ try-job: x86_64-gnu-aux
Expose algebraic floating point intrinsics # Problem A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization. See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H. ### C++: 10us ✅ With Clang 18.1.3 and `-O2 -march=haswell`: <table> <tr> <th>C++</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="cc"> float dot(float *a, float *b, size_t len) { #pragma clang fp reassociate(on) float sum = 0.0; for (size_t i = 0; i < len; ++i) { sum += a[i] * b[i]; } return sum; } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" /> </td> </tr> </table> ### Nightly Rust: 10us ✅ With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i])); } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" /> </td> </tr> </table> ### Stable Rust: 84us ❌ With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum += a[i] * b[i]; } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" /> </td> </tr> </table> # Proposed Change Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature. # Alternatives Considered rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles. In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit. # References * rust-lang#21690 * rust-lang/libs-team#532 * rust-lang#136469 * https://github.com/calder/dot-bench * https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps try-job: x86_64-gnu-nopt try-job: x86_64-gnu-aux
Expose algebraic floating point intrinsics # Problem A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization. See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H. ### C++: 10us ✅ With Clang 18.1.3 and `-O2 -march=haswell`: <table> <tr> <th>C++</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="cc"> float dot(float *a, float *b, size_t len) { #pragma clang fp reassociate(on) float sum = 0.0; for (size_t i = 0; i < len; ++i) { sum += a[i] * b[i]; } return sum; } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" /> </td> </tr> </table> ### Nightly Rust: 10us ✅ With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i])); } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" /> </td> </tr> </table> ### Stable Rust: 84us ❌ With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum += a[i] * b[i]; } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" /> </td> </tr> </table> # Proposed Change Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature. # Alternatives Considered rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles. In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit. # References * rust-lang#21690 * rust-lang/libs-team#532 * rust-lang#136469 * https://github.com/calder/dot-bench * https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps try-job: x86_64-gnu-nopt try-job: x86_64-gnu-aux
Expose algebraic floating point intrinsics # Problem A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization. See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H. ### C++: 10us ✅ With Clang 18.1.3 and `-O2 -march=haswell`: <table> <tr> <th>C++</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="cc"> float dot(float *a, float *b, size_t len) { #pragma clang fp reassociate(on) float sum = 0.0; for (size_t i = 0; i < len; ++i) { sum += a[i] * b[i]; } return sum; } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" /> </td> </tr> </table> ### Nightly Rust: 10us ✅ With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i])); } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" /> </td> </tr> </table> ### Stable Rust: 84us ❌ With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum += a[i] * b[i]; } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" /> </td> </tr> </table> # Proposed Change Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature. # Alternatives Considered rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles. In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit. # References * rust-lang#21690 * rust-lang/libs-team#532 * rust-lang#136469 * https://github.com/calder/dot-bench * https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps try-job: x86_64-gnu-nopt try-job: x86_64-gnu-aux
I think it is fairly clear. The operations allow algebraically justified optimizations, as-if the arithmetic was real arithmetic. I don't think you're going to find a clearer name, but feel free to provide suggestions. One alternative one might consider is |
Expose algebraic floating point intrinsics # Problem A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization. See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H. ### C++: 10us ✅ With Clang 18.1.3 and `-O2 -march=haswell`: <table> <tr> <th>C++</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="cc"> float dot(float *a, float *b, size_t len) { #pragma clang fp reassociate(on) float sum = 0.0; for (size_t i = 0; i < len; ++i) { sum += a[i] * b[i]; } return sum; } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" /> </td> </tr> </table> ### Nightly Rust: 10us ✅ With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i])); } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" /> </td> </tr> </table> ### Stable Rust: 84us ❌ With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum += a[i] * b[i]; } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" /> </td> </tr> </table> # Proposed Change Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature. # Alternatives Considered rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles. In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit. # References * rust-lang#21690 * rust-lang/libs-team#532 * rust-lang#136469 * https://github.com/calder/dot-bench * https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps try-job: x86_64-gnu-nopt try-job: x86_64-gnu-aux
Expose algebraic floating point intrinsics # Problem A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization. See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H. ### C++: 10us ✅ With Clang 18.1.3 and `-O2 -march=haswell`: <table> <tr> <th>C++</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="cc"> float dot(float *a, float *b, size_t len) { #pragma clang fp reassociate(on) float sum = 0.0; for (size_t i = 0; i < len; ++i) { sum += a[i] * b[i]; } return sum; } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" /> </td> </tr> </table> ### Nightly Rust: 10us ✅ With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i])); } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" /> </td> </tr> </table> ### Stable Rust: 84us ❌ With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum += a[i] * b[i]; } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" /> </td> </tr> </table> # Proposed Change Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature. # Alternatives Considered rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles. In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit. # References * rust-lang#21690 * rust-lang/libs-team#532 * rust-lang#136469 * https://github.com/calder/dot-bench * https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps try-job: x86_64-gnu-nopt try-job: x86_64-gnu-aux
Expose algebraic floating point intrinsics # Problem A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization. See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H. ### C++: 10us ✅ With Clang 18.1.3 and `-O2 -march=haswell`: <table> <tr> <th>C++</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="cc"> float dot(float *a, float *b, size_t len) { #pragma clang fp reassociate(on) float sum = 0.0; for (size_t i = 0; i < len; ++i) { sum += a[i] * b[i]; } return sum; } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" /> </td> </tr> </table> ### Nightly Rust: 10us ✅ With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i])); } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" /> </td> </tr> </table> ### Stable Rust: 84us ❌ With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`: <table> <tr> <th>Rust</th> <th>Assembly</th> </tr> <tr> <td> <pre lang="rust"> fn dot(a: &[f32], b: &[f32]) -> f32 { let mut sum = 0.0; for i in 0..a.len() { sum += a[i] * b[i]; } sum } </pre> </td> <td> <img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" /> </td> </tr> </table> # Proposed Change Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature. # Alternatives Considered rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles. In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit. # References * rust-lang#21690 * rust-lang/libs-team#532 * rust-lang#136469 * https://github.com/calder/dot-bench * https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps try-job: x86_64-gnu-nopt try-job: x86_64-gnu-aux
Seems fine to me. If you're relying on floats being unoptimized in const eval in a frequently used crate then that's on you |
Seems OK to me. We of course should just document what people should not rely on here. |
Seems trivially-fine to allow them in I do think Ralf's concern is at least worth a conversation in stabilization discussions, but I think that'll probably also end up arising equivalently at runtime in various ways, so I don't think there's a need to block nightly experimentation as lang. (Obviously if the maintenance burden is too high, or something, other may still block for other reasons.) |
No objections. |
@rustbot labels -I-lang-nominated We discussed this in the lang call today, and we had appetite for these being |
…r=RalfJung Make algebraic functions into `const fn` items. Tracking issue: rust-lang#136469 This PR makes the algebraic intrinsics and the unstable, algebraic functions of `f16`, `f32`, `f64`, and `f128` into `const fn` items: ```rust impl f16 { pub const fn algebraic_add(self, rhs: f16) -> f16; pub const fn algebraic_sub(self, rhs: f16) -> f16; pub const fn algebraic_mul(self, rhs: f16) -> f16; pub const fn algebraic_div(self, rhs: f16) -> f16; pub const fn algebraic_rem(self, rhs: f16) -> f16; } impl f32 { pub const fn algebraic_add(self, rhs: f32) -> f32; pub const fn algebraic_sub(self, rhs: f32) -> f32; pub const fn algebraic_mul(self, rhs: f32) -> f32; pub const fn algebraic_div(self, rhs: f32) -> f32; pub const fn algebraic_rem(self, rhs: f32) -> f32; } impl f64 { pub const fn algebraic_add(self, rhs: f64) -> f64; pub const fn algebraic_sub(self, rhs: f64) -> f64; pub const fn algebraic_mul(self, rhs: f64) -> f64; pub const fn algebraic_div(self, rhs: f64) -> f64; pub const fn algebraic_rem(self, rhs: f64) -> f64; } impl f128 { pub const fn algebraic_add(self, rhs: f128) -> f128; pub const fn algebraic_sub(self, rhs: f128) -> f128; pub const fn algebraic_mul(self, rhs: f128) -> f128; pub const fn algebraic_div(self, rhs: f128) -> f128; pub const fn algebraic_rem(self, rhs: f128) -> f128; } // core::intrinsics pub const fn fadd_algebraic<T: Copy>(a: T, b: T) -> T; pub const fn fsub_algebraic<T: Copy>(a: T, b: T) -> T; pub const fn fmul_algebraic<T: Copy>(a: T, b: T) -> T; pub const fn fdiv_algebraic<T: Copy>(a: T, b: T) -> T; pub const fn frem_algebraic<T: Copy>(a: T, b: T) -> T; ``` This PR does not preserve the initial behaviour of these functions yielding non-deterministic output under Miri; it is most likely desired to reimplement this behaviour at some point.
…r=RalfJung Make algebraic functions into `const fn` items. Tracking issue: rust-lang#136469 This PR makes the algebraic intrinsics and the unstable, algebraic functions of `f16`, `f32`, `f64`, and `f128` into `const fn` items: ```rust impl f16 { pub const fn algebraic_add(self, rhs: f16) -> f16; pub const fn algebraic_sub(self, rhs: f16) -> f16; pub const fn algebraic_mul(self, rhs: f16) -> f16; pub const fn algebraic_div(self, rhs: f16) -> f16; pub const fn algebraic_rem(self, rhs: f16) -> f16; } impl f32 { pub const fn algebraic_add(self, rhs: f32) -> f32; pub const fn algebraic_sub(self, rhs: f32) -> f32; pub const fn algebraic_mul(self, rhs: f32) -> f32; pub const fn algebraic_div(self, rhs: f32) -> f32; pub const fn algebraic_rem(self, rhs: f32) -> f32; } impl f64 { pub const fn algebraic_add(self, rhs: f64) -> f64; pub const fn algebraic_sub(self, rhs: f64) -> f64; pub const fn algebraic_mul(self, rhs: f64) -> f64; pub const fn algebraic_div(self, rhs: f64) -> f64; pub const fn algebraic_rem(self, rhs: f64) -> f64; } impl f128 { pub const fn algebraic_add(self, rhs: f128) -> f128; pub const fn algebraic_sub(self, rhs: f128) -> f128; pub const fn algebraic_mul(self, rhs: f128) -> f128; pub const fn algebraic_div(self, rhs: f128) -> f128; pub const fn algebraic_rem(self, rhs: f128) -> f128; } // core::intrinsics pub const fn fadd_algebraic<T: Copy>(a: T, b: T) -> T; pub const fn fsub_algebraic<T: Copy>(a: T, b: T) -> T; pub const fn fmul_algebraic<T: Copy>(a: T, b: T) -> T; pub const fn fdiv_algebraic<T: Copy>(a: T, b: T) -> T; pub const fn frem_algebraic<T: Copy>(a: T, b: T) -> T; ``` This PR does not preserve the initial behaviour of these functions yielding non-deterministic output under Miri; it is most likely desired to reimplement this behaviour at some point.
Rollup merge of rust-lang#140172 - bjoernager:const-float-algebraic, r=RalfJung Make algebraic functions into `const fn` items. Tracking issue: rust-lang#136469 This PR makes the algebraic intrinsics and the unstable, algebraic functions of `f16`, `f32`, `f64`, and `f128` into `const fn` items: ```rust impl f16 { pub const fn algebraic_add(self, rhs: f16) -> f16; pub const fn algebraic_sub(self, rhs: f16) -> f16; pub const fn algebraic_mul(self, rhs: f16) -> f16; pub const fn algebraic_div(self, rhs: f16) -> f16; pub const fn algebraic_rem(self, rhs: f16) -> f16; } impl f32 { pub const fn algebraic_add(self, rhs: f32) -> f32; pub const fn algebraic_sub(self, rhs: f32) -> f32; pub const fn algebraic_mul(self, rhs: f32) -> f32; pub const fn algebraic_div(self, rhs: f32) -> f32; pub const fn algebraic_rem(self, rhs: f32) -> f32; } impl f64 { pub const fn algebraic_add(self, rhs: f64) -> f64; pub const fn algebraic_sub(self, rhs: f64) -> f64; pub const fn algebraic_mul(self, rhs: f64) -> f64; pub const fn algebraic_div(self, rhs: f64) -> f64; pub const fn algebraic_rem(self, rhs: f64) -> f64; } impl f128 { pub const fn algebraic_add(self, rhs: f128) -> f128; pub const fn algebraic_sub(self, rhs: f128) -> f128; pub const fn algebraic_mul(self, rhs: f128) -> f128; pub const fn algebraic_div(self, rhs: f128) -> f128; pub const fn algebraic_rem(self, rhs: f128) -> f128; } // core::intrinsics pub const fn fadd_algebraic<T: Copy>(a: T, b: T) -> T; pub const fn fsub_algebraic<T: Copy>(a: T, b: T) -> T; pub const fn fmul_algebraic<T: Copy>(a: T, b: T) -> T; pub const fn fdiv_algebraic<T: Copy>(a: T, b: T) -> T; pub const fn frem_algebraic<T: Copy>(a: T, b: T) -> T; ``` This PR does not preserve the initial behaviour of these functions yielding non-deterministic output under Miri; it is most likely desired to reimplement this behaviour at some point.
…r=RalfJung Make algebraic functions into `const fn` items. Tracking issue: rust-lang#136469 This PR makes the algebraic intrinsics and the unstable, algebraic functions of `f16`, `f32`, `f64`, and `f128` into `const fn` items: ```rust impl f16 { pub const fn algebraic_add(self, rhs: f16) -> f16; pub const fn algebraic_sub(self, rhs: f16) -> f16; pub const fn algebraic_mul(self, rhs: f16) -> f16; pub const fn algebraic_div(self, rhs: f16) -> f16; pub const fn algebraic_rem(self, rhs: f16) -> f16; } impl f32 { pub const fn algebraic_add(self, rhs: f32) -> f32; pub const fn algebraic_sub(self, rhs: f32) -> f32; pub const fn algebraic_mul(self, rhs: f32) -> f32; pub const fn algebraic_div(self, rhs: f32) -> f32; pub const fn algebraic_rem(self, rhs: f32) -> f32; } impl f64 { pub const fn algebraic_add(self, rhs: f64) -> f64; pub const fn algebraic_sub(self, rhs: f64) -> f64; pub const fn algebraic_mul(self, rhs: f64) -> f64; pub const fn algebraic_div(self, rhs: f64) -> f64; pub const fn algebraic_rem(self, rhs: f64) -> f64; } impl f128 { pub const fn algebraic_add(self, rhs: f128) -> f128; pub const fn algebraic_sub(self, rhs: f128) -> f128; pub const fn algebraic_mul(self, rhs: f128) -> f128; pub const fn algebraic_div(self, rhs: f128) -> f128; pub const fn algebraic_rem(self, rhs: f128) -> f128; } // core::intrinsics pub const fn fadd_algebraic<T: Copy>(a: T, b: T) -> T; pub const fn fsub_algebraic<T: Copy>(a: T, b: T) -> T; pub const fn fmul_algebraic<T: Copy>(a: T, b: T) -> T; pub const fn fdiv_algebraic<T: Copy>(a: T, b: T) -> T; pub const fn frem_algebraic<T: Copy>(a: T, b: T) -> T; ``` This PR does not preserve the initial behaviour of these functions yielding non-deterministic output under Miri; it is most likely desired to reimplement this behaviour at some point.
These names suggest that the operations are more accurate than normal, where really they are less accurate. One might misinterpret that these are infinite-precision operations (perhaps with rounding after a whole sequence of operations). The actual meaning isn't that these are real number operations, it's quite the opposite: they have best-effort precision with no strict guarantees. I find "algebraic" confusing for the same reason. How about |
Saying "approximate" feels imperfect, as while these operations don't promise to produce the exact IEEE result on a per-operation basis, the overall result might well be more accurate algebraically. E.g.: let mut x = 0.0f32;
for _ in 0..100000000 {
x = x.algebraic_add(0.00000001);
}
println!("algebraic: {x}");
let mut x = 0.0f32;
for _ in 0..100000000 {
x += 0.00000001;
}
println!("exact per op: {x}"); Produces (on release mode):
(The algebraically correct result is Of course, there's no guarantee that it will be. It's just that we've allowed "algebraically justifiable" optimizations to happen. |
I’m not necessarily opposed to “approximate” but I have some quibbles with it too. While there are no guarantees about precision, in practice the most important optimizations these methods enable (reassociation, mul-add contraction) are often neutral to beneficial for precision. To me, the “approximate” name sounds more like what LLVM’s |
I strongly disagree with this naming. With the few exceptions of exactly representable input/output pairs, almost all floating point math is already an approximation. And it would imply optimizations that aren't permissible.
Whether or not they are more or less accurate depends on the exact expression and input domain. For example, |
This comment has been minimized.
This comment has been minimized.
@bjoernager There are more optimizations involved in these algebraic ops than simple reassociation. |
There had been some discussion about removing the |
To justify my "approximate" naming proposal: IEEE-754 specifies the exact semantics of +, -, *, /, sqrt and that semantics is followed precisely to the last bit by the regular arithmetic operations. The "algebraic" operations are "approximate" in the sense that they are not guaranteed to follow the IEEE-754 specification exactly. |
Not in direct reply to anything specifically here, but for our reference, the flags we're turning on are (quoting from the LLVM docs):
We're not turning on:
Nor are we turning on these flags, which would be implied (along with
|
This would be a relevant consideration if there was |
It's mentioned just for reference, in light of the point @hanna-kruppe made in #136469 (comment). |
@bjoernager The “fast math” terminology (also |
I don't see why "fast" must inherently also equate to "exact" or "precise." The whole point of these functions is to optimise for execution speed, not accuracy. Just like I can run "fast" with a basket but I might drop some stuff from it, I can also take it slow and assure that its contents will be kept intact. Anyone that reads the docs would know these complications of fast arithmetic, and I don't see a reason to idiot-proof the language for people that won't bother... After all, there is a plethora of other pitfalls that can lead to junk logic or calculations etc. (which in and of itself does not break any safety guarantees). |
I agree that "Algebraic" can mean many things. "Algebraic numbers" are roots of polynomials, so √2 is algebraic and π is not an algebraic number. There is also a related concept of "algebraic function". It's obviously not what is meant here. The name "algebraic" tries to suggest what kind of optimizations are possible on this, but names shouldn't concentrate on what optimizations are allowed, they should talk about the semantics of the operation, and optimizations follow from that. It's hard for me to interpret "algebraic" in any semantic way. I still think I think the root of the naming issue is that there is no clearly defined semantics: "The exact set of optimizations is unspecified". So it's really kind of |
@bjoernager Naming shouldn’t be constrained by the worst hypothetical misunderstanding someone can invent. But with this exact name, in this exact domain, we have decades of experience showing that it leads to misunderstandings and misuse by many users. And I don’t want to just blame users for it: it’s a maximally misleading name because it focuses on the aspect that needs the least attention. Sure, you can carefully interpret it as “fast as a trade-off against some other desirable property” but that’s not where most people’s minds go first. Especially not in Rust, which has “efficiency” right in its tagline and prides itself on not trading off reliability for it. |
I believe
The response in #136469 (comment) resonates with me as well here. "approximate" would be relative to the default operations, which are as exact as possible given finite precision. I don't think there is any risk of a confusing implication that default operations act with infinite precision. That said, the logic holds less well if we wind up with something like Personally I do like |
But I'm not sure how much tolerance is reasonable to imply because even a simple re-association such as |
Feature gate:
#![feature(float_algebraic)]
This is a tracking issue for exposing
core::intrinsics::f*_algebraic
in stable Rust, including in const.Public API
Steps / History
Unresolved Questions
References
cc @rust-lang/lang @rust-lang/libs-api
Footnotes
https://std-dev-guide.rust-lang.org/feature-lifecycle/stabilization.html ↩
The text was updated successfully, but these errors were encountered: