SIMD Array-of-Structures-of-Arrays in nalgebra and comparison with ultraviolet
Hello everyone!
In this post I'd like to introduce the next major change that will be released in nalgebra at the end of this month (March 2020). This change is about adding the support for SIMD AoSoA to nalgebra. I'll explain what I mean by SIMD AoSoA (Array-of-Structures-of-Arrays with explicit SIMD) and how it relates to SoA (Structure-of-Arrays) and AoS (Array-of-Structures). To give you an idea, SIMD AoSoA is actually what the recent ultraviolet crate has been using to achieve its amazing performance.
Here is a benchmark including the next (to-be-released) version of nalgebra. The best times within a 2.5% range of the minimum of each row are highlighted.
Here, the ultraviolet column uses the non-wide types
of ultraviolet
(Vec2
, Mat2
, Isometry3
, etc.) while the column ultraviolet_f32x4 uses its wide types (Wec2
, Wat2
, WIsometry3
, etc.):
benchmark | nalgebra_f32x4 | ultraviolet_f32x4 | nalgebra | ultraviolet | glam | vek | cgmath | euclid |
---|---|---|---|---|---|---|---|---|
euler 2d x10000 | 2.992 us | 3.014 us | 9.028 us | 5.28 us | 5.166 us | 5.258 us | 5.259 us | 8.631 us |
euler 3d x10000 | 4.546 us | 4.587 us | 17.84 us | 18.34 us | 6.311 us | 17.57 us | 18.04 us | 17.97 us |
isometry transform point2 x1 | 8.1024 ns | 8.6412 ns | 23.4637 ns | 54.5840 ns | N/A | N/A | N/A | N/A |
isometry transform point2 x100 | 2.801 us | 2.837 us | 3.109 us | 4.918 us | N/A | N/A | N/A | N/A |
isometry transform point3 x1 | 16.1052 ns | 20.6723 ns | 61.8824 ns | 330.1143 ns | N/A | N/A | N/A | N/A |
isometry transform point3 x100 | 4.515 us | 4.794 us | 6.546 us | 18.19 us | N/A | N/A | N/A | N/A |
isometry2 inverse | 7.3004 ns | 7.7692 ns | 24.8090 ns | 59.3838 ns | N/A | N/A | N/A | N/A |
isometry2 mul isometry2 | 12.3278 ns | 12.7816 ns | 34.5714 ns | 110.7837 ns | N/A | N/A | N/A | N/A |
isometry3 inverse | 17.1929 ns | 21.4199 ns | 63.9200 ns | 212.2193 ns | N/A | N/A | N/A | N/A |
isometry3 mul isometry3 | 27.3814 ns | 40.3955 ns | 100.4628 ns | 447.4780 ns | N/A | N/A | N/A | N/A |
matrix2 determinant | N/A | N/A | 16.0198 ns | N/A | 11.2070 ns | 15.8621 ns | 16.0906 ns | N/A |
matrix2 inverse | N/A | N/A | 26.5165 ns | N/A | 18.9069 ns | N/A | 26.2478 ns | N/A |
matrix2 mul matrix2 | 12.5352 ns | 10.7532 ns | 25.4533 ns | 25.7148 ns | 16.3060 ns | 301.3775 ns | 25.6988 ns | N/A |
matrix2 mul vector2 x1 | 7.1695 ns | 8.0907 ns | 23.0932 ns | 24.4506 ns | 16.4141 ns | 93.3601 ns | 25.0774 ns | N/A |
matrix2 mul vector2 x100 | 2.684 us | 2.698 us | 3.137 us | 3.174 us | 2.828 us | 9.558 us | 3.122 us | N/A |
matrix2 transpose | 6.4984 ns | N/A | 11.0205 ns | N/A | 10.9890 ns | 10.8180 ns | 10.3812 ns | N/A |
matrix3 determinant | N/A | N/A | 33.1252 ns | N/A | 22.2592 ns | 31.8128 ns | 31.7086 ns | N/A |
matrix3 inverse | N/A | 26.7493 ns | 92.8270 ns | 284.1032 ns | 93.5328 ns | N/A | 82.3820 ns | N/A |
matrix3 mul matrix3 | 0.02883 us | 0.02638 us | 0.09077 us | 0.08728 us | 0.0474 us | 1.005 us | 0.08579 us | N/A |
matrix3 mul vector3 x1 | 13.7548 ns | 13.6309 ns | 46.4205 ns | 46.5016 ns | 21.5466 ns | 317.8168 ns | 47.3542 ns | N/A |
matrix3 mul vector3 x100 | 4.835 us | 4.917 us | 6.138 us | 6.261 us | 6.633 us | 32.93 us | 6.297 us | N/A |
matrix3 transpose | 15.7501 ns | 15.4925 ns | 52.7300 ns | 53.0845 ns | 24.1975 ns | 55.4793 ns | 52.0874 ns | N/A |
matrix4 determinant | N/A | N/A | 690.7027 ns | N/A | 85.5778 ns | 176.1770 ns | 110.6990 ns | 174.6754 ns |
matrix4 inverse | N/A | 0.08342 us | 0.5282 us | 0.3953 us | 0.257 us | 3.792 us | 0.5096 us | 0.651 us |
matrix4 mul matrix4 | 0.06897 us | 0.07068 us | 0.1285 us | 0.1493 us | 0.07653 us | 2.024 us | 0.1185 us | 0.1152 us |
matrix4 mul vector4 x1 | 21.1243 ns | 20.6690 ns | 31.5352 ns | 30.3570 ns | 24.7876 ns | 512.0570 ns | 30.9776 ns | N/A |
matrix4 mul vector4 x100 | 7.595 us | 7.662 us | 8.472 us | 8.379 us | 7.823 us | 51.42 us | 8.366 us | N/A |
matrix4 transpose | 26.2948 ns | 27.1962 ns | 103.0829 ns | 101.3104 ns | 29.2236 ns | 105.6003 ns | 98.1151 ns | N/A |
quaternion conjugate | 6.6130 ns | 6.6179 ns | 24.5468 ns | 25.5657 ns | 11.1447 ns | 24.2229 ns | 25.7246 ns | 24.9413 ns |
quaternion mul quaternion | 14.6380 ns | 16.6299 ns | 39.3908 ns | 274.4676 ns | 27.3533 ns | 62.0697 ns | 35.0299 ns | 59.5286 ns |
quaternion mul vector3 | 14.7579 ns | 18.5064 ns | 52.7303 ns | 170.8328 ns | 44.1336 ns | 81.1243 ns | 52.5011 ns | 54.1958 ns |
transform point2 x1 | N/A | N/A | 36.6350 ns | N/A | 25.6516 ns | 337.3442 ns | 30.5781 ns | 28.5532 ns |
transform point2 x100 | N/A | N/A | 5.475 us | N/A | 5.732 us | 33.81 us | 5.118 us | 4.003 us |
transform point3 x1 | N/A | N/A | 66.3694 ns | N/A | 24.0601 ns | 517.9224 ns | 67.8205 ns | 68.0836 ns |
transform point3 x100 | N/A | N/A | 9.403 us | N/A | 7.824 us | 52.5 us | 9.559 us | 10.07 us |
transform vector2 x1 | N/A | N/A | 32.5667 ns | N/A | 21.3971 ns | 330.5713 ns | N/A | 23.5693 ns |
transform vector2 x100 | N/A | N/A | 5.392 us | N/A | 5.579 us | 33.45 us | N/A | 3.909 us |
transform vector3 x1 | N/A | N/A | 55.3600 ns | N/A | 24.3628 ns | 515.1730 ns | 59.1129 ns | 47.9383 ns |
transform vector3 x100 | N/A | N/A | 9.185 us | N/A | 7.958 us | 52.39 us | 8.992 us | 8.665 us |
transform2 inverse | N/A | N/A | 87.2060 ns | N/A | N/A | N/A | N/A | 45.0272 ns |
transform2 mul transform2 | N/A | N/A | 87.5702 ns | N/A | N/A | N/A | N/A | 40.1995 ns |
transform3 inverse | N/A | N/A | 542.8401 ns | N/A | N/A | N/A | N/A | 748.2648 ns |
transform3 mul transform3d | N/A | N/A | 118.0659 ns | N/A | N/A | N/A | N/A | 115.3294 ns |
vector3 cross | 6.4232 ns | 6.2395 ns | 25.0006 ns | 36.5961 ns | 15.0303 ns | 28.6082 ns | 28.6343 ns | 29.5467 ns |
vector3 dot | 6.6820 ns | 6.3391 ns | 25.5593 ns | 26.1824 ns | 13.6831 ns | 25.4740 ns | 25.9099 ns | 26.1271 ns |
vector3 length | 5.3741 ns | 5.2332 ns | 20.5369 ns | 34.2988 ns | 20.5652 ns | 20.6259 ns | 20.9281 ns | 20.6052 ns |
vector3 normalize | 15.5892 ns | 15.6585 ns | 59.1804 ns | 60.9510 ns | 35.7763 ns | 61.3666 ns | 36.7304 ns | 61.3199 ns |
Please see the last section of this post for a more comprehensive benchmark (including the use of f32x8
and f32x16
with nalgebra) and details about the benchmark conditions.
What is AoSoA with explicit SIMD?
The data layout I call here AoSoA with explicit SIMD (or SIMD AoSoA for short) is a straightforward variant of the AoSoA (Array-of-Structures-of-Arrays) layout which is itself a
combination of two more common layouts: SoA (Structure-of-Arrays) and AoS (Array-of-Structures). So let's see what is the difference
of all those layouts with the simple example using 3D vectors (vectors with three components x, y, z
): given two arrays of
1024 3D vectors, we want to compute the sum of each pairs of vectors with the same index.
Note: The explanations here are merely a superficial introduction of the AoS vs SoA vs AoSoA concepts. I just want to explain some of the differences and some of their advantages/disadvantages. I won't give any detailed analysis of the generated assembly code after compiling the examples provided. The benchmarks at the end of this post will show the performance difference between AoS and SIMD AoSoA. You may be interested by that article too.
Note 2: For iterating through the arrays, I'll be using explicit indices for i in 0..1024
instead of iterators. This
is on purpose to make the number of iterations explicit and to make the code easier to understand for readers that are not
used to Rust iterators.
Array-of-Structures (AoS)
The Array-of-Structures layout is the most common and intuitive layout. You define a structure as the aggregation of all its fields. And multiple structures of the same type are simply stored one after the other inside of an array. Here is what our 3D vector would look like:
Here, we need to iterate through each pair of vectors, one from each set, and execute our sum. This is arguably the most intuitive way of doing this, but not necessarily the most efficient. All Rust linear algebra crates from this benchmark can work with this layout.
Pros of AoS
- It is easy to read/write/modify each vector individually.
- AoS are generally easier to reason with when designing algorithms.
Cons of AoS
- Not as efficient as other layouts when working on a large number of elements of the same type.
Structure-of-Arrays (SoA)
The Structure-of-Arrays layout is less intuitive to work with because it will store each field of a struct into its own array. Thus, our set of vector set would look like that:
These is no explicit Vector3
structure here because they are all packed into the array. Accessing the components of the i
-th
vector of the set means we access set.x[i]
, set.y[i]
and set.z[i]
. With this structure, our vector sum becomes the following:
There is more code here than in our AoS example because we need to iterate on each components individually. However this will be much more efficient than our AoS approach because this is extremely vectorization-friendly: the compiler will be able to group consecutive vector entries for each component, and use SIMD instructions to perform the 2, 4, 8, or even 16 additions at a time instead of a single one.
Pros of SoA
- Extremely fast.
Cons of SoA
- Algorithms need to be explicitly designed with SoA in mind to be efficient.
- Manipulating vectors individually is more difficult and can be less efficient than AoS.
- It is more difficult to think in terms of SoA than in terms of AoS.
Array-of-Structures-of-Arrays (AoSoA)
Now let's combine both SoA and AoS to obtain: Array-of-Structures-of-Arrays (AoSoA). The idea is to first define a wide 3D vector, i.e., we pack several vectors into a single struct:
The term wide 3D vector is inspired from the terminology used in the ultraviolet crate's documentation.
Here, our WideVector3
actually represents 4 vectors laid out in a SoA fashion. If we want our set of vector to contain more than just
4 vectors, we can define an array of WideVector3
(so we end up with an array of structure of array) with only 1024 / 4
elements
because each element already represent 4 vectors:
Our vector sum then becomes quite similar to the AoS approach, except that we have to add each 4 components in the inner loop in an SoA fashion:
So with this data layout, we still achieve great performance because components are grouped by 4, and thus their sum can
easily be auto-vectorized to a single SIMD instruction. To manipulate an individual vector we first have to access the corresponding wide
vector, and then its components on the x,y,z
arrays.
Pros of AoSoA
- Great performance compared to AoS.
- Can even beat the performance of SoA considering AoSoA can be more cache-friendly in modern architecture.
- Designing algorithms around AoSoA is somewhat easier than with plain SoA.
- Using AoSoA, most linear-algebra operations can be easily vectorized as long as they don't rely too much on branching.
Cons of AoSoA
- We still need to design our algorithms carefully to use AoSoA efficiently.
Array-of-Structures-of-arrays with explicit SIMD (SIMD AoSoA)
Finally, let's talk about what I call here SIMD AoSoA. This layout is actually exactly the same as AoSoA presented before,
except we that we use explicit SIMD primitives like, e.g., the types from
the packed_simd crate instead of plain arrays. For example we can use 4-lanes SIMD
floats packed_simd::f32x4
instead of [f32; 4]
:
Then our sum will be exactly the same as in the AoSoA approach, except that we don't have to deal with each 4 lanes
explicitly because packed_simd::f32x4
implements the Add
trait:
If we wanted to use 16-lanes vectors here (based on AVX instructions), we could simply do:
Another benefit of using SIMD types explicitly here is that we no longer rely on the compiler's auto-vectorization. So we can get SIMD instructions even in debug mode, and in some situations where the compiler would fail to auto-vectorize properly.
Using SIMD AoSoA for linear-algebra in Rust: ultraviolet and nalgebra
As far as I know, the first crate that implemented this concept for (gamedev) linear algebra in Rust is ultraviolet. At the end of this month (March 2020), you will also be able to use this approach with nalgebra, in its upcoming version 0.21.0.
With ultraviolet
, you have the choice between two families of types: regular types (Vec2
, Mat3
, Isometry3
, etc.) and wide types (Wec2
, Wat3
, WIsometry3
).
The regular types are designed to be usable with the common AoS layout, with vector/matrix components set to f32
.
The wide types on the other hand are designed to be used with the SIMD AoSoA layout: each wide type packs the corresponding four non-wide types into a single structure
(e.g. one Wec3
represents four Vec3
), and the wide vector/wide matrix components are of type f32x4
from the wide crate. As of today, ultraviolet
is limited to 32-bits floats, and 4-lanes 32-bits SIMD floats.
You can't use it with SIMD integers, nor with 64-bits floats or 32-bits SIMD floats with a number of lanes different from 4.
With nalgebra
, all types of vectors/matrices/tansformations are generic wrt. their component type. Therefore, for a
AoS layout, you can use, e.g., the Vector2<f32>
type. And if you want to leverage SIMD AoSoA, you can use Vector2<f32x4>
instead and
that will give you four 2D vectors for the price of one. Note that the f32x4
type here comes from the new simba
crate and is a newtype for the f32x4
from packed_simd (the upcoming standard safe SIMD implementation in Rust).
A newtype was required because of the orphan rules, and the need to implement some traits from the num crate for the SIMD types.
Simba is not limited to f32x4
though as it defines a newtype for every single numeric type of packed_simd. Therefore nalgebra
will also support all the integer and float SIMD types, with 2, 4, 8, or 16 lanes. You can for example write and use Isometry2<simba::f32x16>
as
a SoA set of 16 Isometry2<f32>
.
Finally nalgebra has support for SIMD on all the platforms supported by packed_simd itself.
Here are some (subjective) pros and cons for ultraviolet and nalgebra:
- Pros of ultraviolet: no generics so it is simple to use, and efficient even without compiler optimizations enabled. Works on stable Rust.
- Pros of nalgebra: generics allow the use of all SIMD types. More feature-complete and tested than ultraviolet. Based on packed_simd with great platform support but we could make it work with the wide crate too.
- Cons of ultraviolet: cannot use 64-bits floats nor SIMD types other than 32-bit 4-lanes floats. It has a more limited feature set (but that may be enough for gamedev).
- Cons of nalgebra: generics make it harder to use, and make the doc harder to browse. Also nalgebra is much less efficient without compiler optimizations enabled. Because SIMD AoSoA is based on packed_simd, the use of SIMD types will only work with the nightly compiler.
Benchmarks of Rust linear-algebra crates
Now is the time to see if SIMD AoSoA is useful at all. The benchmarks I run here are a modified version of the
mathbench-rs benchmark suite you may have already head of. For example it
was used when glam
and ultraviolet
were published.
If you want to run the benchmarks on your machine, you can clone my fork and
either run cargo bench
or RUSTFLAGS='-C target-cpu=native' cargo bench
.
The modifications I made to the benchmark were the following:
- I added
codegen-units = 1
to the[profile.bench]
section. This allows to get as much optimization as we can get from the compiler (this is typically what you want before releasing a binary). It turns out that this can affect the performance of nalgebra which benefits significantly from compiler optimizations. - I added support for ultraviolet regular types (identified by the column ultraviolet in the benchmarks) as well as its wide types (identified by the column ultraviolet_f32x4).
- Because ultraviolet use the concepts of rotors instead of quaternions for its rotation, I used its
Rotor/WRotor
types for all its quaternion benchmark rows. - I added support for using nalgebra with f32 SIMD with 4, 8, and 16 lanes identified by nalgebra_f32x4, nalgebra_f32x8 and nalgebra_f32x16.
- I added benchmarks for 3D isometries of both ultraviolet and nalgebra.
- I modified the benchmark of unary and binary operations so they measure the time to process 16 elements. For example when benchmarking 2D matrix transposition, we are actually seeing the computation time for transposing 16 matrices (instead of just one). This allows us to measure the gain of SIMD AoSoA without penalizing neither non-AoSoA crates nor AoSoA crates.
This last modification follows the same principle as the one introduced for the benchmarks presented by ultraviolet
in
their README.
-C target-cpu=native
Benchmark with In this benchmark I am compiling with the RUSTFLAGS='-C target-cpu=native
option so that the compiler emits AVX instructions
for f32x8
and f32x16
. It appears that some crates (namely glam and vek) are not as efficient as they could be
when this option is enabled so you will find a second benchmark without this option enabled in the next section.
Benchmark options:
- command line:
RUSTFLAGS='-C target-cpu=native' cargo bench
- profiling option in
Cargo.toml
:codegen-units = 1
- CPU:
AMD Ryzen 9 3900X 12-Core Processor
benchmark | nalgebra_f32x16 | nalgebra_f32x8 | nalgebra_f32x4 | ultraviolet_f32x4 | nalgebra | ultraviolet | glam | vek | cgmath | euclid |
---|---|---|---|---|---|---|---|---|---|---|
euler 2d x10000 | 2.076 us | 2.224 us | 3.05 us | 3.05 us | 9.674 us | 12.0 us | 12.42 us | 11.85 us | 11.92 us | 9.585 us |
euler 3d x10000 | 3.014 us | 2.809 us | 4.791 us | 4.639 us | 18.18 us | 17.58 us | 6.27 us | 18.12 us | 18.05 us | 17.91 us |
isometry transform point2 x1 | 5.7179 ns | 5.6563 ns | 7.8197 ns | 8.1874 ns | 22.8197 ns | 24.4878 ns | N/A | N/A | N/A | N/A |
isometry transform point2 x100 | 2.582 us | 2.587 us | 2.739 us | 2.787 us | 3.007 us | 3.169 us | N/A | N/A | N/A | N/A |
isometry transform point3 x1 | 10.5417 ns | 10.1237 ns | 15.4410 ns | 19.5679 ns | 60.0877 ns | 64.7755 ns | N/A | N/A | N/A | N/A |
isometry transform point3 x100 | 4.704 us | 4.377 us | 4.64 us | 4.941 us | 6.567 us | 6.68 us | N/A | N/A | N/A | N/A |
isometry2 inverse | 6.0317 ns | 6.4494 ns | 6.9687 ns | 7.4126 ns | 24.2412 ns | 24.8876 ns | N/A | N/A | N/A | N/A |
isometry2 mul isometry2 | 8.1413 ns | 9.2351 ns | 10.1867 ns | 10.2618 ns | 34.5250 ns | 33.1397 ns | N/A | N/A | N/A | N/A |
isometry3 inverse | 12.3065 ns | 11.2699 ns | 16.2452 ns | 21.6418 ns | 62.2947 ns | 76.3882 ns | N/A | N/A | N/A | N/A |
isometry3 mul isometry3 | 29.0822 ns | 16.5287 ns | 26.0439 ns | 31.5684 ns | 97.8058 ns | 109.7477 ns | N/A | N/A | N/A | N/A |
matrix2 determinant | N/A | N/A | N/A | N/A | 15.4800 ns | N/A | 9.9942 ns | 15.5422 ns | 15.6876 ns | N/A |
matrix2 inverse | N/A | N/A | N/A | N/A | 25.0731 ns | N/A | 15.7486 ns | N/A | 25.8362 ns | N/A |
matrix2 mul matrix2 | 10.6500 ns | 9.0379 ns | 10.3309 ns | 10.4283 ns | 24.7601 ns | 24.9938 ns | 16.1426 ns | 291.7833 ns | 25.0133 ns | N/A |
matrix2 mul vector2 x1 | 5.7680 ns | 5.2159 ns | 6.7758 ns | 6.9242 ns | 22.9934 ns | 24.4428 ns | 15.5422 ns | 85.4171 ns | 23.7398 ns | N/A |
matrix2 mul vector2 x100 | 2.641 us | 2.663 us | 2.647 us | 2.617 us | 3.14 us | 3.182 us | 2.953 us | 8.998 us | 3.232 us | N/A |
matrix2 transpose | 5.8718 ns | 6.8760 ns | 6.6906 ns | N/A | 10.6439 ns | N/A | 10.9130 ns | 10.6631 ns | 10.5080 ns | N/A |
matrix3 determinant | N/A | N/A | N/A | N/A | 31.9111 ns | N/A | 19.7343 ns | 31.3051 ns | 32.2809 ns | N/A |
matrix3 inverse | N/A | N/A | N/A | 25.6965 ns | 83.4324 ns | 82.1388 ns | 120.2562 ns | N/A | 79.4379 ns | N/A |
matrix3 mul matrix3 | 70.6877 ns | 19.6932 ns | 27.7722 ns | 26.8799 ns | 83.2946 ns | 83.9021 ns | 44.3916 ns | 944.3034 ns | 80.8441 ns | N/A |
matrix3 mul vector3 x1 | 10.6031 ns | 10.7913 ns | 13.6117 ns | 13.7169 ns | 46.1231 ns | 47.5651 ns | 20.6856 ns | 315.4715 ns | 45.3403 ns | N/A |
matrix3 mul vector3 x100 | 4.7 us | 4.807 us | 4.845 us | 4.738 us | 6.163 us | 6.142 us | 6.701 us | 32.47 us | 6.091 us | N/A |
matrix3 transpose | 14.3912 ns | 12.8725 ns | 15.0385 ns | 15.9171 ns | 19.9624 ns | 19.7458 ns | 22.7494 ns | 19.5610 ns | 20.0838 ns | N/A |
matrix4 determinant | N/A | N/A | N/A | N/A | 1.489 us | N/A | 0.083 us | 0.1515 us | 0.1021 us | 0.1565 us |
matrix4 inverse | N/A | N/A | N/A | 0.09051 us | 0.5339 us | 0.3744 us | 0.2844 us | 4.203 us | 0.4745 us | 0.5851 us |
matrix4 mul matrix4 | 0.2354 us | 0.06835 us | 0.07657 us | 0.07581 us | 0.1247 us | 0.135 us | 0.07736 us | 1.916 us | 0.1146 us | 0.1089 us |
matrix4 mul vector4 x1 | 39.0102 ns | 15.4389 ns | 20.7324 ns | 22.1028 ns | 30.6785 ns | 30.8711 ns | 25.6006 ns | 487.1873 ns | 30.3248 ns | N/A |
matrix4 mul vector4 x100 | 8.339 us | 7.419 us | 7.437 us | 7.428 us | 8.14 us | 8.209 us | 8.078 us | 47.98 us | 8.332 us | N/A |
matrix4 transpose | 95.1337 ns | 23.2408 ns | 25.8665 ns | 25.7914 ns | 97.9916 ns | 101.3651 ns | 30.0239 ns | 103.5180 ns | 102.0192 ns | N/A |
quaternion conjugate | 6.0223 ns | 6.3760 ns | 7.5628 ns | 6.9775 ns | 23.2724 ns | 23.0902 ns | 10.3465 ns | 23.2522 ns | 23.2696 ns | 23.0664 ns |
quaternion mul quaternion | 9.3504 ns | 9.4567 ns | 12.3669 ns | 12.6805 ns | 30.0095 ns | 38.4430 ns | 25.8066 ns | 42.2113 ns | 42.6575 ns | 40.9668 ns |
quaternion mul vector3 | 8.5309 ns | 7.7526 ns | 13.3755 ns | 16.7836 ns | 49.3278 ns | 56.7010 ns | 41.2790 ns | 71.9719 ns | 48.9911 ns | 48.2763 ns |
transform point2 x1 | N/A | N/A | N/A | N/A | 35.6464 ns | N/A | 23.5642 ns | 325.1071 ns | 29.7918 ns | 28.4816 ns |
transform point2 x100 | N/A | N/A | N/A | N/A | 5.386 us | N/A | 5.612 us | 33.66 us | 5.255 us | 4.066 us |
transform point3 x1 | N/A | N/A | N/A | N/A | 68.6997 ns | N/A | 23.9545 ns | 504.0400 ns | 67.0988 ns | 68.1284 ns |
transform point3 x100 | N/A | N/A | N/A | N/A | 9.393 us | N/A | 7.898 us | 49.31 us | 9.282 us | 10.0 us |
transform vector2 x1 | N/A | N/A | N/A | N/A | 32.5603 ns | N/A | 24.0791 ns | 327.3927 ns | N/A | 24.8445 ns |
transform vector2 x100 | N/A | N/A | N/A | N/A | 5.385 us | N/A | 5.649 us | 33.51 us | N/A | 3.891 us |
transform vector3 x1 | N/A | N/A | N/A | N/A | 52.9997 ns | N/A | 25.2423 ns | 487.6624 ns | 55.1891 ns | 46.3865 ns |
transform vector3 x100 | N/A | N/A | N/A | N/A | 9.083 us | N/A | 7.963 us | 49.47 us | 8.994 us | 8.771 us |
transform2 inverse | N/A | N/A | N/A | N/A | 83.1814 ns | N/A | N/A | N/A | N/A | 38.3216 ns |
transform2 mul transform2 | N/A | N/A | N/A | N/A | 79.8949 ns | N/A | N/A | N/A | N/A | 53.0397 ns |
transform3 inverse | N/A | N/A | N/A | N/A | 509.3638 ns | N/A | N/A | N/A | N/A | 573.8688 ns |
transform3 mul transform3d | N/A | N/A | N/A | N/A | 116.3846 ns | N/A | N/A | N/A | N/A | 112.3300 ns |
vector3 cross | 4.4663 ns | 4.5852 ns | 6.3479 ns | 6.1689 ns | 26.0655 ns | 24.9554 ns | 15.6013 ns | 27.9000 ns | 27.6733 ns | 28.1612 ns |
vector3 dot | 4.4944 ns | 4.5232 ns | 6.4285 ns | 6.2531 ns | 25.3941 ns | 24.9869 ns | 13.6267 ns | 24.5923 ns | 25.9651 ns | 25.8382 ns |
vector3 length | 3.5110 ns | 3.5936 ns | 5.4675 ns | 5.4127 ns | 20.7343 ns | 21.0470 ns | 20.4590 ns | 21.0161 ns | 20.7430 ns | 20.5710 ns |
vector3 normalize | 7.8013 ns | 8.3383 ns | 15.4976 ns | 15.3918 ns | 61.5877 ns | 61.3314 ns | 35.6648 ns | 59.7074 ns | 35.7934 ns | 61.2340 ns |
To get a better comparative view of the performance of AoSoA, here are the benchmarks restricted to ultraviolet's wide types
(Wec2, Wat2, WIsometry2
, etc.) and nalgebra types paramaterized by SIMD types (Vector2<f32x4>, Matrix2<f32x8>, Isometry2<f32x16>
, etc.):
benchmark | ultraviolet_f32x4 | nalgebra_f32x4 | nalgebra_f32x8 | nalgebra_f32x16 |
---|---|---|---|---|
euler 2d x10000 | 3.05 us | 3.05 us | 2.224 us | 2.076 us |
euler 3d x10000 | 4.639 us | 4.791 us | 2.809 us | 3.014 us |
isometry transform point2 x1 | 8.1874 ns | 7.8197 ns | 5.6563 ns | 5.7179 ns |
isometry transform point2 x100 | 2.787 us | 2.739 us | 2.587 us | 2.582 us |
isometry transform point3 x1 | 19.5679 ns | 15.4410 ns | 10.1237 ns | 10.5417 ns |
isometry transform point3 x100 | 4.941 us | 4.64 us | 4.377 us | 4.704 us |
isometry2 inverse | 7.4126 ns | 6.9687 ns | 6.4494 ns | 6.0317 ns |
isometry2 mul isometry2 | 10.2618 ns | 10.1867 ns | 9.2351 ns | 8.1413 ns |
isometry3 inverse | 21.6418 ns | 16.2452 ns | 11.2699 ns | 12.3065 ns |
isometry3 mul isometry3 | 31.5684 ns | 26.0439 ns | 16.5287 ns | 29.0822 ns |
matrix2 mul matrix2 | 10.4283 ns | 10.3309 ns | 9.0379 ns | 10.6500 ns |
matrix2 mul vector2 x1 | 6.9242 ns | 6.7758 ns | 5.2159 ns | 5.7680 ns |
matrix2 mul vector2 x100 | 2.617 us | 2.647 us | 2.663 us | 2.641 us |
matrix2 transpose | N/A | 6.6906 ns | 6.8760 ns | 5.8718 ns |
matrix3 inverse | 25.6965 ns | N/A | N/A | N/A |
matrix3 mul matrix3 | 26.8799 ns | 27.7722 ns | 19.6932 ns | 70.6877 ns |
matrix3 mul vector3 x1 | 13.7169 ns | 13.6117 ns | 10.7913 ns | 10.6031 ns |
matrix3 mul vector3 x100 | 4.738 us | 4.845 us | 4.807 us | 4.7 us |
matrix3 transpose | 15.9171 ns | 15.0385 ns | 12.8725 ns | 14.3912 ns |
matrix4 inverse | 90.5105 ns | N/A | N/A | N/A |
matrix4 mul matrix4 | 75.8094 ns | 76.5675 ns | 68.3461 ns | 235.3853 ns |
matrix4 mul vector4 x1 | 22.1028 ns | 20.7324 ns | 15.4389 ns | 39.0102 ns |
matrix4 mul vector4 x100 | 7.428 us | 7.437 us | 7.419 us | 8.339 us |
matrix4 transpose | 25.7914 ns | 25.8665 ns | 23.2408 ns | 95.1337 ns |
quaternion conjugate | 6.9775 ns | 7.5628 ns | 6.3760 ns | 6.0223 ns |
quaternion mul quaternion | 12.6805 ns | 12.3669 ns | 9.4567 ns | 9.3504 ns |
quaternion mul vector3 | 16.7836 ns | 13.3755 ns | 7.7526 ns | 8.5309 ns |
vector3 cross | 6.1689 ns | 6.3479 ns | 4.5852 ns | 4.4663 ns |
vector3 dot | 6.2531 ns | 6.4285 ns | 4.5232 ns | 4.4944 ns |
vector3 length | 5.4127 ns | 5.4675 ns | 3.5936 ns | 3.5110 ns |
vector3 normalize | 15.3918 ns | 15.4976 ns | 8.3383 ns | 7.8013 ns |
As we can see, both ultraviolet_f32x4
and nalgebra_f32x4
yield roughly the same computation times. The most efficient
option for my processor is nalgebra with f32x8
: it often reaches the best performance, and f32x16
comes
with little performance gain, and even significant regressions for some tests.
-C target-cpu=native
Benchmark without It appears some rust linear algebra crates do not perform as well as they could if -C target-cpu=native
is passed
to the compiler. This is for example the case of glam
and vek
. However, it appears that not using -C target-cpu=native
makes some methods of non-wide types of ultraviolet much less efficient.
Because we don't use -C target-cpu=native
here, both f32x8
and f32x16
won't emit 8- and 16-lanes AVX instructions, making
them only as efficient as f32x4
.
Here is the same benchmark with:
- command line:
cargo bench
- options in
Cargo.toml
:codegen-units = 1
. - CPU:
AMD Ryzen 9 3900X 12-Core Processor
benchmark | nalgebra_f32x16 | nalgebra_f32x8 | nalgebra_f32x4 | ultraviolet_f32x4 | nalgebra | ultraviolet | glam | vek | cgmath | euclid |
---|---|---|---|---|---|---|---|---|---|---|
euler 2d x10000 | 3.057 us | 3.069 us | 2.992 us | 3.014 us | 9.028 us | 5.28 us | 5.166 us | 5.258 us | 5.259 us | 8.631 us |
euler 3d x10000 | 4.585 us | 4.587 us | 4.546 us | 4.587 us | 17.84 us | 18.34 us | 6.311 us | 17.57 us | 18.04 us | 17.97 us |
isometry transform point2 x1 | 7.7226 ns | 6.7828 ns | 8.1024 ns | 8.6412 ns | 23.4637 ns | 54.5840 ns | N/A | N/A | N/A | N/A |
isometry transform point2 x100 | 2.625 us | 2.701 us | 2.801 us | 2.837 us | 3.109 us | 4.918 us | N/A | N/A | N/A | N/A |
isometry transform point3 x1 | 21.0932 ns | 16.1782 ns | 16.1052 ns | 20.6723 ns | 61.8824 ns | 330.1143 ns | N/A | N/A | N/A | N/A |
isometry transform point3 x100 | 6.466 us | 4.598 us | 4.515 us | 4.794 us | 6.546 us | 18.19 us | N/A | N/A | N/A | N/A |
isometry2 inverse | 7.4879 ns | 7.0432 ns | 7.3004 ns | 7.7692 ns | 24.8090 ns | 59.3838 ns | N/A | N/A | N/A | N/A |
isometry2 mul isometry2 | 13.3355 ns | 10.5486 ns | 12.3278 ns | 12.7816 ns | 34.5714 ns | 110.7837 ns | N/A | N/A | N/A | N/A |
isometry3 inverse | 29.6336 ns | 19.3869 ns | 17.1929 ns | 21.4199 ns | 63.9200 ns | 212.2193 ns | N/A | N/A | N/A | N/A |
isometry3 mul isometry3 | 51.0123 ns | 39.2781 ns | 28.5785 ns | 37.5971 ns | 98.0279 ns | 459.4930 ns | N/A | N/A | N/A | N/A |
matrix2 determinant | N/A | N/A | N/A | N/A | 16.0198 ns | N/A | 11.2070 ns | 15.8621 ns | 16.0906 ns | N/A |
matrix2 inverse | N/A | N/A | N/A | N/A | 26.5165 ns | N/A | 18.9069 ns | N/A | 26.2478 ns | N/A |
matrix2 mul matrix2 | 60.2594 ns | 10.6371 ns | 12.5352 ns | 10.7532 ns | 25.4533 ns | 25.7148 ns | 16.3060 ns | 301.3775 ns | 25.6988 ns | N/A |
matrix2 mul vector2 x1 | 29.6210 ns | 8.0023 ns | 7.1695 ns | 8.0907 ns | 23.0932 ns | 24.4506 ns | 16.4141 ns | 93.3601 ns | 25.0774 ns | N/A |
matrix2 mul vector2 x100 | 3.5 us | 2.591 us | 2.684 us | 2.698 us | 3.137 us | 3.174 us | 2.828 us | 9.558 us | 3.122 us | N/A |
matrix2 transpose | 6.8152 ns | 7.4080 ns | 6.4984 ns | N/A | 11.0205 ns | N/A | 10.9890 ns | 10.8180 ns | 10.3812 ns | N/A |
matrix3 determinant | N/A | N/A | N/A | N/A | 33.1252 ns | N/A | 22.2592 ns | 31.8128 ns | 31.7086 ns | N/A |
matrix3 inverse | N/A | N/A | N/A | 26.7493 ns | 92.8270 ns | 284.1032 ns | 93.5328 ns | N/A | 82.3820 ns | N/A |
matrix3 mul matrix3 | 0.1494 us | 0.03354 us | 0.02883 us | 0.02638 us | 0.09077 us | 0.08728 us | 0.0474 us | 1.005 us | 0.08579 us | N/A |
matrix3 mul vector3 x1 | 52.3315 ns | 14.0998 ns | 13.7548 ns | 13.6309 ns | 46.4205 ns | 46.5016 ns | 21.5466 ns | 317.8168 ns | 47.3542 ns | N/A |
matrix3 mul vector3 x100 | 10.68 us | 5.033 us | 4.835 us | 4.917 us | 6.138 us | 6.261 us | 6.633 us | 32.93 us | 6.297 us | N/A |
matrix3 transpose | 20.9653 ns | 15.7305 ns | 15.7501 ns | 15.4925 ns | 52.7300 ns | 53.0845 ns | 24.1975 ns | 55.4793 ns | 52.0874 ns | N/A |
matrix4 determinant | N/A | N/A | N/A | N/A | 690.7027 ns | N/A | 85.5778 ns | 176.1770 ns | 110.6990 ns | 174.6754 ns |
matrix4 inverse | N/A | N/A | N/A | 0.08342 us | 0.5282 us | 0.3953 us | 0.257 us | 3.792 us | 0.5096 us | 0.651 us |
matrix4 mul matrix4 | 0.4467 us | 0.08671 us | 0.06897 us | 0.07068 us | 0.1285 us | 0.1493 us | 0.07653 us | 2.024 us | 0.1185 us | 0.1152 us |
matrix4 mul vector4 x1 | 84.1744 ns | 20.9291 ns | 21.1243 ns | 20.6690 ns | 31.5352 ns | 30.3570 ns | 24.7876 ns | 512.0570 ns | 30.9776 ns | N/A |
matrix4 mul vector4 x100 | 10.32 us | 7.592 us | 7.595 us | 7.662 us | 8.472 us | 8.379 us | 7.823 us | 51.42 us | 8.366 us | N/A |
matrix4 transpose | 103.5080 ns | 32.5330 ns | 26.2948 ns | 27.1962 ns | 103.0829 ns | 101.3104 ns | 29.2236 ns | 105.6003 ns | 98.1151 ns | N/A |
quaternion conjugate | 6.7367 ns | 7.6066 ns | 6.6130 ns | 6.6179 ns | 24.5468 ns | 25.5657 ns | 11.1447 ns | 24.2229 ns | 25.7246 ns | 24.9413 ns |
quaternion mul quaternion | 20.0365 ns | 16.6831 ns | 14.6380 ns | 16.6299 ns | 39.3908 ns | 274.4676 ns | 27.3533 ns | 62.0697 ns | 35.0299 ns | 59.5286 ns |
quaternion mul vector3 | 19.3629 ns | 14.7722 ns | 14.7579 ns | 18.5064 ns | 52.7303 ns | 170.8328 ns | 44.1336 ns | 81.1243 ns | 52.5011 ns | 54.1958 ns |
transform point2 x1 | N/A | N/A | N/A | N/A | 36.6350 ns | N/A | 25.6516 ns | 337.3442 ns | 30.5781 ns | 28.5532 ns |
transform point2 x100 | N/A | N/A | N/A | N/A | 5.475 us | N/A | 5.732 us | 33.81 us | 5.118 us | 4.003 us |
transform point3 x1 | N/A | N/A | N/A | N/A | 66.3694 ns | N/A | 24.0601 ns | 517.9224 ns | 67.8205 ns | 68.0836 ns |
transform point3 x100 | N/A | N/A | N/A | N/A | 9.403 us | N/A | 7.824 us | 52.5 us | 9.559 us | 10.07 us |
transform vector2 x1 | N/A | N/A | N/A | N/A | 32.5667 ns | N/A | 21.3971 ns | 330.5713 ns | N/A | 23.5693 ns |
transform vector2 x100 | N/A | N/A | N/A | N/A | 5.392 us | N/A | 5.579 us | 33.45 us | N/A | 3.909 us |
transform vector3 x1 | N/A | N/A | N/A | N/A | 55.3600 ns | N/A | 24.3628 ns | 515.1730 ns | 59.1129 ns | 47.9383 ns |
transform vector3 x100 | N/A | N/A | N/A | N/A | 9.185 us | N/A | 7.958 us | 52.39 us | 8.992 us | 8.665 us |
transform2 inverse | N/A | N/A | N/A | N/A | 87.2060 ns | N/A | N/A | N/A | N/A | 45.0272 ns |
transform2 mul transform2 | N/A | N/A | N/A | N/A | 87.5702 ns | N/A | N/A | N/A | N/A | 40.1995 ns |
transform3 inverse | N/A | N/A | N/A | N/A | 542.8401 ns | N/A | N/A | N/A | N/A | 748.2648 ns |
transform3 mul transform3d | N/A | N/A | N/A | N/A | 118.0659 ns | N/A | N/A | N/A | N/A | 115.3294 ns |
vector3 cross | 5.9052 ns | 5.7307 ns | 6.4232 ns | 6.2395 ns | 25.0006 ns | 36.5961 ns | 15.0303 ns | 28.6082 ns | 28.6343 ns | 29.5467 ns |
vector3 dot | 5.7524 ns | 5.8113 ns | 6.6820 ns | 6.3391 ns | 25.5593 ns | 26.1824 ns | 13.6831 ns | 25.4740 ns | 25.9099 ns | 26.1271 ns |
vector3 length | 5.3441 ns | 5.4417 ns | 5.3741 ns | 5.2332 ns | 20.5369 ns | 34.2988 ns | 20.5652 ns | 20.6259 ns | 20.9281 ns | 20.6052 ns |
vector3 normalize | 15.6224 ns | 15.3854 ns | 15.5892 ns | 15.6585 ns | 59.1804 ns | 60.9510 ns | 35.7763 ns | 61.3666 ns | 36.7304 ns | 61.3199 ns |
Conclusion
The use of SIMD AoSoA is extremely promising for doing efficiently lots of linear algebra on large arrays of entities with the same types.
This will be possible in nalgebra in its next version 0.21.0 and will perform as efficiently as the current implementation in ultraviolet
when using f32x4
and will also be compatible with other SIMD types as well (e.g. f32x16
, f64x2
, i32x4
, u8x4
, etc.) Those types are newtypes wrapping
the types from packed_simd. Those newtypes will come from a new crate named simba
which I will present in more depth in the next edition of the This month in Rustsim
blog posts.
Finaly here are two elements to keep in mind:
- Compilation flags matter a lot when measuring performance.
-C target-cpu
orcodegen-units=1
can have significant impacts on your release performance. - SIMD AoSoA comes with a price: algorithms must be carefully designed to obtain as much gain as possible compared to an AoS approach.
Thank you all for reading, and for your support on patreon or GitHub sponsor!
Don't hesitate to share your thoughts on this topic, or to correct me if something seems off to you.