I'm not sure about those results, really, so I did the same benchmark in 4 (+1) different setups.
One thing I would like to know before going into the results is, have you ran those benchmarks on a macOS machine? Which Rust version?
macOS still doesn't have proper futexes (with timeout support) and condvar implementation, so both Golang and Rust falls back to the regular pthread_mutex functions, which are slower in most scenarios. However, since Rust 1.62, it's using futex in all compatible platforms (Windows, Linux, *BSD excluding macOS).
Let's see the results.
…
All compiled with go build
and cargo +stable build --release
.
+stable
was used to select the stable toolchain, since the default on my machine isnightly
, although there is no notable difference between stable and 1.65 nightly (29e4a9ee0 2022–08–10) in this regard.
Apple M1 8GB 2020 (macOS 12.4, Kernel 21.5.0)
> go version
go version go1.19 darwin/arm64> rustc +stable --version
rustc 1.63.0 (4b91a6ea7 2022-08-08)> # Go
> hyperfine --warmup 10 -r 200 -N -- gobench/m
Benchmark 1: gobench/m
Time (mean ± σ): 12.3 ms ± 0.2 ms [User: 68.0 ms, System: 7.2 ms]
Range (min … max): 11.4 ms … 12.7 ms 200 runs> # Rust
> hyperfine --warmup 10 -r 200 -N -- rustbench/target/release/rustbench
Benchmark 1: rustbench/target/release/rustbench
Time (mean ± σ): 14.0 ms ± 0.3 ms [User: 94.5 ms, System: 7.0 ms]
Range (min … max): 12.8 ms … 15.1 ms 200 runs
Simplified:
Go 12.3 ms ± 0.2 ms
Rust 14.0 ms ± 0.3 ms
…
Intel i7 10th 16GB (FreeBSD 13.1-RELEASE / Go compiled from source)
> go119 version
go version go1.19 freebsd/amd64> rustc +stable --version
rustc 1.63.0 (4b91a6ea7 2022-08-08)> # Go
> hyperfine --warmup 10 -r 200 -N -- gobench/m
Benchmark 1: gobench/m
Time (mean ± σ): 10.1 ms ± 0.6 ms [User: 69.3 ms, System: 3.2 ms]
Range (min … max): 9.1 ms … 15.0 ms 200 runs> # Rust
> hyperfine --warmup 10 -r 200 -N -- rustbench/target/release/rustbench
Benchmark 1: rustbench/target/release/rustbench
Time (mean ± σ): 5.6 ms ± 0.5 ms [User: 38.0 ms, System: 0.6 ms]
Range (min … max): 5.1 ms … 10.3 ms 200 runs
Simplified:
Go 10.1 ms ± 0.6 ms
Rust 5.6 ms ± 0.5 ms
…
Raspberry Pi 4B 8GB BCM2835 (Arch Linux, Kernel 5.15.56)
> go version
go version go1.19 linux/arm64> rustc +stable --version
rustc 1.63.0 (4b91a6ea7 2022-08-08)> # Go
> hyperfine --warmup 10 -r 200 -N -- gobench/m
Benchmark 1: gobench/m
Time (mean ± σ): 40.1 ms ± 5.1 ms [User: 119.1 ms, System: 18.7 ms]
Range (min … max): 36.9 ms … 80.1 ms 200 runs> # Rust
> hyperfine --warmup 10 -r 200 -N -- rustbench/target/release/rustbench
Benchmark 1: rustbench/target/release/rustbench
Time (mean ± σ): 24.7 ms ± 3.2 ms [User: 81.5 ms, System: 3.6 ms]
Range (min … max): 21.8 ms … 52.4 ms 200 runs
Simplified:
Go 40.1 ms ± 5.1 ms
Rust 24.7 ms ± 3.2 ms
…
Ryzen 5900X 64GB (Arch Linux, Kernel 5.19.2)
> go version
go version go1.19 linux/amd64> rustc +stable --version
rustc 1.63.0 (4b91a6ea7 2022-08-08)> # Go
> hyperfine --warmup 10 -r 200 -N -- gobench/m
Benchmark 1: gobench/m
Time (mean ± σ): 7.6 ms ± 0.8 ms [User: 89.8 ms, System: 4.7 ms]
Range (min … max): 6.1 ms … 13.9 ms 200 runs
> # Rust
> hyperfine --warmup 10 -r 200 -N -- rustbench/target/release/rustbench
Benchmark 1: rustbench/target/release/rustbench
Time (mean ± σ): 4.5 ms ± 0.2 ms [User: 54.0 ms, System: 6.0 ms]
Range (min … max): 3.6 ms … 5.4 ms 200 runs
Simplified:
Go 7.6 ms ± 0.8 ms
Rust 4.5 ms ± 0.2 ms
…
Also I did one additional test on the old Rust 1.61 version (which is the only version available on aarch64 FreeBSD 13.1-REALEASE), before the Futex
patch.
AWS EC2 Graviton2 t4g.medium (FreeBSD 13.1-RELEASE arm64 / Go compiled from source)
> # Neofetch
OS: FreeBSD 13.1-RELEASE arm64
Uptime: 13 days, 22 hours, 50 mins
Packages: 167 (pkg)
Shell: fish 3.4.1
Terminal: /dev/pts/1
CPU: ARM Neoverse-N1 r3p1 (2)
Memory: 785MiB / 3998MiB> go119 version
go version go1.19 freebsd/arm64> rustc +stable --version
rustc 1.61.0> # Go
> hyperfine --warmup 10 -r 200 -N -- gobench/m
Benchmark 1: gobench/m
Time (mean ± σ): 42.8 ms ± 5.8 ms [User: 78.8 ms, System: 4.6 ms]
Range (min … max): 36.3 ms … 64.0 ms 200 runs> # Rust
> hyperfine --warmup 10 -r 200 -N -- rustbench/target/release/rustbench
Benchmark 1: rustbench/target/release/rustbench
Time (mean ± σ): 31.7 ms ± 1.1 ms [User: 60.5 ms, System: 1.4 ms]
Range (min … max): 28.6 ms … 39.0 ms 200 runs
Simplified:
Go 42.8 ms ± 5.8 ms
Rust 31.7 ms ± 1.1 ms
…
When Rust 1.61 is compared against Go, the different is irrelevant, on macOS Go is a bit faster while on aarch64 FreeBSD Rust is faster, however, since 1.62, Rust performs 2x faster than Go (and when Go is faster, it is about ≃16% faster) in all the systems where Futex
is supported.
So, at worse (in previous versions), Rust performs as well as Go, there is no faster in this regard, but only in previous Rust versions, things changes when we talk about recent Rust versions.
Edit: for advanced formatting since Medium don't allow to do without a separate post.
Edit²:
I like Go (it is currently my main language at work), its approach to concurrency is very interesting and make it very easy to introduce concurrency to your code, and going preemptive instead of cooperative, like most of the Coroutines implementations, has its advantages.
It is easier to write performant concurrent code in Go than in other languages, you don't need to understand multiple concurrency structures, its advantages and disadvantages, Go is the one that take this responsibility, but it also has the drawback that there are cases where the coice Go makes for you, is not the better one for your use-case.
Rust is a language that doesn't want to lock you down to a single way of doing things, you need to chose the one that is better suitable for you, and this involves understanding what those solutions are and their drawbacks. In the end, choosing the right or wrong implementation is what decides whether your code outperforms or under-performs, the wrong solution can make Rust as slow as any dynamic programming language, but you will need to make a lot of wrong decisions to get there.
Also, Go already uses futex on Linux and FreeBSD (and probably SRW locks on Windows), only in macOS it still uses pthread_mutex because of the lack of an alternative, so it already perform the best it can on Linux, and still slower than Rust. The Go overhead is not on the locking mechanism available on macOS, changing it will probably have no massive effects on performance in the majority of the scenarios (still an extremely good thing to use once macOS, if ever, releases), so, Rust was lagging behind on versions lower than 1.61 because it wasn't using the most performant locking mechanism, like Go, happily this got implemented.
If you are interested on this Rust change, this is the PR.
Edit³ (yes, one more)
I was quite intrigued with the performance on aarch64 FreeBSD, something seems a little off, so yesterday I started bootstrapping rustc 1.65 (nightly) on my amd64 installation following this guide, I don't need to bootstrap the cargo itself (cargo is extremely flexible when it comes to running different rustc versions), so I went just with rustc and the std library.
I just left it there and went to bed, it was built on my i7 10th machine, i7 surely is a good CPU which can handle the job, but in order to cross compile aarch64 rustc from an amd64 machine, you need to compile all dependencies for the aarch64 architecture, including llvm, clang and everything else (I started right from the aarch64 FreeBSD base system txz, so I avoided compiling the entire base system as well), which are very big projects and took hours to compile.
With everything done, I uploaded the distribution files (rustc and rust-std) to my aarch64 FreeBSD on AWS, recompiled the benchmark, and, impressively, nothing changed, really, I can look to ensure it is using futexes instead of mutexes with the strings
cli:
# FreeBSD AMD64 Rustc 1.63.0 (stable)
> strings rustbench/target/release/rustbench | rg 'pthread_mutex'
pthread_mutex_destroy
pthread_mutex_lock
pthread_mutex_unlock# FreeBSD aarch64 Rustc 1.65.0 (nightly)
> strings rustbench/target/release/rustbench | grep "pthread_mutex"
pthread_mutex_destroy
pthread_mutex_lock
pthread_mutex_unlock# FreeBSD aarch64 Rustc 1.65.0 (nightly)
# Rust 1.65.0
> strings ./rustbench/target/release/rustbench | grep "umtx"
_umtx_op# FreeBSD aarch64 Rustc 1.61.0 (stable)
# Nothing, obviously
> strings ./rustbench2/target/release/rustbench | grep "umtx"# Linux aarch64 Rustc 1.63.0 (stable)
# Nothing, obviously
> strings target/release/rustbench | rg 'pthread_mutex'
They are very similar, but pthread_mutex_*
appears only on FreeBSD and does not on Linux, so let see if previous rustc versions used to include those calls (it does, just to make sure it really does):
> cargo +1.61 build --release
Finished release [optimized] target(s) in 0.13sstrings > target/release/rustbench | rg 'pthread_mutex_lock'
pthread_mutex_lock
pthread_mutex_lock@GLIBC_2.2.5
So, in fact, previous versions used pthread_mutex
, and that's expected, since only Rust 1.62 changed to futexes, and umtx
(FreeBSD Futex implementation) is present on the binaries produced by 1.65 but not in 1.61, which is also expected.
If we look in how many pthread calls are in 1.61 binaries, there is way more than in 1.65.0 (that we've looked at the start):
# FreeBSD aarch64 Rustc 1.61.0 (stable)
# Rust 1.61.0
> strings ./rustbench2/target/release/rustbench | grep "pthread_mutex"
pthread_mutex_destroy
pthread_mutex_lock
pthread_mutex_unlock
pthread_mutexattr_destroy
pthread_mutexattr_init
pthread_mutexattr_settype
pthread_mutex_init
pthread_mutex_trylock
"$&(*,.02468:<>@BDFHJLNPcalled `Result::unwrap()` on an `Err` valueErrormessageOsRUST_MIN_STACKlibrary/std/src/sys/unix/locks/pthread_mutex.rs
The binary compiled with 1.65.0 is in the
rustbench/
directory because I moved the previous compilation results to therustbench2/
directory, that's the reason for the directory to not be consistent with previous tests.
Okay, but, how about the performance? Those are the numbers:
> # Rust 1.61.0 (aarch64 FreeBSD 13.1)
> hyperfine --warmup 10 -r 200 -N -- rustbench2/target/release/rustbench
Benchmark 1: rustbench2/target/release/rustbench
Time (mean ± σ): 31.8 ms ± 0.9 ms [User: 60.9 ms, System: 1.2 ms]
Range (min … max): 29.1 ms … 34.1 ms 200 runs> # Rust 1.65.0 (aarch64 FreeBSD 13.1)
> hyperfine --warmup 10 -r 200 -N -- rustbench/target/release/rustbench
Benchmark 1: rustbench/target/release/rustbench
Time (mean ± σ): 30.9 ms ± 1.0 ms [User: 59.7 ms, System: 0.8 ms]
Range (min … max): 28.2 ms … 33.0 ms 200 runs
Simplified:
> uname -a
FreeBSD freebsd 13.1-RELEASE FreeBSD 13.1-RELEASE releng/13.1-n250148-fc952ac2212 GENERIC arm64Rust 1.61 (pthread) 31.8 ms ± 0.9 ms
Rust 1.65 (futex) 30.9 ms ± 1.0 ms
I have done several runs, the difference still of 1ms, yes, only 1ms, and doesn't matter how many times I run this test, the worse I can get is they exactly matching each other in terms of performance, but nothing more than 1ms of difference.
Well, it turns out that pthread_mutex_lock already uses Futex whenever possible, FreeBSD documentation itself states that libthr uses umtx under the hood, and Tokio is using parking_lot, which already has a faster Mutex implementation for a long time.
The reason that pthread does appear in previous binaries is because there are some places that Rust's Mutex is being used, however, in recent versions, even Rust Mutex implementation uses futexes, so there is no pthread calls on wards, on the other hand, those are still present on FreeBSD, but truss shows that futex is being used on both binaries:
> # Rust 1.61.0 (aarch64 FreeBSD 13.1)
> truss ./rustbench2/target/release/rustbench 2>| rg '_umtx'
_umtx_op(0xffffffffd7e8,UMTX_OP_WAKE,0x1,0x0,0x0) = 0 (0x0)
==> 193ms
_umtx_op(0x40a130c8,UMTX_OP_NWAKE_PRIVATE,0x1,0x0,0x0) = 0 (0x0)
_umtx_op(0x40291428,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x0,0x0) = 0 (0x0)
_umtx_op(0x40a12700,UMTX_OP_WAIT,0x32bb3,0x0,0x0) = 0 (0x0)> # Rust 1.65.0 (aarch64 FreeBSD 13.1)
> truss ./rustbench/target/release/rustbench 2>| rg '_umtx'
_umtx_op(0xffffffffd7e8,UMTX_OP_WAKE,0x1,0x0,0x0) = 0 (0x0)
==> 190ms
_umtx_op(0x40a130c8,UMTX_OP_NWAKE_PRIVATE,0x1,0x0,0x0) = 0 (0x0)
_umtx_op(0x40298428,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x0,0x0) = 0 (0x0)
_umtx_op(0x40a12700,UMTX_OP_WAIT,0x32c12,0x0,0x0) = 0 (0x0)
This proves that they behave exactly the same on 1.65 and 1.61, I've gone further and compiled this project with Rust 1.61 on my Linux machine, the performance still the same, Rust beats Go even on previous versions, because parking_lot already had faster lock implementations.
So, the conclusion is that, even on Rust 1.61.0, this code will outperform Go, unless you're running it on a macOS.
Simply because macOS doesn't have futexes.