What is faster in Erlang, maps or records? Should I use lists:filter or a list comprehension to remove odd elements from the list? How fast is my random number generator?
There are plenty of myths around Erlang code performance. Some have roots in the past and are now retired. Mature codebases often have “performance-oriented implementations” for selected standard functions. But how much does it actually save?
In God we trust, all others must bring data
Answering the list filter/comprehension question is really this simple:
./erlperf 'run(Arg) -> [X||X<-Arg, X rem 2 =:= 1].' \
--init_runner 'lists:seq(1, 100).' \
'run(Arg) -> lists:filter(fun(X) -> X rem 2 =:= 1 end, Arg).' \
--init_runner 'lists:seq(1, 100).'
Code || QPS Time Rel
run(Arg) -> [X||X<-Arg, X rem 1 2419 Ki 413 ns 100%
run(Arg) -> lists:filter(fun(X 1 1501 Ki 666 ns 61%
This incantation reveals lists:filter
performance compared to a list comprehension. Both versions receive the same argument as an input: a list generated by lists:seq(1, 100).
It takes 413 ns on average for a list comprehension to run, and 666 ns for lists:filter
.
Yes, that’s how simple benchmarking should be.
Get the tool
erlperf is an open source tool published on GitHub. Clone the source code and build it tailored your environment – Erlang/OTP version and OS.
git clone https://github.com/max-au/erlperf
rebar3 escriptize
erlperf is also published to hex.pm. I often add it to my application dependencies for the test
profile to add benchmarking to automated testing routine.
Benchmarking modes
erperf implements several modes: continuous (default), timed (low-overhead) and concurrency estimation (squeeze).
Continuous mode
By default erlperf runs your code for 3 seconds, and takes a sample every second. A sample is reported as a number of iterations – how many times the runner
code actually ran. Then the average is taken and reported. To give an example, if your function runs for 20 milliseconds, it can happen that erlperf captures samples with 48, 52 and 50 iterations. The average would be 50.
This approach works well for CPU-bound calculations, but may produce unexpected results for slow functions that take longer than sample duration. For example, timer:sleep(2000)
with default settings yields zero throughput.
You can spawn many processes iterating over the same runner code.
./erlperf 'rand:uniform().'
Code || QPS Time
rand:uniform(). 1 15725 Ki 64 ns
./erlperf 'rand:uniform().' -c 2
Code || QPS Time
rand:uniform(). 2 29339 Ki 67 ns
Note that Time metric stays the same, because a single iterations time does not change. However total throughput (QPS, historical name for “iterations per sample”) doubles, demonstrating linear performance growth with increasing concurrency.
You can change the sampling rate (sample duration, milliseconds) and amount of samples to take. This does not change Time, because it is the cost of a single iteration. But througput drops 5 times, because sample duration is 5 times shorter:
./erlperf 'rand:uniform().' -d 200
Code || QPS Time
rand:uniform(). 1 3037 Ki 64 ns
Timed mode
In this mode erlperf loops your code a specified amount of times, measuring how long it took to complete. It is essentially what timer:tc
does. This mode has slightly less overhead compared to continuous mode, shaving up to 4 ns per iteration:
./erlperf 'erlang:unique_integer().' -l 100M
Code || QPS Time
erlang:unique_integer(). 1 189 Mi 5 ns
./erlperf 'erlang:unique_integer().'
Code || QPS Time
erlang:unique_integer(). 1 108 Mi 9 ns
This difference may be significant if you’re profiling low-level ERTS primitives.
Running multiple concurrent processes is not supported in this mode. But you can run multiple versions of the code:
./erlperf ' rand:uniform().' --init_runner 'rand:seed(exsss).'\
'rand:uniform().' --init_runner 'rand:seed(exrop).' -l 10M
Code || QPS Time Rel
rand:uniform(). 1 19465 Ki 51 ns 100%
rand:uniform(). 1 16132 Ki 61 ns 82%
Note the trick: runner source code is identical, the only difference is white space added to distinct these two in the output. It’s easy to notice that exrop
random number generator is slightly faster than the default exsss
.
Concurrency estimation mode
Erlang is famous for concurrency. But it’s also practical – and it has primitives that may break concurrency. For example, it is easy to misuse ETS table and run into lock contention. The example below runs erlperf in the concurrency estimation (squeeze) mode:
/erlperf 'ets:insert(tab, {key, value}).' \
--init 'ets:new(tab, [public, named_table]).' -q
Code || QPS Time
ets:insert(tab, {key, value}). 1 12365 Ki 80 ns
init/0 code creates a public ETS table named ‘tab’, and then runner code tries to insert the same key over and over again. This operation requires write lock, effectively limiting concurrency to a single process. And erlperf reports exactly that, estimating maximum concurrency at 1 process.
This mode proves useful to find concurrency bottlenecks. For example, some functions may have limited throughput because they execute remote calls served by a single process:
./erlperf 'code:is_loaded(local_udp).' -q
Code || QPS Time
code:is_loaded(local_udp). 6 1504 Ki 3990 ns
In that example, code:is_loaded
performs gen_server:call
to a singleton process.
Programmatic API
Command line interface works well for one-liners. Benchmarking complex cases is available via erlperf
module API. You can use it in tests, or even in production. Yet the easiest way to try is running rebar3 shell
in the erlperf repository root.
rebar3 shell
Eshell V12.3.1 (abort with ^G)
(erlperf@ubuntu)1> erlperf:run(rand, uniform, []).
14002976
In addition to all options available through the command line, programmatic API also supports various ways to define runner code, including MFA and anonymous functions.
Creating a benchmark
A benchmark job consists of a mandatory runner code, that is executed in a tight loop, and optional init/0, init_runner/0,1 and done/0,1 functions.
init/0 is the easy one: it runs once when the job is started. It cannot accept any arguments, but it can return a value that may be later used by the runner, init_runner or done code. This function should be used to create resources shared between all runner processes, e.g. create ETS tables, start applications or required processes. This function runs in the context of a job process.
init_runner/1 runs in the context of the spawned runner process. You may supply a simple Erlang statement – pg:join(runners, self())
, a function that accepts no arguments – init_runner() -> pg:join(runners, self())
or a function that accepts one argument – value returned from init/0. You can, for example, generate a unique name in init/0, and pass it to init_runner/1, runner/1 and done/1:
./erlperf
--init 'list_to_atom("tmp_" ++ integer_to_list(erlang:unique_integer())).'\ --init_runner 'init_runner(Scope) -> {ok, _Pid} = pg:start(Scope), Scope.'\ --done 'done(Scope) -> gen:stop(Scope).'\
'runner(S) -> pg:join(S, g, self()), pg:leave(S, g, self()).'
Code || QPS Time
runner(Scope) -> pg:join(Scope, group, self 1 333 Ki 3005 ns
Same applies to done/0,1: you can define it with zero or one argument. For the latter, init/0 must also be defined.
runner/0,1,2 may accept zero arguments (simple iteration), one – return value of init_runner/0,1, or two – return value of init_runner, and a state passed to the next iteration:
./erlperf --init_runner '1.' \
'run(_, S) -> io:format("~b~n", [S]), S + 1.' -l 5
1
2
3
4
5
Code || QPS Time
run(_, S) -> io:format("~b~n", [S]), S + 1. 1 44642 22400 ns
Tips and tricks
Sometimes you just need to run a one-off benchmark for a small function you just implemented. It’s too much of a hassle to add dependency on erlperf for that. You can just change the current directory to the one you have compiled beam bytecode files and run erlperf:
cd ~/argparse/_build/test/lib/argparse/ebin/
erlperf/erlperf 'argparse:parse([], #{}).'
Code || QPS Time
argparse:parse([], #{}). 1 970 Ki 1030 ns
Or you can provide an extra code path to erlperf and run it from another folder:
erlperf 'argparse:parse([], #{}).' -pa ~/argparse/_build/test/lib/argparse/ebin/
Code || QPS Time
argparse:parse([], #{}). 1 999 Ki 1001 ns
If you have many applications and adding code paths is cumbersome, you can use ERL_FLAGS
variable to pass the root folder of your release:
ERL_LIBS="~/argparse/_build/test/lib" ./erlperf 'argparse:parse([], #{}).'
Code || QPS Time
argparse:parse([], #{}). 1 1002 Ki 997 ns
It’s only the beginning
There are many more features packed into erlperf. You can make it print Erlang VM statistics (see -v
option), run your benchmarks isolated in a separate VM (-i
) or leave your benchmark running, perform hot code upgrade and see throughput changes in realtime. You can configure concurrency estimation mode to start with 32 processes and never go beyond 64. You can run your benchmarks in a cluster and watch erlperf cluster-wide reporting.
If you find a problem or want to submit a feature request, please use GitHub issues list. Or, even better, submit a pull request with the test case, a bugfix or improvement.
Thanks for your blog, nice to read. Do not stop.