erlperf 2.0: benchmarking Erlang code

What is faster in Erlang, maps or records? Should I use lists:filter or a list comprehension to remove odd elements from the list? How fast is my random number generator?

There are plenty of myths around Erlang code performance. Some have roots in the past and are now retired. Mature codebases often have “performance-oriented implementations” for selected standard functions. But how much does it actually save?

In God we trust, all others must bring data

Answering the list filter/comprehension question is really this simple:

./erlperf 'run(Arg) -> [X||X<-Arg, X rem 2 =:= 1].' \
   --init_runner 'lists:seq(1, 100).' \
   'run(Arg) -> lists:filter(fun(X) -> X rem 2 =:= 1 end, Arg).' \
   --init_runner 'lists:seq(1, 100).'

Code                               ||        QPS      Time    Rel
run(Arg) -> [X||X<-Arg, X rem       1    2419 Ki    413 ns   100%
run(Arg) -> lists:filter(fun(X      1    1501 Ki    666 ns    61%

This incantation reveals lists:filter performance compared to a list comprehension. Both versions receive the same argument as an input: a list generated by lists:seq(1, 100). It takes 413 ns on average for a list comprehension to run, and 666 ns for lists:filter.

Yes, that’s how simple benchmarking should be.

Get the tool

erlperf is an open source tool published on GitHub. Clone the source code and build it tailored your environment – Erlang/OTP version and OS.

git clone https://github.com/max-au/erlperf
rebar3 escriptize

erlperf is also published to hex.pm. I often add it to my application dependencies for the test profile to add benchmarking to automated testing routine.

Benchmarking modes

erperf implements several modes: continuous (default), timed (low-overhead) and concurrency estimation (squeeze).

Continuous mode

By default erlperf runs your code for 3 seconds, and takes a sample every second. A sample is reported as a number of iterations – how many times the runner code actually ran. Then the average is taken and reported. To give an example, if your function runs for 20 milliseconds, it can happen that erlperf captures samples with 48, 52 and 50 iterations. The average would be 50.

This approach works well for CPU-bound calculations, but may produce unexpected results for slow functions that take longer than sample duration. For example, timer:sleep(2000) with default settings yields zero throughput.

You can spawn many processes iterating over the same runner code.

./erlperf 'rand:uniform().'
Code                    ||        QPS       Time
rand:uniform().          1   15725 Ki      64 ns

./erlperf 'rand:uniform().' -c 2
Code                    ||        QPS       Time
rand:uniform().          2   29339 Ki      67 ns

Note that Time metric stays the same, because a single iterations time does not change. However total throughput (QPS, historical name for “iterations per sample”) doubles, demonstrating linear performance growth with increasing concurrency.

You can change the sampling rate (sample duration, milliseconds) and amount of samples to take. This does not change Time, because it is the cost of a single iteration. But througput drops 5 times, because sample duration is 5 times shorter:

./erlperf 'rand:uniform().' -d 200
Code                    ||        QPS       Time
rand:uniform().          1    3037 Ki      64 ns

Timed mode

In this mode erlperf loops your code a specified amount of times, measuring how long it took to complete. It is essentially what timer:tc does. This mode has slightly less overhead compared to continuous mode, shaving up to 4 ns per iteration:

./erlperf 'erlang:unique_integer().' -l 100M
Code                             ||        QPS       Time
erlang:unique_integer().          1     189 Mi       5 ns

./erlperf 'erlang:unique_integer().'
Code                             ||        QPS       Time
erlang:unique_integer().          1     108 Mi       9 ns

This difference may be significant if you’re profiling low-level ERTS primitives.

Running multiple concurrent processes is not supported in this mode. But you can run multiple versions of the code:

./erlperf ' rand:uniform().' --init_runner 'rand:seed(exsss).'\
    'rand:uniform().' --init_runner 'rand:seed(exrop).' -l 10M
Code                     ||        QPS       Time     Rel
rand:uniform().           1   19465 Ki      51 ns    100%
 rand:uniform().          1   16132 Ki      61 ns     82%

Note the trick: runner source code is identical, the only difference is white space added to distinct these two in the output. It’s easy to notice that exrop random number generator is slightly faster than the default exsss.

Concurrency estimation mode

Erlang is famous for concurrency. But it’s also practical – and it has primitives that may break concurrency. For example, it is easy to misuse ETS table and run into lock contention. The example below runs erlperf in the concurrency estimation (squeeze) mode:

/erlperf 'ets:insert(tab, {key, value}).' \
    --init 'ets:new(tab, [public, named_table]).' -q

Code                                   ||        QPS       Time
ets:insert(tab, {key, value}).          1   12365 Ki      80 ns

init/0 code creates a public ETS table named ‘tab’, and then runner code tries to insert the same key over and over again. This operation requires write lock, effectively limiting concurrency to a single process. And erlperf reports exactly that, estimating maximum concurrency at 1 process.

This mode proves useful to find concurrency bottlenecks. For example, some functions may have limited throughput because they execute remote calls served by a single process:

./erlperf 'code:is_loaded(local_udp).' -q
Code                               ||        QPS       Time
code:is_loaded(local_udp).          6    1504 Ki    3990 ns

In that example, code:is_loaded performs gen_server:call to a singleton process.

Programmatic API

Command line interface works well for one-liners. Benchmarking complex cases is available via erlperf module API. You can use it in tests, or even in production. Yet the easiest way to try is running rebar3 shell in the erlperf repository root.

rebar3 shell

Eshell V12.3.1  (abort with ^G)
(erlperf@ubuntu)1> erlperf:run(rand, uniform, []).
14002976

In addition to all options available through the command line, programmatic API also supports various ways to define runner code, including MFA and anonymous functions.

Creating a benchmark

A benchmark job consists of a mandatory runner code, that is executed in a tight loop, and optional init/0, init_runner/0,1 and done/0,1 functions.

init/0 is the easy one: it runs once when the job is started. It cannot accept any arguments, but it can return a value that may be later used by the runner, init_runner or done code. This function should be used to create resources shared between all runner processes, e.g. create ETS tables, start applications or required processes. This function runs in the context of a job process.

init_runner/1 runs in the context of the spawned runner process. You may supply a simple Erlang statement – pg:join(runners, self()), a function that accepts no arguments – init_runner() -> pg:join(runners, self()) or a function that accepts one argument – value returned from init/0. You can, for example, generate a unique name in init/0, and pass it to init_runner/1, runner/1 and done/1:

./erlperf 
--init 'list_to_atom("tmp_" ++ integer_to_list(erlang:unique_integer())).'\ --init_runner 'init_runner(Scope) -> {ok, _Pid} = pg:start(Scope), Scope.'\ --done 'done(Scope) -> gen:stop(Scope).'\
'runner(S) -> pg:join(S, g, self()), pg:leave(S, g, self()).'

Code                                            ||        QPS       Time
runner(Scope) -> pg:join(Scope, group, self      1     333 Ki    3005 ns

Same applies to done/0,1: you can define it with zero or one argument. For the latter, init/0 must also be defined.

runner/0,1,2 may accept zero arguments (simple iteration), one – return value of init_runner/0,1, or two – return value of init_runner, and a state passed to the next iteration:

./erlperf --init_runner '1.' \
'run(_, S) -> io:format("~b~n", [S]), S + 1.' -l 5
1
2
3
4
5
Code                                            ||        QPS       Time
run(_, S) -> io:format("~b~n", [S]), S + 1.      1      44642   22400 ns

Tips and tricks

Sometimes you just need to run a one-off benchmark for a small function you just implemented. It’s too much of a hassle to add dependency on erlperf for that. You can just change the current directory to the one you have compiled beam bytecode files and run erlperf:

cd ~/argparse/_build/test/lib/argparse/ebin/
erlperf/erlperf 'argparse:parse([], #{}).'

Code                             ||        QPS       Time
argparse:parse([], #{}).          1     970 Ki    1030 ns

Or you can provide an extra code path to erlperf and run it from another folder:

erlperf 'argparse:parse([], #{}).' -pa ~/argparse/_build/test/lib/argparse/ebin/

Code                             ||        QPS       Time
argparse:parse([], #{}).          1     999 Ki    1001 ns

If you have many applications and adding code paths is cumbersome, you can use ERL_FLAGS variable to pass the root folder of your release:

ERL_LIBS="~/argparse/_build/test/lib" ./erlperf 'argparse:parse([], #{}).'

Code                             ||        QPS       Time
argparse:parse([], #{}).          1    1002 Ki     997 ns

It’s only the beginning

There are many more features packed into erlperf. You can make it print Erlang VM statistics (see -v option), run your benchmarks isolated in a separate VM (-i) or leave your benchmark running, perform hot code upgrade and see throughput changes in realtime. You can configure concurrency estimation mode to start with 32 processes and never go beyond 64. You can run your benchmarks in a cluster and watch erlperf cluster-wide reporting.

If you find a problem or want to submit a feature request, please use GitHub issues list. Or, even better, submit a pull request with the test case, a bugfix or improvement.