peer: distributed application testing

Erlang distribution is one of the most important features of the VM and runtime. Transparent message exchange between processes running on different nodes is incredible. But how do I test it?

There are several ways to run Common Test cases with multiple Erlang nodes involved. One is documented in Common Test for Large-Scale testing. I am yet to see anyone ever using it. Even a great guide provided by Learn You Some Erlang isn’t shedding enough light to start. But at least it points at ct_slave module. There are a few articles explaining how to use it, and some source code examples on GitHub. So when I began working on Scalable Process Groups it was my natural choice for Common Test.

Down the rabbit hole

Originally spg meant to be transparent replacement for pg2. Hence I tried to base my test suite on pg2_SUITE. First surprise was to find out ct_slave is not used there. I discovered test_server:start_node function, capable of doing what I needed, and much more. It is undocumented, therefore unsupported, so I did not want to use it for my tests.

I followed it further down the source code and eventually found slave. It started getting awkward, since API was more or less the same as ct_slave. I kept looking into OTP test suites, and found several more implementations of the same concept. Some test suites went with os:cmd to start extra nodes. Some did open_port({spawn, "erl"}). Some implementations, e.g. loose_node, were reused in multiple suites.

With 13 different standards available, what choice do I have, other than inventing 14th one?

peer: bringing everything together

There were multiple reasons why I decided to come up with yet another implementation of a connected Erlang node. No existing primitive was a solution I could be happy with.

First, I always had troubles with command line escaping when starting extra nodes. In the example below I’m trying to start Erlang with “my value” string passed for “key” configuration of the kernel application.

1> {ok, N1} = slave:start("localhost", one, "-kernel key my value").  
{ok,one@localhost}
2> rpc:call(N1, application, get_env, [kernel, key]).                 
{ok,my}
3> {ok, N2} = slave:start("localhost", two, "-kernel key 'my value'").
{ok,two@localhost}
4> rpc:call(N2, application, get_env, [kernel, key]).                 
{ok,my}
5> {ok, N3} = slave:start("localhost", three, "-kernel key \"my value\""). 
{ok,three@localhost}
6> rpc:call(N3, application, get_env, [kernel, key]).                     
{ok,my}
7> {ok, N4} = slave:start("localhost", three, "-kernel key '\"my value\"'").
** exception error: no match of right hand side value {error,timeout}

Last attempt in fact resulted in a crash dump, but there was no way to figure it out. Spoiler alert – peer makes it easy to understand what went wrong if extra node refuses to start:

1> {ok, _, N} = peer:start(#{name => four, args => ["-kernel", "key", "my value"], connection => standard_io}).
{"could not start kernel pid",application_controller,"{bad_environment_value,\"my value\"}"}
could not start kernel pid (application_controller) ({bad_environment_value,"my value"})

I also needed an alternative way to communicate between two nodes to simulate a netsplit. While I found a smart technique in OTP (spawning two nodes, connected to the original one, but not to each other), it requires elaborate setup ensuring that global won’t make a mesh out of my loosely connected cluster. For some test cases I did not want the node to be distributed at all, so two-node-technique could not work for me.

(test@localhost)1> {ok, N1} = slave:start("localhost", one).
{ok,one@localhost}
(test@localhost)2> {ok, N2} = slave:start("localhost", two).
{ok,two@localhost}
(test@localhost)3> nodes().
[one@localhost,two@localhost]
(test@localhost)4> rpc:call(N1, erlang, nodes, []).
[test@localhost,two@localhost]

I also wanted my tests to be running in parallel. And I absolutely did not want one failing test case affect all cases running after. Say, a_SUITE starting extra node named one, and b_SUITE attempting to do the same, and failing:

(test@localhost)5> {ok, N3} = slave:start("localhost", one).
** exception error: no match of right hand side value {error,{already_running,one@localhost}}

This problem wasn’t new to OTP. I found quite a number of smart ways to generate unique node names. Yet some test suites weren’t playing nice and were simply trying to halt any nodes that were connected to the test runner. There was even a function kill_slaves that felt awkward – not just for what it was doing, but also for its naming.

Naming was one of the reasons for me to come up with a new implementation. Not only I wanted the language to be neutral. It’s also plain wrong to call extra nodes “slaves” for they have full power over the original node, e.g. can halt it. I did not even need to solve the hardest problem of naming a newborn, as the term peer was already used by Common Test (although undocumented).

The road to OTP

I did first, very incomplete, implementation in 2019, specifically to test spg. I had to drop that code later in favour of test_server:start_node to proceed with upstream contribution, pg. But I kept peer in the internal WhatsApp repository, and evolved it further. Eventually all tests starting extra nodes were using it. It was a success already. But the journey was only 1% done. My goal was to bring together all implementations, and set up an industry standard approach for the BEAM community.

This could only be achieved with peer contributed to OTP. Which means high quality standards. Including great documentation and usage examples; FreeBSD, SunOS and Windows compatibility; cleaning up OTP test suites; and finally deprecating existing modules, slave and ct_slave.

I wanted to ensure that peer supports all necessary features before submitting the initial pull request. I visited darkest OTP corners and had lots of “how old is this code” moments. Some were eye-openers, “I could never imagine a peer node could be used that way!”

Deep analysis allowed pull request to be merged relatively fast, given the amount of changes it contains- over 5,500 lines of code! Several follow-up PRs scrubbed over a thousand lines of copy-pasted code from OTP test suites. There are a few more in works, applying new testing guidelines to the remaining OTP applications.

As of November 2021, peer has been officially accepted, and is going to be released as a part of OTP 25 standard library (stdlib).

peer highlights

peer was designed to replace a large number of scattered implementations, supporting both “raw” nodes and “well done” nodes behaving according to (not documented) Common Test guidelines.

Alternative connection via standard I/O

My personal favourite, this feature allows starting peers that are not distributed, but still capable of executing remote calls. Origin node does not need to be distributed either:

# erl
1> {ok, Peer, 'nonode@nohost'} = peer:start(#{connection => standard_io}).
{ok,<0.87.0>,nonode@nohost}
2> peer:call(Peer, erlang, is_alive, []).
false

You can then choose to start distribution dynamically, connect over distribution, simulate net splits, and even stop distribution without peer node going down:

3> peer:call(Peer, net_kernel, start, [[one, shortnames]]).
{ok,<9331.79.0>}
4> peer:call(Peer, erlang, node, []).                      
one@ubuntu
5> peer:call(Peer, net_kernel, stop, []).                   
ok
6> peer:call(Peer, erlang, node, []).    
nonode@nohost

But the fanciest feature is console redirection for peer nodes. You can get a useful error message when peer node is not able to start, or when it tries to dump something via erlang:display function:

1> peer:start(#{connection => standard_io, 
                args => ["-kernel", "key", "my value"]}).
{"could not start kernel pid",application_controller,"{bad_environment_value,\"my value\"}"}
could not start kernel pid (application_controller) ({bad_environment_value,"my value"})

Crash dump is being written to: erl_crash.dump...done
** exception exit: {boot_failed,{exit_status,1}}
     in function  peer:start_it/2 (peer.erl, line 457)

There is also less impressive alternative connection over TCP. It does not give you console redirection, but allows testing code that affects console I/O.

Testing releases in Docker containers

The award-winning question, “how do I test my Erlang application packaged as a Docker container”. Official documentation provides an example! The test suite starts with building a release enclosed in a Docker container. Then peer starts with an alternative connection over console, allowing RPC between test runner and nodes shielded with Docker. You can even form a cluster of containers to verify your favourite service discovery library.

Flexibility that peer provides also allows starting extra nodes on remote hosts, for example, via ssh. You can use any other wrapper – some OTP test suites leverage this to waste file descriptors.

Common Test compatibility

In addition to peer itself, there are a number of helpers designed to work with Common Test suites.

?CT_PEER_NAME() creates a reasonably unique node name, based on the module and calling function name. To give an example, nodes started by my_SUITE while running my_testcase will have names like my_SUITE-my_testcase-123-56. Last two numbers are origin node process identifier, and a unique integer. This allows to easily identify tests creating runaway nodes.

?CT_PEER() starts a new peer node, that gets connected via Erlang distribution and is ready to accept RPC. If this node crashes, dumps are created in the location expected by Common Test, and not in current directory. If test runner has coverage support turned on, peer node will also start cover. Additionally, code path for the peer node gets updated with test suite directory, so your test can run over RPC:

rpc_case(Config) when is_list(Config) ->
    {ok, Peer, Node} = ?CT_PEER(),
    ok = rpc:call(Node, ?MODULE, run_remote, []).

run_remote() ->
    io:format("This code is executed in the peer node context~n").

?CT_PEER([“arg1”, “arg2”]) does all that, and also passes additional arguments to the BEAM.

?CT_PEER(#{connection => standard_io}) enables alternative connection via console. You can pass all other peer startup options this way too, including node name:

{ok, Peer, Node} = ?CT_PEER(),
%% restart peer node with the same name and alternative TCP connection
peer:stop(Peer),
{ok, Peer2, Node} = ?CT_PEER(#{name => Node, connection => 0}).

Finally, if you’re developing something that is expected to be compatible with grand-parent OTP version, you might find ?CT_PEER used this way:

Rel = integer_to_list(list_to_integer(erlang:system_info(otp_release)) - 2),
case ?CT_PEER([], Rel, PrivDir) of
    not_available ->
        {skip, "OTP " ++ Rel ++ " not found"};
    {ok, Peer, Node} ->
        ?assertEqual(Rel, 
            rpc:call(Node, erlang, system_info, [otp_release])),
        peer:stop(Peer)
end.

Stopping extra nodes

?CT_PEER() starts the peer node linked to the current process. It ensures that extra node will be stopped if test case fails (crashing test runner process). However default Common Test mode is to run cases sequentially in the same process. Therefore I recommend to stop extra nodes with peer:stop call in the test code itself.

This behaviour has important consequence: controlling process of the peer node is started as a part of init_per_suite/2 callback needs to be unlinked:

init_per_suite(Config) ->
    {ok, Peer, Node} = ?CT_PEER(),
    unlink(Peer),
    [{node, Node}, {peer, Peer} | Config].

Otherwise it will be stopped before your test cases are executed, as init_per_suite runs in a separate process that exists immediately after callback returns.

Exit code support

In some cases peer nodes are not even expected to boot correctly. You can catch the exception and verify that correct exit code is returned:

try
    ?CT_PEER(#{connection => standard_io, args => ["-no_epmd"]}),
    ct:fail(unexpected_no_epmd_start)
catch
    exit:{boot_failed, {exit_status, 1}} ->
        ok
end

For debugging convenience, when controlling process crashes (e.g. test case terminates abnormally) and prints crash report, command line to start peer node is available in this report.

Halfway there

While the journey is no longer at a 1% mark, it’s still only a half way there. Several more OTP applications need to be updated. Only after that undocumented part of a test_server can be deprecated and removed. I hope this to be a part of the OTP 25 release, which is likely to be published in May 2022.

Even that won’t be the end of the journey. Backward compatibility must be retained at least for two years, keeping slave and ct_slave around. Only when it’s done I will officially call it a day and have a pear.

2 thoughts on “peer: distributed application testing”

u3s says:

July 26, 2022 at 3:24 am

Great blog post Max. Thanks!

MickeyRat says:

February 18, 2023 at 11:10 am

Thank for the tool! I’ve been thinking how to test my distributed Dispatcher app and this module `peer` appeared as a very useful and convenient tool!

Thanks again