Erlang VM provides outstanding observability. It takes just a few lines to question VM about any process it runs. It is safe and performant enough to be used in production, given due diligence.
Two major features available in the BEAM are sequential tracing (younger runtimes call that distributed tracing) and function call tracing. In this article I am going to talk about the latter.
Erlang tracing BIFs
Call tracing has similarities with debugging breakpoints. Traditional debuggers can set those to stop the process (or thread) upon entering a specific function. Then an operator can print arguments and go step-by-step over the function clauses.
This does not work well for realtime systems, causing call timeouts and negative downstream effects. There are debuggers with non-stop mode allowing to continue execution after hitting a breakpoint, but it changes the model from active debugging to passive observation. Which is what Erlang tracing offers.
It is surprisingly easy to start with erlang:trace/3
BIF:
(primary@max-au.com)1> erlang:trace(whereis(pg), true, ['receive']).
1
(primary@max-au.com)2> flush().
Shell got {trace,<9386.155.0>,'receive',{join,<14359.156.0>,lambda,<14359.2958.3>}}
Shell got {trace,<9386.155.0>,'receive',{join,<14359.156.0>,alpha,<14359.3028.3>}}
The example above traces messages received by pg
process. Every time pg
extracts the matching message from its queue, a trace message is sent to the tracer process. By default erlang:trace/3
sets the calling process as the tracer (there are options to change this behaviour). That’s why flushing shell message queue prints the trace.
erlang:trace/3
BIF is unbelievably powerful. Apart from tracing message passing, you can observe plethora of other events: processes spawned, linked or unlinked, and even garbage collection. Yet the most exciting feature is function call tracing.
Tracing function calls
Calling trace(whereis(pg), true, [call])
is not enough to start tracing: you also need to specify function names to trace.
---------------------------------------------------------
| pg
--------------------------------------------------------
| handle_call/3
| |---- join_local/3
| |-[+] join_local_update_ets/5
| | ---- notify_group/5
| |---- broadcast/2
This is done with trace_pattern/2,3
help. Here is an example:
%% start the default pg scope
1> pg:start(pg).
{ok,<0.88.0>}
%% enable call tracing for all functions of pg module
2> erlang:trace_pattern({'pg', '_', '_'}, true, [local]).
62
%% disable tracing for pg:handle_call/3
3> erlang:trace_pattern({'pg', handle_call, 3}, false, [local]).
1
%% trace
4> erlang:trace(whereis(pg), true, [call]).
1
5> pg:join(group, self()).
ok
6> flush().
Shell got {trace,<0.88.0>,call,{pg,join_local,[<0.86.0>,group,#{}]}}
Shell got {trace,<0.88.0>,call,{pg,join_local_update_ets,[pg,#{},#{},group,<0.86.0>]}}
Shell got {trace,<0.88.0>,call,{pg,notify_group,[#{},#{},join,group,[<0.86.0>]]}}
Shell got {trace,<0.88.0>,call,{pg,broadcast,[[],{join,<0.88.0>,group,<0.86.0>}]}}
This code demonstrates basic technique. Trace messages are generated every time a traced process enters (and/or returns from) traced function. Functions and modules are manipulated with trace_pattern
, processes with trace/3
.
Call tracing opens up limitless capabilities. One example is profiling: fprof
in Erlang works by calculating time difference between function invocations and returns. Another example is “tracing debugger” concept: every single function call is recorded and can be replayed just as if a developer is clicking “Step In” and “Step Over” buttons. This concept is really promising, but there is a serious challenge: amount of information that needs to be saved. Enabling tracing in production may generate gigabytes of trace logs in less than a second, causing out-of-memory crash. It is the reason to avoid raw tracing APIs in production.
Fortunately, there are a number of apps closing this gap. There is also dbg
that comes for free with OTP, but its API is intimidating enough to be avoided even by experienced engineers. Even harder it is to use ttb
, Trace Tool Builder, and Observer application visualising traces. Hence I’ll explore a few simple options that are often enough to get the desired data.
recon_trace
Part of the larger recon
library, this tool is well documented and easy to use. There is even a whole chapter of Erlang in Anger book by the same author. The most important improvement over raw tracing BIFs is ability to limit trace volume in terms or rate (per second) and/or total amount of trace messages:
1> recon_trace:calls({pg, '_', '_'}, 2, [{scope, local}, {pid, pg}]).
62
10:34:57.961042 <0.155.0> pg:join_local(<0.86.0>,group,#{})
10:34:57.962113 <0.155.0> pg:join_local_update_ets(pg,#{},#{},group,<0.86.0>)
The example above traces invocations of all pg
module functions for a process named pg
. Tracing stops as soon as 2 trace messages are received. This ensures safety of a production node, – but if you set the limit too high, it’s still possible to run out of memory.
It’s also possible to print the return value by adding return_to
option to the list of options. You can also trace multiple processes, or even all of them. And… that’s pretty much it. If you want more (e.g. rate-limited message send/receive tracing or elaborate trace patterns), you’re after something else.
redbug
This application is a bit more versatile at the expense of some learning curve. Documentation is available but somewhat scarce. There are a few blog posts expanding usage examples, and here is one more: the very same trace for pg
process calling any function in pg
module, stopping after 2 trace messages:
1> redbug:start("pg", [{msgs, 2}, {procs, [pg]}]).
{1,62}
% 11:13:09 <0.155.0>(pg)
% pg:join_local(<0.86.0>, group, #{})
% 11:13:09 <0.155.0>(pg)
% pg:join_local_update_ets(pg, #{},#{},group,<0.86.0>)
redbug done, msg_count - 2
Unlike recon_trace
, redbug
supports elaborate trace patterns and message send/receive tracing:
redbug:start('receive', [{msgs, 2}, {procs, [pg]}]).
{1,0}
% 11:18:09 <0.155.0>(pg)
% <<< {leave,<38208.143.0>,<38208.2487.4>,[group]}
% 11:18:09 <0.155.0>(pg)
% <<< {leave,<38208.143.0>,<38208.2955.4>,[group]}
redbug done, msg_count - 2
This snippet makes redbug
to trace two messages received by pg
. There are a number of even more advanced options like limiting message size or print depth. If you feel restrained by recon_trace
features, you may appreciate redbug
extras.
But sometimes the only thing you need is to look into a process state, or just log a bunch of events that are happening to your gen_server process. OTP behaviours let you do that without any extra tooling.
sys
This facility does not use Erlang tracing, so mentioning it here only for posterity. One of the oldest OTP parts, sys
has some functions to debug processes implementing OTP behaviours (gen_server, gen_statem and special processes coming with proc_lib). There is some documentation available, but it does not state important caveat: most of the debugging primitives there aren’t for production usage.
One case that’s relatively safe is gen_server state introspection:
(primary@max-au.com)5> sys:get_state(pg).
{state,pg,
#{<0.93.0> => {#Ref<0.1234032000.123207690.147899>,[group]}},
#{},#{},#{},#{}}
If your process isn’t crazy active, you can also log events for it:
(primary@max-au.com)6> sys:log(pg, true).
ok
(primary@max-au.com)7> pg:join(secondary, self()).
ok
(primary@max-au.com)10> sys:log(pg, get).
{ok,[{in,{'$gen_call',{<0.91.0>,
#Ref<0.4167693524.2003828739.179864>},
{join_local,secondary,<0.91.0>}}},
{out,ok,
{<0.91.0>,#Ref<0.4167693524.2003828739.179864>},
{state,pg,
#{<0.91.0> =>
{#Ref<0.4167693524.2003828739.179865>,[secondary]}},
#{},#{},#{},#{}}}]}
I found the rest of sys
functions less useful for debugging, with the exception of sys:replace_state
that allows me to run arbitrary code within the context of the target process and access to its state.
Want to learn more?
Hardcore Erlang developers probably know internals of ERTS tracing implementation. If you want to learn more, there is some internal documentation available, but the real source of truth, as usual, is in the source code.