Keep It Simple

Work of fiction, inspired by The Configuration Complexity Clock. Any resemblance to actual persons, living or dead, or any production code is purely coincidental.

Sunrise

It was a rainy day. I was waiting for a service deployment process to complete. The process is deliberately slow. It is handled by another team. I don’t have any control over it, so all I could do is wait. For a simple change to deploy:

handle_cast({notify, <<"Australia">>, Content}, State) ->
    afp:gday_mate(Content),
    {noreply, State};
+handle_case({notify, <<"USSR">>, Content}, State) ->
+    ?LOG_WARNING("KGB no longer available for ~p", [Content]),
+    {noreply, State};
handle_cast({notify, <<"USA">>, Content}, State) ->
    fbi:hi_mister(Content),
    {noreply, State};

I had no doubts that it worked, because I implemented a Common Test verifying expected behaviour:

ensure_content(Config) when is_list(Config) ->
    ok = content_filter:engage(<<"Innocent content">>),
    [] = kgb_deprecated:get_all_content()).

The whole process felt slow, and I wasn’t really keen to write tests next time. I did not want to wait for CI to complete and CD to bore me any longer. Fair enough? So I called my code configuration. Since it’s a separate distribution channel that I can trigger anytime even without service owner approval, I was no longer limited by the code push schedule.

handle_cast({notify, Country, Content}, State) ->
    case global_config:get({agency, Country}) of
        {Mod, Fun} ->
            Mod:Fun(Content);
        _ ->
            ?LOG_ERROR("~s not supported", [Country])
    end.

It was convenient, but limited. As a next step I added a DSL – list of {Module, Function, [Arguments]} tuples; and a simple interpreter:

handle_cast({notify, Country, Content}, State) ->
    Script = global_config:get({agency, Country}),
    lists:foldl(
        fun ({M, F, A}, Acc) ->
            erlang:apply(M, F, [Env | A])
        end, Env, Script).

That let me move fast! But I made a few typos setting configuration:

global_config:set({agency, <<"USA">>}, [
    {fbi, hi_mister, []},
    {erlang, halt, [1]}
]).

During the incident review I was pointed at the SafeConfigDelivery system. It performs slow rollout for global configuration changes, monitoring service health and reverting the change automatically if a problem is detected. I could safely write complex Erlang code, and push it to production at my convenience, not even needing a blessing from the service owner, because config files are separated from the service code and managed by SafeConfigDelivery.

[
    {agency, <<"USA">>} => [{fbi, hi_mister, []}, {erlang, halt, [1]}],
    {agency, <<"Australia">>} => [{afp, gday_mate, []}]
] 

It was so convenient! Unfortunately, SafeConfigDelivery was not aware of my DSL syntax. The only way to validate the config was through a failing deployment attempt.

But I found that SafeConfigDelivery can verify JSON syntax. So I rewrote the interpreter to accept JSON instead of Erlang term format.

{
  "content": [
    {
      "agency": "USA",
      "script": [
        {
          "module": "fbi",
          "function": "hi_mister",
          "args": []
        },
        {
          "module": "erlang",
          "function": "halt",
          "args": [
            {
              "type": "integer",
              "value": "100"
            }
          ]
        }
      ]
    }
  ]
}

That wasn’t developer-friendly, and of course SafeConfigDelivery maintainers knew that. Hence they suggested to use Python generators instead of raw JSON, as it was natively supported by their system.

I added a schema with a few helpers, and started writing Erlang code in Python:

import_json_schema("my_app/agency.schema", "*")

handle_cast = (
    ErlangCodeBuilder()
    .setBuilderType(erl_json.MFA)
    .setAuthor("genius")
    .addMFA(CaseExpr().condition(addMFA("maps", "get", [Atom("agency")])))
    .addMFA("erlang", "halt", [1])
    .build()
)

I was deeply satisfied with the solution. And my own DSL, letting me write Erlang in Python. Of course, I thoroughly documented it, explained how to write Python, and marketed it as a faster way to develop. I even provided a metric, time from starting typing, to users getting my change. Happy went I home.

Sunset

Phone woke me up at night. Robotic voice suggested to join multiple ongoing incidents, and participate in a Jenga session.

Complex system failure

Since I did a good job of marketing my “Erlang in Python” solution, several engineers tried to use it. They added a few scripts and changed some. Every developer performed necessary steps: wrote Python code, verified JSON correctness and even tried Python script locally and in the staging environment.

But one part was missing. I left no space for automated testing in my workflow. There were several repositories and complicated systems involved, neither of which was designed for testing automation.

Developers were breaking each other’s changes unknowingly. Eventually they formed a virtual team, to do just one thing: serialise all changes to Python scripting, and test it manually. Every single script initially, and every combination later, when scripts started to depend on each other.

It completely defeated the goal of moving faster. But it still worked.

Until at some point another engineer decided to expand my interpreter, providing extra instruction support. The code release containing this change happened to have an unrelated bug. Which resulted in CD reverting code back to the previous version. And coincided with SafeConfigDelivery running canaries and slowly rolling out Python scripting changes.

Positive Feedback Loop

Two machine brains fought each other inducing a positive feedback loop. CD was rolling code back and forth, SafeConfigDelivery did the same. Sometimes, when stars aligned, health checks passed. One or both systems proceeded to the next step. Which may or may not fail, exhibiting nearly random behaviour depending on scheduler timings, kernel upgrades, stress runs, global_config changes and VM updates.

I spent a sleepless night, and then a few more, and a bunch of meetings, incident reviews, only to learn a lesson.

Complexity slows everyone

The route I took was incremental and safe. But the end state turned out to be less than desirable, and I failed to reach the goal. Where did I make the wrong turn?

Pointless Slow Sign

Pretty much every time.

  1. I wanted to “move fast”, and make changes to a service I did not own. I wanted to do it at my convenience.
  2. I decided to call my code “configuration”, and declared that it does not need to be tested in CI. Helps me to move faster: I don’t write a test, I do once off manual run. But it slows down every other developer – they need to do manual verification for unrelated changes just to ensure that my “config” isn’t broken.
  3. I moved code to a separate system, SafeConfigDelivery. Even more developers started to break my code – there is no simple way to discover it!
  4. Service owners lost visibility. I can make any changes, deploy and revert, keeping owners puzzled with alarms firing at random.
  5. I knew Python better than Erlang. I assumed everyone is like that, which is wrong. And what’s even worse, the language I used wasn’t Python. It was my DSL, combining the worst features of both Python and Erlang, while reaping no benefits of either.
  6. I did not plan ahead. Did not think how the system may evolve in the future, how it can scale. Automated testing capabilities should have been taken into account from the very beginning. Otherwise it gets progressively harder to retrofit, leaving the system vulnerable to operator errors.

I failed the main goal. Yes, initially I moved faster by incurring tech debt. But later it became a burden for me and my entire team.

How do I get back?

Simplify, and keep it simple

Just get back to the original state. Write Erlang code in Erlang, write tests with Common Test. Learn Erlang, even if Python feels more familiar.

Pile of bricks

Do not rush for short-term gains. Slow code push process is a safety choice. So is service ownership. Adding a backdoor to push “config” feels rewarding short-term. But it leads to a less reliable solution, ultimately slowing down everyone.

Keep it simple.

And if you need to move faster – ask service owners for a favour, extra code push.

Leave a Reply

Your email address will not be published. Required fields are marked *