Erlang was invented during the late 80s and early 90s at Ericsson to write logic for telephone switches. It was released to the public in 1998 and was designed by Joe Armstrong, Robert Virding, Mike Williams. It was created to help write fault-tolerant, highly available programs in a distributed, soft real-time setting. This required a language that handles millions of parallel processes, possibly distributed in many geographical locations that had support for hot-swapping of code. It should also help programmers write software that never crashed, even in the presence of software errors.
To accomplish that Erlang has support for very lightweight processes that are completely isolated and communicate using message passing. It has immutable data which is not shared between processes meaning locks, mutexes etc are never needed. Because Erlang is a functional language there are also no classes. Instead Erlang have modules which are files that contain function definitions and attributes. It also contains a list of imported functions and exported functions as well as a host of other, less commonly used features, like specific compiler flags. Unlike a class an Erlang module cannot be instantiated. It is simply a collection of functions and variables. However, the state of the variables can be different between different processes and each process can be seen, in a very objective-programming view, as instantiates of the module.
Erlang also has strong design principles, both in terms of error handling but also how concurrency is handled. These design principles are often less talked about than the lightweight, message passing processes of Erlang but will be the focus on this blog. The design principles are captured by what Erlang calls behaviors. To fully utilize all of their benefits, lightweight, isolated processes communicating using message passing are needed. Even if most languages have some support for lightweight threads, very few have any abstractions that are similar to Erlang’s behaviors. Regardless of what language you use, you can still learn from Erlang’s behaviors and get value out of them. Before we can learn from Erlang’s behaviors we first need to understand them however, and that will be done through an example:
Generic server for resource handling
If you want to share a resource (such as a data structure) between processes then the way to do that is using a server-client relationship where one process acts as a server and many other processes acts as clients. Taking an example from Joe Armstrong’s phd thesis on how a very simple server client behavior may be created in Erlang [1].
-module(server).
-export([start/3, stop/1, rpc/2]).
start(Name, F, State) ->
Register(Name,
spawn(fun() ->
loop(Name, F, State)
end)).
stop(Name) -> Name ! stop.
rpc(Name, Query) ->
Name ! {self(), Query},
receive
{Name, crash} -> exit(rpc);
{Name, Reply} -> Reply
end.
loop(Name, F, State) ->
recieve
stop -> void;
{Pid, Query} ->
case (catch F(Query, State)) of
{'EXIT', why} ->
log_error(Name, Query, Why),
From ! {Name, crash}
loop(Name, F, State);
{Reply, State1} ->
From ! {Name, ok, Reply},
loop(Name, F, State1)
end
end.
log_error(Name, Query, Why) ->
io:format("Server ~p query ~p caused exception ~p~n", [Name, Query, Why]).
Code language: Erlang (erlang)
Where:
- Start – Starts the server with a specific function F.
- Stop – stops the server
- rpc – performs a call to the server.
- loop – main loop of the server.
Below is the client code for a “very simple home location register” used by [1] to display how a generic server can be used.
-module(vshlr).
-export([start/0, stop/0, handle_event/2, i_am_at/2, find/1]).
-import(server, [start/3, stop/1, rpc/2]).
-import(dict, [new/0, store/3, find/2]).
start() -> start(vshlr, fun handle_event/2, new()).
stop() -> stop(vshlr).
i_am_at(Who, Where) ->
rpc(vshlr, {i_am_at, Who, Where}).
find(Who) ->
rpc(vshlr, {find, Who}).
handle_event({i_am_at, Who, Where}, Dict) ->
{ok, store(Who, Where, Dict)};
handle_event({find, "robert"}, Dict) ->
1/0; %% Deliberate error
handle_event({find, Who}, Dict) ->
{find(Who, Dict), Dict}.
Code language: Erlang (erlang)
1> vshlr:start().
true
2> vshlr:find("joe").
error
3> vshlr:i_am_at("joe", "sics").
ok
4> vshlr:find("joe").
{ok,"sics"}
5> vshlr:find("robert").
Server client query {find,"robert"}
caused exception {badarith,[{vshlr2,handle_event,2}]}
** exited: rpc **
6> client:find("joe").
{ok,"sics"}
Code language: PHP (php)
Observations: There is complete separation of client and server. The client is also much simpler than the server and all of the concurrency is solved in the server which is fully generic. This is done by all interaction between server and client happening via messages and Erlang then guarantee that messages will be handled so that no process gets deadlocked. All of the business logic is contained in the client which is written using no concurrency primitives (direct message passing for example). The client also contains no error handling code and all of the difficult concurrency related fault-tolerance is handled by the server. To update the business logic (client) all that is necessary is to know how to write sequential programs which is fairly simple.
The faults that are handled are not business logic related things like looking for a person that does not exist. As you can see the data structure just returns “error” and it is up to the business logic in the client to handle this error in some meaningful way. Such as telling the user that no user name “Joe” is registered. The type of errors that are handled are ones where things have clearly gone wrong and are unforeseen. For a simple application like the vshlr it is hard to even come up with an error that is not handled by the business logic – but that is why they are unforeseen errors. Standard Erlang error handling is then to “let it crash” and have a supervisor process that handles the error. Maybe detailed information of what happened is written to a log and then the process is restarted. If the goal is to provide location information to users in less than 100ms maybe restarting the process in the latest known state state and then getting the information from the gen_server takes 1 second. In that case we do not quite manage the hard task of providing information quickly but a simpler task of providing information at all. This is much better than crashing or failing to provide any information. Offensive programming, in a way opposed to defensive programming, with its “let it crash” mentality and supervision trees deserve their own blog post. Until that happens Joe Armstrong’s thesis “making reliable distributed systems in the presence of software errors” covers this in depth and I especially recommend chapter 5 for those interested [1], or the Erlang design principles [2].
Other benefits of a generic server is that it can also be re-used for other applications where a state is shared between many processes/clients. As the generic server expands and evolves the non-functional aspects improve without risk of affecting the business logic. Similarly, the business logic can change and evolve without creating bugs in the non-functional parts.
It is also possible to much more thoroughly verify (or at least test) the generic server part when it is re-used everywhere instead of having to test hundreds of slightly different solutions spread all over the application. It is also possible to create new generic server versions with other non-functional properties such as increased verbosity for debugging.
Erlang is packed with many behaviors, which can be likened with an abstract class or interface that developers can extend. Because Erlang has no classes the behaviors are callback modules and by writing -behavior(<name of behavior>) the module (file) promises to export some specific functions and if this is not done the compiler will complain. Six of these behaviors stand out and are more fundamental than the others. They are described both in Joe Armstrong’s phd thesis and in the Erlang design principles [1,2].
- gen_server – manages shared resources
- gen_event – an event manager with zero to many event handlers.
- gen_statem – generic state machine
- supervisor – supervises a group of workers (other processes) and monitors, stops or starts them as needed
- application – You fill in how to start and stop your application and it solves packaging problems for you.
- release – A collection of applications.
From Joe Armstrong’s thesis these six behaviors are enough to build any concurrent computer program. We will mostly focus on the first four as they are used much more frequently and because structures to manage packages/applications/releases already exist in most languages even if they take a different form. However, the four behaviors gen_server, gen_event, gen_statem and supervisor are very rarely found in other languages and seldom implemented in projects. To get an idea of how frequently each of these are used Joe writes that AXD301 (an Ericsson telecom switch) used “122 instances of gen_server, 36 instances of gen_event and 10 instances of gen_fsm [deprecated version of gen_statem]. There were 20 supervisors and 6 applications. All this is packaged into one release” [1].
We will dig a bit deeper into the gen_server behavior which corresponds very nicely to the general server example above. This is expected as gen_server stands for generic server. Below is the same business logic implemented as an Erlang gen_server.
-module(vshlr_gen).
-behavior(gen_server).
-export([start/0, stop/0, i_am_at/2, find/1]). %% client functions
-export([init/1, handle_call/3, handle_cast/2]). %% callback functions for gen_server
%% We no longer need to import things from server
-import(dict, [new/0, store/3, find/2]).
start() -> gen_server:start_link({local, vshlr_gen}, vshlr_gen, [], []).
init(_Args) -> {ok, new()}.
stop() -> gen_server:stop(vshlr_gen).
i_am_at(Who, Where) ->
gen_server:call(vshlr_gen, {i_am_at, Who, Where}).
find(Who) ->
gen_server:call(vshlr_gen, {find, Who}).
handle_call({i_am_at, Who, Where}, Dict) ->
{noreply, store(Who, Where, Dict)};
handle_call({find, "robert"}, Dict) ->
1/0; %% Deliberate error
handle_call({find, Who}, Dict) ->
{reply, find(Who, Dict), Dict}.
handle_cast(_Msg, State) -> %% handles async messages but it's not used by our application so state is unchanged
{noreply, State}
Code language: Erlang (erlang)
You can hopefully see that the old client and this implementation using gen_server is almost identical. The initialization is slightly different and the handle_event function has been renamed handle_call. All of the benefits remain however, but in this case the server is written by the language creators and is very well tested. Examples of benefits are that we can have hundreds of processes that use this vshlr_gen module and it does not matter if dict is thread safe or not. The gen_server implementation promises that it will handle the resource/state in a thread safe manner. gen_server also gives verbose stack traces and error messages when errors occur and integrates nicely with Erlang’s supervisor behavior. This means Erlang programmers can focus on the business logic and write it in a sequential manner and not worry about thread safety, messages etc. The gen_server is also continuously updated and improved.
This makes code much easier to read because you do not need to search for the business logic in a forest of non-functional code. It also removes a huge chunk of possible errors and therefore makes it much faster to write code. It also results in more correct code as the difficult concurrent problems are already handled by someone else.
The gen_event and gen_statem are similarly generic behaviors that handles the non-functional parts of an event manager and state machines respectively. To keep this blog from growing too large I will not give more examples of how they can be used but the Erlang design principles cover the topic very well [2].
Another benefit of having a few, very general behaviors, is that they re-occur over and over in the code base. That means they quickly become familiar to those working with them. State machines, event handlers and resource management looks the same everywhere meaning once you understand the behaviors large chunks of the code base becomes understandable. The behaviors does not only provide a solution to fault-tolerant concurrent programs but also provides a way to unite the design/architecture of the program.
Lessons from Erlang’s behaviors
Outside of the Erlang world it is very common to see business logic intermingled with code that perform non-functional parts. Such as locks for variables, timing related logic for embedded systems or error handling. This means the business logic is harder to see and understand. The non-functional parts, especially bits related to concurrency, tend to produce harder to solve problems than ones posed by business logic errors. Having everyone write both business logic and non-functional code is a big waste of time and a source of hard to solve bugs. However, with the Erlang architecture and behaviors, the concurrency is solved by fewer, more export programmers and the generic code can be re-used in many locations. That means these generic parts can be extensively tested and when bugs are found they are inherently solved everywhere in the whole system.
I had a colleague tell me some time ago that they had a bug caused by a process taking two locks and somewhere some other process took the same two locks but in a different order. In my eyes that error is not caused by lack of documentation or programmer negligence. It is the direct result of poor architecture and design choices and for people that have never worked in a more high-level language it may be the only thing they know.
That incident also clearly exemplifies how having everyone write concurrency related parts degrade the quality and fault-tolerance of the program. It is not reasonable to expect software to contain zero errors. Instead we should aim to write software that works even in the presence of bugs. However, collecting all of the concurrency in a very well tested library written by experts goes a long way in reducing errors. With supervisors and message passing then it is possible to write software that functions even in the presence of bugs.
In the C++ world the main complaint is the loss of control and computational speed when abstracting out large parts of the code. One good example of a general technique used in the C++ world that sacrifices computational speed for safety is Resource acquisition is initialization (RAII) [3]. It is a simple pattern where instead of acquiring a resource, using it, then freeing it manually you acquire the resource by initializing it. Then when the process exits the scope the resource will automatically be freed. This removes a tiny bit of non-functional code that used to be intermingled with the business logic. Now you simply (locally) create the object and then let it be deleted when you are done with it. Very seldom do you see a general wrapper that makes any data structure (or resource) thread safe however. Instead you have thread safe arrays, or a thread safe queue. Then someone uses an improper data structure and you get odd errors that take days to figure out. These data structures are also not part of the standard library so they have to be written by hand and one project may have multiple thread-safe-queue for example.
One area where I have seen pretty good general implementations of an abstract class that solves much of the concurrency issues and error handling is with state machines. The generic state machine then solves all transitions, handling of events and the state itself. It can also support hierarchical states and auto transitions with an “on-exit”, “on-entry”, “<State-name>” function etc. All of these functions are defined by the business logic and therefore by the classes implementing the generic state machine but the error handling logic and event logic is all handled by the abstract class. This means it is very easy to write a thread safe state machine and all of the state machines follow the same structure so they are easy to understand. The actual implementation also only contains business logic. That means any errors I encountered related to state machines were easy to debug and easy to understand. Overall it was a quite pleasurable experience working with the state machines, sadly, that cannot be said for many other things in their code base.
The trade-offs with general solutions is twofold. First, to build the general solution expertise and time is needed. There is an initial investment that needs to be made to implement good generic solutions (if they are not provided by the language). The second is computational speed, but there are very few applications today where we really need the speed.
If there is one thing I want you to takeaway after reading this blog it is to start looking at your code and ask yourself: “does this code provide customer value (functional) or does it tell the computer how to do something (non-functional)?”. If you find yourself mostly writing non-functional code then it might be time to discuss code re-use not only of the business logic but also of the non-functional parts.
If you wish to get more information about behaviors then check out Erlang’s design principles: https://www.erlang.org/doc/design_principles/des_princ.html. If you wish to learn more about how to write fault tolerant concurrent software then I strongly recommend checking out Erlang’s error handling strategy. One way would be by reading Joe Armstrong’s thesis, especially chapters 4-6. You can also check out this talk by Joe https://www.youtube.com/watch?v=cNICGEwmXLU or this one about error handling: https://www.youtube.com/watch?v=TTM_b7EJg5E. You can also check out the Erlang behavior documentation for supervisor: https://www.erlang.org/doc/design_principles/sup_princ.html or this github page with a few interesting projects and some good blog posts: https://github.com/stevana/armstrong-distributed-systems.
[1] https://erlang.org/download/armstrong_thesis_2003.pdf
[2] https://www.erlang.org/doc/design_principles/des_princ.html