Akka in Action: Let it Crash
This article is an excerpt from Akka in Action
By Raymond Roestenburg, Rob Bakker, and Rob Williams
In an ideal world, a system is always available and is able to guarantee that it will be successful with each undertaken action. Building a fault-tolerant application with plain old objects and exception handling is quite a complex task, but Akka provides tools to make applications more resilient. Actors, for example, provide a way to untangle the functional code from the fault-recovery code and the actor life-cycle makes it possible to suspend and restart actors (without invoking the wrath of the concurrency gods) in the course of recovering from faults.
Building a fault-tolerant application with plain old objects and exception handling is quite a complex task. In this article, we'll look at how actors simplify this task.
So what should happen when an Actor processes a message and encounters an exception? We don't want to just graft recovery code into the operational flow, so catching the exception inside an actor where the business logic resides is not an option.
Instead of using one flow to handle both normal code and recovery code, Akka provides two separate flows: one for normal logic and one for fault recovery logic. The normal flow consists of actors that handle normal messages; the recovery flow consists of actors that monitor the actors in the normal flow. Actors that monitor other actors are called supervisors.
Figure 1 shows a supervisor monitoring an actor.
Figure 1 Normal and recovery flow
Instead of catching exceptions in an actor, we'll just let the actor crash. The actor code only contains normal processing logic and no error handling or fault recovery logic, so it's effectively not part of the recovery process, which keeps things much clearer. The mailbox for a crashed actor is suspended until the supervisor in the recovery flow has decided what to do with the exception. So how does an actor become a supervisor? Akka has chosen to enforce parental supervision, meaning that any actor that creates actors automatically becomes the supervisor of those actors. A supervisor doesn't "catch exceptions," rather it decides what should happen with the crashed actors that it supervises based on the cause of the crash. The supervisor doesn't try to fix the actor or its state. It simply renders a judgment on how to recover, and then triggers the corresponding strategy. The supervisor has four options when deciding what to do with the actor:
- Restart—The actor must be recreated from its Props. After it is restarted (or rebooted, if you will), the actor will continue to process messages. Since the rest of the application uses an ActorRef to communicate with the actor, the new actor instance will automatically get the next messages.
- Resume—The same actor instance should continue to process messages; the crash is ignored.
- Stop—The actor must be terminated. It will no longer take part in processing messages.
- Escalate—The supervisor doesn't know what to do with it and escalates the problem to its parent, which is also a supervisor.
Figure 2 gives an example of the strategy that we could choose when we build the log processing application with actors. The supervisor is shown to take one of the possible actions when a particular crash occurs.
Figure 2 Normal and recovery flow in the logs processing application
We'll need to take some special steps to recover the failed message, which we'll discuss in detail when we talk about how to implement a restart later. Suffice it to say that in most cases, you don't want to reprocess a message, because it probably caused the error in the first place. An example of that would be the case of the logProcessor encountering a corrupt file: reprocessing corrupt files could end up in what's called a poisoned mailbox—no other message will ever get processed because the corrupting message is failing over and over again. For this reason, Akka chooses not to provide the failing message to the mailbox again after a restart, but there is a way to do this yourself if you're absolutely sure that the message didn't cause the error, which we'll discuss later. The good news is that if a job is processing tens of thousands of messages, and one is corrupt, default behavior will result in all the other messages being processed normally; the one corrupt file won't cause a catastrophic failure and erase all the other work done to that point (and prevent the remainder from occurring).
Figure 3 shows the how a crashed dbWriter actor instance is replaced with a fresh instance when the supervisor chooses to restart.
Figure 3 Handling the DbBrokenConnectionException with a restart
Let's recap the benefits of the "let it crash" approach:
- Fault isolation—A supervisor can decide to terminate an actor. The actor is removed from the actor system.
- Structure—The actor system hierarchy of actor references makes it possible to replace actor instances without other actors being affected.
- Redundancy—An actor can be replaced by another. In the example of the broken database connection, the fresh actor instance could connect to a different database. The supervisor could also decide to stop the faulty actor and create another type instead.
- Replacement—An actor can always be recreated from its Props. A supervisor can decide to replace a faulty actor instance with a fresh one, without having to know any of the details for recreating the actor.
- Reboot—This can be done through a restart.
- Component lifecycle—An actor is an active component. It can be started, stopped, and restarted.
- Suspend—When an actor crashes, its mailbox is suspended until the supervisor decides what should happen with the actor.
- Separation of concerns—The normal actor message processing and supervision fault recovery flows are orthogonal, and can be defined and evolve completely independently of each other.
This article is an excerpt from Akka in Action by Raymond Roestenburg, Rob Bakker, and Rob Williams. Save 39% on Akka in Action with code 15dzamia at manning.com.
For source code, sample chapters, the Online Author Forum, and other resources, go to http:// www.manning.com/roestenburg/