Daniela Binatti, Co-Founder and CTO at Pismo, and Mauricio Galdieri, Software Architect, explored chaos engineering applied to financial services in a Pismo webinar this week.
In the presentation “Build it up, tear it down: how chaos engineering creates platform resiliency,” Daniela and Mauricio discussed the need for chaos engineering in financial services and explained how Pismo uses this technique to improve our platform.
Daniela explained how our cloud-native platform works. She highlighted that although running processes in the cloud make things easier, many challenges arise when you place a cloud-native solution on top of a distributed infrastructure. You have to be aware of them before running applications.
“You have to ensure that all these virtual resources we can easily launch will behave in a certain way if something goes wrong. And chaos engineering is one of the techniques we employ to guarantee the resiliency of the Pismo platform.”
Breaking things on purpose
Mauricio explained that chaos engineering consists in voluntarily creating problems that applications could face in real-life operations. This “breaking things on purpose” technique helps our engineering teams prepare for unexpected failures.
“This definition makes a lot of sense for what we do at Pismo. Chaos engineering is a series of experiments you run to build confidence in your system. We believe that things are experiments. It’s not about solving problems in production. It’s all about gaining and passing this confidence to our clients,” he says.
The experiments create failures in infrastructure and system components at different levels so that we can see what happens in each scene. “The goal of running these experiments is to ensure that the platform continues to work normally under those different failures you are testing.”
Automating experiments
Managing chaos experiments becomes easier when you can automate them. Mauricio says injecting failures with the help of specific tools can improve the chaos engineering procedures.
“At Pismo, we needed a more structured way to perform those experiments. We use the Gremlin tool, which is very powerful and enables us to experiment with several scenarios and cancel or abort an experiment that may be disrupting the system too much. It’s a very balanced tool in terms of flexibility,” Mauricio explains.
Testing x Experimenting
Mauricio highlights that chaos experiments are not about testing. He says that we expect a specific result when we perform a test. On the other hand, when we experiment, we don’t know what will happen. We may have a hypothesis, not an assurance of the results.
“When you are testing something, you can be successful or not, depending on the result. If the hypothesis is confirmed, you gain confidence in the system when experimenting with something. Otherwise, you learn about it and can use the acquired knowledge to improve it.”
The power of the error budget
At Pismo, our service level objective (SLO) is 99.99% uptime, which means our systems can be unavailable for a short period without breaking any contracts. This is known as our “error budget.” Mauricio says it is vital to utilise it to run chaos experiments.
“You have to run these experiments to gain confidence in your systems. For example, you can let your system down for five minutes a month. It might seem odd initially, but it is the smartest thing to bring resiliency to your platform.”
In conclusion, our software architect ensures that, as a financial company, anticipating failures before they come out is the key to success. “With these practices, we ensure that when you pay a bill, make a transfer, or invest your money, everything will happen as expected and instantly. And you will have no unpleasant surprises.”
Watch the webinar in the video below: