Pismo employs chaos engineering to assure our banking and payments platform is reliable and resilient. Daniela Binatti, Co-Founder and CTO at Pismo, and Mauricio Galdieri, Software Architect, spoke about this technology today at the Febraban Tech conference in São Paulo, Brazil.
Daniela and Mauricio discussed the need for chaos engineering in financial services and explained how Pismo uses it to improve our platform. “We have hundreds of APIs built on Kubernetes pods, running on public cloud infrastructure. This complex architecture has powerful features and brings many benefits – but also complexity,” says Daniela.
“Our applications use distributed databases, intricate communication processes, and temporary resources. We create and kill Kubernetes pods every day. So, it’s challenging to foresee how the system would react to unpredicted failures. With chaos engineering, we voluntarily create problems that applications could face in real-life operations. Thus, we can learn about the platform and make it more resilient.”
Breaking things on purpose
“We break our systems pro-actively, so they don’t break out of our control,” adds Mauricio Galdieri. “What happens when we have abnormal latency on a communication link? If an application becomes unavailable, how will other applications react? These are questions we can answer by doing chaos experiments”, he says.
Mauricio highlights that chaos experiments are not simple tests: “When we perform a test, we expect a certain result. The system may either fail or pass. When we do a chaos experiment, we may have a hypothesis. However, we don’t know for sure what will happen. If the hypothesis is confirmed, we gain confidence in the system. Otherwise, we learn about it and can use the acquired knowledge to improve it.”
Mauricio says that a common misconception is thinking chaos engineering is something only nerds should care about. “There are unexpected situations not directly related to software that could affect the computer systems. Google, for example, makes experiments simulating the effects of a hypothetical earthquake in California. Other companies simulate situations in which someone inside the data centre feels sick or suffers an accident, for instance.”
The error budget
Our software architect has two main recommendations for companies adopting chaos engineering. The first one is to automate the experiments as much as possible. “The chaos engineering comprises processes that may fail. If we automate them, we can apply the chaos engineering procedures to improve them.” We use the Gremlin tool to manage the experiments.
The second recommendation is to use part of the company’s error budget to do the experiments. Pismo’s contracts, for example, usually specify a service level agreement (SLA) of 99.99%. This means the systems can become unavailable for a few seconds per day without breaking contracts. These few seconds are the error budget. “We can use part of this time to perform experiments. It’s very worthwhile. We may make the system fail for a while, but this will assure that it won’t fail more extensively.”
Pismo is participating in Febraban Tech as a sponsor and exhibitor. Visit us today at booth 108 on the 3rd floor and learn more about our comprehensive platform for banking and payments.
Learn more about resiliency engineering: