Blogs > Enhancing Observability with Tail Sampling: insights from Pismo’s experience
12 August –

Enhancing Observability with Tail Sampling: insights from Pismo’s experience

The innovative approach has helped Pismo manage and reduce its observability costs by almost 80% in terms of traces data while maintaining the critical visibility required for its operations.

Fernanda Testa

At a recent panel at Hacktown, one of Latin America’s leading festivals of innovation, business, and technology in Brazil, Alexander Magno, SRE Manager at Pismo, shared his insights on using the tail sampling processor for OpenTelemetry Collector.

This innovative approach has helped Pismo manage and reduce its observability costs by almost 80% in terms of traces data while maintaining the critical visibility required for its operations.

The challenge

While experiencing exponential growth, Pismo faced a significant challenge: the cost of observability was rising in tandem with its expansion. As services became increasingly critical, particularly in handling large volumes of transactions globally, it was imperative to control these costs without compromising on the quality of observability. The solution? Implementing a sampling strategy to manage and reduce the volume of telemetry data.

Understanding sampling and its benefits

Sampling is a technique used to reduce the amount of telemetry data generated or stored, thereby managing costs effectively. While there are various sampling strategies applicable to traces, the focus of Pismo’s approach has been on tail sampling for tracing data.

Tail sampling, in particular, involves making decisions about which traces to retain based on the analysis of trace data after some seconds of the transaction started. This method allows more informed and intelligent sampling decisions, ensuring that the most relevant and useful traces are retained, while the less critical ones are discarded.

Why tail sampling?

In traditional head sampling, traces are sampled at the beginning of their lifecycle, often leading to the loss of valuable data as the decision to sample or discard is made without a complete view of the trace.

Tail sampling, on the other hand, allows for a comprehensive view of the trace data, enabling the system to retain traces that show errors, high latency, or any other significant anomalies.

Pismo chose tail sampling for its ability to provide a comprehensive view of transactions while minimizing data storage costs. By evaluating traces after some seconds of the transaction started, they can identify and retain only those traces that are significant, such as those with errors or high latency. This method ensures that critical information is not lost and that the data collected is highly relevant for analysis and troubleshooting. For success transactions we can keep short % to understand the behavior of our apps.

By implementing tail sampling, we have been able to:

1. Improve cost efficiency
By reducing the volume of telemetry data that needs to be stored and processed, Pismo has significantly lowered observability costs.

2. Enhance data quality
Retaining only the most relevant traces ensures that the data used for monitoring and debugging is of higher quality and more actionable.

3. Maintain critical visibility
Even with reduced data volumes, Pismo has been able to maintain the necessary visibility into its system’s performance and health, ensuring that critical issues are still detectable and addressable.

Practical implementation at Pismo

Implementing tail sampling required a strategic approach and collaboration across teams. Pismo’s engineering team worked closely with Observability team to integrate the tail sampling processor into their existing OpenTelemetry Collector infrastructure.

This component plays a crucial role in collecting, processing, and exporting telemetry data, and its flexible architecture allows the addition of various processors, including the tail sampling processor.

Magno highlighted the following steps in the implementation process:

1. Evaluation and planning: the initial phase involved a thorough evaluation of Pismo’s existing observability setup and identifying the areas where sampling could be most effectively applied. This included analyzing the types of telemetry data generated and determining the critical points for trace retention.

2. Configuration of the OpenTelemetry Collector: the team configured the OpenTelemetry Collector with the tail sampling processor. This required setting up policies for trace retention based on specific criteria such as error rates, latency thresholds, and probabilistic to store percent of success transactions.

3. Testing and iteration: before full-scale deployment, the implementation was tested in a controlled environment to fine-tune the sampling policies and ensure that the critical traces were retained without losing essential visibility.

4. Monitoring and optimization: post-deployment, continuous monitoring was essential to ensure the sampling strategy was functioning as intended. This involved analyzing the sampled traces and making adjustments to the policies as needed to optimize performance and cost savings.

The results and future outlook

The adoption of tail sampling has enabled Pismo to achieve a balanced approach to observability—reducing costs while maintaining high standards of visibility and performance monitoring. This strategic implementation not only supports our current operations but also positions us to scale efficiently in the future.

Moving forward, Pismo plans to continue refining sampling strategies and exploring additional optimizations within the OpenTelemetry framework. The insights shared by Alexander Magno underscore the importance of innovative approaches in managing observability, especially in the face of rapid growth and expanding service demands.

Pismo’s experience serves as a valuable case study for other organizations seeking to enhance their observability practices without incurring prohibitive costs. By leveraging advanced sampling techniques like tail sampling, companies can achieve a sustainable balance between comprehensive monitoring and cost efficiency.

More Articles

28 November -

What is core banking? Definition, benefits, trends and how cloud is changing its future

Pismo
6 mins

21 November -

The future of tokenization: how it’s changing the game

Debora Riceputi, Senior Product Manager – Tokenization
4 mins

30 October -

Daniela Binatti reflects on Pismo’s journey at Money 20/20 USA

Pismo
2 mins