ETEAM Blog Inside Observability-Driven Development: An Exclusive Interview

While observability helps Ops teams understand what’s happening across cloud-native, distributed systems, developers are also beginning to reap the benefits.

Observability-driven development emerges as a way to shift observability practices from post-deployment to the earlier stages of testing and development. With this approach, developers use telemetry data to build more reliable software and overcome the challenge of coding an application without having full visibility into its behavior.

But how exactly does observability empower developers?

To answer this question and many more, we’re sitting down with Saeed Zarinfam. Saeed is a software engineering consultant and technical writer with over 15 years of experience. He specializes in the design and development of scalable, mission-critical systems using Spring Boot, Spring Cloud, WebFlux, K8s, AWS, Docker, Kafka, and more.

He worked on several enterprise, cloud-native projects for companies like Warner Bros., Paypal, and Tink (VISA) and shares his technical insights on his Medium blog.

Let’s welcome Saeed and dive right in!


Q1: Initially the territory of Ops and SRE teams, observability is now becoming a focus for developers as well. Why do you think that is?

Yes, that’s right. Several reasons come to mind, the top 3 being software architecture, tools, and standards.

The increasing popularity of microservices architecture and cloud-native applications made the necessity of observability more and more visible for developers. The three pillars of observability (logs, metrics, traces) are essential concerns if you want to have a maintainable and assessable system.

These new approaches in software architecture also led to several significant changes in the software development process and introduced new concepts like GitOps or infrastructure-as-code (IaC).

Probably one of the most important concepts is "You build it, you run it." In this new approach, developers engage more in the observability territory.

On the other hand, tools like Docker also significantly influenced this shift. Developers can quickly run the required infrastructure in containers to have observability locally, so observability tools are now more accessible to development teams.

The final aspect, I would say, is standards. Standards like OpenTelemetry or facade libraries like Micrometer play a crucial role in enabling developers to gain insights into the internal workings of their systems through metrics, logging, and distributed tracing.


Q2: What are some concrete scenarios where observability can help in the development process?

When it comes to real-life scenarios, observability can help solve quite a few functional and non-functional problems.

For example, bottlenecks, slow APIs, slow database queries, and issues like this, are usually discovered after deploying the application to a certain environment. Now, through observability-driven development, you can reveal code performance issues a lot earlier.

Another advantage is that it enables developers to reproduce bugs and problems in the application or observability stack by running the observability stack locally.


Q3: How do you see observability changing development practices and the feedback developers get on their code?

I think most of these changes have already begun, and several tools already exist to help developers use observability during development.

For example, IDE plugins like Digma can significantly shorten the feedback loop by using runtime data and AI to provide insights into the code you are writing.

Other examples are services that integrate with CI tools and provide feedback at the commit stage very quickly to developers.

Tools like these will make the development process more effective and smarter.


Q4: Based on your experience as a consultant, what is the biggest challenge companies face when implementing observability?

In the not-so-distant past, we had a lot of problems both in terms of enabling observability and in terms of tool compatibility. ‌But now, standards like OpenTelemetry and also facade libraries like Micrometer have solved these problems to a great extent. Thanks to these standards and libraries, observability stacks have become more compatible with each other. As a result, making applications observable has become easier.

Another challenge is to make the application observable in a meaningful way.

It’s not enough to simply be able to use the observability tools in your application. Providing logs is a good first step, but these days it’s not enough.

Providing meaningful telemetry and metrics, as well as tracing are essential to enable observability. However, if you have all of these without a dashboard and tooling on top of them (like an alert or query system), it’s a wasted opportunity.

The other important challenge companies face is the cost. Whether you have an in-house system or use a managed service for observability, controlling the cost of having a good quality observability service can be tricky.


Q5: Instrumenting code to capture relevant data is a key step toward an observable system. What are some best practices you follow?

Observability isn’t just about metrics, logs, or traces. It’s a crucial aspect of understanding and managing complex distributed systems.

And how do you do that?

By understanding system behavior and using tools that can show you what is happening inside your application.

To do that, you need to instrument your code. In my opinion, the most important aspect of instrumenting code is identifying what metrics you need to track and where they are in your system. Using tools like OpenTelemetry speeds up instrumentation by providing guidelines on where to start, increasing detail levels, ensuring tracing coverage, and reporting errors.


Q6: Overcomplicated distributed systems have been nicknamed “death by a thousand microservices.” How can distributed tracing make debugging such systems easier?

A good tracing system that can provide meaningful information about service interactions is vital for microservices architecture, regardless if you have 10 services or 1000.

In the past, having meaningful logs and metrics was enough for debugging systems, but these days, having a tracing system in place for microservices is crucial. By providing search capability on traces, correlation ID (or request ID), trace data (timing, spans), and visual request flows, we can overcome the complexity of microservices architecture and find the root cause of issues.

A tracing system can also be used to optimize microservices in addition to debugging. By collecting timing data at each stage of a request’s journey, distributed tracing enables us to identify services or components that cause performance issues.


Q7: As our applications grow more complex, the temptation to add more things to track also grows. But when you generate 20TB of traces per day, costs will add up. Is there a danger of excessive instrumentation and how do you avoid it?

Yes, this is the typical problem with any observability data, whether it’s logs, metrics, or traces, and it can become a real challenge when you have a huge amount of it. From my experience, the first step you want to take is to set a retention policy that specifies how long trace data should be stored before it is archived.

Other steps you can take to reduce the amount of data include sampling a percentage of requests, filtering out unnecessary or redundant data, and aggregating it.


Q8: What tools or resources would you recommend to help developers get familiar with observability?

The first place to start, in my opinion, is getting familiar with OpenTelemetry. OpenTelemetry aims to standardize the way telemetry data is collected and processed across different programming languages, frameworks, and platforms. It’s open source and it has already been adopted by many companies, so all the more reasons to add it to your stack.

It’s a great resource for developers to learn more about key observability concepts by digging into the main OpenTelemetry components like the OTEL Collector, OpenTelemetry Protocol (OTLP), Language APIs & SDKs. By doing this, you not only learn about one of the most powerful observability frameworks, but you also understand the flow of data as a developer, including how to collect and process it.

As I mentioned before, I also recently tried using the Digma IDE plugin and found it very helpful. In addition to offering a rich set of features like insight, tracing, and more, it can help developers learn more about observability locally without the need to install any additional software.

Last but not least, using the Grafana Cloud Free Forever account is a great way to try out the power of the Grafana stack and familiarize yourself with its capabilities and concepts.


Q9: We usually think about observability in terms of technical infrastructure and not the people behind it. You also have experience in building teams. Do you think observability can contribute to higher-performing engineering teams?

Without any doubt, observability can help engineering teams be more productive and collaborate better. The feedback and insights provided by observability help teams work faster and smarter.

For example, logs provide detailed insights about the behavior of a system that can help developers identify, debug, and troubleshoot issues.

On the other hand, metrics offer real-time visibility into system performance and behavior and enable teams to quickly identify and address issues before they affect users or the business.

Finally, traces provide end-to-end visibility into the flow of requests through a distributed system, allowing developers to understand how different services and components interact.

At the company level, organizations can take advantage of observability automation to lighten the burden on their technology teams and cut down on manual work. Teams can automate tasks like alerting and incident reports, while the generated data can trigger automated actions based on specific conditions.


Q10: How do you keep in touch with the latest industry developments? Is there anything that caught your eye recently?

As a software engineer and consultant, I would say solving real-life problems has always been the best way to keep your finger on the pulse of the industry. It not only motivates you to find the best solutions but you also learn a lot along the way. On the other hand, as a technical blogger, I follow many technical blogs and read technical books.

Attending conferences and meetups is another way to keep up to date with the latest trends and innovations in the software development industry. It’s something that I enjoy doing and a great way to discover interesting projects.

For example, besides the ones I already mentioned, a couple of other projects that stood out for me include Dapr, Spring AI, and Modulith.

If you liked this interview, you can follow Saeed on LinkedIn or his Medium blog.

Make sure to also follow ETEAM for more insights on software development and the latest tools and technologies.

Get the latest from ETEAM straight to your inbox!

Follow ETEAM