X-Ray Vision Without the Cape
When we develop applications, often there is a certain opacity that comes as a result of low observability, or more simply put, de-prioritizing non-functional requirements. Given how common it is in software development to focus on getting a feature out the door rather than creating something with sustainability around it, what if we could take the first step on that journey with minimal effort? The draw of AWS X-Ray is just that – a low barrier to entry tool to begin the process of making application performance actionable.
Moving to a highly agile increased release cadence delivery cycle requires observability into application performance as one of the key components of this process. This is validated through research conducted by DevOps Research and Assessment (DORA) team, finding that comprehensive monitoring and observability is a staple of high-performing teams. Monitoring key Service Level Indicators (SLI) and setting Service Level Objectives (SLO) allows us to determine how changes impact our applications. This is increasingly important as we transition from large monolithic systems to distributed cloud-native architectures as the number of moving parts needing to be monitored is vastly increased.
How Amazon X-Ray Works
AWS CloudWatch provides observability into computing metrics out of the box; however, the service does lack insight into an application’s performance. Utilizing the Custom Metrics feature to record a Service Level Indicator such as ‘Response Time’ for a specific endpoint will provide actionable visibility. Unfortunately, this metric does nothing to answer the question of why the response time is what it is and that is the exact question that AWS X-Ray is designed to answer.
Essentially there are two components of X-Ray that are integrated into the application’s stack:
2. An infrastructure component that collects segments produced by the library for forwarding to AWS. This daemon is quite easy to deploy via an OS package or a Docker container. As the segments are pushed to AWS their hosted X-Ray console is used to view the assembled traces.
How to Use Amazon X-Ray
As a simple demonstration, this example will utilize a ‘Backend for Frontend’ pattern for an imaginary enterprise social app. The BFF pattern highlights distributed tracing quite well and should provide a good demo.
We will look at a simple ‘home’ endpoint for the app which returns a list of team members, some TODOs, and the last five documents that the individual worked on. We will define an SLO for this ‘home’ endpoint to be a response or load time of 150ms. This code will be intentionally poor to provide interesting traces.
The trace that X-Ray creates begins with the entry into the system and collects segments from our instrumented services to provide a holistic view of the request. The X-Ray console provides two main sections to analyze which consist of a Trace Map and the Segment View. The Trace Map is quite nice as a higher-level view of the system. Closely matching the application design from Figure 1. It is easy to visualize how the different services impact the SLO of the endpoint.
Immediately this trace highlights a few problems with the application:
- As indicated from the trace map that there is an ‘Address Service’ being called several times which is not part of the design.
- The average response time of 36ms from this service multiplied by the seven requests results in 256ms which is quite different than the total of 2.64 seconds reported by the member service.
- The segment timeline inside the BFF Application indicates the service calls are being executed in series.
- A duration of 2.8 seconds is far greater than the 150ms budget defined by the SLO.
At this point it is obvious that fixing or optimizing the Member Service will provide the largest gains in reaching the service level objective. As this is a distributed tracing system the trace for the Member Service is available and a quick look at it shows significant processing time after each request to the address service.
Reviewing the code answers the question as to what is happening between each request to the address service but raises many questions.
- First, if the address service accepts an array of member identifiers, why are we sending the requests one by one?
- Second, the ‘team’ endpoint looks like cut and pasted code from the ‘member’ endpoint;
- Is the member address needed when retrieving a list of team members?
- The address service seems quite frail, what will happen once this application is in the hands of hundreds of users?
These are all interesting questions, regardless the design does not require the address service so it can be removed. In addition, the BFF Application will be updated to parallelize the requests to the dependent services.
In our final trace, we have achieved the SLO of 150ms as indicated by the total duration of 121ms. The requests to the backend services are performing as they should – in parallel. There is an opportunity for additional gains by looking into the work the TODO service is performing.
Instrumenting and measuring an applications’ performance is far more powerful than just determining if it is up or down. Distributed tracing provides a high-level view of an application’s infrastructure providing the ability to drill down and focus on key transactions that impact service level objectives.
AWS X-Ray is one of many observational tools that can be used to instrument application code. Although it may be limited in features and the languages it supports, it does provide a very low barrier to entry and requires very little infrastructure to achieve its purpose.
Are You Ready for a DevOps Transformation?
While software continues to eat the world at an ever-increasing pace with DevOps, the challenges and struggles of companies implementing DevOps is very real. We all can overcome these challenges by working together, improving our tools, processes, knowledge, and training our workforce.