Thursday, November 12, 2009

X-Trace: A Pervasive Network Tracing Framework

Summary

This paper describes the design and implementation of X-Trace, a tracing framework that gives users a comprehensive view of service behavior. The motivation for X-Trace was the lack of a single diagnostic tool across all layers and applications, making it hard for the user to understand the interactions between the different services and protocols and therefore hard to find where the problem originated.

A user will start X-Trace when starting an application task. X-Trace will tag the request with a task identifier, which will be propagated to all layers and protocols so that X-Trace can determine all the operations connected to that task, which results in a task tree.

The network path generally traverses multiple administrative domains (AD). Each AD may not wish to share internal information. X-Trace accomodates this situation by providing a clean separation between the user and the receiver of the trace information. The portion of the X-trace that is relevant to each party will get sent to that party if they have X-Trace turned on, and they can do with the data what they wish. Even if all parties don't turn on X-Trace, it will still provide horizontal slices of the task tree.

There are 3 main design principles used by X-Trace:
1. Trace request should be sent in-band: metadata should be added to the same datapath we want to trace
2. Collected data should be sent out-of-band: allows data to be sent even when failures occur in the path
3. The entity requesting data is decoupled from entity receiving the data: this allows each AD to control its own information.

Critique & Questions

I think X-Trace is an interesting concept. I could see it being very useful for network operators and users as well. Of course, the usefulness increases with the number of ADs on the path that are using it. I'm curious about whether this was ever implemented in all layers in the real-world network.

1 comment:

  1. It was implemented and used to instrument real systems like Coral (a CDN), a network authorization system, and Hadoop. It was very useful for uncovering bugs in these systems.

    ReplyDelete