Thursday, October 29, 2009

Resilient Overlay Networks

Summary

A Resilient Overlay Network (RON) is an architecture at the application-layer that allows distributed Internet applications to detect outages and recover quickly (on the order of seconds rather than minutes as it is in BGP). The RON nodes sit on top of the network, constantly monitoring it. This network information is used when sending packets to decide whether to send by the normal path through the Internet or to send to another RON node.

RON was evaluated based on 2 deployments. It was able to detect all the outages that took place and reroute packets, all of which took less than 20 seconds on average. RON was also able to improve throughput by 5% and reduce loss probability by 5% in some cases.

RON nodes aggresively probe and monitor the paths connecting their nodes. They exchange information about the quality of each path using a routing protocol. They build a forwarding table with lots of path information, including latency, loss rate and throughput. The authors argue that RON should be limited in size, between 2 and 50 nodes in order to reduce the bandwidth overhead resulting from the aggressive probing.

RON has 3 major goals:
  1. allow nodes to communicate with each other, even in the face of outages.
  2. integrate routing with distributed applications more closely than before, allowing for application-specific metrics to be used.
  3. provide a framework for the implementation of expressive routing policies, which allow paths to be chosen based on criteria such as forwarding rate controls or notions of acceptable use.
The RON architecture consists of many RON clients distributed over the network. The client interacts with RON using the conduit API in order to send and receive data. The first node that sees the packet is called the entry node, which classifies the packet on the type of path it should use. Then, it determines a path for the packet, adds a RON header and forwards the packet. All other RON nodes simply forward to the next RON node. The final RON node is called the exit node and it delivers the packet to the application.

The entry node has a lot of control over the flow of the packet. The packet is tagged with a flow ID, and subsequent nodes attempt to keep the packet on the same path as the rest of the flow.

When building up its routing table, RON keeps track of (i) latency, (ii) packet loss and (iii) throughput. It uses a link-state routing protocol. The latency-minimizer forwarding table uses an exponential weighted moving average of round-trip latency.

RON was implemented and evaluated. It was found to detect and recover from errors in 18 seconds. It also improves throughput, loss rates and latency.

Criticisms & Questions

I think this was an interesting paper. It was well organized and presented the protocol well. Their evaluation also seemed sound given that they did 2 implementations and RON was found to help in all cases.

I think RON is a good way to allow applications to give information to the transport protocol without breaking the end-to-end argument. It solves many of the problems with application-specific needs.

No comments:

Post a Comment