Tuesday, November 24, 2009

BotGraph: Large Scale Spamming Botnet Detection

Summary

This paper proposes BotGraph, a system used to detect and map botnet spamming attacks which attack major Web email providers. BotGraph includes an efficient implementation to construct and analyze large graphs.

BotGraph has 2 main components: aggressive sign-up detection and stealthy bot-user detection. First, BotGraph detects aggressive account signup by looking for a sudden increase in sign-ups from a single IP address. BotGraph uses this information to limit the number of email accounts the botnet can control.

Then, the second part is to identify which are the set of machines controlled by the bot. To do this, BotGraph builds up a user-user graph, where every node is a user and every edge has a weight which represents how similar the 2 users are. BotGraph uses the logic that all the nodes in a bot-user group use the same toolkit and so must have similar behaviors. Therefore, by using the appropriate similarity metric for the graph, the bot-user group will be a connected component in the user-user graph.

The similarity metric that BotGraph uses is the number of common IP addresses logged in by the 2 users. BotGraph relies on the following 2 properties of botnets:
  • The sharing of one IP address: ratio of bot-users to bots is 50:1, so many bot-users use one bot to login, resulting in a shared IP address
  • The sharing of multiple IP addresses: botnets have high churn rate so bots are added and removed often from the botnet. To maximize account utilization, the spammer will reassign accounts to different bots, resulting in each account being used on multiple IP addresses.
The authors implemented their graph creation algorithm using Hotmail logs from 2 1-month periods in 2007 and 2008. The logs contained a signup log and a login log. After constructing the user-user graph, they found 26 million botnet-created user accounts with a false positive rate of 0.44%.

Criticisms & Questions

This was an interesting idea that they proposed and seems to work based on the assumptions they are making. However, given that spammers are constantly evolving their technique, the basic assumptions they make about bots sharing IP addresses might not be valid in the future.

Monday, November 23, 2009

Not-a-Bot: Improving Service Availability in the Face of Botnet Attacks

Summary

This paper proposes Not-a-Bot (NAB), a system that identifies and verifies human-generated and bot-generated traffic to prevent email spam, DDoS attacks and click-fraud. NAB is comprised on 2 main components: the attester, which lives on the client machine and the verifier, which lives on the server that processes the requests. Requests are attested on the client machine saying that they are human-generated and then sent to the server, who verifies the attestation and use application-specific policies to deal with the requests.

NAB has 4 main requirements:
  • attestations must be generated in response to human requests automatically
  • attestations must not be transferable from the client where they were generated to attest traffic from another client
  • NAB must benefit its users but not hurt those that do not
  • NAB must preserve existing privacy and anonymity of applications
The main job for the attester is to determine whether to provide (and provide) an attestation when an application requests one. The attester uses information from keyboard and mouse activity along with timing to determine if the action was human-generated. The attestation contains a signature over the entire request content to ensure that the content used to generate the attestation is tied to the attestation.

Attestations always satisfy the following 2 properties:
  1. Non-transferability: prevents an attestation from being forged to appear as if it were generated for another attester
  2. Binding to the content of the request
NAB also includes a verifier that runs on the server that processes the requests. The verifier checks that the attestation is valid. Once it is, it uses application-specific policies to deal with the request. Examples include aggressive filtering of spam, prioritizing requests in case of DDoS and only serving valid attestation to prevent click-fraud.

NAB uses nonces in the attestation to provides the following 2 security guarantees:
  • attestations can't be double-spent
  • a bot can't steal key clicks and accumulate attestations beyond a fixed time window

By using traces of keyboard and mouse activity of 328 users along with traces of spam, DDoS, and click-fraud, they evaluated NAB. They found that it reduced the volume of spam by 92%, deprioritized 89% of bot-originated web activity and detected bot-originating click-fraud activity with 87% accuracy.


Criticism & Questions

I think this paper proposed an interesting idea, and if deployed widely enough could be useful. One practical concern is that although the authors mention that latency overhead is negligible, if many applications start to require more and more attestations, the latency overhead could rise high enough to be significant.

I liked that they used real traces of users' keyboard and mouse activity. Their current heuristic for guessing if the action is human-generated or not is fairly simple and could be improved using more information on human-computer interaction patterns.

In addition, one of the key requirements of NAB is to benefit users while not harming nonusers. However, if any of the verifiers use the policy that attested traffic is given higher priority (such as in the DDoS example), then nonusers' requests will be given lower priority, which is essentially harming their user experience.

Thursday, November 19, 2009

Skilled in the Art of Being Idle: Reducing Energy Waste in Networked Systems

Summary

This paper explores methods to reduce power usage in idle systems in the most effective way possible.  Some of the problems for systems to enter low-power sleep modes  when idle is that doing so causes the system to lose network connectivity and prevents it from running tasks scheduled to run when the system is in low utilization.

After collecting and analyzing data from 250 enterprise users, they were able to find out the value of proxying, which applications would benefit from proxying and how comprehensive it would need to be in order to be effective.  

Their studies showed that machines are idle about 50% of the time, and this results in 60 TWh/year of wasted electricity, which amounts to 6 billion dollars.  When comparing idle time traffic between home and office environments, they found that average number of packets at home is 1 and at the office is 3.  Based on these they found that any solutions that woke the host up to handle network traffic wouldn't be efficient.

Based on the above data, they propose a proxy architecture that supports different idle-time behaviors.  When designing the proxy, they classified the traffic along 2 dimensions, the first being the need to proxy the traffic with the following options:
  • don't wake protocols: generate sustained and periodic traffic and should be dealt with without waking the host
  • don't ignore protocols: require attention to make sure correct order of operations
  • policy-dependent traffic: how it's handled depends on tradeoff between sophistication of functionality and energy savings.
and the second being the complexity required to proxy the traffic with the following options:
  • ignorable (drop): can be safely ignored
  • handled via mechanical response: easy to construct required response
  • require specialized processing: usually requires state maintenance
The authors designed the following 4 proxies:
  • proxy_1: ignores all traffic listed as ignorable & wakes machine to handle all other incoming traffic
  • proxy_2: ignores all traffic listed as ignorable, issues responses for protocol traffic to be handled with mechanical responses & wakes the machine for all other incoming traffic
  • proxy_3: (1) and (2) as in proxy_2, but wakes up all traffic belong to any of telnet, ssh, VNC, SMB, and NetBIOS & drops any other incoming traffic.
  • proxy_4: identical to proxy_3 but also wake up for regular network backups, anti-virus software updates, FTP traffic and Intel specific updates.
They find that proxy_1 is inadequate for home and office and that proxy_3 achieves the most sleep in all scenarios.

Criticism & Questions

I think this was a very interesting paper.  The methodology seemed sound and their evaluation  thorough, especially their classification of idle-time traffic.  I will be curious to see how their work goes forward and if it ever gets implemented in real-world machines.

Cutting the Electric Bill for Internet Scale Systems

Summary

This paper describes how electricity prices vary with time and geographic location, and how data centers can take advantage of this information to save in electricity costs.

They rely on 2 key observations:
  1. electricity prices vary
  2. large distributed systems already incorporate request routing and replication
Using price information, the goal is to have a system route client requests to clusters such that total electricity cost is minimized. This also depends on the energy elasticity: the extent to which a cluster's electricity costs depend on the load on that cluster.

By simulating 29 different locations in the US with historical electricy prices, the authors get results that suggest huge cost savings for data centers, including:
  • existing systems can see energy cost savings by at least 2%
  • savings rapidly increase with energy elasticity
  • allowing client-server distances to increase leads to increased savings

Criticism & Questions

This was a very interesting paper to read. I liked how the paper problem was rooted in a practical problem with a very practical solution. This paper was recently published, but I would be interested in knowing if any data centers have tried implementing this solution and what kinds of results they're getting.

Scalable Application Layer Multicast

Summary

This paper proposes a new scalable application-layer multicast protocol, designed specifically for low-bandwidth, data streaming applications.

They used 3 dimensions with which to evaluate application-layer multicast protocols:
  • quality of the data delivery path
  • robustness of the overlay
  • control overhead
The NICE protocol arranges the end hosts into a hierarchy, which defines the multipath overlay data paths. The hierarchical structure is necessary for scalability and localizing the effect of member failures.

The hosts are assigned to different layers in the hierarchy and then split into clusters with other hosts close to them. The host in the geographic center of each cluster is designated as the cluster leader.

The authors performed several simulations of their protocol and Narada application-layer multicast protocol. Their results showed that they had lower link stress and improved end-to-end latencies and similar failure recovery properties, all with order of magnitude lower control traffic.

They also deployed their protocol in a wide-area testbed and found that nodes maintained low-latency paths and had a max packet loss rate of 1% or less. And, the average control overhead was less than 1 Kbps.

Criticisms & Questions

This was also another simple, but clever idea.  It would be interesting to see a direct comparison of this protocol to SRM.  I did that like that this paper did both simulations and an implementation and that they evaluated the results using multiple dimensions.

A Reliable Multicast Framework for Light-weight Sessions and Application Level Framing

Summary

This paper describes Scalable Reliable Multicast (SRM), an application-level reliable multicast framework. SRM draws on principles found in ALF, LWS, group delivery model of IP multicast protocol, and core design principles of TCP/IP: best-effort delivery, reliability on an end-to-end basis, and adaptive parameters.

In regular TCP, the sender is responsible for ensuring reliable data delivery.  This requires receiving ACKs and NACKs from receivers and maintain reception state.  With multicast, this could cause the ACK implosion effect at the sender.  In addition, the IP multicast model imposes a level of indirection between the sender and receivers, making it very difficult for the sender to maintain state for all the receivers.

In the wb framework, receivers are responsible for reliability.  Once a sender sends out a packet to all the receivers, if a receiver detects that it's missing data, it will wait for a random time determined by its distance to the sender and then multicasts a repair request to the group.  This means any host that has the packet can multicast it to the group, and all hosts missing the packet will hear it and cancel their repair request timer.  This reduces the number of repair requests and thus a request and response implosion.

The authors simulate this framework on several different node topologies, including chain and star.  The found that the request/repair scheme worked well for bounded-degrees trees where ever node participates in the multicast, but works less well for a sparse session.  It also works less well when the bounded-degree is really high.

A suggested optimization is local recovery, which means setting a TTL on a repair request packet so that only the neighborhood affected by the loss sees it.

Criticisms & Questions

The idea proposed in this paper was simple, but very clever.  It made use of multicast properties to reduce the number of packets in the network, and could make use of even more properties to optimize it.  I think the simulations had good results, but I would like to see SRM implemented in a real-world environment and see how it performs.

Thursday, November 12, 2009

X-Trace: A Pervasive Network Tracing Framework

Summary

This paper describes the design and implementation of X-Trace, a tracing framework that gives users a comprehensive view of service behavior. The motivation for X-Trace was the lack of a single diagnostic tool across all layers and applications, making it hard for the user to understand the interactions between the different services and protocols and therefore hard to find where the problem originated.

A user will start X-Trace when starting an application task. X-Trace will tag the request with a task identifier, which will be propagated to all layers and protocols so that X-Trace can determine all the operations connected to that task, which results in a task tree.

The network path generally traverses multiple administrative domains (AD). Each AD may not wish to share internal information. X-Trace accomodates this situation by providing a clean separation between the user and the receiver of the trace information. The portion of the X-trace that is relevant to each party will get sent to that party if they have X-Trace turned on, and they can do with the data what they wish. Even if all parties don't turn on X-Trace, it will still provide horizontal slices of the task tree.

There are 3 main design principles used by X-Trace:
1. Trace request should be sent in-band: metadata should be added to the same datapath we want to trace
2. Collected data should be sent out-of-band: allows data to be sent even when failures occur in the path
3. The entity requesting data is decoupled from entity receiving the data: this allows each AD to control its own information.

Critique & Questions

I think X-Trace is an interesting concept. I could see it being very useful for network operators and users as well. Of course, the usefulness increases with the number of ADs on the path that are using it. I'm curious about whether this was ever implemented in all layers in the real-world network.

NetFPGA: A Tool for Network Research and Education

Summary

This paper describes NetFPGA, a project that is designed to give students practical experience with networking concepts at the physical and link layer. Students can use NetFPGA to build real networking hardware and because of its low cost, can connect many together to build interesting networking systems.

Till now, there have been 2 versions of NetFPGA. Version 1 includes 3 FPGA's, one of which is the Control FPGA, preprogrammed to connect the Etehrnet controller to the other 2 User FPGA's. The students built a simple logic analyzer in order to be able to see the signals in a waveform viewer.

There were several limitations to Version 1:
  • Board format was awkward, requiring self-assembly
  • Too slow for research projects
  • No on-board CPU
  • Windows not a possible development platform
This led to Version 2 of the NetFPGA. Version 2 switched to the PCI format which was beneficial because most users were already familiar with this format and it had greater bandwidth. It also possible to use ModelSim on XP to develop.

Future work includes further testing to analyze configuration and hardware problems in addition to improving the logic analyzer.

RCP and Intrustion Detection at ICSI are 2 projects currently making use of NetFPGA.

Critique & Questions

This was a quick read and a good introduction to NetFPGAs.

Tuesday, November 10, 2009

Internet Indirection Infrastructure

Summary

This paper describes an overlay network called Internet Indirection Infrastructure (i3) that is designed to easily provide services like multicast, anycast and mobility. i3 adds a level of indirection on top of the current network, separating the act of sending and receiving from each other. Rather than sending to one or more IP address, a sender will send to an logical identifier. Receivers will indicate which identifiers they are interested in receiving by way of triggers.

i3 was designed to be versatile, allowing users to create services like mobility, anycast and service composition on it. It also decentralized the responsibility for creating efficient trees by delegating that task to the end-hosts.

i3 uses a rendezvous-based communication. Each packet is represented by a (id, data) pair where id is an m-bit identifier and data is the IP packet payload. Receivers send out triggers of the form (id, addr) which states that all packets with an identifier id should be sent to addr. When a receiver would like to receive data, it can put out a trigger in the network. And, then when a packet with that id arrives, it will get sent to the receiver.

The authors implemented and evaluated i3 using the Chord lookup system. Initial evaluations suggest that i3 is highly flexible and can support many sophisticated applications.

Critique & Questions

Interesting paper. The rendezvous-style communication they used is fairly straightforward. Although from a practical implementation standpoint, there do seem to be many challenges to work through before that can take place. I liked that they acknowledged if/how i3 could be implemented for real.

A Policy-aware Switching Layer for Data Centers

Summary

This paper proposes PLayer, a policy aware layer-2 for data centers. PLayer is made up of inter-connected pswitches, policy-aware switches. Middleboxes can then be plugged into the pswitches, moving them off the network path. The pswitches are responsible for deciding which middleboxes each packet needs to go through and forwards the packet accordingly.

The main principles of PLayer are:
1) separating policy from reachability - middleboxes traversed are determined explicitly by a policy and not implicitly for network path selection mechanisms.
2) taking middleboxes off the physical network path - middleboxes are removed from choke points

PLayer aims to provide the following properties:
1) correctness
2) flexibility
3) efficiency

When forwarding packets, the PLayer uses centralized policy and middlebox controllers to setup and maintain the rule tables at the pswitches.

Critique & Questions

I thought this was a very well written and well structured paper. They did a good job at explaining the current problems in data centers. I also liked that they went into detail on how PLayer could be incorporated into the current infrastructure with minimal changes.

Wednesday, November 4, 2009

DNS Performance and the Effectiveness of Caching

Summary

This paper presents an analysis of traces of DNS and TCP traffic from MIT and KAIST. The paper explores the performance as perceived by the client and the effectiveness of caching.

Their data comes from 3 traces: 2 from MIT and one from KAIST. Their data includes outgoing DNS queries and incoming responses and outgoing TCP connection start and end packets. Most of the answered A lookups were followed by a TCP connection; those that weren't were excluded from the analysis. Also, any TCP connections that were not proceeded by DNS queries were excluded.

Client-Perceived Performance:

Latency was found to be adversly affected by the number of referrals that take place in a DNS query. 70% of the referrals were found to have cache hits for an NS record in the local DNS cache, showing that caching NS records are really beneficial as they reduce the load on the root servers.

They found that there was a large number unanswered lookups. They break unanswered lookups into 3 categories: zero referrals, non-zero referrals, and loops. They suggest that the large number of unanswered queries is due to the persistent retransmission strategy used by name servers and the referral loops that exist between name servers. They found that over 99.9% of answered queries were done in at most 2 retransmissions while over 90% required no retransmissions. Therefore, they conclude that the DNS name servers are too persistent in their retry strategy.

10-42% of lookups result in a negative answer. The largest cause are inverse lookups for IP addresses with no inverse mappings. They also found that negative caching may not be so effective because it has a heavy tailed distribution. 15-27% of lookups to root name servers resulted in negative responses, most probably caused by incorrectly implemented or configured resolvers.

Effectiveness of Caching:
- How useful is it share DNS caching among several client machines?
- What is the impact of the value of TTL on caching effectiveness?

Previous work shows that web object popularity follows a Zipf-like distribution. NS records were found to have much higher TTLs than A records, resulting in many fewer DNS requests to the root or gTLD servers. If the NS records weren't cached, the load would on the root or gTLD servers would have increased by a factor of 5, making NS record caching critical to DNS scalability.

They found that more popular domain names have shorter TTLs, supporting the idea that DNS-based load-balancing is useful only for popular sites.

They determine that sharing cache across many clients would not be very helpful. Sites common across the clients are the most popular ones, resulting in some performance gain there. However, most of the sites accessed are usually only accessed by one client. Even if a site is accessed multiple times, it is usually from the same client accessing it successively, and the client's local cache would be adequate for that.

They find that increasing TTL values increases hit rates, but is only noticeable for TTL values less than 1000 seconds. This is because most cache hits are due to multiple successive accesses of the same server by the same client. This suggests that low TTL values on A records wouldn't harm performance much.

Criticism & Questions

Another interesting paper. I think it did a good job at evaluating many of the features that were explained in the first paper. Their traces seemed like a good representation of the real-world environment. Most of their conclusions were logical based on the data they presented.

Most of the graphs were helpful in understanding their data, but some were really hard to make out. One thing that was especially confusing was that Table II came before Table I.

Development of the Domain Name System

Summary

This paper presents the original design, implementation, surprises, successes and shortcomings of the Domain Name System.

Before DNS, the HOSTS.TXT system was used to lookup the mapping from domain names to IP addresses. This system was essentially a single file, centrally located at the SRI Network Information Center and distributed to all hosts. As the network was growing larger and the number of hosts change from number of organizations to number of workstations, this was not a feasible solution. Existing systems at the time were IEN116 and XEROX. Neither fit their requirements and so they set out to design a new system.

The specification for DNS was that it should be a hierarchical name space with typed data at the nodes. Following are some of the requirements/assumptions for the DNS:
  • it should provide at least all the info in the HOSTS.TXT file
  • database should be maintable in a distributed manner
  • there should no obvious size limits for names, name components, data associated with a name, etc.
  • should interoperate across the DARPA internet and other environments
  • provide tolerable performance
DNS has 2 major active components: name servers and resolvers. Name servers store information and answer queries. Resolvers interface with the client programs and are responsible for finding the name server that has the information that the client wants.

The name space is a vairable-depth tree where each node has an associated label. There is not standard printing rule for the name format and it decouples the structre of the tree from any semantics.

There are 2 mechanisms for transferring data from source to destination: zones and caching. A zone includes a contiguous section of the tree, usually controlled by an organization and is responble for maintaining the zone's data and provide redudancy for the zone. DNS resolvers cache responses for the TTL specified by the zone.

The basic algorithm used by the DNS resolver is to search "downward" from domains that it can access already.

While building this system, they came across the following surprises:
  • refinements of semantics
  • performance of the network was much worse than the original design had expected, but lookups that required no network access did really well.
  • the need for negative caching
Some successes of the DNS system:
  • variable depth hierarchy
  • organizational structure of names
  • datagram access
  • additional section processing: responding server can anticipate next query and include that information in the response
  • caching
  • mail address cooperating
Some shortcomings of the DNS system:
  • type and class growth
  • easy upgrade of applications: hard to convert existing apps to use DNS
  • distribution of control vs. distribution of expertise/responsibility

Critique & Questions

Very interesting paper to read. It was nice to see the thought process around DNS. It is interesting to note that this paper deems caching as a success of DNS while the other paper we read shows that caching of A records is not very effective. This can probably be attributed to all the changes in the internet in the past 20 years. I wonder if the distribution of domain names when this paper was written (1988) would be Zipf-like, since there would presumably be fewer domain names. It's also important to note the lack of any authenticity and integrity checking of the data, which poses a security concern, something that was never considered in the original design.

Tuesday, November 3, 2009

Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications

Summary

This paper describes, Chord, a lookup protocol for peer-to-peer applications. Chord uses DHT with consistent hashing, which provides load balancing and low overhead when a node joins or leaves the network. Chord is also scalable because in a N-node network, every node is only required to know about O(log N) nodes and requires O(log^2 N) messages when doing a lookup.

In the Chord protocol, every node and key is given an identifier using a base hash function. These identifiers determine which keys should reside on which node. After arranging all the nodes in order by their identifier in a clock-wise circle, a key is assigned to the first node whose identifier is equal to or follows the key's identifier. This is called the key's successor. Every node is also aware of its successor node.

When a node joins or leaves the network, only O(1/N) fraction of the keys need to moved to a new location. When a node n joins the network, it will be assigned some of the keys previously assigned to n's successor. And, when a node n leaves the network, its keys will be reassigned to its successor.

The basic idea of the algorithm, given that each node is only aware of its successor, is that when a node cannot resolve a lookup query itself, it will pass the query to its successor, which will then pass it to its successor until a node can resolve the lookup query. This will ensure correctness but will be inefficient. To improve its performance, each node can maintain additional routing information, allowing the node to pass each lookup query further than its successor and closer to the destination.

The authors simulated and implemented the Chord protocol, both of which confirmed that the latency grows with the number of nodes in the network.

Criticism & Questions

I enjoyed reading this paper. I think they did a good job of first explaining the overview of the algorithm and then going into details.

I did have one question about the algorithm. They explain that successor(k) is responsible for key k and what happens to keys when a node enters or leaves the network. However, I'm unsure of how the key-value entries are initially entered into each node's lookup table. Maybe the keys are assigned manually to whatever nodes exist when the application is set up?

Monday, November 2, 2009

Looking Up Data in P2P Systems

Summary

This paper explores peer to peer systems and looks at the Lookup Problem. The Lookup Problem can be stated as how to find a data item X given that it can be stored on a dynamic set of nodes.

There are 2 broad categories for solutions to this problem. The first are structured solutions. This includes maintaining a central database with filename-location mappings and a DNS-style hierarchical approach. Both of these have resilience problems since they store the data in a few critical nodes. The second category is symmetric schemes, in which every node is equally important to every other node. This category includes continually broadcasting to neighbors until a node has it in its database, which doesn't scale very well and using "superpeers" in a hierarchical structure, which again has resilience problems.

Most of the latest P2P lookup algorithms combine both of these categories to get structured symmetric algorithms which provide guarantees as well as resilience. All of these algorithms use a DHT. The general idea of the algorithm is that the lookup query is forwarded node to node, each time getting "closer" to the destination. "Closeness" can be measured in several ways: numeric differences between IDs, number of common prefix bits, bitwise XOR of the IDs.

Chord uses a skiplist-like routing, Pastry, Tapestry and Kademlia use tree-like routing and CAN uses d-dimensional routing. The authors conclude the paper with the following open questions: distance function, operation costs, fault tolerance and concurrent changes, proximity routing, malicious nodes, and indexing and keyword search.

Criticisms & Questions

I liked this paper. I think it did a good job at explaining the different lookup algorithms and the benefits and tradeoffs of each.

One interesting point they made about malicious nodes was that the worst a malicious node could do is "convincingly deny that data exists". This essentially becomes a DOS attack. Even if it is the worst case, isn't that still a cause for worry?