Tuesday, November 24, 2009

BotGraph: Large Scale Spamming Botnet Detection

Summary

This paper proposes BotGraph, a system used to detect and map botnet spamming attacks which attack major Web email providers. BotGraph includes an efficient implementation to construct and analyze large graphs.

BotGraph has 2 main components: aggressive sign-up detection and stealthy bot-user detection. First, BotGraph detects aggressive account signup by looking for a sudden increase in sign-ups from a single IP address. BotGraph uses this information to limit the number of email accounts the botnet can control.

Then, the second part is to identify which are the set of machines controlled by the bot. To do this, BotGraph builds up a user-user graph, where every node is a user and every edge has a weight which represents how similar the 2 users are. BotGraph uses the logic that all the nodes in a bot-user group use the same toolkit and so must have similar behaviors. Therefore, by using the appropriate similarity metric for the graph, the bot-user group will be a connected component in the user-user graph.

The similarity metric that BotGraph uses is the number of common IP addresses logged in by the 2 users. BotGraph relies on the following 2 properties of botnets:
  • The sharing of one IP address: ratio of bot-users to bots is 50:1, so many bot-users use one bot to login, resulting in a shared IP address
  • The sharing of multiple IP addresses: botnets have high churn rate so bots are added and removed often from the botnet. To maximize account utilization, the spammer will reassign accounts to different bots, resulting in each account being used on multiple IP addresses.
The authors implemented their graph creation algorithm using Hotmail logs from 2 1-month periods in 2007 and 2008. The logs contained a signup log and a login log. After constructing the user-user graph, they found 26 million botnet-created user accounts with a false positive rate of 0.44%.

Criticisms & Questions

This was an interesting idea that they proposed and seems to work based on the assumptions they are making. However, given that spammers are constantly evolving their technique, the basic assumptions they make about bots sharing IP addresses might not be valid in the future.

Monday, November 23, 2009

Not-a-Bot: Improving Service Availability in the Face of Botnet Attacks

Summary

This paper proposes Not-a-Bot (NAB), a system that identifies and verifies human-generated and bot-generated traffic to prevent email spam, DDoS attacks and click-fraud. NAB is comprised on 2 main components: the attester, which lives on the client machine and the verifier, which lives on the server that processes the requests. Requests are attested on the client machine saying that they are human-generated and then sent to the server, who verifies the attestation and use application-specific policies to deal with the requests.

NAB has 4 main requirements:
  • attestations must be generated in response to human requests automatically
  • attestations must not be transferable from the client where they were generated to attest traffic from another client
  • NAB must benefit its users but not hurt those that do not
  • NAB must preserve existing privacy and anonymity of applications
The main job for the attester is to determine whether to provide (and provide) an attestation when an application requests one. The attester uses information from keyboard and mouse activity along with timing to determine if the action was human-generated. The attestation contains a signature over the entire request content to ensure that the content used to generate the attestation is tied to the attestation.

Attestations always satisfy the following 2 properties:
  1. Non-transferability: prevents an attestation from being forged to appear as if it were generated for another attester
  2. Binding to the content of the request
NAB also includes a verifier that runs on the server that processes the requests. The verifier checks that the attestation is valid. Once it is, it uses application-specific policies to deal with the request. Examples include aggressive filtering of spam, prioritizing requests in case of DDoS and only serving valid attestation to prevent click-fraud.

NAB uses nonces in the attestation to provides the following 2 security guarantees:
  • attestations can't be double-spent
  • a bot can't steal key clicks and accumulate attestations beyond a fixed time window

By using traces of keyboard and mouse activity of 328 users along with traces of spam, DDoS, and click-fraud, they evaluated NAB. They found that it reduced the volume of spam by 92%, deprioritized 89% of bot-originated web activity and detected bot-originating click-fraud activity with 87% accuracy.


Criticism & Questions

I think this paper proposed an interesting idea, and if deployed widely enough could be useful. One practical concern is that although the authors mention that latency overhead is negligible, if many applications start to require more and more attestations, the latency overhead could rise high enough to be significant.

I liked that they used real traces of users' keyboard and mouse activity. Their current heuristic for guessing if the action is human-generated or not is fairly simple and could be improved using more information on human-computer interaction patterns.

In addition, one of the key requirements of NAB is to benefit users while not harming nonusers. However, if any of the verifiers use the policy that attested traffic is given higher priority (such as in the DDoS example), then nonusers' requests will be given lower priority, which is essentially harming their user experience.

Thursday, November 19, 2009

Skilled in the Art of Being Idle: Reducing Energy Waste in Networked Systems

Summary

This paper explores methods to reduce power usage in idle systems in the most effective way possible.  Some of the problems for systems to enter low-power sleep modes  when idle is that doing so causes the system to lose network connectivity and prevents it from running tasks scheduled to run when the system is in low utilization.

After collecting and analyzing data from 250 enterprise users, they were able to find out the value of proxying, which applications would benefit from proxying and how comprehensive it would need to be in order to be effective.  

Their studies showed that machines are idle about 50% of the time, and this results in 60 TWh/year of wasted electricity, which amounts to 6 billion dollars.  When comparing idle time traffic between home and office environments, they found that average number of packets at home is 1 and at the office is 3.  Based on these they found that any solutions that woke the host up to handle network traffic wouldn't be efficient.

Based on the above data, they propose a proxy architecture that supports different idle-time behaviors.  When designing the proxy, they classified the traffic along 2 dimensions, the first being the need to proxy the traffic with the following options:
  • don't wake protocols: generate sustained and periodic traffic and should be dealt with without waking the host
  • don't ignore protocols: require attention to make sure correct order of operations
  • policy-dependent traffic: how it's handled depends on tradeoff between sophistication of functionality and energy savings.
and the second being the complexity required to proxy the traffic with the following options:
  • ignorable (drop): can be safely ignored
  • handled via mechanical response: easy to construct required response
  • require specialized processing: usually requires state maintenance
The authors designed the following 4 proxies:
  • proxy_1: ignores all traffic listed as ignorable & wakes machine to handle all other incoming traffic
  • proxy_2: ignores all traffic listed as ignorable, issues responses for protocol traffic to be handled with mechanical responses & wakes the machine for all other incoming traffic
  • proxy_3: (1) and (2) as in proxy_2, but wakes up all traffic belong to any of telnet, ssh, VNC, SMB, and NetBIOS & drops any other incoming traffic.
  • proxy_4: identical to proxy_3 but also wake up for regular network backups, anti-virus software updates, FTP traffic and Intel specific updates.
They find that proxy_1 is inadequate for home and office and that proxy_3 achieves the most sleep in all scenarios.

Criticism & Questions

I think this was a very interesting paper.  The methodology seemed sound and their evaluation  thorough, especially their classification of idle-time traffic.  I will be curious to see how their work goes forward and if it ever gets implemented in real-world machines.

Cutting the Electric Bill for Internet Scale Systems

Summary

This paper describes how electricity prices vary with time and geographic location, and how data centers can take advantage of this information to save in electricity costs.

They rely on 2 key observations:
  1. electricity prices vary
  2. large distributed systems already incorporate request routing and replication
Using price information, the goal is to have a system route client requests to clusters such that total electricity cost is minimized. This also depends on the energy elasticity: the extent to which a cluster's electricity costs depend on the load on that cluster.

By simulating 29 different locations in the US with historical electricy prices, the authors get results that suggest huge cost savings for data centers, including:
  • existing systems can see energy cost savings by at least 2%
  • savings rapidly increase with energy elasticity
  • allowing client-server distances to increase leads to increased savings

Criticism & Questions

This was a very interesting paper to read. I liked how the paper problem was rooted in a practical problem with a very practical solution. This paper was recently published, but I would be interested in knowing if any data centers have tried implementing this solution and what kinds of results they're getting.

Scalable Application Layer Multicast

Summary

This paper proposes a new scalable application-layer multicast protocol, designed specifically for low-bandwidth, data streaming applications.

They used 3 dimensions with which to evaluate application-layer multicast protocols:
  • quality of the data delivery path
  • robustness of the overlay
  • control overhead
The NICE protocol arranges the end hosts into a hierarchy, which defines the multipath overlay data paths. The hierarchical structure is necessary for scalability and localizing the effect of member failures.

The hosts are assigned to different layers in the hierarchy and then split into clusters with other hosts close to them. The host in the geographic center of each cluster is designated as the cluster leader.

The authors performed several simulations of their protocol and Narada application-layer multicast protocol. Their results showed that they had lower link stress and improved end-to-end latencies and similar failure recovery properties, all with order of magnitude lower control traffic.

They also deployed their protocol in a wide-area testbed and found that nodes maintained low-latency paths and had a max packet loss rate of 1% or less. And, the average control overhead was less than 1 Kbps.

Criticisms & Questions

This was also another simple, but clever idea.  It would be interesting to see a direct comparison of this protocol to SRM.  I did that like that this paper did both simulations and an implementation and that they evaluated the results using multiple dimensions.

A Reliable Multicast Framework for Light-weight Sessions and Application Level Framing

Summary

This paper describes Scalable Reliable Multicast (SRM), an application-level reliable multicast framework. SRM draws on principles found in ALF, LWS, group delivery model of IP multicast protocol, and core design principles of TCP/IP: best-effort delivery, reliability on an end-to-end basis, and adaptive parameters.

In regular TCP, the sender is responsible for ensuring reliable data delivery.  This requires receiving ACKs and NACKs from receivers and maintain reception state.  With multicast, this could cause the ACK implosion effect at the sender.  In addition, the IP multicast model imposes a level of indirection between the sender and receivers, making it very difficult for the sender to maintain state for all the receivers.

In the wb framework, receivers are responsible for reliability.  Once a sender sends out a packet to all the receivers, if a receiver detects that it's missing data, it will wait for a random time determined by its distance to the sender and then multicasts a repair request to the group.  This means any host that has the packet can multicast it to the group, and all hosts missing the packet will hear it and cancel their repair request timer.  This reduces the number of repair requests and thus a request and response implosion.

The authors simulate this framework on several different node topologies, including chain and star.  The found that the request/repair scheme worked well for bounded-degrees trees where ever node participates in the multicast, but works less well for a sparse session.  It also works less well when the bounded-degree is really high.

A suggested optimization is local recovery, which means setting a TTL on a repair request packet so that only the neighborhood affected by the loss sees it.

Criticisms & Questions

The idea proposed in this paper was simple, but very clever.  It made use of multicast properties to reduce the number of packets in the network, and could make use of even more properties to optimize it.  I think the simulations had good results, but I would like to see SRM implemented in a real-world environment and see how it performs.

Thursday, November 12, 2009

X-Trace: A Pervasive Network Tracing Framework

Summary

This paper describes the design and implementation of X-Trace, a tracing framework that gives users a comprehensive view of service behavior. The motivation for X-Trace was the lack of a single diagnostic tool across all layers and applications, making it hard for the user to understand the interactions between the different services and protocols and therefore hard to find where the problem originated.

A user will start X-Trace when starting an application task. X-Trace will tag the request with a task identifier, which will be propagated to all layers and protocols so that X-Trace can determine all the operations connected to that task, which results in a task tree.

The network path generally traverses multiple administrative domains (AD). Each AD may not wish to share internal information. X-Trace accomodates this situation by providing a clean separation between the user and the receiver of the trace information. The portion of the X-trace that is relevant to each party will get sent to that party if they have X-Trace turned on, and they can do with the data what they wish. Even if all parties don't turn on X-Trace, it will still provide horizontal slices of the task tree.

There are 3 main design principles used by X-Trace:
1. Trace request should be sent in-band: metadata should be added to the same datapath we want to trace
2. Collected data should be sent out-of-band: allows data to be sent even when failures occur in the path
3. The entity requesting data is decoupled from entity receiving the data: this allows each AD to control its own information.

Critique & Questions

I think X-Trace is an interesting concept. I could see it being very useful for network operators and users as well. Of course, the usefulness increases with the number of ADs on the path that are using it. I'm curious about whether this was ever implemented in all layers in the real-world network.