Summary
The paper describes a new diagnostic system that provides operators more details on what occurred. The goals of the system are to get the diagnostic information without knowing anything about the applications as well as to use inference to see which entities in the network are impacting each other.
They do an analysis of several example problems and classify them into 3 categories: what was impacted, symptom and identified cause. Their inference model uses history to gauge the current impact. They use a strategy similar to probabilistic inference [ie. Prob (D = Dnow | S = Snow) and make the assumption that conditional probability affects causality.
The system, NetMedic, is composed of 3 main functions: capture current component state, generate dependency graph, and diagnose using the component state and dependency graph.
They evaluate NetMedic in many different situations and on many different faults. NetMedic performed well. It identified the correct component 80% of the time and the number is only slightly lower for simultaneously occurring faults.
Criticism & Questions
I enjoyed this paper. The authored laid our their assumptions initially, suggesting how those could affect their results. They did evaluate NetMedic in many different situations, but unfortunately they weren't able to test it on real faults and had to make do with faults they injected. I would be interested in seeing how this would perform when met with real faults, happening in real time (ie. many simultaneous and overlapping faults occurring)
Thursday, September 17, 2009
Subscribe to:
Post Comments (Atom)
Real fault testing is difficult. It is hard to know what you don't know (e.g., uncovering the last bug).
ReplyDelete