Dustin Stevens-Baier
Comp 578
11-13-06

Assignment #11

1. Compare and contrast the different techniques for anomaly detection that were presented in Section 10.1.2. In particular try to identify circumstances in which the definitions of anomalies used in the different techniques might be equivalent or situations in which one might make sense, but another would not. Be sure to consider different types of data.

Proximity Based & Density Based both use the distance-based approach. Proximity and Density Based techniques do not require a knowledge base of the data set. A lot of the proximity technique use distance outlier detection. The proximity method is used to find the anomalous objects that has the greatest distance from most points. The Density Based technique identifies anomalous objects as being in regions of low density. The negatives of both teh proximity and density based approaches is that if the data set is high then the complexity gets high as well. Model Based Techniques are used when the distribution of data is known in adavnce. If there is no known data or training set then another technique should be used. A negative of the Model Based Techniques is that it can be very difficult to build a model. If the data distribution is not known that makes it diffcult. Most of the tests are for a single attribute which can also make things difficult.

8. Many statistical tests for outliers were developed in an environment in which a few hundred observations was a large data set. We explore the limitations of such approaches.

(a) For a set of 1,000,000 values, how likely are we to have a outliers according to the test that says a value is an outlier if it is more than three standard deviations from the average?

2700 outliers for a data set of 1,000,000 values. The way we find this number is pretty common statistics outliers are the points that are beyond + or - 3 standard deviations.

(b) Does the approach that states an outlier is an object of unusually low probability need to be adjusted when dealing with large data sets? If so, how?

It sometimes does need to be adjusted. In certain situation we know that 2700 outliers could be way too much if a part has a very good success rate. However, sometimes it is not enough if the part has a very bad success rate. This would have to be adjusted accordingly given the actual success to failure rates of a certain part. For instance, if the failure rate is 1 in a 1,000,000 then obviously 2700 hundred parts being deemed as failures would be way to high. However, if the failure rate was 1 in 10 then 2700 hundred parts being deemed as failures is way to low. The goal would be to have the outliers best predict the parts that fail.

9. The probability density of a point x with respect to a multivariate normal distribution having mean � and covariance matrix S is given by the equation

NOTE: There is a typo in the above equation as noticed by Cory and it appears as if he is right. The last component needs to be (x- � )^T .

Using the sample mean x_bar and covariance matrix S as estimates of the mean � and covariance matrix S , respectively, show that the log prob(x) is equal to the Mahalanobis distance between a data point x and the sample mean x_bar plus a constant that does not depend on x .

Take the ln of both sides. Then you simplify the right hand side down to -ln((2*Pi^.5 )^m * |S|^.5)-.5((x-x_bar)*S^-1 * (x-x_bar)^T)

Then you bring the ln(prob(x)) to tthe right side of the equation and the (x-x_bar)S^-1 * (x-x_bar)^T to the left side side of the equation and you now have the Mahalobis distance.

Hosted by www.Geocities.ws