Dustin Stevens-Baier
Assignment #3

18. This exercise compares and contrasts soem similarity and distance measures.

(a) For binary data, the L1 distance corresponds to the Hamming distance; that is, the number bits that are different between two binary vectors.� The Jaccard similarity is a measure of the similarity between two binary vectors.� Compute the Hamming distance and the Jaccard similarity between the following two binary vectors.

x = 0101010001
y = 0100011000

L1 = |(0 � 0)| + |(1 � 1)| + |(0 � 0)| + |(1 � 0)| + |(0 � 0)| + |(1 � 1)| + |(0 � 1)| + |(0 � 0)| + |(0 � 0)| + |(1 � 0)|� = 3

f₀₁ = 1
f₁₀ = 2
f₀₀= 5�
f₁₁ = 2

J = f₁₁ / (f₀₁ + f₁₀ + f₁₁) = 2 / (1 + 2 + 2) = 2/5 = .4

(b) which approach, Jaccard or Hamming distance, is more similar to the Simple matching Coefficient, and which approach is more similar to the cosine measure?� Explain.� (Note: The Hamming measure is a distance while the other three masures are similarities, but don't let this confuse you.)

Hamming distance is more similar to the Simple Matching Coefficient.� Since the Simple Matching Coefficient = 1- Hamming Distance.� This does not ignore the 0-0 matches�unlike the Jaccard�similarity.

(c) suppose that you are comparing how similar two organisms of different species are in terms of the number of genes share.� Describe which measure, Hamming or Jaccard, you think would be more appropriate for comparing the gentic makeup of two organisms.� Explain.� (Assume that each animal is represented as a binary vector, where each attribute is 1 if a particular gene is present in the organism and 0 otherwise.)

f₁₁ matches should not occure becuase we are dealing with seperate species.�� The other three should be there alot.� Since the Jaccard ignores the f₀₀ it isn't a good idea to use it.� The Hamming code will take these into account.�

(d) If you wanted to compare the gentic makeup of two organisms of the same species, e.g., two human beings, would you use the Hamming, distance, the Jaccard coefficient, or a different measure of similarity or distance? Explain. (Note that two human being�share > 99.9% of the same genes.)

Because we are comparing the same species the dominant factor will be f₁₁.� I would not use the Hamming method becuase it doesn't seem to give a good representation, it would take the f₀₀ into account and we don't want that.� It seems like the best one would actually be correlation since it gives a decent distribution that can get close to 1.

19. For the following vectors, x and y, calculate the indicated similarity or distance measures.

(a) x = (1,1,1,1), y = (2,2,2,2)

Cosine:� (1*2 + 1*2 + 1*2 + 1*2)/[(1²+1²+1²+1²)^1/2*(2²+2²+2²+2²)^1/2] = 1

Correlation:� X = 1/4(1+1+1+1) = 1
�� Y=1/4(2+2+2+2) = 2
��cov = 1/(4-1) [(1-1)(2-2)+(1-1)(2-2)+(1-1)(2-2)+(1-1)(2-2)]=0
�� std X = [(1/(4-1))[(1-1)²+(1-1)²+(1-1)²+(1-1)²]]^1/2=0
�� std Y = [(1/(4-1))[(2-2)²+(2-2)²+(2-2)²+(2-2)²]]^1/2=0
��
�� corr = 0/0

Euclidean: [(1-2)²+(1-2)²+(1-2)²+(1-2)²]^1/2� = 2

(b) x = (0,1,0,1), y = (1,0,1,0)

Cosine:� (0*1 + 1*0 + 0*1 + 1*0)/[(02+12+02+12)^1/2*(12+02+12+02)^1/2] = 0

Correlation:� X = 1/4(0+1+0+1) = 1/2
�� Y=1/4(1+0+1+0) =�1/2
�� cov = 1/(4-1) [(0-1/2)(1-1/2)+(1-1/2)(0-1/2)+(0-1/2)(1-1/2)+(1-1/2)(0-1/2)]=-1/3
�� std X = [(1/(4-1))[(0-1/2)²+(1-1/2)²+(0-1/2)²+(1-1/2)²]]^1/2=3^1/2/3
�� std Y = [(1/(4-1))[(1-1/2)²+(0-1/2)²+(1-1/2)²+(0-1/2)²]]1/2=3^1/2/3
��
�� corr =�(-1/3)/[( 3^1/2/3)(3^1/2 /3)] = -1

Euclidean:� [(0-1)²+(1-0)²+(0-1)²+(1-0)²]^1/2� = 2

Jaccard:� 0/(2+2+0) = 0

(c) x = (0,-1,0,-1), y = (1,0,-1,0)

Cosine: (0*1 + -1*0 + 0*-1 + 1*0)/[(0^2+-1^2+0^2+1^2)^(1/2)*(1^2+0^2+-1^2+0^2)^(1/2)] = 0

Correlation:� X = 1/4(0-1+0+1) =�0
�� Y=1/4(1+0-1+0) =�0
��cov = 1/(4-1) [(0-0)(1-0)+(-1-0)(0-0)+(0-0)(-1-0)+(1-0)(0-0)]=0
�� std X = [(1/(4-1))[(0-0)^2+(-1-0)^2+(0-0)^2+(1-0)^2]]^(1/2)=2*(3^(1/2))/3
�� std Y = [(1/(4-1))[(1-0)^2+(0-0)^2+(-1-0)^2+(0-0)^2]]^(1/2)=2*(3^(1/2))/3
��
�� corr =�(0)/[( 2*(3^(1/2))/3)(2*(3^(1/2))/3)] = 0

Euclidean:� [(0-1)^2+(-1-0)^2+(0+1)^2+(1-0)^2]^(1/2) = 2

(d) x = (1,1,0,1,0,1), y = (1,1,1,0,0,1)

Cosine: (1*1 + 1*1 + 0*1 + 1*0 + 0*0 + 1*1)/[(1^2+1^2+0^2+1^2+0^2+1^2)^(1/2)*(1^2+1^2+1^2+0^2+0^2+1^2)^(1/2)] = 3/4

Correlation: X = 1/6(1+1+0+1+0+1) =�2/3
�� Y=1/6(1+1+1+0+0+1) =�2/3
��cov = 1/(6-1) [(1-2/3)(1-2/3)+(1-2/3)(1-2/3)+(0-2/3)(1-2/3)+(1-2/3)(0-2/3)+(0-2/3)(0-2/3)+(1-2/3)(1-2/3)]=1/15
�� std X = [(1/(6-1))[(1-2/3)^2+(1-2/3)^2+(0-2/3)^2+(1-2/3)^2+(0-2/3)^2+(1-2/3)^2]]^(1/2)=2*((3/5)^(1/2))/3
�� std Y = [(1/(6-1))[(1-2/3)^2+(1-2/3)^2+(1-2/3)^2+(0-2/3)^2+(0-2/3)^2+(1-2/3)^2]]^(1/2)=2*((3/5)^(1/2))/3
��
�� corr =�(4/15)/[( 2*((3/5)^(1/2))/3)(2*((3/5)^(1/2))/3)] = 1/4

Jaccard:� 3/(1+1+3) = 3/5

(e) x = (2,-1,0,2,0,-3), y = (-1,1,-1,0,0,-1)

Cosine:�� (2*-1 + -1*1 + 0*-1 + 2*0 + 0*0 + -3*-1)/[(2^2+-1^2+0^2+2^2+0^2+-3^2)^(1/2)*(-1^2+1^2+-1^2+0^2+0^2+-1^2)^(1/2)] =�0

�Correlation:X = 1/6(2+-1+0+2+0+-3) =�0
�� Y=1/6(-1+1+-1+0+0+-1) =�-1/3
��cov = 1/(6-1) [(2-0)(-1+1/3)+(-1-0)(1+1/3)+(0-0)(-1+1/3)+(2-0)(0+1/3)+(0-0)(0+1/3)+(-3-0)(-1+1/3)]=0
�� std X = [(1/(6-1))[(2-0)^2+(-1-0)^2+(0-0)^2+(2-0)^2+(0-0)^2+(-3-0)^2]]^(1/2)=3*(2/5)^(1/2)
�� std Y = [(1/(6-1))[(-1+1/3)^2+(1+1/3)^2+(-1+1/3)^2+(0+1/3)^2+(0+1/3)^2+(1+1/3)^2]]^(1/2)=(1/3)*(42/5)^(1/2)
��
�� corr =�(0)/[( 3*(2/5)^(1/2))((1/3)*(42/5)^(1/2))] = 1/4

1. Obtain one of the data sets available at the UCI Machine Learning repository and apply as many of the different visualiation techniques described in the chapter as possible.� The bibiliographic notes and book Web site provide pointers to visualization software.

It seemed to make a lot of since to use the classic iris set.� I also used Jump 6.0 free trial download since I already have this program for Pattern recognition.� A good reason to use jump is that the sample data is already created and we just need to do some analyzing and graphing.

�

Here is a picture of the sample data inside jump.� Below is the Bivariate Fits of all the different combinations petal length, petal width, sepal width, sepal length.