Dustin Stevens-Baier
Assignment #3


18. This exercise compares and contrasts soem similarity and distance measures.

(a) For binary data, the L1 distance corresponds to the Hamming distance; that is, the number bits that are different between two binary vectors.� The Jaccard similarity is a measure of the similarity between two binary vectors.� Compute the Hamming distance and the Jaccard similarity between the following two binary vectors.

x = 0101010001
y = 0100011000

L1 = |(0 � 0)| + |(1 � 1)| + |(0 � 0)| + |(1 � 0)| + |(0 � 0)| + |(1 � 1)| + |(0 � 1)| + |(0 � 0)| + |(0 � 0)| + |(1 � 0)|= 3

f01 = 1
f10 = 2
f00 = 5�
f11 = 2

J = f11 / (f01 + f10 + f11) = 2 / (1 + 2 + 2) = 2/5 = .4




(b) which approach, Jaccard or Hamming distance, is more similar to the Simple matching Coefficient, and which approach is more similar to the cosine measure?� Explain.� (Note: The Hamming measure is a distance while the other three masures are similarities, but don't let this confuse you.)

Hamming distance is more similar to the Simple Matching Coefficient.� Since the Simple Matching Coefficient = 1- Hamming Distance.� This does not ignore the 0-0 matches�unlike the Jaccard�similarity.



(c) suppose that you are comparing how similar two organisms of different species are in terms of the number of genes share.� Describe which measure, Hamming or Jaccard, you think would be more appropriate for comparing the gentic makeup of two organisms.� Explain.� (Assume that each animal is represented as a binary vector, where each attribute is 1 if a particular gene is present in the organism and 0 otherwise.)

f11 matches should not occure becuase we are dealing with seperate species.�� The other three should be there alot.� Since the Jaccard ignores the f00 it isn't a good idea to use it.� The Hamming code will take these into account.�

(d) If you wanted to compare the gentic makeup of two organisms of the same species, e.g., two human beings, would you use the Hamming, distance, the Jaccard coefficient, or a different measure of similarity or distance? Explain. (Note that two human being�share > 99.9% of the same genes.)

Because we are comparing the same species the dominant factor will be f11.� I would not use the Hamming method becuase it doesn't seem to give a good representation, it would take the f00 into account and we don't want that.� It seems like the best one would actually be correlation since it gives a decent distribution that can get close to 1.

19. For the following vectors, x and y, calculate the indicated similarity or distance measures.

(a) x = (1,1,1,1), y = (2,2,2,2)

Cosine:
� (1*2 + 1*2 + 1*2 + 1*2)/[(12+12+12+12)1/2*(22+22+22+22)1/2] = 1


Correlation:� X = 1/4(1+1+1+1) = 1
��������������������� Y=1/4(2+2+2+2) = 2
����������������������cov = 1/(4-1) [(1-1)(2-2)+(1-1)(2-2)+(1-1)(2-2)+(1-1)(2-2)]=0
��������������������� std X = [(1/(4-1))[(1-1)2+(1-1)2+(1-1)2+(1-1)2]]1/2=0
��������������������� std Y = [(1/(4-1))[(2-2)2+(2-2)2+(2-2)2+(2-2)2]]1/2=0
���������������������
��������������������� corr = 0/0


Euclidean: [(1-2)2+(1-2)2+(1-2)2+(1-2)2] 1/2� = 2

(b) x = (0,1,0,1), y = (1,0,1,0)

Cosine:�
(0*1 + 1*0 + 0*1 + 1*0)/[(02+12+02+12)1/2*(12+02+12+02)1/2] = 0

Correlation:� X = 1/4(0+1+0+1) = 1/2
��������������������� Y=1/4(1+0+1+0) =�1/2
��������������������� �cov = 1/(4-1) [(0-1/2)(1-1/2)+(1-1/2)(0-1/2)+(0-1/2)(1-1/2)+(1-1/2)(0-1/2)]=-1/3
��������������������� std X = [(1/(4-1))[(0-1/2)2+(1-1/2)2+(0-1/2)2+(1-1/2)2]]1/2=31/2/3
��������������������� std Y = [(1/(4-1))[(1-1/2)2+(0-1/2)2+(1-1/2)2+(0-1/2)2]]1/2=3 1/2/3
����������������������
��������������������� corr =�(-1/3)/[( 31/2/3)(31/2 /3)] = -1


Euclidean:� [(0-1)2+(1-0)2+(0-1)2+(1-0)2]1/2� = 2


Jaccard:� 0/(2+2+0) = 0

(c) x = (0,-1,0,-1), y = (1,0,-1,0)

Cosine:
(0*1 + -1*0 + 0*-1 + 1*0)/[(0^2+-1^2+0^2+1^2)^(1/2)*(1^2+0^2+-1^2+0^2)^(1/2)] = 0


Correlation:� X = 1/4(0-1+0+1) =�0
��������������������� Y=1/4(1+0-1+0) =�0
����������������������cov = 1/(4-1) [(0-0)(1-0)+(-1-0)(0-0)+(0-0)(-1-0)+(1-0)(0-0)]=0
��������������������� std X = [(1/(4-1))[(0-0)^2+(-1-0)^2+(0-0)^2+(1-0)^2]]^(1/2)=2*(3^(1/2))/3
��������������������� std Y = [(1/(4-1))[(1-0)^2+(0-0)^2+(-1-0)^2+(0-0)^2]]^(1/2)=2*(3^(1/2))/3
����������������������
��������������������� corr =�(0)/[( 2*(3^(1/2))/3)(2*(3^(1/2))/3)] = 0


Euclidean:� [(0-1)^2+(-1-0)^2+(0+1)^2+(1-0)^2]^(1/2) = 2

(d) x = (1,1,0,1,0,1), y = (1,1,1,0,0,1)

Cosine:
(1*1 + 1*1 + 0*1 + 1*0 + 0*0 + 1*1)/[(1^2+1^2+0^2+1^2+0^2+1^2)^(1/2)*(1^2+1^2+1^2+0^2+0^2+1^2)^(1/2)] = 3/4


Correlation: X = 1/6(1+1+0+1+0+1) =�2/3
��������������������� Y=1/6(1+1+1+0+0+1) =�2/3
����������������������cov = 1/(6-1) [(1-2/3)(1-2/3)+(1-2/3)(1-2/3)+(0-2/3)(1-2/3)+(1-2/3)(0-2/3)+(0-2/3)(0-2/3)+(1-2/3)(1-2/3)]=1/15
��������������������� std X = [(1/(6-1))[(1-2/3)^2+(1-2/3)^2+(0-2/3)^2+(1-2/3)^2+(0-2/3)^2+(1-2/3)^2]]^(1/2)=2*((3/5)^(1/2))/3
��������������������� std Y = [(1/(6-1))[(1-2/3)^2+(1-2/3)^2+(1-2/3)^2+(0-2/3)^2+(0-2/3)^2+(1-2/3)^2]]^(1/2)=2*((3/5)^(1/2))/3
����������������������
��������������������� corr =�(4/15)/[( 2*((3/5)^(1/2))/3)(2*((3/5)^(1/2))/3)] = 1/4



Jaccard:� 3/(1+1+3) = 3/5

(e) x = (2,-1,0,2,0,-3), y = (-1,1,-1,0,0,-1)

Cosine:��
(2*-1 + -1*1 + 0*-1 + 2*0 + 0*0 + -3*-1)/[(2^2+-1^2+0^2+2^2+0^2+-3^2)^(1/2)*(-1^2+1^2+-1^2+0^2+0^2+-1^2)^(1/2)] =�0



�Correlation:X = 1/6(2+-1+0+2+0+-3) =�0
��������������������� Y=1/6(-1+1+-1+0+0+-1) =�-1/3
����������������������cov = 1/(6-1) [(2-0)(-1+1/3)+(-1-0)(1+1/3)+(0-0)(-1+1/3)+(2-0)(0+1/3)+(0-0)(0+1/3)+(-3-0)(-1+1/3)]=0
��������������������� std X = [(1/(6-1))[(2-0)^2+(-1-0)^2+(0-0)^2+(2-0)^2+(0-0)^2+(-3-0)^2]]^(1/2)=3*(2/5)^(1/2)
��������������������� std Y = [(1/(6-1))[(-1+1/3)^2+(1+1/3)^2+(-1+1/3)^2+(0+1/3)^2+(0+1/3)^2+(1+1/3)^2]]^(1/2)=(1/3)*(42/5)^(1/2)
����������������������
��������������������� corr =�(0)/[( 3*(2/5)^(1/2))((1/3)*(42/5)^(1/2))] = 1/4


1. Obtain one of the data sets available at the UCI Machine Learning repository and apply as many of the different visualiation techniques described in the chapter as possible.� The bibiliographic notes and book Web site provide pointers to visualization software.



It seemed to make a lot of since to use the classic iris set.� I also used Jump 6.0 free trial download since I already have this program for Pattern recognition.� A good reason to use jump is that the sample data is already created and we just need to do some analyzing and graphing.





Here is a picture of the sample data inside jump.� Below is the Bivariate Fits of all the different combinations petal length, petal width, sepal width, sepal length.

























Here are some histograms along with some data from jump.














A pie chart which isn't all that useful in this case since the sample has the same amount of data for each type of flower.





There are also cdf plots that come from the histograms.
















There also some Stem and Leaf plots










Hosted by www.Geocities.ws

1