Based Dustin Stevens-Baier
Comp 578
9-7-06

Assignment #2

3.    You are approached by the marketing director of a local company, who believes that he has devised a foolproof way to measure customer satisfaction. He explains his scheme as follows: "It's so simple that I can't believe that no one has thought of it before. I just keep track of the number of customer complaints for each product. I read in a data mining book that counts are ratio attributes, and so, my measure of product satisfaction must be a ratio attribute. But when i rated the products based on my new customer satisfaction measure and showed them to my boss, he told me that I had overlooked the obvious, and that my measure was worthless. I think that he was just mad because our best selling product had the worst satisfaction since it had the most complaints. Could you help me set him straight?"

A.    Who is right, the marketing director or his boss? If you answered his boss, what would you do to fix the measure of satisfaction?

The marketing director is wrong and his boss is correct. The number of complaints on a particular product cannot determine the level of its satisfaction when compared to other products.  If the best selling product receives the most complaints, that does not mean the worst customer satisfaction.  The number of items sold plays the most important role in getting the number of customer's feedback.   The more you sell the bigger the sample to get both negative and positive data from.

B.    What can you say about the attribute type of the original product satisfaction attribute?

The marketing director is correct when he talks about ratio attributes, but he does not use percent variation allowing proportional comparison with other products. A better way to describe the data would be through soemthing tracking the number of complaints per items sold.

4. (a) Is the marketing director in trouble? Will his approach work for generating an ordinal ranking of the product variations in terms of customer preference? Explain.

Yes he is in trouble, becuase a scenario cab be devised where we have a contradiction. For instance if the person says 1 > 2 and 2 > 3 but then says 3 > 1 we can see that no ordinal data can be gotten from this. The first two says 1 > 2 > 3 but then the last one contradicts that and says 3 > 1. The only time we need to test for the last one is if no ordinal data can be gotten from 1 and 3 for instance 1 > 2 and 3 > 2 we know that 2 is lowest and then we need to find out how 1 and 3 rank.

(b) Is there a way to fix the marketing director's approach? More generally, what can you say about trying to create an ordinal measurement scale based on pairwise comparisons?

I don't think that this approach makes the most since the customers should be able to compare them all. However, sometimes there are too many to get reasonable ordinal data. This means that you are going to need conditional questions that only get asked if you need them. This way you can avoid the contradictions that can arise with the marketing directors.

(c) For the original product evaluation scheme, the overall rankings of each product variation are found by computing its average over all test subjects. Comment on whether you think that this is a reasonable approach. What other approaches might you take?

I think that the original idea would work okay as long as the value we are averaging is uniform across the entire scope of items. For instance if every customer is rating their experience on a scale of 1-10 then it would be fine but if some items rate on a different scale or some customers rate on a different scale then it wouldn't work. There would be obvious issues with this idea as well, their would be people that rate everything a 10 or a 1. Another approach would be to take all of the complaints made and divide it by the number of purchases, then this could be ordered.

5. Can you think of a situation in which identification numbers would be useful for prediction?

I can think of a couple of ways if the id was tied into age or location. For instance a phone number is kind of an id number and area codes give you a good prediction of location. aAso check numbers are a form of id and give a good prediction of time and possibly age of the person with the account.

14. The following attributes are measured for members of a herd of Asian elephants: weight, height, tusk length, trunk length, and ear area. Based on these measurements, what sort of similarity measure from Section 2.4 would you use to compare or group these elephants? Justify your answer and explain any special circumstances.

Based off of what I know about elephants as one of these would increase they all would so I would use something to see what kind of correlation they have with one another. A good idea would probably be Pearsons's correlation coefficient. This would enable us to predict some of the data based off of the other values. For instance if we know the weight and height then we can probably predict the others based off the correlation data we have.

15. You are given a set of m objects that is divided into K groups, where the i^th group is of size mi. If the goal is to obtain a sample of size n < m, what is the difference between the following two sampling schemes?

(a) We randomly select n*m_i/m elements from each group.

(b)We randomly select n elements from the data set, without regard for the group to which an object belongs.

In the first group you get a much better proportional representation of each group something that can't be said for the second one, because n < m. It is possible that the number of K is alot larger than each sample. This means that you may not be able to proportionately sample each group. Another issue is that as you the size of n grows closer to m, you loose any advantage you once had by sampling.

16. Consider a document-term matrix, where tf_ij is the frequency of the ith word in the j^th document and m is the number of documents. Consider the variable transformation that is defined by tf'_ij = tf_ij*log m/df _iwhere dfi is the number of documents in which the ith term appears, which is known as the document frequency of the term. This transformation is known as the inverse document frequency transformation.

(a) What is the effect of this transformation if a term occurs in one docuemnt?

In this situation it measn that the term will weighted very heavily.

i.e. Let the frequency be 2, the documents total 6, and the term appears in 1 document.
2*log(6/1) = 1.57

In every document?

If it happens in every document it will be zero.

i.e. Let the frequency be 2, the documents total 6, and the term appears in 6 documents
2*log(6/6)= 0

(b) What might be the purpose of this transformation?

A problem with data mining in particular for search engines and other practical purposes on the web, common terms that are on ervy page cannot be weighted as much as the more useful terms. An example is if someone was serching for "the baseball" They might be interested in every document that has the word "baseball" in it but they definately don't care about every document that has the word "the" in it.

Hosted by www.Geocities.ws