# Clustering and Classification methods for Biologists

## Data Pre-processing

Depending on the analysis it may be necessary or desirable to undertake some data pre-processing prior to starting. Pre-processing may be needed because of algorithmic constraints. For example, variables may need to be same data type or it may be desirable to use transformations to speed up processing. Pre-processing can include the following.

• Monotonic transformations such square root or logarithmic. Monotonic transformations can be demonstrated graphically. A plot of the original values against the transformed values will not have any peaks or troughs if the transformation is monotonic. The square transformation (change x to x2) is not monotonic if negative and positive numbers are included. The plot will dip down to 0 at x = 0 and increase as x gets increasingly negative or positive.
• Change the data type by degrading it to a less informative type. Information may be lost in a transformation but it is never added. For example, a continuous score can be converted to a binary format, using an above/below threshold rule. Similarly, a continuous variable could be reduced to a small number of ordered values. This process is called discretization. Although conceptually simple, the decisions on the placement of class boundaries for the new ordinal variable is potentially complex. How many classes should there be? How are class boundaries defined: equal interval, equal frequency, natural breaks?
• It might be important to reduce the number of variables using a selection/rejection routines. There are automated, stepwise methods that use statistical criteria. Alternatively. Alternatively, Huberty (1994) suggested the use of logical screening (theoretical, reliability and practical grounds) to screen variables. This is possible if some initial research identifies variables that may have some theoretical link. We may also wish to take into account the data reliability and do not ignore the practical problems of obtaining data, this includes cost (time or financial) factors.
• Data reduction using a projection method such as PCA, Sammon mapping. (Sammon mapping uses an iterative process to produce a two-dimensional representation of a data matrix with more than two variables).

Now try the following self-asssment question. If you need help with any of these look at the resources listed on the section menu page.

1

##### Data Pre-processing

The following statements all refer to possible data-preprocessing actions. Some are correct, others are incorrect. Identify the correct ones.

 a) Cosine(X) is a monotonic transformation. b) The natural logarithm of X is a monotonic transformation. c) It can be beneficial to use LOG(X+1) rather than LOG(X) as a transformation. d) Rank(X) is a non-monotonic transformation (replace each value of x with its position in an ordered list). e) Discretization of a continuous variable is necessary to construct a histogram of frequencies. f) Information is lost if a variable, such as height, is transformed into two categories: above and below the mean. a) Correct. a) No, it is not monotonic, a plot of Cosine(X) against X is an oscillating curve (it may be necessary to first transform your data to radians). b) Correctb) This is a monotonic transformation.c) Correct, because LOG(0) is not defined.c) LOG(0) is not defined and will create a problem with your software. If you add 1 to every number LOG(0) becomes LOG(1), which is 0.d) It is a monotonic transformation.d) It is a monotonic transformation.e) Correcte) The values have to be placed in 'bins' to enable the frequency distribution to be calculated.f) Correctf) Information must be lost because the original values, which could be very diverse, are replaced by only one of two possible values.
Hosted by www.Geocities.ws