SBRACO: a little program for Bray-Curtis ordination

Federico Spinazzi
federico@syspr03.disat.unimi.it

Contents

1  Theory of Bray-Curtis ordination
    1.1  The data and the problem as they are coded
    1.2  What to do...
    1.3  ...and how
2  Implementation
    2.1  What SBRACO does.
        2.1.1  Data format
        2.1.2  Output
        2.1.3  Graphs
    2.2  What SBRACO doesn't do.
3  Examples
    3.1  Ecological data
    3.2  Mineralogical data
    3.3  Artificial data

List of Figures

    1  Ordination of the classical Dune Meadow data of [4].
    2  Bray-Curtis ordination of mineral compositions.
    3  Bray-Curtis ordination of artificial data set (1).
    4  Bray-Curtis ordination of artificial data set (2).

1  Theory of Bray-Curtis ordination

The main idea beyond Bray-Curtis ordination is that of D-vegetational space. Beals formalizes this idea as follows:

``Vegetation is known to be closely correlated with the enviroment (including past enviroment), and a vegetational space is required that, in contrast to species-dimensional space, reflects directly the enviromental space.

[...]

Althought individual species do not relate linearly with the enviroment, it might be supposed that some measure of overall vegetational difference between samples might do so. The dimensions of such a vegetational space will be then determined by vectors of vegetational change from point to point (i.e.: from sample to sample)''

The model implied by Bray and Curtis ([3]) was nearly the same, and, it is much a better model than that implied by Euclidean distance, where each species is considered as an axis perpendicular to each other. The accent on Euclidean distance is dictated by the role it has in PCA (recall the equivalence of Principal Coordinate Analysis with Euclidean distance and PCA).

Major faults in BC ordination seem to be the methods originally developed to select reference points for the construction of axes.

Beals, however, describe a robust method to select such points, and the implementatio follows his description.

1.1  The data and the problem as they are coded

We deal with the following general problem: given a set of sites where the number of individuals belonging to each of a set of species have been estimated/counted, tell what are the main enviromental factors that control the distribution of species, if any. Well, ordination techniques try to answer the if any part of the question and often give a good help on the what are one. Of course, knowledge will help ;-).

A file, containing a series of numbers is given you as a present; it looks like this:

o-c-i o-c-i o-c-i ¼o-c-i
optional-row-identifier 0 1 ¼¼ 123
optional-row-identifier 0 31 ¼¼ 10
¼ ¼¼ ¼¼ ¼
optional-row-identifier 0 1 ¼¼ missing
optional-row-identifier 32 231 ¼¼ 2

where o-c-i stands for optional-column-identifier. Assume also that that fine missing is absent from our data.

1.2  What to do...

Ordination techniques are multivariate in nature: the information coded in more1 than one variable is used to explore the data for groupings, regular patterns, relationships, indipendencies and so on. Also, some kind of data reduction is searched for. This allows graphical inspection of data and, hopefully, an improved knowledge on the content of the answer we have been invited to to give in 1.1.

The general idea is that each observation/row (for our question: each site) belongs to a space each variable/column (for our question: each species) is an axis of2.

Such idea bring us to search for meaningfull directions in this space, where meaningfullness is not necessarly mathematical or geometrical. Each method search for direction in a different way. Ok, let's search.

1.3  ...and how

Here I only try to explain how BC ordination (best: BC ordination as actually implemented in SBRACO) search.

First of all, a distance matrix is computed from the data given as input. Beals strongly recommend to use Sorensen distance and so do I. I've added an option for Euclidean distance.

Here the formula (xhj is the number of individuals for the j-th specie in the h-th site):

Sorensen Distance is a `City-Block metric' (also know as Canberra metric) it is computed thinking at the space as a city (think at Manhattan, where, to go from here to there you can follow only perpendicular roads); in this space each dimension (i.e.: each species), contribute to the overall distance in proportion to their relative differences in the two samples. Quoting Beals:

``This necessarly weights an enviromental factor according to the number of species responding to it, as well as to how dramatically they respond.''

Beals suggests to standardize data before computing distance. I've set some standardization avaiable from SBRACO, but my experience is that no-standarization works fine in most cases.

Ok, this is the first step. Now take this beautiful square matrix and search the site (i.e.: the column or the row of this distance matrix) whose distances show the higest variance. Found ? Ok, keep it. Now search, between the remaining sites, the one whose distance show the lower linear regression coefficient with the distance of the first point3. Found it ? Ok, you have the second reference point for the first axis.

With this searching algorithm we try to extract a gradient at a time and so we should be able to examine complex-shaped enviromental gradients from the coordinates we are going to compute.

Now, you can project the i-th sample onto this axis with the following obscure equation:

xi = DAB2+DAi2-BBi2
2DAB
where A is the first reference point, B the second. This equation set the origin in A. There is a way to set it in the middle of the axes:
xi = DAi2-BBi2
2DAB
(1)
Here, we have the aproximation of the D-vegetational space, that is not euclidean, with an euclidean one. Some grade (who knows how much ?) of distortion will then be applied.

We have n numbers and we can do anything with them. We only want to plot them, so it would be fine to have another axis.

Take a coffee or a tea or an italian wine (also think at how much povetry coffee and tea have brought in the southern parts of the world). Then keep on reading.

In order to compute following axes we try to remove the distance accounted by previous axes from the original distance matrix, i.e., we try to construct a residual distance matrix4.

So we use the following equation to ideally-compute a second distance matrix on the axis already computed5 with elements:

RDij =   æ
Ö

Dij2-ODij2
 
where Dij2 is the original distance and ODij2 is the distance accross all the previous axes. Here I have assumed that ODij2 is an euclidean distance because it seems that traslation in an Euclidean space has occured just before (see equation 1).

Ok, from now on it is a matter of looping for how many axes you choose and subtracting a matrix from another. The loop stop (hopefully) if the reference points are too close (or the denominator in (1) will be zero !).

On output you'll have the coordinates of each sample for each of the axis SBRACO had been able to compute and store.

This method of searching is called Variance-Regression by Beals and is the only one actually implemented as it seems to me the most robust of all (Beals recomends it too).

2  Implementation

SBRACO was compiled with the great DJGPP C/C++ compiler for DOS written mainly by DJ Delorie ( www.delorie.com ) as a part6 of the GNU Project ( www.fsf.org ). It is a 32-bit program that will allow you to use up to 4Gb of RAM under DOS and 64Mb in a DOS box under MS-Windows.

The user interface was written with JPTUI, a freeware library in C++ written by Jeepy ( www.teaser.fr/~jpdelprat ).

If you like the program and have some money, fund one of this two project or buy some roses to your love, as you prefer.

My experience with it ranges from an 80486 with 8Mb RAM, MS-DOS, to a P200MMX with 80Mb RAM, Windows NT 4.0. The first machine will work out a file with 800 samples and 102 species in ... I don't remeber, say 5 minutes (the time necessary to disk swapping is very long, with more RAM you can get it faster). On a 64Mb machine with Windows NT I manage to run an ordination of 2758 samples, 18 variables.

Away from tecnhical information remember that you have to provide to SBRACO a DPMI program. It is infact distribuited with a DPMI server called CWSDPMI.EXE for use under DOS. If you use Windows or a DOS emulator such as linux DOS-EMU, you'll needn't it, but put it somewhere in your PATH (please refer to DOS documentation about it) or simply put it in the same directory if you are using DOS. DJGPP programs are smart enough to use long names where they can, i.e. under Windows 95 (not under Windows NT).

2.1  What SBRACO does.

It implements only the Variance-Regression method; you can choose between Sorensen and Euclidean distances (I think the latter could be useful when data come from chemical analysis or so). Standardization provided are:

As you will see there are other distances I've played with in the past (Jaccard, Morosita, Horn) but they all perform bad for me. Also, for binary data, Sorensen distance works fine. I decided to disable them, for the moment.

The program provides you a little editor in main window. Actually it can display only 255 columns, but it can read file of any size (provide it founds enough memory).

2.1.1  Data format

Data must be stored in an ASCII file, each sample on a line, each value must be separated by a combination of one or more tabs or spaces. Missing data are not allowed. Only numbers are to be placed in the file. See class2.dat for an example.

You can ask to order column or rows of the data provided. Don't use BC ordination to ordinate species. Even if some graphs seem very beautiful, I have tried with the Dune Meadow data of [4] and I found that it gives bad results for species. [2] suggests a different approach for species-ordination, but I haven't implemented it, yet. This option is only provided to avoid you the job of trasposing file coded with species as rows ( ;-)).

2.1.2  Output

SBRACO will ask you for a file where to store the coordinates computed and will write them in it. The output is ASCII, as the input. It don't check for disk space while writing, but this shouldn't be a problem.

2.1.3  Graphs

The graphs are of two types: screen preview or graphics file. You need to request a preview if you also want a graphics file. Actually it uses GNUPLOT as a plotting engine. It is a (free) fine package, as you can judge from the graphs in section 3. If you don't want to learn it, simply enjoy the screen preview. Actually only a postscript output is possible from SBRACO, but a little knowledge of GNUPLOT will allow you to produce other output as well. The DOS version coming with SBRACO cannot output `gif' files so I disabled this option. If you can use a Win32 system I recomend to try the 3.6 version compiled for this system. Visit http://science.nas.nasa.gov/~woo/gnuplot/beta/ .

Of course you may use your favourite graphing software.

2.2  What SBRACO doesn't do.

It doesn't accept file with labels for rows or columns; it will not allow you to insert comment in the file. These are the most important reasons for which its developement is actually virtually stopped. I'm rewriting all the I/O routines for this and for memory usage optimization. It cannot deal with missing data ( :-( ).

3  Examples

3.1  Ecological data

As an example I compare BC with CA on the Meadow Data of [4]. The ordination is made on species. Please notice the great difference between the two and the fact that BC ordination give misleading results on them (as I previously pointed out, this happens very often with species).

As a general point, if you want to study both row and column at the same time then Correspondence Analysis is much better. I recomend to try BC ordination if you cannot uncover any gradient in your data by means of PCA, CA or any other method. Maybe it will help ..., maybe not. I don't include graphs here but you should be able to see from the file meadow.dat that the ordination of sites are instead very similar for BC and CA (notice that the sites are, here, the column).



Figure 1: Ordination of the classical Dune Meadow data of [4].

3.2  Mineralogical data

They come from my thesis. 819 chemical composition are divided in 11 mineral classes. Another 1758 comes from an Ice Core.

I used it to check if they can be thought as coming from the same region of compositional space. Also it serves as a well know dataset to test the program and the method7.



Figure 2: Bray-Curtis ordination of mineral compositions. Above 819 samples belonging to 11 different classes; below 1758 samples of unknown class coming from an Ice Core drilled at Summit, Greenland ([8]).

3.3  Artificial data

Here I simulated 2 enviromental gradient and I supposed that they were indipendent. We have 400 sites and 13 species. Imagine two transect that perpendiculary cross at their mid-points. On each transect you have 200 sites, equally spaced. Now suppose that, along each transect, an enviromental gradient exists. This gradient regulates the abundance of the 12 species, while one of the species has an random distribution between the 400 sites (see fig. 3.3). Try to distinguish sites belonging to each transect (here PCA goes really wrong if you delete the specie that has an uniform distribution, fig. 3.3).



Figure 3: Bray-Curtis ordination of artificial data set (1). See text.



Figure 4: Bray-Curtis ordination of artificial data set (2). See text.

References

[1]
Beals E., 1984. Bray-Curtis Ordination: An Effective Strategy for Analysis of Multivariate Ecological Data. Advances in Ecological Research, Vol. 14, 3-55..

[2]
Beals E., 1965. Species Patterns in a Lebanese Potrietum. Vegetatio, 13, 69-87.

[3]
Bray J. R., Curtis J. T., 1957. An ordination of the upland forest of the southern Winsconsin. Ecological Monographies, 27, 325-349.

[4]
Jongman R.H.G., Ter Braak C.J.F., Van Tongeren O.F.R., eds., 1987. Data analysis in community and landscape ecology. Cambridge University Press.

[5]
Hill M. O. 1973. Reciprocal averaging: An eigenvector method of ordination. Journal of Ecology, 61, 237-249.

[6]
Goodall D. W. 1954. Objective methods for the classification of vegetation III. An essay in the use of factor analysis. Australian Journal of Botany, 2, 304-324.

[7]
Austin, M. P., Orloci L., 1966. Geometric models in ecology. II. An evaluation of some ordination techniques. Journal of Ecology, 54, 763-773.

[8]
Maggi V. 1997. Mineralogy of atmospheric microparticles deposited along the Greenland Ice Core Project ice core. Journal of Geophysical Research, 102, 26,725-26,734


Footnotes:

1 in ecology this seems more like a more

2 In the case of an ecological survey, we have already pointed out the fact that the description and the coding of this space could be critical.

3 I refer to the term b in the equation: first = a+b(second)

4 Here Beals promised to give a formula to compute residual for `city-block', but he did ? (Let me know if you have it !).

5 It is a waste of memory to have two matrix allocated and it is EXACTLY what SBRACO actually does. The developement stopped as I realized to have to rewrite all the low-level function for data management and storage. Future version will use a quarter of the memory needed now. It allows also to add other analysis (PCA, COA, ...)

6 It is not directly related to it, I guess.

7 SBRACO allocated about 140Mb for this run.


File translated from TEX by TTH, version 1.57.