Statistics - New Developments, Hampton's Writing

Statistics on the Transcription of James Hampton's Diary

June 3, 2004. Removed Case 1 and 2 statistics, since there seem to have been some errors with them. Still investigating.

May 26, 2004. Added more Case 2 statistics.

May 25, 2004. Added Case 2.

May 14, 2004. Added comments on base case and added Case 1.

May 12, 2004. Opened statistics page. Placed statistics for base case.

Base Case

Here are statistics on the diary transcription. although these do not consider the # mark for uncertain characters.

List and counts of all graphemes in transcription.
A Lotus .wk1 file of this information
A comparison of discrepancies between these counts and those given in Stamp's paper.

Here are more statistics. The single-character and digrams files do not include * (unreadable characters), the # mark for uncertain characters, and the uncommon characters J2 , 4 , O1 , 10 , L , 15 , Y1 , qL1 , n , qL0 , P3 , e , P4 , v , Y5 , K1 and Y0 .

LTCT Output - single-grapheme distribution, digram distribution, doubled characters, and other statistics.
VFQ Output - other single-grapheme statistics and vowel identifications by the Sukhotin algorithm. Characters with a number in the first column are vowels according to the Sukhotin algorithm.
HMM/Sukhotin comparison - a comparison of vowel identifications by the Sukhotin algorithm and the Hidden Markov Model in Stamp's paper.
The most frequent trigrams. (Thanks to Jeff Haley.)
The most frequent 5-grapheme strings. (Thanks to Bruce Grant.)

Discussion

Study of the base case is still ongoing, but some things seem obvious.

The Sukhotin/HMM comparison table shows that in fact the Sukhotin vowel algorithm and the Hidden Markov Model used in Stamp's paper might be saying the same thing. For standard and phonemic English the HMM placed consonants in state 0 and vowels in state 1. If we assume however that the Sukhotin algorithm does just the opposite, they are in complete agreement, if one makes a further assumption.

The assumption is that a model value ratio of 1.6 or more is sufficient to definitely place a result in either of the two states. Stamp stated that a ratio of 10 would be necessary, but these results make one question that. The samples of English were very large (around 6 million characters or phonemes) compared to the Hamptonese sample here (about 29,000 graphemes). That may well make the results less definite. Stamp did state that a sample of 10,000 characters is sufficient to get valid results for English, but at what requirements?

The other thing is the degree to which the /Y3 vv/ digram dominates the distributions. Further investigation shows that even the /Y3 vv Y3 vv/ string is rather dominant. Also, 90% of the occurrences of /Ki/ are in the digram /Ki /Ki/. We shall therefore assume that we need to treat these groups as single graphemes, which leads to Case 1.

Case 1

The following transformations were applied to the transcription:

   /Y3 vv/        -->  /Y3v/
   /Y3 vv Y3 vv/  -->  /Y6/
   /Ki Ki/        -->  /Ki2/   {/KiKi/ is too ambiguous}

However, this did not include the same graphemes as with the Base Case LTCT and VFQ results, as well as 13 , qL , and HH.

Here are the resulting statistics:

LTCT Output.
VFQ Output.
5-Grapheme Strings.

Discussion

It is obvious that /qL3 vv/ is also an important influence on the 5-gram statistics. This leads immediately to Case 2.

Case 2

This transformation is applied:

   /qL3 vv/  --> /qLv/

However, this excludes the same graphemes as in Case 1, and also J.

This is the result:

5-Grapheme Strings.
6-Grapheme Strings.
7-Grapheme Strings.

Discussion

Still considering these results. Perhaps there is a calculation error.

For later.

END

Hosted by www.Geocities.ws