Words
In order to analyse the data quantitatively, it is necessary to identify and define what, for the purposes of this study, a misspelling is. Also, to give an indication of the frequency of these misspellings, some unit of writing must be chosen. This unit could range from the largest, two student examination scripts, to the (practically) smallest, the number of characters written. The former would involve the least analysis but would produce only a gross, if useful, estimate of misspelling frequency. The latter would involve an overwhelming amount of analysis and produce a perhaps exaggerated and unnecessary delicacy; also, since some misspellings involve the addition or omission of letters, the unit of writing would be influenced by the misspellings themselves. Between these two extremes there are several possible units: the individual script, the paragraph, the sentence, the T-unit, and the word. For present purposes, I have chosen the word.
The definition of a word is problematic under the best circumstances, but it is even more so with the present data. This is principally because many students seem only vaguely aware of the importance of regular spacing between individual words, and are willing to insert spaces in what are generally considered words. I have therefore approached the concept of word intuitively and consistently. Personal decisions once made have been applied consistently throughout the scripts.
Among the decisions made was how much of the scripts to include in the word count. Students are required to write their name, identity number, section number, and teacher’s name at the top of the script. Although there are indeed misspellings involved here, especially in the case of my name, I have discounted this part. On the other hand, I have included the title, although not all students wrote one. In some cases, alternative words, for example have/consist, are both included, and they are both counted. Such extratextual observations as The End, Thank You, and my personal favourite Please see the adverse side have not been included in the word count. If there is evidence that the student was actually prevented from completing the last word in the composition because the examination time had expired, this incomplete word is not considered. At the same time, it is not possible to identify those words in the body of the script which may be incomplete for the same reason.
In many cases, what I consider to be two words have been elided. The most frequent example of this occurs with the indefinite article. Since most students write the indefinite article as a separate word and since this is the orthographic standard, I have counted, for example, alot as two words. Similarly, such forms as inorder are interpreted as two words. On the other hand, legally contracted forms such as it's or doesn't could be considered as one word. However, for the sake of consistency, and since the uncontracted forms also occur, such contracted forms are taken to be two words.
The elision of words is mirrored by division of words. This may be completely illegal, as in a bout, which is counted as one word. In other cases, the illegality is not so clear. Forms such as can not, no body, some thing and, in context, every thing are treated uniformly as one word.
A few other problems arise, but they can be dealt with decisively and expeditiously. Abbreviations such as TV or KFUPM I consider to be one word. Similarly, the few examples of numbers together with any abbreviated or iconic units are counted as a single word. There are a few examples of contiguous repetition of a word; in these cases, only one of the words is included in the word count.
The above procedure, when applied to the 246 examination scripts written by 123 students, yields a total of 43,181 words. The mean number of words per student script is 351.06. There is a small difference in this mean between the two semesters: 347.59 for the first and 356.86 for the second semester. Similarly, there is a relatively small difference between the means for the two midterm examinations and those for the two final examinations. The first and second semester midterm means are 149.97 and 154.05 respectively. For the final examinations, the means are 197.62 and 197. Therefore, there is a certain consistency in terms of numbers of words written by the first and second semester groups of students. Pages 1 and 2 of Appendix B show the distribution of numbers of words written by students in the four examinations. Remembering that there were more students in the fall semester (77) than in the spring semester (46), the similarity of profile of these distributions between midterm examinations and between final examinations gives some justification for considering the two groups as one homogenous group.
The striking difference in numbers of words written is not between the two groups of students but between the midterm and final examinations. This is most probably accounted for by the nature of the questions. Whereas the midterm examination questions are explicit and limited, the final examinations both invite the application of ‘your own knowledge and experience’. One might therefore expect more words, and specifically more words not provided in the magazine article, in the final examinations. This could explain the slightly larger number of words in the spring midterm examination, where part of the question is not so explicitly text-based.
Misspellings
An analysis of misspellings involves two problems: definition and identification. Even when a suitable definition of a misspelling has been decided, there remains the problem of identification. The students chosen for the study have in general developed a reasonably legible hand, but there are many cases where the identity of individual letters is ambiguous. This occurs not only with the least proficient handwriters, of whom there are still a few and who have difficulty in forming letters, but it is also found in some of the proficient and fluent handwriters who may have developed identical or very similar forms for pairs of letters such as the letters o and a. I must admit that I used a hybrid for these letters when writing Russian in examinations in an attempt to hide the fact that I did not know the orthographic representation of certain unstressed vowels. However, I have no evidence, or indeed suspicion, that our students intentionally employ this archetypically Phoenician ruse. In cases of ambiguity, I have carefully studied and compared other examples of the letter and letter combinations involved throughout the student’s two scripts. This usually results in disambiguation, but is strictly limited to cases which are beyond reasonable doubt. Therefore, a small number of misspellings probably remain unidentified.
Once letters and words have been interpreted and identified, misspellings in the resultant words must also be identified. The problems involved in this can be illustrated by considering what would happen if all the words in the data were run through a computer spellchecker. I have not actually done this since the input of 43,181 words would have been time-consuming, and, more importantly, the time invested would not have produced the desired results. Such a check would have failed to identify what must for the present purposes be considered misspellings.
Of course, a computer spellcheck would certainly identify those forms which are not included in the dictionary. However, it is by no means clear that each of these forms should be considered misspellings in the present circumstances. For example, the dictionary may be too limited to include technical words such as entrepreneurship and would almost certainly not contain proper nouns like Bahrain. Also, acceptable variant spellings, either within or across geographic regions, may not be allowed. In fact, each potential misspelling would have to be designated as a misspelling or a correct spelling according to personal decisions based on the whole context of its production.
While a computer spellchecker would identify many potential misspellings, it wouat the same time fail to identify a considerable number of them. This would occur when a misspelling of a word produces a correct spelling of another word. For example, in the string it dose not works, the word dose may well be admitted as a legal spelling by the dictionary, but it must certainly be included as a misspelling in the present, or indeed any, study. Similarly, tow in the string There are tow factor can reasonably be interpreted as a misspelling.
Of course, the reductio ad absurdum would be to consider Kuwait as a misspelling for Bahrain in the string Kuwait is island. Therefore, it is necessary to decide whether each word, be it legally or illegally spelled, can be reasonably construed, in the light of its grammatical and semantic context, as a failed attempt to spell another word. In the example above, Kuwait is a successful attempt to spell the wrong word; it is not a misspelling. In the example it dose not works, while dose is clearly a misspelling, how should one consider works? I have not designated such incorrect grammatical forms as misspellings, unless of course they are themselves illegally spelled. This applies, for example, to incorrectly used adverb and adjective forms. Similarly, the words affect and effect are not considered misspellings, irrespective of their context.
The inadequacies of a computer spellcheck for present purposes highlight the necessity of human scanning, during which literal, grammatical and semantic information from the scripts, as well as the circumstances under which they were written, are used to produce a list of misspellings. Since I am familiar with the work of these students and the particular composition examinations, I have a schema which helps to interpret the scripts. In an extreme example, I am probably the only person who could know with confidence that intrope is a misspelling of entropy because the student who wrote it had used that word throughout the semester at every possible, and indeed impossible, opportunity. A less extreme example is the word Shake, which is more easily seen as a misspelling of Sheikh when it is realised that the word occurs in the article on which the examination question was based. However, I am constantly aware of the danger of imposing my own schema on the student scripts, and have tried at each stage to reinterpret misspellings.
There are fewer than ten words in the whole sample for which I have been unable to assign the presumed intended word. One student bemoans the tedious and butty technical terms which evidently plague his life at KFUPM; whatever butty might mean in this phrase (bitty, precise; batty or dotty, stupid; ‘butty’, inconsistent?), it almost certainly does not refer to sandwiches, and is therefore a misspelling. In another example, bigen to de thrugh, attempts at begin and through can be discerned, but de is a misspelling since it is, to my best knowledge, a nonword. However, apart from these few exceptions, each misspelling has been assigned a word which I feel confident it was supposed to be.
Certain words have been excluded from the list of misspellings for five possible reasons. First, and as outlined above, grammatically or semantically inappropriate forms have not been included unless they are themselves misspelled. Second, as far as variant spellings are concerned, the data only poses questions of American and British standards. Education in English in Saudi Arabian state primary and secondary schools has standard British spelling, which is Education Ministry policy. On the other hand, American spelling, and the American education system, is used in schools of the Saudi ARAMCO company, many of whose graduates become students at KFUPM. Therefore, both British and American standard spellings are considered correct.
The third group of words which might be considered misspellings are those which include incorrect or questionable use or nonuse of capitalization or hyphenation. I have not taken these to be misspellings. Proper nouns are the fourth group of words. Most of these examples are words taken from the texts on which the questions are based or from the questions themselves. Since they were available to the students at the time of writing, they are considered misspellings if they are not spelled exactly as in the text.
Finally, there is one example, batchelor (degree), which was misprinted (or misspelled) in the first semester final examination article, and which several students copied correctly. This is therefore not a misspelling.
Appendix AI contains the 2,061 misspellings, as defined above, found in the 246 scripts. They are listed by student in the order in which they occur in the script. The alphanumeric code accompanying each misspelling identifies the student responsible for the misspelling and the examination in which he produced it. Each student was assigned a number from 001 to 123: 001 to 077 are fall semester students, and 078 to 123 are spring semester students. The midterm examination is represented by the letter M or m, and the final examination by F or f; upper case and lower case letters identify first and second semester students respectively.
The misspellings in Appendix AI are tokens, that is all occurrences of misspellings in the scripts. Appendix AII was prepared from the first appendix to produce an alphabetical list of types, of which there are 1,444. Since the present study is specifically concerned with misspellings, I have not reduced various inflected forms of a word to one lemma. This is particularly important because inflection of words appears to be a significant and considerable source of misspelling. Therefore, a type subsumes all those tokens which are literally identical, disregarding only capital and small letters. The exception to this is a few homographic multiple entries, where identical tokens represent different words. In such cases, the second entry is asterisked, as in this example:
bay*> pay> ml
factores 14,31,31
The quantitative information available in Appendices AI to AVI is shown
in the following table of numbers of misspelling types and tokens.
|
|
|
|||
| exam |
|
|
|
this exam only |
| M |
|
|
|
|
| m |
|
|
|
|
| F |
|
|
|
|
| f |
|
|
|
|
| total |
|
|
||
There are also differences between the midterm and final examinations as far as types and tokens are concerned. Expressing tokens as an increase over types, the midterm tokens are 24.37% and 26.5% higher in the fall and spring semesters respectively. However, the increases in the final are 41.34% and 32.28%. An examination of appendices AIII to AVI may indicate why types in the final examination produce a significantly larger number of tokens than those in the midterm examination. In the midterm examinations, there are only three types for which there are more than four tokens, and these account for 26 tokens. However, in the final examinations there are twelve types with more than four tokens, and these yield 96 tokens. At least ten of these final examination types are misspellings, such as Ianguge and cources (fall semester) and causway and Bahrin (spring semester) which appear in and are apparently taken from either the question or the article for the examination. Such misspellings I shall call ‘text-derived’, since there is a strong possibility that they were miscopied during the examinations from available printed text.
Of course, there is no practicable way to know if a student has actually referred to a given word in the text. However, the probability that this has happened can be estimated. It is almost completely certain that misspellings such as etrepreneur are text-derived since the word entrepreneur appears in the question and the textbook chapter article and it is extremely unlikely that any of the present students has any previous familiarity with the word. On the other hand, it might seem absurd to consider, for example, th (for ‘the’) as text-derived; not only does the definite article occur in all the present texts, but also its use is indispensable for a grammatically acceptable composition. Nevertheless, this misspelling could at least be considered potentially text-related if it occurs in a string of directly copied source wording. Take, for example, this sentence: Each input unite of a computer system reads data fron a sepsific form and converts it into elecrtonic pulses. The lexical items unit, specific and electronic all occur in the vocabulary component of Preparatory Year English and are extensively practiced and examined. Thus, they might be considered familiar to students, and therefore the misspellings are not obviously text-derived. However, their occurrence in a long string of source wording increases the probability that they are in fact text-derived.
The probability of a misspelling being text-derived increases with the unfamiliarity of the lexical item and the extent to which its context is source wording. Of course, these are highly subjective criteria, but they can be applied using personal intuition informed by knowledge of the texts and the students and tempered by conservative scepticism. Following this procedure, I conservatively estimate the number of misspelling tokens which are text-derived to represent 33% and 28% of all tokens in the midterm examinations and 27% and 27% in the final examinations. There may well be some instances where my intuition has failed and where I have included misspellings which are not in fact text-derived. However, these examples are certainly outweighed by the large number of misspellings which I suspect are text-related but have not included as such. Even my conservative estimates suggest that a considerable amount of the misspelling data is influenced by visual processing.
Since misspelling types in the final examination yield a far higher percentage of tokens than in the midterm examination, and since this appears to be related to text-derived misspellings, it seems surprising that the percentage of text-derived misspellings does not differ significantly, and indeed decreases slightly, in the final examination. What seems to happen is that the midterm examination questions specifically demand the extensive use of a set of lexis taken from the textbook chapter. On the other hand, a much smaller set of lexis is available in the article for the final examination. Therefore, much greater use is made of this smaller number of words. Also, because the article has been prepared before the examination, students may feel that, having invested some time in understanding the text and having that text to hand during the examination, they will make use of as much of it as they can.
The quantitative information available in Appendices AI to AVI suggests
one further fact. The total numbers given in the previous table do not
take account of the different numbers of students in the two semesters.
The following table shows the mean misspelling types and tokens per student.
|
|
|
|||
| exam |
|
|
|
|
| M |
|
|
|
|
| m |
|
|
|
|
| F |
|
|
|
|
| f |
|
|
|
|
Correlations of misspellings
Appendix C has been prepared in order to consider whether a correlation between misspelling and various measures of student achievement can be discerned. Each page gives a schematic representation of an original, accurate, small scale graph plotting the number of words and the number of misspelling tokens written by individual students in one composition examination.
Each coloured mark represents a student, and the colour of the mark indicates the grade achieved by that student in the test referred to in the title; the colour code is given at the bottom of the page. The vertical axis denotes words written in the examination given in the title, and indicates the mean number of words written, as well as 10% and 20% above and below the mean. The horizontal axis shows the number of misspelling tokens. The coordinates of a point at the centre of the black squares would be the mean number of words written and either the mean or median number of misspellings. In fact, in most cases the mean and median are either identical or have a difference of less than two. Since different examinations produce different ranges of misspellings, the scale on the horizontal axis has barranged for each examination around the mean or median to produce a spread of students across the whole page. Similarly, the vertical axis ranges from the smallest to the largest number of words written by an individual student in this examination.
The black squares enclose students whose word and misspelling totals are closer to the mean. This serves as an immediate visual indication of a student’s relative performance. A diagonal line through the bottom left and top right corners of the squares would separate students with frequencies of misspellings per word higher and lower than the mean. In other words, students above this line towards the top left of the page are the better spellers; students below the line towards the bottom right are worse spellers.
Using this measure of spelling proficiency, the number of better and worse spellers according to their grades in the relevant measure of student achievement can be counted. Because the spread of students with grade C (average) does not yield significant results, I have counted two groups: students with grades A or B (the higher achievers) and students with grades D or F (the lower achievers).
Appendix C, pages 1 to 4, gives words and misspelling tokens written in the first semester midterm, first semester final, second semester midterm and second semester final examinations respectively, and correlates these with the final grade achieved by the student. This grade is the grade awarded for English 101 on the basis of midterm and final objective and composition examination results, and therefore represents the most general measure of student achievement in English available.
The following table sets out the number of A or B grade and D or F grade
students who have lower and higher than average misspelling frequencies
in the first semester midterm composition examination, the lower frequency
denoting the better spellers. The fourth and seventh columns (boldface
numbers) show the total number of students; this total includes students
in the second and fourth columns who are outside the larger black square
in the appendix (much lower or higher misspelling frequency) and students
in the third and sixth columns who are inside the square (slightly higher
or lower misspelling frequency).
| grades |
|
|
|
|
|
|
| A + B |
|
|
|
|
|
|
| D + F |
|
|
|
|
|
|
The figures for the first semester final examination, taken from Appendix
C2, are set out below.
| grades |
|
|
|
|
|
|
| A + B |
|
|
|
|
|
|
| D + F |
|
|
|
|
|
|
| grades |
|
|
|
|
|
|
| A + B |
|
|
|
|
|
|
| D + F |
|
|
|
|
|
|
| grades |
|
|
|
|
|
|
| A + B |
|
|
|
|
|
|
| D + F |
|
|
|
|
|
|
| grades |
|
|
|
|
|
|
| A + B |
|
|
|
|
|
|
| D + F |
|
|
|
|
|
|
Appendix C, pages 5 to 8, follow the format of the first four pages
of the appendix, but in this case correlate words and misspellings with
the composition grade achieved by the student. The following five tables
give information from these appendix pages in the same way as the previous
tables. The figures for the first semester midterm examination are
| grades |
|
|
|
|
|
|
| A + B |
|
|
|
|
|
|
| D + F |
|
|
|
|
|
|
| grades |
|
|
|
|
|
|
| A + B |
|
|
|
|
|
|
| D + F |
|
|
|
|
|
|
| grades |
|
|
|
|
|
|
| A + B |
|
|
|
|
|
|
| D + F |
|
|
|
|
|
|
| grades |
|
|
|
|
|
|
| A + B |
|
|
|
|
|
|
| D + F |
|
|
|
|
|
|
Of more immediate interest than composition grades is the correlation
of misspellings with objective examination grades, which may be considered
to reflect reading skills. Appendix C, pages 9 to 12, contain this information,
which is set out as before in the following four tables, along with correlations.
| grades |
|
|
|
|
|
|
| A + B |
|
|
|
|
|
|
| D + F |
|
|
|
|
|
|
| grades |
|
|
|
|
|
|
| A + B |
|
|
|
|
|
|
| D + F |
|
|
|
|
|
|
| grades |
|
|
|
|
|
|
| A + B |
|
|
|
|
|
|
| D + F |
|
|
|
|
|
|
| grades |
|
|
|
|
|
|
| A + B |
|
|
|
|
|
|
| D + F |
|
|
|
|
|
|
The 55% correlation is extraordinary. It is possible that there is something extraordinary in the second semester final objective examination. As described before, the objective examinations contain questions which are not directly ‘reading’ questions, for example grammar questions. It should be remembered that 25% of the second semester final examination was devoted to questions on word prefixes and roots, which were considered to involve reading skills to a minimal extent. Therefore, the correlation of 55% supports the hypothesis that there is a correlation between spelling and reading proficiency, since reading proficiency is not rewarded in a quarter of the questions in the second semester final examination. This may also explain why the correlation with final grades in the final examinations is higher for the better spellers than the worse spellers in the first semester but lower in the second.
A more direct test of reading proficiency is provided by six in-class
reading quizzes based on texts in the English 101 coursebook. The six quizzes
have a total of 70 multiple-choice questions. Appendix C, pages 13 to 16,
give the grades achieved by students in these quizzes and the words and
misspellings written in the four examinations. As before, information from
these four pages is given in the tables below.
| grades |
|
|
|
|
|
|
| A + B |
|
|
|
|
|
|
| D + F |
|
|
|
|
|
|
| grades |
|
|
|
|
|
|
| A + B |
|
|
|
|
|
|
| D + F |
|
|
|
|
|
|
| grades |
|
|
|
|
|
|
| A + B |
|
|
|
|
|
|
| D + F |
|
|
|
|
|
|
| grades |
|
|
|
|
|
|
| A + B |
|
|
|
|
|
|
| D + F |
|
|
|
|
|
|
To sum up, the quantitative analysis of the data suggests, as far as the present students are concerned, that the frequency of misspelling is not a simple function of the number of words written. Rather it may plausibly be considered to reflect reading proficiency. The range of correlations for total population between spelling proficiency and individual measures of student achievement is 57% to 83% (0.57 to 0.83) and this is remarkably similar to the range of correlation found by Malmquist, referred to earlier, which is 0.50 to 0.80. I attributed the range to individual student differences and differences between test instruments, and this has apparently been replicated here on a small scale.
Correct spelling does not seem to be an important criterion in the evaluation by their teacher of the students’ writing, and is therefore not a significant problem to be addressed in the syllabus. However, if misspelling is a superficial indication of underlying reading problems, then it is important to consider misspellings in the light of reading strategies which might produce them. The present data is particularly suitable for such an examination since a considerable proportion of it can be considered text derived as has been previously defined.