Introduction to Computational Linguistics
For Kathy Taylor
Computational Linguistics Paper on ASL Machine Translation
What are the basic requirements to put sign language into a computer? What does such a system provide in terms of functionality and usefulness? Could such a system function as machine translation to or from sign language once technology advances enough to process video images? The requirements for building a functional MT system for sign language may be a tall order considering current technology as we know it today. These are the kinds of questions that are being investigated for this research project. There are computerized notational systems available now for sign language research. World Wide Web sources will be used for more information relating to sign writing and notational systems. There are some basic features that a machine translation system for sign language should have, like the ability to mark up a video image. The machine translation system should have search capabilities based on features of sign language, including Non-Manual Signals (facial expressions, head movement, body tilt, etc.). Such a MT system should include sound as well because there are verbalizations in sign language, just like there are gestures in spoken language. Also, the MT system should be capable of doing some transcription of sign language automatically. In order to do this, the MT system must be able to deal with citation forms and discourse forms. For example, fingerspelling a single letter would look different from the same letter fingerspelled in a word, due to assimilation and other linguistic effects. The MT system would need a dictionary based on phonetic forms, not on English glosses, in order to be effective. A iconic interface would be required to access this dictionary in a pre-determined order. For example, the unmarked handshapes, the A-handshape, S-handshape, O-handshape, C-handshape, 1-handshape, B-flat-handshape, and 5-handshape could be ordered from the most closed handshape (fist-like) to the most open handshape (palm-like). Such requirements like those mentioned above already need to be taken into consideration when building a functional MT system for sign language.
This research paper will focus on the basic requirements for building a functional Machine Translation (MT) system for sign language. Basically, a requirements analysis for a sign language MT is the purpose of this paper. The steps that this machine translation will go through are input source data gathering, sign language recognition, text conversion, analysis, transfer, synthesis and then output production in the target language.
Source Data Gathering
Input data should be audio and visual input from a microphone and camcorder.
The camcorder should record the front of the person signing in order to get non-manual signals (NMS) and other facial expressions relevant to sign language syntax, morphology, and discourse. This frontal view is called the second person view [Starner 1997] . First person view is the view from the signer's own perspective [Starner 1997] . Second person view, not first person view should be used because NMS will be visible in the second person view, not in the first person view. The camera should not be used in a first person viewpoint position. For example, a miniature camera was mounted vertically on the flap of a baseball cap [Starner 1997]. The movement of the arms and hands could be recorded, but only the nose could be seen from this particular position [Starner 1997]. The rest of the face was not recorded at all. Other NMS signals such as head tilt and body shifts in posture would pose more difficulties as well. The image from the miniature camera would be disorienting if the head were to tilt. If the signer shifts his torso, the movement would be harder to detect relative to the background. A camera mounted on a tripod for example, would provide stability to the image. Movement would be easier to detect relative to the background.
Mouthing would also be visible from a second person view. Currently, no linguistic emphasis is being placed on mouthing. Mouthing is considered to be part of contact sign [Lucas & Valli 1992]. Whether or not mouthing is part of natural sign language is debatable. Mouthing itself is not considered to be NMS, except where NMS have lexicalized mouthing. For example, reduced English mouthing in the ASL signs HAVE and FINISH became lexicalized as part of the ASL signs but are not explicitly considered to be NMS [Lucas & Valli 1992]. However, some ASL NMS do have mouthing that did not come from the lexicalization of English mouthing. There are some ASL expressions like PAH! and CHA that are mouthed. PAH! is mouthed with the sign FINALLY. The mouth configuration CHA is mouthed when the meaning of the sign being produced at the time needs to be modified to include the meaning of the word "big" [Lucas & Valli 1992].
Glove systems cannot sense NMS at all. Gloves systems are wearable technology using some sort of fabric (like lycra) with embedded sensors to cover the hand(s), arm(s), or the full body in a body suit [Harling 1995]. In order to sense NMS within a glove system, the face would have to be completely covered with fabric. The sensors embedded in the fabric would have to be dense enough and sensitive enough to detect eyebrow and mouth movements, the most noticeable NMS. To cover the face with fabric would be very uncomfortable to say the least. The worst case scenario would be trying to record eye gaze movements. Contact lens with embedded sensors would have to be used! However, glove systems can be used for translation of tactile sign language used by the deaf-blind population. There are existing fingerspelling glove systems being used to serve as a transliterator of spoken language for deaf-blind people [Kramer & Leifer 1987, Kramer & Leifer1989].
The video image should include the body from head down to the hip where most of the signs are produced. A couple of feet on each side of the body should be allowed for extension of the arms during signing, and a half of a foot above the head for those signs that are signed on or near the head. The video-image should not be obstructed by the microphone itself. Any obstructions will prevent a clear reading of the sign language by the machine translation system. A plain background, with a light neutral color should be used to obtain the best image within a video image digitizer on a computer like Adobe Premiere.
Audio input is needed for incidental vocalizations of words and verbalizations in sign language. There are meaningful verbalizations in sign language just like there are meaningful gestures in spoken language. The microphone should be able to pick up any words or verbalizations the signer makes, for example, trills, and be discriminating enough to detect audible but non-voiced expressions over background noise, like for example PAH! mentioned above already. A high quality clip-on microphone placed on the collar of the signer's shirt could be used.
Parallel Input Processing
Ideally, both audio and visual input can be processed by the same digitizer in parallel sequence.
Sign Language Recognition
Continuing on this parallel processing concept, sign language recognition should be done on a statistical basis with proven Hidden Markov Modeling (HMM) techniques. In spoken language, statistical MT is based on probabilities that a word will occur after a particular word and was used in speech recognition. The HMM techniques were used for speech recognition for many years [Starner 1994]. The HMM techniques have been adapted to the visual mode recently and appear successful to the point of being accepted within the visual recognition field [Starner 1994]. Visual recognition is not as complex as language recognition. The HMM techniques can recognize continuous speech, cursive writing, and the hand-signed portion of ASL sentences [Starner 1994]. However, a HMM system needs to be trained on test sets [Starner 1994]. The larger the test corpus, the better the accuracy [Starner 1994].
Parallel HMM Processing
Since HMM techniques have already been used for processing speech and video images separately, these same techniques can be used to process video and audio portions of sign language at the same time. The HMM algorithms should be used to process both the audio and the visual input simultaneously. Processing both audio and visual input simultaneously may produce a higher accuracy rate than processing the audio and visual input separately.
The HMM techniques for this sign language MT should be based on probabilities of signs, NMS, words or verbalizations, not just signs alone. For example, the machine translation system would choose the next sign, NMS, word, or verbalization based on its probability that it would appear most often after a particular sign, NMS, word, or verbalization in a sequence.
A restricted sign language grammar helps with sign language recognition by identifying nouns, verbs, etc. in the input stream based on syntax structure and sequence. A grammar limits the search for an entry in a dictionary lookup to within its syntax classification, if applicable.
Please note that a statistical (for example, using HMM techniques) or a rule-based grammar would only be used to help recognize sign language at this point. The grammar wouldn't be used to translate sign language to another language whether spoken or signed. However, the work done by the grammar can be saved for further use by the MT system.
Please note that a dictionary lookup is not the same as looking up a word in a dictionary as common sense would dictate. One critical difference is that the orthographic alphabet is not used in within the keyword. The keyword would consist of sign language features chosen so that an entry in a dictionary would be found most quickly. Handshape, location and movement features of sign language would be used to minimally recognize a sign or NMS. Words and verbalizations would be recognized based on standard phonological features with phonetic adaptations to account for typical distortions found in voices of deaf people.
The dictionary access method should be based on the most restrictive set of features in sign language, namely the handshape first, then the location, and finally the movement. Handshape, location, and movement could be used in that hierarchical order to limit the search in the dictionary. There are fewer handshapes than locations, and fewer locations than movements. Due to assimilation effects, storing two signs that often occur together as a single entity in a dictionary could increase accuracy and speed up processing [Starner 1994]. These groupings would be called bigrams within a grammar [Starner 1994]. Trigrams store three signs [Starner 1994].
The range of distorted to clear articulation among the deaf and hard-of-hearing population has to be taken into consideration when processing the words or verbalizations found in the audio stream. There may not be any significant audio input if the signer chooses to be quiet. Otherwise, standard speech dictionary lookup techniques can be used with additions to the dictionary of words and verbalizations used by deaf people with sign language.
One of the problems of having a dictionary of this kind is how to order it. The interface of an online fingerspeller at the World Wide Web address www.iwaynet.net/~ggwiz/asl was impressive because animated pictures of real hands fingerspelling a single letter were used [Gay 1998]. This interface was ordered by the fingerspelled alphabet, but other handshapes that are not part of the fingerspelled alphabet are not included as part of this interface. For example, the "feeling finger" is not included. The "feeling finger is where the second finger sticks forward at an angle from the rest of the extended fingers.
There is an online animated ASL dictionary at the World Wide Web address www.bconnex.net/~randys that is impressive, but it has a limitation as well [Stine 1997]. This limitation is that within each handshape category, the entries in this online dictionary are ordered by gloss in alphabetical order. Using glosses to order a dictionary is not acceptable for a sign language MT system. The MT system must be ordered by features of the actual sign itself.
The handshape feature alone will not be enough to order the entries in a dictionary. There are many signs that start with the same handshape, but are located differently or move differently.
The order could be arranged in the fingerspelled alphabetical order with other handshapes not part of fingerspelled alphabet appended to the end. An alternative would be order on how open or closed the handshape is compared to each other. For example, the unmarked handshapes, the A-handshape, S-handshape, O-handshape, C-handshape, 1-handshape, B-flat-handshape, and 5-handshape could be ordered as listed from the most closed handshape (fist-like) to the most open handshape (palm-like). Unmarked handshapes could also be ordered before marked handshapes.
Ordering by location has the same problem as ordering by handshape. There are many signs in the same location that differ only in their handshape or movement.
Contacting signs could be ordered by their locations starting from the head down to the hip or vice versa. Signs made out in neutral signing space could be ordered by their locations on the X, Y and Z axis's.
Ordering solely on movement is not feasible because there are signs that move the same way but have different handshapes or locations. It is difficult to order by movement because there are an indefinite variety of movements.
These movements would have to be classified according to some common movement feature. There could be movement along the three axis's X, Y and Z. The type of movement could be straight, curved or oscillating. The handshape itself may open or close. There could be multiple movements occurring at the same time or alternatively. As you can see, ordering by movement is complex and the MT system would have to be effective enough to look up a sign based on prior knowledge of a predetermined order.
Clearly, all three features, the handshape, the location, and the movement, have to be used as keywords to look up a sign in a dictionary. The handshape could be the primary index, the location, the secondary index, and the movement the third index to sort signs by within a dictionary.
The sign language dictionary would have to be integrated with the spoken language dictionary. The dictionary would have to include NMS and verbalizations as separate entries as well. The identification of NMS may be more difficult because they may spread over several signs.
Dictionary lookup would assist with the conversion of recognized sign language into text. The recognized sign language input would have to be converted into a text representation for further text-based processing. This text-based processing may be for a new linguistic knowledge system for that particular sign language to another particular target language or for an existing interlingua system.
Typically, text representations of signs are called glosses, but this term should be avoided because NMS are not called glosses. The NMS need to be represented in text form as well.
Representing NMS may be difficult because NMS also spread and overlap signs. A line over the text is usually used with an abbreviated label for the particular NMS. However, these kinds of lines require a more graphical text interface than a standard computer keyboard has available. A font would have to be developed, just like fonts are developed for languages using a non-English alphabet. This font will need to include a method for representing NMS and their spread.
One suggestion would be to design a font with a line over a space to use between glosses, and there would be at least two forms of the alphabet, one with and one without the line. The NMS labels would be embedded on the same text string as the glosses with some kind of NMS beginning and ending flags or indicators. Multiple NMS may be used simultaneously and may begin and end in different places within the input stream. This method represents simultaneous events in a sequential manner and so has its limitations. For example, if two NMS started at the same time, one of the NMS start indicators would have to be placed first, and the other one second within the text stream.
Of course, words found in the audio stream are not called glosses. They are just called words and can use the standard speech processing code to convert them into text. Verbalizations will need to have some kind of an mnemonic text classification based on their phonetic features.
Analysis, Transfer, and Synthesis
Once the text conversion has been done, then the text stream could be used as input to a transfer system or interlingua to be analyzed and then a source language representation would be produced. The transfer module would work with the source language representation and convert it into a representation of the target language by using rules and other conversion methods. The target language representation would be then used to create output whether it is spoken or signed.
Sign Language Output
To produce linguistically correct sign language output, the MT must be able to produce the sign language output dynamically. Prerecorded images are not feasible [Speers] because all the variations of simultaneously occurring events like NMS and verbalizations during a sign could not be accounted for without the number of images growing exponentially. The organization of the lexicon must be flexible enough to allow specification of unique features of signs such as agreement features. Redundancy should be avoided as well [Speers]. Following these concepts, sample feature specifications are shown in Diagram A on the next page. The "O" represents one and only one selection among the choices available in that column. The "" represents multiple features that can be selected under one column.
Independent Multi-media Dictionary
A sophisticated multi-media dictionary should be implemented to assist users and programmers with accessing the MT data independent from the MT system itself. The multi-media dictionary should have access to audio strings of words and verbalizations, animated pictures of the fingerspelled alphabet, videotapes of people signing citation forms of signs, and videoclippings of various NMS produced in isolation. The multi-media dictionary interface should include audio and visual playback capabilities. The user-interface should be user friendly with regard to deafness. The audio output should be displayed visually like an o-scope. The user-interface should also have the ability to tag the visual display of an audio or video entry for feature transcription and notation.
Instance-based learning systems (IBL), neural networks, expert systems, decision tree building systems, and template matching are other techniques that were used for limited sign language translation. A neural network needs to be trained just like an HMM system and is very limited in the scope of its lexicon. IBL matches input to the nearest neighbor in its database [Kadous]. IBL and template matching need to have a pre-defined database and are not dynamic. A decision tree building system is like an flowchart. Expert systems or decision tree systems would need IF-THEN code programmed, but are dynamic and need not to be trained. However such rule-based systems are not completely successful. An MT system based on HMM techniques appears most likely to be successful.
Diagram A: Sample Feature Specifications
O no agreement features
O single agreement features
 eye gaze
 head tilt
 location of hand
O double agreement features
 eye gaze
 head tilt
 location of dominant hand
 location of weak hand
[Gay 1998] Gay, Greg. (1998). GG Wiz's Fingerspeller; http://www.iwaynet.net/~ggwiz/asl/index.html
[Harling 1995] Harling, Philip A. (1995). Gesture-Based Interaction.
[Kadous] Kadous, Mohammed. (1996). Machine Recognition of Auslan Signs Using PowerGloves: Towards Large-Lexicon Recognition of Sign Language.
[Kramer & Leifer 1987] Kramer, J. & Leifer, L. (1987). The talking glove: An expressive and receptive verbal communication aid for the deaf, deaf-blind, and nonvocal. In Murphy, H. J., editor, Proceedings of the Third Annual Conference on Computer Technology/Special Education/Rehabilitation, California State University, Northridge.
[Kramer & Leifer 1989] Kramer, J. & Leifer, L (1989). The talking glove: A speaking aid for nonvocal deaf and deaf-blind individuals. RESNA 12th Annual Conference, New Orleans, LA.
[Lucas & Valli 1992] Lucas, Ceil; Valli Clayton. (1992). Language Contact in the American Deaf Community. San Diego, CA: Academic Press, xi-xii, 41, 46-47, 75-106.
[Speers] Speers, d'Armond. (in progress). Dissertation in progress. Georgetown University, Washington DC.
[Starner 1997] Starner, Thad. (1997). Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video.
[Starner 1994] Starner, Thad. (1994). Real-Time American Sign Language Recognition from Video Using Hidden Markov Models.
[Stine 1997] Stine, Randy. (1997). Animated American Sign Language Dictionary.