Voice Control -- Lexicon Representation

Study on Voice Controlled Computing Suggestions for Representing the Lexicon

The example lexicon I provide is in the same format that Dr. Dougherty provides his examples. That is, as a set of PROLOG rules. That sample lexicon is, by no means, the only one with which I worked this semester. I downloaded a couple of lexicons in different machine readable formats from the Internet. I needed machine-parsable lists of words and their parts of speech. The two I downloaded were in different formats, so I wrote a converter in C++ for each of them. For the Moby part-of-speech file, I created this converter. I later modified my converter to this one for a lexicon taken from the Oxford Text Archive. Having formatted each of these lexicons for the purpose, I found I had tens of thousands of entries. That seemed good until I tried to compile the lexicons. After setting my stack sizes to several times their default size, I compiled a lexicon of about 70,000 entries that was based on the lexicon from the Oxford Text Archive. The compile took most of an hour and produced a 17 MB executable that took 5 minutes to load on my AMD-K6/2 350 running RedHat 6.0.

I say all this to say that representing a large lexicon as a single set of PROLOG facts is not a reasonable solution. Something else must be reasonable, so I offer a few areas that may be researched.

Solution I. Forget using PROLOG.
Everything that is done in PROLOG could be written in C++ and all the words looked up in a database. The problem is that PROLOG offers so much of the problem-solving intelligence needed in other aspects of natural language processing. It would be really sad to reinvent the PROLOG wheel when it works so well for part of the project.

Solution II. Have a C++ module load the data from the database and pass it to the PROLOG module.
This idea is very attractive to me. I'm not sure it's possible. I have not adequately studied the problem of interfacing between PROLOG and C++.

Solution III. Have the PROLOG program find its words in the database.
PROLOG does offer disk I/O functionality. I'm not sure how powerful that functionality is. If PROLOG disk I/O is up to the task this could be the simplest way to solve the problem. Still, that's a big if.

No doubt other solutions exist to this problem. These come to my mind and the last two are some that I intend to pursue in the future.

Back to the Outline

Hosted by www.Geocities.ws