Experiments in Information
Retrieval
from spoken documents
![]()
A. G. Hauptmann, R. E. Jones,
K. Seymore, S. T. Slattery,
M. J. Witbrock*, and M. A. Siegler
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213-3890
*Justsystem Pittsburgh Research Center
4616 Henry St.
Pittsburgh, PA 15213
ABSTRACT
This paper describes the experiments performed as part of the TREC-97 Spoken
Document Retrieval Track. The task was to pick the correct document from 35
hours of recognized speech documents, based on a text query describing exactly
one document. Among the experiments we described here are: Vocabulary size
experiments to assess the effect of words missing from the speech recognition
vocabulary; experiments with speech recognition using a stemmed language model;
using confidence annotations that estimate of the correctness of each recognized
word; using multiple hypotheses from the recognizer. And finally we also
measured the effects of corpus size on the SDR task. Despite fairly high word
error rates, information retrieval performance was only slightly degraded for
speech recognizer transcribed documents.
INTRODUCTION
For the first time, the 1997 Text REtrieval Conference (TREC-97) included an
evaluation track for information retrieval on spoken documents. In this paper,
we describe some experiments for the spoken document retrieval, with details of
both the speech recognition system and the information retrieval engine.
The SDR Data
The speech data were identical to the training data used in the 1997 ARPA Speech
Recognition Workshop HUB-4 broadcast news evaluations [11]. The main difference
lay in the split between training and testing data; here roughly half of the
material was reserved for the test data and only half the total acoustic data
was used for training the acoustic models. There were three "versions" of the
data available from NIST: A manually generated transcript (which also contained
some errors), a speech recognition transcript provided by IBM, and the raw audio
data, to be transcribed by our own recognizer. There were about 1200 news
stories in the training data set and 1451 in the test set.
Information Retrieval Scoring Metrics
The IR task consisted of a list of queries, for each of which one or more
relevant documents were to be returned by the IR system. The test queries were
designed to simulate a known-item retrieval task. For each query, only one
document was considered relevant for the purposes of this evaluation. While
other documents may have had some relevance to the query, only the document it
was designed to retrieve was scored as a correct retrieval. To measure the
effectiveness of the IR system, we report the inverse average inverse rank (IAIR):
Where is the rank of document I. N is the number of queries.
One characteristic of the IAIR is that it rewards correct documents near the top
more than documents in the middle or towards the end of the rankings. Both
average rank and IAIR score 1.0 for a perfect retrieval and larger numbers for
less than perfect retrievals. However, using the average rank metric, the
difference between returning a document at rank 100 versus rank 200 is large,
where this difference is almost negligible for the IAIR metric. At the other end
of the scale, the difference between returning a document at rank 2 versus rank
10 is small for the average rank, but large for IAIR. In real life situations,
where users’ time is valuable, closeness to the top is more critical than the
average rank over all items returned.
The Speech Recognition Component
The Sphinx-III speech recognition system was used for the CMU TREC SDR
evaluation, and it was configured similarly to the that used in the 1996 DARPA
CSR evaluation [9], although several changes have been made to the recognizer
since then. Sphinx-III is a large vocabulary, speaker independent, fully
continuous hidden Markov model speech recognizer with separately trained
acoustic, language and lexical models.
For the current evaluation a gender-independent HMM with 6,000 senonically-tied
states and 16 diagonal-covariance Gaussian mixtures was trained on a union of
the CSR Wall Street Journal corpus and the 1996 TREC-6 training set.
The decoder used a Katz-smoothed trigram language model [12] trained on the
1992-1996 Broadcast News Language Modeling (BN LM) corpus [11]. This is a fairly
standard language model, much like those that have been used by the DARPA speech
recognition community for the past several years. As a space optimization,
singleton trigrams and bigrams were excluded. As a new feature, this language
model incorporated cross-sentence-boundary trigrams to better model utterances
containing more than one sentence.
The lexicon was chosen from the most common words in the corpus, at a size that
balanced the trade-off between leaving words out-of-vocabulary and introducing
acoustically confusable words [8]. For this evaluation, the vocabulary was
comprised of the most frequent 51,000 words in the BN LM corpus, supplemented by
some 200 multi-word phrases and some 150 acronyms. The vocabulary size was
initially based on our experience with broadcast news, and a subsequent careful
analysis of the trade-off showed that this choice was a good one. More details
of the trade-off involved in vocabulary selection are provided below.
Compared with the earlier Sphinx-II speech recognition system, Sphinx-III boasts
a higher accuracy but at significant computational cost. To achieve a lower word
error rate of 27.4% versus 45.9% for Sphinx-II on a subset of the training data,
the original Sphinx-III system processing time increased to 120 times real time
on a 266 MHz DEC Alpha compared with only 1.4 times real time for Sphinx-II. By
reducing the beam width of the search and by optimizing the space required, the
Sphinx-III processing time was reduced to about 30 times real time, with only a
slight loss in word transcription accuracy. The 75% speedup resulted in about a
10% increase in relative word error rate. Decoding the audio files in the test
data required about 1000 hours of CPU time.
The Information Retrieval Component
Both documents and queries were processed using the same conditioning tools,
namely noise filtering, stopword removal, and term stemming:
Noise Filtering: The goal of noise filtering was simply to remove non-alphabetic
ASCII characters, punctuation, and case distinctions.
Stopword removal: A set of 811 stopwords from a combination of the SMART [13] IR
engine and other available stopword lists were removed entirely.
Term mappings: A set of 4578 mappings was used to map words with irregular word
endings that were not properly covered by an implementation of the Porter [6]
algorithm. An on-line Houghton-Mifflin dictionary was used for this lookup of
irregular words and their roots.
An example of this mapping is APPENDICES® APPENDIX
Term stemming: An implementation of the Porter algorithm was applied to map
words to their common root.
A heavily stripped down core of the CMU Informedia SEIDX engine [10] was used to
compare queries with documents. A relevance score was created for each pair
according to the following equation:
Query term frequency for vocabulary word I
Document term frequency for vocabulary word I
Inverse document frequency for vocabulary word I
Sign of value function (0 if 0, 1 if positive)
3. Official TREC-6 SDR Results
Table 1 shows the official CMU TREC SDR results. Since the transcriptions were
subject to filtering by stopword removal and stemming as discussed above, the
word error rates were reported for both the unfiltered and filtered references
and hypotheses. An analysis of the results showed several preprocessing errors
and confirmed an insight into the relationship between word error rate and
information retrieval.
Transcription Source
WER
IAIR
Unfiltered
Filtered
Reference
0
0
1.35
CMU-SR
35.5
26.4
1.44
IBM-SR
45.6
47.4
1.64
Table 1: Performance of the CMU TREC-6 SDR Evaluation System according to the
NIST scoring system on 49 queries. The filtered word error rate (WER) reflects
the effect of stopword removal and stemming.
Vocabulary Coverage
The words that were in the queries but were missing from the speech recognizer’s
51,000 word vocabulary were "CIA", "TORCHED?", "SMOKING?", "WELL-KNOWN", and "GOLDFINGER".
These problems are primarily due to inconsistencies in the preprocessing phases.
While "C.I.A." was in the vocabulary, "CIA" was not, resulting in a completely
missed word during information retrieval. Similarly, an oversight in the
preprocessing phase allowed the question mark to become part of the word in "torched?"
and "smoking?". For "well-known", each of the component words "well" and "known"
were in the vocabulary, but the compound "well-known" was not there as a single
token, and thus was treated as an irretrievable word. The only true missing word
in our 51,000-word vocabulary was "Goldfinger". Thus the 51,000 word vocabulary
selection provided excellent coverage for this test evaluation.
Recognition Accuracy versus Information Retrieval Quality
The official TREC results confirmed that vastly reduced word error rates
translate into slight improvements in information retrieval. Comparing the
performance on the baseline IBM speech recognition data with that on the CMU
speech recognition output, on the filtered texts, we found that nearly doubling
the filtered word error rate led to only a 14% decrease in information retrieval
effectiveness as measured by IAIR.
4. Experiments
Some of the experiments described here were performed before the actual test
data with queries was available from NIST. In order to allow meaningful
experiments to be performed on the TREC-6 training data, 1167 documents were
selected from the set and known-item retrieval style queries were generated for
374 of them by hand. In some of the very early experiments, a much smaller test
set composed of only 103 broadcast news stories with associated known-item
queries from a privately collected corpus was added to the 1167 documents to
permit initial investigation of ideas involving the speech recognition
configuration. We shall refer to this latter test set as the "small test set."
Vocabulary Size Experiments
Prior to the evaluation we attempted to find a good vocabulary size that was
optimized for both speech recognition and information retrieval. We chose three
different vocabulary sizes, 40,000, 51,000 and 64,000 words, constructed a
language model for each one, and then performed speech recognition. Table 2
shows that as the vocabulary got larger, the rate of out-of-vocabulary words
decreased, but beyond 51,000 words speech recognition accuracy did not improve.
Additional vocabulary coverage was thus obtained only at the cost of adding many
acoustically confusable words, and information retrieval effectiveness decreased
slightly. We chose to use the 51,000-word vocabulary for our official TREC
submission. As explained in the analysis of vocabulary coverage above, this
vocabulary size left in only unrecognizable word amongst the terms used in the
49 test queries. This experiment was performed prior to the official TREC
submission on the 103 queries that constituted our in-house development test set.
Vocabulary Size
Out Of Vocabulary Rate
Word Error Rate
IAIR
40k Words
1.13 %
26.4 %
1.24
51k Words
0.83 %
26.8 %
1.21
64k Words
0.75 %
26.8 %
1.22
Table 2: Effect of Vocabulary Size on System Performance.
This experiment was performed on the "small" test set of 103 queries.
Stemmed Language Models
Using a small test set described above and the 51,000-word vocabulary, we also
investigated the concept of language modeling tailored specifically to
information retrieval. Since the words in the recognition output are stemmed
before being used for IR, distinctions between different forms of a stem are
irrelevant to the IR system. In an attempt to take advantage of this observation,
a language model was built from a stemmed version of the LM training data. Each
root word in the language model had multiple "pronunciations" in the lexicon to
reflect the original, unstemmed, forms.
For example, suppose the root forms of the words "recognize", "recognized", and
"recognition" all map into the common root "recogni"+suffix, where the suffix in
this case is either "ze", "zed", or "tion". The stemmed language model would
provide only one transition from the root "recogni" into words that can follow,
in effect collapsing multiple paths between individual words into one path
between root words. The lexicon would reflect the alternate inflected forms as
alternate pronunciations of the root word, i.e.
Recogni R EH K AX G N AY Z
Recogni(2) R EH K AX G N AY Z DD
Recogni(3) R EH K AX G N IH SH AX N
The premise was that this stemmed language model would avoid much of the
confusion due to acoustic variations in suffixes of words, but would aid in the
correct recognition of the important roots of the words. Table 3 shows the
results of these experiments. The word error rate of the stemmed language model
was higher than for the baseline language model. The WER increased both if only
stemmed words were counted, as well as when all original words were compared.
Furthermore the information retrieval effectiveness (as measured by the inverse
average inverse rank metric) also showed a decrease.
Language Model
Word Error Rate
IAIR
Unfiltered
Filtered
Baseline
26.8 %
22.6%
1.21
Stemmed
35.1 %
23.8 %
1.26
Table 3: Using a language model built from stemmed LM training texts. This
experiment was also done with the "small" 103-query in-house development test
set.
Confidence Annotation
Since state-of-the-art speech recognition software does not produce a perfect
transcript of what was said, we would like to obtain any extra information we
can about the likelihood of correctness of particular words. This is akin to the
situation in which a human annotator makes a guess at a word that was hard to
hear, and marks that this word may have been mis-heard.
An ideal automatic confidence annotator would label each word produced by the
speech recognizer with a label correct to indicate that this is in fact the word
that was spoken, and incorrect to indicate that this word was not spoken. We
will compare the results of our annotation to this ideal, which we call Perfect
Annotation.
Features for Confidence Annotation
The confidence annotation we performed is based on work by Lin Chase [2], though
annotation has been explored by many others including [3,4,5]. Typically
confidence annotation is performed by taking information available about
individual occurrences of words in the hypothesized text, from information
produced within the speech recognizer, or outside the recognizer. These features
are then automatically examined to find indicators of likely correctness and
incorrectness. The candidate features we considered were:
Acoustic Score. This is the score the speech recognizer assigns the word based
the probability that the acoustics observed were generated by the hypothesis.
Language Model Score. This is a score assigned by the speech recognizer, based
the probability that the word is to occur given the previous two words.
Duration. This is the duration of the word, and helps offset the duration
dependence of the acoustic score.
N-best Homogeneity. The n-best list is the list of the best n guesses at the
words spoken in the document, sorted according to a weighted combination of
acoustic and language model scores. A word appearing in our hypothesis may
appear in many or few of the competing hypotheses. N-best list homogeneity is
the proportion of hypotheses that the word appears in. We set n to 200 for the
confidence annotation experiments.
Experimental Description - Confidence Annotation
For each set of features, the experiment proceeds as follows:
Label all words in training set as correct or incorrect by comparing them to the
words in the words in the reference transcript
Build a decision tree that finds sets of features that perform well in
distinguishing between correct and incorrect words in speech recognition
hypotheses.
Use decision tree to test features of words in test set. Once a word has been
sorted into a leaf node, the proportion of correct and incorrect words from the
training set with these features is used to calculate an approximate probability
of correctness
Perform information retrieval by weighting each word according to the
probability that it is correct (the confidence).
We conducted experiments by splitting the training data into two sections,
training our decision tree on one half, testing on the other half, then
reversing the roles.
Decision Tree Building
The decision tree building algorithm we use is C4.5 [7]. It functions by taking
all training data, and attempting to find rules based on features which
distinguish between classes. Each item of training data is a word along with its
associated features (described above), and its class of correct or incorrect.
Taking each feature does this in turn, asking a question about that feature, and
using the answer to partition the data. A feature is chosen if it has high
information gain, i.e. if the resulting two groups of data contain less of a mix
of correct and incorrect. The ideal split would create classes that contain
exclusively correct or exclusively incorrect examples.
Since such ideal splits are rare, the decision tree building halts when no more
information gain (reduction in entropy) can be achieved. At this point, each
leaf of the tree contains examples which have all the same features for
questions asked at each partition, and which are mostly of one class. The
proportion of correct examples at this node is the probability of correctness
that will be assigned to any word with the same features.
When using the decision tree to classify a new word, we check each of its
features to find which leaf-node of the decision tree to classify it into. At
that point, it is classified as having the probability of correctness
corresponding to this leaf node.
Evaluating Confidence Annotation: Cross-Entropy Reduction
The most common method of evaluating word confidence annotation is cross-entropy
reduction. Cross-entropy is a measure of how well our model of the probability
of word correctness corresponds to Perfect Annotation (as defined above). If our
model annotates perfectly, its cross-entropy is 0. The worse the annotation
performs the higher the cross-entropy.
The most naive from of confidence annotation we can perform is to tag each word
with a probability of correctness equal to the overall word-accuracy. Thus if we
know that our recognizer generally gets 80% of words correct, the baseline
confidence annotator assigns each word an 80% probability of correctness. We
then measure the quality of our annotation by measuring how much better it
performs than this baseline.
Actual probability that word i is incorrect
Predicted probability that word i is incorrect
Thus we attain a figure for cross-entropy for the
default model of classifying each word as correct with probability equal to the
word-accuracy, and score our improvements in modeling the probability of
correctness by how much they reduce cross-entropy as a percentage of this
baseline.
Information Retrieval Using Word Confidence Weights
First we describe two orthogonal ways of using word confidence weights in the
relevance scheme described above:
Expected Term Frequency (ETF): The ETF is an estimate of how many times the term
actually occurred given the number of observations. Assuming independent
observations, this is the product of the term frequency for the word and the
probability of the word being correct.
Expected Inverse Document Frequency (EIDF): To calculate EIDF, we first
calculate the probability that this word occurs somewhere in the document, for
each document:
Since typically, is very small when , we only take the product over terms for
which the recognized word was w. Summing this value over all documents and
dividing by the total number of documents gives us an approximate value of the
expected document frequency for this word
Oracle Experiments
Since the interaction between confidence annotation and information retrieval
may be complex, we also conducted an experiment to see how we could make use of
confidence scores in the idealized case in which we know exactly which words are
correct, and which are incorrect. We removed words in two different ways:
Pre-filter: Before the hypothesis is filtered, all the words that are not found
in the reference are removed.
Post-filter: After the hypothesis is filtered, all the words that are not found
in a filtered version of the reference are removed
Table 4 shows that for both training and testing sets, the Post-Filter Oracle
annotation was able to significantly reduce the IR error of the decoded
transcripts. This indicates that a more realistic experiment might be able to do
this as well.
We performed an analysis of some of the differences between documents in the
stemmed oracle experiment, and reference information retrieval experiments. We
should expect the number of query words in the correct document to decrease,
since oracle confidence annotation cannot correct for substitutions and
deletions, but will drop all incorrectly substituted and inserted words. A
cursory glance at documents and queries revealed that some documents contain
more query words as speech hypotheses then the corresponding reference
transcription. Our intuition here is that speech recognition can occasionally
correct for spelling errors in the references, and so words that are incorrect
with respect to the reference transcription may be correct for the purposes of
information retrieval.
Baseline Performance
Oracle Annotation
Reference Transcripts
Decoded Transcripts
Pre-Filter
Post-Filter
Training Set
1.233
1.283
1.285
1.269
Testing Set
1.332
1.382
1.374
1.338
Table 4: Baseline and Oracle Annotation on TREC-6 Training and Testing Sets.
Values are IAIR
Information Retrieval Experiments for Confidence Annotations
In order to see how well cross-entropy reduction translates into gains in
information retrieval accuracy, we conducted a series of experiments. Since we
also hoped to find the best way of incorporating weights into information
retrieval we performed the following information retrieval experiments:
ETF: for this experiment, we used ETF, and regular IDF.
EIDF: for this experiment, we used EIDF, and regular TF.
ETF-EIDF: we use both ETF and EIDF
Pre-Filter
ETF
EIDF
ETFIDF
Training set
1.276
1.283
1.277
Testing set
1.378
1.383
1.399
Post-Filter
ETF
EIDF
ETFIDF
Training set
1.273
1.281
1.274
Testing set
1.381
1.382
1.382
Table 5: Confidence Annotation Performance on TREC-6 Training and Testing Sets.
Values are IAIR.
The results of these experiments are found in
Post-Filter
ETF
EIDF
ETFIDF
Training set
1.273
1.281
1.274
Testing set
1.381
1.382
1.382
Table 5
. Although the IAIR was reduced in most cases, the upper bound found in the
Oracle Annotation was not attained.
6. Using N-best Lists for Information Retrieval
Typically, speech recognition systems produce a transcription of each spoken
utterance in much the same way that a human transcriptionist might. However, the
transcription used is only the most probable decoding of the acoustic signal,
out of a large number of hypotheses that are considered during the recognition
process. It is a relatively simple matter to obtain a list of these different
hypotheses, ranked in order of decreasing likelihood.
Using these additional hypotheses seems promising for information retrieval,
since it offers the hope of including terms that would otherwise be missed by
the speech recognizer in documents, allowing them to match with query terms and
increase document recall. On the other hand, words incorrectly identified in
lower ranked recognition hypotheses may cause spurious matches with query terms,
decreasing retrieval precision.
Experiments Using N-Best Lists
In the context of the TREC-6 SDR task, an initial attempt was made to evaluate
retrieval effectiveness using n-best hypotheses lists generated from the speech
recognition decoder lattice. N-Best hypotheses were generated for the 1451
stories in the TREC-6 SDR test data. Of these, decoding failed completely in
four cases, resulting in empty transcriptions. For the remaining 1447 stories,
lists of the two hundred most likely hypotheses were generated for each
utterance. Table 6 shows an example of N-best hypotheses.
Ideally, one would use hypothesis probabilities generated during decoding to
weight the terms during retrieval, but for this preliminary experiment, the n
hypotheses for each utterance were simply concatenated together into one larger
document. No discounting of weights for less probable hypotheses was done.
N
Nth most likely decoder hypothesis
1
HATE FAIR ADEQ EDUC CHILD WITHSTAND CALM
2
HATE FAIR ADEQ EDUC CHILD WITHSTAND COMMON
3
HATE FAIR ADEQ EDUC CHILD WITHSTAND INTERCOM
4
HATE FAIR ADEQ EDUC CHILD WITHSTAND CALM
Table 6: The top four hypotheses for utterance three of story j960531d.7, after
stop word removal and stemming. Note that the fourth hypothesis is identical to
the first, and differed only in inflected forms.
The effect on retrieval effectiveness of using the documents generated from the
n-best lists in the TREC-6 test set is illustrated in Table 7. Note that for N
set at 50, the performance on the hypothesized transcripts is actually slightly
lower than performance on the reference transcripts (1.332) This may be again
due to effects of misspellings in the reference transcripts. These results were
obtained from the official NIST queries using the full TREC-6 SDR corpus. The 49
queries include the corrected transcription for the words "well-known", "C.I.A.",
"smoking?", and "torched?". Thus the baseline at 1 hypothesis is slightly higher
than the official number reported in Table 1.
Number of Hypotheses (N)
IAIR
1
1.368
2
1.353
5
1.366
10
1.365
20
1.367
50
1.317
100
1.320
200
1.325
Table 7: IR Performance of N-Best hypotheses on the TREC-6 test set. The 49
queries include the corrected transcription for the words "well-known", "C.I.A.",
"smoking?", and "torched?". Thus the baseline at 1 hypothesis is slightly higher
than the official number reported in Table 1.
While it is encouraging that an improvement in retrieval can be obtained at all
by this method, it is clear that further work will be required if the promise of
this idea is to be realized. In particular, the increasingly harmful effect of
adding large numbers of less probable hypotheses to the documents suggests that
discounting each hypothesized word by its recognition score may improve
performance even more.
7. Scaling Collection Size
Many of our experiments, including some of the ones reported here, seem to
suffer from two problems. The effect size of our experimental variables seems to
be fairly small, and the difference between the reference text retrieval and the
speech recognition transcript retrieval is only a few percent of the inverse
average inverse rank. If this relationship holds even as we scale to larger,
more realistic, and more useful collections, then we can consider the problem of
spoken document retrieval practically solved to within a few percent of perfect
text retrieval effectiveness.
To test this hypothesis using the TREC-6 training set, we increased the number
of text documents in the corpus up to 14,000 and measured the inverse average
inverse rank for the same retrieval queries. However, instead of actually
performing speech recognition on the added documents, artificially degraded
texts were used. In this case, the degradation method attempted to only model
word errors through deletion of term words. Although a primitive model of speech
recognition errors this may represent an upper performance bound.
Figure 1 shows the relationship between the inverse average inverse rank
information retrieval performance and the size of the document collection. As
more documents are added to the collection, the gap between the reference (perfect
text) retrieval and the speech recognition based retrieval grows. At collections
larger than 10,000 documents the gap starts to widen significantly. We can
expect to experience larger discrepancies between speech transcribed and
perfectly transcribed documents, which may make spoken document recognition
unusable for collections numbering in the 100,000 or larger.
8. Summary
There are several conclusions we can draw based on our experiments:
First of all, we have found that even large reductions in speech recognition
word error rate result only in small information retrieval improvements. On the
converse side, the quality of information retrieval is a lot higher than the
speech recognition word error rate figures would indicate. Despite fairly high
word error rates, information retrieval performance was only slightly degraded
for speech recognizer transcribed documents.
Stemmed language modeling did not help speech recognition or information
retrieval.
A 51,000 vocabulary covered the range of words
used in the queries quite well. Only one query word was truly outside of this
vocabulary.
We could expect better performance on the reference texts if better IR weighting
schemes and pre-processing functions were used. These improvements would
probably also result in small gains in the speech corpus, although we have done
no studies , including Should use better IR functions (not our focus)
Confidence Measures provide no benefit. Even an oracle confidence measure, which
can reliably single out the correctly recognized words and discard all the other
words provides only a small increase in retrieval effectiveness (as measured in
IAIR). This points to the conclusion that deleted (missing) words are most
critical, while inserted words do not affect the retrieval in the same
proportion.
Since deleted (missing) words are critical to the retrieval effectiveness, one
can try to reduce this by adding probable words from the speech recognizer
hypothesis N-Best list. Using the N-Best list to augment the speech recognition
output with likely words shows great promise. Our experiments indicate that this
approach might drastically reduce the difference between perfect text
transcripts and speech recognizer generated transcripts.
In general, most of our findings are very preliminary. While we believe we may
have uncovered trends, there is too little data for conclusive experiments. As a
result, we did not conduct significance tests to measure the practical effects
of the observed trends since the TREC-6 SDR track provided too little data for
definitive experiments. Furthermore, the difference between the speech
recognizer generated transcripts and the perfect text transcripts was too small
in this corpus. However, the experiments we have done on increasing the scale of
these document collections by orders of magnitude leave a worrisome fear that
the initially promising results for SDR will not hold up in larger data sets.
Figure 1: Effect of collection size on IR performance of the TREC-6 training set
with reference and artificially degraded documents. The X Axis is the number of
documents used in the analysis, and the Y Axis is the IAIR.
REFERENCES
M.-Y. Hwang, "Subphonetic Acoustic Modeling for Speaker-Independent Continuous
Speech Recognition". PhD Thesis, CMU-CS-93-230, Carnegie Mellon University,
1993.
L. L. Chase, PhD thesis, Carnegie Mellon University Robotics Tech Report, 1997.
S. Cox and R. Rose, "Confidence Measures for the Switchboard Database," IEEE
International Conference on Acoustics, Speech and Signal Processing, 1996.
L. Gillick and Y. Ito, "Confidence Estimation and Evaluation," LVCSR Hub-5
Workshop Presentation, 1996.
P. Jeanrenaud, M. Siu, H. Gish, "Large Vocabulary Word Scoring as a Basis for
Transcription Generation," Proceedings of Eurospeech, 1995.
M. F. Porter, "An algorithm for suffix stripping," Program, 14(3):130-137, July
1980.
J. R. Quinlan, Programs for Machine Learning, San Francisco, Calif.: Morgan
Kaufmann, 1993.
K. Seymore, S. Chen, M. Eskenazi, and R. Rosenfeld. "Language and Pronunciation
Modeling in the CMU 1996 Hub 4 Evaluation," Proc. Spoken Language Systems
Technology Workshop. Morgan Kaufmann Publishers, 1997.
M. Siegler, U. Jain, B. Raj, and R. Stern. "Automatic Segmentation,
Classification, and Clustering of Broadcast News Audio," Proc. Spoken Language
Systems Technology Workshop. Morgan Kaufmann Publishers, 1997.
M. J. Witbrock, and A. G. Hauptmann, "Speech Recognition and Information
Retrieval", Proceedings of the 1997 DARPA Speech Recognition Workshop,
Chantilly, VA, February 2-5, 1997.
D. Graff, Z. Wu, R. MacIntyre and M. Liberman, "The 1996 Broadcast News Speech
and Language-Model Corpus", Proceedings of the 1997 DARPA Speech Recognition
Workshop, Chantilly, VA, February 2-5, 1997.
S. Katz, "Estimation of probabilities from sparse data for the language model
component of a speech recognizer", IEEE Transactions on Acoustics, Speech and
Signal Processing, ASSP-35(3),400-401, March, 1987.
Salton, G., Ed, "The SMART Retrieval System", Prentice-Hall, Englewood Cliffs,
NJ, 1971.