Information Research, Vol. 8 No. 4, July 2003
Task dimensions of user evaluations of information retrieval systems
F.C.Johnson, J.R. Griffiths and R.J. Hartley
Department of Information and Communications
Manchester Metropolitan University
Manchester, UK
![]()
Abstract
This paper reports on the evaluation of three search engines using a variety of
user-centred evaluation measures grouped into four criteria of retrieval system
performance. This exploratory study of users' evaluations of search engines took
as its premise that user system success indicators will derive from the
retrieval task the system supports (in its objective to facilitate search). This
resulted in the definition of user evaluation as a multidimensional construct
which provides a framework to link evaluations to system features in defined
user contexts. Our findings indicate that users' evaluations across the engines
will vary, and the dimensional approach to evaluation suggests the possible
impact of system features. Further analysis suggests a moderating effect on the
strength of the evaluation by a characterization of the user and/or query
context. The development of this approach to user evaluation may contribute
towards a better understanding of system feature and contextual impact on user
evaluations of retrieval systems.
Introduction
This paper presents the development of a framework for the evaluation of
Information Retrieval (IR) systems, focusing on Internet search engines from a
user perspective. The traditional measures for evaluation based on the relevancy
of the retrieved output may only be a partial match of users' objectives and of
the systems' objectives. The user's judgement of the success of the system may
be influenced by factors other than the recall and precision of the output. Such
factors are likely to be related to the degree to which the system meets its
objective to facilitate and maximize the users' search. Usability measures are
typically used for the evaluation of an interactive retrieval system and allow
some investigation of the impact of system features on users' search behaviour
and, in turn, system performance. An alternative approach is to substitute (action
based) usability measures for users' assessment of the system on selected
variables. Towards this end we posit that users' evaluation of a retrieval
system is a multidimensional construct based on the user information searching
process which the system seeks to support. In this paper we investigate the
development of this criterion on which users might judge the success of a
retrieval system. A small scale evaluation of three search engines was carried
out using groups of possible success indicators and the interrelations between
these variables were examined to suggest those which appear to be determinants
of users' judgment of system success (with information retrieval). This initial
investigation provides a basis on which we can speculate the value of
dimensional user evaluations of system success defined in terms of system
suitability for the user task. It is proposed, for further research, that this
user-centred approach can provide a framework in which user evaluations on the
dimensions can relate to relevant system and situational impacts. If users'
evaluations, determined by the search process, are dependent on system features
then we can expect evaluation to vary across systems. Further, if a moderating
effect of the user query context can be observed then we may be closer to
understanding variations in users' evaluations of the same system.
Search engine evaluation and development
Harter and Hert (1997) in their comprehensive review of evaluation of IR systems
define evaluation partly as a process by which system effectiveness is assessed
in terms of the degree to which its goals and objectives are accomplished.
Modern interactive retrieval engines aim to support the user in the provision of
features and functionality which help maximize their search for the retrieval of
relevant and pertinent results. The user and their search as an integral
component of the system under investigation has been the focus of ongoing
developments in evaluation methodologies for interactive retrieval systems (for
example, Robertson and Hancock-Beaulieu, 1992 and Harman, 2000. As Dunlop (2000:
1270), reflecting on MIRA, a working group to advance novel evaluation methods
for IR applications, states:
the challenge for interactive evaluation in IR is to connect the two types of
evaluation: engine performance, and suitability for end-users.
Traditional recall and precision measures of engine performance are based on the
concept of relevance, that for a given query there is a set of documents which
match the subject matter of that query. Relative recall and precision measures
have been used for the evaluation of search engine performance but comparison
across these studies is difficult. Different scales for the relevance judgements
have been used (Chu and Rosenthal, 1996; Ding and Marchionini, 1996; Leighton
and Srivastava, 1999) and/or the results are drawn from different query sets (Gauch
and Wang, 1996; Tomaiuolo and Packer, 1996; Back, 2000). It was, however, the
long standing criticism of the very basis of these performance measures, binary
relevance judgments, which was highlighted as a major criticism of the
evaluations of web search engines participating in the Web Special Interest
track of the large scale testing environment of the Text Retrieval Conferences (TREC-8)
(Hawking et al., 1999; Hawking et al., 2001).
Sherman's report of the Infonortics 5th search engine meeting (2000) states that
the relevancy of the system output is a poor match of these systems' objectives
which he suggests include getting information from users, providing browsing
categories, promoting popular sites and speed of results. Those, it seems, which
assist a user to express a query, navigate through the collection, or quickly
get to the requested information. This trend in the development of search
assistance features supporting casual users has been consistently noted (Sullivan,
2000; Feldman, 1998, 1999; Wiggens and Matthews, 1998). Search and retrieval
features are touted as maximising user capabilities in manipulating the
information searching process. Statistical probablistic retrieval systems allow
natural language or, more precisely, unstructured queries which may support the
user's task of query formulation; concept processing of a search statement may
determine the probable intent of a search; relevance feedback can assist users
in modifying a query; and, use of on and off the page indicators to rank the
retrieved items may assist users in judging the hit list, as do visualisation
techniques, for example, which may be used to provide a view of a subset of the
collection.
The increasingly interactive nature of system design motivates approaches to
evaluation which accommodate the user and the process of interaction. Front-end
features will affect users' interactions which in turn will partly determine the
effective performance of the back-end system. The requirements which can be
derived from this interaction of system features, user and query are articulated
well in Belkin et al., as highlighted in Harter and Hert, 1997,: 26
if we are going to be serious about evaluating effectiveness of interactive IR,
we [sic. need to] develop measures based upon the search process itself and upon
the task which has lead the searchers to engage in the IR situation.
The rise of interactivity was acknowledged from TREC-3 onwards with the
introduction of a special interactive track (Beaulieu et al., 1996) and its goal
to investigate the process as well as the outcome in interactive searching (Hersh
and Over, 2001) Various usability measures of the user-system performance, such
as number of tasks completed, number of query terms entered, number of commands
used, number of cycles or query reformulations, number of errors and time taken,
were thus derived by consideration of the users' actions or behaviour in
carrying out a search task. As such, usability measures may provide indicators
of the impact specific features of a retrieval system have on searcher behaviour.
Voorhees and Garofolo, (2000) itemise studies which have investigated features
such as the effects of visualization techniques, and different styles of
interaction, while several studies have focused on the system feature of
relevance feedback in interactive IR. Belkin et al. (2001) looked at query
reformulation, and White et al. (2001) compared implicit and explicit feedback.
In the context of TREC-8 interactive track, Fowkes and Beaulieu (2000) examined
searching behaviour, related to the query formulation and reformulation stages
of an interactive search process, with a relevance feedback system. Of
particular note in this study was the moderating effect of the user query
context, where different query expansion techniques were found to be suited to
the degree of query complexity.
Others substitute usability measures for satisfaction measures and ask users
about their satisfaction with or judgment of the general performance of the
system and possibly its specific features (or more widely, the interface). Su
(1992, 1998) identified twenty user measures of system success which were
grouped into the evaluation dimensions of relevance, efficiency, utility and
user satisfaction. These were correlated to the users' rating for overall system
success to determine 'value of search results as a whole' to be the best single
user measure of success. The pursuit to find the best determinant of users'
system success rating is a worthy one which would reduce considerable cost and
effort in evaluation. Yet its application would mask the potentially complex
judgment resulting from the many factors with which a user engages in
interacting with the system and on which the user may draw in assessing the
system. The crux of the issue being that while a user evaluation (expressed in a
single construct) is more easily obtained, it is at the expense of knowing why.
From the system developers' perspective it may be of value to have greater
insight into user evaluations and the subtle balance of the interrelations
between the various indicators of a user judgment of system success. A single
system, for example, may be rated as providing excellent results but requiring
considerable user time and effort, thus high on effectiveness but low in
efficiency or indeed visa versa. Hildreth (2001), for example, found that 'perceived
ease of use' of an OPAC was related to 'user satisfaction with the results'. Our
concern with what might be the determinants of a user judgment such as 'ease of
use' leads to consideration of a more detailed system evaluation. For example, a
system's layout and representation of the retrieved items may contribute to a
user perception of ease of use helping the user to quickly and easily make
relevancy judgments. It would seem reasonable to speculate that a system could
score high in the user's judgment of this aspect but, in fact, score low on user
satisfaction with the search results.
Dimensions and determinants of user evaluation.
A system evaluation based on various success indicators requires some basis on
which the interrelations between these factors can be sought and understood.
This requirement is perhaps demonstrated by the fact that previous studies have
been unable to establish a consistent relation between user satisfaction and the
recall and precision of the search results (Sandore, 1990; Su, 1992; Gluck,
1996; and Saracevic and Kantor, 1988). The reason for this may be that it is an
erroneous assumption that there ought to be a relation between user and system
measures of performance. It would certainly be dependent on the individual and
their situation. The framework we seek to develop is thus one which relates and
groups various user indicators of success into dimensions of the users' overall
judgment in order that meaningful relations can be sought in the evaluation. The
remainder of this paper describes our preliminary investigation into the
defining function of the retrieval task dimensions on user evaluation indicators
and measures. Ideally, user evaluation based on what the user is doing and the
system is supporting will link to system features in well defined search
contexts. A further requirement for our feasibility study is thus to identify
possible user/query contexts which may moderate users' judgments of the system.
Users with different demands on the system are likely to vary accordingly in
their assessment based on the system and its features. Succinctly put, our
feasibility study set out to identify the possible dimensions and indicators of
user evaluations in a framework in which evaluation is dependent on system
features and moderated in a user/query context.
The IR task process
Several models of information searching have been suggested which may be used to
identify the user success factors in system evaluation. Information retrieval,
for example, may be viewed as an exploratory process, a view which led Brajnik
(1999) to derive evaluation statements for criteria such as system flexibility.
The approach we explored focused on the more mechanical process models to
demarcate the tangible (system dependent) information retrieval activities. Such
models of the basics of information searching are standard, for example in
Salton (1989) as highlighted in Baeza-Yates (1999: 262), and from which the
following interacting steps are defined:
formulation and submission of a query,
examination of the results, with a
possible feedback loop to re-formulate the query, and
integration of search results and evaluation of the whole search.
Each step provides some statement of user-requirement, what the goal-directed
user is trying to do with the system. Each step in the process thus represents a
dimension which defines the variables and measures on which a user may evaluate
a system's success in supporting information retrieval.
Table 1 groups these indicators of system success under each of the dimensions,
Query formulation, Query reformulation, Examination of result, and Evaluation of
search results and search as a whole. The majority of which came from existing (and
generally accepted) measures of retrieval system performance (for example as
listed in Su (1992) and which form the broad criteria of effectiveness, utility
and efficiency. The main effect of using the lower level task dimensions is the
decomposition of interaction by the three task process steps.
Table 1: Indicators of user judgment of system success grouped by task
dimensions Effectiveness Utility Efficiency
Evaluation of results
Satisfaction with precision
Satisfaction with ranking Evaluation of search and results
Value of search results
Satisfaction with results
Resolution of the problem
Rate value of participation
Quality of results Evaluation of search as a whole
Search session time
Response time
Interaction
Query formulation
Satisfaction with query input Query reformulation
Satisfaction with query modification
Satisfaction with query visualisation Examine results
Satisfaction with visualisation of item
Satisfaction with manipulation of output
Measures of search results
Users' judgment of the success of the search results and the search as a whole
may be based on the criterion of effectiveness manifest in the system output,
the retrieved set. Traditional measures of retrieval effectiveness are based on
the notion of relevance, and evaluation of the system will be partially
dependent on the ability of the system to meet its basic function to retrieve
only relevant documents. We used Su's user measures of user satisfaction with
precision, and ranking. Further indicators of a user success judgment of the
search may be based on factors, other than relevance. Measures of utility focus
on the actual usefulness or value of the retrieved items to the individual
information seeker (Saracevic et al., 1988). Cleverdon (1991) argued (with
Cooper, 1973, who put forward a straight utility-theory single measure) that
retrieval effectiveness measures of recall and precision should be used in
combination with (and possibly related to) these more user-oriented measures
which are based on factors as subjective satisfaction statements, search costs,
and time spent. Indeed, various factors may bear on users' judgements of overall
satisfaction with the value of the search results. For example, users may be
influenced by the extent to which information quality can be assumed based on
the source; the extent to which the information is accurate or correct; and, the
extent to which the information is at the right level to meet user need. Yet,
Saracevic (and Su, 1998: 558) report that standard utility measures do not exist.
Saracevic used the following evaluative statements:
How much time spent reviewing abstracts;
Assign a cost value to usefulness of results;
What contribution this information made to resolution of problem that motivated
your question;
Overall how satisfied with results.
Su found that a measure of utility value of the search results as a whole
correlated most strongly with users' overall judgement of system success and
thus proposed it to be a best single measure of a system. We used as a measure
of utility Satisfaction with results and resolution of the problem (derived from
Saracevic) and value of search results as a whole, value of participation and
quality of results (from Su).
Measures of user-system interaction
In the process of getting the search results the user is involved in the steps
of query formulation, re-formulation and examination of the results. These
comprise broad categories of user interaction with the system, and the specific
actions subsumed relate to how the user interacts and manipulates or commands
the system to retrieve the required information. In the absence of usability
measures, interaction will be largely determined by satisfaction measures alone.
As Belkin and Vickery (1985: 194) point out satisfaction is a concept intended
to capture an overall judgement based on user reaction to the system thus
extending the range of factors relevant to the evaluation. Our approach which
has interaction as a broad category of the three interactive task dimensions
breaks down the concept to measures based on these relevant factors, grouped in
Table 1 and explained as follows. Query formulation, on consulting an
information retrieval system, user reaction to the system may be influenced by
the perceived ease of expressing the query. The user may be influenced in this
judgement by, for example, the availability of different search methods, such as
natural language searching or power search to specify a search topic. A measure
of user satisfaction with query input thus is defined in terms of the perceived
ease in the expression or specification of the query. Search reformulation,
users may be influenced by any assistance or feedback received for formulating
or modifying the query, such as the system suggesting query terms from a
thesaurus or offering 'more like this' feedback options. Another form of
feedback lies with the query visualization, the assistance provided in
understanding the impact of a query. An example is the use of folders to
categorise search results which may suggest to the user different perspectives
of the topic which is useful in refining the search. The measures of user
satisfaction with query modification and satisfaction with query visualization
are thus defined in terms of system suggesting search terms or facilitating
query by example, and in terms of understanding the impact of the query
respectively. Examining the results, on receiving results the user will be
involved in some process of interpreting the results in the given frame of the
information need and would want to see quickly and easily an item's topic or
meaning and why it was retrieved. Summary representation features for
visualising the 'aboutness'of an item might support a user in this task, e.g.,
in highlighting query terms, showing category labels and a clear and organised
layout. Thus we defined the measure of satisfaction with visualisation of item
representation and manipulation of the output (e.g., summary display features
(category labels), sort by).
Measures of search efficiency
The final stage in the retrieval task is evaluation of the search as a whole and
relates to the criterion of efficiency. Boyce et al. (1994: 241) highlight the
difference between effectiveness and efficiency thus:
an effectiveness measure is one which measures the general ability of a system
to achieve its goals. It is thus user oriented. An efficiency measure considers
units of goods or services provided per unit of resources provided.
Dong and Su (1997: 79) state that response time is becoming a very important
issue for many users. If users want to retrieve information as quickly as
possible, this may in part equate to the efficiency of the system and the
judgement of which will affect user evaluation of the system success as a whole.
Thus whilst efficiency seems hard to define, these studies and others (such as,
Stobart and Kerridge, 1996; Nahl, 1998) seem corroborate on the importance of
speed of response as an indicator of efficiency.
Multidimensional framework for search engine evaluation
In our feasibility study these indicators of satisfaction were used to elicit a
user judgment of the success of three search engines. Our aim was not to obtain
an evaluation of these engines as such but rather to confirm or refute our
premise that a dimensional user evaluation can result in meaningful relations
sought in the user judgments made across and within the systems. This can be
stated in three propositions which were explored to varying degrees afforded by
the scope of the study.
Proposition One states that the user judgment of system success is a response to
how well the system has supported the retrieval task and, as such, is a
multidimensional construct. The nature of user evaluations was explored as
follows:
User success ratings assigned on the four criteria were correlated with an
overall success judgment to find which, if any, appears be the most important
factor in defining user judgment of the system.
Users ratings on the measures were correlated with the overall success ratings
for each associated criterion to find which, if any, contributed most strongly
to the user's overall rating of a criterion.
User derived reasons for attributing satisfaction ratings, overall and on each
criterion, were collected using open-ended questions and analysed to suggest, or
otherwise, our measures as those which users themselves base an evaluation of
system success.
Proposition Two states that user ratings will vary across the systems reflecting
the support of system features to the retrieval task. In the scope of this
feasibility study we do not claim to test this, but evidence of it was sought in
the finding that user evaluations varied across the engines.
User ratings on each of the measures were compared across the search engines to
find which engine, if any, received notably higher/lower ratings. Some
speculation was made as to the possible impact of system features.
Proposition Three states that user evaluations of the systems will be moderated
by the context of the users' information query making different demands on the
system. Again this was not tested but evidence was sought of contexts leading to
the systems receiving different evaluations.
Four task identifiers were analysed against the overall user ratings and the
four evaluation criteria across all four search engines to ascertain if a
moderating effect of context was obtained.
Implementation
Twenty-three participants were recruited from second year students of the
Department of Information and Communications, MMU. A short briefing was given a
few days prior to their search session to explain the project and to present the
Information Need Characteristic Questionnaire which was to be completed before
the search session. No restrictions were placed on the type of information
sought or the purpose for which it was intended, but the questionnaire did
capture a characterisation of the search context in terms of following
parameters used in Saracevic (1988): a) problem definition (on a scale from 1-5,
would you describe your problem as weakly defined or clearly defined?); b)
intent (on a scale from 1-5, would you say that your use of this information
will be open to many avenues, or for a specifically defined purpose); c) amount
of prior knowledge (on a scale from 1-5, how would you rank the amount of
knowledge you possess in relation to the problem which motivated the request?);
and d) expectation (on a scale from 1-5 how would you rank the probability that
information about the problem which motivated this research question will be
found in the literature?). Participants were then asked to conduct the search
using as many reformulations as required and to search for as long as they would
under normal conditions. Each was required to search the three engines, Excite,
NorthernLight, and HotBot the order of which was varied to remove learning curve
effect. These engines were selected on the basis that each had at least one
discernable feature so that each search would present a unique search experience
and the ability to distinguish the engines.
Following each search participants were required to respond on a likert type
scale to questions relating to each of the user satisfaction variables indicated
in Table 1 as defining each of the evaluation criterion. In addition users were
asked to provide an overall success rating of the engine with respect to each
criterion, i.e., effectiveness, efficiency, utility and interaction. This was to
allow the testing of Proposition one as stated above towards the identification
of the measures which when validated could form the basis to comprise a judgment
for each criterion.
To measure system effectiveness users were asked to rate on a three point scale
the degree of relevance of each item retrieved, leaving it open as to how many
individual items were assessed. The searchers were provided with definitions of
relevant, partially relevant and non-relevant which had their origins in the
ISILT (Information Science Index language test) project undertaken by Keen and
Digger in the early 1970s (Keen, 1973 ,) and which have been used in various
tests since that time. Since the searches were carried out on three different
engines it is highly likely that identical items would be retrieved with a
possibility that ‘already seen’ items are consciously or unconsciously judged to
be less relevant the second or third time around. This is partly resolved in the
varying order of the engines to which searchers were assigned. In this instance,
however the impact was not considered to be great as the aim was not to evaluate
the performance of the individual engines per se but to obtain a user’s
expression of satisfaction with precision across the engines’ results.
Participants were then asked to rate on a five point scale their satisfaction
with the precision of the search results. An overall rating of effectiveness was
obtained by the users' assessment of the overall success of the search engine in
retrieving items relevant to the information problem or purpose on a five point
scale. To measure utility users were asked to rate on a five point scale the
worth of their participation, with respect to the resulting information; the
contribution the information made to the resolution of the problem; satisfaction
with results; the quality of the results; and the value of the search results as
a whole. Participants were asked to rate on a five point scale the overall
success of the search engine in terms of the actual usefulness of the items
retrieved. To measure interaction the users were asked to rate on a five point
scale satisfaction with query input, query modification, query visualisation,
manipulation of output and satisfaction with visualisation of representation of
item. To measure efficiency participants were required to record the search
session time and to rate on a five point scale the overall success of the search
engine in retrieving items efficiently.
Analysis and interpretation
Our primary aim, expressed in proposition one, was to explore the
multidimensional nature of users' evaluation of a system. To begin to understand
how users might themselves evaluate these systems we correlated users' overall
success rating with the ratings assigned on the four criteria to suggest which,
if any, appears be the most important or contributory factor to users' overall
judgement of system success. The results are shown in Table 2: where a moderate
correlation is defined as greater than 0.4 and less than 0.7 and a strong
correlation as between 0.7 and 0.9.
Table 2: Global and engine level - Overall success rating correlated against the
four criteria Spearman's rank correlation coefficient **=strong *=moderate
strength correlation
Criterion Global Excite NorthernLight Hotbot
Effectiveness .759** .779** .795** .729**
Efficiency .817** .843** .908** .741**
Utility .710** .362 .930** .806**
Interaction .592* .511* .660* .580*
The strength of the correlation ratings indicates that users' overall success
rating of the system appears to be multidimensional. The Efficiency criterion
held the strongest correlation with users' overall rating (.817) and Interaction
the weakest (.592) which is in accordance with other studies which suggest ease
of use and response time to be strong determinants of satisfaction. This result
is fairly consistent with the analysis done at the level of the individual
search engine, although we note the strong correlation with utility on
NorthernLight and HotBot which was not found with Excite.
Indicators of the dimensions
Correlations were also taken of the measures within each criterion with users'
overall rating. Strong correlations may suggest single measures as the best
indicator of users' evaluation of the criterion. Results at the global level,
across the three engines, suggest that 'satisfaction with precision' has the
strongest correlation (.733) with the judgement of the systems' effectiveness.
All the measures of utility held a strong correlation with its overall
judgement, the strongest measure being 'satisfaction with results' (.755). The
measure 'rate worth of participation' had a negative correlation which indicates
that as utility rises the value of participation decreases. This may bring into
question users' interpretation of value of participation. We can only speculate
but it was possibly mistaken to refer to amount of user effort exerted.
Interestingly, the measure of 'time taken to search' held a negligible
correlation (.062) with the judgement of the systems' efficiency. The judgement
of the satisfaction with systems' interaction held a correlation of moderate
strength (.506) with the measure 'satisfaction with query visualization'. This
is followed by 'satisfaction with facility to input query' (.486), 'satisfaction
with visualization of item representation' to ease understanding of item/s from
the hitlist (.452), 'ability to modify query' (.437), and 'ability to manipulate
output' (.132), this being a very weak correlation. At the level of individual
search engines, the judgement of interaction on both Excite and HotBot held the
strongest correlation with satisfaction with query visualization, whereas
NorthernLight held the strongest correlation with the ability to modify the
query. The low correlations of the measure of efficiency and the negative
correlation of the measure of value of participation are considered in the
concluding section of this paper.
User derived reasons for attributing satisfaction ratings, overall and on each
criterion, were collected using open-ended questions. These were analysed to
suggest the extent to which our selected measures compared to those which users
themselves might base an evaluation of system success. Some 250 comments were
collected and, in the main, simply confirm the users' understanding of our
intended interpretation of system effectiveness and utility. The selected, but
representative, comments in Table 3 show the user-derived reasons for assigning
ratings of success of effectiveness seem to confirm the finding that user
satisfaction with precision held the strongest correlation with user ratings of
effectiveness. The utility measures all held strong correlations and the
user-derived reasons also indicate that these were measures which users may
themselves use. Further analysis would be required to ascertain if in fact these
measures were simply variations of the same measure 'satisfaction with results'.
Table 3: User derived reasons for assigning success ratings on the four
criteria. Effectiveness
'Information retrieved was extremely relevant to my needs.'
'Most items retrieved appear to have some relevance.'
'Too much irrelevant information.'
Efficiency
'Ease of use.'
'Had to redefine search twice.'
'The search terms were attempting to pin down a concept that was hard to
verbalize [or] encapsulate.'
'Would become "extremely efficient" as the user becomes more adept with search
terminology phrasing and when an "advanced search" would be more appropriate.
'Very quick, only had to search once.'
'One search term located all items that were of some relevance.'
'Needed to define search better.'
'Minimum effort, but results not good.'
'Search engine seemed efficient enough, but the search term was unusual. I think
with a more concrete search term the SE would have performed well.' Utility
'Those items were found useful.'
'I have gained further info[rmation] on the subject I was search[ing] for.'
'Current info[rmation] was located.'
Interaction
'I changed the query once and it was helpful.'
'The SE easily allowed the query to be modified.'
'Found it hard to refine search.'
'The query was easy to change but yielded no better results.'
'Good options to change query.'
'Refining the search was hard, I couldn't think of any new queries and the SE
didn't offer any help trying to narrow down search queries, like the SE I
usually use.'
'Could lead to different routes of enquiry from the initial search term.'
The correlations found with the interaction measures were relatively low with
satisfaction with query visualization holding the strongest correlation. The
user derived reasons, however, indicate that there was perhaps some expectation
that the system would provide some assistance in modifying the query and that
this would impact on evaluation of system interaction. The efficiency criterion
held the strongest correlation with an overall judgment of success, yet the
measure of search time held a low correlation as a measure of efficiency.
Interestingly, the user-derived reasons for assigning ratings of success on this
criterion suggest that users relate efficiency, not to time taken but to
something equating to the amount of user effort required to conduct a search.
The real worth of this investigation is that while we may suggest that user
evaluation is multidimensional and there are some obvious candidate variables on
which to base the measure of these evaluation dimensions, considerable research
is still needed to properly ascertain and validate user measures. We still need
to understand how users themselves evaluate a system and to what these measures
relate.
Feature impact
Our second proposition is that system characteristics will affect user
evaluations on the task dimensions which have defined the evaluation. Users'
ratings varied across the search engines and, within the scope of a small scale
study, this was viewed as indicative that user evaluations are not random but an
elicited response reflecting the support of the system to the users' task. As
has been noted, strong correlations with users' overall rating and utility was
found on NorthernLight and HotBot, in contrast to the very weak correlation
found on Excite. The marked difference in the strength of correlation found
between the systems is interesting but only in that it suggests that users'
overall judgement may be more strongly associated with a judgement made on a
particular criterion depending on the system. In a full scale evaluation study
with a far larger sample more insight and interpretation could be possible from
an analysis of the central tendency on the rating scales for the individual
measures. For example, taking the strongest correlation with users' overall
rating and interaction on NorthernLight (see Table 2), it was found that 77% of
users of this system rated the ability to modify the query whereas globally and
on the other two systems the measure of user satisfaction with query
visualization had the strongest correlation . Again this may indicate the
influence of a system feature on the users' judgement, but this cannot be
substantiated without a much larger investigation.
Impact of query context
Our final proposition was that users' evaluations of the system may be moderated
by some contextual characterisation of the user and information query. To obtain
some indication of the support for this proposition we analysed the user/query
context where a system received high/low ratings by correlating the four task
identifiers (task defined, task purpose, task knowledge and task probability)
against the overall satisfaction rating and the four criteria. Again we stress
the exploratory nature of this exercise and that a greater sample would be
required in an evaluation situation to support any analysis at this level.
Globally, across the three engines, moderate strength correlations indicated
that as task definition increases so does overall rating of system success
(.407), and satisfaction with effectiveness (.418) and efficiency (.482). The
correlations between task definition and utility (.307) and interaction (.221)
were weak. Weak or very weak correlations were obtained between task purpose,
task knowledge, and task probability and the overall success rating and the four
criteria. The suggestion that a system receives a higher rating of effectiveness
and efficiency when the user has a well-defined task is not surprising. It would
be reasonable to assume that in such a context the information seeker will have
a fairly good idea of the search requirements to obtain good results. Indeed,
the effect of the moderating context will be of more interest if notable
variations can be found between the systems evaluated.
Indicative of such a finding is with the system NorthernLight. Whilst weak
correlations were found globally for task purpose, when based on the data
obtained for NorthernLight moderate strength correlations were found with
Efficiency (.636) and Utility (.577). The comparison is with the weak
correlation found on Excite data with Efficiency (.161) and Utility (.002). It
would not be surprising if a correlation was found for intent of task purpose
and utility across all three engines: a broad query, open to many avenues, could
lead to a high rating of the utility of the results. That such a finding is
strongly held only on one engine could lead to speculation that a feature of the
engine leads to results which better support a broad query. For example,
NorthernLight boosts features related to the visual organisation and
representation of the search results which may better support the user with a
broad query. Indeed users of the system expressed a high level of satisfaction
with item visualization, in terms of understanding the content of the hit list,
and query visualization in terms of understanding the impact of the query on the
results obtained. A larger scale study would be necessary to ascertain the
significance of this rating when compared to others.
Conclusions
Our preliminary investigation into user evaluations of internet search engines
would seem to indicate that these are determined by many factors which together
may represent dimensions of some overall user judgment of the system. To explore
the value of a framework for evaluation based on multi-dimensions we defined and
grouped possible indicators of success on the search task process in which the
user is engaged. This would seem to be a reasonable approach to the substitution
of usability measures for the evaluation of interactive system objectives to
maximize search for information retrieval. The success indicators of user
satisfaction were drawn from existing measures or by their definition provided
by the consideration of the system feature objectives. However, there is a need
to ascertain from users themselves what these success indicators might be and to
validate these as relating to task dimensions. This is clear, not least, from
our finding of some discrepancy in how users might define system efficiency with
user effort being a key influence on the user judgment of the system. Further
research is thus required to develop the multidimensional construct of user
evaluation as possibly a function of the system support for the user retrieval
task process.
Ultimately our aim in developing the framework is to have user evaluations link
to system features, thus allowing a system to score high in particular aspects
but not necessarily in all aspects. In this study, we can only speculate with
caution that a system feature of query modification contributed to the users'
evaluations, and that there was some observed effect of task purpose as a
moderating variable in one system. While this tempts comment, we can only
speculate that this was attributed to a system feature which better supported a
particular query context. Only in a full-scale evaluation could this be tested
using appropriate statistical techniques, such as regression analysis, to
express user judgment of the system as a function of the task defined indicators
of success and to explore, within and across systems, the relationship held
among the dependent and moderating variables. Obviously this would be an
expensive undertaking, but one which would not only allow variation to be found
in users' assessments across systems influenced by system features, but also
variations within search dimensions as influenced by user query task contexts.
Acknowledgements
This paper reports on results of a feasibility study of user evaluation of
internet search engines, Devise (Johnson et al., 2001). Thanks goes to Re:source
for its funding of Devise under grant number NRG/085 from January 2000-2001. The
full report is available from our website http://www.mmu.ac.uk/h-ss/dic/cerlim/
and can be requested from the thesis and reports lending division of the British
Library. This research is being developed in Department of Information &
Communications at MMU, and thanks to Sarah Crudge for many interesting
discussions on its development towards the registration of her PhD on user
evaluations of search engines.
References
Back, J. (2000). An evaluation of relevancy ranking techniques used by Internet
search engines. Library and information research news, 24(77), 30-34.
Baeza-Yates,R. and Ribeiro-Neto, B. (1999). Modern information retrieval
Reading, MA: Addison-Wesley.
Belkin, N.J., Cool, C., Kelly, D., Lin, S.J., Park, S.Y., Perez-Carballo, J.,
Sikora, C. (2001). Iterative exploration, design and evaluation of support for
query reformulation in interactive information retrieval.” Information
processing and management, 37(3), 403-434.
Beaulieu, M., Robertson, S., and Rasmussen, E. (1996). Evaluating interactive
systems in TREC. Journal of American Society for Information Science, 47(1).
85-94.
Belkin, N.J. and Vickery, A. (1985). Interaction in information systems: a
review of research from document retrieval to knowledge-based system. London:
the British Library.
Boyce, B.R., Meadow, C.T., and Kraft, D.H. (1994). The measurement of
information science. Reading, MA: Academic Press.
Brajnik, G. (1999). Information seeking as explorative learning, in: Proceedings
of the MIRA ’99 conference.” In: the electronic workshop in computing series.
[Available at http://www.ewic.org.uk]
Brajnik, G. (1999). Information seeking as explorative learning. In: S. W.
Draper, M. D. Dunlop, I. Ruthven, and C.J. Van Rijsbergen, eds., Proceedings of
Mira 99: Evaluating Interactive Information Retrieval, Glasgow, April 1999.
British Computer Society, Electronic Workshops in Computing. Retrieved 17 July
2003 from http://www1.bcs.org.uk/DocsRepository/02800/2836/brajnik.pdf
Chu, H. and Rosenthal, M.“Search engines for the world wide web: A comparative
study and evaluation methodology." In: ASIS ’96: Proceedings of the 59th ASIS
annual meeting, 33, pp.127-135. Medford, NJ: Information Today. Retrieved 8 July
2003 from http://www.asis.org/annual-96/ElectronicProceedings/chu.html
Cleverdon, Cyril W. (1991). The significance of the Cranfield tests on indexing
languages. In: Proceedings of the 14th International Conference on Research and
Development in Information Retrieval (ACM SIGIR ’91), pp. 3-12. New York:ACM
Press.
Cooper, W.S. (1973). On selecting a measure of retrieval effectiveness. Journal
of the American Society for Information Science, 24, 87-100.
Ding, W.and Marchionini, G. (1996). A comparative study of Web search service
performance. In: Hardin, S. ed., Proceedings of the 59th Annual Meeting of the
American Society for Information Science Vol 33. (pp. 136-142.) Baltimore, MD:
Information Today.
Dong, X., and Su, L. (1997). A comparative study of Web search service
performance. In: C. Schwartz and M. Rorvig, eds. Proceedings of the 60th Annual
Meeting of the American Society for Information Science. Vol 34. (pp. 136-142).
Medford, NJ: Information Today.
Dunlop, M. (2000). Reflections on Mira: interaction evaluation in information
retrieval.” Journal of the American Society for Information Science, 51(14),
1269-1274.
Feldman, S. (1998). Web search services in 1998: trends and challenges.
Searcher, 6(6), 29-39. Retrieved 8 July 2003 from
www.infotoday.com/searcher/jun98/story2.htm
Feldman, S. (1999). Search engines: the 1999 conference. Information Today,
16(6). Retrieved 8 July 2003 from http://www.infotoday.com/IT/jun99/feldman.htm
Fowkes, H. and Beaulieu, M. (2000). Interactive searching behaviour: Okapi
experiment for Trec-8. Paper presented at the British Computer Society
Information Retrieval Special Group 22nd Annual Colloquium on Information
Retrieval Research, Cambridge, 5-7 April. Retrieved 17 July from
http://irsg.eu.org/irsg2000online/papers/fowkes.htm
Gauch, S. and Wang, G. (1996). Information fusion with ProFusion. In: H. Maurer,
ed. Proceedings of the World Conference of the Web Society (Webnet ’96), San
Francisco, CA, Oct 15-19. Retrieved 8th July from
http://www.csbs.utsa.edu:80/info/webnet96/html/155.htm.
Gluck, M. (1996). Exploring the relationship between user satisfaction and
relevance in information systems. Information Processing and Management,
32(1).11-18.
Harman, D. (2000). What we have learned and have not learned from TREC. Paper
presented at the British Computer Society Information Retrieval Special Group
22nd Annual Colloquium on Information Retrieval Research, Cambridge, 5-7 April.
Harter, Stephen, P. and Hert, C.A. (1997). Evaluation of information retrieval
systems: approaches, issues, and methods. Annual Review of information Science
and Technology, 32, 3-94.
Hawking, D., Craswell, N., Thistlewaite, P., Harman, D. (1999). Results and
challenges in web search evaluation. In: The Eighth International World Wide Web
Conference, Toronto, May 11-14. Retrieved 8 July 2003 from
http://www8.org/w8-papers/2c-search-discover/results/results.html
Hawking, D., Craswell, N., Bailey, P., Griffiths, K. (2001). Measuring search
engine quality. Information Retrieval, 4. 33-59.
Hersh, W., Over, P. (2001). TREC –9 Interactive Track Report.” In: The Ninth
Text Retrieval Conference (TREC –9). Gaithersburg, MD: National Institute for
Standards and Technology. Retrieved 8 July 2003 from
http://trec.nist.gov/pubs/trec9/t9_proceedings.html
Hildreth, C.R. (2001). Accounting for users’ inflated assessments of on-line
catalogue search performance and usefulness: an experimental study. Information
Research,6(2). Retrieved 8 July 2003 from
http://informationr.net/ir/6-2/paper101.html.
Johnson, F. C., Griffiths, J.R., and Hartley, R.J. (2001). Devise: a framework
for the evaluation of Internet search engines. London: Resource: the Council for
Museums, Archives and Libraries. (Library and Information Commission Research
Report 100)
Keen, E.M. and Digger, J.A. (1972) Report of an information science index
language test. Aberystwyth: College of Librarianship Wales.
Keen, E.M. (1973) The Aberystwyth index language tests. Journal of Documentation
, 29(1), 1-35.
Leighton, H.V. and Srivastava, J. (1999). First 20 precision among World Wide
Web search services (search engines). Journal of the American Society for
Information Science, 50(10), 870-881.
Nahl, D. (1998). Ethnography of novices’ first use of Web search engines:
affective control in cognitive processing. Internet Reference Services
Quarterly, 32(2), 69.
Robertson, S.E. and Hancock-Beaulieu, M. (1992). On the evaluation of IR
systems. Information Processing and Management, 28(4), 457-466.
Salton, G. (1989). Automatic text processing: the transformation, analysis and
retrieval of information by computer. Reading, MA: Addison-Wesley.
Sandore, B. (1990). Online searching: what measure satisfaction? Library and
Information Science Research, 12, 33-54.
Saracevic, T., Kantor, P., Chamis, A.Y., and Tirvison, D. (1988) A study of
information seeking and retrieving. I Background and methodology.” Journal of
the American Society for Information Science, 39(3), 161-176.
Saracevic, T. and Kantor, P. (1988). A study of information seeking and
retrieving. II. Users, questions and effectiveness. Journal of the American
Society for Information Science, 39(3), 177-196.
Sherman, C. (2000). The FireworksFly [Available at
websearch.about.com/library/weekly/ aa041800b.htm]
Sherman, C. (2000). 'Old economy' information retrieval clashes with 'new
economy' Web upstarts at the Fifth Annual Search Engine Conference: Conference
Report. Medford, NJ: Information Today. Retrieved 17 July 2003 from
http://www.infotoday.com/newsbreaks/nb000424-2.htm
Stobart, S. and Kerridge, S. (1996). An investigation into World Wide Web search
engine use from within the UK –preliminary findings Ariadne, No. 6. Retrieved 8
July 2003 from http://www.ariadne.ac.uk/issue6/survey/
Su, L. (1992). Evaluation measures for interactive information retrieval.
Information Processing and Management, 28(4), 503-516.
Su, L. (1998). Value of search results as a whole as the best single measure of
information retrieval performance. Information Processing and Management, 34(5),
57-579.
Sullivan, D. (2000). Web search engine trends and achievements since the 1999
Boston Search Engine meeting, In: Search Engines Today and the New Frontier: the
Fifth Search Engine Meeting., Boston, Massachusetts, April 2000. Retrieved 8
July 2003 from
http://www.infonortics.com/searchengines/sh00/sullivan_files/frame.htm.
[PowerPoint presentation]
Tomaiuolo, N.G. and Packer, J.G. (1996). An analysis of Internet search engines:
assessment of over 200 search queries. Computers in Libraries, 16(6), 58-62.
Voorhees, E. and Garofolo, J. (2000). The TREC spoken document retrieval track.
Bulletin of the American Society for Information Science, 26(5). Retrieved 17
July 2003 from http://www.asis.org/Bulletin/June-00/voorheesgarofolo.html
White, R.W., Jose, J.M. and Ruthven, I. (2001). Comparing explicit and implicit
feedback techniques for web retrieval: TREC-10 Interactive track report, in:The
tenth Text Retrieval Conference (TREC 2001), Gaithersburg, Maryland, November
13-16. Gaithersburg, MD: National Institute of Standards and Technology.
Retrieved 8 July 2003 from http://trec.nist.gov/pubs/trec10/t10_proceedings.html
Wiggins, R. and Matthews, J.(1998). Plateaus, peaks and promises: the
Infonortics ’98 search engine conference. Searcher, 6(6). Retrieved 8 July 2003
from http://www.infotoday.com/searcher/jun98/story4.htm
---------------------------------------------------------------------------------
How to cite this paper:
Johnson, F.C., Griffiths, J.R. and Hartley, R.J. (2003) "Task dimensions of user
evaluations of information retrieval systems"Information Research, 8(4), paper
no. 157 [Available at: http://informationr.net/ir/8-4/paper157.html]
© the authors, 2003.