Literature Review Of
Adverse Impact
Tracey Blackwood
Industrial Seminar
Literature Review Project
October 31, 2000
Adverse impact in an employment testing context occurs when the testing procedure used consistently produces significantly different results for members of one particular group than it does for members of another group. This disparity in results produces different employment outcomes for members of the lower scoring group. Adverse impact is most commonly described in terms of selection testing procedures, and this is where much of the research has been done. When adverse impact occurs in employment selection testing, the difference in average results can lead to significantly different selection rates from each group based on these results. The legal definition of adverse impact as outlined by the Equal Opportunity Employment Commission is a selection rate for any racial, ethnic, or sex subgroup which is less than 4/5 of the group with the highest rate of selection (EEOC,1978). This is commonly referred to as the four-fifths rule. Once adverse impact has been discovered, the employer using the test has a legal obligation to show that the test is job related and that unfair discrimination has not occurred.
There is an abundance of information and research on adverse impact. It often occurs in a number of contexts, with various groups commonly responding in different ways to separate forms of evaluation and testing. Likewise, there have been many possible solutions or ways to reduce adverse impact in selection, which have been proposed and addressed by researchers in the field. At the heart of this research is the issue of validity. Specifically, researchers want to find methods of testing which will effectively reduce existing adverse impact but yet remain valid to their purpose of predicting actual job performance in candidates. There is considerable controversy surrounding this issue as many feel that the adverse impact witnessed in certain kinds of testing procedures represents real differences in the population and is not exclusively the product of a flawed test. The issue therefore lies in finding methods to reduce adverse impact which uphold validity.
Before examining the specific details of adverse impact research it is important to differentiate between selection procedures and individual selection tests. Rather than relying on one specific test to determine employment decisions, many employers use a combination of predictors to test individually for separate criteria that are deemed to be job relevant. Collectively, all of the tests used by an organization in the employment selection process will be heretofore referred to as the selection procedure or process. Much of the research in the field of adverse impact has been focused on attempts to find ways of reducing adverse impact by weighing certain criteria (with less adverse impact) more heavily than others (with high adverse impact) in the selection procedure. Another common method is to introduce low adverse impact tests and their subsequent criteria into a procedure that has demonstrated high adverse impact. The most commonly addressed trend in adverse impact research involves the use of cognitive ability tests in employment selection. While research has shown that cognitive ability tests can validly predict job performance in most cases (Hunter,1986; Schmidt, Ones, & Hunter,1992), there is also a demonstrated tendency for African-Americans to score on average one standard deviation lower and for Hispanic candidates to score 0.6-0.8 units lower on these tests than their white counterparts (Hunter & Hunter, 1984; Jensen,1980). Much of the empirical research in modification of selection procedures has addressed this discrepancy.
Regression analysis has become the standard measurement of test bias. It was developed by Cleary in 1968, and is known as the Cleary model (Bartram, 1995, p.54) Regression analysis relates test scores to actual job performance on two axis of a graph. Two lines are plotted, one for the majority group and one for the minority group. If the lines have a different slope, the test predicts performance well for one group and not the other. If the lines have a different intercept, one group consistently scores lower than another (Sackett & Wilk,1994).
It is important at this point to very briefly discuss the legal history of the adverse impact issue, which has been defined by various court rulings in the years following the 1964 Civil Rights Act. Since there have been many court cases which have applied in some way to the overall issue, in the interest of space only several of the most relevant cases will be discussed here. The aforementioned Civil Rights Act of 1964 was the genesis of subsequent court cases that defined adverse impact. The ramifications of civil rights legislation brought about a number of legal battles centered on the adverse impact of selection tests. The first was the 1971 case of Griggs v. Duke Power Company. The ruling in this case established the guideline that the burden of proof lies on the employer to prove that an employment requirement is job-related (Cascio,1998,p.25). It also established that discriminatory preference of any kind in a test was prohibited, regardless of intent (fair in form but discriminatory in operation) (Cascio,1998,p.25). In subsequent cases throughout the 1970s, the Supreme Court further defined proper employment selection procedure, concerning proof of job relevance, reliable and unbiased measures, and predictor validity for minority and majority candidates (Cascio, 1998,p.26).
The EEOC in 1978 established the four-fifths rule, and defined an employment test as anything which tests the suitability of an applicant (Bartram,1995). It also created a guideline for selection tests stating that if two or more tests are available and equally valid for an organization�s purpose, the test which has demonstrated the least adverse impact should always be used (EEOC, 1978). The Civil Rights Act of 1991 prohibited the practice of within-group norming, used by some organizations to reduce adverse impact and increase minority hiring (Sackett & Wilk, 1994). This act also established the guidelines of using relatively low cut scores, practice tests, and preparation of all candidates to reduce adverse impact when no alternatives to the present test are available (Bartram, 1995). All of the research outlined in the following pages has been conducted in the aftermath of these cases and frequently refers to the guidelines and practices established by this body of legislation.
With the background information established, it is now time to turn directly to the adverse impact research of the last decade, its results, implications, applications, and shortcomings. The effects of altering criterion weights among various predictors in selection procedures have been the subject of a great deal of research. Weights correspond to organizational values concerning the particular dimensions of performance. Based upon a conceptualization of two broad dimensions that apply to all jobs, task performance and contextual performance, Hattrup, Rock, and Scalia (1997) conducted an experiment. In this experiment they examined strategies for weighing these two factors in a composite criterion measure for job selection, and the subsequent impact that these different weights would have on adverse impact. The researchers made the assertion based on regression data that increasing weights on task performance dimensions implied increased weight on cognitive ability testing in the selection procedure and likewise that contextual performance weight was correlated with personality testing (Hattrup et al.,1997). With the well established phenomena of adverse impact in cognitive testing, it has been shown in research that measures of work orientation and conscientiousness (contextual measures) have no significant racial adverse impact and are also valid predictors of job performance (Ones, Viswesvaran, & Schmidt,1993). To summarize the results of this study, it was found that adverse impact against minorities was greatest when task performance was the only criterion dimension used, and subsequently decreased as contextual performance was weighted more heavily. However when contextual performance was weighted higher than task performance it was found that overall job performance decreased as a result (Hattrup et al. 1997). The researchers felt, based on their data, that organizations need to place greater emphasis on contextual performance in selection and evaluation of employees in order to offset the adverse impact of cognitive based tests (Hattrup et al. 1997).
Wilfried DeCorte (1999) proposed a method that was similar to that studied by Hattrup et al. This study sought to develop a predictor composite that maximized job performance of those selected and at the same time controlled the level of adverse impact to comply with the 4/5 rule. He noted that previous researchers had focused strictly on controlling adverse impact and had not worked on balancing this control with job performance, and that previous attempts to implement predictor composites controlling for adverse impact had resulted in lower quality workforce (DeCorte,1999,p.695). Five scenarios were tested, each varying the weights of predictors and cutoff scores. Two calculations were performed for each scenario, one obtained predictor weights in a similar fashion to Hattrup et al. (1997), the second used constrained programming to determine predictor weights and cutoff scores which maximized quality of applicants and controlled for adverse impact. The results measured the tradeoff between quality and control of adverse impact for each scenario. It was determined that the �cost of removing adverse impact may be quite substantial� if an organization values task performance (DeCorte,1999,p.700). The average quality of employees was found to be substantially smaller with predictor composites that controlled adverse impact. However the constrained programming method was used to determine predictor composites which maximized the quality of the workforce within the limits of the 4/5 rule with much better results than those obtained with the previous method of multiple regression (DeCorte,1999). Thus a more refined and exact method for determining predictor weights within composites was developed as a result of this study.
A recent study conducted by Barrett, Carobine, and Doverspike in 1999 measured the racial adverse impact of short-term memory tests as compared to cognitive ability tests in personnel selection. Using statistically matched pairs of white and black applicants, the study compared scores on reading comprehension (cognitive ability) tests and short term memory tests in an employee selection process. The study found a considerable reduction in score differences between the groups on the short-term memory test. The results of this study suggest that implementing short term memory tests into the selection process can reduce differences between white and black applicants in overall score (Barrett et al., 1999).
Dave Bartram, writing for the International Journal of Selection and Assessment, suggested that varying cutoff scores on selection tests might be a key element in reducing adverse impact (1995). His research tested the effects of using a common cutoff score for all groups, set at a level so low that adverse impact would be minimized. Of course the cutoff score must be set high enough for the test to still retain its utility, so the study sought to find an acceptable balance by measuring the effects of varying cutoff scores on test utility. This study found when mean differences between groups averaged between 0.5 and 0.2 z-scores the test could retain utility without significant adverse impact (Bartram,1995). However utility and adverse impact both quickly became extreme on their respective sides of this narrow margin of safety. Ultimately this study suggested three ways to reduce adverse impact in a valid selection test. First, organizations should use a relatively low cut score to initially screen out applicants who are not qualified. Secondly, organizations should ensure that the remaining candidates are prepared properly for the test. Finally, practice tests should be implemented before actual testing. The paper suggested that these actions would significantly reduce bias and enable higher cutoff scores to be used in selecting qualified applicants (Bartram,1995).
A 1997 experiment came to five major conclusions in a study of predictor combinations to reduce adverse impact and maximize predictive efficiency (Schmitt, Rogers, Chan, Sheppard, & Jennings,1997). The authors examined the effects of four factors (number of predictors, predictor intercorrelations, validity, and level of predictor subgroup differences) on the level of subgroup differences associated with a composite containing cognitive testing and another containing just alternate predictors (Schmitt et al.,1997,p.720). First, the study found that the addition of three alternate predictors into a composite containing cognitive testing reduced adverse impact, but not substantially enough to ensure 4/5 rule compliance. Second, adverse impact in general remained high over a broad range of variations in the aforementioned variables. Third, alternate predictors with little or no demonstrated adverse impact and high intercorrelation and validity will have the greatest effect in reducing adverse impact when added to a selection procedure. Fourth, manipulation of predictor intercorrelation produced the largest variation between predictive validity and adverse impact goals. Fifth, relative validity levels of individual predictors are important to consider because if a composite is to be optimally weighted, those with the highest validity will be weighted the highest (Schmitt et al., 1997).
This research shows a clear trend in the scientific approach to selection testing. Research is focusing on the effects of offsetting cognitive ability testing with more personality and contextual performance testing in the selection process. From the results of the previously mentioned research it can be easily deduced that issues of validity are raised when contextual performance begins to replace task performance as a predictor of job performance. Maxwell and Arvey (1993) described a commonly held belief among specialists that �validity must be sacrificed to reduce adverse impact�(p.433). They referred to the �golden rule� of Hartigan & Wigdor (1989) which implied that items selected for a test on the sole basis of reducing adverse impact would compromise the construct validity of a test and therefore reduce its predictive power.
Some authors have attacked the issue of adverse impact as the natural product of differences among groups in the population which comprises the labor pool rather than the product of a defective examination procedure (Wollack,1994; Bartram,1995). Bartram wrote that in general research has shown that �differences between ethnic groups and between the sexes in measures of ability are reflected in differences in job performance�, but differences in test scores seem to overestimate the real differences measured in job performance (1995,p.60). Stephen Wollack writing about the merits of cognitive testing cited �an overwhelming body of evidence� which shows vast disparities in educational skills in the general population along racial and ethnic lines (1994,p.218). It is thus natural to reason that the problem may not in the testing procedure but rather be reflective of society itself. Bartram likewise came to the conclusion, following his study, that the problem was a result of real racial-ethnic differences which unfortunately are revealed in the tests, stating simply that �adverse impact is an effect of unfairness, not a cause� (1995,p.61).
Wollack encourages employers not to follow personnel practices which deny reality and makes the important argument that �When the objective of merit selection is subordinated to affirmative action, no one wins. Organizational competence is compromised and the beneficiaries of affirmative action programs are forever stigmatized� (1994,p.220). He outlines several steps that may be useful to employers who seek a �representative workforce of qualified employees� (1994,p.224). First, employment tests should be honest and �appropriately rigorous�, with emphasis on recruitment efforts to draw qualified minority applicants (Wollack,1994,p.221). Secondly, strict standards such as inflexible cutoff scores and top down selection of only the highest scorers represent decision making based on statistically insignificant differences, and banding of scores along with flexible cutoff scores may reduce group differences (Wollack, 1994). The last suggestion made is to consider a wide range of skills which may be equally applicable to the job, compose an �eligibility list� of these skills and base decisions on a composite of these scores (Wollack,1994,p.223).
A 1998 study performed by Ryan, Polyhart, and Friedel also seems to stress caution in using personality testing to compensate for cognitive based adverse impact. They cited a number of past experiments on the subject. They make note of the fact that in the study performed by Schmitt et al., that adverse impact was still very high when cognitive testing was combined with three non-cognitive factors, but practically non-existent when the cognitive section was eliminated from the composite (Schmitt, Rogers, Chan, Sheppard, &Jennings, 1997). They also caution against the use of data collected in samples which may not be representative of every particular pool of applicants, and they stress collection of local data before an organization generalizes results of a study to their particular needs (Ryan et al. 1998). One of the most important points made is that there is often considerable within group variance in minority group applicants and this can lead to misleading results in adverse impact studies (Ryan et al. 1998). The authors ran a study that measured the effects of weighting personality and ability testing differently with two large samples of applicants. The study had a number of important conclusions. First it found that adverse impact can indeed exist in personality testing based on differences between particular samples and general population expectations. Second, the study measured adverse impact on a number of groups, not just the typical white-black cognitive impact, and how adverse impact variations in one group may affect another. Research, for example, has demonstrated that there is a general female superiority on verbal testing and male superiority in tests of quantitative ability and spatio-visual ability (Halpern,1992). An inverse relationship for adverse impact against African-Americans and against women was discovered on one test, suggesting that organizations should look at multi-group adverse impact and not take a narrow approach in studying one group only. There was also a general finding that applicant pool characteristics can affect adverse impact. Specifically, if the number of minority applicants in your sample is small, any within group variation is going to have a great effect on overall adverse impact. The researchers also obtained mixed results from their two samples when weighting personality and cognitive measures, finding that greater weighting of personality variables did reduce adverse impact in one sample but not in the other. The conclusion from this study is that differential weighting may not be an ultimate solution to adverse impact, and any organization attempting to use this method must take the aforementioned points into account before proceeding (Ryan et al.,1998).
Score adjustment is another method of potentially reducing adverse impact which came into favor during the 1980s with the use of the General Aptitude Test Battery (GATB) in federal services working to refer potential employees to organizations (Sackett & Wilk,1994). Scores were lower on average for Black and Hispanic applicants than for their White counterparts on the GATB. As a solution, GATB scores were converted to percentile scores within the ethnic group of the candidate, and candidates with different raw scores would be referred as equal based upon their standing within their ethnic group. This practice, known as within group norming, was outlawed by the Civil Rights Act of 1991, in addition to many other forms of score adjustment (Sackett & Wilk,1994). Sackett and Wilk produced an article in 1994 which examined many forms of score adjustment in testing, particularly in the realm of physical ability tests which produce adverse impact against women and cognitive tests which produce adverse impact against Black and Hispanic candidates. They noted that score adjustment is still legal in the form of adding points to veteran�s scores in public sector jobs. Among the rationales cited for score adjustment, was the test bias issue (if a test is biased, one score does not mean the same thing to members of different groups). As examples of bias, they listed language specific tests, item formats favoring majority groups, and the argument that test taking skills are not good predictors of job performance. They noted that score adjustment would technically be justified if there was a common regression slope but a different intercept on the Cleary model for two different groups, thus showing that one group scored predictably lower in a consistent manner (Sackett & Wilk,1994, p.933).
Sackett and Wilk discuss several methods of score adjustment which consider group membership. Bonus points add a constant number of points to scores of people within a group. The authors point out that the success of this method depends on standard deviations being similar between groups as well. Within group norming is discussed as well. This method is more successful than bonus points, but the sample on which an organization bases its norms must be local and not national averages because of differences in applicant pools Using separate cutoffs for different groups is technically similar to bonus points in that it holds one group to a lower standard than another (Sackett & Wilk,1994,p.937). Top down selection from separate lists entails ranking candidates within their groups and selecting the candidates by rank from each list to comply with the number set to be hired from each group (Sackett & Wilk,1994). Banding groups individuals within a particular score range into bands and considers all applicants within a band to be statistically equal. Variations on this method include minority preference within bands (picking minority members of a band first), and sliding bands which reformulate the band at the next lowest score once all applicants scoring at the top of the range have been selected. A common method is to select all minority members in a band, and then the majority members at the top of that band, then slide the band down and select more minority members. This method of only selecting majority members from the top of the band and minority members from all scores within the band is very similar to the bonus point approach. A criticism of this approach is that this violates the central tenet of banding which is that statistically similar scores should be treated equally (Sackett & Wilk,1994).
Sackett and Wilk also discussed group differences on various tests and how these differences relate to job performance. They noted that common differences on cognitive ability tests do not show underprediction of actual job performance for the groups most adversely affected, Black and Hispanic candidates. To quote the authors directly, �[research] has documented consistent predictive validity for a wide range of jobs, a lack of predictive bias against Blacks and Hispanics, and large and consistent adverse impact by race� (Sackett & Wilk, 1994,p.944). The authors noted that personality inventories (self-report measures of traits) are scored gender specifically, and this practice has not been controversial, but these differences are smaller in magnitude than the ethnic differences in ability tests (p.944-946). The authors address the issue of physical ability testing and its adverse impact against women. Very large female-male differences have been observed on measures of strength and endurance in the general population and therefore it is hard to argue for a measurement bias (Sackett & Wilk, 1994,p.949). Thus adverse impact is inevitable unless cutoff scores are very low.
In her general defense of selection testing as a practice, Charlene Solomon noted two ends of a continuum in selection testing (1993). At one end are employers who rely strictly on interviews and recommendations. At the other end are companies who rely strictly on standardized test scores. She points out that �most experts believe that tests should only be one component of the hiring process� (Solomon,1993,p.100). Job relevant tests with no between group differences are the goal of selection testing according to Solomon (1993). As an example, a �job attitude test�, a multidimensional selection test that measures attitudes toward a range of workplace behaviors, including �integrity, dependability, service, safety, and productivity� would be fair to basically all subgroups (Solomon,1993,p.101). Solomon encourages employers to first develop a list of skills needed for specific jobs in order to initially �weed people out� who simply don�t have the qualifications to perform the job (1993,p.102). The next step is to find a test vendor or publisher who uses studies that reflect information on a wide range of subjects, representative of the entire population, or specific to a population from which the employer will be hiring. This company must also keep current with EEO and ADA data to ensure legal compliance and up to date statistics on adverse impact. Alternative forms must always be investigated, and a good publisher will �have many alternative formats for each test� (Solomon, 1993,p.102) An example would be tests which are suitable to basic language and cultural differences among applicants in sentence structure and punctuation. Scientific norming, a statistical process which assures tests are equivalent across cultures, is an important fact to consider as well (Solomon,1993). The goal in selection testing is to �maintain and promote diversity [and] to maintain and promote productivity (Solomon,1993,p.103).
In conclusion, an organization seeking to reduce adverse impact and retain validity in its selection procedure has a lot of important factors to consider. It is important that employment tests are honest and appropriately difficult, but they should also consider a wide range of skills that may be applicable to the job in question. Samples used in data trying to assess potential adverse impact of a test should be gathered locally and specifically, not based simply on national norms. Every pool of applicants is different and small differences in numbers of applicants, and within group variation in a particular applicant pool can lead to large differences in adverse impact data. Employers must avoid being too narrow in their approach to adverse impact. Multi-group adverse impact ramifications must be considered when attempting to reduce adverse impact against one particular group. Using an initially low cutoff score to screen out applicants not suited to the job, then preparing the remaining candidates properly for their selection tests can make a lot of difference in potential adverse impact difficulties. A balance must be sought between adverse impact and predictor validity in any selection procedure. The research above has examined many possible approaches to the problem, but each organization must ultimately decide which qualities it most values in workers and implement the procedure which will select those candidates and minimize the amount of adverse impact.
References
Bartram, D. (1995). Predicting adverse impact in selection testing. International Journal of Selection and
Assessment, 3, 52-61.
Barrett, G.V., Carobine, R.G, Doverspike, D. (1999). The reduction of adverse impact in an employment setting using a short-term memory test. Journal of Business and Psychology, 14, 373-377.
Cascio, W.F. (1998). Applied Psychology In Human Resource Management, 5th ed. New Jersey: Prentice-
Hall, Inc.
DeCorte, W. (1999). Weighing job performance predictors to both maximize the quality of the selected
workforce and control the level of adverse impact. Journal Of Applied Psychology, 84, 695-702.
Equal Employment Opportunity Commission. (1978). Uniform Guidelines on employee selection
Procedures. Federal Register, 43, 38290-38315.
Halpern, D.F. (1992). Sex Differences In Cognitive Abilities. Hillsdale, N.J: Lawrence Erlbaum.
Hartigan, J.A., & Wigdor, A.K., (Eds.) (1989). Fairness in employment testing: Validity generalization,
minority issues, and the General Aptitude Test Battery. Washington D.C.: National Academy Press.
Hattrup, K., Rock, J, & Scalia, C. (1997). The effects of varying conceptualizations of job performance on
adverse impact, minority hiring, and predicted performance. Journal of Applied Psychology, 82, 656-
664.
Hunter, J.E., & Hunter, R.F. (1984). Validity and utility of alternative predictors in job performance.
Psychological Bulletin, 96, 72-98.
Hunter , J.E. (1986). Cognitive ability, cognitive aptitudes, job knowledge, and job performance. Journal of
Vocational Behavior, 29, 340-362.
Jenson, A.R. (1980). Bias in mental testing. New York: Free Press.
Maxwell, S.E., & Arvey, R.D. (1993). The search for predictors with high validity and low adverse impact: Compatible or incompatible goals? Journal of Applied Psychology, 78, 433-436.
Ones, D.S., Viswesvaran, C., &Schmidt, F.L. (1993). Comprehensive meta-analysis of integrity test
validities: Findings and implications for personnel selection and theories of job performance. Journal of
Applied Psychology, 78, 579-703.
Ryan, A.M., Polyhart, R.E., Friedel, L.A. (1998). Using personality testing to reduce adverse impact: A
cautionary note. Journal of Applied Psychology, 83, 298-307.
Sackett, P.R. & Wilk, S.L. (1994). Within-group norming and other forms of score adjustment in pre-
employment testing. American Psychologist, 49, 929-952.
Schmidt, F.L., Ones, D.O., & Hunter, J.E. (1992). Personnel selection. Annual Review of Psychology, 43,
627-670.
Schmitt, N., Rogers, W., Chan, D., Sheppard, L., & Jennings, D., (1997). Adverse impact and predictive efficiency of various predictor combinations. Journal of Applied Psychology, 82, 719-730.
Solomon, C.M. (1993) Testing is not at odds with diversity efforts. Personnel Journal,72,100-104.
Wollack, S. (1994). Confronting adverse impact in cognitive examinations. Public Personnel Management,
23, 217-224.
