AMERICAN EDUCATIONAL RESEARCH ASSOCIATION, INC. et al v. PUBLIC.RESOURCE.ORG, INC.

Filing 60

MOTION for Summary Judgment Filed by AMERICAN EDUCATIONAL RESEARCH ASSOCIATION, INC., AMERICAN PSYCHOLOGICAL ASSOCIATION, INC., NATIONAL COUNCIL ON MEASUREMENT IN EDUCATION, INC. (Attachments: #1 Statement of Facts Points of Authority, #2 Statement of Facts Statement of Undisputed Facts, #3 Declaration Declaration of Jonathan Hudis, #4 Exhibit Ex. A, #5 Exhibit Ex. B, #6 Exhibit Ex. C, #7 Exhibit Ex. D, #8 Exhibit Ex. E, #9 Exhibit Ex. F, #10 Exhibit Ex. G, #11 Exhibit Ex. H, #12 Exhibit Ex. I, #13 Exhibit Ex. J, #14 Exhibit Ex. K, #15 Exhibit Ex. L, #16 Exhibit Ex. M, #17 Exhibit Ex. N, #18 Exhibit Ex. O, #19 Exhibit Ex. P, #20 Exhibit Ex. Q, #21 Exhibit Ex. R, #22 Exhibit Ex. S, #23 Exhibit Ex. T, #24 Exhibit Ex. U, #25 Exhibit Ex. V-1, #26 Exhibit Ex. V-2, #27 Exhibit Ex. W, #28 Exhibit Ex. X, #29 Exhibit Ex. Y, #30 Exhibit Ex. Z, #31 Exhibit Ex. AA, #32 Exhibit Ex. BB, #33 Exhibit Ex. CC, #34 Exhibit Ex. DD, #35 Exhibit Ex. EE, #36 Exhibit Ex. FF-1, #37 Exhibit Ex. FF-2, #38 Exhibit Ex. FF-3, #39 Exhibit Ex. FF-4, #40 Exhibit Ex. FF-5, #41 Exhibit Ex. FF-6, #42 Exhibit Ex. GG, #43 Exhibit Ex. HH, #44 Exhibit Ex. II, #45 Exhibit Ex. JJ, #46 Exhibit Ex. KK, #47 Exhibit Ex. LL, #48 Exhibit Ex. MM, #49 Declaration Declaration of Marianne Ernesto, #50 Exhibit Ex. NN, #51 Exhibit Ex. OO, #52 Exhibit Ex. PP, #53 Exhibit Ex. QQ, #54 Exhibit Ex. RR, #55 Exhibit Ex. SS, #56 Exhibit Ex. TT, #57 Exhibit Ex. UU, #58 Exhibit Ex. VV, #59 Exhibit Ex. WW, #60 Exhibit Ex. XX, #61 Exhibit Ex. YY, #62 Exhibit Ex. ZZ, #63 Exhibit Ex. AAA, #64 Exhibit Ex. BBB, #65 Exhibit Ex. CCC, #66 Exhibit Ex. DDD, #67 Exhibit Ex. EEE, #68 Exhibit Ex. FFF, #69 Exhibit Ex. GGG, #70 Exhibit Ex. HHH, #71 Exhibit Ex. III, #72 Exhibit Ex. JJJ, #73 Declaration Declaration of Lauress Wise, #74 Exhibit Ex. KKK, #75 Exhibit Ex. LLL, #76 Declaration Declaration of Wayne Camara, #77 Exhibit Ex. MMM, #78 Declaration Declaration of Felice Levine, #79 Exhibit Ex. NNN, #80 Exhibit Ex. OOO (Public Version), #81 Exhibit Ex. PPP, #82 Exhibit Ex. QQQ, #83 Exhibit Ex. RRR, #84 Exhibit Ex. SSS, #85 Exhibit Ex. TTT-1, #86 Exhibit Ex. TTT-2, #87 Exhibit Ex. UUU, #88 Declaration Declaration of Kurt Geisinger, #89 Declaration Declaration of Dianne Schneider, #90 Text of Proposed Order Proposed Order, #91 Certificate of Service Certificate of Service)(Hudis, Jonathan). Added MOTION for Permanent Injunction on 12/22/2015 (td).

Download PDF
EXHIBIT V-1 Case No. 1:14-cv-00857-TSC-DAR g EXHIBIT I, f; $\-c.\.^-vr^t*d. Ê =.r E slre-lrs- 3- AERA-APA-N C M E-OOOO OO 1 AERA_APA-N C M E_O OOOO02 ST,&h[ÐARDS for educãtional and psychological testing American Educational Research Association American Psychological Associarion National Council on Meaiurernenr in Education AE RA-APA-N C M E_O O OO O O 3 Copyright @ 1999 b)' the American F.ducational lìescarch .r\ssociation, dre t\me¡ican Psvcbolo'cal Àssociarion, and thc Nadonal Council on lvfeasurcmenr in llducadon. ..UI rights rcsc¡r'ed. Except as pcrrnitred uncler thc Uni¡ed States Copyright i\ct of 7976, no Patr of rhis publicarion rnav be reproduced ot disributcd .in irny form or by any nreans, or storccl in a databasc o¡ rctricva-l systcm, withour the prior u'rittcn Pctmission of the publisher. Publishcd b.v Amcrican F,clucational llesea¡ch .¿\ssociation 1430 K St., N\\¡, Suitc 1200 Washington, DC 20005 Librar¡. r.,f Congress. Card ntrmbcr: 9906ó845 ISBN: 0-935302-25-5 ISBN-I 3 : 97 8-0 -935302'25-7 Prin¡ed in thc United Starcs of .¡\mcrica First printine in 1 999; second, 20Q2; rlúd,2004; fourrh, 2007; fi fth, 2008; and sixth, 2011. The S I a ù a rd s -[o r E d tt c øti o t t a / a n d P y c ho logi ca / Testingtill be undet concinuing revierv by the thrcc sponsoring organizations. Commcnrs and s''ooesnons u,iìl hc rvclcome and should bc scnt to _-_ -ÒÒ__ _-The Committee to Develop Standards for Educational nnd Psychological Testing in ca¡e of the Execuúr'e Of6ce, American Psychologrcal r\ssociation, 750 l:i¡sr Stcer, NE, Washington, '" DC 20002-4242. Prepared by the Joint Committee on Standatds For Educational ancl Psychological Tesriog of the ¡\merican Ed uca rional lìesearch t\s sociation, the t\merican Psvchological i\ssociation, ancl thc National Counci.l on tr'feasurement in Educarion. AERA-APA-N CM E-OOOOO04 TAffiË.E TF CüruTENTS 5. Test Administration, Scoring, and Reporting ..................6r Background Standards 5.1-5.16 ilt AERA-APA_NCM E_OOOOOOS rÃDr E V¡ VVtr ¡ Êrr.v nE nni.lfÉfilfc lÁDLL L The Hights and Responsibilities of Test Takers..... .. .'.... .............',. ....85 Backgrouncl _ ".." ""-". ""'85 Starrãa.ds 8.1-8.13 9. Test¡ng tndividuals of Diverse Linguistic Backgrounds... '.. '.-. ..'.........91 """"" """ Backgrotrnd Srandards 9.1-9.1t 10. Testing lndividuals with Disabilities ..................... Background 10.1-f0.12 PART II! TESTING APPLICATI0NS ............,.,.. 11. The Resp0nsibilities of Test 'i0t .""""101 "" ......'..........'...... "" """.'"""" " Sranclards .,. Users... - ... ........ .roe ' "' ""1 I I """"""""" IIi """"""" "" """"1l3 ll.l-lL'2í 12. Psychological Testing and Assessment........... ..'.'..... Background l2.t-l?.20 """"""" " ""1t9 "' "- ".' " l l 9 """"""""""""""131 Stanjards 13. Educational Testing and Assessment ......".....'.'..'.. Background Stanáards t3.l-13.f9 14. Tesling in Employrnent and Credentialing....'..". "' """"""""." 137 """"""""""137 """" " "' "" """145 ' """" "" " ' l5l """ """""151 ....... Background Srandards ......._..... I l4,l-14.L7 15. Test¡ng in Program Evaluation and Public Background Sc¿ndards l5.l-15.rt ",106 ..".'.... Background Sranclards '91 " """97 Policy .. .". ' '8 "' "' ' ' 163 ' " 163 """167 """ "' """ GLOSSARY lB5 INDEX AE RA_APA*N C M E_O O OOO O 6 FREFAGE There havc been fivc carlier documenrs from rhree sponsoring organizations guidirrg the development and use of tess. The firsr oF these David Goh Berr Green Edwerd Haerrel þ r Psyc h o lngí cal Tesr and Diagnosric Tcchniques, prepared by Jo-Ida Hansen Sharon Johnson-Lewis a commitrec oI rhe American Psychological Associarion (APA) and published by rhar Joseph Mararazzo w as Tàc h n i c a I Reco mme ndatio ns organiz.ation in 1954. The second was Tèchnical Rzcommend¿don¡ þr Achinement 7èsts, prepared by a conrrnirtee representing che American Educarional Research Associarion (AERA) ahd the Narional Council on Measurement Used in Educarion (NCMUE) and published by rhe Narional Educarion Associarion in 1955. The rhird, which replaccd rhe earlicr two, was published by APA in I966 and prepared by a commirree represenring APA, AERA, and rhe Narional Council on Measuremenr in Educarion (NCME) and called rhe Standards þr Educational and Psychological Ti:tts an¿ Manu¿ls. The lourrh, Edaruùonal and PsTchological -Tesu, wu again a collaborarion oFAERA, APA and NCME, and was publislred in 1974. The fitrh, Søndardsþr Edrcdtion¿l and Psychobgtcal Tâsting, also a joinr collaboration, was pubStandards for in Ì981. In l99l APA's Commitree on lished Psychological Tesrs and Asscssmenr suggesred rhe need ro revise rhe 1985 Stand¿r/:. Represenrarives olAEM, APA and NCME met and discussed the revision, principles that should guide rhat revision, and porential Joinr Commitrce members. By 1993, rhe presidenrs of rhe rhree organizarions appoinred members and rhe Commiccee had irs firsr meering 'l993. November, 'fhe Srandard¡ has been developed by a joint committee appointed by A-ERA, A-PA anã NCME. Members of rhe Commir(cc we¡e: Eva Baker, co-chair Paul Sackerr, co-chair Lloyd Bond Leona¡d Feldr Suzanne L¿ne Manf¡ed Meier Pamela Moss Esreban Olmedo Diana Pullin From 1993 rc 1996 Charles Spielberger served on the Commirree as co-chair. Each sponsoring organizarion was permitted ro assign up ro rwo liaisons ro the Joinr Commirceet project. Liaisons served as rhe conduits between rhe sponsoring organizarions and the Joint Comminee. A-PA's liaison from im Commicree on Psychological Tests and Asessmens changed several dmes as *re membership of the Commitree changed. Liaisons to the Joint Comr4ittee: AER -Villiam Mehrens APA - Bruce Brackcn, A¡drew Czopek, Rodney l,owman, Thomas Oaklând NCME - Daniel Eignor APA and NCME also had commir¡ees who served ro moniror the process and keep relevan¡ parties informed. APA Ad Hoc Comminee of the Cou¡cil Representatives: Melba Vasquez Donald Bersoff Stephcn DeMers of James Farr Bertram Karon Nadine Lambe¡r Charles Spielberger NCME Standa¡ds and Ti:st Use Corrsrinee Gregory Cizek A.llen Doolirrle L¡ Ann Gamache il AE RA-AP A_N C M E_O OO OOOT PREFACE Donald Ross Green Ellen Julian Tracy Muenz Nambury Raju A management committee was lormed ar rhe beginning of this eflort. They monitored rhe financial and administrative arrangements oF the project, and advised rhe sponsoring organizarions on such mattets. Management fümmittee: Frank Farle¡ APA George Madaus, AERA Vendy Yen, NCME StafFrng For the revision included Dianne Brown Maranto as project director, and Frank i-ancìy Ellen Lent Roberr Linn Theresa C. Liu Stanford von Mayrhauser Samuel Messick Craig N. Mills Robert J. Misle4' Kevin R. Murphy Mary Anne Nester Maria Pennock-Roman Carole Perlman Michael Rosenfeld Jonarhan Sandoval Cynthia B- Schmeiser Kara Schmitt Neal Schmitt Richard J. Shavelson Dianne L. Schneider as staffliaison. IØayne J. Camara served as project director From 1993 to 1994. AlAs legal counsel conducccd the legal l¡rrie A. Shepard revierv ol rhe Stand¿rds. Villiam C. Howell and Vitliam Mehrens reviewed the standards for consistency actoss chapters. Linda Murphy developed rhe indexing For rhe book. The Joint Commirtee solicited preliminary reviews oFsome draft chapters, from recognized experts. These reviews were primarily solicìred For the technical and fairness chapters. Reviewers arc Iisred below: Marvin Alkin Philip Bashook Bruce Bloxom Jelfery P. Braden Robert L. Brennan John Callender Ronald Cannella Lee J. Cronbach Jamcs Cummins John Fremer Kurt F. Geisinger Roberr M. Guion \P'alter Haney Patti L. Harrison Gerald P Koocher Richard Jeanneret . Milbrey \X/. Mclaughl in Mark E. Swerdlik Janet \Vall Anthony R.7.zra Dra[t've¡sions oî úte Smndards were widely distribured [or public review and comment three times during this revision efFort, providing the Commitree with a rotal of nearly 8,000 pages of commenrs. Organizations who submitred comments on drafts a¡e listed beiow Many individuals contributed to the input from each organi- zation, and although we wish we could acknowledge every individual rvho had inpur, l^.^,1". i.f^rm¡'^ i^.^--ler¡ contribured to cach organiza' tiont response. The Joint Commirree could not have completed its task without the tion as ro rvho thoughrful ¡eviews of so many professionals' Sponsorìng fu sociations American Educational Research Association (AERA) American Psychological Associatio n (APA) Narional Council on Measu¡ement in Education (NCME) AË RA-APA_NCM E*OÛOOOO8 PBEFACE Membership Organi"ations (Sciendfi c, Professional, Trade & Advocacy) Ame¡ican Association foi Highcr Inte¡narional B¡orherhood of Elecrrical Educarion (A,AHE) American Board oF Medical Specialries (ABMS) American Counseling Associarion (ACA) Americ:n Evaluarion Associarion (AEA) American Occupario nal Therapy fusociarion Ame¡ican Psychologicai Sociery (APS) APA Division of Counseling Psychology (Divisìon t 7) \Vorkers Inre¡narional [:nguage Testing Associarion Internarional Personnel Management fusociarion Assessment Council (IPtvfAAC) Joint Commirree on Tesring Pracrices UCTP) Narional Associarion for rhe Advancemenr of Colored People (NAACP), Legal Defense and Educarional Fund, Inc. Narional Center for Fair and Open Tesring (Fairtesr) APA Division of Developmenral Psychology (Division 7) Narional Organizarion for Comperenry A-PA Division of Evaluarion, Measurement, Personnel Tesring Council oF Metropolitan and Srarisric¡ (Division 5) APA Division of Mencal Reta¡dation Personnel Testing Council of Sourhern . Assurance Washingcon (PTC/M\Xr) & Developmental Disabiliries (Division 33) APA Division of Pharmacology & Subscance Abuse (Division 28) APA Division o[ Rehabilitarion Psychology (Division 22) A-PA Division of School Psychology (Division t6) Asian American Psychological tusociarion (AAPA) fusociarion for Assessmenr in Counseling (A,{C) fusociarion oFTesr Publishers (ATP) Ausrralian Council for Educarional Resea¡ch Limi¡ed (ACER) Chicago Industrial/Organizational Psychologists (CIOP) Council on Licensure, Enforcemenr, and Regulacion (CLEAR), Examinarion Resou¡ces Er Advisory Commirree (ERAC) Equal Employment Advisory Council (EEAC) Foundarion fo¡ Rehabilirarion Cerrifi carion, Educarion and Research Human Sciences Research Council, Sourh Africa Inrernarional Associarion for CrossCuhural Psychology (IACCP) (NOCA) California (PTC/SC) Sociery for Human Resource Management (SHRM) Sociery of Indian Psychologiss (SIP) Sociery for Indusrrial and Organizarional Psychology (APA Division t4) Sociery For thc Psychological Srudy of Erhnic Minoriry Issues (APA Division 45) Srate Collabo¡arive on Assessmenr & Srudenr Sra¡dards Tèchnical Guidelines for Pe¡[ormance Assessmenr Consorrium (TGPA) Telecornrnunicarions Sraffi ng Forum '\Í'esrern Region Intergovernmenral Person nel Assessment Counci I (\íRIPAC) Credendaling Boards American Board of Physical and Medica.l Rehabiliration American Medical Technologisrs Commission on Rehabiliution Counselor Certifi carion Narional Board for Certified Counselors (NBCC) Narional Board of Examiners in Oprometry vii AE RA-APA_N CM E-O O O OO O 9 PRËFACE i.¡'arional ôoard of ivfedic¿i Exami¡rers Narional Council oFState Boards of Nursing Government and Federal Âgencies Army Research Institute (ARI) Calilornia Highway Patrol, Pérsônnel and Tiaining Divìsion, Selection Research D-^^-^ r ruÉ(drrt - Ciry o[ Dallas, Civil Service Deparrmenr Co¡nmonrvealrh oI Virginia, Department ofEducation Delense Manpower Data Center (DMDC), Pe¡sonnel Testìng Division Department of Defense (DOD), Otrìce o[ rhe Assistanc Secrerary of Defense Deparrment oIEducation, Office o[ Educational Improvement, National Center for Educarion Statisrics Department of Justice, Immigration and Naturalization Service (lNS) Deparrmenr of [ebo¡, Employmenr and Training Adminisrration (DOLiETA) U.S. Equal Employment Opportuniry Commission (EEOC) U.S. Office of Personnel Managemenr (OPM), Personnel Resources & Development Center Tst Publishers/Ðevelopers American College Tesrìng (ACT) CTB/McG¡aw-Hill The College Board Educational Testing Service (ETS) Highland Publishing Company lnstituçe for Personaliry Et Abiliry Tcsting (IPAT) P¡olessional Examina¡ion Service (PES) -- ), -r ^ fll15llf ullv¡15 ---:---.:^-Center for Creative Leadership Gallauder Universiry National Task Fo¡ce on Equiry in Testing Deaf ÁguÉ¡¡¡¡L ^ P¡oFessionals Universiry of Haifa, Israeli Group Kansas Srare Universiry Narional Center on Educational Ourconres (NCEO) D-.._---1..--:- JLdLc r t-:-.---:-l gllr¡ùyrYd¡tl4 c-- -^ u¡¡¡vçrst() Universiry of North Carolina - Charlotre Universiry of Sourhern Mississippi, Department of Psychologv lVhen thc Joint Commitree tomplered its task of revising the Standaids, ft then submitred irs rvork to the three sponsoring crganizations lor app roual. Each o rgan izatio n had irs orvn governing body and mechanisnr for approval, as well as definirions lor wha¡ rheir approva{ means. A-E,RA: This endorsemenc carries wich ic rhe unde¡standing that, in general, we believe rhe Stdnd¿r/s to represent rhe current consensus among recognized professionals regarding expected measurement practice. Developers, sPonsors, publishers, and users of tesrs should observe these S¡ønddrds. APA: The APA's approval of rhe Standards means rhe Council adopts the documen¡ as APA policv. NCME: NCME endorses the Standø¡d¡ for Educational ønd Pslchologrcal Testing and recognizes rhat rhe intent of these Standards is to promote sound and responsible meâsurement practìce. This endorsement carries wirh it a professìonal imperarive For NCMË members ro arrend ro the Standdrd:. AJchough ¡Jte Sr¿ndards a¡eprescriptive, rhe Stm tl¡.rd.¡ itsel I does not co nrai n en fo rcemenc mechanisrns. These scanda¡ds were formulaced wirh rhe intent of being consistent with other srandards, guidelines and codes of conducr publ'shed by rhe three sponsoring organizations, and lisred below. The reader is encouraged to obtain these documents, some of 'rhich have references to testing and assessment in specific applications o¡ settings. The Joint Committee on the Standzrds for Educdtional and hTchological Tenìng viii AE RA-APA-N CM Ë_O OOOO'I O PREFACE References American Educarional Research Associarion. Uune, 1992). Ethical St¿nd¿rds of thr American Educational Rese¿rclt As¡oc iation. llashingron, DC: Aurhor. American Federarion oFTèachers, National Council on Measuremenr in E.ducarion, & Narional Education fusociarion. Standzrds Te ach er Co mp e t e n ce i n Ed uc¿ tì o n¿ I Ás s ¿ ¡ s m ett t ofSndzns. (1990). \X¡ashington, DC: National þr Council on Meæurement in Educarion. American Psychological Association. (December, 1992). Erhical Principles of Psychologisrs and Code of Conducr. American Pslchologis t, 4 7 (l 2), 1 597 -1 6 I t. Joinr Commirree on TLsring Praccices. Code of Fair Testing Practices in (l 988). Education. Vashingcon, DC; Ame¡ican Psychological Associarion. Narional Council on Meesuremenr in Educarion. (1995). Code of Professíonal Responsibilities in Educatíonà! Measurcmênt. lVash ingron, .DC: Aurhor. AERA-APA-NCME_OOOOO 11 gNTffiTÐIJTTEüN Educational and psychological resting and assessment are among the most importanr conrributions of behavioral science to our socier¡ providing Fundamencal and significant improvements over previous practices' Ajrhough not all tests are well-developed nor arc all testing pracrices u'ise and beneficial, rhere is extensive evidence documcnting the efTecti veness of well-consrructed tes ts lor uses supporred by validiry evidence. The proper use of ¡ests can resul¡ in rviser decisions abou¡ individuals and programs than would be rhe select From emong many individuals lor a highly competitive job or [or entry into an educational or training Program, chc preferences of an applicant may be inconsistent wirh those of an emp'loyer or admissions oÊfice¡. Similarly, when testing is mandared by a court, the interess of the test taker may be differenr lrom thosc o[rhe parry requesting roure to b¡oader and mo¡e equitable access to rhe court order. There are meny PafiiciPa¡rs in the resting process, induding, among orlrers: (a) ùrose who prepare and develop the tesr¡ (b) those who publish and merket rhe resr; (c) those who adminisrer and score the test; (d) chose who education and employment' The improper use the test ¡esula case use wirhout rheìr use and also can provide a of tesrs, however, cen ceuse considerable harm to test rakers and other parties affected by resr-based decisions. The inrent of the Standzrds is co pcomote che sound and ethical use ofrests and to providc a basis for evaluat- ing the qualiry of testing practices. Participants in the Testing Process Educaiional and psychological testing and assessment involve and significantly affect individuals, insciturions, and sociery as a whole. The individuals affectecl include students, parents, teachers, educational adminis- trators, job applicanrs, employees, clients, patients, supervìsors, executives, and evaluarors, anÌong others. The instirurions affected includc schools, colleges, businesses, industry clinics, and Bovernment agencies. Inciividuais and insúrutions benefir when resting helps them achieve their goals. Sociery, in rurn, benefis when resring conrributes co the achievement individual and insticutional goals. The interests of the various parries involved in the tesring Process are usuall¡ but nor always, congruent. For example, when a rest is given for counseling PutPoses or for job placement, the interests ofthe individual and the insrirution ofren coin.iie. In contrast, when a test is used to oF for some decision-making purpose; (e) those who interprer rest resuks for clienrs; (Ð those who take the test by choice, direcrion, or necessir¡ (g) those who sponsor rests, which may be boards that rePtesent inscirutions or Bovernmental agencies thac contract with a test developer lor a specific instrument or servicei and (h) chose who selecr or review tcsts, evaluating their comparative merits o¡ suitabiliry for ¡he uses proposed. These roles are sometimes combined a¡d somerimes furthe¡ divided. For example, in clinics the test taker is rypica.lly the intended bencficiary olthe rest results. ln some situarions the tesr administrator is an agent o[ the reit developer, and sometimes dre test admintùí/hen an industrial isrrator is also the tesr user. organization prepa¡es irs own employment rests, it is both the developer and the user. Sometimes a test is developed by a test author but published, advertised, and distributed by an independent publisher, úrough ùre publisher may play an active role in rhe test dwelopment. Given this inrermingling o[ roles, it is difficrrlt ro assign precise responsibilir¡ for addressing various s¡anda¡ds to specific parricipanrs in the testing process. This document begins with a series of chaptets on the test development Process' which focus primarily on the responsibilities of tesr developers, and then turns to chaprers i I I ¡ AERA_APA-N CM E-OOOOO i 2 INÏRODUCTION on specific uses and epplicarions, which focus primarily on responsibiliries of resr users. One chaprer is devored specifically to rhe righrs and responsibiliries oF resr rakers. 'fhe Stand¿rd¡ is based on rhe prernise rhar effecrive testing and assessment require thar all participanrs in rhe testing process possess rhe knowledge, skills, and abiliries relevan¡ ro cheir role in rhe resring process, as well as awareness ofpersonal and conrextual lactors rhat may influence rhe resring process -fhey nor feasible in cerrain situations), or,,conditional" (importance varies wirh application). The presenr Standards conrinues rhe rradition ofexpeccing resc developers and use¡s to conside¡ all sranda¡ds before operational use; however, rhe Søndards docs nor continuc rhe pracrice of designaring levels of imporrânce. Insread, rhe rexr ofeach srandard, and any accompanying commenra¡y, discusses rhe condirions under which a srandard is relevanr. It was nor rhe case char under rhe I 9g5 also should ol¡rain any appropriare supervised experience and legislatively man- Stan/¿rds resr developers and users rvere obligared to arrend only ro rhe primary srandards. dared pracrice credentials necessery to perform Rather, the ¡erm "conditional" meanr rhar a standard was primary in some serrings and secondary in orhers, rhus requiríng careFul comperenrly rhose aspects of the resting process in which they engage. For example, rest developers and rhose selecting and interprering resr need adequare knowledge of psychomerric principles such as validiry conside¡arion of che applicabilicy oFeach srandard For a given setring. and reliabiliry The abscnce ofdesignarions such as "primary" or "conditional" should nor be The Purpose of the Standards taken ro imply rhar all srandards are equally significanr in any given siruation. Depending on ¡he conrexr and purpose o[ resr develop- The purpose of publishing the Standard¡ ìs ro provide crireria for rhe cvaluarion of tests, testing pracrices, and rhe effects oFresr use. Although the evaluarion of the appropriateness ofa tesr or tesring application should depend heavily on proFessionaì judgmenr, the Standards provides a frame of reference to assure rhat relevant issues a¡e addressed. [r is hoped thar all proFessional resr developers, sponsors, publishers, and users rvill adopr rhe Støndards and encourage orhers ro do so. 'lhe Smnà¿ràs makes no errempr ro provide psychomerric answers ro quesrions oÊ public policy regarding the use ofresrs. In general, the Standards advoc¿res thar, wirhin feasible limirs, the relevanr technice.l informarion be made available so rhar those involved in policy debare may be fully inFormed. Cateqories of Standards The 1985 Stand¿rd: designated each standard as "primary" (ro be mer by all tesrs bcFore operarional use), "secondary" (desirable, but ment or use, some srandards rvill be more salient rhan others. Mo¡eover, some snndards a¡e broad in scope, serring forrh concerns or requirements relevanr ro nearly all tesrs o¡ testing conrex$, and orher srandards are narrower in scope. However, all srandards are imporranr in rhe conrexrs ro rvhich rhey apply. Any classification that gives rhe appearance ofelevaring the general imporrance of some sandards over o¡hers could invire neglecr oFsome srandards rhar need ro be addressed in parricular siruarions. Further, rhe current Stand¿rds does not include srandards considered sccondary or "desirable." The conrinued use oFrhe secondary designation would risk encouraging both the expansion of the Standørds ro encompass large numbers of"desirable" srandards and rhe inappropriare assumprion thar any guideline not included in rhe Standards as ar le¡sr "secondary" was inconsequenrial. Unless otherwise specified in rhe sran- dard or commenrary, and wirh rhc cavears ¿ AE RA-APA_NCM E-OOOOO 1 3 tüfnnnilnTlntJ r' Ir I -- )--)^ ^L^..11 L- ..-. outllneo oelowt 5L¿lru¿lus srruuru v! r¡tet before operational test use. This means that each standard should be carefully considered ¡o determine irs applicabilìry ro rhe resting context under considerarion. In a given case there may be à sound professional reasòn why adherence to rhe sranda¡d is unnecessary' [t is also possible rhar chere may be occasions when technical feasibility ma¡' influence wherher a standard can be mec prior to operationel test use. For example, some standards may call for analyses ofdara that may not be available at the point of initial If test developers, users, and, when applicable, sPonsors have deemed operarional test use. a srandard to be inapplicable or unfeasible, rhey should be able, iFcalled upon, to explain the basis for their decision' However, there is no expecration that documentation be routinely available of the decisions related co cach standard. Tests and Tesl Uses to Which These Standards APPIY A rest is an evaluative device or procedure in which a sample of an examinee's behavior in a specified domain is obcained and subsequenrly evaluated and scored using a srandardized process. Vhile the label test is ordinarily reserved for instruments on which lesPonses a¡e evalua¡ed fo¡ their cotrectness or qualiry and the rctms scale or inuentory are used for meesures of atritudes, interest, and disposi. tlofìs, to .t c.,.- J L!r¡¡r,ri,¡..r uìcs -L ^ s¡r{Bt! .--* l¡rs -i^-ltllË Jtutt&ta -..)-..-^- reFer to all such evaluative devices. A distincrion is sometimes made berween t¿¡¡ and assetsmeTtt. Asse¡sment is a broader rerm, commonly referting to a process that inregrates test information with information from other sources (e.g., informarion from rhe individual's social, educational, employment, or psychological history)' The applicabiliry ofúre St¿ndards to an evaluation device or method is not alte¡ed by the label applied to ir (e.g., test, assessment, scale, inventory). Tæts differ on a number of dimensions: rhe mode in which test materials are Presented (paper and pencil, oral, compurerized admin.isr¡arion, and so on); the degree to rvhich srimulus materials are srandardized; oF responle Êormar (selection ol a the rypc iapóni. from a set ofalrernarives as oPPosed ro rhe producrion of a response); and rhe degree to which resr marerials are designed to reflect or simulate a particular context' In all cases, however, tests standardize the Process by which rest-take¡ resPonses ¡o resr materials are evaluated and scored. As noted in prior versions oF the Støndards, the same general rypes o[ information are needed For all varieries of tescs. The precise demarca¡ion berween ¡hose measuremenr devices used in the fields of educarional and prychological testing that do and do nor fall within the purview of the Sand¿rd¡ is difficulr ro idenrifr. Although the Smndards applies most directly to srandardized measures generally recognized as "tests," such as me¿sures of abiliry, aptitude, achievement, atti tudes, interests, personali ry, cognirive fu.ncrioning, and mental health, it may also be usefully applied in varying degrees to a broad ránge of less Formal assessment techniques. Admittedt¡ ic will generally not be possible to apply the Søndards rigorously to unsrandardized quesrionnaires or to the broad range of unstructured behavior samples used in some lorms of clinic- and school-based psychologiczJ assessrnenr (e-g., an inrake interview), and to instructor-made tesrs that are used to evaluate student performance in edu- cation and rraining. It is useful ro distinguish berween devices thar lay claim to the conceps and rechniques of the field of educariond and psychologicat tesring lrom ,.nt nontt"ndtdized o¡ chose which repre- less standardized aids ro àay+o-day evaluative decisions. Alrhough the principles and concepts undcrlying the Stand¿rds can be fruitfully applied ro day-today decisions, such as when a business owner inierviews a job applicant, a manager evalu- AERA-APA*N CM E_ÛOOOO'i 4 l¡tïR0DUcIt0N ates the performance of subordinares, or a coach evaluares a prospecrive athlere, ir would be overreaching ro expecr rhar rhe standards of che educarional and psychological resring It is appropriate for developers or users to srare rhar eFlorrs were made to adhere ro the Standar¿s, and to provide docume nts care. describing and supporring rhose eFFor¡s. Blanker claims withour supporring evidence field be followed by those making such decisions. In conrrast, a structured interviewing sysrem developed by a psychologist and accompanied by claims that rhe syscem has been found ro be predicive of job performance in a variery of other setrings falls within rhe purview oî rhe Stand¿rd¡. edge develops. Cautions to be Exercised in Using 5) Prescriprion ofrhe use oFspecific technìcal merhods is nor the intenr oI the lhe Standards Several caurions are imporranr ¡o avoid misinterpreting rh,e Stand¿rds: l) Evaluaring the acceptabiliry oFa tesr or test applicarion does nor rest on rhe Iitera_l satisFecrion of every srandard in this document, and acceptabiliry cannor be deccrmined by using a checklisr. Specific circumsrances affecr the importance of individual srandards, and individual srandards should nor be con- sidered in isolation. The¡eforc, evaluating acceptabiliry involves (a) professional j udgmcnt that is b¿sed on a knowledge ofbehavioral sci- ence, psychometrics, and the communiry standards in rhe professional field ro which rhe resrs apply; (b) rhe degree to which ¡he intent of the srandard has been satisfied by the tesr developer and user; (c) che akernatives rhat are readily available; and (d) research and experiential evidence regarding [easibiliry of meering the srandard. 2) Vhen resrs are at issuc in legal proceedings and orher venues requiring experr wicness rescimony ir is cssenrial thar professional judgmenr be based on the accepred corpus of knowledge in derermining the relevance oF parricular srandards in a given situarion. Thc inrcnr of rhe Standard¡ is ro oFfe¡ guidance for such judgments. 3) Claims by rest developers or res! users thar a test, manual, or procedure sarisfies or follows rhese srandards should be made wi¿h should nor be made. 4) These sranda¡ds are concerned wirh a field char is evolving. Consequenrl¡ rhere is a continuing need ro moniror changes in rhe field and to revise this documenr as knowl- St¿ndards. For example, rvhere specific staristical reporting requirements are menrioned, rhe phrase 'br generally âccepred equivaleni' aJways should be unde¡s¡ood. The srandards do nor arrcmpr ro repeâr or to incorporatc rhe many legal or regularory requirements rhat might be relevanr to rhe issues they add¡ess. In some areas, such as the collection, ana.lysìs, and use ofresr dara and resuks Fo¡ differenr subgroups, che law may both require parricipanm in rhe tesring process to take cerrain acrions and próhibit those panicipanrs [rom taking o¿]rer acrions. 'Vhe¡e it is apparenr tha¡ one or more srandards or comments address an ìssue on rvhich esrablished legal requiremens may be paruicularly relevant, the standard, commenr, or inrroduc- tory marerial may make no¡e of thar facr. Lack ofspecific reference to legal requirements, however, does not imply that no rele- vant requiremenr exisrs. In all siruations, parricipanrs in rhe resrìng process should separacely consider and, whe¡e appropriate, obrain legal advice on Iegal and regulatory requiremenrs. The Number of Standards The number of sundards has increased lrom the I 985 Standards For a variery of ¡easons. Firsr, and most imporranrly, new developments have led ¡o rhe addirion of new srandards. Cornmonly these deal wirh new rypes 4 AERA-APA-N C M E-O OOOO 1 5 rÁrrEñnilnTlnâJ tlgrrtvsev¡¡v¡¡ olrests or new uses íor existing tesrs, rather rhan being broad standards applicable to all resc. Second, on the basis ol recognìtion that some users oî the Stdnd¿rds may turn only to chapters directly relevant to a given applica- tion, certain standárds are rePeatcd in differenr chapters. lü/hen such repetition occurs, SLUrC Ur rcsPurrrL -^..--- 'f.|-- -^¡'^^ ¡L..¡ some tests are not under the purview of rhe Sønd¿rds because they do not measure con- srructs is conttery ro this use oI rhe term. Also, as detailed in chaprer l, evolving con' ceprualizations oF the concept ol va-lidiry no longer speak ofdifferenr rypes oFvalidiry but ofdifferent lines ofvalidiry evi- the essence oF rhe standard is the same. Only the wording, area ofapplicarion, or elabora- speak instead rion in the comment is changed. Third, relevant to a specific intended interpretation s¡andards dcaling rvirh imporlant nonrechnical issues, such es avoiding conflicts of interest and equitable rteatment ofall rest takers, oftesr scores. Thus, many lines ofevidence can contribute to an understandìng ofthe dence, all in service of providing in[ormation construcc meaning oftesc scores. have been added. AJrhough such topics have nor bccn addressed in prìor versions of the Standørds, they are not likely to be viewed as imposing burdensome nerv requirements Thus che increase in rhe number of scandards does not Per se signal an increase in the obligations placed on test developers and ¡est users, Tests as Measures of Constructs \?e depart f¡om some historical uses of the term "construct," which reserve the term For cha¡acte¡is¡ics thar are not direcdy obsewable, bur which are inlerred from interrelated sets of observations. This historical perspecrive invires confirsion. Some tests are viewed as me¿sures olconstrucrs, while othe¡s âre not. L: addition, considerable debate has ensued as to rvherher certain characteristics measured by tesrs are properly viewed as constructs. Furthermore, the rypes olvalidiry evidet¡ce thought ro be suitable can differ as a resuh ofrvhether a given cest is viewed as measuring a construct. '!le use rhe retm cznstrttct more broadly concepr or characteristic thar a test is 0rganization of This Uolume Part I of the Standards, "Test Construcrion, Evaluarion, and Documentarion," contains srandards For validiry (ch. l); reliabiliry and errors of measurement (ch. 2); test developmenr and revision (ch. 3); scaling, norming, and score comparabiliry (ch. 4); test administration, scoring, and reporring (ch. 5); and supporting documentation For tesrs (ch. 6). Part II addresses "Fairness in Testing," and conrains stand¡rds on fairness and biæ (ch. 7); the rights and responsibilities of tesr takers (ch. 8); testing individuals ofdiverse linguistic backgrounds (ch. 9); and testing individuals wirh disabiliries (ch. 10). Part III treats specific "Testing Applications," and contains standards involving genera.l responsibiliries ol test users (ch. I I); psychological testing and assessment (ch. 12); educa¡ional tesring and essessment (ch. 13); rcsiing in employment and credentialing(ch, l4); ancl tcsting in pro' gram evaluation and public poticy (ch. 15)' Each chapter begins with introductory rext that provides background for rhe stan- to e test score or a patrern of test responses. Thus, ir is always incumbent on a resting dards that follow. This ¡evision of the Srand¿rds conrains mo¡e ex¡ensive introductory text mate¡ial than irs predecessor. Recognizing rhe common use of the Standzrds in rhe education of future tesr developers professional ro specifo the const¡uct interpreration rhat will be made on the basis of the and users, rhe committee opred co provide a context for the standards ihemselves by pre- as the designed to meesure. Rarel¡ iFever, is there a single possibte meaning that can be artached I AERA_APA-NCME_OOOOOi 6 INTBODUCTION senting more background marerial rhan in previous versions. This rexr is designed to assisr in the inrerpreration of the s¡andards that Follow in each chaprer. Although the text is at rimes prescriprive and exhortaror¡ ir should nor be inrerprered as ímposing additional srandardsl The Stand¿rÁ also conrains an index and includes a glossary that provides definitions fo¡ rerms es rhey are specificaJly used in rhis vo.lunre. AERA-APA-NCM E-OOOOO 1 7 mf{ EDT lrf{Fr [ fl il Test Constftrståon, Evaluarion, and Ðoeurnentation AERA_APA_NCME-OOOOOl 8 tr" WALHDSW Background Validiry reFers ¡o rhe degree ro which evidence and rheory suppon rhe interprerations of resr scores enrailed by proposed uses of resrs. Validiry is, rhe¡eflore, rhe mosr lundamenral considerarion in developing and evaluating resrs. The process ofvalidation involves accumularing evidence to provide a sound scienrific basis For the proposed score intcrpretations. Ir is rhe inrerprerarions oF resr scores required by proposed uses rhar are evaluared, not rhe rest itsel[, \ù/hen resr scores ere used or interpreted in more rhan one way, each intended inrerprerarion mus¡ be validared. Validarion logicelly begins widr an explicit sraremenr ol rhe proposed interpreration will benefit From a parricular instrucrional intervenrion, thar a student has mas¡ered a specified curriculum, or rhar e srudenr is likely ro be successful rvirh college-level work. Similarl¡ a tesr of self-esreem might be used for psychological counseling, ro inform a decision abour employmenr, o¡ For rhe basic scientific purscores: rhar a srudenr oF with a rarionale for the relevance oFrhe inrerprecarion ro the proposed use. The proposed interpretacion refers ro rhe consrrucr or concepr rhe resr is inrended ro measure. Examples of construcrs âre matheresr scoresr along matics achievemenr, performance as a compu ter rechnician, dep ression, and self-es¡eenr. To supporr resr developmenr, rhe proposed inrerprerarion is elaborated by describing its scope and exrenr and by delinearing rhe aspects ol rhe consrruct rhat a¡e ro be represenred. The derailed description provides a conceprual Framework for rhe resr, delinearing rhe knowledge, skills, abiliries, processes, or characceristics to be assessed. The framework indicates how rhis represenration of the consrrucr is ro be disringuished from orher consrrucrs and how ir should rela¡e ro orher variables. The conceptual framework is partially shaped by the ways in which resr scorcs will bc used. For insrance, a tesr of mathematics achievemenr mighr be used ro place a scudent in an appropriere program of insrruccion, ro endo¡se a high school diploma, or to inform a college admissions decision. Each olthese uses implies a somewhar differenr inrerprerarion of rhe marhemarics achievemenr rest pose oF elabonring rhe consrruct of selÊesreem. E¿ch ofrhese porenriel uses shapes rhe specified framework and dre proposed inrerpretation of rhe res¡'s scores and also has implications for tesr development and evaluation. Validarion can be viewed as developing a scienrifically sound validiry ergumenr to support the inrended inrerpreracion oF resr scores and their relevance ro rhe proposed use. The conceptual framework poinrs ro rhe kinds of evidence rhat might be collecred ro evaluare the proposed inrerpreration in light ofrhe purposes of resring. As valìdarion proceeds, and ncw evidence abour rhe meaning ola tesrt scores becomes available, revisions may be needed in rhe tesr, in rhe conceprual framewo¡k rhar shapes it, and even ìn ¡he consrrucr underlying the resr. The wide variery of resrs ând ci¡cum- ir narural rhar some rypes of be especìally cricical ìn a given case, whereas orher rypes will be less useFul. s¡ances makes evidence will The decision about whar rypes of evidence are imporranr for validarion in each insrance can be clarified by developing a set ofproposirions rhat support rhe proposcd inrcrpretãrion for the parricular purpose of resring. For insrance, when a mathemarics achievemen¡ tesr is used ro âssess readiness For an advanced course, evidence for rhe following propositions might be deemed necessar)4 (a) thar certain skills are prerequisire for the advanced course; (b) rhar rhe concenc domain oÊ rhe test is consisrent wirh rhese prercquisire skillsi (c) that lesr scores can be generalized across relevant sets of irems; (d) rhar reJr scores arc not unduly influenced by ancíllary va¡iables, AERA-APA-N C M E_O O OOO 1 9 VALIDITY - -L -- wr¡Lr¡¡B aurtlt/r ^L:l:-.suLll a5 ----:-:-- /^\ .L-- -..^^-^\L,/ !rrdt rulLLrr :- .L^ rrr (r¡! advanced course can be validly assessedl and (f) rhar examinees wirh high scores on rhe resr will be more successÊ.rl in the advanced with low scores on tlre course than examinees rest. Examples of propositions in other tcsting contexts might include, for instance, the proposition that examinees with high general anxiery scores experience significant anxiery in a range oI settings, the proposirion that a child's score on an incelligence scale is strongly rclated ro the child's academic performance, or rhe proposirion rhar a certain parrern of scores on a neuropsychological battery indicaces i mpairment characteristic oF brain i nj ury. The validarion process evolves âs thele proposi¡ions a¡e arriculated and evidence is garhered ro evaluare their soundness. Identifring the propositions implied by a proposed test interpreration can be lacilitated by considering rival hypotheses thac may challenge the proposed interpretacion. It is also useful to consider rhe perspectites oIdifferent incerested parries, existing expe- / PAfiT I :^- ^^..---. ^. :-^^.-l (JPs wt a rv¡¡rr¡rvr¡ .,,-. ^Ê ¡r¡6 PdrJdólr reading marerial. As another example, a tesr of anxiery might measure only physiological ¡eacrions and not emo¡ional, cognitìve, or siruational components. . Construct-irrelevanr variance reFers to rhe degree to rvhich tesr scores are aflecred by processes shar are extraneous ro irs inrended consrrucr. The test scores may be systematically influenccd to some extent by components thar are not part ol che construcr. In rhe case oF a reading comprehension cest, consrruct'irrelevant components mighr inclucle an emotional reaction to thc rest conrent, familiariry rvirh the subject mattcr ofthe reading pesseges on rhe test, or the wriring skill needed to compose â resPonse. Depending on the detailed definirion oFthe consrrucr, vocabulary knowledge or reading speed míght atso be irrelevant comPonenÉ. On a tes¡ of anxiery, a response bias ro underreport anxiery might be considered e source of co¡rstruct-irrelerant variance. Nearly atl tests leave out elements that ¡ience wirh similar rests and contexts, and the expecred consequences of the proposed test use. Plausible rival hypotheses can ofren be generated by considering wherher a resr measuÍes less o¡ mo¡e rhan irs proposed construct. Such conce¡ns are refer¡ed ¡o as conttruct underrepresentatíon end eonsttuct' some porenrial users believe should be measu¡ed and include some elemen¡s ¡hat sorne irreleuant variance. Conscruct underrepresentation reFers to rhe degree to which a tesr fails to caPtur€ important aspects oF rhe construct. ir impiies a narrowed meaning of test scorcs t¡ecause the test does not adequatefy sample some rypes ofcontenr, engage some psychological processes, or elicit some ]vays of responding rhat are encompassed by the intended consrrucr. Täke, for example, a cest of reading administration conditions, or language level that may materialli' limit or qualifr the inter' pretation oF rcst scores. Thar is, the process of validation may iead to revisions in the test, rhe conceptual f¡ameworl< of rhe test, or boúr. The ¡evised test would then need validation' 1ù/hen propositions have been identified comprehension in¡ended to measure chil- drent ability ro read and interpret stories wirh understanding. A particular test might underrepresent r-he intended construcr because ir did nor contain a sufTìcient variery of read- porenrial users consider inappropriate. Va.lidation involves ca¡efrrl artendon ro possible disto¡rions in meaning arising from inadequate representarìon of the construct and also to aspeccs o[ meesurement such as rest fotmat, rhat would suppon the proposed interpretation of test scores, validation can proceed by developing empirical evidence, examining releva¡r lirerarure, and./o¡ conducting togical analyses to evaluate each of these proposirions. Empirical evidence may include both local evidence, produced wirhin the contexts where che rest will be used, and evidence From similar testing l0 AERA-APA-NCME_OOOO02O PAßT I/ VALIDIW applications in orher serrings. Use ofexisring evidencc from similar resrs and conrexrs can enhance the qualiry of the vaiidiry argumenr, especiâlly when currenr data are limired. Because a validity a¡gumenr rypically depends on more than one proposition, srrong eyidengg 1n 9,lppor, of one in no way diminishes rhe need for evidence ro supporr others. For example, a sr¡ong predictor-crirerion relarionship in an employmenr serring is nor suFficicnr ro justifr resr use for selection withour but they do nor represenr disrincr rypes oF validiry. Vdidiry is a unirary concepr. k is rhe degree ro which ail rhe accumulared evidence supporrs the intended inrerprerarion of rest scores fo¡ rhe proposed purpose. Like rhe 1985 Standard:, rhis edirion refers ro rypes of validiqy evidence, rarher than disrinct rypes of validiry To emphasize rhis disrinction, the üeermenr thar lollows does nor follow rraditionai nomenclarure (i.e., rhe use of tl¡e terms content ualiditl or predictiue uatidit1).The considering rhe appropriareness and meaningfulness of rhe cri¡erion measure, Professional judgment guides decisions regarding the specific forms of evidence rhar can besc supporr glossary conrains definitions of rhe rradiriona.l rerms, explicaring rhe difference berween rra- rhe intended inrerpretarion and use. As in Evro¡Hcr Bns¡o oH Tesr Conr¡ur all scienrific endcavors, the qualiry of rhe Imporanr va.lidiry evidence can be obained from an analysis of rhe rclationship berwcen a evidence is primary. A few lines of solid evidence regarding a parricular proposition are ber¡e¡ rhan numerous lines oFevidence of questionable qualiry. Validarion is rhe joint responsibiliry of the test developer and the rcsr rrser. Thc resr developer is responsible [or Furnishing relevant evidence and a rarionale in supporr o[ *re intended tesr Lrse. The ¡es¡ usc¡ is uldmarely responsible for evaluating rhe evidence in the particular serring in which the resr is ro be used. Vhen rhe use of a resr differs from rhat supporred by the rest developer, rhe resr user bears special responsibiliry flor validarion. The standards apply to rhe validarion process, for which rhe appropriare parries share responsibiliry. It should be nored char imporranr conrriburions ro rhe validiry evidence are made as orher researchers reporr findings oF investigations ¡hat are related to the meaning of scores on rhe tesr. Sources of Validity Evidence The lollowing sections ourline various sources of evidence that might be used in evaluating a proposed interprerarion o[ (esr scores hr parrìcular purposes. These sources oF evidence may illuminate different aspecrs of validiry ditional and currenr use. test's conrenr and rhe conslrucr ir is inrended ro measure. Tesr conrenr refers to the themes, wording, and formar of rhc items, rasks, or quesdons on â resr, as well as rhe guidelines for procedures regarding admi nisrrarion and scoring. Tesr developers often work from a specifi- cation of the conrenr domain. The contenr specificarion carefully describes rhe conrenr in detail, oFren wirh a classificarion of areas of contenr and rypes oFirems. Evidence based on test conrenr can include logical or empirical analyses of the adequacy with which the rest content represen$ ¡he content domain and ol rhe relevance oFthe contenr domain to rhe proposed interpreration of test scores. Evidence based on conrcnt can also come from experr ,iudgmenrs of rhe relarionship berween parts of the tesr and the consrrucr. For example, in devcloping a licensure resr, the major facea of the specific occuparion can be specifìed, and experts in rhar occuparion can be askcd ro assign tcsr irems ro the caregories defined by those facecs. The¡ or other qualified experEs, can rhen judge the represenrarivencss of rhe chosen set ol ircms. Sometimes rules or algo¡irhms can be consrruc¡ed ro selecr or generere items thar differ sysrematieJly on rhe various face6 of content, according ro specifications. AERA_APA_NCME-OOOO02 1 VALIOITY / PABT I Some tesrs are based on qystematic observatìons o[ behavior. For example, a listing o[ ¡he ¡asks comprising a job domain ma;' be irrelevant difficulry (or easiness) that reguire developed lrom observations o[behavior in a job, together with judgmens olsubject-maner Ev¡o¡Hce Bnsro oH R¡sroHsr experrs. Expert judgments can be used to assess the relative importance, criticaliry and/or lrequency of rhe various rasla.  job sample rest rr--JC----,--l^--can tlìe[Ì Dc col]stluL(€u llurr¡ a l¿r¡uulrl 9r srratified sampling oF casks rared highly on rhese characreristics. The test can then be administered unde¡ srandardized conditions in an ofF-the-job setting. The appropriateness oIa given conrent domain is ¡clared to the specific inferences to be made from tesr scores. Thus, when consid' ering an available test for a purpose other rhan rhat for which it was first developed, ir is especially important to evaluare the appropriareness of the original content domain [or the proposed new use. ln educational Program evaluarions, for example, tests may properly cover material rhar receives lirtle or no attention in the cu¡ricu[um, as'¿all as that rorva¡d which instruction is directed. Policymakers can then evaluate student achievement with respect to both contenr negleced and conten! addressed. On the other hand, when student lurrher investigatìon. Pnocesrs Theoretical and empirical analyses ol the ¡esponse processés of test râkers can provide evidenie concerning the fit berneen the con' st¡uct and the detailed narure ofperÍormance or response acrually engaged in by examinees. For ìnsrance, iIa tesr is inrended to assess ma¡hemacical reasoning, it becomes important to determine wherher examinees are, in [act, reasonìng about the material given instead of [ollowing a standa¡d a.lgorithm. For anorher instance, scores on a scale iniended io assess the degree of an individual's extroversion or inrrove¡sion shou]d noc be srrongly influenced by social conformiry. Evidence based on ¡esponse Processes generally comes from analyses o[ individual responses. Questioning test takcrs abour their performance strategies or resPonses rc parüc' ular items can yield evidence that enriches the de finition of a construct. Maintaining records thar monito¡ ihe development of a response to a writing task, through successive written draFts or elecrronically monitored revisions, for inscance, also provides evidence o[ process. vidual studens, such as p¡omotion or graclua- Documenrarion of orher aspecrs of performance, like eye movements or resPonse times, may also be relevanr to some constructs. Inferences rion, the framewo¡k elaborating a content about processes involved in perlormance can domain is appropriately limiced to what studenr have had an opportuniry to learn from also be developed by anatyzing rhe relarionship among patts of ¡he rest and berween ¡he test mesrery of a delivered curriculum is tesced For purposes oFinforming decisions about indi- L, --:-..1..-rrlÈ LurrlLuru¡r¡ -- uçrrvl¡Lu. a J^l:-.^-^,.l orher .rariables- Wide individLral differ in process cen be revealing and may lead ro reconsideration ofcertain rest formats' Evidence oF response Processes can conrribute to quesrions about diFFerences in meaning or interpretation of tesc sco¡es across ¡'¡l Evidence about content can be used, in part, to address questions about differences in the meaning or interpretation ol rest scores across relevant subgroups oI cxaminees. Of particular concetn is the extenc ro which consrrucr underrep¡escn¡arion o¡ consrrucr-irrelevanr componenr may give an unfair advantage or disadvantage to one or more subgroups oF examinees. Careful review of the construct and test content domain by a diverse panel ences ofexpetrs may point ro porenrial sourccs of their performance. relevant subgroups oF examinees. Process studies involving examinees f¡om different subgroups cen assis¡ in determining the extent to which capabìlities irrelevanr or ancillary to ùe construct may be differenrially influencing AERA*APA-N C M Ë_O OOOO22 PART I / VALIDITY Studies of response proceJses are nor lim- iced to the examinee. Assessmenrs olten rely on observers or judges ro record and/o¡ evalua¡e examinees' perlormances or products. In such cases, relevanr validiry evidence includes . the exren¡ ro which rhe processes ofobservers or judges a¡e consistenr wirh rhe inrended inrerprerarion ofscores. For instance, il judges are expecred to apply parriculer crireria in scoring examinees' perlormances, ir is imporranr ro ascerrain wherhe¡ rhey are, in facr, applying the appropriare criteria and nor being influenced by factors rhat are irrelevant ro rhe intended inrerpretarion. Thus, validarion may include empirical srudies oFhow observers or judges reco¡d and evaluare dara along rvirh analyses ofrhe appropriareness of these processes to rhe inrended inrerpreration or consrrucr defi ni¡ion. Evrornc¡ Basro oH lmrnn¡l- SrRucrun¡ Analyses oF rhe inrernal srructure of a resr can indicare the degree ro which rhe relationships âmong rest irems and resr componenrs conForm to rhe consrrucr on which rhe proposed resr score interpretarions arc based. The conceptual framework for a rest may imply a single dimension of behavior, or ir may posit several components ¡har a¡c each expecred ro be homogeneous, bur rhar are also disrincr from each orher. For example, a measure of discomforr on a heakh survey mighr assess borh physical ànd emorional healrh. The ex(enr ro which irem inrerrela- tionships bear out rhe presumptions oF rhe f¡amework would be relevanr ro vâlidiry. The specific rypcs of analysis and rheir ìnterpreration depend on how rhe resr will be used. For example, if a parricular application posited a series of test componenrs of increasing difficulr¡ empirical evidence ol the extent to which response pârrerns confo¡med ro this expecracion would be provided. A theory rhar posired unidimensionaliry would call for evidence ol irem homogeneiry. In rhis case, rhe irem inrerrelarionships also provide an esrimare of score reliabiliry, bur such an index would be inappropriate for rests wirh a more complex inrernal srrucrure. Some srudies of the internal srrucrure of to shorv wherher particular icems may Êrncrion diFferenrly Êor idenrifiable tests are designed subgroups oI cxam i nees. Differenrial item Funcrioning occurs when differenr groups oFexaminees wirh similar overall abiliry, or similar starus on an appropriare criterion, have, on everage, sysremarically differenr responses to a parricular irem. This issue is discussed in chapters 3 and7. However, differenrial irem functioning is nor always a flaw or weakness. Subsets o[ irems rhat have a specific characrerisric in common (e.g., specific conrentr rask representation) may Function difFerenrly for differenr groups ol similar.ly scoring examinees. This indicares a kind of mulridimensionaliry rhat may be unexpected or may conForm to the resr framewo¡k. Rrurrors r0 (hxEft Vnsl$t ¡s Analyses ofrhe relarionship o[tesr scorei to variables exrernal ro rhe resr provide anothEvro¡HcE Bnseo oH . cr imporcanr source of validity evidence. External va¡iables may include measures of some crireria rhar rhe tesr is expecred ro predicr, as well as relarionships ro other resrs hypothesized to measure rhe same consrrucrc, and tests measuring relared or diffe¡enr constructs. Measures other than test scores, such as performance crireria, are oFren used in employmenr serrings. Caregorical variables, including group membership variables, become relevant when thc rhcory underlying a proposed resr use suggesri thar group differences should be present or absenr if a pro- posed tesr inrcrpretation is ro be supported. Evidencc bascd on rclationships wirh orher variables addresses questions about the degree to which these relationships are consisrent with the construcr underlying rhe proposed resr inrerprerarions, 13 AE RA-APA_N C M E_O O OO 023 VALIDITY Convergent and discriminant evidencè' Relarionships berween test scores and orher meesures intended Io assess similar consrrucrs provide convergent evidence, whereas ¡ela- tionships becween tesc scores and measures purportedly oF different consrructs provide dis*iminant evidence. For instance, wirhin some theoretical F¡ameworks, scores on e multiple-choice tesi of ieading comprchension mighr be expecred ro relate closely (convergent evidence) to other measures of rcading comprehension based on orhe¡ me¡h- / PART I the crirerion and the measutement Procedures uscd ro obtain criterion scotcs are ofcentral imporunce. The value of a test-criterion study depends on rhe relwa¡ce, reliabiliry and valldiry ofrhe inrerpretation based on the criterion meesure for a given tesring applicarion. . Hisro¡icall¡ two designs, often called predìctìve and concurrent, have been disringuishcd for e','aluating test-ctiterion relarionships. A predictivc study indìcares how accurately rcsr data can predicr criterion scores rhat arc obrained at a Iater time. A concurrent ods, such as essay responses; conversel¡ test scores might be expected io relate less closely scudy obrains predictor and criterion informarion ar abour the same time.'When predic- (discriminanr evidence) to measures oForher iion is acrually contemplatcd, as in eduø¡ion or employment sertings, or in planning rehabilitation regimens, predictive studies can ¡e rain ¡be remporal differences and o¡he¡ characteristics of the practical siruation' Concurrent evidence, which avoids temporal changes, is parricularly useful for prychodiagnostic tests or to investigare a-lte¡native meæures ofsome specified construct. In general, skills, such as logical reasoning. Relationships among diflerenr methods of measuring the construct can be cspecially helpful in sharpening and elaborating score meaning and inrerpreratio n. Evidence of relations with other variables can involve experimental as well as co¡relarional evidence. Studies might be designed, for instance, to invesrigate wherher scores on â meesure of anxiery improve as a ¡esult of some psychologicel treatment or rvherher scorei on a tes¡ of academic achievement difle¡entiate becween insrructed and noninstructed groups. If performance increases due ro short-term coaching are viewed as a rhreat ro validiry, it would be useful ro invesrigatc whether coached and uncoached grouPs Per- [orm differently. Test-criterion relationships. Evidence of t ,t--:-Itlg lg4rlulr ^C lssL -^^-*.^ ul -^-- rLUrgr ^ --ì-,,-^. -.:r-.:^ñ may be exprcssed in various ways, bur the Fundamenral question is always: How accurarely do test scores predict criterion perlormancel The dcgree of accuracy deemed necessary depends on the purpose for which rhe test is used. The cricerion v-¿¡iable is a meãsure oÊsome attribute or outcome that is o[primary interesr, as determined by rest users, who may be administrators in a school system, the mânagemerìt of a firm, or clients. The choice o[ the choice of research stra(egy is guided bv prior evidence olthe exrent to which predic' tive and concurrenr srudies yield rhe same or different results in the domain. Test scores are sometimes used in dlocating individua.ls to different creatmenm, such as differenr jobs rvirhin an instirutìon, in a way rhar is advantageous for rhe institution and for the individuals. In thar context, evidence is needed to judge the suirabiliry of using a test when classifring or assigning a Person to one i^h.¿¡c',c enorher Õr to ône treatment versus ''-'t"anothe¡. Classification decisions are supported by evidence rhar the rclationship of tesr scorcs to performa-nce crite¡ia is different for different rrpirmpñrc I¡ is noqsible For rests to be hiehly predicrive of performance for different educ¿' cion programs or lobs withour providing the informarion necessery to make a comparative or iudgment o[ che efficacy of assignments treatments, ln general, decision rules for selection or placement are also infìuenced by the number of persons to be accepted or the 14 AE RA-APA_N C M E_OOOOO24 PABT I / VALIDITY numbe¡s thar can be accommodared in al¡ernative placemenr categories. Evidence abour relations ro orher variables is also used ro invesrigate quesrions of differential prediction for groups. For insrance, a findi¡g thaq thq.¡9!4¡!q¡ gf¡esr scores to a relevanr crirerion variab.le dilfers f¡om one group ro anorher may imply ihãr thê mêaning of ihe scores is not rhe same for members of the differenr groups, perhaps due to consrruct underrepresentarion or consrruct-irrele- vant componenrs. However, the dilference may also imply that the crirerion has differenr meaning for different groups. The differences in tesr'crirerion relationships can also a¡ise from measuremen! error, especially when group meâns differ, so such differences do not necessarily indicare diflerences in score meaning. (See chaprer 7.) Vatidiry generalization. An imporranr issue in educationel and employmenr secings is rhe degree ro which evidence of valdìry based on resr-crirerion relaríons can be generalized ro a new siruarion wirhout furdrer srudy oFvaJidiry in rhat new siruarion. \When a test is used to predicr rhe same or similar crireria (e.g., perforn'rance o[a given job) at differenr times or in different places, ir is rypically tound rhat observed resr-crirerion correlations vary substantially. In the past, thìs has been raken ro imply rhar local validacion sudies are always requi red. Mo re recen cl¡ mera-anal¡ic analyses have shown rhar in some domains, much of rhis variabiliry may be due ro srerisricål arrifecs such as sampling fluctuations and variations across validarion srudies in the ranges of test sco¡es and in the reliabiliry of criterion measures. \)Zhen rhese a¡d odrer influences are cal<en into account, it may be Found rhar rhe remaining variabiliry in validiry coeftìcienr is relarively small. Thus, srarisrical summaries o€pasr validarion studies in similar siruarions may be useful in estimarinB resr-crire¡ion relarionships in a new siruation. This practice is refer¡ed ¡o æ the srudy ofvalidiry generalizarion. In some circumsrances, there is a strong [or using validiry generalization. This would be rhe case where ¡he mera-analyric basis dambase is large, where the mera-ana.l¡ic data adequarely repÍesenr rhe rype ofsituarion to which onc wishes ro generalize, and where correction lor starisrical artifacts produces a cleà¡ and consìstenr paccern olvalidity evidence. In such circumsrances, the inlormacional value ofa local validiry srudy may be relatively limited. In orher circumsrences, rhe inFerencial leap required for generalizarion may be much larger. The mera-analyric darabase may be small, rhe findings may be less consisrent, or the new sirua¡ion may involve fearures markedly different from rhose repre- in rhe meta-anal¡ic darabase. in such circumstances, siruation-specific evidence of scnred validiry wilì be relarively more informarive. Alrtrough rese¿rch on validiry generalizarion ofa single local validation srudy may be quite imprecise, there are siruations rvhere a single stud¡ care[ully done, with adequate sample size, provides suffìcient evìdence ro supporr teJr use in a new situation. This highligha the imporrance of examining carefully rhe comparative inlormational value of local versus mera-analytic srudies. ln conducring srudies o[ rhe generalizabiliry olvalidiry evidence, rhe prior srudies that are included may vary according ro sevshows rhat resuks eral siruarional lacers. Some oI rhe major in the rvay the predictor construcc is measured, (b) rhe rype of job or curriculum involved, (c) the type of facers a¡e (a) differences crirerion meesure used, (d) rhe rype oF tesr takers, and (e) the rime period in which rhe srudy was conducred. In any particular srudy oFvaiidiry gencralizarion, any numbcr of these facets might vary and a major objective of the study is ro derermine empiricaily the exrent to which variation in rhese facers affecrs che rest-criterion correlarions obtained. The exrent to which predicrive or concurrent evidence of validiry generalizårion can t5 AE RA-APA_N CM E-O OO OO25 VALIDITY be t sed in new situations is in laree mea.sute a funcrion olaccumulated research. Ahhough evidence of generalizaúon can ofren help ro suppor( a claim oF validiry in a nerv situarion, rhe exrent olavaítable data limits the extent to which the claim can be sustâined. The above discussion locuses on the use of cumuladve darrbases to estimate predictort. l tcLrlcflteilon relaflonsnlPS. lvleta-allallrll-: ---L niques can also be used ¡o summarize orher forms oF dara relevant to o(her inferences one may wish ro drarv f¡om test scores in a parricular application, such as effects of coaching and effecrs ofcertain alrerarions in testing condi¡ions ro accommodara ¡st¡ ¡¿ks¡s rvirh / PART I consequences depend on â more searching inquiry into the sources of rhose consequences. Take, as an example, a finding ofdiffe¡en¡ hirìng rates for members of different grouPs as a consequence o[using an employment resr. IF rhe difference is due solely to ãn unequal distriburion of .rhe skills che rest purpqrrs ro meesu¡e, and il those skills are, in lact, importanr con¡ributors to job performance, then rhe finding o[group differences per se docs nor imply any lack ofvalidiry for the intended infe¡ence. the test me¿sured skill differences unrelated to job performance (e.g', a sophisticared reading rest for a job that required only IF, however, minimal funcrional literacy), or iF the differ- ce¡tain disabilities. cnces wcrc d,re to thc testt sensiriviry to some examinee cha¡acteristic not intended to be parc Bmro ol¡ CoHs¡ou¡Hcrs 0F TEsrlNG issue receiving artenrion in recent years is the incorporation of the intended a¡d uninrended consequences oftest use into the conc€p( oFvalidir¡ Evidence about conse- of rhe resr construct, then validiry would Ev¡o¡ruce An quencei can inform validiry decisions. Here, howevcr, it is important to distinguish beween evidence that is direccly relevant to validiry and evidence that may inform decisions abour social policy bur falls ourside rhe realm oFvalidity. Distinguishing berween issues of validiry and isues of social policy becomes parricularly imporrant in cases where diffcrential consequences of test use are observed for differenr identifiable grouPs. For example, concerns have been raised about the effec¡ of group )rcc- --^ ¡1¡!¡¡! u¡¡ r¡1rf,¡v.¡ ul¡lçrcr¡Lcs -^- r¡¡ LLst 5!v{!o sclecrion and promorion, rhe placement of children in special education classes, and rhe narrowing of a schoolt curricirlum to excludc learning oI objectives that are not assessed. Akhough informarion about the consequences oF resring may influcnce decisions about test use, such consequences do not in and ol rlemselves derracc F¡om ùe validiry of inrended tesr interpretations. Rarher, judgments oF validity or invalidity in the light o[ testing called inro question, even il be rest scores correlat- ed positively with some measure of job per[ormance- Thus, evidence about consequences may be directly ¡elevant ro validiry when it can be rraced to a source of inva.lidiry such æ consrruct underrepresentation or consr¡uct-irrelevant components, Evidence about consequences that cannot be so traced-that in fact reflecs valid difFerences in performance-is crucial in inlormìng policy decisions bur ålls outside the rechnical purview of validirT. Tesrs are commonly administered in the expectation that some benefit will be realized from rhe inrended use of rhe scores. A few of rhe many possible benefits are selection of efficacious treatments lor therap¡ placement workers in st'itable .!obs, prevenrion of unqualificd individuals frorn enterìng a profession, or improvement of classroom insrructional pracrices. A fundamental purpose of oF va.lidation is ro indicate whether these specific be nefits are likely to be tealized' Thus, in thc case ofa test used in placement decisions, the validarion rvould be in[ormed by evidence rha¡ al¡e¡na¡ive placemenrs, in fac¡, are diF- ferentially beneficial co the persons and the insriturion. ln the case of employment resting, 16 AE RA-APA_N CM E_OOOO026 PART I / VALIDIW STANDARDS iIa will test publisher claims ¡har use of the resr resulr in reduced employee rraining costs, improved workforce effìciency, or some orher benefìt, ùen the validation would be inlo¡med by evidence in supporr of rhar claim. - . Claims are somerimes made for benefirs. of resring rhar go beyond di¡ecr uscs oF rhe res! sco¡es rhemselves. Educational ¡es¡s, for example, may be advocared on rhe grounds thar rhei¡ use will improve srudenr moriva- tion or encourage changes in classroom insrructional praoices by holding educators accounrable for valued learning ourcomes. Vhere such claims are cenrra] ro rhe rarionale advanced [or resring, rhe direcr examina¡ion of tesring consequences necessarily assumes even grearer importance. The valida¡íon process in such c¿ses would be informed by evide¡rce that the anricipared benefirs oftesc ing are being realized. lntegrating the Val¡d¡ty Evidence A sound validiry argumenr inregrares various srrands ofevidence into a coherenr accounr ofche degree ro which existing evidence and rheory support the inrended inrcrpretation of rest scores for specific uses. It encompasses evidence garhered f¡om new scudies and evidence available from ea¡lier reporred rese¿rch. The validiry argumenr may indicare the need lor refining the definition oFthe construcr, may suggesr revisions in the tesr or orher aspecrs o[ rhe resring procers, and may indicate areas needing further srudy. Ulrimatel¡ the validiry of an inrended inrerprecation of resr scores relies on all rhe available evidence relevanr co rhe cechnical qualiry of a resring syscem. This includes evidence ofcareful tesr consrrucrion; adequarc score reliabilir¡,; appropriate rest adminisr¡arion and scoring; accu¡are score scaling, equating, and standard serting; and careful artenrion ro åirness for all examinees, as descriH in subsequent chapters of rhe St¿ndard¡. Standard 1.1 A rationale should be presenred for each recommended interpreration and use of rest scores, together widr a comprehensive summary of the evidence and tfieory bearing on -th.l intended"i¡se Òi'inrerpretation:, -.. . Commettr: The rarionale should indicare rvhar propositions are necessary to investigace the intended inrerpretarion. The compiehensive summary should combine logical analysis with empirica.l evide nce ro provide supporr for the tes¡ rationale, Evidence may come lrom srudies conduc¡ed locall¡ in the sercing where rhe rcsr is ro be used; from specific prior srudies; or from comprehensive statisrical synrheses of available studies meeting clearþ specified crireria. No rype olevidence is inherently preferable to orhers; rarher, rhe qualiry and relevance o[rhe evidence ro the intended resr use derermine the value of a parricular kind ofevidence. A presenrarion of empirìcal evidence on any poinr should give due weight to all relevant findings in rhe scienrific lirerarurc, including ü¡ose inconsisrenr wirh the inrended inrerprerarion or use. Tþsr developers have rhe responsibiliry ro provide supporr For their own recommendarions, but test users are responsible For evaluating the qualiry of rhe validiry evidence provided and its relevance ro che local siruation. Standard 1,2 The test developer should set forth clearly how test scores are inrended to be inrerpreted and used. The population(s) for which a test is appropriate should be clearly delimited, a¡rd the consrrucr rhat the test is intended to assess should be clearly described. Comment: Statements about validiry should refer ro parricular interprerations and uses. [t is incorrect ro use rhe unqualified phrase "the validiry of the resc." No resr is valid for all purposes or in all sirua¡ions. Each recom- 17 AE RA-APA-NCM E_OOOOO27 I I *ræn n¡¡r A F$Fìf,i h | ¡À[1{I l¡ÀrÍ¡ I.\ VALIDITY / PARÏ I .^^,,i.-" .,-li.l^^-^A-) ,,-- ^. :-..-^.^.-.i^¡ion and should specifr in clear language the popularion for which the test is in¡ended, the consrrucr ir is inrended to meâsure) and the manner and contexts in which tes! scores are obtained should be desc¡ibed in as ¡¡ruch detail as is pracrical, including major relevarit sociodemographic arrd developmental to be employed. Commeit: Statistical findings can be influ* enced by.factors aifecting the sample on which rhe resulr are bæed. \Øhen the sample Standard 1.3 IFvalidiry for some common or likely interpretation has nor been investig'ated, o¡ if the interpretation is inconsistent with available evidence, that fact should be made clea¡ atd potentiel users should be cautioned about making unsupported inte¡prentions. Comment: IF past expe rience suggesrs that a tesr is likely to be used inappropriately [or cerrain kinds ofdecisions, specific warnings against such uses should be given. On rhe other hand, no ¡wo situa¡ions are ever idendcal, so some generalization by rhe user is always necessary. Professional judgment is required to evaluate the extent ro which existing validicy evidence supports a given tesr use. Standard 1.4 If a test is used in a way that has not been v¿lidated, it is incumbent on the user ro jus- tifu the new use, collecting new evidence if necessary, Comment: Professional judgment is required to evaluate the extent ro which existing valìdiry evicience applies in rhe new situarion and ro determine what new evidence may bc nceded. The amount and kinds of new evidence required may be influenced by experience wich similar prior ¡est user or inrerprerarions and by the amount, qualiry, and relevance of existing data. Standard 1.5 The composition of any sample of examinees from which validiry evldence is cha¡acteristics, is inrended to rep¡esent a popuiation, that popularion should be described, and attenrion should be drawn to any systematic Facrors thar may limir the representativeness of the sample. Factors that might reasonably be cxpecred to affect the results include selfselecrion, arrrition, linguistic prowess, disabiliry starus, and exclusion criteria, and others. lf rhe subjects ofa validiry study are Petients, for example, then the diagnoses ofthe patienrs are important, as well as other cha¡acrerisrics, such as the severiry of rhe diagnosed condition. For tests used in industry the employment starus (e.g., applicanrs versus currenr job holders), the genera.l level ofexperience and educational bacÇround and the gender and ethnic composirion of the sample may be relevan¡ information. For tests used in educ¿tional settings, relevant information may include educational background, developmenral level, communiry characreristics, or school admissions policies, as rvell as the gender and ethnic composicion of rhe sampleSometimes resrrictions about privary preclude obcaining such population in[o¡mation. êr--J--J { ô ù14¡¡UatU r.O When the validation resrs in Pãrt on the appropriateness of test content, the procedures followed in specifying and çnerating test content should be described and justiÊed in reference to the construct the test is intended to meâsure or the domain it is intended to EPt€sent, If the definition of the content sampled incorporates criteria such as importance, [recriticaliy, ù.se criteria should also quency, or be clearly explained and justified. 1B AE RA-APA_NCM E_0000028 PART ti VAuolW Comment: For example, test developers mighr provide a logical srrucrure rhar rnaps rhe itcms on the resr ro the conrenr domain, illustrating rhe relevance oFeach irem and the adequacy wi¡h which rhe set of i¡ems .represents rhe content domain. A¡eas o[ the contenr domain rhar are nor includbd among rhe resr i¡ems could be indicared as well. Standard 1,7 STANÐARDS tions used by examinees, then theoretical or empirica.l evidence in suppon of those premises should be provided. !(/hen srarements about the processes employed by observers or scorers are part ofthe argument for validity,'similar information should be.provided;Comment: If the resr specificarion delineares the processes to be assessed, rhen evidence is needed rhar rhe resr i¡ems do, in facr, rap rhe intended processes. When a validation rests in pan on rhe opinions o¡ decisions of expen judges, observers, or raters, procedures for selecting such experts and for eliciting judgmenrs or ratings should be fulfy described. The qualifi- Standard 1.9 cations, and experience, ofthe judges should be presented. The dêscription ofprocedures tiviry of test performance to change with these forms o[ instrucrion should be docu- should include e¡y rËining and instructions provided, should indicate whether participants reached rheir decisions independend¡ and should report the level of agreement reached. If panicipanrs interacted with one another or exchanged information, the procedures through which they may have influenced one another should be set forth. mented, Comment: Systematic collection of judgments or opinions may occur ar meny points in resr consrrucrion (e.g., in eliciting experr iudgments of contenr appropriareness or adequare content represenrarion), in Formulacing rules or srandards for score inrerpreration (e.g., in setring cur scores), or in resr scoring (e.g., raring ofessay responses), \Vhcnever such procedures are employed, rhe qu.aliry of the resuJring judgments is important to the validarion. Ir may be entirely appropriare ro have experrs work togerher to reach consensus, bur it would not rhen be appropriate ro rrear their respective judgmenrs as srarisrically independent. Standard 1.8 If the rationale for a test use or score interpretation depends on premises about rhe psychological processes or cognitive opera- If a test is claimed to be essentially unaffecred by practice and coaching, then che sensi- Commcnt: Marerials ro aid in score interprem- rion should summarize evidence indicating rhe degree ro which improvement wirh practice or coaching can be expecred. Also, materials wrircen lor resr rakers should provide pracrical guidance abour the value of tesc prepararion acriviries, including coaching. Standard 1.10 '\ùflhen interpretation oFperformance on speciÊc items, or small subsets of irems, is suggested, the rationale and relevanr evidence in supporr of such interpretation should be provided. When inrerpretation of individual item responses is likely but is nor recommended by the developer, rhe user should be wamed against making such interpretations. Comment: Users should be gìvcn sufficienr guidance to enablc rhem to judgc rhe degree ofconfidence rvarranted for any use or interpretarion rccommended by the tesr developer. Test manuals and score reporrs should discourage overinrerpretation oF i nlormarion rhat may be subjecr ro considerable error. This is especially imporranr iF interpretation 19 AERA_APA_N CM E-OO OO029 a-n ora a ññ^ t\¡l.[ff\fi tr'tH¡ r\ ¡¡ !vv Ë! ¡!uvr lv VALÍ OITY I ^t _^-1^-^^^-^ u¡¡ rsulcrLu ur Plr.ur¡¡r4¡¡L! ^- :-^t^--J :---- ---il -..L sets of items, or subrcst scores is suggerted. Standard 1.11 If the rationale for a test use or interpretation depends on premises about rhe ¡eladonships among parts of the test, evidence concemino rhe inter"al slrucftrre of t}e test should be provided. Comment: k mighr be claimcd, for example, that a tesr is essentially unidimensional. Such a claim could be supported by a muliiva¡iare statistical analysis, such as a factor analysis, showing that the score variabiliry a¡r¡iburable ro onc major dimension rvas much greater rhan rhe score variability arrriburable to any other identified dimension. When a cest prorides more than one score, rhe inrerrelarionships ol those scores should be shown ro be consisrenr with the consrruct(s) being assessed. Standard 1,12 When interpieation of subscores, score differences, or profiles is suggesred, the rationde and releva¡r evidence in suppott of such interpretation should be orovided. Where composite sco¡es are developed, the basis and radonale lor arriving at the composites should be given. / PABT ¡ Êåaa¿f ascf l1 ÙtotlUOlU {lr¡g When validity evidence includes stetisticâl analyses oftest results, either alone or together with daa on other variables, the condirions under which r-he-data were collecred should be described in enough detail that users can judge the relevance of the statistical findings to local conditions. Attention should be d¡awn to any features of a validation data collection that are likely ¡o diffcr from t¡Aical operationa] testíng conditions and that could plausibly influence test performance. Comment: Such conditions might include (bur would nor be limired to) rhe [ollowing: examinee motivarion or prior preparation, *re distriburion of tesr scores oúer cxaminees, the rime allowed For examinees ro respond or other administrative condirions, examine¡ rraining or other examiner characteristics, rhe rime intervals separating collection of dara on difFerent measures, or conditions rhar may have changed since rhe validity evidence was obrained. Standard 1.14 When ralidiry evidence includes empirical analyses oftesr responses together with data on othe¡ ruriables, the r¿tionale for selecting the additional va¡iables should be provided. SØhere appropriate and feasible, evidence Comment: When a rest provides more than one score, rhe ciistinctiveness of the seperâte sco¡es should be demonstrated, and the inretrelacionships of those scores should be shorvn to be consisrcnt wirh rhe construct(s) being assessed. Moreover, evidence lor rhe validity of interpretations oF rwo sepatate scores would nor necessarily justifi an interpretation ol:the difference berween them. Racher, the rarionale and supporring evidence musr pertain directly ro rhe specific score or score combination to concerning the constructs represented by be interprered or used. consistent rvith theoretical exPectedons. The addirional variablcs might be demographic . rr t tt -- lllÉr¡ tçLlllltLdr otngt varlaDrcsr as wcrr a5 ^L-:- ---L-:--l properties, should be presented or cited" Attention should be drawn to any likely sources of dependence (or lach of independr I -r -.- l-encej among vanaDles oÍner Lrla¡ usPcrrue¡¡the construct(s) they represent. cies among Comm¿nt: The parrerns of association berween and among scores on the instrument under study and other variables should be 20 AERA APA NCME OOOOO3O PART I / UALIDITY characterisrics, indicarors STAruDARDS of rreatmenr condi- ¡ions, or scores on orhe¡ measu¡es. They mighr include inrended measures of rhe same construct or o[different con$rucrs. The reliabiliry oFscores from such orher meãsures end the'validity' of in rended 'inre rprecarions.of'--sco¡cs from these measures are an importanr parr of the validiry evidence for the instrument under srudy. I[such va¡iables include composire scores, the construcrion of the In addirion ro considering the properries ofeach variable in isolarion, it is important to guard against faulry inrerprerations arising from spurious sources of dependency emonE measures, composires should be explained. including correlared errors o! shared variance due to common meúlods of measurement or common elemenrs. Standard 1.16 When validation relies on evidence thar cest scores are related to one or more criterion variables, informarion about the suitabitiry _11d 19ch1ica1 _qual!ry o!49 criteria should, be reportedComment: The descriprion oI cach crirerion variable should include evidence concerning its reliabiliry rhe extenr to which it represents the intended consrrucr (e.g., |ob performance), and ¡he exrenr ro which ir is likely ro be influenced by extraneous sources ofvariance. Special arrention should be given ro sourccs rhar previous research suggesrs may introduce exrraneous variance that mighr bias che c¡irerion For or against identifiable groups. Standard 1.15 Standard 1.17 'lflhen ir is asserted that a certain level of test performance predicts adequate or inadequate criterion performance, informa-tion abour the levels of crirerion perFormance associated with given levels of test If test scores are used in conjunction with other quantifiable va¡iables to predict some outcome or crirerion, regression (or equiva- scores should be provided. In general, iFseveral predictors of crite¡ion are available, rhe oprimum combination of predictors cannor be dererComment: Cornment: Regression equations are more use- ful rhan correlarion coefficienrs, which lent) analpes should include those additional relevant '¡¿¡iables along wiuh the test scores. are generaJly insuftìcienr to fully describe perrerns ofassociarion bcrween tests and other variables. Means, srandard deviations, and orher srarisrical summarìes are needed, as well as in[ormation abour the disrriburion of crirerion performances conditional upon e given test score. Evidence ofoverall associarion becween v¿¡iables should be supplemenred by info¡marion abour rhe form of rhat association and abour rhe variabiliry associated wirh thar association in diFferenr ranges oltest Note thar dara collecrions employing examinees selected for their exrreme scores on one or more meesures (exrreme groups) rypiscores, cally cannot provide adequace informar.ion about rhe association. some mined solely from separare, pairwise examina¡ions oF rhe crirerion variable wi¡h each sepârare predìcror in rurn. lt is often informative to estimare the increment in predicrive accurâcy rhar ma1. be expecred rvhen each rariable, including rhe test score, is introduced in addirion to all orher available variables. Analyses involving mulriple predicrors should be verified by cross-validation or equivalenr analysis whenever Feasible, and rhc precisio n of esrimared regression cocffìcients should be reporred. Standard 1.18 When satistical adjusrments, such as drose fo¡ restriction of range or attenuation, are made, both adjusted and unadjusted coeffì21 AERA-APA_N CM E-OOOO03 1 t^--I ci s rt¡tu¡ lllr¡{il lq I L, IFn¡Ul/ñu uûra, cients, as VATIDITY weÍi as the specific procedure usecÍ, and all statistics used in the adjustment, should be reported. Comment: The co¡relation berween cwo variables, such as test scores and criterion mcasutes, depends on the range of values on each variable. For example, rhe test scores and the crirerion values of selecred applicana will g'pically have a smalle¡ range than the scores of all applicants. Sratistical merhods are available [or adjusring the correlation ro rellec¡ the population o[interest rather than the sample available. Such adjusrmenrs are often apÞropriate, as when comparing results across various situations. Reporting an adjusted correla¡ion should be accompanied by a sratemenr olthe method and rhe s¡aristics used in making the adjustment. PART ¡ Standard Í.20 When a meta-analysis is used as evidence of the strength o[a test-criterion reladonship, the test and the criterion variables in the local.situation should be.comparable with those in the studies summarized. IF releva¡rt resea¡ch includes credible evidence ^.L^- f:^.,,.-" vÀ ..¡r .!r(.¡¡6 ^Ê.L- ,-".;-- ' that any .^-li.ori^. -"rr influence the strength of the cest-criterion relationship, the correspondence berween those feacures in the local situation and in the meta-analysis should be reported. Any significant disparities that might limit the applicability of the rneta-a¡al¡ic findings to the local situation should be nored explicitly. Comment: The meta-analysis should incorporare all available studies meering explicitly srared inclusion criteria. Meta-analyric evidence used in tesr validation Standard 1.19 / .fPi.rlly is based on a numbe¡ of tests measuring the same or If a test is recommended for use in assigning persons to altemati-.e treal'nents or is likely to be so used, and if outcomes from those t¡eatÍ¡ents c¿n reasonably be compared on a common criterion, tlen, whenever Feasible, supporting evidence oI differential outcomes should be provided. Comment: IIa tcsr ìs used for classification inro alternative occupational, rheiapeutic, or educational Programs, it is not sufficient just to show that rhe rest prcdicrc tteetment outrh" .l"ccìfì^^-^- c,.^^^.. ¡v¡ o¡ç ,4¡¡urr/ LU¡r¡Ls. JùPyv¡ L F^. .L- .,.1;,1;^, ^F very similar constructs and crirerion meesures rhar likewise measure rhe same or similar consrructs. A mera-anal¡ic study may dso be limired to a single test and a singlc crirerion. For each study included in the analysis, the resr-criterion relationship is expressed in some common merric, otten as an efect size.TF.e srrength of the test-c¡iterion relationship may be moderated by features of the situarion in rvhich the rest and criterion measutes were obrained (e.g., rypes ofjobs, characterisrics of tesr takers, time interval separating collection of rest and criterion measureJ, year ot decaàe in which the daca were collected). IF test-crirerion relationships vary according to such moderaror variablcs, then, ¡he numbers of cation proceclure is provided by showing thar the tesr is uselul in decermining which persons ar€ Iikely ro profir differentially from one treâtment or another. Treatment categories may have to be combined ro assemble suffrcicnt c¿ses for sraristicel analysis. It is ¡ecognizcd, however, that such research may not be leasible, because ethical and legal constçaints on difFerential assignments may for- srudies permirring, the meta-analysis should report sepârate estimated effecr size dis¡ributions conditional upon relcvant siruational bid conrrol groups. magnitudes oF the influenccs of situational fearures on effect sizes. features. This might be accomplished, Êor cxample, by reporting sePatate distriburions for subsets ofstudies or by estimating the 27 AERA-APA-NC M E_OOOOO32 PABT I / VALIDITY the testing program per se will resuh in Standard 1.21 .Any meta-anal¡ic evidence used to support .an intended test use should be clearly described, índuding methodological choices in identifring and coding srudies, correcdng for artifacts, and examining potential moderator va¡iables. Assumptions made in correcting for artifacts such as criteríon unreliabiliry and range resrricr¡on should be presented, and the consequences of rhese assumptions made clea¡. indirect benefit should be made explicir. logical or. theoreticâl arguments and empirical evidence for the indirect benefir should be provided. Due weighr should be given to any contradictory findings in the scienrific literature, including fìndings suggesring important indirect onrcom€s orhec than those predicted. Comm ent : Meta-analysis inevirably involves judgmenrs regarding a number of methodological choices. The bases lor these judgments should be articulated. In the cæe of choices invo[ving some degree of uncerrainry, such as arrifacr correcrions based on assumed values, the uncerrainry should be acknowledged and rhe degree to which conclusions abour validiry hinge on rhese assumptions should be examined and reporred. Standard 1.22 When it is clearly stated or implied rhat some indirect benefit in addirion ro rhe urility of informarion f¡om the test scores rhemselves, the rarionale for anticipating the a recommended test use will result in a specific outcome, the basis for expecdng that outcome should be presented, rogerher wirh relevant evidence. Comment: For example, cerrain educarional resring programs have been advocared on rhe grounds rhar rhey would have a salurary influence on classroom insrrucrional prâcrices or would clarifu studenrs' understanding of the kind or level oFachievemenr rhey were expecred ro a¡rain. To the exrenr rhar such claims enter inro rhe jusrification for a resting program, rhey become parr oF che validiry argumenr For ¡esr use and so should be examined as parr of rhe validarion efforr. Due weighr should be given ro evidence against such predictions, For example, evidence rhar under some conditions educational tesring may have a negarive efÍecr on classroom ins¡rucrìon, Standard 1.24 Comment: IF ir is asse¡red, for example, rhar using a given tesr lor employee selecrion will resul¡ in reduced employee errors or rraining costs, evidence in supporr oF rhat asser¡ion should be provided. A given claim fo¡ rhe benefirs oftesr use may be supported by logical or theo¡eric¿l ergu¡nenr as well as empirical da¡a. Due weight should be given to findings in rhe scienrific lirerarure thar may be inconsisrent with rhe srared expectarion. Standard 1.23 When a test use or score interpretation is recommended on the grounds that tesdng or 'tVhen unintended consequences result from test use, an arrempÌ should be made ro investigate whether such consequences a¡ise from the resr's sensir¡v¡ry ro characre¡isrics other tha¡ tìose it is intended to essess o¡ to the test's failure fully ro represent rhe intended consrruc(. Comment: The vaJidiry of tesr score interpretations may be limi¡ed by construct-irrelevanr componenß or constrlrct underrepresentation. Vhen unin¡ended consequences appear to stemt ar leasr in part, lrom rhe use of one or more tesß, ic is especially important ro check AERA_APA_N C M E_O O OO O 3 3 I r rô'rn Âtflll A rrÂr rìBltrr :,ì [ ßiltu n rÈatrill rìì rnrw / ÞanÎ I ¡hzr thcsc consccucnccs do nor arise from --__--f--'_---such sources of invalidity. Although group differences, in and o[themselues, do nor call inro question the validiry ofa proposed interpreration, rhey may increase the salience of plausible.rivaf hyporheses that should be inuesrigated as part of rhe validarion effor¡. 24 AE RA-APA-N C M E-O OOOO 34 2, RELIABIHTV AI$I} ËRRORS OF MËASUREMilEruT variarion in their scores than others, but no Background ¡-es¡ b¡qadly deÊned, is a ser of ta.lcs dcsigned ro elicit or a sc¡le to describe examince behavio¡ { domain, or a system For collecting samples of an individual! rvo¡k in a particular area. Coupled wiù the device is a scoring procedure that enables the examiner to quanti$r, evaluate, and interpret the behavior or work samples. Reliability refers ro rhe consistency oFsuch measurements when the testing pro- in a specified cedu¡e is repeated on a populatìon of individ' uals or groups. The discussion thar Follows introduces concepts and procedures ùrat may nor be [ami]iar ro some readers. Ic is not expected that the brief definitions and explanations presented will be sufficient ro enable rhe less sophisticated reader to become adequately conversant with these developments. To achievc a better understanding, such readers may need ro consul¡ m,ore comprehensive treatments in the measuremenr literature. The usefulness of behavioral meesurement5 presupPoses that individuals and groups exhibit some degree olstabiliry in their behav- he¡e ior. However, successive samples o[behavior from the same person are rarely identicai in all pertinent respects. An individual's performances, produccs, and responses to sets oFtest quesrions vary in their qualiry or charac¡er from one occasion to anorher, even under striccly controllcd conditions, This variation is reflected in the examinee's scores, The causes of this variabiliry are generally unrelated to rhe purposes of mcasurement. An cxaminee m y try harder, may make luckier guesses, be more alert, feel less anxious, or enjoy better heal¿h on one occasion ¡han ano¡her. An examinee may have lmowledge, experience, or understanding rhat is more relevant to some in thc domain sampled by the test. Some individuals may exhibit less rasks ¡han ro orhers examinee ís completely consis¡ent. Because of this variation and;'in some instances,' because, of subjectìviry in the scoring processr an individual's obrained score and the avcrage score of a group will always reflecc at least a small ernount of measurement error, To say that a score includes a component of er¡or implies rhat rhere is a hypothetical error-free value thar characterizes an examinee at the rime ofresting. In classical test theory rhis error-free value is referred to as the persons tra¿ score for the test or measurement ìt is conceprualized as the hypothetical average score resulting from many reperirions o[ the test or alternate forms of rhe insrrument. ln staristical terms, the true procedure. score is a personal parâmeter and eac,h observed score oFan examinee is presumed to esdmâte rhis pararncrer. Under a¡ approach to reliabilìry estimation known as generalizability theory, a comparable concept is refe¡red to as an exal,ni' neds uniuerse scor¿. Under item rcsponsc theory (lRT), a closely related concept is called an examinee's ability or nøit pardmeter, though observed scores and trait parameters may be srated in different units. The hypothetical diFFerence becween an examinee's observed score on any parciculer meesuremenr and rhe examineeì true or universe score for the procedure is called medsurcment et'ror. The definition of what constitutes a scandardized test or measurement procedure has b¡oadened significandy in recenr years. At one time the ca¡dinal features oF most standa¡dized tests wcr€ consistency of chc ¡est marerials from examinee to examinee, close adherence to stipulated procedures for test adminis¡ration, and use of prescribed scoring rules that could be applied wiúr a high degree of consisrency. These features were, in fact, what made a resr "scandardized," and thcy made meaningÊrl norms possible' In employ25 AERA-APA-NCM E_OO OOO35 BETIABILITY A¡¡D EBBOBS OF MEASUREMETIT / PAFT I men( serrings and cerrifica¡ion orograms, flex, ible measuremenr procedures have been in who rake ¡he easie¡ forrn may be expected to for many years. Individua.lized oral examinations, simularions, analyses of exrended ¡he mo¡e dilficult Fo¡rn. Such a diFFe¡ence would nor be considered an e¡ror oF measurement under mosr merhods olquanrifuing and summarizing error,'rhough generalizabiliry' theory would permit resr Fo¡m differences ro use cese reports, and performance in real-lile ser- 'rings such as'clinics are norv commonplace: In educarion, horvevet large-scale resring programs rvith a high degree of flexibiliry in resr formac and adminisrrarive procedures are a relarively recenr developmenr. In some programs cumulative porrfolios olstudenr work have been subs¡irured for mo¡e r¡adirìonal end-of-year resrs of achievemenr. Other prog¡ams norv allow examinees ro choose rheir earn a higher everage score rhan rhose who rake bc recognized as an error source. The systemaric facro¡s rhar may differentiall¡ affecr rhe perforrnance ofindividua] resr takers are nor as easily derected or overridden as those aflecring groups. For example, some examinees experience levefs oItesr anxiery rhar severely impair cognirive effìciency. The orvn topics ro demonsrrare rheir abiliries. Srill others permir or encourage small groups of examinees to work coope¡arively in completing rhe cest. ,A science examinacion, for exam- The individual sysremaric errors are nor gen- ple, might involve a ream oI high school eraìly regarded as an elemenr rhat conrribures srudents who conduct a study of the sources pr€sencc of such a condirion can somerimes be recognized in an examinee, bur rhe effecr cannot be overcome by sratisúcal adjustmens. ro unreliabilir¡. Rather, they consrirure a of pollution in local srreams and prepare a report on their findings. Examinarions of rhis kind raise complex ìssues regarding rhe domain reprcsenred by rhe tesr and abour rhe generalizabiliry of indìvidual and group scores. Eâch step roward grearer flexibiliry sou¡ce of construcr-irrelevant variance and rhus may derracr f¡om validiry almost inevitably enlarges rhe scope and magnirude of measuremenr error. However, it is possible that some of rhe resultanr sacrifices in reliabiliry may reduce consrrucr i¡relevance or const¡uct underrepresenrarion in an assessment Program. ineek morivation, inreresr, or atrenrion and the inconsistent application of skills are clearly internal factors that may Iead ro score inconsistencies. DiFFcrences among resting si¡es in ¡heir freedom f¡om disrracrions, ¡he ¡andom effecrs ofscorer subjectivir¡', and variarion in scorer standards are examples ol external factors. The porency and importance oiany parricular source depend on rhe speciFic conditions under which rhe measures are taken, holv perlormances are scored, and the Characteristics and lmplicatíons of Measurement Error Errors of meæurement are generally viewed as random and unpredicrable. They are cônceptually disringuished flrom sysrematic errors, which may also affecr performance of individuals or groups, bur in a consistenr rarher than a random manner. For example, a s;,sremaric g¡oup error would occu¡ as a ¡esuk ofdifferences in the diffìculry of resr fo¡ms thar have not been adequately equared. When one resr [orm is less diflìcult than anorher, examinees Imporranr sou¡ces of measuremenr error nray be broadly categorized as chose roored wirhin rhe examinees and rhose exrernal ro ,h.- Fl,,..,,-.;^^" ;^ .L- t-,,-l ^r-- --^- interpretarions made from rhe scores. A parricular lacto¡ such as the subjectiviry in scoring, may be a significanr source of measuremenr error in some ãssessmenrs and a minor consideration in others. Some changes in scores from one occasion to anorher, ir should be nored, are nor regarded as error, because rhey result, in parr, f¡om an intervention, learning, or maturarion 26 AERA APA NCME 0000036 PABT 1 / BEIIABIIITY AND ERROBS OF MEASUREMENT thar has occurred berween the initial and fìnal measures. The diflc¡ence within an individual indicares, ro some €xr€nt, the eFFec¿s of the intervention or the extent olgrowth. In such serrings, change per se cons¡irutes the phenomenon of inreresr. The diflerence .or the forms, scorers, adminisrrations, or other relevant dimensions. h aÌso includes a descriprion oF rhe examinee population ro whom the foregoing data appl¡ as the dara mây eccurarely reflect what is ¡rue of one popularion change score ¡hcn becomcs rhe measure to which reliabiliry perrains. Measurement error reduces the usefulness of measures. k limics rhe exren¡ to which example, a given reliabiliry coefficient or esti- test results can be generalized beyond the par- ticulars ofa specific application of rhe measurement pÍocess, ThereÊore, ir reduces the confidence char can be placed in any single meesu remen t. Because ¡andom me¿Jurement errors are inconsistent and unpredictable, they cannor be rcmoved from observed scores. However, their aggregate magnitude can be summarized in several ways, as discussed below. Summarizing Reliability Data Info¡ma¡ion about measurement error essential to the proper evaluation and use is of an instrum¿nr. This is true whether the me¿surc is bascd on the responses to a specific set of questions, a portfolio oFwork samples, the per[ormance of a task, or the crearion of an original product. The ideal approach to rhe srudy of reliabiliry enrails independenr replication of the entire measurement process. However, only a rough or partial approximarion of such replicarion is possible in many resri n g siruadons, and invesri gario n o[ mezsu remenr effor may requìre special srudìes that depan fro m rotrtine testin g p roccdu res. Nevertheless, ir should be che goal of tesr developers ro investigate test reliabiliry as fully as practicaJ conside¡arions permir. No rest developer is exempr from this responsibiliry. The crirical inflormation on reliabiliry includes the identificarion oFthe major sources of error¡ summary sraristics bearing on rhe size o[such errors, and the degree of gcneralizabiliry of scores across altcrnate bur. misrepresen t .what. is true.of. another. .For. . mared srandard error derived lrom scores of a nationally representarive sample may difler significanrly From char obtained for a more homogeneous sample drawn from one gender, one erhnic group, or one community. Reliabiliry in[ormation may be reported in terms o[va¡iances or srandard deviations of measurement errors, in terms of one or more coefficients, or in terms of IRT-based test informarion frrnctions. The srandard crror of measuremenr is the stand¿¡d deviation o[a hypothetical disrriburion oF measuremenr errors thar arises when a given population is assessed via a particular tcst or procedure. The overell variance of measuremenr errors is actually a weighted average of the values that hold at various ttue score levels. The variance at a particular level is called e conditional error uariance end irs square root a" eonditiotul st¿nd¿rd enor. Tiadirionall¡', three broad categories oF reliabiliry coeffìcienrs have been recognized: (a) coefficients derived from che i n isrratìo n ol parailel Forms in i ndepende nt testing sessions (alternare-form coeffìcients); adm (b) coelficienrs obrained by administration of rhe same instrument on seperate occesio ns ( cest- reresc or stab il ity coeffi cien rs); and (c) coefficients based on the relationships among scores derived from individual items or subsers of rhe items within a test, all daca accruing from a single administration (internal consistency coeffi cients). tùØhere test scoring involves a high level oÊ judgment, indexes ofscorer consistency are commonly obtained. With the development ol generalizability theor¡ the forcgoing rhree categorics may now be seen as special cases of a more general classification: generalízabiliry coeffi cients. 27 AE RA_APA-N C M E-O O OO 037 RELIABILITY AND EBRÍ)RS OF MEÁSUREMEÀIT I.ike rradirional reliabiliry coefficienrs, a gnerølizabilitT coffidentìs defined as ¡he ¡a¡io of rrue or universe score variance ¡o obse¡ved Unlike rradirional approaches to the study ofreliabiliry horvever, generalizàbili rt-ihèory Þérriíirs the'¡eíeãícher rotpeci$, score variance. and estimare rhe various components oF rrue score variance, error variance, and observed furimarion is rypically accomplished by rhe applicarion of rhe cechniques oFanalysis of variance. Of special inreresr are score variance. the separace numerica.l escimares oF rhe componen$ ofoverall erro¡ variance. Such esrimares permit examinarion oFrhe conrributìon oleach source ofe¡ror ro rhe overall measurement proc€ss. The generaìizabiliry approach also makes possible rhe esrimarion of coeffìcients rhar apply to a wide variery of porenrial measurement designs. The test inlormation ñrnction, an impor- IRI, effìcienrly summarizes how rvell the test discriminarer among individuals a¡ various levels ofrhe abiliry or rrait being assessed. Under rhe IRT conceptualization, a mathematical fr-rnction c¿lled rhe itm char¿cteri¡tic carue or item response function is ysed as a model to represent the increasing proporrion ofcorrect responses ro an item for groups at progressively higher levels oF rhe abiliry or trair being measured. Given an adequare ranr result of dacabase, rhe parameters oF che characceristìc curve o[ each irem in a resr can be estimated. The rest information Function can then be approximated. This funcrion may be viewed marhemaricai sraremcnr of rhe precision of measuremenr at each level of rhe given trait. Precision, in the IRT conrext, is analogous to the reciprocal of ¡he conditional error as a variarìcc of classical tcst rheory. lnterpretation of Reliability Data In general, reliabiliry coeffìcienrs a¡e most useñrl in comparing tests or measuremenc procedures, parricularly those that yield scores in differenr units or metrìcs. Howeveç such comparisons / PAST I "." ,"."1,, cî,ôì-LrÊ^-.",,r ail^,,,^^-. -,,.. t-made fo¡ ditferences in the varìabiliry of the groups on rvhich rhe coefficienrs are based, the cechniques used ro obrain rhe coefficienrs, the sou¡ces oferror reflecred in rhe coefficienrs, àrid rhd.leigths of rhè ilír¡urÍieäts. being compared in rerms oF resting rime. Generalizability coeffi cienrs and the many coeffìcienrs included under the t¡adirional cacegories mey appeer ro be inrerchangeable, bu¡ some convey quire differenr inlormation From odrers. A coeffìcienr in any given category mey encompass errors of rneesurement from a highly resrricted perspective, a very broad perspecrive, or some point becween these extremes. For example, a coefficient may reflecr error due ro scorer inconsisrencies but nor reflect rhe variation that characrerizes a succession of examinee perlormances or producrs. A coeflìcient may reflecc only rhe inrernal consisrency of irem responses wirhin an insrrument and fail ro ¡eflect measurernen¡ e¡ror associared wirh day-to-day changes in cxaminee heakh, efficiency, or motivetion. Í. "À^,,t,.t ^^. k. ;-î^..-Å L^,,,-.,-, .L-. alrernare-form or resr-reresr coeffi cienr based on ¡esr adminisrrerions several days or weeks apam ere always preFerable to internal consisrency coefficients. For many tests, internal consisrency coefficienrs do not differ signifi\ùírhe ca n tly from al te rnate-form coeffì cie n c. re only one Fo¡m of a test exisrs, retesring may result in an inflated correlation berween the firsr and second scores due to idiosyncraric fearures ofthe test or to examince recall of initial responses. Also, an individuali sratus on some ait¡ibutes, such as mood or emotional srate, rnay change siglificarrtly irr a short period of time. In the assessmenr of such constructs the multiple measures that give rise to reliabiliry estimates should be obrained within the short period in which rhe arr¡ibure remains stable. Therefore, flor characteristics of this kind an inrernel consistenry coeflìcienr may be pre[erred. AE RA-APA-N C M E_O OOO 038 PART I / RETIABILITY ANO ERRORS OF MEASUREMENT The standard error o[ measurement is generally more relevant than the reliabiliry coefficienr once â meâsurenrenr procedure has been adopted and interpretation olscores has become the user's primary concern- lt should be noted.thar sranda¡d.erro$ share some of rhc ambiguiries ivhich charac¡erize reliabiliry coefficienrs, and esrimares may vary in their qualiry. InFormation about the precision of measuremenr ar each oI several widely spaced score levcls-rhar is, condicional srandard crrors-is usually a valuable supplement to rhe single statisric for all score levels combined. Like reliabitity and generalizability coefficients, standard errors may refìect variation ftom many sourccs of error or only a few. Fo¡ most purposes, a more comprehensive scandard erro¡ is more informative than a less comprehensive value. However, there arc meny exceprions ro this generalizarion. P¡acrical consrraints often preclude conduct of rhe kinds oF srudies rhat would yield estimates of the preferred srandard errors. Measurements derìved from observacions oFbehavio¡ or evaluariors ofproducs ãre especia.lly sensitive to a variery of error F¿ctors. Thæe include evaluator biases and idiosyncrasies, sco ri n g sub.iectiviry and intn-exam i nec factors from one perFormance or product to another. The methods of generalizabìliry theory are well suited to the investigation of the reliabiliry of the scores on such measures. Estimares oF the erro¡ va¡iance âssociared with each specific source and with ¡]ra¡ cause variarion the inre racrions berween sourccs indiete the extent ro which examinee scores may be gene¡alized ro a popularion ofscorers and ¡o a unive¡se of products or perlormances. The interpreutions of resr scores may be broadly categorized as rclatiue oc absolut¿. Rclarivc inrerpretarions convey rhe sranding of an individual or group wirhin a reference popularion. Absolure interpretations relate the srarus oFan individual or group to defined standards. Thesc srandards may originare in empirical data for one or more populations or be based enrirely on authoritarive judgment. DifFe¡enc values of rhe sranda¡d error apply ro rhe rwo rypes o[ interpretations. The resr information function can be perceived an akernarive to tradirional indices of meaEureme nr precision, buc there arg imporrant disrincrions rhat should be nored. Srandard errors under classical resr theory can be derìved by several difFerent approaches. These yield similar, but not identical, results. More significanrly, standard e¡ro¡s, like reliabiliry coefficients, may reflect a broad configurarion of error facrors or a restricted configuration, depending on the design o[ rhe reliabiliry srudy. Tèsc inFormacion functions, on rhe othe r hand, are limired ¡o the restricr- . of measurement error that is with internal consistency reliabiliries. In addition, under IRT several different ed deFrnirion associared ma¡hematical models have been proposed and accepted as rhe bæic Form oFthe item cha¡ac- teristic curve. Adoption of one model rathe¡ than anorher can have a material efFect on the derived cest inFormation func¡ion. A final consideration hæ significant implicatioru lor boúr IRI and classical approaches to quantification oI cest score precision. It is this: Indices ofprecision depend on shc sc¿le in which rhey are reported. An index srated in terms of raw scores or the trait level estimates of IRT may convey a radically different perception of reliabiliry than the same index resrated in ¡crms oFderived scores. Thìs same contrest may hold for conditional standard errors. In terms of rhe basic score scale, precision may àppeâr ro be high ar one sco¡e level, low at another. Bu¡ when ¡he conditional srandard errors are resrated in units ofderived scores, such as grade equivalents or standard scores, quite different trends in comparative precision may cmcrge. ThereForc, rneasure' menr precision under both theories very strongly depends on the scalc in which test scores are reporred and interpreced. Precision and consisrency in measurement ate always desirable. However, the need 29 AERA APA NCME OOOOO39 BELIABILITY AI{D ERBOBS OF MEASUREMENT for orecision increases as the consequences of decisions and interpretarions grow in importance. If a decision can and will be corroborared by informarion From other sources or if an erroneous inicial decision can be quicldy corrécted,'sco res'çvirh modeslrel iabil iry may sulfice. But ifa rest score leads ro a decision thar is nor easily reversed, such as rejection or admission of a candidate to a prolessional school or the decision by a jury thar a serious injury was susrained, rhe need for a high degree of precisìon is much greater, \Íïere rhe purpose of measuremenr is classification, some measuremenr er¡ors ere more serious than orhers. An individual who is far above or far below rhe value esrablished for pass/fail or for eligibiliry for a special program can be mismeasured wichour se¡ious consequences. Mismeasurement of examinees whose rrue scores are close ro rhe cur score is a mo¡c serious concern. The techniques used ro quanrifr reliabiliry should recognizt these circumstances. This can be done by reporring rhe condirional srandard er¡or in rhe viciniry of;he crirical value. Some authorities have proposed that a semanric disrincrion be made be¡ween "relia- biliry of scores" and "degree oF agreement in classificarion." The former rerm would be reserved for analysis of score variarion under repeâred mearuremen¡. The term clasifcation towistencl or inter-ratn agreement, rather than reliabilìry, would be used in discussions of consistency of classifìcation. Adoprion of such usagc wouid maicc it clear rhar rhe importance ofan error ofany given size depcnds on rhe proximiry oF rhe examinee's score ro rhe cur score. However, it should be recognized tl,ar thc degree olconsisrency or agreerrrerrr in cxaminee classificacion is specific to the cut score employed and its location within che score distribution. / PART I larqe Þ---f-' rh¡ r--'_.* __"- _'-Þ--' ---o- çro,,rç. -__- nnçi¡ite anrl n"orti.e - -.o""'-*urernenr errors ofindividuals may averege our almost completely in group means. However, the sampling errors associated wirh rhe random sampling oFpersons who are rested For purposes of þrogram evaluatio n are still -p¡es= ent..This component of the variarion in rhe mean achievemenr of school clæses fiom year to year or in rhe average expressed sarislacrion o[ rhe c]ienrs oF a progrem may consritute e porenr source olerror in program evaluarions. h can be a significanr source o[error in infe¡ences abour programs even if rhere is a high degtee of precision in oFsuccessive samples individual resr scores. Therefo¡e, when an instrumen¡ is used to make group judgmenrs, reliabiliry data must bear direcrly on the interpretacions specific ro groups. Srandard errors appropriare ro individua.l scorei are nor appropriate measures oF the precision ol group aveteges. A more appropriace srarisric is rhe srandard erro¡ of rhe observed score means. Generalizabiliry theory can provide more refined indices when ¡he sources of measurement error are numerous and complex. Tlpicall¡ developers and disrriburors cf resrs have primary responsibility for obraining and reporting evidence oF reliability or test inFormation Íunctions. The user must have such.data to mâÌe an inlormed choice among elte¡netive measuremenc approaches and will gencrelly be unable ro conducr reliebiliry studies prior to operational use ofen instrument. In some insr¿nces, however, local use¡s of a rest or proccdurc must accept ât least partial responsibiliry For documenting the precision of measuremenr. This obliga' tion holds when onc of the primary purposes of rneasurerncnr is ro rank or ciassifr examinces within the local population. ft also holds when users must rely on local scorers Average scores ofgroups, when inrerprer- who are rrained ro use the scoring rubrics provided by the test developer. ln such ser- ed as measures ofprogram effectiveness, involve er¡or factors that are not identical ro rhose rhac ope¡ate er ¡he individuai level. For tings, local Factors may macerially affect the magnitude of e¡ror variance and observed score variance. Therelore, the reliabiliry of 30 AËRA APA NCME OOOOO4O PARÍ I i RELIABILITY AND ERRORS OF MEASUREMENT scores may differ appreciably flrom that rcpon- ed by the developer. The reporting of reliability coefficients alone, with lictle detail regarding the merhods used ro esrima¡e the coeffìcien¡, the narure of the group.from which the.data were deriyed, and the conditions under which the data were obrained constitutes inadequate documentarion. General srarements to the effect that a test is "reliable" or rhat it is "sufficienrþ reliable ro permit inrerpretations of individual scores" are rarel¡ i[ever, accepabìe. lt is the userwho must rake responsibiliry for determining wherher or not scores are suffìciently trusrworthy to jusrifr antici pated uses a¡d interpretarions. O[ cou rse, rest consrructors and publishers are obligated ro provide suffìcient data ro make informed judgments possible. fu the foregoing comments emphasize, rhere is no single, preferred approach to quanrificarion of reliabilíry. No single index adequately conveys all of the relevanr Facrs. No one method of investigation is oprimal in all siruations, nor is the tesc developer limited to a single approach for any insrrument. The choice oFes¡imacion rechniques and thc minimum acceptable level for any index remain a matter of profesional judgment. Although reliabiliry is disct¡-ssed here ¿s an independenr characteriscic o[ resr scores, it should be recognized that the level of reliabrliry of scores has implìcarions For rhe validiry of score interprerations. Reliabiliry dara ultimately bear on the repearabiliry ofthe behavior elicired by the test and thc consistency of tle resulcant scores. The data also bear on the consisrency olclassifications of individuals derived from rhe sco¡es. To rhe extenr rhat scores reflecr random errors of measurement, their porential lor accurate predicrion of criteria, For beneficial examinee diagnosis, a¡d for wise decision making is limired. Relarively unreliable scores, in conjuncion with other convergent in[ormation, mey sometimes be ofvalue to e test user, but rhe level of a score's reliabiliry places limits on its unique contribution to va.lidiry for all purposes. STAI\IDARÐSI Standard 2.1 For each toÉl score, subscore, or combinadon ofscores that is to be interpreted, estimates of relevant reliabilities and standa¡d errors of measurement or test information fufiôtióä¡' shóuld be ieponed. Contmenr: [r is nor sufficient to report estiol reliabilities and standard erro¡s ol meâsurement only for totd scores whe n subscores are also interpreted, The form+o-form and day-ro-day consistency o[ total scores on a tesl may be acceptably high, yet subscores may have unacceptably low reliabiliry For all sco¡es to be inrerpreted, users should be supplied wirh reliabiliry data in enough detail to judge whether scores are precìse enough flor mares rle users' intended interpreutions. Composites lormed from selected subtests within a test batrery are freqrrenrly proposed for predictive and diagnosric purposes. Users need informarion about rhe reliabiliry of such composites. Standard 2.2 The sta¡rdard error of measurement, both overall a¡d conditional (ifrelevant), should be reported both in râw score or original scele unia and in units ofeach derived score recommended fo¡ use in test interpretetion. Comment: The most common derived scores include srandard scores, grade or age equivalents, and percenrile ranks. Because raw scoreJ on no¡m-referenced tesr are only rarely inter- prered direcrl¡ standard crrors in derivcd score unim are more helpful to the rypical resc A confìdence inrerual For an examineei true scote, universe score, or percentilc rank serves much the same purpose as a standard er¡or a¡d can be wed as an alrernaúve approach co convey reliabiliry inFormation. The implications of t}lc standard error ol measurement are especially importanr in situations where decisions cannot be PosrPoned and corrobouser. rative sources of information a¡e limited. 31 AERA APA NCME OOOOO4I I I É.tn nrRrn nnæ ù r F{!H[Jr{nUì} Qlan¡lnr¡l g,v? vau.rsga u I When test interpretation emphasizes diffei- rwo observed scores ofan individual or nvo averages ofa group, retiaences becween blliry data, l¡clqdi¡g ççaa{arì,ç.rr9¡q, ¡hould be provided for such differences. Commetr: Observed scorc differences are used for a variery oIpurposes. Achievement gains are Frequently rhe subjecr of inFerences [or groups as well as indivìduals, Dilferences benveen verbal and performance scores of intelligence and scholasric abiliry cesrs are fiELIABILITY AND ERROES OF I\4EASUflEMEÎ{T / PAfiT I cach inÍlucnccd by diflerent sou¡ces ol ¡-¡rcasurernen¡ e¡.ror, ir is unacceprable ro say simpl¡ "The reliabiliry of resr X is .90." A betrer sreremenr would be, "The reliabiliry coeFficienr of.90 reporred for scores on tesr X was obrained by correlaring scorcs from forms A and B adminisrered on successive days. The da¡a we¡e based on a sample oF400 lOrh-grade srudents From five middle-class suburban schools in New York Sure. The demographic breakdown of this group was as lollows: ...." Standard 2.5 in rhe diagnosis of cognirìve r{ reliabiliry coeffìcient or standard e¡ror of impairmenr and learning problems. Psychodiagnostic inferenccs are frequently drawn lrom rhe differences berween subtest scores. Aptirude and achievemenr barreries, inreresr measurement based on one approach shou.ld not be interpreted as intercha¡rgeable with oFten employed inventories, and personaliry eJsessmenrs aÍe commonly used to idenrifr and quanrifr che relative sciengths and weak¡esses or rhe par¡ern oFrraic levels ofan examinee. \l/hen rhe another derived by a different technique unless their implicit definitions of measurement e¡ro¡ are equivalent Comment: Inrernal consistency, al¡ernare- form, tesr-re¡esr, and generalizabiliry - coeFfi inrerprerarion of tesr scorcs cenrers on the peaks and valleys in rhe examineet lesr score profile, rhe reliabiliry ofsco¡e diffe¡ences for all pairs ofscores is crirical. cients should nor be considered equivalenr, as each may incorporare a unique definicion of Standard 2.4 approaches, Test developers should indicace Each method of quantifying rhe precision or consistenry ofscores should be described clearly and expressed in rerms oFsmristics appropriate to the method. The sampling measurernent er¡o¡. Erro¡ va¡iances derived via item response rheory may nor be equivâlenr to error variances esrimated via orhe¡ the sources,oferror rhar are reflecred in or ignored by rhe reported reliabiliry indices. Standard 2.6 procedures used to select e<aminees fo¡ relia- If æliability coefficients biliry malyses md descriptive statistis on tion o[ these samples should be reported. Commen¡: Informarion on rhe merhod ol subject selecrion, sample sizes, means, srandard deviations, and demographic characrerisrics o[ the groups helps users judge rhe exrenr ro which reporred data apply ro rheir own examinee popularions. If the tesr-rerest or alrernare-form approach is used, rhe inrerval berween resrings should be indìcared. Because rhere are many wâys o[esrimaring retiabiliry are adjirsted for rcstricor uriabiLity, the adjustment pro, cedure and both the adjusted and unadjusæd coefficients should be repoÍed. The standa¡d mç deviations ofthe group acrually tested a¡rd oF dre targerpopuladon, as well as the racionale for the adjuscment, shorrld be presented. Comment: Applicarion oF a cor¡ecrion fo¡ restriction in variabiliry presumes that rhe available sample is not represenrarive of the tesc-taker population to which users mighr be expected ro generalize. The ¡arionale For the 32 AERA APA NCME OOOOO42 PART I / RETIABILITY ANO EßRORS OF MEÀSUREMENT correction should consider rhe appropriateof such e genera.lization. Adjustment formulas rhat prerume constency in the standard error across score leve[s should not be used ness unless conltalc),can be defended Standard 2.7 When subsets of items rvithin a rest a¡e dictated by the test specifications and can be presumed ro measure panially independent traits or abilides, reliabiliry estimation ptocedures should recognize the multifactor character of the instrument. Comment: The total score on a test thar is clearly muhifactor in nature should bc treared as a composire score. Ifan internal consistency o[toral score reliabiliry is obrained by the split-halves procedure, the halves should be parallel in content and statistical characte¡istics. Stratified coefficient alpha should be used rather than ¡he more familiar esrimere nonstratifi cd coeffi cien t. STANDARÐS] cy olomitted responses toward the end of a test are also highly informative, though not conclusivc, evidence regarding speededness. A decline in the proponion ofcorrect respons€s, beyond that attributable to increasing item diffi crilry mày i ndicate' thât soräè'èÍáäiäeéi' werc responding randomly. \X/ith computeradminisrered ress, abnormally fut item rcsporìse times, particularly roward rhe end of the test, may aiso sugest thet examinees were respond- ing randomly. In rhe case of constructedresponse exercises, including cssay questions, the complereness oF the responses may suggesr thar time constraints had litcle effect on early irems bur a signìficant effecc on lace¡ i¡ems. Introduction ofa speed factor inio what might othe¡wise be a power test may have a marked effect on aLrernare-form and tesr-retesr reliabilities. A shift from a pâperand-pencil format to a computer-adminis- rered f,ormar may affect tesr speededness. Standard 2.9 lù(/hen a test is designed Standard 2.8 Test users should be info¡med about the degree to which ¡ate of wo¡k may afrect examinee perFormance, Comment, It is not possible to state, in genera.l, whether retiabiliry coefficients will increase o¡ decrese when rare o[work becomes an importanr source of s¡ntematic varia¡ce. Rare of work, as an examinee rrair, may be more srablc or less stable From occesion ro occ¡sion than ¡he orher factors rhe test is designed to measure. Because speededness has diffe¡enrial effecs on various esrimares, inFormation on speededncss is helpful in interpreting reporred coeffìcien¡s. The imporrance o[ the speed factor can somerimes be infe¡red from analyses of irem responses a¡d from obscrvations by examiners during rest adminis¡rations conduc¡ed for reliabiliry analyses. The distçibution o["last item artempred" and increases in rhe lrequen- to reflect rate of work, reliability should be esdmated by the dtemaæ-form or test-rctest approach, using separately timed administrations. Comment: Split-half coefficients based on lrom the odd-numbered and even-numbered irems are known ro yield inflated estimates of reliabiliry [or highly speeded resrs. Coefficient alpha and other inrernal consistenry coefficienu may also be biascd, though rhe size olthe bias is not æ clear as thar for the split-halves coeffrcien¡. seperate sco¡es Standard 2.10 '\)ühen subjective judgment enters into test scoring evidence should be provided on boúr inter-rete¡ consistency in scoring and withino<aminee consistency over repeated meas¡¡¡ements. A clea¡ disdnction should be made among reliabiliry data based on (a) indePendenr palels of rate¡s scoring the same ptrform' 33 AE RA-APA-N C M E_O OOO 04 3 l*rçr r-\ nrm il ¿.iltul lYú¡ruuú¡r¡fyv à ñññ FELIABILTY AND EßFOË.C OF MÉÀ.SUBEMEilT IrtF{t !\ ^-^Å..-.- /L\ ^ srrrËtç y4¡cr sçutf¡tt \u¡, d -:--t^ -^-^l --^_:_- successive performances or new producca, and (c) independenr panels scoring successive per- / PAST I Comment: If test score inte rprerarion invoives inferences within subpopularions as well as within rhe general population, reliabiliry data forma¡ces or new products. should be provided for boúr the subpopularions Commtzt: Task-to-rask variarions in rhe quaJiry and the general popularion. Tesc users who work exclusively with a specific culrural group oFan examinee's perlormance and rater-to-reter inconsistencies in scoring represenr independ- abiliry would benefir lrom an esrimare oFrhe ent sôurces of measurement error- Reports reliabiliry srudies should make clear which standard error for such a subpopularion. Some groups of test takers-pre-school children, Fo¡ of of in ¡he data. Vhere feasible, rhe crro¡ variances arisìng from each these sources are refìecrcd source should be esrimared. Generalizabiliry studies and variance component analyses are especially helpFul in rhis regard. These analy- can provide separate error variance estimates for rasks wirhin examinees, for judges, and lor occæions within ¡hc rime period of trait stabiliry. Informarion should be provided on the qualificarions of the judges used in reliabiliry srudies. Inte¡-rarer or inter-observer agreement may be particularly imporranr lor rarings and ses observ-etional data rhar involve subrle discrimi- netions. It should be noted, however, rhar when rate¡s evaluare posirively correlared characterisrics, a lavorable or unfavorable essessment oFone r¡air may color rheir opìnions oFother rrairs. Morcovcr, high inrer-rarer consistency does not imply high examinee consistency fronr rask ro rask. Therefore, internal consisrency rvirhin raters and inrerrater agreemenr do not guarantee high reliabiliry oF examinee scores. Standard 2.lt I[ there a¡e generally accepred rheorerical or empirical reasons for expeciing thai ¡eliabiliry coeftìcients, standard errors of measure' ment, or test informarion funcrions wili difFer subsrantially for various subpopula* úons, publishers should provide reliabiliry data as soon as feasible for each major populadon fo¡ which the test is recommended. or rvirh individuals who have a parricular dis- example-rend to respond ro tesr stimuli in less consisrenr a læhion rhan do older children. Standard 2,I2 Ifa test is proposed for use in sweral grades or over e range ofchronological ãge groups and if separate no¡rns are provided fo¡ each grade or each ate group, retiabiliry data should be provided fo¡ each age or grade population, not solefy for ali grades or ages combined. Comment: A reliabiliry coefficienr based on a samplc of examinees spanning severâl grades or a broad range of ages in which average scores a¡e steadily incrcasing will generally give a spuriously inflated impression of reliabitiry. lVhen a celt is inrended to discriminate wirhin age or grade popularions, reliabiliry coefficientS and srandard errors should be reported separarely for each popularion. Standard 2.13 to apply general scoring rules æd principles specified by the test developer, local reliabiliry dara should bc g:athered and reponed by local authorities when adequate size samples a¡e available. IF loc¿l scorers are employed Commntt: For example, many starewide resr- ing programs depend on local scoring oÊ essays, constfucted-resPonse exercises, and peiformance tesrs. Reliabiliry anal¡nes bear on the possibiliry rhar addi¡ional rraining ofscorers is needed and, hence, should be an inregral parr ofprogram moniroring. 34 AERA APA NCME OOOOO44 PABT I / RETIABILITY ANO ERROFS OF MEASUREMENI Standard 2.14 Conditional standard errors of measurement should be reported at several score levels iF constancy c:nr^or be assumed Where cut scores are speciÊed for selection or dassification, the standard errors oF measurement should be reported in the vicinity ofeach cut scote. Comment: Esrima¡ion of conditional srandard errors is usually feasible even with the sample sizes rhat are rypically used for reliability analyses. If ic is assumed thac che standard erro¡ is consrant ove¡ a broad range oFscore levels, the rationale for this assumption should be presenred. Standard 2.15 'When a test or combinadon oF measures is used co make categorical decisions, estimates should be provided of the percentage of examinees who would be classified in the s¿une way on cwo applications of rhe procedure, using the same €orm or altemate forms of the instrument. Comment:Ylhen a test or composite is used to make caregorical decisions, such as pass/fail, the srandard error of measuremenr er or near the cur score has imponant implications for the trusnvorrhiness oI these decisions. However, the standard error cennot be rranslared inro the expected percentage olconsistenr decisions unless assumprions are made abour the form of rhc disrributions of measurement er¡ors and rrue scores. Ir is preferable cha¡ ¡his percenrege be estimared directly rhrough the usc o[a repeated-measurements approach il consistent with rhe requiremenc oÊresr securiry and iFadequatc samples are availablc. Standard 2.16 In some testing sinratiors, the items va¡y from eraminee to o<aminee-tluough random selection Ëom a¡r enensive item pool or application STANDAHÐSI of algorirhms based on the o<aminee's level of performance on previous items or preferences with respect to item di-fficulty. In this rype of testing, the preferred approach to reliability estimaúon is one based on successive administ¡aciòis of úrè test ulder côriditions siniúlú tó those preva.iling in operadonal test use. Comment: Varying rhe ser o[ irems presented to each examinec ìs an acceptable procedure in some sertings. IIrhis approach is used, reliabiliry data should be appropriate ro rhis procedure. Fsrimates of standard errors of abilicy scores can be computed through the use of IRT and reported rourinely as parr of rhe adaprive testing procedure. However, those esrimates are not an adequate subscirure for esrimares based on successive administrations oI the adapcive test, nor do they bear on the ixue of stabiliry over short interv-¿ls, IRT esúmares are conringent on the adequacy ofborh úre item paremete¡ estimates a¡d the item response models adopted in dre theory. Estimates of reliabilities and snndard errors of meæuremenr based on the admin'srration and analysis ofahernate Forms ofan adaptive rest reflecr errors associared wiúr the entire measuremenr The a}ernare-fo¡m esrimates provide an independent check on the magnitude of the errors of measurement specific to the proceJs. adaprive fearure ofthe resting procedure. Standard 2.17 When a test is av¿ilable in both long and shon versions, relìabilíty data should be reported for scores on each version, preferably based on an independent administr¿tion of each. Comm¿nt: Some tesrs and resr batreries are published in both a "full-length" version and a "survcy'' or 'tho¡t" version. In many applications rhe Spcarman-Brown formula will sar- isfactorily approximate the reliabiliry of one of these from data based on rhe orher' However, context effecrs are commonplace in tc¡¡s of 35 AERA APA NCME OOOOO4S I aç¡ I ¡  RAà \ !f-ouvgrtû ut¡r!r, v ü Ärntfl Mrt{I [\ ^¡a EEIIê.BILITY É.i¡T ERRORS OF !4LASUREMÊNT ur4rrrlurl¡ ---c^-----^t-- -L- -L---^--:-..Pqrrurrr¡drrLÉ. ru5u, tttc sllutt vgtsion of a standardized resc oFren comprises a nonrandom sample of items lrom rhe [ulllength version. The¡efore, rhe shor¡er version may be more reliable or less reliable rhan rhe Spearman-Browri piojecrions fiom rhe'fulllengrh version. The reliabiliry ofscores on each version is be.sr evaluared rhrough an independenr administrarion of each, using the designated rime limirs. Standard 2.18 tX/hen significant variations are permined in tesr administration procedures, separate reli- ability anaJyses should be provided for scores produced under each major rzriation ifadequate sample sizes a¡e available. Comment: To accommodare examinees with disabilities, test publishers mighr authorize modifications iñ the procedures and time limirs rhar are specified for rhe administr¿tion o[rhe paper-and-pencil edirion ofa rest" In some cÀsei, modified edirions of rhe rest icself may be provided. For example, tape-recorded versions for use in a group setiing or with individual equipment may be used to test examinees who exhibit reading disabilities or attention deficirs. Ifsuch modifications can be employed rvirh tesr talcers who are not disabled, insights can be gained regarding rhe possible elfecrs on tesc scores ol these nonstandard adm inisrrations. Standard 2.19 Slhen average test scores For groups a¡e used in program evaluations, the groups tested should generally be regarded as a sa-mple from a larger population, even if all examinees available at rhe time of measr¡rement a¡e tes¡ed. Iß such cases the standa¡d ermr of the group mean should be repofted, as it reüects rzriabiliry due to sampling of eraminees as well as ra¡iability due to meesurement error, / PART ! Cornment: The graduating seniors of a liberai arcs college, rhe cu¡renr clients oFa social service agenry, and analogous groups exposed to e program of inreresr rypically consrirure a semple in a longitudinal sense. Presumabl¡ èorlpaiable gròups From rhé sâme popularion rvill recur in furure years, given sra¡ic conditions. The factors leading to uncerrainry in conclusions about program effecriveness arise from rhe sampling of persons as well as measutement error. Therefore, the standard e¡ror of rhe mean obse¡ved score, reflecting variation in both true scorcs and measurement errors, represenrs a rnore realisric standard error in this setring. Even this value may underesrimate the variabiliry of group means over rime. In many serrings, rhe sraric conditions assumed under random sampling of persons do not prevail. Standard 2.20 When the pu¡pose of testing is to me".u¡e the .".Fomr.¡e of *rr." -th". th". ináiwiltt"k ¡ -_--------' -- ú--ra procedure 6rquentty used is to assþ a small of subset ofitems to each ofrnany subsamples exerniriees. Data a¡e aggregated across subsampleJ and item subsets to obtain a me:$ure of group performance. When such procedures a¡e used for program ev-¿luâtion or population descriptions, reliability analpes mr¡st take the satnpling scheme into account. Commmt: This rype oF measurement program i. r^'--l *¿*;-.a*¡l;-Ir ¡" J-.i---,1 .^ reduce the time demanded of individual examinees and to increase the toral number of items on which dau are obrained. This testing approach provides ¡he samc r¡pc oF information abour group perFormances drat would accrue iIall examinees could respond to ail exercises in the i¡em pool. Retiabiliry statis¡ics must be appropriare ro rhe sampling plan used with respect co examinees a¡d items. AERA APA NCME 0000046 3" TEST ÐËVËtOPMIENT AIUÐ MEViSION Background Tesr developmenr is the process oF producing a measure of some aspecr of an individual's knowledge, skill, abiliry, inrerests, attirudes, or orher characrerisrics by developing irems and combining rhem to lorm a tesr, acco¡ding ro a specified plan. TÞst developmenr is guided by the srated purpose(s) ofthe test and the intended infe¡ences to be made From the rest scores. The test development process involves consideration of conten t, format, the conrext in which the tesr will be used, and rhe potenrial consequences of using ¡he tesr. Test development also includes specilying condirions íor adminisrering rhe rest, derer- Test Development The process oFdweloping educ¿¡ional *d prychological rests commonly begins with a smrement of the purposc(s) oF the test and rhe construcf or content domain to be measu¡ed, Tesrs of ùre sarne construct or domain c¿n differ in imporrant ways, because a number of decisions musr be made es rhe tesr is developed. Ir is helpful to consider the four phases leading from rhe original sratement ofpurpose(s) to dre fìnal product: (a) delineation ofthe purpose(s) oFrhe test and the scope o[the construct or t]re exren¡ of ùe domain to be measured; (b) derelopmenr and evaluarion of rhe tesr specifications; (c) developmenr, field tesring, evaluarion, mining procedures For scoring the tesc performance, and reporting the scores to rest takers and test users. This chapter focuses primarily on the following aspects oF tesr developmenr: sreting the purpose(s) oF the tes¡, defining a framework for the rest, developing and selection oF the items and scoring guides and procedures; and (d) assembly and evaluarion of the [esr for operârional use. lfhat follows is a description of rypical test dwelopment proccdures, rhough there may be sound reæons tesr specifications, developing and evaluaring setrings and not in orhers. items and their associated scoring procedures, assembling r-he tesr, and revising the test. The sraremenr oF purpose(s), rhar some of chese steps ere followed in some ' The fi¡st step is to extend the original a¡d the construcr or first section describes the test development process rhat begins with a sta¡ement of the purpose(s) of the test and culminares with the assembly ol the resr. The second secrion add¡esses several special considerations in test developmenr, including considerarions in delineating the test framework and in devcl- content domein being considered, inro a Êamework for rhe ¡esr rhar describes the ex¡enr of rhe domain, or thc scope of the consrrucr ro be measured, The ¡est frameworl<, therefore, delineates the aspects (e.g., content, skills, processes, and diagnostic fearures) ofthe con- oping perFormance assessmenrs. The chapter concludes wirh a discussion on resr ¡evision. Issues bearing on validiry, reliabiliry and [air- "Does eighth-grade mathemarics include algebra?" "Does vcrbal abiliry include text comprehension as wcll as vocabulary?" "Does self-esteem include both feelings and acrc?' The delineation o[ rhe tesr framework can be guided by theory or an analysis o[the content domain or job requiremens as in rhe case of many licensing and employment tesa. The test frameworlc serves as a guide to subsequent test evaluation, The chapter on validiry providcs a more thorough discussion of the relationships among the construct oÍ content domain, the rest framework' and rhe purpose(s) of the rcst ness are interwoven within the srages of test developmenr. Each of rhese ropics is addressed comprehensively in other chapters ofthe Stønd¿rds: validiry in chaprer 1, rcliabiliry in chapter 2, and aspects of fairness in chapters 7, 8, 9, and 10. A.ddirional material on ¡esr administ¡ation and scoring, and on reporting scores and resuln, is provided in chaprer 5Chapter 4 discusses score scales, and the focus of chapter 6 is test documenrs, srruct or domain to be measured. For example, AERA-APA_NC M E-OO OOO47 TEST DEVELOPMENT AIID REV¡SIOI'I (\^-^ Å..i""^^. l"-,,- L.-^ / PART ¡ tions, The tesr specificarions delineate the For- t^- -^^L ¡(L¡¡¡, V(r¡L¡ pUrpusLò trta/ L - l¡rurL :-^---- uL - --^-L^effecrively sened by a shon consrrucced-ræponse formar. Short-answer irems require a response of no more rian a few words. Extended-response Formats require rhe ¡esr raker ro wri¡e a more mat of irems, tæks, or quesrions; the response formàt or condirions For resþonding; and rhe rype ofscoring procedures. The specificarions oIone or more sentences or paragraphs. Perlormance assessmenrs oF¡en seek ¡o emulare rhe concexr or condirions in may indicare rhe desired psychomerric propdiffìculry and discriminarion, as well as rhe desired resr properties such as ¡est difficulry, inter-irem correlations, and reliabiliry. The resr specificarions may also include such lactors as rime resrricrions, characrerisrìcs oF rhe incended popularion oF test takers, and procedures for adminisrration. which the intended knowledge or skills --l- ^L^.,. ¡esc is ro measure, and whar irs scores are intended to convey, the nexr step is to design rhe resr by esrablishing resr specifica- whar che erries oFirems, such as All subsequenr cesr developmenr acriviries are guided by the tesr specifications. Tesr specificarions will include, ar lcasr implicitl¡ an indicarion of wherher rhe rest scores will be primarily norm-¡eferenced or crire¡io n-relerenced. lVhen scores are norrn- ex¡ensive response are acruelly applied. One rype oÊ per[ormance assessment, for example, is rhe sranda¡dized job or work sample. A rask is presenred ro che rest taker in a standardized formar under standa¡dized conditions. Job or rvork samples might include, for example, rhe assessmenr of a practitionert abiliry to make an accurare diagnosis and recommend rreatment flor a defined condition, a manager's abiliry ro arriculate goals for an organization, or a srudenr's proficiency in performing e science laborarory experimenr. AJI rypes oÊirems require some indica¡ion of how to score the responses. For selecr- referenced, relarive score inrerprerations are oF primary inreresr. A score for an individual or ed-response irems, one alternadve is considered for a definable group is rankeci within one or more disrribucions of scores or compared to the average performance oF rest ukers for various reference popularions (e.g., based on age, grade, diagnosric caregor¡', or job classificatio n). lf hen sco res aÍe crirerion-reFerenced, absolure score interpretarions are oFprimary inreresr. The meaning of such scores does nor depend on rank informarion. Rather, rhe tesr score conveys direcrly a level ofcomperence. in some defined crirerion domain. Borh relarive and absolure inrerprecarions are of¡en used with a given resr, bur the resc developer decermines which approach is mosr relcvant For thar rest. The nature of the item and response For- In other testing progrerns, ùe ahernarives may be weighted differenciaiiy. For shorr-answer mars that may be specified depends on rhe purposes o[ the tesr and the defined domain o[ the tesr. Selecced-response Formars, such as mulriplc-choice irems, are suitable for many purposes of resring. The test specifications indicate how many alrernarives a¡e ¡o be used rhe correcr response in some resring programs. items, a list oIacceprable alternatives may suFfice; exrended-response irems need more detailed rules for scoring, somerimes callcd scoing rabics. Scoring rubrics specifr rhe crireria lor evaluating performance and may vary in the degree ofjudgment enrailed, in the number of score levels, and in ocher ways. k is com' mon pracrice for test developers to provide scorers wirh examples of perlormances at each of the score levels co help clarifr the crire¡ia. For extended-response irems, including performance rask, rwo major rypes of scoring procedures are used: anal¡ic and holisric. Borh of the procedurcs require explicit performance cri¡eria úrat reflecr rhe test Framework. However, the approaches differ in the degree oFderail provided in the evaluarion reporr. Under the analytic scoring procedure, each cri¡ical dimeruion of the performance criteria is judged independend¡ and separate scores âre obrained 10 AERA-APA_NCME-OOOO048 PART I / TEST DEVELOPMENT AND REVISION for each o[ these dimensions ín addition to an overall score. Under the holistic scoring procedure, rhe same perFormance criteria may implicitly be considered, but only one ote¡all score is provided. Because the analytic proce- dure provides inFormarion on a number ol critical dimensions, it potentially provides valuable information For diagnostic purposes and lends irselI to evaluaring strengths and weaknesses of test rakers. In contrasr, the holistic procedure may be preferable when an overall judgment is desired and when rhe skills being assessed are complex and highty interrelared. Regardless ofthe rype ofscoring procedure, designing the items and developing the scoring rubrics and procedures is an integrated process. A parriciparory approach may be used in the design of items, scoring rubrics, and sometimes ùre scoring proc€ss iself. Many inrerested persons (e.g., practitioners, teachers) may be involvod in developing irems and scoring rubrics, and/or evaluating the subsequent performances. Ila participatory approach is used, parricipants' knowledge about the domain being and thei¡ abiliry to apply the scoring rubrics are of c¡itical importance. Equatly imporcanr, For those involved in developing tests and evaluaring perlormances, is their lamiliariry with the nature oFrhe popularion being tesred. Relevant characteristics of rhe population bcing resred may include the rypi- assessed cal nnge oFexpecced skill levels, their familiariry with the response modes required of rhem, and the primary language rhey use. The resr developer usually assembles an irem pool rhar consisrs of a larger ser of items rhan what is required by rhe test specifications. This allows [or rhe tesr developer to select a ser of items For rhe test that meet rhe test specifications. The qualiry of rhe irems is rrsually ascerrained through item review pro- pilot tesring. Items are reviewed for content qualiry clariry and lack olambicedures and guiry. Items somerimes are reviewed for sensitivicy to gender or cultural issues. An attempt is generally made to avoid words and ropics that may oflend or orherwise disturb some resr tekers, if less offensive material is equally useful. Often, a field ¡est is developed and adrninistered to a group oftest takers who are somewhar representative of the target popularion for rhe tesc. The field tesr helps decermine some o[ the psychometric properties o[ rhe resr irems, such as an itemt diffìculry and abiliry to discriminate emong test takers of differenr standing on the scale. Ongoing testing programs ofren pretest irems by inserring rhem into existing tests. Those i¡ems are not used in obuining test scorei ofrhe tesr rakers, bur the irem responses provide useful data flor rest development. The next step in tesr developmenr is to assemble items inro a test or to identi$ an irem pool for an adaptive test. The rest developer is responsible For ensuring that the items selec¡ed For the rest meet the requirements of the tesr specifications. Depending upon the purpose(s) oFthe test, relevant considerations in irem selection may include the contenr qualiry and scope, the weighring of irems and subdomains, and ûre appropriateness oF the irems selected for the inrended population of tesc cakers. O&en cest developers will specifr the distribution of psychomctric indices oF rhe i¡ems to be included in the test. For example, rhe specified discribution oF item difficulty indices for a selection test would differ f¡om rhe distribution specified for a Vhen psychomerric indices oF rhe items are esrimated using irem Íesponse thcory (lRT), the fit oFthe model ro rhe dara is also evaluared. This is accomplished by evaluating the extent ¡o which the assumptions underlying the item response model (e.g., trnidirnensionaliry, local independence, speededness, and equaliry ofslope general achievement rest. parameters) are satisfìed. Thc rest developer is also responsible for ensuring that the scoring procedures ere consiscenr with the purpose(s) of the test and [acilitate meaningful score interpretation. The narure oF the iniended scorc interpretirions âo AERA APA NCME OOOOO49 TEST OEUELOPMEi¡ÎAf,IP FEVI,SIO¡.I ,,,:ll J-.---:-^.L^ (¡¡L :-^^-.^--^ ^f pJ/t¡lujltgttlL ¡¡r¡pu¡ Ld¡¡Lt vt --.-L-----:- in rhe ¡esr construccion process. For example, indices o[irem difficutry characceriscics oF i¡ems and discriminarion, and inrer-irem correlarions, / PAFT I or scoring criteria thar may differenrialiy affecr resr scores ofone or more groups of resr rakers. Vhen difFere nrial irem funcrioning is may be parricularly imporcant when relative sióie ìnteipieia¡ions áre inrended. In rhi'c¿re of rela¡ive score interprerarions, good disc¡iminarion among resr cakers ar all poinrc along detecred, resr developers rry to idenrifo plausible explanarions for the difFerences, and rhen théy máy réþlaóe ôi revise irems rhac give rise to group diÊè¡ences ifcons¡rucr irrelevance ís deemed likely. However, at rhis time, the¡e has the consrruct conrinuum is desi¡able. h is imporrant, however, rhat rhe resr specifica- been lirtle progress in discerning rhe cause or substantive themes rlrar accoun( for differen- tions are nor compromised whcn optimizing the disrribution of these indices. In rhe case of absolute score inrerprererions, different criteria apply. In this case, rhe exrenr to which ¡he relevanr domain has been adequately represented is imporranr even iI many of rhe tial item functioning on a group basis. Irems for which the differenrial irem funcrioning index is significanr may consriture valid meas- irems are relarively easy or nondiscriminating rvithin a relevanr popularion. ft is impormnr, however, ro assure rhe qualiry of rhe conrcnr of relarively easy or nondiscriminaring items. If cut scores are necessary lor score inrerpreration in c¡irerion-referenced programs, the level of irem discrimination consrirures critical informarion primarily in rhe viciniry oFrhe cut scores. Bec¿use of these differences in tes¡ developnrenr procedures, rcsts designed ro facilicate one rype of inrerpretation luncrion less effecrively for orher rypes of inrerprerarion. Given appropriare resr design and supporting evidence, howeve( scores arising from some norm-referenced programs may provide reasonable absolure score interprerarions and scores arising from some crirerion-referenced programs may provide reasonable relative score inrerprerarions. \ùØhen evalua¡ing the qualiry ol¡he irems in the irem pool and rhe resr irself, resc developers ofrcn conducr srudies of differenrial irem Funcrioning (see chaprer 7). Differenrial item funcrioning is said to exisr when rest rakers oÊapproximarely equal abiliry on the targeted construct or content domain diffe¡ in their responses to an irem according ro rheir group rnembership. In rheor¡ rhe ulrimare goal of such srudies is to identifr consrrucrirretevant æpecs of irem conrenr, irem Formar, ures of an elemenr of rhe intended domain and differ in no rvay from orher i¡ems ¡har show cant i ndexes. 1ùí¡hen rhe di fferenrial item frrnccioning index is significanr, the tcsr developer must mke câre rher any replacemenr items or item revisions do not compromise the resr specifi carions. nonsigni fi When mulriple Fo¡ms of a resr are prepared, rhe rest specificarions govern each ol the forms. Also, when an irem pool is developed [or a compurerized adaprive resr, rhe specifications refer borh ro rhe irem pool and ro che rules or proceciures by which rhe individual item sers are creared for each tesr taker. Some of the arrracciye Features of computerized adaptive tests, such as tailoring rhe difficulry level oF rhe irems ro rhe resr raker's abilit¡ place addirional consr¡aints on ¡he design ofsuch rests. In general, a large numbe¡ of items is needed for a computerized adaptive rest to ensure that each railored irem set mcers rhe requireme nrs oÊ the test specifi, ca¡ions. Fu¡¡her, tests oFrcn are developed in the context of larger sysrems or programs. Multiple item sers, for example, may be created for use wích diffe¡ent groups of resr takers or on different tesring dates. Lasr, when a short fo¡m ofa rest is prepared, the tesr specificacions of rhe original resr govern rhe shorr form. DiÊferences in rhe tesr specificarions and the psychomerric properries of rhe short form and the original resr will affecr rhe inrer, pretation of rhe scores derived f¡om rhe shorr 40 AERA APA NCME OOOOO5O PAßT I / TEST DEVELOPMENT AND REVISION form. In any ofthese cases, the same fr¡nda- iations of icem scores wich productiviry meas- ofcurrent sales personnel or e measure of mental methods and principles of test development apply. ures Special Considerations ¡n Test DeveloPment wirh cuscomer loyalry. Similarl¡ an inventory ro help idenrifr different patterns ofpsychopatholory might be developed using patienrs from different diagnostic subgroups, Vhen tesr development relies on a data-based approach, it is likely that some items will be seleced bæed on chance occusenc€s in the dat¿ Cross-'¡¿lidarion studies are routinely conducted to determine rhe tendenry to select items by chance, which involves administering the rest to a This seccion elabora¡es on several topics discussed above. Fìrsr, considera¡ions in delinearing the lramework for t-he test are discussed. Follorving this, considerations io the develop- ment of perlorma¡ce âssessmen$ and porrfolios are addressed. Oelineating the Framework for the Test The scenario presented above ourlines whar is often done to develop a rest. However, rhe activiries do not dways happen in a rigid sequence. The¡e is often a subtle interplay berween rhe process oFconceptualizing a consttuct or concenr domain and the devclopme nr of a test ol rhar construcr or domain. The lramework For rhe rest provides a description of how the consrruct or domain will be represenred. The procedures used to develop items and scoring rubrics and to examine item characteristics may often conrribute to clarifring the framewo¡k. The extent to which the Framework is defined a priori is dependent on the testing application. In many testing applications, a well-defined Framework and de¡ailed test specifications guide the development of items and thei¡ associated scoring rubrics and procedures. In some areas ol psychological measurement, test dcvelopment may be less dependent on an a priori defined framework and may rely more on a data-based approach that resulu in an empirically de¡ived definition oF the framework. In such instances, items are selected primarily on the basis of their empiricel relationship wirh an exrernal criterion, their relationships with one another, o¡ their power to discriminate among groups of indìvidua.ls. Fo¡ example, consrruction oFa selec¡ion test lor sales personnel nright be guided by rhe corre- client sadsåcion might be assembled from rhose irems in an item pool thac correlare most highly comparable sample. In many resting applications, thc framework for the test is specified initially and this specification subsequently guides the develo¡ menr of items and scoring procedures. Empirical relationships may then be used to inform decisions abouc retainíng, rejecting, or modiSing irems. interpreetions of sco¡es from tests developed by this process have the advanugc of a logical/theoretical and an empirical foundation fo¡ the underlying dimensions represcnred by che rest. PrRroR¡¡al¡ce Ass¡ssmrms One distincrion beween performance assessmenr and other forms of tesr has ro do with the rype of response that is required From che resr nkers. Performance æsessments require the rest takers to cårry out a process such as playing a musical instrument or tuning a car! engine or to produce a product such as a writren essay, Performance aisessmenrc gcnerally require the test takers to demonstrate their abilities or skills in senings that closely resemble real-life seltings. For examplc, an essessment of a prychologist in training may require rhe test taker to intc¡view a clicnt, choose appropriate tesrc, and a¡rive at diagnosis and plan for therapy, Performancc a.ssessmenr ete diverse in narure and can be product-based as well as bchavior-based. Because perFormance assessments Ypicdly consist of a small num41 AE RA_APA-N CM E-O O OO 05 1 TEC? gLrLLV| tt¡L¡tt ñtlu DEtflorn¡r ,t õ^Dt, rL9r nctrEt f¡DÄtÊit? ÀÃtn ttlvtJlull rAñl ber oi tasks, establishing the exrent ro which the resulrs can be gener:alized ro rhe broader domain is parricularly importanr. The use of resr specifications wili conrribure ro ræla being drveloped so as ro.sysremarically represenr the c¡irícal dimensions ro be assessed, leading ro a more comprehensive coverage olrhe domain rhan what rvould occu¡ if test specifications were I ucts ther demonstrate their comperencies lor promorion purposes. Analogously, in educelional applicarions, srudenr may parricipare in the selecdon of some of úreir work and che producs to be included in rheir ponfolios æ well as in rhe evaluacion of che marerials. The specifi- carions fo¡ úre porr[olio indicare who is responsible for selecring is contents. For example, rhe not used. Fu¡ther, borh logical and empirical specifications may s(are rhat rhe resr rakeç rhe evidence are imporcant ro document rhe extent to rvhich perFormance assessmenfs-tasks as o<aminer, o¡ borh pania working togerher should scoring criteria-reflecr rhe processes or skills rhar are specified by rhe domain definirion- lVhen rasks are designed to elici¡ complex cognirive processes, logical analyses of ¡he ¡asks and bo¡h logical and empirical analyses of the resr rakers' performences on the rasks provide necessary validiry evidcnce. ponflolio. The parricular responsibili ries ol each parry are delineated in the specificarions. The well as be involved in rjre sclec¡ion of ¿he con¡en¿s of ¡he more standardized rhe conrenrs and procedures oFadminisrration, rhe easier ir is ro esrablish comparabiliry of porrfolio,based scores. Regardless of che meúrods ued, all performance assessrnenß are evaluared by the same s¡andards oF technical qualiry as orher forms of tesrs. Ponrrouos A, i unique cype of performance assessmenr is an ndividual porrfol io. PorrFol ios are systemaric collecrions oIwork or educarional products rypicalty collected over rime. Like orher assess."nr procedrlres, the design of portfolios is dependent on rhe purpose. Typical purposes include judgmenr o[ the improvement in job o¡ educarional perFormance and evaluarion of the cligibiliry for employmenr, promorion, or graduation. A wel l-designed ponfolio specifi es the na¡ure of úre wo¡k rhar is to be pur into the porrfolìo. The ponfolio may include enrries such as represenrarive produ.m, the besr work oFrhe tesr teker, or indicators of orogress. For example, in an employmenr serring involving promoriont employees may be inscructed ro include rhei¡ best work or products. Alternativel¡ ifthe purpose is ro judge a student's educarional growth, studenrs may be asked to provide evidence ol improvemenr rvith respecc to particular comperencies or skills. They may also be requesred ro p¡ov.ide jusrificarions for ùre choices. Stül other methods may include the use of videoapes, qhi- birions, demonstr¿tions, simulacions, and so on. In employment senings, employees may be involved in the selecrion of rheir work and prod- Test Revisions Tess and their supporring documents (e.g., rest manuals, technical manuals, user's guides) are reviewed periodically ro derermine wherher ¡cvisions a¡e nceded. P.evisions or amendmen¡s are necessary lvhen nerv resea¡ch dau, significant changes in the domain, or new condirions oF test use and interprerarion would eirhe¡ improve the validiry o[ inrerprenrions of rhe resr scores or suggert that the rest is no longer fully appropriate for is inrended use. As an example, ress are ¡evised ifthe test conrenr or language has be- come ourdated and, rhereforc, may subsequenrly affect the validiry oF rhe rest score interprerarions. Revisions to resr conrenr are also made to ensure ¡}re confidentialiry of dre tesr. k should be noted, however, that outdared norms may nor have che same implicadors for revisioru as an ourda¡ed resc For o<ample, it may be necessary co updare the norms for an achieverncn¡ resc a-frer a period of rising or falling achievemenr in the norming .popularion, o¡ when rhere are changes in rhe test-taking population, but rhe resr conrenr irself may continue ro be as relevant as it was when rhe tesr was developed. 42 AERA_APA_NCME-OOOOO52 PART I / TEST DEVETOPMENT AND REVISION Standard 3.1 Tests and testing programs should be developed on a sound scientific basis. Test devel- opers and publishers should compile arid document adequate evidence bgari4g o4 test development. Standard 3.2 The purpose(s) of the test, definition of the domain, and the test specifications should be stated clearly so that judgments can be made about the appropriateness of the defined domain fot the stated purpose(s) of the test and about the ¡elation of items to the dimensions of the domain they are intended to represent. Comment: The adequacy and uselulness of test interprererions depend on rhe rigor wirh which rhe purposes of the test and rhe domain represented by the rest have been defined and STAhIDARDS] Commmt: Professional judgment plap a major role in developing the resc spccificarions. The specific proccdures used for developing the specifications depend on the purposes oF the test. For example, in developing licensure and cenificarion tesr, practice anal¡'ses or job analyses usually provide rhe basis for defining rhe rest specificarions, and job analyses primarily resr. For achievement tests to be given ar the end ofa course, the test specifìcations should be bæed sewe this frrncrion For cmployment on an outline ofcourse conrenr and goals. for placemenr tess, ir may be necessary to examine rhe required entry knowledge and skills for several cou¡ses. 'Sühereas, Standard 3.4 The procedures used to inte¡pret test scores, and, when appropriate, the no¡mative or sanda¡dization samples or the criterion used should be documented. explicated. The domain definicion should be Òmment: Test specifications may indicate chat sufficiently detailed and delimired to show clearly what dimensions of knowledge, skill, processes, attirude, values, emotions, or ùe inrended behavior are included and whar dimensions A dear description witl enha¡ce accurate judgments by reviewers and orhers about the congruence ofthe defined domain and the resr irems. a¡e excluded. The test specifications should be documented, along with their rationale and the procrss by which they were developed. The rest specifications should define the conrenr of the test, the proposed number of items, dre item formats, the desired psychometric properties of the items, and the item and section arrangement. They should also speciþ the amounr oF time flor tesring directions to the test takers, procedures to be used for test adminisrrarion a¡d scoring, a¡d orher reler"anr informadon- ñr absolute others in one or more defined popularions. In absolute score interpretarions, the score or average is assumed to reflecr directly a level of com- petence Standard 3.3 score inrerpreations are or teladve score interpreations, or borh. In relative score inrerpretadons the stans ofan individud (or group) is dete¡mined by comparing rhe sco¡e (or mean score) to the yrformanæ oF or masrery in some defined criterion desþed to facilirare one rype oF domain. Tess interpretation frrnction less effectively for orher q¡pcs of interprearions, Given appropriate tcst design and adequare supporring dam, however, scores arising from norm-reFerenced tesring pro- grams mey provide reasonable absolure score intcrprecations and scores arising from criterionrelerenced programs may provide reasonable relative scorc interpreracions. Standard 3.5 tùl'hen appropriate, relevant experts external to the testing program should review the test speciÊcations, The purpose of the rwiew, rhe 43 AE RA_APA_N CM E-O O OO 05 3 ln'rn o¡ntalfH¡ üsù R ñññ I \ | IåIAII tÉ uEv ! v tf Étusa ---i---- :- -^-J--^--) --^-^^- L-- --.L:-L -L^ rEv¡cw ,5 Lu¡luuLrEu, PruLc55 U/ W¡rrL¡¡ f¡rc end the ¡esulcs of the review shou.ld be documented. The qualificarions, relevant experiences, and demographic characreristics of expert judges should also be. documenled, Comment: Experr review of ¡he rest specifications may serve many useful purposes such as helping ro âssure contenr qualiry and representativeness. The experr judges may include individuals representing defined populations ofconcern to the tesc specifications. Fo¡ example, if rhe rest is related ro crhnic minoriry concerns, rhe experr review r¡pically includes members o[ appropriare echnic minoriry groups or experrs on minoriry group issues. Standard 3,ô The type of items, the response formats, scoring procedures, and test administration procedures should be selested based on the purposes of the test, the domain to be measured, and úre intended test takem. To rle exrent possible, test content shou-ld be chosen to ensure that intended inferences Êom test sco¡es are equally rralid for membe¡s of different groups of resr økers. The test rev¡ew process should include empirical analyses and, when appropriate, the use of expert judges to review items and rerponse formars. The qualifications, relevant experiences, and demographic cha¡acterisdcs oFexpen judges shouJd also be documented. 1^----.. F*^^,, ;,,¡^,. -",, h, -"L-,.1 .^ ;t^^ mare rial likely ro be inappropriare, confusing, or ofFensive For groups in the tesr-taking population, For example, judges may be asked ro identiÇ rvherher lack of e.xposure ro problem conrex$ in mathematics word problems may be of concern for some groups of srudents. Various groups of tesr rakers can be defined by characterisrics such as age, erhniciry, culture, gender, disabilir¡ or demographic region. tify fËST OEVELOPMENT AiII} REVISIOi¡ / PART ¡ illrtllU¿llU J.l The procedures used to develop, review, and try out items, and to select irems from the itenr pool should be documenred. If the items were classified into different categories or subtests according ro the test specifications, the procedures used for the classifica.:^- ^-J.L- -^^-^^-:^.^^^-- ^^) dLLs( dLt of the classification should be documented. Comment: Empirical evidence and/or experr judgment are used to classifr irems according to caregorics oF the ¡esr specificarions, For example, proFessional panels may be used For classiSing the i¡ems or for dcrermining the appropriareness of the developer's cìassification scheme. The paneì and procedures used should be chosen with care as they will afFect rhe accurary oF the classifìcation. Standard 3.8 'When item tryouts or field tests are conducted, the procedures used to select the sample(s) of test takers for item tq/ouß ùd the resulting cha¡acterisrícs oi rhe sample(s) should be documented. When appropriate, the sample(s) should be er representatiye es possible of the population(s) for which the test is intended. Comment: Conditions rvhich may differenrially affect performance on the tesr irems by the sample(s) as compared to che incended population(s) should be documented when approprìate. As an example, res¡ rakers may be less motivated when they know rheir sco¡es will not have an impact on rhem. $tandard 3,9 'Vhen a test develope¡ erzluates the psycho- metric properties of irems, the classical or item rcsponse theory (IRT) model used for There is limited evidence, however, rhat expen reviervs alleviare problems rvith bias in cesting evaluating the psychometric properties of (see chapter 7). for estimating item properties should be de- items should be doormented. The sample used 44 AERA APA NCME OOOOO54 PABT I / IEST DEVETOPMENT AND REVISION diwrsiry for the proceduæ. The process by which irems a¡e selected and the daø used for ítem selection, such as item difficulry, item discrimination, and,/or item information, should also be documented. \ù7hen IRT is used to estimate item parameters in test development, the item scribed a¡rd should be of adequate size and response model, estimation procedures, and evidence of model fir should be documented. Comment: Although overall sample size is importanr, ir is imporrant also rhat there be an adequate number of cases in regions critica.l to the de¡ermination of the psychometric properries ofi¡ems. lFthe resr is ro achieve greatest precision in a particular part of rhe score scale and this consideration affecs irem selection, rhe manner in rvhich item statistics are used needs to be carefully described. Vhen IRT is used as rhe basis o[ tesr development, ir is imponant ro document úre adequacy of fir of cfie model co rhe daca This is accomplished by providìng information abour rhe exrent ro which IRT assumprions (e.g., u nidi mensiona.lity, loeJ item independence, or equalir,v of slope paramerers) are satisfied. Tcst developers should show rhat any differerices berween r}le administrarion condirions ofthe field tesr and the final form do nor affecr item performance. Conditions rhat c¿n affecr item sraristics include item posirion, rime limits, Iengrh of resc, mode of tesring (e.g., paper-and-penci-l versus compurer-adminisrered), ¿¡d use oFc¿lcularors plc, in field resting irems, those placed at the or ocher rools. For exam- end oía test might obtain poorer irem staristics chan those inserted within the rest. STANDARDSI Commcnt: lVhen da¡a-b¿sed approaches ro resr development are used, items are selecred prima- rily on the basis of their empirical relationships wirh an exte¡nal criterion, their relarionships wi¡h one another, o¡ theìr powcr to disc¡iminate among groups of individua.ls. Under these circumsrances, it is likely thar some irems will be selectcd besed on chance occurrences in rhe dara used. Adminisrering the resr to a compareble sample of test takers or a hold-out sample provides a meens by which the rendency to select irems by chance can be de termined. Standard 3.11 Test developers should doo¡ment the errtent to which rhe cont€nt domain of a test rep¡esentr úe deÊned domain and test specifications. Comment: Tesr developers shou-ld provide evidence of rhe extenr to which rhe rcsr irems a¡d scoring criteria represent the defined domain. This affords a bæis to help derermine wherher per- formance on the test can be generalized ro the domain thar is being assessed. This is especialþ important for tescs that conøin a small number oFitems such as performa¡ce assessments. Such evidence may be provided by experr iudges. Standard 3.12 The rationale and supporting evidence for computerized adaptive tests should be documented. This documentation should include procedures used in selecting subsets ofitems for administration, in determining the starting point and termination conditions for the test, in scoring the test, a¡rd for controlling item exposure. Standard 3.10 Test dwelopers should conduct c¡oss-va-lidacion studies when items a¡e selected primari- Ir is important ro assure that documentation of the procedures does not compromise the securiry of the rest itcms. Commcnt: ly on the basis of empirical relationships lFa computcrized adaptive test is inteoded r¿ther than on the basis of content or theoreti- to meåsute a number of different content suÞ cal co¡rsideradons. The ertent to which the dif- caregorics, irem selecúon procedures are to ass¡'¡.re thar ùe subcaregories are adequately rcpresented feænt smdies idend.fr the same item set should b'e documented by che irems presented to the tesr mker. 45 AERA APA NCME OOOOOSS I c.'rR agn n mnc. {_p{tËuAtr!L!-'¡ TEST DEVELOPMENÏ AND ffEVISION tð (lan¡lor¡N ? vsgl ¡gqa U V. l2 aU should document the rationale and process used to develop, review, and assign item weighrs. Vhen rhe ítem weights are obtained based on empirical data, the sample used for obtaining ire- w"ioh¡q .hnrtl.l h" ."ffì.;"^tl., l'.-" .-.{ -.Þ--'--.-'-.-.-.-...-....'.)._Þ-*._ representative of rhe popu.lation fo¡ which the test is intended. W'hen the i¡em weights are obained based on expeft judgment, the qualifications ofthe judges should be documented. rakers, along in rhe popularion of rvith ocher changes such cesr as cJranges in instructions, rraining, or job requiremenrs, may impacr the original derived item weighu, necessiraring subscquenr s¡udics afrer an appropriare period of rime. Standard 3.14 The c¡iteria used for scoring test rakers' performance on ecended-resporrse items should be documented. This documentaúon is especially important for performancc assessmenß, such as scorable ponfolios and essays, where the criteria for scoring may not be obvious to the user. Comme¡t: The compieteness and clariry of rhe cesc PAßT I a^.-..*,. r¡¡ uL'Lrur¡¡Ë a ytvtttyr, -L^ dërç, ¡4rw//t/.¿cttt, I- J-,-t^^:-^ rrrE -^^ t-- \7hen a test score is derived Êom úte differentiàl weighting of irems, the tesr developer Comment: Changes / specifications, including rhe definìtion of ¡he domain, are essential in developing che scoring crireria. The resr developer needs to provide a clear descripcion of horv rhe test scores are intended to be interprered to help ensure the appropriareness oF rhe scoring procedures. of tesr ukers should be considered, as should other possible unigue sources of di-ftìculry for groups in the population to be tested. Tesr direcrions thar specify time a.llorvances, nature oFthe responses el(pected, and rules regarding use ofsupplementary mat€rieJs, such as noreç, references, dic¡iona¡ies, calculators, or manipulatives such as lab equi¡ ment, may be esrablished via field rescing. Standard 3.'16 Ifa short form ofa test is prepared, for example, by reducing the number of irems on the original test or o¡ganizìng portions of a test into a separate form, the specificadons of dre short form should be as similar as possible to those ofthe original test. The procedures used for the reduction of items should be documenred. Comment: The exrent ro which rhe specificarions oFthe sho¡r lorm differ from rhose of the original test, and rhe implicarions of such differences for inrerprering rhe scores derived from'rhe shorr [orm, should be documented. Standard 3.17 rlØhen previous ¡esea¡ch indicates that i¡rele- v¿nt %¡ianc€ could confound the domain definition underllng the test, then to the extent feasible, the test developer should investigate sources of ir¡eleva¡t ¡¡ariance. Where possible, such sources oF i¡releva¡lt va¡iance should be -^-^..^,¡ rc¡rruvsu ^- -J..-^J v{ rçuulLu L...L^.^-. Lg¡! uçvsluPgr. l^..^t^^^uj u¡s Standard 3.18 Standard 3.15 lVhen using a sta¡rdardizcd tcsring format to co[ect struco:¡ed behavior samples, the domain, test design, tesr specificarions, aad materials should be documented as for any other test. Such documeritation should include a clear definition oFthe behavior oipected ofthe test takers, the naru¡e of the guage, expetience, and abiliry level eqectd responses, and any mareriaìs or directions that a¡e necessa¡y to ca¡ry out the tesdng. For tests úrat have time limits, test development resea¡ch should examine the degree to which scores indude component and ev-dluate of drat component, given a speed t-he appropriateness the domain the test is designed to measure, Standard 3.'!9 The directions for test administration shou-ld be presented wirlr suficient clarity and empha- 46 AERA APA NCME 0000056 PAßT I / TEST DEVELOPMEI.IT AND REVISION that it is possible fo¡ others to replicate adequtely the adminisu'ation condiúons under which the data on reliabiliry and validiry, and, where appropriate, norms were obtained. sis so Commenì: Because all people administering tess, including those in schools, indusrry and clinics, need ro follow test adminisrration conditions carefull¡ ir is essenrial rhat test adminisrrarors receive derailed instructions on test administrarion guidelines and procedures. Standard 3.20 The instructions presented to test takers should cpntarn su-fñcient detail so that test økers c¿¡r respond to a task in the ma¡rner ¡-hat the test developer intended. \ù?hen appropriate, sample material, practice or sample questions, criteria for scoring, and a representadve item identified ì{,ith each major a¡ea in the test's dassification oÌ domain should be provided to the test Ekers prior to the administ¡ation oFthe test or induded in the testing material as part of the standa¡d adminisr¿tion instructions. .Comment: For example, in a personality inventory ir may be intended chat resr ø.kers give the firsr response that occurs ro rhem. Such an expectation should be made clear in the inventory direcrions. fu another example, in directions for incerest or occupational inventories, it may be imporrant to speci[y whether test take¡s are ro mark the activiries rhey would like ideally or whether they are ro consider both rheir opporruniry and their abilicy realistically. The extent and nacure oF practice materials and directions depend on expected levels of knowledge arnong resr rakers. For example, in using a novel test format, it may be very importanr to provide the tesr taker a pracrice opporruniry as parc of the test administration. In some cesting sicuations, ic may be imporcanc for the instructions ro address such matcers as the effects rhar guessing and time limis have on rest scores. Il expansion or elaboration ol the rest instructions is permirted, the condi- STANDARDS"] tions under which ùis may be done should be snred clearly in rhc form of general rules and by giving represenutive examples. If no expansion or elaboration is to be permitred, rhis should be srated explicirly. Publishers should include guidancé for dealing with rypical quesrions From test takers. Users should be insr¡ucred how to deal with guestions that may arise during the testing period. Standard 3.2tr If the test developer indicates that the condiúons of administradon are permiaed to tery from one test taker or group to anotJrer, permissible va¡iation in conditions for administraúon should be idencified, and a rationale for permining the different conditions should be documented. Comment: [n deciding whether rhe conditions of administrarion can vary, the test developer needs ro consider and study the pocential effecrs of varying conditions of administrarion. If conditions of administ¡arion vary from rhe condirions srudied by the rest devel- oper or From those used in the development of norms, the comparabiliry of the test scores may be weakened and ¡he applicabiliry of the norms cân be questioned. Standard 3.22 Procedures for scoring and, if relevant, scoring criteria should be presented by the test developer in suffìcient detail and clarity to maximize tÏe accuracy of scoring. Instructions for using rating scales or for deriving scores obtained by coding, scaling, or dassiling constructed responses should be clea¡. This is especially criticd if tests can be scored lo""lly. Standard 3.23 The proæss fot selecting eaining, and quali$ing scorcrs should be documented by the test developer. The training maærials, such as the 47 AE RA-APA-N CM E-O OOOO 57 i,A+n S¡a A Ëññ l-\ r¿lrull tttHR !\ tv!!r¡!sr'¡!tHv ÎEST IIEVETOPME¡{T AruO NEVI-SIOÍ{ ^-^-t-- -.-L-:-- --J ^---,--l-- -c - - tåll(Crs sLU¡¡¡¡Ër rUUr¡L5 il¡U CÁarr¡P¡sS Ur tCSt ^ I responses r-hat illustrate the levels on scale, and the procedures rle t score for rraining scorers should ¡esult in a degree of agreement among scorers that allows for the scores to be interpreted as originally intended by rle test deueloper, Scorer reliability and potential d¡ift over rime in rarers'scoring standards should be evaluated and reported by the person(s) responsible fot conducdng the raining session. Standard 3.24 \ùVhen scoring is done locally and requires scorer judgmenr, ¡he resr user is responsible for providing adequate training and instruction to the scorers and for examining scorer agreement and accuracy, The test developer should document the expected level ofscorer agÍeement and accuracy. Comment: A common pracrice of tesr devel- of training materials (e.g., scoring rubrics, ¡esr rakers' ÍesponseJ at each score level) and procedures when scoring opers is to provide examples / PAET ! providing evidence char rhe oider version is as appropriate âs rhe new version for rha¡ panicular tesr use. Standard 3.26 Tests should be labeled or adve¡tised as "revised" only when they have been revised in significant ways. A phrase such as "with minor modification" should be used when the test has been modified in minor ways. The score scale should be adjusted ro acco¡.¡nt for rhese modifications, and users should be informed of the adjustments made to rhe sco¡e scale. Commcnt: Ir is rhe tesr developert responsibiliry to determine whcrher revisions ro a resr would influence tesr scorc ìnrerpretacions, lf test score inrerprerations would be affecred by the revisions, ir would then be appropriace to label the test "revised." Vhen tests are revised, the nature oF rhe revisions and their implicarions on test score interprerations should be documented. is done locally and requires scorer judgmenr. Standard 3.27 Standard 3.25 If A test should be amended or revised when nerv research data, sigaificent changes in rhe domain represented, or newly recommended conditions o[ tesc use may lower the walidiry of test sco¡e interpretations. Afthough a test that remains useÊ¡l need not be withdrawn or revised simply because olthe passage of time, test devefopers and test publishers are responsible for monitoring changing conditions and for amending, revising, o¡ withdrawing rhe rest as indicqred. a test or part of a test is intended for resezrch use only and is not distributed for operationâl use, statemenrs to this effect should be displayed prominently on all relevant test administration and inrerpretation materials that are provided to the rest user. Comment: This standard ¡cfers to resrs rhac are inrended fo¡ rescarch use only and does not refe¡ to sranda¡d rest development funcrions that occur prior ro rhc operational use ofa tesr (e.g., field resting). Connnent: TÞsr devclopcrs nccd ro consider a number oFfaccors rhat may warranr the revision of a rest, including ourdated lesr coûrenr and language. Ifan olde¡ version ofa rest is usod when a newer version has been publishcd or made available, test users are responsible for 4B AE RA-APA_N CM E-OOOOO58 4" S0ALËS, NtRMlS, AND SG0Rfr GOMPAffiAEËLITY Background Têst scores are reporced on scales designed to æsisr score inrerpretation. Typically, scoring begins wich responses to separete test items, which are often coded using 0 or I to represent wrong/right or negative/positive, buc sometimes using numerìcal values to indicate finer response gradations- Then rhe item scores ere combined, often by addirion but sometimes by a more elaborate procedure, to obtain t rdw score. Raw in part, by fearures ofa test such as test lcngth, choice of time Iimir, item diffìculties, and rhe circumstances under which the test is adminisrered. This makes raw scores are decermined, scores difficulr to interprer in the absence oI further informarion. Inrerprecarion and sratisrical analyses may be facilitated by converting rav/ scores into an entirely different set ofvalues called dcriued scores or scab scorcs. The vari- ous sc¡les used For reporting scores on collcge admissions resrs¡ the standard scores often used to report resulc for intelligence scales or vocational interest and personeliry inventories, and the grade equivalents reported For achievement tes$ in rhe elementary grades are examples ofscale sco¡es. The process ofdeveloping such a score scale is called scaling a tesr. Scale sco¡es may aid interpretation by indicating how a given score compares to those of orher test tekers, by enhancing the comparabiliry ol scores ob¡ained using diFferent forms of a tes¡, or in othcr ways. Another way of assisting score inte rpreration is ro establish standards or cut scorct Í.11^t distinguish different score ranges. In some cases, a single cur score may define the boundary berween passing and failing. In other cases, a series oFcur scores may define distinct proficienry levels. Cur scores may be established for eicher raw or scale scores. Both scale scores and standards or cuc scores can be cent¡al to the use and interpretation of tes¡ scores. For rhat reason, their defensibiliry is an important considerarion in rest validadon. There is a close connecrion berween standards or cut scores and certain scale scores. IFthe successive score ranges defined by a series relabeled, say 0, 1,2, and of cut scores are so on, then â scåle score has been created. In addirion to [aciliraring interpretations of a single resc form considered in isolarion, scale scores are often created co enhance com- parabiliry across diFferent Forms of the same restr ecross resr formats or administration condirions, or even actoss tests designed to rneesure differen¡ constructs (e.g., related sub. tes$ in a battery). Equatcd scores from alrernare forms of a ¡est can often be interpreted more easily when expressed in scale score unirs rarher rhan raw score units. Scaling may be used ro place scores from different levels ofan achievement rest on a continuous scale and chereby facilitatc inFerenccs about growth or developmenr. Scaling can also enhance the comparabiliry of scores derived from rests in different arees, as in subtests within an aptirude, inrerest, or achievcment battery. I'lorm-Heferenced and CriterionReferenced Score lnterpretat¡ons Individual rew scores or sc¿le scores a¡e oÊen referred ro the distriburion of scores for one or more comparison groups to draw useful inferences abour an indivídual's perFormancc, Test score interpreadons bæed on such comparisons are said to be norm-referenced- Percenrilc rank norms, for cxample, indica¡e the standing of an individual or group within a defined population of individuals or groups. An o<ample of such a comparison group mighr bc fourthgrade studcnts in the United Sutes, tested in the last 2 months oÍa recent school year' Percendles, averages, or other staristics for such reference groups arc cdled notnx. By showing AERA APA NCME OOOOO59 ê^^t rê JV¡ìLEJr how rhe tesr score o[ a given examinee compares ro rhosc of orhers, norms assist in the classificarion or descriprion oI examinees. Orher resc score incerprerarions make no direcr ¡eFerence ro the performance of orher examinees. These interpretarions may rake a varieg' of forms; mosr ere collectively relerred ro as riterion-referettced- interpretarìons. Derived scores supporting such inrerprerarions may indicare rhe likely proporrion oF correcr responses on some larger domain of irems, or rhe probabilicy of an examinee's answering parricular sor¡s of items cor¡ecrly. Orher crìteri on -referenced in rerprerari ons may ì ndi care rhe likelihood rhar some psychoparhology is present. Scill other crirerion-referenced inrerprerations indicare rhe probabiliry rhar an examinee's level o[ rested knowledge or skill is adequate to perform successlully in some other settìng; such probabilities may be summarized in an expeccancy rable. Scale sco¡es to supporr such criterion-referenced score interprerations are ofren develop3d s¡ ¡[ìg basis of statistical analyses of the relationships of rest scores to other variables. Some scale sco¡es are developed primarily to supporr norm-¡eferenced interprerations and others, crirerion-referenced interprerations. In practice, however, rhere is nor a.lways a sharp distinction. Borh criterion-relerenced and norm-¡eferenced scales may be developed and uscd for the same rest scores. Moreover, a no rm- reFe renced sco¡e scale o rigi nally developed, for cxample, ro indicate perlormance relarive to some specific rcFe¡ence population mighr, over rime, also come ro supporr criterion-referenced inrerprerarions. This could happen as research and expericnce broughr incrcased understanding of rhe capabiliries implied by åiiîerem scale score levels. Conversel¡ resuks of an educarional assessment might be reported on a scale consisring ofseveral ordered proficiency levels, defined by descriptions of rhe kinds of casks srudents a¡ each level were able to perlorm. Thar would be a cri¡e¡ion-refe¡enced scale, but once rhe ltñnntô Nunllrùr aÁfn ô^^ht tiltu ovunÉ ô^aâñrñrôrt uulvtrånåÞtlt fw , ñâñ? r I ¡ / rAn I I disrriburion ofscores over levels was reported, sa¡ for all eighth-grade srudents in a given state, individual studenrs' scores would also convey information abour their sranding relative to thar.tesred population. Interpretations based on cur scores may likewise be eithet criterion-reFerenced or norm-referenced. lf qualirarively different descriprions are arrached ro successive sco¡e ranges, a crirerion-relerenced interprerarion is supported. For example, the descriptions o[ performance levels in some assessmenr task scoring rubrics can enhance score interprerarion by summarizing the capabiliries thar musr be demons¡rated ro merit a given score. In other cases, crirerion-re[erenced ínrerpreracions may be based on empir-rcelly derermined relationships berween resr scores and orher variables. Bur when resrs a¡e used for selecdon, ir mey be eppropriare to ¡ank-o¡de¡ examinees according ¡o ¡heir test performance and es¡ablish a cut score so as ¡o selecr a prespecified number or proportion of examinees f¡om onc end o[the distribution; ifthe selection use is otherwise supporred by relevanr reliabiliry and validiry evidence. In such cases, the cut score interpretarion is norm-referenced; the labels reject or fail tersus acce?t ot pats ^re de¡ermined solely by an examinee's sranding relative ro others tested. Criterion-referenced interpretations based on cur scores âre sometimes criticized on rhe gtounds rhar rhe¡e is very rarely a sharp dis' tinction ofany kinci be¡ween rhose jusr below versus just above a cut score. A neuropsychological resr may be helpftrl in diagnosing some parricular impairment, for example, but the probabiliry thar rhe impairmenc is presenr is likely ro increase conrinuously as a fi¡nction of the test sco¡e. Cuc scores may nonerheless aid in formulating rules [or reaching decisions on rhc basis oF tesc performance. It should be recognized, however, that rhe probabiliry of misclassificarion rvill generally be relatively high Êor persons wirh scores close to rhe cut points. 50 AERA APA NCME OOOOO6O PART I / SCALES, NORMS, AND SCORE COMPAffABILIW group oFrest rakers Norms The validiry of norm-referenced inrerpreutions depends in part on the appropriareness olthe reference group ro which resr scores are compared. Norms based on hospiralized patiens, for example, mighr be inappropriate for some interpretations of nonhospiralized parienrs' scores. Thus, it is important thar reference popularions be carelully-defined and clearly described. Validiry o[ such i nterprerations a]so depends on the accuracy with which norms summa¡ize the performance of the re[erence population. Thar popularion may be small enough rhat essentially the entire population can be tested (c.g., all pupils ar a given grade level in a given district tesred on the same occasion). Often, howeveç only a sample oÊ examinees from the reference popularion is tested. It is then important thar the norms be based on a technically sound, represenrarive, scienrific sample ofsuffìcient size. Parients in a few hospirals in a small geographic region are unlikely to be representarive olall parients in the United Sates, for example. Moreover, the appropriarenes ofnorms based on a given sample may diminish over rime. Thus, for resa that have been in use For a number ofyears, periodic review is generally required to aisure rle concinued utiliry oFnorms. Renorming may be required ro maintain che validiry oF normreferenced tcJr score i nrerprerations. More tha¡ one reference population may be appropriare lor the same test. For example, achievement test performance mighr be interpreted by reference to local norms based on sampling from a parricular school disrrict, norms for a state or type olcommuniry or national no¡ms. For other resß, norms mighr bc based on occupational or educ¿rional classifications. Descriptive scatistics for all examinees who happen to be tesred during a given period of timc (sometimes caJled user norms or pÌogtam norms) may be useful for some purposes, such as describing rrends over rime. But ùrcre musc be sound reason ro regard chat ¿s an appropriate basis lor such inferences. When there is a suitable ration- for using such a group, the descriptive staristics should be clearly characre¡ized as being ale on a sample oFpersons routinely tqsred part oFen ongoing program. based as Gomparability and Equating Many resr uses involve diffe¡enr versions of the same resr, which yield scores rhat can be used interchangeably even though they are based on different.sets ofitems. In testing programs rhat offer a choice of examinarion dates, fo¡ example, tesr security may be compromised if che same form is used repearedly. Other testing applications may enuil repeared mcesuremenr of rhe sarne individuals, perhaps to measure change in levcls ofpsychological dys[unction, change in artirudes, or educational progress. In such contexts, ¡euse oFrhe sarne set of rest items may resulr in correlared er¡ors of measu¡ement and biased esrimases o[change. \Øhen disrinct forms ofa resr are consrructed ro :he same explicit content and staristical specifi cations and administered under idenrica.l conditions, rhey are referred rc as alternate þr¡¿¡ or sometim es parallel or equiualcnt îorms. The proccss oF placing scores from such alternare Forms on a common scale is called eqaatitig. Equacing is analogous ro the calibration of difÊerent balances so rhat they all indicate the same weight for any given object. However, the equating process For rest scores is more complex. lt involves small statistical adjustmcnrs to accounr for minor differcnces in the diffìcu.lry and sntistical properties of rhe alternare forms. In cheory equating should provide accurate score conversions for any set o[ persons drawn From rhe examinee population [or which the tesc is designed. Furrhermorc, the same score conversion should be appropriatc regard- o[ the score interpretetion or use intended. It is not possiblc to cons[ruct conversions wirh rhese ideai properties berween scores on less 51 AERA APA NCME OOOOO6I SOA.TES, Í'¡ORMS, tes$ that measure ciiffereirt constructs; thar differ materially in difTìculry, reliabiliry rime limirs, or other conditions of administrarion; or that ere designed to different specifications. The¡e is anorher assessmenr approach that may provide interchangeable scores bæed on responses to differenr irems using different methods, not reFerred ro as equaring. This is rhe use of adap¡iue tests. k has long been ¡ecognized that little is learned lrom examinees' responses to items rhar are much too easy or much roo difficult for rhem. Consequently, some testing procedures use only a subset o[ rhe available irerns wirh each examinee in orde¡ to avoid boredom or lrusrration, or to shorten resting rime. An adaptive test consisrs ol a pool of itcms togerher wirh rules for selecring a subser oF rhose items to be administered to an individuai examinee, and a procedure for placing diF[erenr exeminees' scores on a common scale. The selection ofsuccessive items is based in part on rhe examinee's responses to prev¡ous icems. The irem pool and irem selecrion rules may be designed so rhat cach examinee receives a represenrarive ser of irems, oI appropriace difficulty. The selecrion rules generally assure rhâr an acceprable degree oF precision is arrained be[ore testing is terminated. At one rime, such raì[ored tesring was limited to certain individually adminisrered psychological tesrs. \üirh advances in item response rheory (lRT) and in compurer rechnoloqv- however. edenrive recrinq is - - -'l'_-"r" -'--___Þ '-- " -'Òt'' becoming more sophisticared. Wirh some adapiive rests, it may happen rhat rwo examinees rarely ifever respond to precisely the same set of items. Moreover, rwo examinees taking che same adaprive resr may be given sers of items that differ markcdly in difficulry. Nevercheless, when certain statistical and content condirions are mer, test scores produced by an adaprive resting sysrem cen function like scores from equated alternate forms- AilT SE()RE EOMPARABITIÍY / PART I Scaiing to Achieve Comparabiiiiy The term equating is properly reserved only for score conversions derived for alre¡nare lo¡ms of the same rest. h is often usefrrl, however, ro compare scores Êrom rescs rhar cannot, ory be equered. For example, in rhe- it may be desirable to interpret scores F¡om a shortened (and hence less reliable) Form ofa resr by firsr converting them to corresponding scores on rhe full-length version. For inee growrh ovet timc, che evaluacion of exam' it may be desiral¡le ro develop scales rhat span a broad range ofdevel- opmenral or educational levels. Test revision often brings a need for some linkrge berween scores obtained using newer and oldc¡ editions. International comparative studies or use wirh hearing-impaired examinees may require rest [orms in dif[erenr languages. In s¡ill orher cases, Iinkages or alignrnents may be creared berween tesrs measuring differenr constructs, perhaps comparing an aptirude with a form oF behavior, or linking measures o[ achieve- menr in several concenr ereas. Scores From such tesrc mal somerimes be aligned or pre---.-J :^ ^ -^^^^-)^--^ -^Lt- .^ ^:l ..---- :- cstimating relarive performance on one rest lrom performance on another. Score conve¡sions ro lacili¡are such comparisons may be described using rerms like linkage, calib¡a¡ion, co ncordance, proj ection, moderation, or anchoring. These weaker score linkages may bc technically sound aná may fully sarisfr desired goals of comparabiliry For one purpose or lor one subgroup of examinees, but rhey cannot be assumed to bc stablc over time or invariant across mu.ltiple subgroups oF the cxaminee popularion nor is ùlere eny âssurance rhac scores obtained using differenr resrs will be equally accurate. Thus, their use for othe¡ purposes or with orher populations chan original.ly intended may require addirional research. For example, a score conversion thet wes ãccurate for a group o[ nerive speakers might sysrematically overpredict or underpredict the scores ofa group oFnonnative speakers. 52 AERA A.PA NCME 0000062 PART I / SCALES, NORMS, AND SCORE COMPARABILITY Cut Scores A criric¿l srep in the developmenr and use of some resrc is ro establish one ot more cut poins dividing the score range ro parrition t}re disrribution olscores into categories. These categories may be used just for descripdve purposer or may be used to disringuish among examinees lor whom different progrems a¡e deemed desirable or diffe¡ent predictions are warrant- ed. An employer may derermine a cur score !o screen potential employees or promote currenr employees; a school may use resr scores ro decide which oFseveral alternative instructional programs would be most beneficial for a srudent; in granting a proícsional license, a srare may specifr a minimum passing score on a licensure tes¡. These examples differ in important respects, but all ìnvolve delinearing categories of examinees on rhc basis oF test scores. Such embody *re nrles according ro which resrs ere used or inrerprered. Thus, in some situarions the validiry of test interprerarions may hinge on ¡he cut scores. There can be no single merhod f,or derermining cur sco¡es for all cesm or for ali purposes, nor cen there be any single set oF procedures for establishing rheir defensibiliry. These examples serve only as illus¡rations. The firsr example, rhat olan employer hiring all those who earn scorei above a given level on an employment rest, is mosr straightforward. fusuming rhat rhe employment rest is valid [or its intended use, âverage job performance would rypically be expecred ro rise steadil¡ albeit slowl¡ rvirh each incremenr in test score, at least lor some rante ofscores surrounding the cut point. In such a case rhe designation oF the parricular value for rhe cut poinc may be largely derermined by rhe number of persons ro be hired or promoted. There is ¡o sha¡p difference berween ùose jusr below the cut point and rhose jusr above ir, and che use of the cut score does not entail any criterion-referenced .interpretation. This method cur scores ofestablishing a cut score may bc subject to legal requiremenrs with reJpect to the nature of the validiry and reliabiliry evidence needed ro supporr che use of rank-order selections and rhe unavailabiliry of effective alternative selection merhods, if ir has a disproponionate effect on one or more subgroups of employees or prospecrive employees. In rhe second example, a school dist¡ict mighr srrucrure its courses in writing around chree categories of needs. Fo¡ children whose proficiency is least developed, instrucrion might be provided in small groups, with considerable individual actention to assist them in creating meaningful wrinen stories grounded in their own experience. For children whose proficiency was Funhcr developed, more emphasis mighr be placed on systemaric explorarion of the snges of the wriring process. Instruction for chìid¡en at the highesr proficienry level might emphasize mastery of specific writing genres or prose structures used in more formal writing. In an appropriare implementation of such a program, child¡cn could easily be trensferred from one level to another if their original placemenr was in error or as their proficiency increased. Ideall¡ cut scores delineating care- gories in rhis application would bc based on rese¿rch demonstrating empirically that pupils in successive score renges did most often benefir more lrom the respective rrearmenrs ro which they were assigned chan f¡om the alrer- h would rypically be found that berween rhose score ranges in which one or another instructional treatmenE was clearly superior, rhere was an inrermediare region in which neither t¡earmenr was clearly pre[erredThe cur score might be locared somewhere in narives available. rhar inrermediate region. In ùre finaì example, that of a professional licensure examination, the cut score represents a¡ informed iudgment rhar those scoring bclow ir are likely to make serious errors for want of rhe knowledge o¡ skills tested. Lirde evidence apart from errors made on the test itsel[may documenr the need to deny the right to Prac- 53 AERA APA NCME 0000063 I *:o. I I (\ tuflilt lrt*¿r ¡J rI lrrll U U.r.añ[ ll u r\ rJ SCA!.ES. Ì{ORMS. ANII SCOBE COMPARABILITY rice rhe profession. No rest is periecr, oi course, end regerdless ofrhe cut score chosen, some examinees with inadequare skills are likely ro pãss end some wirh adequate skills likeÇ co åil. The relarive probabiliries of are such False posirive and false negãríve €rrors rvill vary depending on rhe cuc score chosen. A given probabiliry ofexposing rhe public to porential harm by issuing a license ro an incompcrenc individual (false posirive) musr be wcighed againsc some corresponding probabilicy ofdenying a license ro, and rhereby disenFranchising, a qualified examinee (False negarive)..Changing rhe cur score ro reduce ei¡hcr probabiliry will increase rhe other, alrhough borh kinds oferrors can be minimized through sound resr design rhat anricipates the role olrhe cur score in resr use and inte rpretation. Determining cur scores in such si¡uations cannor be a purely technical marter, although empirical studies and statistical models can be ofgreat value in informing rhe process. Cut scores embody value judgmencs as well as 'Where technical and empirical considerarions. rhe resuls of the srandard:sening process have highly significanr consequences, and especially where large numbers of examínees are involved, rhose responsible for estabiishing cut scores should be concerned that the process by which cur scores âre dete¡mined be clearly documenred end defensible. The qualifications o[any judges involved in srandard setting and the process by rvhich they are selected arc part oFthar documenrarion. Care must be raken ro assure ¡ha¡ judges understand whar rhey are ro do. The proc€ss musr be such tha¡ well-qualified judges can apply rheir knorvledge and experience to reach mcaningful and relevant judgmcnts rhat accurarely reflect their understandings and incenrions. A suffìcienrly large and represennrive group ofludges should be involved to provide reasonable assurance chat ¡esults would not vary greacly if the process were replicated. / FART I Standard 4.I Test documents should provide test users with clear explanations of the meaning and intended inte¡pretation of derived score scales, as well as their limitations. Nl scales (raw score or derived) may to misinterpreration. Somerimes are extrapolared beyond rhe range of Commøt: be subject scales available dara o¡ are interpolared wirhour sufÍì- cient dara points. Grade- and age-equivalenr scores hâve been cricicized in rhis regard, bur percentile ranks and srandard score scaies are ro misinrerpreracion. If rhe narure o¡ incended uses ofa scale are novel, ir is especially importanr rhat uses, inrerprecarions, also subjecr ir a¡d limiutions be clearly described. Iliustra¡iorrs of appropriate versus ineppropriare i n rerpreracions may be helpFul, especially for rypes of sc¿IeJ or inrcrpretations thar may be unfamilia¡ to most users. This s¡andard perrains to score scales intended fo¡ criterion-¡eferenced as well as fu r norm-reflerenced interpretation. Qlendard rl D The construction of scales used for reportíng scores should be described clearly in test documents, Comment: Vhen scales, norms, or orher interprccive sfscems are provided by rhc test develope6 technical documenrarion should enable users to judge the quality and p¡eci- sion of the resultìng derived sco¡es. This srandard perrains ro score scales intended for crire¡ion-referenced as well enced inrerpretation. as fo¡ no¡m-refer- Standard 4.3 If there is sound ¡eason to believe r-hat specific misinrerp¡etations of a score scale are likel¡ test users should be explicitly forewa¡ned. 54 AERA_APA_NCME-OOOOO64 PART I / SCALES, NoRMS, AND SCoßE CoMPARABILIW Commenî: Test publishers and users can reduce misinterpretations of grade-equivalenr scores, for example, by ensuring that such scotes are accompanied by insrructions rha¡ make clear rhar grade-equivalent scores do not represent a sranda¡d ofgrowth per year or grade and rha¡ roughty 50% of rhe studenr rested in the.standardization sample should by definition fall below grade level. fu another example, a score scale poinr originally defined as the me¿n oF STAN¡DARDSI mary informacion about differences becween gender, ethnic, language, disabilicy, grade, or 'Ln be uefrrl some The permissible uses of such differentiated norms and related information may be €e groups, for o<ample, may cases. lìmired by law. Use¡s also need ro be made alert ¡o situations in which norfiß are less appropriate for some groups or indìviduals rhan orhers. On an occuparional inrerest invenrory, For example, norms for pe rsons acrually engaged some reference population should no longer be inrerpreted as represenring average perlorm- in an occupation may be inappropriate for interpreting the scores of persons not so ance iFche scde is held consrânt over time and engaged. rhc examinec popularion changes. areness Standard 4.4 \ühen raw scores ere intended to be directly interpretable, their meanings, intended interpretations, and limitations should be described and justified in the same manner as is done for derived score scales. Comment: In some cases the items in a ¡est are â representarive sample ol a weil-defìned domain o[ items. The proportion correct on the test may then be interpreted as an escimate of rhe proporrion of irems in the domain that could be answered correctly. In other cases, different interpretations may be attached ro scores above or below one or anorlter cur score. Supporr should be offered for any such interprerarions recommended by the test developer. Standard 4.5 Norms, il used, should reFer to clearly described populations. These populations should include individuals or groups to whom test users will ordinarily wish to cofnpare their own examinees. Comment: It is rhe responsibiliry of test develo¡ ers to describe norms clearly and rhe responsibil- iry of test users to employ norms appropriately. Users need ro know the applicabiliry ofa rest to dìfferenr groups. Differenriated norms or sum- fu anorher example, the appropriFor personaliry invenrories or relarionship scales may differ depending upon an examineet sexual orientation. of norms Standard 4.6 Reports of norming studies should include precise specifìcation of the population that was sampled, sampling procedures and participation rates, any weighúng of the sample, rhe dates oftesting, and descriptive sadsdcs. The information provided should be sufficient to enable users to judge the appmpriaæness of the norms for interpreting the scores oflocal examinees. Technical documentadon should indicate the precision of the norms themselves. Comment: Scientific sampling is important if norms are ro be representative of intended populations. For example, schools already using a given published rest and volunteering ro perricipate in a norming srudy For rhat test should not be assumed to be representative of schools in general. In addirion to sampling pro. cedures, participarion r¿tes should be reponed, and the method ofcalculating participation r¿rcs should be dearly dcscribed. Srudics tlnt a¡c designed to be nationaily representative often use weights so thar the weighted sample bemer represenrs rhe nadon ¡Ìran docs rhe unweighred sample. When weighr are used, it is imporurt rhat rhe procedure for dcriving r]re wcighs bc described and r}tat the demognphic rePrcsenla- 55 AERA-APA-N C M E_OOOOO65 lernR¡nnone I L¡ Eralf Ll,l{l [¡,rÈD scÁr_Es, NofiMs, At{o sconÊ cotvrPARABrLrrY úon ofborh the rveighred and the unweighted samples be given. IF norming dara are collecred under conditions in which srudenr motìva- tion in completing the resr is likely to difFer l¡om rhat expected during operarional use, rhis should be clearly documented. Likewise, ilthe insrrucriona.l histories of srudens in the norming sample differ systematically from those to be expected during operational resr use, that lacr should be noted. Norms based on samples cannor be perFecrly precise. Even rhough the imprecision ol norm-¡cfe¡enced .interpretadons due to imperfections ìn the norms rhemselves may be small compared ro rhar due ro measuremenr error, estimates of the precision of norrns should be available in rechnical documentarion. For example, sra¡dard errors bæed on the sample design mighr be presented. In some resting applications, norms based on all examinees rested over a given period of time may be useful lor some purposes. Such no¡ms should be clearly characrerized as being based on a sample of pe rsons routinely tested as pan of an ongoing testing program. Standard 4=7 If local examinee groups differ materially from t-he populations rc which norms refer, a user who repons derived scores based on the published norms has rhe responsibiliry to describe such differences if they bear upon the interpretation of the reported scores, Comment: Io employmenr settings, rhe qualifications oFlocal examinee g¡oups mây flucruate depending on recruirment or referral procedures as well as market conditions. ln such cases, appropriatc test use and inrerpretation ma)' nor require documenration or ceutions co ncern i n g depa rtu re-s from characteristics of rhe norming population. Standard 4.8 When norms are used to che¡ecterize examinee groups, the staústics used to summar-ize / PART I each group's oerformance and the noffns to which those statistics are referred should be clearly defined and should support the intended use or interpreration. Comment: Group means are disrribured dilferencly from indìvidual scores. For example, it is nor possible to determine the percenrile rank ofa school's average test score iIa]l rhar is known are the percenrile ran[<s o[each of that school's srudens. Ir may somerimes be use6.rl ro develop special norms for group meens, bur when the sizes of rhe groups differ marerially or when some groups are much more heterogeneow than othe¡s, the construction and inrerpretarion olgroup norrns is problematical. One common and acceptable procedure is to report the percentile rank of the median group rnember, for example, the median percenrile rank of the pupils rested in a given school. Standard 4.9 When raw scoie or derived score scales aÍe designed for criterion-referenced interpretation, including the classification of exami¡ees into separate categories, the rarionaje for recommended score interpretations should be clearly explained. Commen t: Crire¡ion-re[erenced inrerpretarions or inlerences rhat do nor uke rhe form of comparisons ro rhe rest perFormance oI other examinees. Examples include starements that some psychopathologr is likely present, that a prospective employee possesses specific skills required in a given position, or rhat a clrild scoring above a cenain score point cân successfully apply a given ser ofskìlls. Such interpretarions may refer ro the absolute levels ofte-st sÇores or ro patrerns ofscores for an individual cxamince. Wheneve¡ the test developer ¡eco¡nmends such interpretations, are score-based descriprions the rationale and empirical basis should be clearly presentcd. Serious efForrs should be made whenever possible ro obein independent 56 AERA APA NCMË 0000066 PABT I / SCALES, T'¡ORMS, AND SCOFE COMPABABILITY evidence concerning rhe soundness oIsuch score ìn rcrpretarions. Criterion-¡eferenced and norm-re[erenced scales are not mutually exclusive. Given adequate supporring data, scores mey be interprered by both approaches, not necessarily just one or the other. STANI}ARDS score reported and used is a pass-fail decision, for exarnple, then rhe fo¡m-ro-form equivâlence or oí measuremeng for examinees fa¡ above below the cut score is oFno concern, Far Some tcsting ac¡ommodadons may only af[ect the dependence of tesr scores on capabilities irrelevant to the construct the test is intended ro measure. Use of a large-print edition, for Standard 4.10 A clear rationale and supporting evidence should be provided for any claim tlrat sco¡es earned on different forms of a test may be used inte¡changeably. In some cases, dìrect cvidence oÊscore equiralence may be provided, In other cases, evidence may come from a demonstcation that the tfieoretical assump- tions underlfng procedures for establishing score comparability have been sufficiendy sat- isfied. The speciÊc rationale and the evidencc required will depend in part on the intended uses for which score equiralence is claimed. example, assures that performance does not depend on che abiliry ro perceive srandard-sizc print. In such cases, relatively modesr srudies or professional judgment may be suffìcient to support claims of score equivalence. Standard 4,11 When claims of fo¡m-to-form score equivalence are based on equating procedures, detailed technical information should be provided on the method by which equadng frrnctions or other liokaga were esteb[shed and on the accuracy ofequating fr¡ncdons. Comment: Support should be provided For any irems or tesring marerials, or different testing Comment: The fundamental concern is to show thar equated scores measure essenrially procedures, are inrerchangeable for some pur- rhe same consrrucr, w.irh very s.imilar levels assertion that scores obuined using different pose. This s¡andard applies, [or example, to alternate forms o[a paper-and-pencil rest or to alrernare sers oFitems raken by different examinees in computerizcd adaprivc resting. ft also applies to test lorms administered in different Êormats (e.g., paper-and'pencil and compurerized tesrs) or test forms designed for individuel ve¡sus group adminisrra¡ion. Score equivalence is easiesr ro establish when different forms are construcred Êollowing idenrical procedures and then equated sadstica.lly. \X/hen úra¡ is not possible, for example, in cases where differenr tesr fo¡mars are used, addirional evidence may be required ro esablish úre requisite degree ofscore equivalence For the intended contexr and purpose. tVhen recom¡nended inferences or acúons a¡e based solely on classifications of examinees inro one oFrwo or more categories, rhe rationale and evidence should address consistency of classifìcarion. if *re only of reliabiliry and conditional standa¡d errors of measuremenr. Technic¿l in[ormarion should include the design of equating studQ, rhe s¡arisric¿l merhods used, rhe siz¿ and relèvant characceristics oFexaminee samples used in equating studies, and rhe characeristics of any anchor tesrs or linking items. S¡anda¡d errors of equaring funcrions should be esrimated a¡d reported wheneve¡ possible. Sample sizes permitting, ir may bc inlormativc ro dctermine equating firnctions independendy for identifiable subgroups of examinees. Ir may also be informative ro use rwo anchor Forms and co conduc the eq""dnt using each ofthe a¡chors. In some ceJes, equadng funcdons may be determined independendy using differenr saris¡ical merlods. The corespondencc of separate funcrions obtained by such methods can lend suppon to r.l-re adequacy of the equacing ruula. Any substandal disparities found by such merhod¡ 57 AERA APA NCMÊ OOOOO6T I f --- clrtú RUF! tt Ðgt(. lL) [F{lU[.!fdrìU-rð SCALES, NOBMS, AND SCOßE COMPAHABILITY should be resolved or reDorted. To be most useful, equating error should be presented in units of the reporred score scele. For testing programs rvith cut scores, equaring error neâr ¡he cut score is ol primary importance. The degree oF scrutin)' of equâting firncrions should bc commensurare with rhe extenr oF test use anticipated and che importance of the decisions rhe test scores are intended to inlorm. Standard 4.12 In equating studies rhat rely on rhe steijstical equivalence o[examinee groups receiving different forms, methods of assuring such equivalence should be described in detail. I P,ABT I Standard 4,'!4 IV'hen score conversions ot comparíson procedures are used to relate scores on rests or test forms that are not closeþ parallel, the construction, intended interpretation, arid limitations of those conversions or comparisons should be clearly described. Comment: Various score conversions or concordance tables have been consrructed relating diflerent levels ofdifficulry, relaring earlie¡ to revised forms ofpublished resrs, creating score conco¡dances benveen diflerent tests at Commnr: Cerrain equaúng derigns rely on the tests of simila¡ or differenr const¡ucts, or For other purposes, Such conversions are olren use[ul, but they may also be subjecr to misinterpretation. The limitations of such conver- rando m sions should be clearly described. is ro s¡rstematically mix differenr [est fo¡ms and Ql¡nd¡r¡l â l6 eq uiv-alence of groups receivi ng differenr forms. Often, one way to assure such equivalence then distribute them in a random fashion so thar roughly equal numbers of examinees in each group resred receive each form. Standard 4.13 additional test forms are created by uk' ing a subser of the items in an eristing æst form or by rearranging its items and there is sou¡rd reason to believe that scores on these fo¡ms \lhen ur vt r(r¡¡^ --,. L- :^C1,,-^--Å L-.:í-- ^^^.--. ¡¡rê, -Cf-^.- In equating studies that employ an anchor evidence should be prorrided that there is no test design, the characteristics of the ancho¡ undue distortion of norms for the different versions or of score linkages berween tlem. rest and its similarity to the forms being equated should be presented, including both content specifications and empirically determined relationships among test scores, If anchor items are used, as in some lRT-based and ctassical equating studies, the represenrativeness and psychometric characteristics ofanchor items should be presented, Comment: Tesrs o¡ tesr Forms may be linked via common icems embedded wirhin each oF them, or a common test adminisrered togeth- er wirh each oFrhcm. These common items tesrs are referred to æ linking items, anchor items, or anchor tesrs. \W'ith such methods, the qualiry oF rhe resulting equating depends srrongly on the adequacy ofthe anchor tesr or or items used. Comment:Some res¡s and resr barteries are published in both a full-lengrh version and a survey or shorr version. In orher cases, mukiple versions ofa single test form mây be creered by rearranging irs irems. ir should not be assumed rhat performance dara derived from rhe adminisrrarion of irems as parr of the initial version can be used to approximate norms or construct convcrsion rables for alrernarive intact resrs, Due caution is required in cases where context efFects are likely, including speeded tests, long tesrs where Facigue may be a factor, and so on. In many cases, adequate psychomctric data may only be obtainable from independent adminisrrarions of rhe akernare Forms. co AE RA-APA_N C M E-O OOOO 68 PABT I / SCALES, NORMS, AND SCORE COMPARABIL¡TY Standard 4.16 If test specifications are changed from one version of a test to a subsequent rersion, such changes should be identified in the test manual, a¡d an indication should be given tha¡ _ converted sco¡es For the two versions may not be strictly equivalent: W'hen substantial changes in test specifications occu¡, either scores should be teported on a new scale or a dear statement should be provided to dert users that tÄe scores are not direcdy comparable with those on ea¡lie¡ versions of the test. Comment: Major shifu sometimes occur in the specifications of tesr that a¡e used for subsnnrial periods of rime. Often such changes take of improvemcn¡s in item rypes or ofshiFts in con¡ent that have been shown co improve validity and, therefore, are highly advantage desi¡able. h is imporranr to recognize, howev- er, rhar such shiFts will resulr in scores tha¡ c¿nnot be made stricrly interchangeable with scoÍes on an earlier form of the tesr. Testing progrems that attempt to maintain a common scale over time should conduct periodic checks of the stabiliry of the scale on which scores are reported. Comm¿nt: ln some resring programs, irems are inr¡oduced into and retired from item pools on an ongoing bæis. ln oùer cases, the irems in successive tesr forms may overlap very linle, or not ar al.l. In eicfrcr c¿se, ifa 6xed scale is used for reporring, ir is imponant ro ãssure thar ùre mea¡ing ofrhe scalod scores does nor chanç over dme. Standard 4.19 'Nlhen proposed score interpretations involve one or more cut scores, the rationale and procedures used for establishing cut scores should be clearly documented. Comment: Cur scores may be established to select a specified number of examinees (e.g., to fill existing vacancies), in which case little furrhe¡ documentation may be needed concerning the specific quesrion ofhow the cut scores are essblished, though anention should be paid to legal requiremenm that may apply. In orher .'ces, however, cut scores may be used ro classifr examinees inro distincr caregories (e.g., diagnoscic categories, or passing versus merhod musr be clearly documented. Ideall¡ the role ofcut scores in test use and interpretarion is takcn ìnro accorürt during resc design. Adequate precision in regions ofscore scales where cut poinrs are esablished is prerequisite ro reliable classification of o<aminees into c¿tegories. Ifstandard sening employs data on dre score distributions for criterion groups or on the relation of test scores to one or mo¡e criterion va¡iables, those dara should be summa¡izcd in rechnical documentation. If a judgmental sundard-setting process is followed, dre method employed should be clearly described, a¡rd the precise nature of the judgmens called for should Standard 4.18 If a publisher provides norms lor use in test score interpretation, then so long as the test bility to Commcnt: Tesc publishers should assurc rhat up-to-dare norms are readily available, but it remains the test use¡i responsibiliry to avoid inappropriate use ofno¡ms rhat a¡e out o[date and ro st¡ive to assure accurate and appropriare tesr interpretarions, lailing) for which rhere are no preestablished quotas. In rhese cases, the srandard-setting Standard 4.f 7 temains in print, STANDARI}S it is the publisher's responsi- that the test is renormed with sufficient frequency to permit continued accuassure rate and appropriate score interpretations. be presented, wherher chosc a¡c judgmcnts of persons, of item or test performances, or of orhe¡ c¡iterion perlormances predictcd by test scoru. Documenudon should also include thc selection and qualification of judges, training provided, any feedback to judges concerning the implicarions oItheir provisional judgmens' 59 AE RA_AP A_N CM E_0 o oo 0 69 laç^oo^ t\ I l, SCALES, NOBMS, AND SCOBE COMPARABILITY ßg\¡1l ¡tti{[ [q, rrlt EÉrñ! u!rL¿ E and en/ opporiúniiies forjudges ro confer v/idt one anorher. lVhere applicable, variabiliry over judges should be reponed. rùl/henever fæsible, a¡ esrimere should be provided of rhe amounr oI variation in cur scores thar mighr be expecred il / PART I 'oelow versus just above the cu( score, bur evidence should be provided where Feasible oFa relarionship be¡ween resr and crirerion perlormance over a score inrerval ¡ha¡ includes or approaches the cur score. the standard-serri ng procedure were replicared. Standard 4.21 Standard 4.20 tù7hen cut scores defining pass-làil Vhen feasible, cut scores defining categories with distinct substanrive interpretations cienry categories are based on direct judgments about the adequacy of item or rest perforrnances or perFormance levels, the judgmental process should be designed so that judges can bring their knorvledge and should be established on the basis ofsou¡d empirical data concerning rJ-re ¡eladon of test performance to relevant criteria. or profi- experience to bear in a reasonable rvay. Comment: In employmenr setrings, alchough ir is imporranr to esrablish rhar tesr scores are ¡elared ¡o job perFormance, rhe precise relation oFçesr and criterion may have lirtle bearing on the choice ofia cu( score. However, in conrexts where disrincr inrerprerarions are applied to diffe¡enr score categories, rhe empirical ¡elation oFrest to criterion assumes grearer imporrance. Cur scores used in interprering d;agnostic rests mey be esrablished on the basis oFempirically derermined score disrriburions [or criterion groups. \üith achiwement or proficiency tes$, such as rhose used in licensure, suitable criterion groups (e.g., successFul versus unsuccessfu.l practitioners) are often unavailable. Nonerheless, ir is highly desirable, when appropriace and feasible, ro investigate the relation berween rest scores and performance in relevant practical se*inç. Nr^.-,L^. ^ -^.-c,,il,, l-.:-^-l ^-l :-^r- Comment: Cur scores are somerimes bæed on judgmenrs about the adequecy of irem or resr performances (e,g., essay responses ro a wriring prompc) or performance levels (e.g., che level rhat would characcerize a borderline examinee). The procedures used ro elicir such judgments should result in reasonable, deFensible senda¡ds r-hat accunrely rellect the judges values and intentions. Reaching such judgmenrs may be mosr straighrfonvard when judges are asked to cònsider kinds o[performances wirh which they are familìar and lor rvhich rhey have lormed clear conceptions oÊadequacy or qualiry. lVhen the rcsponses elicired by a tesr neirher sample nor closely simulare rhe use oF rested knowledge or skills in rhe actual c¡ìrerion domain, judges are nor likely ro approach ¡he øk wiúr such clear undersrandings. Special .L^- L- .^t.^^ tu ¿r)urL -L-. ;..J-^^ Lrr¿t ,uuËL5 mented procedure based solely on judgments o[conrent relevance and item difflculry may have a sound basis for making rhe judgmenw requested. Thorough familiariry wirh descrip- be prelerable ro an empiricaì study wirh an inadequare ccirerion measure or other dcficiencies. Professional judgment is required ro determine an appropriare srandard-serting approach (or combination ol approaches) in any given sìtuation. In gcneral, one would not expect to find a sharp difference in levels of the criterion variable berween rhose just tions oI differenr proficiency caregories, pcacricc in judging rask diffìculry with feedback on accuracy, the experience of acrually rakìng a ibrm of rhe resr, Feedback on rhe Failure rates entailed by provisional standa¡ds, and orher forms of informarion may be beneficial in helping jrrdges ro reach sound and principled decisions. AERA APA NCME OOOOOTO 5" TËST ADruEINISTRAT¡TI$, SOTRII\üG, AruÐ mËruRTEluG Background The usefulness and interprerability of tesr scores require rhar a resr be administered and scored according to the developer's instrucrions. Vhen direcrions ro examinees, testing conditions, and scoring procedures follorv rhe same derailed procedures, the test is said to be srandardized. lVirhout such standardization, the accuracy and comparabiliry oFscore interpretations would be reduced. For tesrs designed ro assess rhe examinee's knowledge, skills, or abilities, standardization helps ro ensure that all examinees have the same opportuniry to demonsrrate their comperencies- Mainraining test securiry also helps to ensure thar no one unfair advantage. Occasionall¡ however, situations arise in has an which modifi cario ns of standa¡dized p rocedu res may be advisable or legally mandared. Persons oF different backgrounds, ages, or familiariry with cesring may need nonstandard modes of resr administration or a more comprehensive orientarion to dre testing process¡ in otder ùat all test take¡s can come to the seme undersranding oF rhe task. Standardized modes ol presenting in[ormation or of responding may not bc suitable for specific individuals, such es persons with some kinds of disability, or persons with limited proficiency in rhe language of the rest, so rhar accommodarions may be needed (see chapters 9 and l0). Large-scale testing programs generally have established specific procedures ro be used in considering and grandng accommodations. Some tesr users feel thar any accommodation not specifically required by law could lead to a charge of unfair rrearmenr and discrimination. Ahìough accommodations are made wirh the inrent oF mainraining score comparabiliry the exrent to which that is possible may not be known. Comparabiliry o[scores may be compromised, and rhe rest mey then not measure the same consrrucrs for all resr rakers. Tesrs and assessmenr differ in rheir degree of srandardization. In many insrances different ex¿minees a¡e given nor the same test form, but equiva.lent forms that have been shown ro yield comparable scores. Some assessments permit examinees to çhoose which taslc to perform or which pieces oFrheir work are ro be evaluated. A degrec ofsranda¡dization can be maintained by specifring the conditions o[ the choice and rhe crìreria olevaluadon oFthe producc.'When an assessmenr permis a ceruin kind of collaborarion, rhe limirs of rha¡ collaboration can be specified. \Wìth some âssessmens, rest administrators may be expected ro tailor their insrruc- tioru to what heJp assure únc all oraminees understa¡rd of ¡Ìrem. In all such c¿ses, the goal remains the same: to provide accurate and comparable measurement for everyone, and unFair advanrage to no one. The degree of srandardization is dictated by rhat goal, and by the inrended usc of the resr. Srandardized directions to ¡est takers help to ensure that all tesr takers undersrand the mechanics of tesr mking. Directions generally inform tesr rake¡s how to make their responses, what kind of help they may legitiis expecred marely be given i[ they do not understand rhe question or task, how they cen correct inadvertent responses, and thc nature oFany time conscraints. General advice is somecimes given about omitting item responses. Many tats, including compurcr-administcred tesr, require special equipmenc Practice exercises a¡e often presented in such cases to ensure that the res¡ uker unders¡ands how co operate rhe equipment. The principle olstandardiza- tion includes orienting tesr cakers to materials with which rhey may not be Familiar. Some equipment may be provided at rhe testing site' such as shop tools or balances. Opporruniry for rest ufte¡s ro practicc wirh rhe equipmenr wil[ often be appropriate, unlcss using the equipmcnt is the purpose oF the rest. 6t AERA APA NCME OOOOOTI TEST ADMINISTRATTON, SCORfNG, ANO BËPORTING Tests ere sorneti¡nes adrninisrered by compurer, wich resr responses made by keyboard, computer mouse, or similar device. Alrhough many resr râkers ere âccusromcd to compurers, some are nor and may need some briefcxplanarion. Even rhose resr rak- will need to know about some derails. Special issues arise in ers who use compurers menaging rhe resring environmenr, such as the arrangement of illumination so rha¡ light sources do nor reflecr on the compurer screen, possibly inrerfering wirh display leg- ibiliry. Maintaining a quiet environmenr can be chailenging when candidates ar€ iesred separatel¡ starcing ar differenr tirnes and finishing ar diflerent rimes from neighboring test takers. Those rvho administer computer-based resrc require rraining in rhe hardware and sofrwa¡e used for the tesr, so rhar rhey c¿n deal wirh problems that may arise in human-compurer inreracrions. Standardized scoring procedures help ro ensure accurare scoring and reporting, which a¡e eçsential in a]l circumsmnces. ïlhen scoring is done by machine, rhe accuracy of thc machinc is at issuc, including a:ry scoring algorithm. Vhen scoring is done by human judges, scorers rquire cerefill training. Regular monitoring can also help ro ensure that every test prorocol is scored according ro rie same srandardized criteria and rhar the crireria do not chaage æ ¡he resr scorers p¡ogress ùrough rhe submitred teJt responses. Te¡r scores, per se, are nor readily interprered without othcr inFormarion, such as norms or standards, indications of measuremenc error, and descripr.ions of test contenr. Jusr as a rcmperârure of 50' in January is warm for Minnesota and cool For Florida, a tes( score of 50 is nor meaningful wirhour some context. I#'hen the scores are to be reported ro person-s who are nor rechnical specialisrs, inrerprerive mare¡ial can be provided that is readily undersm¡dable to those receiving the reporr. Ofren, rhe resr user / PART I -.^,,;J." ^^ :-,-.^.-.^.;^^f.L^ -^^..t.- c^rlsu¡t5 ¡u¡ the test taker, suggesting the limirarions of the resula and rhe relarionship oFany reporred scores ro orher information. Scores on some rests arc not designed ro be released to test takers; only broad test interpreracions, or dichotomous classificarions, such as pass/fail, arc inrended ro be reporred. Interprerations of resr resuhs eÍe sometimes prepared by compurer sysrems. Such interprerarions are generally based on a combìnation o[empirical da¡a and experr judgment and experience. In some professional applicarions of individualized tesring, the computer-prepared inrerpretarions are communicaced by a professional, possibly wirh modificarions for special circumsrances. Such test interprerarions require validarion. Consistency wirh inrerpretarions provided by nonalgorithmic approaches is clearly a concern. In some large-scale assessrnenrs, rhe primary target of assessmenr is nor the individual resr raker bur is a larger unir, such as a schooi disrrict or a¡ indusrrial planr. Ofren, diffe¡en¡ res¡ rakers are given differen¡ se¡s of items, following a carefully balanced matrix sampling plan, to broaden rhe cange of information thet can be obrained in a reasonable time period. The resul¡s acquire mcaning when aggregared over mary individuals raking diffcrent samples oF irems. Such assessmenrs may nor furnish enough information to support even minimally valid, ¡eliable scores lor individuals, as each individual may rake only an incomplete testSome further issues of adminisrrarion and scoring a¡e discussed in chaprer 3, "Tesr Developmenr and Revision." 62 A.ERA APA NCME OOOOOT2 \ phnr t I TEsT ADMTNTsTßATIoN, scoRrNG, AND REpoRTtNc Standard 5.1 Test administrators should follorv carefully the standa¡dized procedures for administ¡arion and scoring specified by the test developer, unless the situation or a test taker's disabiliry dictates that a-n exception should be made. STANDARÐS modifications is reported to Lrsers of test data, such as admissions officers, depends on different considerations (see chaprers 8 and 10). lf such reporrs a¡e made, certain cautions may be appropriate. Standard 5,3 When formal procedures have been estab- Comment: Specifications regarding insrructions to tesr takers, rime limits, the [orm of irem presencacion or response, and resr matcrials or equipment should be srrictly observed. ln general, the same procedures should be followed as we¡e used when obraining the dara for scaling and norming rhe test scores. A rest raker with a disabling condition may require special accommodation. Other special circumstances may require some flexibiliry in adminisrrarion. Judgmens of rìe suitabiliry of adjustments should be tempered by the considerarion that departures From standard procedures may jeopardìze the validiry oF the rest score inrerpretarions. Standard 5.2 Modifications or disrupdons of stand¡¡dized test administration procedures or scoring should be documented. Comment: Information about the natu¡e of modifications of administrarion should be mainrained in secure data files, so rhat resea¡ch srudies or case reviews based on tesr reco¡ds can take this into eccount. This includes not only special accommodarions for particular test takers, but a.lso dìsruptions in the testing e nvironment thar may affect all test takers in rhe resting session. A researcher may wish to use only the ¡ecords based on standardized adminisrrarion. In orher cases, research srudies may depend on such in[ormarion to fo¡m groups ofrespondens. Tþst users or resr sponsors should establish policies concerning who keeps the fìles and who may have access to the files. lù/herher rhe inflormario¡ abou¡ lished for requesting a-nd receiving accommodations, test cakers should be inFormed of rhese procedures in advance of testing. Comment: Vhen large-scaJe tesring progrems have established srricr procedures to be followcd, adminisr¡ators should not depart from these procedures. Standard 5.4 The testing envi¡onment should fumish reasonable comfort with minimd distractions. Comment: Noisc, disruprion in the tesring area, exrremes o[ remperarure, poor Iighting, inadcquate work space, iltegible materials, and so forrh âre among the conditions thar should be avoided in resting siruarions. The testing sire should be readily accessible. Tesring sessions should be monitored where appropriate to assist the test taker when a need arises and to maìntain proper adminisrrative procedutes. In general, the testing conditions should be equivalenr to rhose that prevailed when norms and other inrerpretarive data were obrained. Standard 5,5 Instructions to test takers should clearly indicate how to make rcsponses. Instmctions should also be given in the use ofany equipment likely to be u¡familia¡ to test t¡kers. Opportuniry to ptactice responding should be given when eguipment is involved, unless use of the equþment is being assessed. 63 AERA APA NCME OOOOOT3 lcrnn¡rlnnrìe I qJ I rt! v uJrì¡I uttJc,.f TEST ADMINISÏRATION, STORING, AND BEPORTII'¡G Comment: Vhen electronic calculators are provided Fo¡ use, examinees may need pracrice in usìng rhe calculato¡. Examinees may need pracrice responding wirh unfamiliar usks, such as a numeric grid, which is somecimes used wi¡h mathematics performance items. in compurer- adminisrered tesr, ¡he method o[responding may be unFamilia¡ ro some rest takers.'Where possible, the piacrice .espûnses should be mon- iro¡ed to ensure thar rhe resr taker is making acceprable responses. In some pcrformance resrs rhar involve rools or equipmenr, instrucdons may be needed For unfamilia¡ tools, unless accommo- dating to unfr.miiiar tools is pan of whar is being Ifa tcsr ¡aker is unable to use rJre equi¡ ment or make rhe responses¡ it may be appropriare ro consider ahernative tesring modes. ¿ssessed. Standard 5.6 Reasonable e$ors should be made to assure the integriry of test scores by eliminating oppomrnities for test takers to âttain scores by fraudulent means. Comment: In larç-scale resring programs where the resuics may be viewed as having imporranr consequences, efforrs ro assure score integriqy should include, when appropriate and practicable, sripularing requirements for identification, consrrucring seating charts, assigning re¡r takers ro sears, requiring appropriate space berrveen seats, and providing continuous moniroring of rhe tesring process. Tesc developers should design test mate¡ials and procedu¡es ¡o minimize rhe possibilìry of chearing. Test adminisrrators should note and report any si gniFLcan r insrances oI resring irregulariry. A local change in rhe dare or time of cescing may offer an opportuniry for Êaud. In general, steps should be taken to minimize the possibiliry oFbreaches in test securiry. In any eval uario n of wo¡k producrs (c.g., portfolios) sreps should be taken to ensure thar the producr represenis the candidare's own work, and rhat the amount and kind of assistance provided should be consisrent with the intenl of I PABT I the assessmenr- Ancillary documentation, such as rhe darc rvhen ¡he lvork rvas done, may be useful. Standard 5.7 Test users have the responsibility of protecring the securiry of test materials at all times. Comment: Those who have resr materials under their conrrol should, wirh due consideration ofethical and legaì requiremens, rake all steps necessary to assure thar only individuals wìrh a legitimate need for access ro test macerials are able to obrain such access before the test administ¡ation, and afterwards as well, if any part o[the resr will be reused ar a la¡er ¡ime. Test use¡s mus¡ balance resr securiry with the rights ofall test takers and tesr users. tü7hen sensitive tesr documents are challenged, ir may be appropriare ro employ an independenr rhird parry using a closely supervised secure procedure ro conduct a review oIrhe relevant marerials. Such secr.rre procedutes are usuatly preFerable to placing tesrs, manu¡ls, and an examinee's test respons' es in the pubiic record. Standard 5.8 Test scoring services should document the procedures that were followed to assure accumcy ofscoring. The frequency ofscoring errots should be monitored and reponed to users ofthe service on reasonable request, ' ----^----:--..--^ ^f .¿ïrry sysfematlC stUice OÌ SCoi¡ng eii0¡S ^ --should be corrected. Comment: Clerical and mechanical errors should be examincd. Scoring e¡rors should be minimized and, when they are Found, sreps should be raken promprly ro minimize their recurrence. Standard 5.9 W'hen test scoring involves human iudgment, scoring rubrics should specifr criteria fo¡ sco¡- 64 AERA APA NCME OOOOOT4 PABT I/ TEST ADMIT'¡ISTRATION, SCORING, AND REPORTING ing. Adherence to established scoring criteria should be monitored and checked regularly. Monitori-ng procedures should be documented. Comment: Human scorers may be provided with scoring rubrics listing accepnble alterna-. rive responses, as rvell as general crireria. Consisrency of scoring is olren checked by rescoring randomly selec¡ed test responses and by rescoring some responses from earlier adminisrrations. Periodic checl<s o[ rhe sutisrical properties (e.g., means, standard deviations) ofscores assigned by individual scorers during a scoring session can provide feedback [or the score¡s, helping rhem to mainrain scoring standards. L¿ck ofconsisrcnt scoring may call for retraining or dismissing some scorers or For reexamining the scoring rubrics. Standard 5.f 0 'lù(/hen test score inFormation is released to snrdenr, parene, legal representatives, teadrers, clients, or the media, r-hose responsible for testing programs should provide appropriate interpretetions. The interpretations should describe in simple language what the test covers, what scores mean, the precision of the scores, common misinterpretations test scores, and how scores will be used. of STA[\¡DARDS Commcnt: Vhereas compurer-prepared ìn¡erprerations may be bæed on expert judgment, thc interprerations are of necessiry based on accumulated experience and may not be able co take into consideration the context oF the individual's circtrmstances. Computerprepared interpretations should be used with care in diagnostic settings, because rhey mey not take into account other information abour the individual resr raker, such as age, ge nder, education, prior employmenr, and medical histor¡ thac provide conrext for tesr resulc. Standard 5.12 \Øhen group-level information is obtained by aggregating the results of partial tesrc uken by individuals, validity and reliability should be reported for the level ofaggreg'ation at which results a¡e reported. Scores should not be repofted for individr'âl( unless the walidiry, comparability, and reliahiliq' of such scores have been established. Comment: Large-scale assêssments ofren achieve efficiency by "marrix sampling" oF the concent domain by asking differenr tesr takers diffe¡ent questions. The tesring then requires less time from each test taker, while o n of individual resul ts provides for domain coverage that cen be adequate [or meaningful group- or program-level interprerarions, such as schools, or grade levels wichin a localiry or parricular subjecrmetter areas. Because rhe individual receives only an incornplete rest, an individual score would have limired meaning. If individual scores are provided, comparisons berween scores obtaincd by differcnt individuals are based on responses ro items that may cover differenc material. Some degree of calibration among incomplete tesrs cen sometimeJ be made. Such calibration is essential to the the aggregari Comment: Test users should consulr rhe inter- pretive marerial prepared by rhe resr developer or publisher and should revise or supplement *re marerial âs necessery ro presenr rhe loc¿.I a¡d indivídual resulrs accurarely and clearly. Score precision might be depicred by error bands, or likely score ranges, showing rhe sranda¡d e¡ror of measurement- Standard 5.11 When compurer-prepared interpretations of test ¡esponse prorocols are reported, the sources, rationale, and empirical basis for these interpretations should be available, and their limirations should be described, comparisons of individual scores. 6s AERA APA NCME OOOOOT5 lqra¡unnn¡'le tg ur'g¡!gt-tu uvu TËST ADMIN¡STRATION, SCOBING, AI.ID REPORTING Standard 5.13 Tia¡smission of individually identified test scores to authorized individuals or institutions should be done in a manner that protects tIe confidential narure of tlre sco¡es. Commcnt: Care is a.lways needed rvhen communicaring the sco¡es of identified resr rakers, regardless o[ rhe fo¡m of communicarion. Face-to-Face communic¿rion, as well as telephone and written communication present well-known problems- Transmission by electronic media, including computer nerworks and lacsimile, presenrs modern challenges to confidentialiry Standard 5.14 When a material error is found in test scores or other important information released by a testing orgariization or other institution, a corrected score report should be distributed es soon e.s practicable to all known recþients who might otJrerwise use rhe er¡oneoÌls scÐres as a basis for decision making. The corrected report should be labeled as such. Comment: A material error is one that could change the inrerprerarion ol the test score. Innocuous typographical e¡ro¡s would be excluded. Tmeliness is esenrial fo¡ decisions that will be made soon afrer the rest scores / PART I accompanied by resting materials and rest scores. Recention olmo¡e deniled records oF would depend on circumsrances and should be covered in a retenrion policy Íesponses (see rhe iollowing standard). Record keeping may be subject to legal and proFessional requiremenr. Policy lor the release of any resr information for other chan research purposes is discused in chaptei 8. Standard 5.16 Organizations that maintain test scores on individuals in data files or in a¡ individua.l's reco¡ds should develoo a clear set ofpoliry guidelines on the dur¿tion ofretention ofa¡ individualt reconds, and on the availabiliry, and use over time, of such data. Comment: In some insrances, test scores become obsolere over rime, no longer reflecting the current srare of rhe test raker. Outdared scores should generally nor be used or made available, excepr for research purposes. In other cases, test scores obained in pæt years cen be useful as, for example, in longitudinai assessment. The key issue is rhe vaiid use of ùe information. Score reten¡ion and disclosure may be sub.iecc to legal and profassional requiremcnts. are received. êla¿'l^-J Jldaaudtu Ë I E J.tu 'When test data about a person a¡e ¡etained, both the test protocol and any w¡itten report should also be preserved in some fo¡m. Tesr users should adhere to the policies and record-keeping practice of their professi onal organiz:.tions. Commenr: The prorocol may be needed to respond ro a possible challenge from a test raker. The protocol would ordinarily be 66 AE RA_APA-NC M E-O OO0076 6" SUPPORTü ruG ÐOGI'IUIEIUTATIOhI FTffi THSTS Background The provision of supporring documcnr for tests is the primary means by which tesr developers, publishers, and distributors communicate with test users. These documenrs are evaluared on rhe basis of rhei¡ complerenessi accuracy, currency, and clariry and should be available to qualifìed individuais as appropriate. A tesis documentation typically specifies the nature of the rest; its intended use; rhe processes involved in the resrt devel- opment; technical inFo¡mation related ro scoring, interpretation, and evidence of validity and reliabiliry; scaling and norming iI appropriare to the insr¡umenr; and guidelines lor test adminisrrarion and interpreracion. The objecrive ol rhe documennrion is ro provide cest use¡s wirh rhe informarion needed ro make sound judgments abour rhe nature and qualiry of the tesr, rhe resulting sco¡es, and the interpretarions based on the resr scores. The inlormation may be reporred in documen* such as test manuals, technical manuals, usert guides, specimen sets, examination kirs, direcrions for resr adminisrrarors and scorers, or preview marerials for resr rakers. Tèsr documentarion is mosr effective if ir communicates in[ormarion to mulriple user groups. To accommodare rhe breadrh oF rraining of prolessionals who use tesrs, sepa, rate documen¡s or secrions oFdocuments may be rvritten for idenrifiable caregories oI users such as practirioners, consuhan¡s, administra- tors, researchers, and educerors. For example, the test user who adminisrers the tests and inrerprets rhe results needs inrerpretive information or guidelines. On rhe othe¡ hand, those who are responsible [or selecting resrs need to be able to judge rhe rechnical adcquacy oF the test. Therefore, some combination oFtechnical manuals, user's guides, resr manuals, tesr supplemenrs, examinarion kirs, or specimen sets ordinarily is published ro provide a potential resr user or test reviewer wiù suffìcienr information to evaluate the appropriateness and technicd adequary ofthe resr. The rypes oF informarion presenred in rhese documents rypically include a descriprion oF the intended test-taking population, srared purpose of the tesr, rest specificarions, irem formats, scoring procedures, and rhe tesr developmenr process. Technical dara, such as psychomecric indices ol the items, reliabiliry and validiry evidence, normarive dara, and cur scores or configural rules including those for compurer-genereted interpretarions of rest scorcs also are summarized. A¡ essential Feaurc oF úre documenu¡ion for every test is a discussion of the known appropriate ând inappropriate uses and interpretaúons oF che resr scores. The inclusion of illustrarions of sco re interpretatio ns, as rhey relate to *re test developer's intended applications, also wili help users make accurate inFerences on the basis of the test scores. When possible, iJlus¡rations of improper test uses and inappropriate tesr score inrerpretarions wilI help guard against the misuse of rhe resr. Test documents need to include enough information ¡o allow test users and revìewers co decermine the appropriarcness o[ rhe resr for irs intended purposes. Reflerences ro orher marerials rhat provide more details abour research by the publisher or independenr invesrigators should be ciced and should be readily obrainable by rhe resr user or reviewer. This supplemental material øl be provided in any of a variery of pubtished or unpublished forms; when demand is likely to be low, it may be mainrained in a¡chival form, including elecrronic srorage. Test documenta- rion ís useful for all test instruments, including rhose r-hat are developed exclusively for use within a single organizarion. 67 AE RA_APA_N CM E-O O O OO 7 7 r.çR q¡nn nnn tü ËÍ{_!u!.Ji4nuÐ I T- ",1,{i.;^- .^ .-.h-i."l,l^-,,-"^..r;^^ descriptive marerials are needed in some se¡tings to inForm examinees and orher inceresred parties about the nature and content of the test. The amounr and rype of inlormarion will depend on the perriculãr tesr and applicarion. For example, in situarions requiring informed consent, informarion should be suFficienr ro develop a reasoned judgmenr. Such information should be phrased in nonrechnical language and should be as inclusive as is consisrenr with rhe use olrhe resr sco¡eJ. The materials may include a Beneral descriprion and rarionale for the tesr; sample irems or complere sample res$; and in[ormation about condicions of resç adminisrration, confidenrialiry, and retendon of tesr resulc. For some applications, however, the crue nature and purpose ofa resr are purposely hidden or disguised ro prevenr faking or ¡esponse biar. In these instances, examinees may be motivated to teveal more or less of rhe cha¡acterisrics intended ro be assessed. Under these circumsmnces, hiding or disguising the true natu¡e or purpose of the test is acceptable provided tåis action is consistent with legal principter and ethical standards. This chaprer provides general srandards for rhe prepararion and publication of tesr documenration. The other chapters contain spccific srandards rhar will be useful ro test developers, publishers, and distribu¡ors in ¡]re prepâration of marerials to be included in a rcst's documentation- SUPPOBTIi¡G DOCUMENTATION FOR TESTS / PART I Q$and¡r¡l u. I ulutf gut u A I Test documents (e.g., test manuals, rechnic¿l manuals, user's guides, and supplemental materia.l) should be made avalable to p¡ospec- tive test use¡s and other qualified persons at the time a test is published or released for use. Comment: The tesr developer or publisher should judge carc[ully which information should be included in fi¡st editions of rhe resç menual, rechnicâl manual, or user's guides and which information can be provided in supplemenrs. For low-volume, unpublished tes'Lr, the documenrarion may be relaiively brie[ \W'hen the developer is also the user, docurnentation and sum¡naries are srill necessary. Standard 6.2 Test documents should be complete, accurate, and dearly wrinen so that úre intended reader can readily understa¡d rhe content. Comment: Test documenr shouid provide sufficienr detail to permit reviewers ancí reseârchers to judge or replicate important in che resr manual. For example, reporting correlation mar¡ices in the cesr document may alf olv the resr user to judge the dara upon which decisions and conclusions we¡e based, or describing in detail the sample and the narure of any facror analyses rha( were conducred rvill allorv the test user co replicate reported studies. analyses published Standard 6.3 The rationa.le for the test, ¡ecommended ofthe test, support for srrch uses, and information that assists in score interpretation should be documented. rVhere pe¡iicular misuses oF a test can be reasonably aaticipated, cautions against such misuses uses should be specified. Commen!: Test publishers make everl' 6ff6¡¡ ro caurion rest useß againsr known misuses ol 68 AERA APA NCME OOOOOTS PAßT I / SUPPORTII{G DOCUMENTATION FOR TESTS rests. Howcver, resr publish€rs erc nor required to anticipate all possible misuses ola rest. If publishers do know of persistent test misuse by a rest user, extraordinary educational STANDARÐS i¡rstruction, the doqrmenation should include an identification and description of the cou¡se or instructional materials and should indicate the yea¡ in whidr these materials were prepared. efforrs may be appropriate. Standard 6.7 Standard 6.4 The population for whom t-lre test is intended and the test specifications should be documenred. If applicable, the item pool and scale developmenr procedures should be described in the relev-¿¡rt test ma¡¡uals. If normative data are provided, dre norming popuJation shouJd be de¡cribed in terms of relevent demographic variables, a¡d the year(s) in which the dau we¡e col-lected should be reported. Comment: Known limitations ola for cerrain populations also should be clearly delineared in the test documents. In addicion, il rhe resr is available in mo¡e than one language, rest documents should provide information on rhe translation or adapration procedures, tesc on the demographics of each norming sample, and on score interpretation issues fo¡ each language inro which the ¡esr has been rranslared. Standard 6.5 'Wïen statistical descriptions and analyses thar provide evidence of rhe reliability oÊ scores and the validiry oF their ¡ecommended interp retations a¡e available, the info¡mation should be included in the test's documenta- tion. rVhen relevæt for test interpretation, test documents ordinarily should include item level information, cut scores and configural rules, information about raw scores and derived scores, normative data, the sm¡dard errors of measurement, ald a description of the procedures used to equate multiple forms. Standard 6.6 lWhen a test relates to a course of training or stud¡ a curriculum, a textbook, or packaged Tèst documents should specifu qualifications that are required to administer a test a¡rd to interpret the test scores accurately. Comment: Sta¡emenrs of user qualificarions need to speciS the training, cerrifica¡ion, comperencies, or experience needed to have access to e teJl. Standard 6.8 If a test is designed to be scored or interpreted by cest takers, the publisher a¡rd test developer should provide evidence that the test cán be accurately scored or inte¡preted by dre test takers, Tests that a¡e designed to be scored and interpreted by the test taker should be accompanied by interpretive materials that assist the individual in unde¡sEnding the test scores and that a¡e written in lenguâge that the test tâIcer cån underse¡rd. Standard 6.9 Tþst documents should cite a representative set of tle av-¿ilable studies penaining to gene¡al a¡¡d specific uses ofthe test. Comment: Summa¡ies of cited studies---<xcluding published works, dissertations, or proprietary documen*-should be made available on requesr to tesr users and researchers by the publisher. Standard 6.10 Interpretive materials For tests, that include care studies, should provide examples illus¡¡¿ting the divenity ofprospective test takers. Comment: For some insrruments' rhe presen- tarion of case studies thar are inrended to 69 AERA APA NCME OOOOOT9 lcrnR¡nnnne t 1, f n$ t L¿¡ttq lJ SUPPOBTING DOCUMENTATION FOß TESTS I PABT I u L¿¡ in ¡he inrerpreration of úre tesr will be appropriate lor scores and profìles aJso properry are responsible for documenting evidencc in supporr ol rhe validiry of compurer- inclusion in the tesr documenrarion. For generared score inrerprerarions. Such evidence example, c¿se studies mighr cite as appropri- might be provided, For example, by reporring the finding oFan independent revierv oFthe algorithms by qualified professionals. ¿ssisr rhe user ate examples of women and men of differenr individuals differing in sexual orienration; persons represenring va¡ious erhnic, cultural, or racial groups; and individuals wirh special needs. The inclusion of examples illusrraring rhe diversiry ofprospecrive ¡esr rakers ages; is nor inrended ro promore interprerarion of test scores in a manner inconsistenr wirh legal requirements that may ¡esr¡ict cerrain practices in some contexts, such as employee selecrion. C.lqa¡la¡Åq Ê II T ll lqqt V. that more rhan one method can be used for adminisrration or for recording responses-such as marking responses in a test booklet, on a separate a test is designed so answer sheet, or on a computer keyboardthen the manual should dearly document the extent to which scores arising from these methods are interchangeable. If the results are not interchangeable, this fact should be reported, and gui<iance shouid be given for the inte¡pretation of sco¡es obtained unde¡ the various conditions or methods of administration. Standard 6.12 Comment: The cesr user should be informed ofany cut scores or configural rulcs nccessa¡y for undersranding compurer-generared score inrerprerarions. A descript'ion of borh the sam- ples used to derive cut scores or configural nles and rhe merhods used to derive the cur scores should be provided. \fhen proprietary inrer- in rhe withholding oFcut Every test form and supporting document should carry a copyright date or publication date. Comment: During rhe operarionai life oFa tesr, new or revised tesr forms may be published, and manuals and orher ma¡erials may be added or ¡evised. Use¡s and porential users are entirled to k¡row the publication date¡ of various documenß rhar include resr norms" Communication among researchers is hampered when rhe parricular resr documenrs used in experimental studies are ambiguously reFerenced in research reports. Standard 6.15 Publishers and scoring services that ofFer computer-generated interpretations of test !t ll - I sLorÉs srlouu P(ov¡qc a SurtrfraÐ/ oI1-lmc endence supporring the interpretadons given. ests resul¡ a test, the test's documentation should be amended, supplemented, or revised to keep information for users currenr ¿¡d ro provide useÊrl additional information or c¿urions. Ulq¡ $tandard 6.11 If Standard 6.'13 '!íhen substantial changes are made to scores a¡d dist¡ibuto¡s should provide general information for test users and researchets who may be required to determine the appropriateness oFan i¡tended test use in a specific contelc. Síhen a particular test use can¡ot be jusdfied, the TÞst developers, publishers, rcsponse to an inquþ Èom a prospective test user should indicate this fact clearly. General information also should be provided for test takers and legal guardians who must provide consent prior to a test's adminisrration. or configural rulcs, the owners of rhe intellecrual 70 AERA APA NCME OOOOOSO AERA-APA-NC M E-OOOOOS 1 7 FAIRrüËqq AhË TFqTthtfì t-lruE4 Ãhr$T ü r--rtr qrEtbltrt Ërt bllJr ¡ ttutä EË ú TEST USH Background This chapter addresses overiiding issues of fairness in testing. Ir is intended borh ro emphasize rhe imporrance of fai¡ness in all aspects of tesring and assessment and ro serve as a con[ext for the rechnica] sundards. l¿re¡ chaprers address in greater derail some fai¡ness issues involving the responsibilities of resr ofresr users, rhe rights and responsibiliries rakers, the tesring oFindividuals ofdiverse linguisric backgrounds, and the tesring of rhose with disabiliries. Chapters l2 through 15 dso add¡ess some fairness issues specific ro psycho- logical, educarional, employmenr and credenrialing, and program evaluarion applicerions of tesring and assessmenr. Concern fo¡ Fairness in resring is pervasivq, ¿¡¿ the rrearmen¡ accorded rhe ropic here cannor do justice ro rhe complex issues involved. A full consideration ofFairness would explore rhe many frrncrions of testing in relarion to irs many goals, including rhe broad goal oFachieving equaliry ofopporrunity in our society. I¡ would consider the technical properties of resrs, rhe ways rest resulrs are reporred, and the facrors that are validly or erroneously rhoughr ro accou¡rr for parrerns of resr performance íor groups and individuals. A comprehensive analysis would also examine rhe regulations, srarures, r ^-) Ld5s l---. -L , govctu rcs( usc anor tne drrs ^--^ ¡4w trlaL ¡emedies [or harmful pracrices. The Srandards cannot hope to deal adequarely with all rhese broad issues, some olwhich have occ"cioned sharp disagreernenr among specialisrs and orher rhoughtful observers. Rathe¡, rhe focus oF the Standards is on those aspecrs of resrs, resrìng, and resr use rhat are the customary responsibiliries of rhose who make, use, and inrerpret resrs, and thar a¡e characcerized by some measure of professional and technicai consensus. Absolutc fairness Lo evety examinee is impossible ro attain, il For no other reasons rhan the facrs that rests have imperFect reliabiliry and rhar va.lidiry in any parricular context is a marter oFdegree. But neither is any a.hernative selecrio n or evaluarion mechanism perfecrly fâir. Properly designed and used, tes$ câ.n and do furrher socieral goals oF fairness and equaliry of opporruniry. Serious rechnical deficiencies in resr design, use, or inrerpretarion should, of course, be add¡essed, bur rhe fairness ofresring in any given conlexr must be judged relarive ro rJrar o[feasible test and nontcst alrernatives. Ir is general pracrice rhat large-scale rests are subjected ro c¿refrrl review and empirical checls ro minimize bias. The amounr oFexplicir er¡€nrion ro lairness in the design o[well-made resrs compares favorably ro thar oF many alrernative selection or evaluarion merhods. it is also cruci¿l to bear in mind rhar rest settings are incerpersonal. The inreracrion oF examiner v¡irh examinee should be professiona.l, courreous, caring, and respectful. In mosr resring sicuacions, rhe roles of examiner a¡d examinee are sharply unequal in snrus. ,{. proFessional's infe¡ences and reporrs from resr findings may markedly impacr the tife of rhe person who is examined. Artenrion ro these aspecr of test use and inrerprerarion is no importanr rhan more rechnical concerns. less fu is emphasized in professional education and rraining, users oFrests should be alert to the possibiliry rhat human issues involving examincr and examinee may somerimes affect rest fai¡ness. Atrention ro inrerpcrsonal issues is always imporranr, pcrhaps especially so when exe.minees have a disabilicy or differ from the examiner in erhnic, racial, or religious background; in gender or sexual orientation; in socioeconomic status; in age; or in other reJpecrs that may effect the examinee-examiner interaction. 73 AERA APA NCME OOOOOS2 FAIR¡¡ESS IN TESTINC AND TEST USE Varying Views of Fairness The term fairness a¡d is used F¡nurss in many different has no single technica.l meaning. It wa¡,s is pos- sible that rwo individuals may endorse fairness in tesring as a desirable social goal, yec reach quire differenr conclusions abour the lairness oFa given resting program. Ourlined bclorv are four principd ways in which the rerm fairness is used. h should be noted, horvever, that many additional inrerprentions may be lound in rhe rechnical and popular literature. The first nvo characterizarions presenred here relate fairness to absence of bias and ro equiuble rrearment of all examinees in the testing process. There is broad consensus rhar tesrs should be free from bias (as defined below) and rhat all examinees should be trca¡ed fairly in the testing process itsell (e.g., atrorded rhe same or comparable procedures in test¡ng, test scoring, and use ofscores)- Thc third characterization of rest fairness add¡esses the equaliry of testing outcomÊs lor examinee subgroups defined by race, ethniciry, gender, disabiliry, or or,her cha¡acccristics. The idea ùat fairness requires equaliry in overall passing rates for dif[erent groups has been almost enrirely repudiated in the proFessional resdng lireratu¡e. A more widely accepced view rvould hold thar examinees of equal standing with respecr ro rhe constçuct the rest is intended to meâsure should on average earn the same test score, irrespective of group membership. Unfortunacel¡ because examìnees' levels of the const¡uct are measured imperFecrly, this requirement is rarely amenable to direct cxamination. The fourth definition of [airness relates ro equiry in opportuniry to learn the ma¡crial covered in an achicvement test. There would be genera.l agreement that adequate opportuniry to learn is clearly relevant to some uscs and inrerpreutions of achievemenr tests and clearly inelev-¿¡u to othen, alùough dìsagreement might arise as to rhe relevance of opporruniry ro learn co cesr fairness in some specific situations. As LAcK ot / PART II BrAs Bias is used here as a technical tcrm. It is said to arise when defìciencies in a test itself or rhe manner in which it is used result in differenr meanings for sco¡es earned by membe¡s oI different idenrifiable subgroups. \ù/hen evidence of such deficiencies is found at the fevef of icem response parrerns for members of diF[erent groups, rhe rcrms itenz biat or differentíal itemfu,2ctilníng (DlF) are often used. \ü/hen evide nce is found by comparing rhe parterns of association for different groups berrveen resr scores and orher variables, rhe rcrm predictiue bias mry be used. The concept ol bias and techniques for irs dereccion are discussed below and are also discussed in orher chaprers o( rhe Standdrdt There is general consensus ¡hat considerarion of bias is critical to sound tesring pracrice. FnrRHEss AS EoUTTABLE Tnetrmrrr rN THE TESTTI{G PRocess There is consensus that jusr treetment rhroughout rhe tesring process is a necessary condition For rest lairness. There is also consensus that fair rrearmenc of all examinees requires consideration nor only ofa test itself, bur also rhe conrext and purpose of testing and the manner in rvhich tesr scores are used. A well-designed rest is not intrinsically fair or unFai¡, bur rhe use ofthe çesr in a parricular circumsrance or rvith particular examinees may be fair or unfair. Unfairness can have individual and collecrive consequences. Regardless ofrhe purpose of resring, fairness requires rhat all examinees be given a comparable opportunity to demonstrate rheir sranding on the construct(s) the test is inrended to meesure. Just rrearment also includes such lactors as appropriate resting condirions and equal opporruniry to become familiar with rhe test format, pracrice marerials, and so fo¡rh. In siruarions rvhere individual or group cest results are reported, just treatment also implies rhat such reporting should be accurate and luily informarive. 74 AERA APA NCME OOOOOS3 PABT II / FAIffNESS IIJ TESTING ÁHD TEST USE Fairne.s e.lso recuires rhar all exa.minees '-a----" be affo¡ded appropriate tesring conditions, Careful standardization o[ ¡esrs and adminis¡ration condirions generally helps co assure rhat examinees have comparabfe opporruniry to demonstrate the abilities or attributes ro be measured. In some cases, horvever, aspecrs oI the resting process rhat pose no parricular challenge fo¡ most examinees may prevenr specific groups or individuals from accurarely demonsrraring rheir sranding wirh respect ro rhe consrrucr of inre¡esr (e.g., due to disabiliry or language backgiound). In some insrances, greater comperability may sometimes be a¡¡ained if sta¡dardized procedures are modified. There are conrexts in which some such modifica- rions are fo¡bidden by law and orher conrexts in which some such modifications are required by law. In all cases, srandardized procedures should be followed for a]l examinees unless explicir, documenred accommodarions have been rnade. Ideall¡ examinees would also be afforded equal opporrunity ro þrepare fo¡ a resr, E..-*:-^^^ 5t(uu(u :- 4ttl 145ç L- drtutucu L^Arrrrlrlr5 -L^..|.I ¡r( uc -rr-_r_J equal access ro materials provided by the testing organizarion and sponsor which describe rhe tesr contenr and purpose and offer specific familiarizarion and preparation [or test taking. In addi¡ion to assuring equity in access to accepred resources for ¡esr preparation, rhis principle covers resr securiry lor nondisclosed tesrs. if some examinees were ro have prior access to the contents of ã secure ¡es¡, for example, basing decisions upon rhe relative performance of diffcrent examinees would be unfair ro orhers \¡/ho did not have such access. On resrs rhar havc imporrant individual conseqLrences, all examinees should have a meaningful opporrunicy to provide inpur to ¡elevanr decision makers ifprocedural irregularirles in resring are alleged, iF the validiry oF rhe individual's sco¡e is challenged or may nor be reporred, or if similar speciaf circumstances arise. Fina!!v, rhe conceo¡ion of Fairness as -'-- --"'vr_'""-"t' equitable trearment in rhe tesring process exrends ro rhe reporring oF individual and group test resula. lndividual rcst score inFormation is entided to confidential rrearmenr in most circumstances. Confi dential iry should be respected; scores should be disclosed only as appropriate. ìVhen test scores ere reporred, either for groups or individuals, score reporrs should be accurere and in[ormarive. lt may be especially imporranr when reporting resulrs to nonprofessional audiences ro use appropriate language and wording and to rry to design reports ro reduce thc likelihood of inappropriare inrerpreiarions. rVhen group achievement differences are reporred, for example, including addirional information ro help the intended audience undersrand confounding facrors such as unequal educational opporruniry may help ro reduce misinrerpretation of test results and inc¡ease rhe likelihood that ¡esrs will be used wisely. Frun¡¡¡ss As EQuAUry ril OuTco[rES 0F ïEsTtNG The idea that fairness requires overall pasling rates ro be comparable across groups is not generally accepted in rhe prolessional lireraru¡e. Mosr tesring professìonals would probably agree that rvhile group differcnces in testing outcomes should in many cases trigger heighrened scruriny For possible sou¡ces of test bias, outcome diffe¡ences across groups do nor in themselves indicare thar a resring application is biased or unFair. ft mighr be argued rhat when resrs are used For selecrion, persons who all would perform equally rvell on the criterion measure iIselecred should have an equa.l chance ofbeing chosen regardless of group membership. Unfortunarel¡ there is rarely any direct procedure For derermining wherher rhis ideal has been metMoreover, iFscore distriburions diffe¡ from onc group to anorher, it is generally impossible to sarisfr ¡his ideal usi¡g any test thar has cor¡elation wirh che criteri- ::.ï,jrîI.r*cr AERA APA NCME OOOOO84 FAIRNESS ¡N TESTING AND TËST USE Many testing proFessionals would agree iF a rest is ftee of bias and examinees have received fair treatment in the testing process, rhen rhe condirions o[ fairness have been met. That is, given evidence olthe validiry ofinrended resr uses and inrerprerations, including evidence of lack oI bias and arrention ro issues offair rrearrnenr, fairness has been esrablished regardless ofgroupJevel thar outcornes. This vierv need nor imply rhar unequal tesring outcomes should be ignored alrogether. They may be imporranr in genereting new hyporheses about bias and fair treatment. But in this view, unequal our- / PART II At least three importanr diffìcuhies arise with rhis conception of [airness. Firsr, definition of oppornniry Ø l¿arn is rhe diffìcult in plactice, especielly at rhe level of individuals. Opporruniry is a marrer oldegree. Moreover, rhe measuremenr of some imporranr learning outcomes may reguire siudenrs ro wo¡k wi¡h material they have not seen befo¡e. Second, even ìf i¡ is possib.le ro docurnenr rhe ropics included in the cu¡riculum for a group ofsrudents, specific conrenr coverage for any one student may be impossible ro derermine. Finall¡ rhere is a well-founded desire to comes ec the group level have no direct bear- assure rhat credenrials arresr ro cerrain proficiencies or capabilities. Grancing a diploma to ing on quesrions o[ resr Fairness. There may a low-scoring exârninee on the grounds that be legal requiremenrs to investigate cerrain diffe¡ences in ourcomes of resting among subgroups. Those requiremenrs ñrrther may pro- vide thar, other things being equal, a tesring alternative that minimizes ourcome difFer€nces across relevant subgroups should be used. The srandards in this chapter are intended to be applied in a manner consistent with legal and regulatory standards. FllRu¡s As 0PPoRruNnY To LEARN This frnal conception o[fairness arises in connection wich edr.ìcational achievement resting. In many conreK6! achievement resß are intended to assess what a test taker knows or 'When can do as a result of lormal inscruction. some resr takers have not had the opporruniry to lea¡n the subject marter covered by the cest contenr, rhey are likely ro ger low scores. The test score may accurately reflecr rvhar the resr raker knows and c¡n do, but low scores may have ¡esulred in parr from not having had rhe opportuniry to learn the material tested as well as Êom having had the oppomrniry and having failed to learn. \$lhen rest takers have not had the oppomrnity to learn the materid tested, the policy of using rheir test scores as a basis For withholding a high school diploma, lor example, is viewed as unfair. This issue is Ê.rrttrer discussed in chapter 13, on educational tesring. rhe srudent had insufTìcient opporrunity to learn rhe material resred means certificaring somcone who has not atrained rhe degree of proficiency rhe diploma is inrended to signifr. Ir should be noted rhar opporruniry ro learn ordinarily plays no role in determining che fairness of tesrs used [or employmenr and credentialing, which are covered in chapter 14, nor oladmíssions testing. In rhose circumstances, it is deemed Fair thar rhe tesr should cover the full range ol requisire knowledge and slcìlls. Howeveç rhere are siruarions in which the agency rhat decermines the contens oFa resr used For employmenc or credenrialing also sers rhe curriculum rhac must be followed in preparing ¡o ¡ake rhe rcsr. In such cases, it is rhe responsibiliry of thar agency to essure rhat what is ro be ¡ested is Fully included in rhe spccification of what is ro be taught. Bias Associated With Test Gontent and Response Processes ó¡¿s in tesrs and testing refers to consrruct-irrelevant componenrs rhat result in sysremarically lower or higher scores for idenrifiable groups of examinees. Such consrrucr-irrelevanr score cornponenrs may be inrroduced due to inappropriare sampling o[ The term 7õ AERA_APA-NCM E_OOOOO85 PART II / FAIBNESS fN TESîING ANT ÏEST USE resr conrenr or lack o[clariry in ¡est instruc- tions. They may also arise iFscoring crireria fail to credir fully some correct problem approaches o¡ solutions rhat are more rypi- cal ofone group rhan another. Evidence of these potential sources of bias may be soughr in rhe contenr of the resrs, in comparisons oF rhe inrernal strucrure of resr responses for difïerent groups, and in comparisons o[ rhe ¡ela¡ionships of resr scores to other meesures, although none of rhese rypes ofevidence is unequivocal. lighdy with rhese wars. devoring more artenrion instead, sa¡ to social and industrial developments, rhen rhar stare's test rakers might be relatively disadvanraged. Bias may also resuk lrom a lack of clariry in resr instrucrions or lrom scoring rubrics that c¡edit responses more rypical oF one group rhan another. For example, cognirive abiliry tesrs ofren require rest takers ro classiÇ objecrs according to an unspecified ¡ule. Ifa given task credim classificetion on rhe bæis of the stimulus ob,jects' functions, bur an identifiable subgroup olexaminees rends to classifr Corure¡¡r-Reuno SouRces or Tmr Bns Bias drre ro inappropriate selection oF test conrent may somerimes be detecrcd by inspection of the test itself. In some testing it is common for resr developers to engage en independent panel of diverse experts to review test content for language that might be interprered differently by members o[diffe¡ent groups and for marerial thar mighr be offensive or emotionaJly disurbing ro some test rakers. For performance assessmenrs, panels are ofren engaged to review the scoring rubric as well. A cesc intended to measure verbal analogical reasoning, for conrexrs, example, should indude words in general ue, not words and expressions associared with particular disciplines, occupations, ethnic groups, or loca¡ions. \JØhere marerial likely to be differenrially inreresting or relevanr ro some examinees is included, it may be balanced by material thar may be of particular :^.--^-. .^,L^ -^-^:-:-- -..^-:-^^^ In educarional achievemenc resting, alignment rvirh curriculum may bear on questions oFconrenr-related resr bias. One may ask how well a rest represents sorne content domain and also wherher that domain is appropriare given inrended score intcrprcrations. A rest of l9rh-cenrury Unired Stares hisrory mighr give considerable emphasis to rhe Wa¡ of 1812, rhe Mexican Va¡, the Civil \flaç and the Spanish American \ía¡. IFsome state's cu riiculum f¡amework deal r relatively rhe objecrs on the basis of rheir physical appcarence, faulry resr interprctations are likely. Similarl¡ iI rhe scoring rub¡ic for a construced response irem reserves rhe highest score level for rhose examinees who in Facc provide more info¡marion o¡ elaboration than was acrually requested, then less tesr-wise o<aminees who simply lollow irutrucions will earn lower scores. In rhis case, tesrwiseness becomes a construc¡-irrelevanr cornponent of rest scores. Jud$menral merhods for the review of .-",. -^Å.-". :.-*" ^-- ^ê-- -..^^l---^.-J L., sntisrical proccdurcs for idenrifring irems on tesm that frrnction di[ferently across idenrifi- able subgroups of examinees. Diflerenrial i¡em functioning (DIF) is said to exisr rvhen examinees oFequal abilicy differ on avereger according to rheir group membership, in their responses to a particular item. if examinees from each group ere divided into subgroups according to rhe tsred abiiiry and su'bgrorps at the same abiliry level have uncqual probabilities of answering a given item correctly, then there is evidence that that ircm may not be fi.rnccioning as inrended. ft may be measuring something differenr From rhe remainder of the rest or ir may be measuring wirh different levels ofprecision for different subgrouPs of exeminces. Such an item may offer a valid measutement of some narrow elemenc oF the intended construct, or it may taP some construct-irrelevant comPonent that advanrages 77 AERA APA NCME OOOOO8G FAIRNESS fÎ{ T€STING AND TEST USE or disadvanrages members of one group. Although DIF procedures may hold some promise for improving tesr qualiry, there has been lirrle progress in identifring the causes or substanrive themes thar characceriz¡ irems exhibiring DIF. That is, once items on â lesr have been starístically identified as fr¡ncrioning differenrly from one examinee group ro another, ic has been diifìcutr to specifu rhe reasons lor the differential performance or to identifr a common deficiency among the idenrified items. R¡sporusr-R¡uno In some S0uRcEs 0F TÊsr BrAs cãses, consttuct-irrelevant score componenß may arise because resc irems elicit varieties of responses other than those incended or can be solved in ways that were nor intended. For example, clienm responding to a diagnosric inventory may attempr to provide ¡hc answers they rhink ¡he resr administrâror expec$ as opposed to che ânswers thet besr describe themselves. To the exrent rhar such response acquiescence is more rypical of some groups chan orhers, bias may result. Bias may aiso be associared with test response formats thar pose particular diffìculries Êor one group or anorher. For example, tesr perfo¡mancc may rely on some capabiliry (e.g., English language proficiency or fine-motor coordinarion) rhac is ir¡elevanr to the intent ol the measuremenr bur nonerheless poses impediments for some examinees. A rest of quan ritative reasoning that makes inappropriarcly heavy demands on verbal abiliry would probably be biased against examinees whose firsr language is orher than chat ofthe test. In addition to conrent reviews and DIF analyses, evidence ofbias related (o response processes may be provided by comparisons oF rhe internal structure ofthe test responses for different groups of examinees. If an analysis of the facrors or diÍnensions underlying test performance reveals different internaì structures for differenr groups, it may be that different conscr,rcrs are being measu¡ed o¡ .ir / PARI II may simply be rhar groups differ in rhei¡ variabiliry with respecr ro the same underlying dimensions. 'lf hen rhere is evidence ¡ha¡ tests, including personaliry resr, measure diÊ ferent const¡ucts in differenr gender, racial, or cultural groups, it is imporrant ro determine rhac rhe inrernal st¡uctuce of the cesr supports inferences made [or c]ients from these distincr subgroups of the client popularion. In siruations where internal test strucrure varies markedly acros ethnicalli' diverse cultures, it may be inappropriate ro make direcr comperisons of scores of members oF these different cuhural groups. Bias may also be indicared by patterns oFassociarion berween res¡ scores and orhe¡ variables. Perhaps the most familia¡ [orm such evidence may rake is a diflerence across groups in the regression equarions relating selectìon test performance to criterion performance. This case is discussed at greater length in the following section. However, evidence of bias based on relations ro o¡her variables may also take many othe¡ forms. The relationship berween two tes¡s of rhe same cognitive abiliry mighr be found to diÊ fer from one group to anorher, for example. Such a diFference might indicate bias in one or boch cests. As anoch€¡ insrance, a higher rhan expecred association berween reading and mathematics achievemenr resr scores among students who might well have limìt- ed English proficiency could rrigger an .investigarion to determine whether Ianguage proficiency was influencing some examinees' marhemarics scores. Patterns of score averages or other distributional summaries mighc also.point .ro porenrid sources of tesr bias. If ma-les outperformed females on one meesurc of academic performance and, in the same popularion, females oucperformed males on anorher, ir would follow that rhe rwo meas'ures could not borh be linearly relared ro the identical underlying construct. Nore, however, rhar if the resred popularions dif[ered, if the conrenr domains sampled diffe¡ed, or if 7B AERA-APA-NCM E-OOOOO87 PART II / FAIRNESS IN TESTI¡¡G ANT TEST USE the construcrs cesred othe¡wise differed due to varying motivational contexrs or orher effects, rwo reliable tests, each valid [or its intended purpose, might show such a patrern. Association need not imply any direct o¡ causal linkage, and alre¡narive explanarions for petrerns of associarion should usually be considered. ln some cases, a resrcrirerion coi¡elarion may arise because rhe resr and crirerion borh depend on rhe same construct-irrelevant abiliry. IF identifiable subgroups diller rvith respecc ro rhar extraneous abiliry, rhen bias may result. Fairness in Selection and Prediction 'W'hen scores For rhe dilFerenr groups should be examined. Note rhar in rhe United Srares, rhe use oFdifferent selecrion rules for idenrifiable subgroups ofexaminees is legally proscribed in some contexts. There rnay, however, be legal requirements ro consider alrernarive selection procedures in some such siruarions. There is often rension bcrwecn rhe perspective rhar equates fairness rvirh lack of bias, in the technical sense, and the perspec' tive thar Focuses on testing outcomes. A rest rhar is valid for irs inrended purpose mighr be considered fair ifa given tesr score pred¡crs the same perlormance level for members of all groups. It might nonetheless be regarded by some as unfair, howerer, iF average resr differ ac¡oss groups. This is because a given selection score and c¡iterion rhreshold wiü often result in proportionately more False scores for selecrion and prediction, ev.idence of bias or lack o[ bias is generally sought in the relationships berween tesr and crirerion scores for the respecrive groups. Unde¡ one broadly accepred definirion, no tesrs arc used bias exists iF rhe regression equations relating the tesr and the crirerion are indistinguishable for rhe groups in quescion. (Some formuladons may hold rhar nor only regression slopes and inrercepts bur also srandard er¡ors oF esrimate must be equal.) IF resr-criterion relationships dif[er, differenr decision rules may be followed depending on rhe group to rvhich the person belongs. If firting a common predicrion equation for all groups combined sugges$ that the criterion performance oF persons in any one proun is svstemaricallv overnrerlicre¡l ., underpredicred, and if bias in the crirerion measure has been ser aside as a possible explanation, one possibiliry is ro generare a separate prediction formula for each group. Another possibiliry is ro seek predicror variables that may be used in lieu of or in addition ro the initial predictor score ro reduce differenrial predicrion wirhour reducing overall predicrive accuracy. If separate reg¡ession equarìons are employed, rhe effec oF their use on ¡he disrriburion ofpredicted crirerion negative decisions in groups wiù lowe¡ mean test scores. In o¡her words, a lower-scoring group will usually have a higher proporrion of examinees rvho are rejecred on rhe basis oF rhci¡ tcsr scores even ihough chey would have perFormed successfirlly if rhey had been selected. This seeming paradox is a sra¡isrical consequence of rhe imperfec¡ correlarion berween test and crirerion. Ir does nor occur because o[any other p¡operry ofrhe resc ancl has no direcr relarionship ro group demographics. It is a purely statistical phenomenon ñrnction oilower (est scores, regardless oF group membership. For example, ir usually occurs when rhe rop and bortom test score halves of the majoriry group are compared. The fairness of a tesr or anothcr predicror should be evaluared relative to rhat of nonresr a[ternarives rhar that occurs as a might be used insread. Gn0up OwcoNr Drnrneilc¡s Dur ro Cxorce or PR¡orcroRs Success in virtually all real-world endeavors requires multiple skills and abilities, which may interact in complex ways. Têsting programs rypicajly address only a 79 AE RA_APA_N C M Ë-O O OO O 88 FAIBNESS IH ÏESTING AND TEST USE lsrnrunnmus / PABT II subset of rhese. Some skills and abiliries are Standard 7.1 excluded because they arc assessed in orher components oF the selection process (e.g., completion of course work o¡ an interview); others may be excluded because ¡eliable and valid measuremenr is economicall¡ logisticall¡ or adminisrrarively infeasible. Success in coliege, lor example, requircs perseverance, motivation, good study habits, and a host ofother facrors in addirion ¡o verbal and quantitative reasoning abiliry. Even if each of the criteria employed in a selecrion process is demonstrably valid and appropriare For that purpose, issues of [airness may arise in the choice of which facrors are measured. If identifiable groups differ in their average levels of measured versus unmeasured job-relevanr characrerisrics, then fairness becomes a concern ar the \r)lhen credible research reporrs rhar test scores difler in meaning across examinee subgroups for the type of tesr in quesrion, then ro the exrent feasible, rhe same fo¡ms of validiry evidence collected for the exa-minee populacion as a whole should also be collected for each relevant subgroup. Subgroups may be found to difÊer with respect to appropriateness of test conrenr, internal strucrure of test responses, the relarion of test scores to orher variables, or the response processes employed by individual examinees. Any such Êndings should receive due consideration in the interpretation and use oÊ scores as rvell as in subsequent test revisions. group level as well as rhe individual level. Can Gonsensus Be Achieved? It is unlikely rhat consensus in sociery ar large or within rhe measuremenr communiry is imminent on all matters of Fairness in the use ol tesrs. As noted earlier, fairness is defined in a variety oFways and is not exclusively addresed in technical terms; it is subject to diffe¡ent definitions and interpretations in different social and political circumsrances, According ro one view, rhe consciencious applicarion oI an unbiascd resr in any given siruarìon is Fair, regardless of the consequences For individuals or groups. Orhers would argue rhat lairness rcquires more than sacisfring cerrain technical requiremenrs. It bears repeatìng that while rhe Standard¡ will provide more specific guidance on metters of technical adequacf matters of values and public policy are crucial to responsible tesr use. Comment: Scores differ in meaning across subgroups when the same score produces systemarically different inFcre nces abour examinees who are members of differenr subgroups. In those circumsrances where credible research reports difFerenccs in score meaning for parricular subgroups for rhe rype of test in quesrion, this standard calls lor separate, parallel analyses of data for members oFthose subgroups, sample sizes permitring. Relevant examinee subgroups may be deîrned by race or ethnicig', culrure, language, gender, disabiliry age, socioeconomic starus, or orher classificarions. Not all forms of evidence can be examined separately For members of all such groups. The validiry argument may rely on existing research literarure, for example, and such lirerature may not be available for some popularions. For some kinds of cvidence, sonre separate subgroup analyses may noc be feasible due to the limired number olcases available. Data may sometimes be accumulated so rhar these analyses can be performed after the teJt has been in use for a period of time, This s¡anda¡d is nor sarisfied by æsuring thar such groups are represented wirhin larger, pooled samples, ahhough rhis 80 AERA APA NCME OOOOOS9 oÂDt il / EtìfEilECC til lEC?tN,! ñtrg ÌECÎ ilCE ¡ñt¡t il, tñil¡tlLgu It ¡Lút¡ttu Âiln t!Jt uul -"., "t.^ h. i--^..".r t. -i.,i.. ",1,,. .^^_ ^" ¿'"'b sideration in rhe ìnrerpreration and use ol scores," pursuanr to this standard, test users should be mindful oflegal restricrions that may prohibit or limit wirhin-group scoring and o¡he¡ pracrices. Standard 7.2 'When credible ¡esea¡ch reports differences in the effects of construct-irrelevant va¡iance across subgroups of test take¡s on performance on some pa¡t of the test, the test should be used if at all only for those subgroups for which evidence indicates can be d¡awn rrom ::ri;'i:i.r"-rences A fñññ í É[ttutH !¡ÅÛflt r¡1rà Ã[tì ¡t "..1 ¡.^-.^.i." 'h"' no..'lits rì\ valid infer- ences across subgroups may be ìdenrified. In some circumstances, such as employmenr tesring, there may be legal or orher constraints on rhe use ofdiflerenr resrs for difFe¡enr subgroups. k is acknowledged rhar there a¡e occasions where examinees may requesr or demand to take a version of the tesc other than chat deemed most appropriare by the developer or user. An individual wirh a disabiliry may decline an alternate form and requesr rhe standard form. Acceding .ro rhis requcst, aFter ensuring that rhe examinee is [ully informed about rhe resr and how it will be used, is not a violation of this s¡andard. Comment: An obvious reason why a test may not measure the same constructs ecross subgroups is rhat different componcnts come into play lrom one subgroup to another. Alternativel¡ an irrelevant componenr ma)t have a more significant effecr on rhe performance ofexaminees in one subgroup than in anothe¡. Such intrusive elemenr are rarely enrirely absent for any subgroup but are seldom present to any great extent. The decision whether or not to use a test wirh any given exarninee subgroup necessarily involves a careFul analysis of the validiry evidence Fo¡ differenr subgroups, æ called for in Standard 7 -1 , and ¡he exe¡cise of thoughtful professional judgment regarding rhe significance of the irrelevant components. A conclusion thar a tesr is not appropriare For a particular subgroup requires an alrernarive course of acrion. This may involve a search for a test that can bc used For all groups or, in circumstances where ir is leasible to use different construcr-equivalenr rests for difFe¡enr groups, For an al¡ernarive rest for use in rhe subgroup for which rhe intended construcr is nor well measured by the current tesr. [n sorne cases mulriple tesrs may be used in combination, Standard 7.3 When credible research reports -.hat differential item functioning erists across age, gender, racial/ethnic, cultural, disabiliry and/ot linguistic grou,os in the popularion of test.mkers in the content domain measured by the test, test developers should conduct appropriate studies when feasible. Such research should seek to detecr and eliminate aspects of tesr design, conrenr, and format that might bias test scores for particular groups. Comment: Di[ferential item functioning exists when examinees of equal ability .{;fT., ^^ ^,,^.^^. ^.^^.À;^^,^ .L-;. -.^,,membership in rheìr responses ro e parricular item. In some domains, exisring research may indicate that differenrial item fi:ncrioning occurs infrequently and does nor repli-cete across samples. In others, research evidence nray indicate that differential i¡em functioning occurs reliably at meaningful above-chance levels For some parcicular groups; it is to such cìrcumstances rha¡ the srandard applies. Alrhough ir may not be possible prior to firsr release of a tes¡ ro 81 AERA APA NCME OOOOO9O I FAIRNESS IN TESTING AND TEST USE srnruunnns study the quesrion ol differential item functioning lor some such groups, conrinued operarional use of a rest may aÉford opportunities ro check for differenrial item funccioning. Standard 7.4 Test developers should strive to idenriS and eliminate language, symbols, words, phrases, and content that are generally regarded as offensive by members of racial, ethnic, gender, or other groups, except "when judged to be necessary For adequate representetion of the domain. Comment: Two issues a¡e involved. The firsr deals with che inadverrenr use of language ¡hat, unknown ro che test developer, has a diffe¡cnt meaning or connota¡ion in one subgroup rhan in others. Test publishers ofren conducr sensitiviry reviews o[ all resr material ro detecr and remove sensirive macerial from che tesr. The second deals with setrings in which sensirive material is essenrial [or validiry. For example, history tests may appropriately include material on slavery or Nazis. Tests on subjects [rom the life scienccs may appropriately include material on evolution. A test oF unde¡sranding of,an organizationt sexual harassment policy may require employees ro e valuate examples oF potenrially oFl:ensive behavior. / PART II Comntent: Many test manuals point our variables that should be considered in interpreting test scores, such as clinically relevanr histor¡ school record, voc¿¡ional sratus, and rest-raker morivarion. I nfìuences associared rvirh variables such as socioeconomic sratus, ethniciry, gender, cultural background, language, or age may also be relevanr. In addition, nredication, visual impairments, or orher disabiliries may affect a rest rake¡'s performance on, lor example, a paper-and- pencil resr of machematics. Standard 7.6 '\flhen empirical srudies of differential pre- diction ola criterion for members of different subgroups are conducted, they should include regression equations (or an appropriate equivalent) computed separately for each group or treatment under consideration or al analysis in which the Broup or treatment variables are entered as moderator variables. Co^-rrit, Correlario n coeflìcien $ provide inadequare evidence for or againsc a diffe¡- enrial predicrion hyporhesis ifgroups or rrearmenrs are found not to be approximarely equal with respect to both test and criterìon means and varianccs. Considerations of both regression slopes and interceprs are needed. For example, despire equal correlations across groups, di[Ferences in ìntercepts may be found. Standard 7.5 In testing applications involving individualized interpretations of test scores other than selection, a tesr taker's score should not be accepted as a reflection of standing on the charecteristic being assessed without consideration of alrernate explanacions for the test taker's performance on that test et that time. Standard 7.7 In testing applications where the level of linguistic or reading abiliry is not part of the construct of interest, the linguistic or reading demands of the test should be kept to the minimurn necessary for the valid assessment of the intended construcc. 82 AERA APA NCME OOOOO9I PART II / FAIBNESS IN TESTING AND TEST USE Comment: When ¡he intent is to ass"-ss ebiliry in mathemarics or mechanical comprehension, [or example, the tesr should not conrain unusual words or complicated syntectic conventions unrelated to the marhemarical or mechanical skill bcing assessed. .lifF"."-"". ..- F^,'..'l .^ i---.ti-.ti^should be undertaken to determine that such differences are not attributable to a source of construct underrepresentation or construct-irrelevant variance. While initially the responsibiliry of rhe test developeq the test user bears responsibiliry Standard 7.8 'When scores are disaggregated and pub- licly reported for groups identified by cha¡acteristics such as gendeç ethniciry, age, language proficienc¡ or disabiliry, c¿utionary statements should be included rvhenever credible resea¡ch reports that test scores mey not have comparable meaning across these different groups. Comment: Comparisons across groups ere only meaningful if scores have comparable meaning across groups. The srandard is inrendcd as applicable ro serrings where scores ere implicidy or explicitly presenred as comparable in score meaning across groups. Standard 7.9 When tests or assessments are proposed for use as instruments of social, educational, or public policy, the test developers or users proposing the tesl should fully and accurately inform policymakers of the characteristics of the tests as well as any releva¡¡t and credible information that may be available concerning the likely consequences of test use. Standard 7,10 'llhen the use of a test ¡e.cults in ourcomes that affect the life chances or educational opportunities of examinees, evidence of mean test sco¡e differences berween relevant subgroups of examinees should, where feasible, be examined for subgroups for which credible research repotts mean differences for similar tes¡s. lflhe¡e mean uses with groups other rhan those specified by the developer. for Comment: Examples of such resr uses include situations in which a resr plays a dominant role in a decision ro granr or wirhhold a high school diploma or ro promore a studenr or retain a studenr in grade. Such an investigarion mighr include a review of che cumulative research lirera¡u¡e or local studies, as appropriare. In some domains, such as cognitive ability resring in employment, a subsranrial relevanr research base may preclude rhe need for in educarional serrings, as disin chaprer 13, potenrial differences local srudies. cussed in opportuniry ro iearn may be relevant a possible source as of mean differences. Standard 7.11 'When a construct can be measured in different ways that a¡e approximately equal in their degree of construct representation and f¡eedom from construct-irrelevanr variance, evidence of mean score differences across relevant subgroups of exam- inees should be considerel i^ Å^.iÅi^ù which test to use, Comment: Mean sco¡e differences, lvhile im.porrant, are bur one factor influencing the choice betwcen onc resr and anorher. Cost, icscing rime, tesr securiry, and logistic issues (e.g., an application rvhere very large nurnbe¡s of examinees musr be screened in a very short time) are among rhe issues also entering inro rhe pro[essional judgment abour tesr use. 83 AE RA_APA-NCM E-OOOOO92 tr lsrRruunmms FAIRNESS IN ÏESTING AND TEST USE / PARÏ II Standard 7,12 The testing o¡ âssessment process should be carried out so that test takers receive comparable and equitable treatment during all phases of the testing or assessment Process, Comment: For example, should a person adminisrering a resr or interprecing tesr resulrs recognize a personal bias For or against an cxaminee, or lo¡ or againsr any subgroup of which the examinee is a member, rhe person could take a variery ofsreps ranging [rom seeking a review of test interprerarions from a colleague ro withdrawal from the testing process. 84 AE RA_APA_N CM F_OOOO093 a pEßHTE T&¡Ë lJn äElL ltt'tå¡l[ËJ pEqÞnhf,qHRFE nilin tTsËq ñIEBJ UUn-tJ! t/lËcJälJULãE[E-Cå OF TËST TAKEM$ Ranknrn¡rn¡l vsi.v ess.15¡ This chaprer addresses fairness issues unique ¡o rhe in¡e¡esrs of rhe individual resr take¡. Fair rreatmenr of re¡t rakers is not only a marrer oFequiry, but also promotes the validiry and reliabiliry oFthe inferences made from rhc tesr performancc. The standards presented in rhis chaprer reflect widely accepted principles in the field of measurement. The sran- of rest takers with regard ro resr securiry rheir access to resr ¡esults, and their rights when irregulariries in rheir resting are claimed. Other issues of Fairness are t¡eated in orher chaprers: general principles in chapter 7; the testing oflinguistic minorities in chapter 9; the testing oFpersons with disabiliries in chaprer 10. General considerations concerning reporrs oI test re¡ults are covered in chapter 5. dards add¡ess the responsibiliries Test takers have the right to be assesscd wirh tesrs thar meer currenr professional standards, including srandards of techniczl qualiry fairness, adminisrration, and reporting of ¡esults. Fair and equirable rrearmenr take¡s involves providing, in advance oftesr oftest- ing, information abour rhe nature oFrhe resr, the intended use of test scores, and the confidentialiry of rhe results. Tesc takers, or rhei¡ legal representatives when appropriare, need enough information abour the resr and the inrendeci use of cesc rcsulrs to reach a competenr dccision abour participating in resting. In some instances, formal informed consent for testing is required by law or by other stanof professional pracrice, such as rhose governing research on human subjeccs. The greater rhe conscquences ro the tcsr raker, the greater the importance ofensuring rhar rhe resr rake¡ ìs ñrlly informed abour rhe resr and voluntarily consents to parricipate, except when testing wirhour consenr is permirted by law. iFa resr is oprional, the resr <lards raker has rhe right to l<norv the consequences resr of raking or not raking rhe rcsr. The taker has rhe righr to acceprable opporrunities [or asking quesrions or expressing concerns, and mey expecr timely responses to legitimare questions. lWhere consistent with rhe purposes and narure of the assessmenr, general inFormetion is usually provided about the test's content and purposes. Some programs, in che in¡erests of fairness, provide all tesr takers with help[ul marerials, such as study guides, sample quesrions, or complere sample cesrs, when such inFo¡marion does nor jeopardize the validiry of rhe resulrs lrom Future test adminisrrarion, Advice may also be pro"ided about tesr-taking strategies, including time manageme nt, and the advisabiliry of omitting an item response, when ir is permitred. Informarion is made known about $e availabiliry of special accommodations for those who need them. The policy on retesring may be stated, in case rhe tcst raker fleels thar rhe present performance does not appropriarely reflecr his/her besr performance. As parricipanrs in the assessment, test takers have responsibilities as well as righrs, Their responsibilities include preparing themselves for the resr, Following che directions oF rhe tesr administrator, representing themselves honesrly on the test, and inflorming appropriatc persons if they believe the test resulrs do not adequately reflect rhem. In group tesring situations, test takers are expected not ¡o interFere wirh the performance of othcr ccsr takcrs. Test validity rests on the assumption that a test taker has earned fairly a particular score or pass/fail decision. Any form oI chearing, or other behavior thar ¡educes the fairness and validity ofa tes¡, is irresponsi- 85 AE RA-APA-NC M E-O OOOO94 lsmnqnRmns THE BIGHTS AND HESPONSIBILITIES OF TEST TAKERS ble, is unfair ro other resr takers and may lead ro sanctions. lr is unfair For a tesr taker co use aids rhat are prohibired. Ic is unfair for a rcst raker ro errenge lor someone else to rake the resr in his/her place. The resr raker is obligared to respecr rhe copyrighrs ofthe tesr publisher or sponsor on all tesc marerials. Thìs means ¡har ¡he resr raker will nor reproduce rhe irems wirhout aurhorizarion nor disseminare, in any [orm, marerial rhar is clearly analogous ro the reproduction of the ircms. Tesr rakers, as well as resr adminisrra- tors, have rhe responsibiliry nor to compromise securiry by divulging any details ol the tesc icems ro orhers nor may they requesc such derails from others. Failure ro hono¡ these responsibiliries may compromise the validity of test score inrerpretations for themselves and [or orhers. Sometimes, resting programs use special / PART II Standard 8,1 Any information about test content and purposes that is available to any tesr taker prior to testing should be available rc all test takers, Important information should be available free of charge and in accessible formats. Comment: The inrent of rhis srandard is cqual trearment for all. Inrporranr information would include thar necessary lo¡ tesring, such as when and where rhe rest is given, rvhat material should be broughr, the purpose of rhe test, and so forrh. More dcrailed info¡mation, such as pracrice marerials, is sometimes offered for a fee. Such offerings should be made ro all resr takers. Standard 8.2 scores, sratisrical indicarors, and orher indirecr inlormation about irregularities in testing to help ensure rhar the resr scores are obtained fairly. Unusual patterns of llhere appropriate, test takers should responser, large changes in tesr scores upon retesring, speed olresponding, and similar indicarors may rrigger carelul scrutiny oF certain tesring protocols. The derails of these procedures are generally kept secure ¡o avoid compromising their use. Howeve¡ testing policy, and confidenrialiry protection as is consistent with obta-ining valid test takers can be made aware that in special circumsrances, such as responsc or rest score anomalies, their tesr responses may get special scrurin¡ If cvidence of impropriery or [raud so warrants, the resr taker's score may be cance[ed, or other acrion taken. Because rhese Stand¿rds are direcred ro rest providers, and not ro rest rakers, srandards about test-raker responsibilities are phrased in terms olproviding informarion ro resr takers about their righcs and responsibilities. Providing this inFormation ís the joinr responsibiliry oF rhe tesr devcloper, rhe test administrator, rhe test procror, iIan¡ and the test user and may be appor¡ioned according to parricular circumsÞnces. be provided, in adrznce, as much inFormation about rhe test, the testing process, rhe intended test use, test scoring crireria, resPonses. Comment: Where appropriate, resr rakers should be informed, possibly by a resr bullerin or similar procedure, about rest conrenr, including subjecr area, topics covered, and item formats. They should be inÊormed about rhe advisabiliry oFomirting responses. They sho,rld be aware of any imposed rime limits, so rhat they can manage their time appropriately. General advice should be given about resr-caking srraregy. ln computer administrations, they should be rold about any provisions for review ol items they have previously answcred or omir¡ed. Tèst rakers should undersrand the intended uSe ofrest scores and the confidencialiry of test results. They should be advised whethe¡ they will have access to rheir results. They should be informed abour rhe policy con- 86 AERA APA NCME OOOOO9S PABT il i Tl{E BIGHTS Âil{l fiESP0NSIBILITIES 0F IEST cerning caking the test again and abouc rhe possibility that some tesr protocols may receive special scrutiny for securiry reasons. Test rakers should be informed about rhe consequences of misconduct o¡ improper behavior, such as chearing, that could resul¡ in rheir being prohibìted from complering the resr, receiving test scores, or other sanccions. Standard 8.3 When the test raker is offered a choice of test fo¡met, information about the charac¡eristics of each format shou.ld be provided. Comment: Thsr rakers sometimes have to choose berween a paper-and-pencil admi' nisrrarion and a computer-administered test, which may be adaptive. Some tests are offered in several diffe¡ent languages. Somerimes an alte¡narive assessment is offered in lieu of the ordinary cest. Test iakers need to know the chaiacteristics of each alrernative so rhar the/ can make an informed choice. Standard 8.4 Informed consent should be obtained from test takers, or their legal representatives when appropriate, before testing is done excepr (a) when testing without consent is mandated by law or governmental regularion, (b) when testing is conducted as a regular part of school activities, or (c) when consent is clearly implied. Comment: InFormed consent implies that the test rakers or representatives are made awarc, in lenguagc rhat rhcy cen undcrstand, oF the reasons [or cesting, rhe rype of tesrs to be used, the intended use, and the range of material consequenccs of the incended use. IFwritren, video, or audio records are made of rhe tesring session, or other records are kepr, test rakers I-AKERS Â+ÃÀ'^---^l \t,Inl¡I¿l¡{¡ì!it I InlCt ¡-¡¡ üE¡rr,I are enrirled ro know whar testing informa- tion will be released and to whom. Consent is not required when resting is legally mandated, such as a court-ordered psychological essessment, bur chere may be legal require- mens for providing information. Vhen resting is required for employmen¡ ot fo¡ educational admissions, applicants, by applying, have implicitly given consent ro the resting. Neverrheless, test takers and/ or their legal represenratives should be given appropriare informarion abour a tesr when ir is in thei¡ interesr ro be informcd. Young resr rakers should receive an explanation ofthe reasons for testing. Even a child as young es rwo or rhree, as we[ as older test takers of limited cognitive abiliry can understand a simple explanarion as ro why they are being tested (such as, "l'm going to ask you to rry to do some rhings so thar I can see what you know how ¡o do and whar chings you could use some more hetp with"). Standard 8.5 Test results identified by the names of individual test takers, or by other perso- nally identiSing information, should be released only to persons with a legitimate, professional interest in the test take¡ or who are covered by the informed consent of the ¡est taker o¡ a legal representarive, unless otherwíse required by law. Comment: Scores ol individuals identified by name, or by somc orhcr means by which a person can be readily idenrified, such as social securiry numbe¡, should be kcpr confidcntial. In somc siruations, information may be provided on a confidential basis ro other practitioners with a legitimate interest in the particular casc, consis[cnt with legal and ethical conside¡ations. In[ormation may be providcd to researchers if a tesr taker's anonymity is mainrained and che 87 AERA APA NCME 0000096 lsrarunnmns ÌlIE RIGHTS AND RÉSPONSIBILITIES OF ÏEST TAKEBS intended use is consistent with accepted research practice and is not inconsisrent with the condìtions of the tesr taker's informed consent. Standard 8.6 Tþst data maintained in data Êles should be adequately protected from improper disclosu¡e, Use of facsimile transmission, computer nerworks, data banla, and other elect¡onic data processing or transmittal systems should be restricted to situa- tions in which confìdentiality can be leasonably assured. Comment: lù?hen facsimilc or computer communication is used to transmit a tesr protocol to another site for scoring, or if scores ere similarly transmitted, spccid provisions should be made ro keep the infor- marion confidential. See Standard 5.13. Standard 8,7 Test takers should be made aware thät having someone else take the test fo¡ them, disclosing confìdential test mate¡ial, or arry other form ofcheating is inappropriate and that such behavior may result in sanctions. Comment, Alrhough the standards cannot regulate the behavior of test takers, test take rs should be made aware of their personal and legal responsibilities. Arranging for someone else to impersonate the nominal tesr rake¡ consrirures fraud. Disclosure oFconfidential tcsring matcrial for the purpose of giving o¡Ier test rakers pre-knowledge is unlair and may constitute copyright infringement. In licensure and certification resrs, such acrions may compromise public health and safery. The validiry of ¡esr sco¡e inrerpretations is compromised by inappropriate test disciosure. / PART I1 Standard 8.8 'llhen score reporting includes assigning individuals co categories, the categories should be chosen cerefully and described ptecisely. The least stigmatizing labels, consistent with accurate tepresentation, should always be assigned. Comment: When labels are assocíared with should be raken to be precise in the meanings associared wirh the labels and co avoid unnecessarily stigmatizing consequences associated wirh a label. For example, in an assessment designed to aid in determining wherher an individua.l is comperenr to stand triat, rhe label "incomperenr" is appropriate for individuals who resc results, care perfotm poorly on the assessment. However, in a test of basic literacy skills, ir is more appropriate ro use â label such as "nor profìcieni' rarher rhan'incomperenr," because the latter term has a more global and derogacory meaning. Standard 8.9 'When test scores are used to make decisions about a test taker or to make recommendations to a test taker or a third party, the test taker or the legal (ep¡esentative is entitled to obtain a copy of any report of test scores o¡ test interpretation, unless that right has been waived or is prohibited by law or court order. Comm¿nt: In somc cases a tesr raker may be adequarely informed when the tesr report is given to an appropriate chird parcy (creacing psychologist or psychiatrist) who can inrerpret rhe findings ro the tesr taker. In professional applications of individualized testing, when the tesr takcr is givcn a copy of rhe test reporr, rhe examiner or a knowledgeable third parry should be available to interpret ir, even ifit is clearly wrirten, as the resr B8 AERA-APA-N CM E-OO OOO97 PÁRT II i THE HÍGHTS AND BESPONSIBILIÏTES OF TESÏ TAKERS ÌaLeî mav misrrnrìerçrend nr r¡içe nrreçrinns -''-_'t---_'-'-nor specificelly answered in the report. In employmenr testing situarìons, where rest resulrs are used solely for rhe purpose of aiding selecrion decisions, waivers ofaccess are often a condition of employment, alrhough eccess to test in[ormation may ofren be appropriately required in orher circumstances. ¡nnronriere ¡c¡inn t¡Len \Øi¡hhnl¡li.o n. 'tt--r---" canceling a tesr score may arise because of suspected misconduct by the rest raker, or because of some anomaly involving orhcrs, such as theft, or administrative mishap. An avenue of appeal should be available and made known to candidates whose scorcs may be amended or wirhheld. Some tesring organizarions of[er rhe oprion ola prompt and [rec retest or arbitration o[ disputes. Standard 8.10 In educational testing programs and in licensing and certification applications, when an individual score report is expected to be delayed beyond a briefinvestig'ative period, because of possible irregularities such as suspected misconduct, the test teker should be notified, r}re reason given, and reasonable efforts made to expedite ¡eview and to protect the inte¡ests of the test tâker. The rest taker should be notified of the disposition, when the investigation is closed. Standard 8.11 In educational testing programs and in licensing and certification applications, when ir is deemed necessar,' to cancel o¡ withhold a test takert score because ofpossible tesring irregularities, including suspected misconduct, the rype of evidence and procedures to be used to investigate the irregulariry should be explained to all test takers whose scores are directiy a.ffected by the decision. Test takers should be given a timely opportuniry to provide evidence that the score should not be canceled or withheld. Evidence conside¡ed in deciding upon the Ênal action should be made available to the test taker on request, Commmt: Any form of chearing or behavior rhat rcduces the validiry and fairness of resr results should be investigatcd promprl¡ and Standard 8.12 In educational testing programs and in !icensing and ce¡tification applications, when testing irregularities are suspected, reasonably available information bearing directly on the assessment should be considered, consistent with the need to protect the privacy of test takers. Comment: Unless alleggtions of misconduct ofthe rest raker, rhe a¡e made by associates informatìon to be collccted would ordinarily be limited to rhat ob¡ainable withour invading the privacy oF rhe test raker or his/he¡ associatcs. Standard 8.13 In educational testing programs and in licensing and certification applications, test takers ere entitled to fair considera- tion a¡rd re'¿sonable process, as appropriate to the particular ci¡cumstances, in ¡esolving <iisputes about tesúng, Test takers a¡e entitled to be informed of any available means of recourse. Comment: When a tcst takcr's score mey bc qucstioned and may be invalidarcd, or when a rest taker seeks a review or revìsion of his/her score or some orher aspecr of the testing, scoringl or reporting Proccss, the cest taker is entided to some orderly process for effec¡ive input into or review of thc 8S AERA_APA-NCME_OOOO098 I srnruonnus THE RIGHTS AND HESPONSIBILIÏIES OF TESI TANEBS / PAßÏ II decision making of rhe tesr adminisr¡a¡or o¡ test user. Depending upon the magnirude o[ the consequenccs associated with the test, rhis can range from an internal review ofall relevant dara by a rest adminisrraror, ro an informal conversation wirh an examince, ro a full administrative hearing. The greater rhe consequences, rhe greacer rhe exrenr of procedural protecrions that should be made available. Tesr takers should also be made aware of procedures for recourse, fees, cxpecced rime for resolution, and any possible consequences For the rest taker. Some tesring programs advise that rhe tcsr taker may be represented by an attorney, alrhough possibly at the tesr takert expense. 90 AE RA-A PA-N C M E-O OO OO99

Disclaimer: Justia Dockets & Filings provides public litigation records from the federal appellate and district courts. These filings and docket sheets should not be considered findings of fact or liability, nor do they necessarily reflect the view of Justia.


Why Is My Information Online?