AMERICAN EDUCATIONAL RESEARCH ASSOCIATION, INC. et al v. PUBLIC.RESOURCE.ORG, INC.
Filing
60
MOTION for Summary Judgment Filed by AMERICAN EDUCATIONAL RESEARCH ASSOCIATION, INC., AMERICAN PSYCHOLOGICAL ASSOCIATION, INC., NATIONAL COUNCIL ON MEASUREMENT IN EDUCATION, INC. (Attachments: #1 Statement of Facts Points of Authority, #2 Statement of Facts Statement of Undisputed Facts, #3 Declaration Declaration of Jonathan Hudis, #4 Exhibit Ex. A, #5 Exhibit Ex. B, #6 Exhibit Ex. C, #7 Exhibit Ex. D, #8 Exhibit Ex. E, #9 Exhibit Ex. F, #10 Exhibit Ex. G, #11 Exhibit Ex. H, #12 Exhibit Ex. I, #13 Exhibit Ex. J, #14 Exhibit Ex. K, #15 Exhibit Ex. L, #16 Exhibit Ex. M, #17 Exhibit Ex. N, #18 Exhibit Ex. O, #19 Exhibit Ex. P, #20 Exhibit Ex. Q, #21 Exhibit Ex. R, #22 Exhibit Ex. S, #23 Exhibit Ex. T, #24 Exhibit Ex. U, #25 Exhibit Ex. V-1, #26 Exhibit Ex. V-2, #27 Exhibit Ex. W, #28 Exhibit Ex. X, #29 Exhibit Ex. Y, #30 Exhibit Ex. Z, #31 Exhibit Ex. AA, #32 Exhibit Ex. BB, #33 Exhibit Ex. CC, #34 Exhibit Ex. DD, #35 Exhibit Ex. EE, #36 Exhibit Ex. FF-1, #37 Exhibit Ex. FF-2, #38 Exhibit Ex. FF-3, #39 Exhibit Ex. FF-4, #40 Exhibit Ex. FF-5, #41 Exhibit Ex. FF-6, #42 Exhibit Ex. GG, #43 Exhibit Ex. HH, #44 Exhibit Ex. II, #45 Exhibit Ex. JJ, #46 Exhibit Ex. KK, #47 Exhibit Ex. LL, #48 Exhibit Ex. MM, #49 Declaration Declaration of Marianne Ernesto, #50 Exhibit Ex. NN, #51 Exhibit Ex. OO, #52 Exhibit Ex. PP, #53 Exhibit Ex. QQ, #54 Exhibit Ex. RR, #55 Exhibit Ex. SS, #56 Exhibit Ex. TT, #57 Exhibit Ex. UU, #58 Exhibit Ex. VV, #59 Exhibit Ex. WW, #60 Exhibit Ex. XX, #61 Exhibit Ex. YY, #62 Exhibit Ex. ZZ, #63 Exhibit Ex. AAA, #64 Exhibit Ex. BBB, #65 Exhibit Ex. CCC, #66 Exhibit Ex. DDD, #67 Exhibit Ex. EEE, #68 Exhibit Ex. FFF, #69 Exhibit Ex. GGG, #70 Exhibit Ex. HHH, #71 Exhibit Ex. III, #72 Exhibit Ex. JJJ, #73 Declaration Declaration of Lauress Wise, #74 Exhibit Ex. KKK, #75 Exhibit Ex. LLL, #76 Declaration Declaration of Wayne Camara, #77 Exhibit Ex. MMM, #78 Declaration Declaration of Felice Levine, #79 Exhibit Ex. NNN, #80 Exhibit Ex. OOO (Public Version), #81 Exhibit Ex. PPP, #82 Exhibit Ex. QQQ, #83 Exhibit Ex. RRR, #84 Exhibit Ex. SSS, #85 Exhibit Ex. TTT-1, #86 Exhibit Ex. TTT-2, #87 Exhibit Ex. UUU, #88 Declaration Declaration of Kurt Geisinger, #89 Declaration Declaration of Dianne Schneider, #90 Text of Proposed Order Proposed Order, #91 Certificate of Service Certificate of Service)(Hudis, Jonathan). Added MOTION for Permanent Injunction on 12/22/2015 (td).
EXHIBIT V-1
Case No. 1:14-cv-00857-TSC-DAR
g
EXHIBIT I,
f; $\-c.\.^-vr^t*d.
Ê
=.r
E slre-lrs-
3-
AERA-APA-N
C M E-OOOO OO
1
AERA_APA-N
C M E_O OOOO02
ST,&h[ÐARDS
for educãtional and psychological testing
American Educational Research Association
American Psychological Associarion
National Council on Meaiurernenr in Education
AE
RA-APA-N
C M E_O
O OO O O
3
Copyright @ 1999 b)' the American F.ducational
lìescarch .r\ssociation, dre t\me¡ican Psvcbolo'cal
Àssociarion, and thc Nadonal Council on
lvfeasurcmenr in llducadon. ..UI rights rcsc¡r'ed.
Except as pcrrnitred uncler thc Uni¡ed States
Copyright i\ct of 7976, no Patr of rhis publicarion
rnav be reproduced ot disributcd .in irny form or
by any nreans, or storccl in a databasc o¡ rctricva-l
systcm, withour the prior u'rittcn Pctmission of
the publisher.
Publishcd b.v
Amcrican F,clucational llesea¡ch .¿\ssociation
1430 K St., N\\¡, Suitc 1200
Washington, DC 20005
Librar¡. r.,f Congress. Card ntrmbcr: 9906ó845
ISBN: 0-935302-25-5
ISBN-I 3 : 97 8-0 -935302'25-7
Prin¡ed in thc United Starcs of .¡\mcrica
First printine in 1 999; second, 20Q2; rlúd,2004;
fourrh, 2007; fi fth, 2008; and sixth, 2011.
The S I a ù a rd s -[o r E d tt c øti o t t a / a n d P y c ho logi ca /
Testingtill be undet concinuing revierv by the
thrcc sponsoring organizations. Commcnrs and
s''ooesnons u,iìl hc rvclcome and should bc scnt to
_-_
-ÒÒ__ _-The Committee to Develop Standards for
Educational nnd Psychological Testing in ca¡e of
the Execuúr'e Of6ce, American Psychologrcal
r\ssociation, 750 l:i¡sr Stcer, NE, Washington,
'"
DC 20002-4242.
Prepared by the
Joint Committee on Standatds For Educational
ancl Psychological Tesriog of the ¡\merican
Ed uca rional lìesearch t\s sociation, the t\merican
Psvchological i\ssociation, ancl thc National
Counci.l on tr'feasurement in Educarion.
AERA-APA-N
CM E-OOOOO04
TAffiË.E
TF CüruTENTS
5. Test Administration, Scoring, and
Reporting
..................6r
Background
Standards 5.1-5.16
ilt
AERA-APA_NCM
E_OOOOOOS
rÃDr E V¡ VVtr ¡ Êrr.v
nE nni.lfÉfilfc
lÁDLL
L
The Hights and Responsibilities of Test
Takers.....
.. .'.... .............',. ....85
Backgrouncl
_
".." ""-". ""'85
Starrãa.ds 8.1-8.13
9. Test¡ng
tndividuals of Diverse Linguistic
Backgrounds...
'.. '.-. ..'.........91
""""" """
Backgrotrnd
Srandards 9.1-9.1t
10. Testing lndividuals
with Disabilities
.....................
Background
10.1-f0.12
PART II!
TESTING APPLICATI0NS
............,.,..
11. The Resp0nsibilities of Test
'i0t
.""""101
""
......'..........'......
"" """.'"""" "
Sranclards
.,.
Users...
- ... ........ .roe
' "' ""1 I I
""""""""" IIi
""""""" "" """"1l3
ll.l-lL'2í
12. Psychological Testing and Assessment...........
..'.'.....
Background
l2.t-l?.20
""""""" " ""1t9
"' "- ".' " l l 9
""""""""""""""131
Stanjards
13. Educational Testing and Assessment
......".....'.'..'..
Background
Stanáards t3.l-13.f9
14. Tesling in Employrnent and Credentialing....'..".
"' """"""""." 137
""""""""""137
"""" " "' "" """145
'
"""" "" " ' l5l
""" """""151
.......
Background
Srandards
......._..... I
l4,l-14.L7
15. Test¡ng in Program Evaluation and Public
Background
Sc¿ndards l5.l-15.rt
",106
..".'....
Background
Sranclards
'91
" """97
Policy ..
.". '
'8
"' "' ' ' 163
'
" 163
"""167
""" "' """
GLOSSARY
lB5
INDEX
AE
RA_APA*N
C M E_O
O
OOO O 6
FREFAGE
There havc been fivc carlier documenrs from
rhree sponsoring organizations guidirrg the
development and use of tess. The firsr oF these
David Goh
Berr Green
Edwerd Haerrel
þ r Psyc h o lngí cal
Tesr and Diagnosric Tcchniques, prepared by
Jo-Ida Hansen
Sharon Johnson-Lewis
a commitrec oI rhe American Psychological
Associarion (APA) and published by rhar
Joseph Mararazzo
w as Tàc h n i c a I Reco mme ndatio ns
organiz.ation
in
1954. The second was Tèchnical
Rzcommend¿don¡
þr
Achinement 7èsts, prepared
by a conrrnirtee representing che American
Educarional Research Associarion (AERA)
ahd the Narional Council on Measurement
Used in Educarion (NCMUE) and published
by rhe Narional Educarion Associarion in
1955. The rhird, which replaccd rhe earlicr
two, was published by APA in I966 and
prepared by a commirree represenring APA,
AERA, and rhe Narional Council on
Measuremenr in Educarion (NCME) and
called rhe Standards þr Educational and
Psychological Ti:tts an¿ Manu¿ls. The lourrh,
Edaruùonal and PsTchological
-Tesu, wu again a collaborarion oFAERA, APA
and NCME, and was publislred in 1974. The
fitrh, Søndardsþr Edrcdtion¿l and Psychobgtcal
Tâsting, also a joinr collaboration, was pubStandards
for
in Ì981.
In l99l APA's Commitree on
lished
Psychological Tesrs and Asscssmenr suggesred rhe need
ro revise rhe 1985 Stand¿r/:. Represenrarives
olAEM,
APA and NCME met and discussed
the revision, principles that should guide
rhat revision, and porential Joinr Commitrce
members. By 1993, rhe presidenrs of rhe
rhree organizarions appoinred members
and rhe Commiccee had irs firsr meering
'l993.
November,
'fhe Srandard¡
has been developed by a
joint committee appointed by A-ERA, A-PA anã
NCME. Members of rhe Commir(cc we¡e:
Eva Baker, co-chair
Paul Sackerr, co-chair
Lloyd Bond
Leona¡d Feldr
Suzanne L¿ne
Manf¡ed Meier
Pamela Moss
Esreban Olmedo
Diana Pullin
From 1993 rc 1996 Charles Spielberger
served on the Commirree as co-chair. Each
sponsoring organizarion was permitted
ro assign up ro rwo liaisons ro the Joinr
Commirceet project. Liaisons served as rhe
conduits between rhe sponsoring organizarions and the Joint Comminee. A-PA's liaison
from im Commicree on Psychological Tests
and Asessmens changed several dmes as *re
membership of the Commitree changed.
Liaisons to the Joint Comr4ittee:
AERÂ -Villiam Mehrens
APA - Bruce Brackcn, A¡drew Czopek,
Rodney l,owman, Thomas Oaklând
NCME - Daniel Eignor
APA and NCME also had commir¡ees
who served ro moniror the process and keep
relevan¡ parties informed.
APA Ad Hoc Comminee of the Cou¡cil
Representatives:
Melba Vasquez
Donald Bersoff
Stephcn DeMers
of
James Farr
Bertram Karon
Nadine Lambe¡r
Charles Spielberger
NCME Standa¡ds and Ti:st Use Corrsrinee
Gregory Cizek
A.llen Doolirrle
L¡ Ann Gamache
il
AE RA-AP A_N
C M E_O
OO
OOOT
PREFACE
Donald Ross Green
Ellen Julian
Tracy Muenz
Nambury Raju
A management committee was lormed ar
rhe beginning of this eflort. They monitored
rhe financial and administrative arrangements
oF the project, and advised rhe sponsoring
organizarions on such mattets.
Management fümmittee:
Frank Farle¡ APA
George Madaus, AERA
Vendy Yen, NCME
StafFrng For the revision included Dianne
Brown Maranto as project director, and
Frank i-ancìy
Ellen Lent
Roberr Linn
Theresa C. Liu
Stanford von Mayrhauser
Samuel Messick
Craig N. Mills
Robert J. Misle4'
Kevin R. Murphy
Mary Anne Nester
Maria Pennock-Roman
Carole Perlman
Michael Rosenfeld
Jonarhan Sandoval
Cynthia B- Schmeiser
Kara Schmitt
Neal Schmitt
Richard J. Shavelson
Dianne L. Schneider as staffliaison. IØayne J.
Camara served as project director From 1993 to
1994. AlAs legal counsel conducccd the legal
l¡rrie A. Shepard
revierv ol rhe Stand¿rds. Villiam C. Howell
and Vitliam Mehrens reviewed the standards
for consistency actoss chapters. Linda Murphy
developed rhe indexing For rhe book.
The Joint Commirtee solicited preliminary reviews oFsome draft chapters, from recognized experts. These reviews were primarily
solicìred For the technical and fairness chapters. Reviewers arc Iisred below:
Marvin Alkin
Philip Bashook
Bruce Bloxom
Jelfery P. Braden
Robert L. Brennan
John Callender
Ronald Cannella
Lee J. Cronbach
Jamcs Cummins
John Fremer
Kurt
F. Geisinger
Roberr M. Guion
\P'alter Haney
Patti L. Harrison
Gerald P Koocher
Richard Jeanneret
.
Milbrey \X/. Mclaughl in
Mark E. Swerdlik
Janet \Vall
Anthony R.7.zra
Dra[t've¡sions oî úte Smndards were
widely distribured [or public review and
comment three times during this revision
efFort, providing the Commitree with a
rotal of nearly 8,000 pages of commenrs.
Organizations who submitred comments on
drafts a¡e listed beiow Many individuals
contributed to the input from each organi-
zation, and although we wish we could
acknowledge every individual rvho had inpur,
l^.^,1".
i.f^rm¡'^ i^.^--ler¡
contribured to cach organiza'
tiont response. The Joint Commirree could
not have completed its task without the
tion
as ro rvho
thoughrful ¡eviews of so many professionals'
Sponsorìng fu sociations
American Educational Research
Association (AERA)
American Psychological Associatio n (APA)
Narional Council on Measu¡ement in
Education (NCME)
AË
RA-APA_NCM
E*OÛOOOO8
PBEFACE
Membership Organi"ations (Sciendfi c,
Professional, Trade & Advocacy)
Ame¡ican Association foi Highcr
Inte¡narional B¡orherhood of Elecrrical
Educarion (A,AHE)
American Board oF Medical Specialries
(ABMS)
American Counseling Associarion (ACA)
Americ:n Evaluarion Associarion (AEA)
American Occupario nal Therapy
fusociarion
Ame¡ican Psychologicai Sociery (APS)
APA Division of Counseling Psychology
(Divisìon t 7)
\Vorkers
Inre¡narional [:nguage Testing Associarion
Internarional Personnel Management
fusociarion Assessment Council
(IPtvfAAC)
Joint Commirree on Tesring Pracrices
UCTP)
Narional Associarion for rhe Advancemenr
of Colored People (NAACP), Legal
Defense and Educarional Fund, Inc.
Narional Center for Fair and Open
Tesring (Fairtesr)
APA Division of Developmenral
Psychology (Division 7)
Narional Organizarion for Comperenry
A-PA Division of Evaluarion, Measurement,
Personnel Tesring Council oF Metropolitan
and Srarisric¡ (Division 5)
APA Division of Mencal Reta¡dation
Personnel Testing Council of Sourhern
.
Assurance
Washingcon (PTC/M\Xr)
&
Developmental Disabiliries (Division 33)
APA Division of Pharmacology &
Subscance Abuse (Division 28)
APA Division o[ Rehabilitarion
Psychology (Division 22)
A-PA Division of School Psychology
(Division t6)
Asian American Psychological
tusociarion (AAPA)
fusociarion for Assessmenr in
Counseling (A,{C)
fusociarion oFTesr Publishers (ATP)
Ausrralian Council for Educarional
Resea¡ch Limi¡ed (ACER)
Chicago Industrial/Organizational
Psychologists (CIOP)
Council on Licensure, Enforcemenr, and
Regulacion (CLEAR), Examinarion
Resou¡ces Er Advisory Commirree
(ERAC)
Equal Employment Advisory Council
(EEAC)
Foundarion fo¡ Rehabilirarion
Cerrifi carion, Educarion and Research
Human Sciences Research Council,
Sourh Africa
Inrernarional Associarion for CrossCuhural Psychology (IACCP)
(NOCA)
California (PTC/SC)
Sociery for Human Resource Management
(SHRM)
Sociery of Indian Psychologiss (SIP)
Sociery for Indusrrial and Organizarional
Psychology (APA
Division t4)
Sociery For thc Psychological Srudy
of Erhnic Minoriry Issues (APA
Division 45)
Srate Collabo¡arive on Assessmenr &
Srudenr Sra¡dards Tèchnical Guidelines
for Pe¡[ormance Assessmenr
Consorrium (TGPA)
Telecornrnunicarions Sraffi ng Forum
'\Í'esrern
Region Intergovernmenral
Person nel Assessment Counci
I
(\íRIPAC)
Credendaling Boards
American Board of Physical and Medica.l
Rehabiliration
American Medical Technologisrs
Commission on Rehabiliution
Counselor Certifi carion
Narional Board for Certified Counselors
(NBCC)
Narional Board of Examiners in
Oprometry
vii
AE
RA-APA_N
CM
E-O
O O OO O
9
PRËFACE
i.¡'arional ôoard of ivfedic¿i Exami¡rers
Narional Council oFState Boards of
Nursing
Government and Federal Âgencies
Army Research Institute (ARI)
Calilornia Highway Patrol, Pérsônnel and
Tiaining Divìsion, Selection Research
D-^^-^
r ruÉ(drrt
-
Ciry o[ Dallas, Civil Service Deparrmenr
Co¡nmonrvealrh oI Virginia, Department
ofEducation
Delense Manpower Data Center
(DMDC), Pe¡sonnel Testìng Division
Department of Defense (DOD), Otrìce
o[ rhe Assistanc Secrerary of Defense
Deparrment oIEducation, Office o[
Educational Improvement, National
Center for Educarion Statisrics
Department of Justice, Immigration and
Naturalization Service (lNS)
Deparrmenr of [ebo¡, Employmenr and
Training Adminisrration (DOLiETA)
U.S. Equal Employment Opportuniry
Commission (EEOC)
U.S. Office of Personnel Managemenr
(OPM), Personnel Resources &
Development Center
Tst
Publishers/Ðevelopers
American College Tesrìng (ACT)
CTB/McG¡aw-Hill
The College Board
Educational Testing Service (ETS)
Highland Publishing Company
lnstituçe for Personaliry Et Abiliry
Tcsting (IPAT)
P¡olessional Examina¡ion Service (PES)
-- ), -r ^ fll15llf ullv¡15
---:---.:^-Center for Creative Leadership
Gallauder Universiry National Task
Fo¡ce on Equiry in Testing Deaf
ÁguÉ¡¡¡¡L
^
P¡oFessionals
Universiry of Haifa, Israeli Group
Kansas Srare Universiry
Narional Center on Educational
Ourconres (NCEO)
D-.._---1..--:- JLdLc r t-:-.---:-l gllr¡ùyrYd¡tl4 c-- -^ u¡¡¡vçrst()
Universiry of North Carolina - Charlotre
Universiry of Sourhern Mississippi,
Department of Psychologv
lVhen thc
Joint Commitree tomplered
its task of revising the Standaids, ft then
submitred irs rvork to the three sponsoring
crganizations lor app roual. Each o rgan izatio n
had irs orvn governing body and mechanisnr
for approval, as well as definirions lor wha¡
rheir approva{ means.
A-E,RA: This endorsemenc carries wich ic
rhe unde¡standing that, in general, we
believe rhe Stdnd¿r/s to represent rhe
current consensus among recognized
professionals regarding expected measurement practice. Developers, sPonsors,
publishers, and users of tesrs should
observe these S¡ønddrds.
APA: The APA's approval of rhe
Standards means rhe Council adopts
the documen¡ as APA policv.
NCME: NCME endorses the Standø¡d¡
for Educational ønd Pslchologrcal Testing
and recognizes rhat rhe intent of these
Standards is to promote sound and
responsible meâsurement practìce. This
endorsement carries wirh it a professìonal imperarive For NCMË members
ro arrend ro the Standdrd:.
AJchough ¡Jte Sr¿ndards a¡eprescriptive, rhe
Stm tl¡.rd.¡ itsel I does not co nrai n en fo rcemenc
mechanisrns. These scanda¡ds were formulaced
wirh rhe intent of being consistent with other
srandards, guidelines and codes of conducr
publ'shed by rhe three sponsoring organizations,
and lisred below. The reader is encouraged to
obtain these documents, some of 'rhich have
references to testing and assessment in specific
applications o¡ settings.
The Joint Committee on the
Standzrds for Educdtional and
hTchological Tenìng
viii
AE
RA-APA-N
CM Ë_O OOOO'I O
PREFACE
References
American Educarional Research
Associarion. Uune, 1992). Ethical St¿nd¿rds
of thr American Educational Rese¿rclt
As¡oc
iation. llashingron, DC: Aurhor.
American Federarion oFTèachers, National
Council on Measuremenr in E.ducarion, &
Narional Education fusociarion. Standzrds
Te ach er Co mp e t e n ce i n Ed uc¿ tì o n¿ I Ás s ¿ ¡ s m ett t
ofSndzns. (1990). \X¡ashington, DC: National
þr
Council on Meæurement in Educarion.
American Psychological Association.
(December, 1992). Erhical Principles of
Psychologisrs and Code of Conducr. American
Pslchologis
t, 4 7 (l 2),
1
597 -1 6 I
t.
Joinr Commirree on TLsring Praccices.
Code of Fair Testing Practices in
(l 988).
Education. Vashingcon,
DC;
Ame¡ican
Psychological Associarion.
Narional Council on Meesuremenr in
Educarion. (1995). Code of Professíonal
Responsibilities in Educatíonà! Measurcmênt.
lVash ingron, .DC: Aurhor.
AERA-APA-NCME_OOOOO
11
gNTffiTÐIJTTEüN
Educational and psychological resting and
assessment are among the most importanr
conrributions of behavioral science to our
socier¡ providing Fundamencal and significant improvements over previous practices'
Ajrhough not all tests are well-developed nor
arc all testing pracrices u'ise and beneficial,
rhere is extensive evidence documcnting the
efTecti veness of well-consrructed tes ts lor uses
supporred by validiry evidence. The proper
use of ¡ests can resul¡ in rviser decisions abou¡
individuals and programs than would be rhe
select From emong many individuals lor
a
highly competitive job or [or entry into
an
educational or training Program, chc preferences of an applicant may be inconsistent
wirh those of an emp'loyer or admissions
oÊfice¡. Similarly, when testing is mandared
by a court, the interess of the test taker may
be differenr lrom thosc o[rhe parry requesting
roure to b¡oader and mo¡e equitable access to
rhe court order.
There are meny PafiiciPa¡rs in the resting
process, induding, among orlrers: (a) ùrose who
prepare and develop the tesr¡ (b) those who
publish and merket rhe resr; (c) those who
adminisrer and score the test; (d) chose who
education and employment' The improper
use the test ¡esula
case
use
wirhout rheìr use and also can provide a
of
tesrs, however, cen ceuse considerable
harm to test rakers and other parties affected
by resr-based decisions. The inrent of the
Standzrds is co pcomote che sound and ethical
use ofrests and to providc a basis for evaluat-
ing the qualiry of testing practices.
Participants in the Testing Process
Educaiional and psychological testing and
assessment involve and significantly affect
individuals, insciturions, and sociery as a
whole. The individuals affectecl include students, parents, teachers, educational adminis-
trators, job applicanrs, employees, clients,
patients, supervìsors, executives, and evaluarors, anÌong others. The instirurions affected
includc schools, colleges, businesses, industry
clinics, and Bovernment agencies. Inciividuais
and insúrutions benefir when resting helps them
achieve their goals. Sociery, in rurn, benefis
when resring conrributes co the achievement
individual and insticutional goals.
The interests of the various parries
involved in the tesring Process are usuall¡
but nor always, congruent. For example,
when a rest is given for counseling PutPoses
or for job placement, the interests ofthe
individual and the insrirution ofren coin.iie. In contrast, when a test is used to
oF
for some decision-making
purpose; (e) those who interprer rest resuks for
clienrs; (Ð those who take the test by choice,
direcrion, or necessir¡ (g) those who sponsor
rests, which may be boards that rePtesent
inscirutions or Bovernmental agencies thac
contract with a test developer lor a specific
instrument or servicei and (h) chose who selecr
or review tcsts, evaluating their comparative
merits o¡ suitabiliry for ¡he uses proposed.
These roles are sometimes combined a¡d
somerimes furthe¡ divided. For example, in
clinics the test taker is rypica.lly the intended
bencficiary olthe rest results. ln some situarions the tesr administrator is an agent o[ the
reit developer, and sometimes dre test admintùí/hen an industrial
isrrator is also the tesr user.
organization prepa¡es irs own employment
rests, it is both the developer and the user.
Sometimes a test is developed by a test author
but published, advertised, and distributed by
an independent publisher, úrough ùre publisher
may play an active role in rhe test dwelopment.
Given this inrermingling o[ roles, it is difficrrlt
ro assign precise responsibilir¡ for addressing
various s¡anda¡ds to specific parricipanrs in
the testing process.
This document begins with a series of
chaptets on the test development Process'
which focus primarily on the responsibilities
of tesr developers, and then turns to chaprers
i
I
I
¡
AERA_APA-N CM
E-OOOOO i
2
INÏRODUCTION
on specific
uses
and epplicarions, which focus
primarily on responsibiliries of resr users. One
chaprer is devored specifically to rhe righrs
and responsibiliries oF resr rakers.
'fhe Stand¿rd¡ is based on rhe prernise
rhar effecrive testing and assessment require
thar all participanrs in rhe testing process possess rhe knowledge, skills, and abiliries relevan¡ ro cheir role in rhe resring process, as
well as awareness ofpersonal and conrextual
lactors rhat may influence rhe resring process
-fhey
nor feasible in cerrain situations), or,,conditional" (importance varies wirh application).
The presenr Standards conrinues rhe rradition
ofexpeccing resc developers and use¡s to conside¡ all sranda¡ds before operational use;
however, rhe Søndards docs nor continuc rhe
pracrice of designaring levels of imporrânce.
Insread, rhe rexr ofeach srandard, and any
accompanying commenra¡y, discusses rhe
condirions under which a srandard is relevanr.
It was nor rhe case char under rhe I 9g5
also should ol¡rain any appropriare
supervised experience and legislatively man-
Stan/¿rds resr developers and users rvere obligared to arrend only ro rhe primary srandards.
dared pracrice credentials necessery to perform
Rather, the ¡erm "conditional" meanr rhar a
standard was primary in some serrings and
secondary in orhers, rhus requiríng careFul
comperenrly rhose aspects of the resting
process in which they engage. For example,
rest developers and rhose selecting and
interprering resr need adequare knowledge
of psychomerric principles such as validiry
conside¡arion of che applicabilicy oFeach srandard For a given setring.
and reliabiliry
The abscnce ofdesignarions such as
"primary" or "conditional" should nor be
The Purpose of the Standards
taken ro imply rhar all srandards are equally
significanr in any given siruation. Depending
on ¡he conrexr and purpose o[ resr develop-
The purpose of publishing the Standard¡
ìs
ro provide crireria for rhe cvaluarion of tests,
testing pracrices, and rhe effects oFresr use.
Although the evaluarion of the appropriateness ofa tesr or tesring application should
depend heavily on proFessionaì judgmenr, the
Standards provides a frame of reference to
assure
rhat relevant
issues a¡e addressed.
[r
is
hoped thar all proFessional resr developers,
sponsors, publishers, and users rvill adopr rhe
Støndards and encourage orhers ro do so.
'lhe Smnà¿ràs makes no errempr ro provide psychomerric answers ro quesrions oÊ
public policy regarding the use ofresrs. In
general, the Standards advoc¿res thar, wirhin
feasible limirs, the relevanr technice.l informarion be made available so rhar those involved
in policy debare may be fully inFormed.
Cateqories of Standards
The 1985 Stand¿rd: designated each standard
as "primary" (ro be mer by all tesrs bcFore
operarional use), "secondary" (desirable, but
ment or use, some srandards rvill be more
salient rhan others. Mo¡eover, some snndards
a¡e broad in scope, serring forrh concerns or
requirements relevanr ro nearly all tesrs o¡
testing conrex$, and orher srandards are narrower in scope. However, all srandards are
imporranr in rhe conrexrs ro rvhich rhey
apply. Any classification that gives rhe appearance ofelevaring the general imporrance of
some sandards over o¡hers could invire neglecr
oFsome srandards rhar need ro be addressed
in parricular siruarions.
Further, rhe current Stand¿rds does not
include srandards considered sccondary or
"desirable." The conrinued use oFrhe secondary designation would risk encouraging both
the expansion of the Standørds ro encompass
large numbers of"desirable" srandards and
rhe inappropriare assumprion thar any guideline not included in rhe Standards as ar le¡sr
"secondary" was inconsequenrial.
Unless otherwise specified
in rhe sran-
dard or commenrary, and wirh rhc cavears
¿
AE
RA-APA_NCM
E-OOOOO
1
3
tüfnnnilnTlntJ
r'
Ir I
-- )--)^ ^L^..11 L- ..-.
outllneo oelowt 5L¿lru¿lus srruuru v! r¡tet
before operational test use. This means that
each standard should be carefully considered
¡o determine irs applicabilìry ro rhe resting
context under considerarion. In a given case
there may be à sound professional reasòn why
adherence to rhe sranda¡d is unnecessary'
[t
is
also possible rhar chere may be occasions
when technical feasibility ma¡' influence
wherher a standard can be mec prior to
operationel test use. For example, some
standards may call for analyses ofdara that
may not be available at the point of initial
If
test developers, users,
and, when applicable, sPonsors have deemed
operarional test use.
a srandard
to be inapplicable or unfeasible,
rhey should be able, iFcalled upon, to explain
the basis for their decision' However, there
is no expecration that documentation be
routinely available of the decisions related
co cach standard.
Tests and Tesl Uses to
Which These Standards APPIY
A rest is an evaluative device or procedure in
which a sample of an examinee's behavior in a
specified domain is obcained and subsequenrly evaluated and scored using a srandardized
process. Vhile the label test is ordinarily
reserved for instruments on which lesPonses
a¡e evalua¡ed fo¡ their cotrectness or qualiry
and the rctms scale or inuentory are used for
meesures of atritudes, interest, and disposi.
tlofìs,
to
.t c.,.- J
L!r¡¡r,ri,¡..r
uìcs -L ^ s¡r{Bt! .--*
l¡rs -i^-ltllË Jtutt&ta -..)-..-^-
reFer
to all such evaluative devices.
A distincrion is sometimes made berween
t¿¡¡ and assetsmeTtt. Asse¡sment is a broader
rerm, commonly referting to a process that
inregrates test information with information
from other sources (e.g., informarion from
rhe individual's social, educational, employment, or psychological history)' The applicabiliry ofúre St¿ndards to an evaluation device
or method is not alte¡ed by the label applied
to ir (e.g., test, assessment, scale, inventory).
Tæts differ on a number of dimensions:
rhe mode in which test materials are Presented (paper and pencil, oral, compurerized
admin.isr¡arion, and so on); the degree to
rvhich srimulus materials are srandardized;
oF responle Êormar (selection ol a
the
rypc
iapóni. from a set ofalrernarives
as oPPosed
ro rhe producrion of a response); and rhe
degree to which resr marerials are designed to
reflect or simulate a particular context' In all
cases, however, tests standardize the Process
by which rest-take¡ resPonses ¡o resr materials
are evaluated and scored. As noted in prior
versions oF the Støndards, the same general
rypes o[ information are needed For all varieries
of
tescs.
The precise demarca¡ion berween ¡hose
measuremenr devices used in the fields of
educarional and prychological testing that do
and do nor fall within the purview of the
Sand¿rd¡ is difficulr ro idenrifr. Although the
Smndards applies most directly to srandardized measures generally recognized as "tests,"
such as me¿sures of abiliry, aptitude, achievement, atti tudes, interests, personali ry, cognirive fu.ncrioning, and mental health, it may
also be usefully applied in varying degrees to
a broad ránge of less Formal assessment techniques. Admittedt¡ ic will generally not be
possible to apply the Søndards rigorously to
unsrandardized quesrionnaires or to the broad
range of unstructured behavior samples used
in some lorms of clinic- and school-based
psychologiczJ assessrnenr (e-g., an inrake interview), and to instructor-made tesrs that are
used to evaluate student performance in edu-
cation and rraining. It is useful ro distinguish
berween devices thar lay claim to the conceps
and rechniques of the field of educariond and
psychologicat tesring
lrom
,.nt nontt"ndtdized o¡
chose
which repre-
less standardized aids
ro àay+o-day evaluative decisions. Alrhough
the principles and concepts undcrlying the
Stand¿rds can be fruitfully applied ro day-today decisions, such as when a business owner
inierviews a job applicant, a manager evalu-
AERA-APA*N CM
E_ÛOOOO'i
4
l¡tïR0DUcIt0N
ates the performance of subordinares, or
a
coach evaluares a prospecrive athlere, ir would
be overreaching ro expecr rhar rhe standards
of
che educarional and psychological resring
It is appropriate for developers or users
to srare rhar eFlorrs were made to adhere ro
the Standar¿s, and to provide docume nts
care.
describing and supporring rhose eFFor¡s.
Blanker claims withour supporring evidence
field be followed by those making such decisions. In conrrast, a structured interviewing
sysrem developed by a psychologist and
accompanied by claims that rhe syscem has
been found ro be predicive of job performance in a variery of other setrings falls within
rhe purview oî rhe Stand¿rd¡.
edge develops.
Cautions to be Exercised in Using
5) Prescriprion ofrhe use oFspecific
technìcal merhods is nor the intenr oI the
lhe Standards
Several caurions are imporranr ¡o avoid misinterpreting rh,e Stand¿rds:
l)
Evaluaring the acceptabiliry oFa tesr
or test applicarion does nor rest on rhe Iitera_l
satisFecrion of every srandard in this document, and acceptabiliry cannor be deccrmined
by using a checklisr. Specific circumsrances
affecr the importance of individual srandards,
and individual srandards should nor be con-
sidered in isolation. The¡eforc, evaluating
acceptabiliry involves (a) professional j udgmcnt
that is b¿sed on a knowledge ofbehavioral sci-
ence, psychometrics, and the communiry
standards in rhe professional field ro which
rhe resrs apply; (b) rhe degree to which ¡he
intent of the srandard has been satisfied by
the tesr developer and user; (c) che akernatives
rhat are readily available; and (d) research and
experiential evidence regarding [easibiliry of
meering the srandard.
2) Vhen resrs are at issuc in legal proceedings and orher venues requiring experr
wicness rescimony ir is cssenrial thar professional judgmenr be based on the accepred
corpus of knowledge in derermining the relevance oF parricular srandards in a given situarion. Thc inrcnr of rhe Standard¡ is ro oFfe¡
guidance for such judgments.
3) Claims by rest developers or res! users
thar a test, manual, or procedure sarisfies or
follows rhese srandards should be made wi¿h
should nor be made.
4) These sranda¡ds are concerned wirh a
field char is evolving. Consequenrl¡ rhere is
a continuing need ro moniror changes in rhe
field and to revise this documenr as knowl-
St¿ndards. For example, rvhere specific staristical reporting requirements are menrioned,
rhe phrase 'br generally âccepred equivaleni'
aJways should be unde¡s¡ood.
The srandards do nor arrcmpr ro repeâr
or to incorporatc rhe many legal or regularory
requirements rhat might be relevanr to rhe
issues they add¡ess. In some areas, such as the
collection, ana.lysìs, and use ofresr dara and
resuks Fo¡ differenr subgroups, che law may
both require parricipanm in rhe tesring process
to take cerrain acrions and próhibit those
panicipanrs [rom taking o¿]rer acrions. 'Vhe¡e
it is apparenr tha¡ one or more srandards or
comments address an ìssue on rvhich esrablished legal requiremens may be paruicularly
relevant, the standard, commenr, or inrroduc-
tory marerial may make no¡e of thar facr.
Lack ofspecific reference to legal requirements, however, does not imply that no rele-
vant requiremenr exisrs. In all siruations,
parricipanrs in rhe resrìng process should
separacely consider and, whe¡e appropriate,
obrain legal advice on Iegal and regulatory
requiremenrs.
The Number of Standards
The number of sundards has increased lrom
the I 985 Standards For a variery of ¡easons.
Firsr, and most imporranrly, new developments have led ¡o rhe addirion of new srandards. Cornmonly these deal wirh new rypes
4
AERA-APA-N
C M E-O OOOO
1
5
rÁrrEñnilnTlnâJ
tlgrrtvsev¡¡v¡¡
olrests or new uses íor existing tesrs, rather
rhan being broad standards applicable to all
resc. Second, on the basis ol recognìtion that
some users oî the Stdnd¿rds may turn only to
chapters directly relevant to a given applica-
tion, certain standárds are rePeatcd in differenr chapters. lü/hen such repetition occurs,
SLUrC
Ur rcsPurrrL
-^..--- 'f.|-- -^¡'^^
¡L..¡
some tests are not under the purview of rhe
Sønd¿rds because they do not measure con-
srructs is conttery ro this use oI rhe term.
Also, as detailed in chaprer l, evolving con'
ceprualizations oF the concept ol va-lidiry no
longer speak ofdifferenr rypes oFvalidiry but
ofdifferent lines ofvalidiry evi-
the essence oF rhe standard is the same. Only
the wording, area ofapplicarion, or elabora-
speak instead
rion in the comment is changed. Third,
relevant to a specific intended interpretation
s¡andards dcaling rvirh imporlant nonrechnical issues, such es avoiding conflicts of interest and equitable rteatment ofall rest takers,
oftesr scores. Thus, many lines ofevidence
can contribute to an understandìng ofthe
dence, all in service of providing in[ormation
construcc meaning oftesc scores.
have been added. AJrhough such topics have
nor bccn addressed in prìor versions of the
Standørds, they are not likely to be viewed as
imposing burdensome nerv requirements
Thus che increase in rhe number of scandards does not Per se signal an increase in
the obligations placed on test developers
and ¡est users,
Tests as Measures of Constructs
\?e depart f¡om some historical uses of the
term "construct," which reserve the term For
cha¡acte¡is¡ics thar are not direcdy obsewable,
bur which are inlerred from interrelated sets
of observations. This historical perspecrive
invires confirsion. Some tests are viewed as
me¿sures olconstrucrs, while othe¡s âre not.
L: addition, considerable debate has ensued
as
to rvherher certain characteristics measured
by tesrs are properly viewed as constructs.
Furthermore, the rypes olvalidiry evidet¡ce
thought ro be suitable can differ as a resuh
ofrvhether a given cest is viewed as measuring a construct.
'!le
use rhe retm cznstrttct more broadly
concepr or characteristic thar a test is
0rganization of This Uolume
Part I of the Standards, "Test Construcrion,
Evaluarion, and Documentarion," contains
srandards For validiry
(ch. l); reliabiliry and
errors of measurement (ch. 2); test developmenr and revision (ch. 3); scaling, norming,
and score comparabiliry (ch. 4); test administration, scoring, and reporring (ch. 5); and
supporting documentation For tesrs (ch. 6).
Part II addresses "Fairness in Testing," and
conrains stand¡rds on fairness and biæ (ch. 7);
the rights and responsibilities of tesr takers
(ch. 8); testing individuals ofdiverse linguistic backgrounds (ch. 9); and testing individuals wirh disabiliries (ch. 10). Part III treats
specific "Testing Applications," and contains
standards involving genera.l responsibiliries ol
test users (ch. I I); psychological testing and
assessment (ch. 12); educa¡ional tesring and
essessment (ch. 13); rcsiing in employment
and credentialing(ch, l4); ancl tcsting in pro'
gram evaluation and public poticy (ch. 15)'
Each chapter begins with introductory
rext that provides background for rhe stan-
to e test score or a patrern of test responses.
Thus, ir is always incumbent on a resting
dards that follow. This ¡evision of the
Srand¿rds conrains mo¡e ex¡ensive introductory text mate¡ial than irs predecessor.
Recognizing rhe common use of the Standzrds
in rhe education of future tesr developers
professional ro specifo the const¡uct interpreration rhat will be made on the basis of the
and users, rhe committee opred co provide a
context for the standards ihemselves by pre-
as the
designed to meesure. Rarel¡ iFever, is there a
single possibte meaning that can be artached
I
AERA_APA-NCME_OOOOOi 6
INTBODUCTION
senting more background marerial rhan in
previous versions. This rexr is designed to
assisr in the inrerpreration of the s¡andards
that Follow in each chaprer. Although the text
is at rimes prescriprive and exhortaror¡ ir
should nor be inrerprered as ímposing additional srandardsl
The Stand¿rÁ also conrains an index and
includes a glossary that provides definitions
fo¡ rerms es rhey are specificaJly used in rhis
vo.lunre.
AERA-APA-NCM
E-OOOOO
1
7
mf{ EDT
lrf{Fr [
fl
il
Test Constftrståon,
Evaluarion, and
Ðoeurnentation
AERA_APA_NCME-OOOOOl 8
tr" WALHDSW
Background
Validiry reFers ¡o rhe degree ro which evidence
and rheory suppon rhe interprerations of resr
scores enrailed by proposed uses of resrs.
Validiry is, rhe¡eflore, rhe mosr lundamenral
considerarion in developing and evaluating
resrs. The process ofvalidation involves accumularing evidence to provide a sound scienrific
basis For the proposed score intcrpretations.
Ir is rhe inrerprerarions oF resr scores required
by proposed uses rhar are evaluared, not rhe
rest itsel[, \ù/hen resr scores ere used or interpreted in more rhan one way, each intended
inrerprerarion mus¡ be validared.
Validarion logicelly begins widr an explicit
sraremenr
ol rhe proposed interpreration
will benefit From a
parricular instrucrional intervenrion, thar a
student has mas¡ered a specified curriculum,
or rhar e srudenr is likely ro be successful
rvirh college-level work. Similarl¡ a tesr of
self-esreem might be used for psychological
counseling, ro inform a decision abour
employmenr, o¡ For rhe basic scientific purscores: rhar a srudenr
oF
with a rarionale for the relevance oFrhe inrerprecarion ro the proposed
use. The proposed interpretacion refers ro rhe
consrrucr or concepr rhe resr is inrended ro
measure. Examples of construcrs âre matheresr scoresr along
matics achievemenr, performance as a compu ter rechnician, dep ression, and self-es¡eenr.
To supporr resr developmenr, rhe proposed
inrerprerarion is elaborated by describing
its scope and exrenr and by delinearing rhe
aspects ol rhe consrruct rhat a¡e ro be represenred. The derailed description provides a
conceprual Framework for rhe resr, delinearing rhe knowledge, skills, abiliries, processes,
or characceristics to be assessed. The framework indicates how rhis represenration of
the consrrucr is ro be disringuished from
orher consrrucrs and how ir should rela¡e
ro orher variables.
The conceptual framework is partially
shaped by the ways in which resr scorcs will
bc used. For insrance, a tesr of mathematics
achievemenr mighr be used ro place a scudent
in an appropriere program of insrruccion, ro
endo¡se a high school diploma, or to inform
a college admissions decision. Each olthese
uses implies a somewhar differenr inrerprerarion of rhe marhemarics achievemenr rest
pose oF elabonring rhe consrruct of selÊesreem.
E¿ch ofrhese porenriel uses shapes rhe specified
framework and dre proposed inrerpretation of
rhe res¡'s scores and also has implications for
tesr development and evaluation.
Validarion can be viewed as developing a
scienrifically sound validiry ergumenr to support the inrended inrerpreracion oF resr scores
and their relevance ro rhe proposed use. The
conceptual framework poinrs ro rhe kinds of
evidence rhat might be collecred ro evaluare
the proposed inrerpreration in light ofrhe
purposes of resring. As valìdarion proceeds,
and ncw evidence abour rhe meaning ola
tesrt scores becomes available, revisions may
be needed in rhe tesr, in rhe conceprual
framewo¡k rhar shapes it, and even ìn ¡he
consrrucr underlying the resr.
The wide variery of resrs ând ci¡cum-
ir narural rhar some rypes of
be especìally cricical ìn a given
case, whereas orher rypes will be less useFul.
s¡ances makes
evidence
will
The decision about whar rypes of evidence
are imporranr for validarion in each insrance
can be clarified by developing a set ofproposirions rhat support rhe proposcd inrcrpretãrion
for the parricular purpose of resring. For
insrance, when a mathemarics achievemen¡
tesr is used ro âssess readiness For an advanced
course, evidence for rhe following propositions might be deemed necessar)4 (a) thar certain skills are prerequisire for the advanced
course; (b) rhar rhe concenc domain oÊ rhe
test is consisrent wirh rhese prercquisire skillsi
(c) that lesr scores can be generalized across
relevant sets of irems; (d) rhar reJr scores arc
not unduly influenced by ancíllary va¡iables,
AERA-APA-N
C M E_O
O
OOO 1
9
VALIDITY
- -L -- wr¡Lr¡¡B aurtlt/r
^L:l:-.suLll a5 ----:-:--
/^\ .L-- -..^^-^\L,/ !rrdt rulLLrr
:- .L^
rrr (r¡!
advanced course can be validly assessedl and
(f) rhar examinees wirh high scores on rhe
resr
will
be more successÊ.rl in the advanced
with low scores on tlre
course than examinees
rest. Examples of propositions in other tcsting
contexts might include, for instance, the
proposition that examinees with high general
anxiery scores experience significant anxiery
in a range oI settings, the proposirion that a
child's score on an incelligence scale is strongly rclated ro the child's academic performance,
or rhe proposirion rhar a certain parrern of
scores on a neuropsychological battery indicaces i mpairment characteristic oF brain i nj ury.
The validarion process evolves âs thele proposi¡ions a¡e arriculated and evidence is garhered
ro evaluare their soundness.
Identifring the propositions implied by
a proposed test interpreration can be lacilitated by considering rival hypotheses thac
may challenge the proposed interpretacion.
It is also useful to consider rhe perspectites
oIdifferent incerested parries, existing expe-
/
PAfiT I
:^- ^^..---. ^. :-^^.-l
(JPs wt
a rv¡¡rr¡rvr¡ .,,-. ^Ê
¡r¡6 PdrJdólr
reading marerial. As another example, a tesr
of anxiery might measure only physiological
¡eacrions and not emo¡ional, cognitìve, or
siruational components.
. Construct-irrelevanr variance reFers to
rhe degree to rvhich tesr scores are aflecred by
processes shar are extraneous ro irs inrended
consrrucr. The test scores may be systematically influenccd to some extent by components thar are not part ol che construcr. In
rhe case oF a reading comprehension cest,
consrruct'irrelevant components mighr
inclucle an emotional reaction to thc rest
conrent, familiariry rvirh the subject mattcr
ofthe reading pesseges on rhe test, or the
wriring skill needed to compose â resPonse.
Depending on the detailed definirion oFthe
consrrucr, vocabulary knowledge or reading
speed míght atso be irrelevant comPonenÉ.
On a tes¡ of anxiery, a response bias ro underreport anxiery might be considered e source
of co¡rstruct-irrelerant variance.
Nearly atl tests leave out elements that
¡ience wirh similar rests and contexts, and
the expecred consequences of the proposed
test use. Plausible rival hypotheses can ofren
be generated by considering wherher a resr
measuÍes less o¡ mo¡e rhan irs proposed
construct. Such conce¡ns are refer¡ed ¡o as
conttruct underrepresentatíon end eonsttuct'
some porenrial users believe should be measu¡ed and include some elemen¡s ¡hat sorne
irreleuant variance.
Conscruct underrepresentation reFers to
rhe degree to which a tesr fails to caPtur€
important aspects oF rhe construct. ir impiies
a narrowed meaning of test scorcs t¡ecause
the test does not adequatefy sample some
rypes ofcontenr, engage some psychological
processes, or elicit some ]vays of responding
rhat are encompassed by the intended consrrucr. Täke, for example, a cest of reading
administration conditions, or language level
that may materialli' limit or qualifr the inter'
pretation oF rcst scores. Thar is, the process
of validation may iead to revisions in the test,
rhe conceptual f¡ameworl< of rhe test, or boúr.
The ¡evised test would then need validation'
1ù/hen propositions have been identified
comprehension in¡ended to measure chil-
drent ability ro read and interpret stories
wirh understanding. A particular test might
underrepresent
r-he
intended construcr
because
ir did nor contain a sufTìcient variery of read-
porenrial users consider inappropriate.
Va.lidation involves ca¡efrrl artendon ro possible
disto¡rions in meaning arising from inadequate
representarìon of the construct and also to
aspeccs o[ meesurement such as rest fotmat,
rhat would suppon the proposed interpretation
of test scores, validation can proceed by developing empirical evidence, examining releva¡r
lirerarure, and./o¡ conducting togical analyses to
evaluate each of these proposirions. Empirical
evidence may include both local evidence,
produced wirhin the contexts where che rest
will be used, and evidence From similar testing
l0
AERA-APA-NCME_OOOO02O
PAßT
I/
VALIDIW
applications in orher serrings. Use ofexisring
evidencc from similar resrs and conrexrs can
enhance the qualiry of the vaiidiry argumenr,
especiâlly when currenr data are limired.
Because a
validity a¡gumenr rypically
depends on more than one proposition, srrong
eyidengg 1n 9,lppor, of one in no way diminishes rhe need for evidence ro supporr others.
For example, a sr¡ong predictor-crirerion relarionship in an employmenr serring is nor suFficicnr ro justifr resr use for selection withour
but they do nor represenr disrincr rypes oF
validiry. Vdidiry is a unirary concepr. k is rhe
degree ro which ail rhe accumulared evidence
supporrs the intended inrerprerarion of rest
scores fo¡ rhe proposed purpose. Like rhe
1985 Standard:, rhis edirion refers ro rypes of
validiqy evidence, rarher than disrinct rypes of
validiry To emphasize rhis disrinction, the
üeermenr thar lollows does nor follow rraditionai nomenclarure (i.e., rhe use of tl¡e terms
content
ualiditl or predictiue uatidit1).The
considering rhe appropriareness and meaningfulness of rhe cri¡erion measure, Professional
judgment guides decisions regarding the specific forms of evidence rhar can besc supporr
glossary conrains definitions of rhe rradiriona.l
rerms, explicaring rhe difference berween rra-
rhe intended inrerpretarion and use. As in
Evro¡Hcr Bns¡o oH Tesr Conr¡ur
all scienrific endcavors, the qualiry of rhe
Imporanr va.lidiry evidence can be obained
from an analysis of rhe rclationship berwcen a
evidence is primary. A few lines of solid evidence regarding a parricular proposition are
ber¡e¡ rhan numerous lines oFevidence of
questionable qualiry.
Validarion is rhe joint responsibiliry of
the test developer and the rcsr rrser. Thc resr
developer is responsible [or Furnishing relevant evidence and a rarionale in supporr o[
*re intended tesr Lrse. The ¡es¡ usc¡ is uldmarely
responsible for evaluating rhe evidence in the
particular serring in which the resr is ro be
used. Vhen rhe use of a resr differs from rhat
supporred by the rest developer, rhe resr user
bears special responsibiliry flor validarion. The
standards apply to rhe validarion process, for
which rhe appropriare parries share responsibiliry. It should be nored char imporranr conrriburions ro rhe validiry evidence are made as
orher researchers reporr findings oF investigations ¡hat are related to the meaning of scores
on rhe tesr.
Sources of Validity Evidence
The lollowing sections ourline various sources
of evidence that might be used in evaluating a
proposed interprerarion o[ (esr scores hr parrìcular purposes. These sources oF evidence
may illuminate different aspecrs of validiry
ditional and currenr
use.
test's conrenr and rhe conslrucr
ir is inrended
ro measure. Tesr conrenr refers to the themes,
wording, and formar of rhc items, rasks, or
quesdons on â resr, as well as rhe guidelines for
procedures regarding admi nisrrarion and scoring. Tesr developers often work from a specifi-
cation of the conrenr domain. The contenr
specificarion carefully describes rhe conrenr in
detail, oFren wirh a classificarion of areas of
contenr and rypes oFirems. Evidence based on
test conrenr can include logical or empirical
analyses of the adequacy with which the rest
content represen$ ¡he content domain and ol
rhe relevance oFthe contenr domain to rhe
proposed interpreration of test scores. Evidence
based on conrcnt can also come from experr
,iudgmenrs of rhe relarionship berween parts
of the tesr and the consrrucr. For example, in
devcloping a licensure resr, the major facea of
the specific occuparion can be specifìed, and
experts in rhar occuparion can be askcd ro
assign tcsr irems ro the caregories defined by
those facecs. The¡ or other qualified experEs,
can rhen judge the represenrarivencss of rhe
chosen set ol ircms. Sometimes rules or algo¡irhms can be consrruc¡ed ro selecr or generere
items thar differ sysrematieJly on rhe various
face6 of content, according ro specifications.
AERA_APA_NCME-OOOO02
1
VALIOITY
/
PABT I
Some tesrs are based on qystematic observatìons o[ behavior. For example, a listing o[
¡he ¡asks comprising a job domain ma;' be
irrelevant difficulry (or easiness) that reguire
developed lrom observations o[behavior in a
job, together with judgmens olsubject-maner
Ev¡o¡Hce Bnsro oH R¡sroHsr
experrs. Expert judgments can be used to assess
the relative importance, criticaliry and/or lrequency of rhe various rasla. Â job sample rest
rr--JC----,--l^--can tlìe[Ì Dc col]stluL(€u llurr¡ a l¿r¡uulrl 9r
srratified sampling oF casks rared highly on
rhese characreristics. The test can then be
administered unde¡ srandardized conditions
in an ofF-the-job setting.
The appropriateness oIa given conrent
domain is ¡clared to the specific inferences to
be made from tesr scores. Thus, when consid'
ering an available test for a purpose other rhan
rhat for which it was first developed, ir is
especially important to evaluare the appropriareness
of the original content domain [or the
proposed new use. ln educational Program
evaluarions, for example, tests may properly
cover material rhar receives lirtle or no attention in the cu¡ricu[um, as'¿all as that rorva¡d
which instruction is directed. Policymakers
can then evaluate student achievement with
respect to both contenr negleced and conten!
addressed. On the other hand, when student
lurrher investigatìon.
Pnocesrs
Theoretical and empirical analyses ol the
¡esponse processés of test râkers can provide
evidenie concerning the fit berneen the con'
st¡uct and the detailed narure ofperÍormance
or response acrually engaged in by examinees.
For ìnsrance, iIa tesr is inrended to assess
ma¡hemacical reasoning, it becomes important to determine wherher examinees are, in
[act, reasonìng about the material given instead
of [ollowing a standa¡d a.lgorithm. For anorher
instance, scores on a scale iniended io assess
the degree of an individual's extroversion or
inrrove¡sion shou]d noc be srrongly influenced
by social conformiry.
Evidence based on ¡esponse Processes
generally comes from analyses o[ individual
responses. Questioning test takcrs abour their
performance strategies or resPonses rc parüc'
ular items can yield evidence that enriches the
de finition of a construct. Maintaining records
thar monito¡ ihe development of a response
to a writing task, through successive written
draFts or elecrronically monitored revisions,
for inscance, also provides evidence o[ process.
vidual studens, such as p¡omotion or graclua-
Documenrarion of orher aspecrs of performance,
like eye movements or resPonse times, may
also be relevanr to some constructs. Inferences
rion, the framewo¡k elaborating a content
about processes involved in perlormance can
domain is appropriately limiced to what studenr have had an opportuniry to learn from
also be developed by anatyzing rhe relarionship
among patts of ¡he rest and berween ¡he test
mesrery of a delivered curriculum is tesced For
purposes oFinforming decisions about indi-
L,
--:-..1..-rrlÈ LurrlLuru¡r¡
-- uçrrvl¡Lu.
a J^l:-.^-^,.l
orher .rariables- Wide individLral differ
in process cen be revealing and may lead
ro reconsideration ofcertain rest formats'
Evidence oF response Processes can
conrribute to quesrions about diFFerences in
meaning or interpretation of tesc sco¡es across
¡'¡l
Evidence about content can be used, in
part, to address questions about differences in
the meaning or interpretation ol rest scores
across relevant subgroups oI cxaminees. Of
particular concetn is the extenc ro which consrrucr underrep¡escn¡arion o¡ consrrucr-irrelevanr componenr may give an unfair advantage
or disadvantage to one or more subgroups oF
examinees. Careful review of the construct
and test content domain by a diverse panel
ences
ofexpetrs may point ro porenrial sourccs of
their performance.
relevant subgroups oF examinees. Process studies involving examinees f¡om different subgroups cen assis¡ in determining the extent to
which capabìlities irrelevanr or ancillary to ùe
construct may be differenrially influencing
AERA*APA-N
C M Ë_O OOOO22
PART I
/
VALIDITY
Studies
of
response proceJses are
nor lim-
iced to the examinee. Assessmenrs olten rely
on observers or judges ro record and/o¡ evalua¡e examinees' perlormances or products. In
such cases, relevanr validiry evidence includes
. the exren¡ ro which rhe processes ofobservers
or judges a¡e consistenr wirh rhe inrended
inrerprerarion ofscores. For instance, il
judges are expecred to apply parriculer crireria
in scoring examinees' perlormances, ir
is
imporranr ro ascerrain wherhe¡ rhey are, in
facr, applying the appropriare criteria and nor
being influenced by factors rhat are irrelevant
ro rhe intended inrerpretarion. Thus, validarion may include empirical srudies oFhow
observers or judges reco¡d and evaluare dara
along rvirh analyses ofrhe appropriareness of
these processes to rhe inrended inrerpreration
or consrrucr
defi
ni¡ion.
Evrornc¡ Basro oH lmrnn¡l- SrRucrun¡
Analyses oF rhe inrernal srructure of a
resr can indicare the degree ro which rhe
relationships âmong rest irems and resr componenrs conForm to rhe consrrucr on which
rhe proposed resr score interpretarions arc
based. The conceptual framework for a rest
may imply a single dimension of behavior,
or ir may posit several components ¡har a¡c
each expecred ro be homogeneous, bur rhar
are also disrincr from each orher. For example, a measure of discomforr on a heakh survey mighr assess borh physical ànd emorional
healrh. The ex(enr ro which irem inrerrela-
tionships bear out rhe presumptions oF rhe
f¡amework would be relevanr ro vâlidiry.
The specific rypcs of analysis and rheir
ìnterpreration depend on how rhe resr will
be used. For example, if a parricular application posited a series of test componenrs of
increasing difficulr¡ empirical evidence ol
the extent to which response pârrerns confo¡med ro this expecracion would be provided. A theory rhar posired unidimensionaliry
would call for evidence ol irem homogeneiry. In rhis case, rhe irem inrerrelarionships
also provide an esrimare of score reliabiliry,
bur such an index would be inappropriate for
rests wirh a more complex inrernal srrucrure.
Some srudies of the internal srrucrure of
to shorv wherher particular
icems may Êrncrion diFferenrly Êor idenrifiable
tests are designed
subgroups oI cxam i nees. Differenrial item
Funcrioning occurs when differenr groups
oFexaminees wirh similar overall abiliry, or
similar starus on an appropriare criterion,
have, on everage, sysremarically differenr
responses to a parricular irem. This issue is
discussed in chapters 3 and7. However, differenrial irem functioning is nor always a
flaw or weakness. Subsets o[ irems rhat have
a specific characrerisric in common (e.g.,
specific conrentr rask representation) may
Function difFerenrly for differenr groups ol
similar.ly scoring examinees. This indicares
a kind of mulridimensionaliry rhat may be
unexpected or may conForm to the resr
framewo¡k.
Rrurrors r0 (hxEft Vnsl$t ¡s
Analyses ofrhe relarionship o[tesr scorei
to variables exrernal ro rhe resr provide anothEvro¡HcE Bnseo oH
.
cr imporcanr source of validity evidence.
External va¡iables may include measures of
some crireria rhar rhe tesr is expecred ro predicr, as well as relarionships ro other resrs
hypothesized to measure rhe same consrrucrc,
and tests measuring relared or diffe¡enr constructs. Measures other than test scores, such
as performance crireria, are oFren used in
employmenr serrings. Caregorical variables,
including group membership variables,
become relevant when thc rhcory underlying
a proposed resr use suggesri thar group differences should be present or absenr if a pro-
posed tesr inrcrpretation is ro be supported.
Evidencc bascd on rclationships wirh orher
variables addresses questions about the degree
to which these relationships are consisrent
with the construcr underlying rhe proposed
resr inrerprerarions,
13
AE
RA-APA_N
C M E_O
O OO
023
VALIDITY
Convergent and discriminant evidencè'
Relarionships berween test scores and orher
meesures intended Io assess similar consrrucrs
provide convergent evidence, whereas ¡ela-
tionships becween tesc scores and measures
purportedly oF different consrructs provide
dis*iminant evidence. For instance, wirhin
some theoretical F¡ameworks, scores on e
multiple-choice tesi of ieading comprchension mighr be expecred ro relate closely
(convergent evidence) to other measures of
rcading comprehension based on orhe¡ me¡h-
/
PART
I
the crirerion and the measutement Procedures
uscd ro obtain criterion scotcs are ofcentral
imporunce. The value of a test-criterion study
depends on rhe relwa¡ce, reliabiliry and valldiry
ofrhe inrerpretation based on the criterion
meesure for a given tesring applicarion.
. Hisro¡icall¡ two designs, often called
predìctìve and concurrent, have been disringuishcd for e','aluating test-ctiterion relarionships. A predictivc study indìcares how
accurately rcsr data can predicr criterion scores
rhat arc obrained at a Iater time. A concurrent
ods, such as essay responses; conversel¡ test
scores might be expected io relate less closely
scudy obrains predictor and criterion informarion ar abour the same time.'When predic-
(discriminanr evidence) to measures oForher
iion is acrually contemplatcd, as in eduø¡ion
or employment sertings, or in planning rehabilitation regimens, predictive studies can
¡e rain ¡be remporal differences and o¡he¡
characteristics of the practical siruation'
Concurrent evidence, which avoids temporal
changes, is parricularly useful for prychodiagnostic tests or to investigare a-lte¡native meæures ofsome specified construct. In general,
skills, such as logical reasoning. Relationships
among diflerenr methods of measuring the
construct can be cspecially helpful in sharpening and elaborating score meaning and
inrerpreratio n.
Evidence of relations with other variables
can involve experimental as well as co¡relarional evidence. Studies might be designed,
for instance, to invesrigate wherher scores on
â meesure of anxiery improve as a ¡esult of
some psychologicel treatment or rvherher
scorei on a tes¡ of academic achievement difle¡entiate becween insrructed and noninstructed groups. If performance increases due
ro short-term coaching are viewed as a rhreat
ro validiry, it would be useful ro invesrigatc
whether coached and uncoached grouPs Per-
[orm differently.
Test-criterion relationships. Evidence of
t
,t--:-Itlg lg4rlulr
^C lssL -^^-*.^
ul -^-- rLUrgr
^ --ì-,,-^.
-.:r-.:^ñ
may be exprcssed in various ways, bur the
Fundamenral question is always: How accurarely do test scores predict criterion perlormancel The dcgree of accuracy deemed
necessary depends on the purpose for which
rhe test is used.
The cricerion v-¿¡iable is a meãsure oÊsome
attribute or outcome that is o[primary interesr, as determined by rest users, who may be
administrators in a school system, the mânagemerìt of a firm, or clients. The choice o[
the choice of research stra(egy is guided bv
prior evidence olthe exrent to which predic'
tive and concurrenr srudies yield rhe same or
different results in the domain.
Test scores are sometimes used in dlocating individua.ls to different creatmenm, such as
differenr jobs rvirhin an instirutìon, in a way
rhar is advantageous for rhe institution and for
the individuals. In thar context, evidence is
needed to judge the suirabiliry of using a test
when classifring or assigning a Person to one
i^h.¿¡c',c enorher Õr to ône treatment versus
''-'t"anothe¡. Classification decisions are supported
by evidence rhar the rclationship of tesr scorcs
to performa-nce crite¡ia is different for different
rrpirmpñrc I¡ is noqsible For rests to be hiehly
predicrive of performance for different educ¿'
cion programs or lobs withour providing the
informarion necessery to make a comparative
or
iudgment o[ che efficacy of assignments
treatments, ln general, decision rules for
selection or placement are also infìuenced by
the number of persons to be accepted or the
14
AE
RA-APA_N
C M E_OOOOO24
PABT I
/
VALIDITY
numbe¡s thar can be accommodared in al¡ernative placemenr categories.
Evidence abour relations ro orher variables is also used ro invesrigate quesrions of
differential prediction for groups. For insrance,
a findi¡g thaq thq.¡9!4¡!q¡ gf¡esr scores to a
relevanr crirerion variab.le dilfers f¡om one
group ro anorher may imply ihãr thê mêaning of ihe scores is not rhe same for members
of the differenr groups, perhaps due to consrruct underrepresentarion or consrruct-irrele-
vant componenrs. However, the dilference
may also imply that the crirerion has differenr
meaning for different groups. The differences
in tesr'crirerion relationships can also a¡ise
from measuremen! error, especially when
group meâns differ, so such differences do
not necessarily indicare diflerences in score
meaning. (See chaprer 7.)
Vatidiry generalization. An imporranr
issue in educationel and employmenr secings
is rhe degree ro which evidence of valdìry
based on resr-crirerion relaríons can be generalized ro a new siruarion wirhout furdrer srudy
oFvaJidiry in rhat new siruarion. \When a test
is used to predicr rhe same or similar crireria
(e.g., perforn'rance o[a given job) at differenr
times or in different places, ir is rypically tound
rhat observed resr-crirerion correlations vary
substantially. In the past, thìs has been raken
ro imply rhar local validacion sudies are always
requi red. Mo re recen cl¡ mera-anal¡ic analyses
have shown rhar in some domains, much of
rhis variabiliry may be due ro srerisricål arrifecs
such as sampling fluctuations and variations
across validarion srudies in the ranges of test
sco¡es and in the reliabiliry of criterion measures. \)Zhen rhese a¡d odrer influences are cal