AMERICAN EDUCATIONAL RESEARCH ASSOCIATION, INC. et al v. PUBLIC.RESOURCE.ORG, INC.

Filing 60

MOTION for Summary Judgment Filed by AMERICAN EDUCATIONAL RESEARCH ASSOCIATION, INC., AMERICAN PSYCHOLOGICAL ASSOCIATION, INC., NATIONAL COUNCIL ON MEASUREMENT IN EDUCATION, INC. (Attachments: #1 Statement of Facts Points of Authority, #2 Statement of Facts Statement of Undisputed Facts, #3 Declaration Declaration of Jonathan Hudis, #4 Exhibit Ex. A, #5 Exhibit Ex. B, #6 Exhibit Ex. C, #7 Exhibit Ex. D, #8 Exhibit Ex. E, #9 Exhibit Ex. F, #10 Exhibit Ex. G, #11 Exhibit Ex. H, #12 Exhibit Ex. I, #13 Exhibit Ex. J, #14 Exhibit Ex. K, #15 Exhibit Ex. L, #16 Exhibit Ex. M, #17 Exhibit Ex. N, #18 Exhibit Ex. O, #19 Exhibit Ex. P, #20 Exhibit Ex. Q, #21 Exhibit Ex. R, #22 Exhibit Ex. S, #23 Exhibit Ex. T, #24 Exhibit Ex. U, #25 Exhibit Ex. V-1, #26 Exhibit Ex. V-2, #27 Exhibit Ex. W, #28 Exhibit Ex. X, #29 Exhibit Ex. Y, #30 Exhibit Ex. Z, #31 Exhibit Ex. AA, #32 Exhibit Ex. BB, #33 Exhibit Ex. CC, #34 Exhibit Ex. DD, #35 Exhibit Ex. EE, #36 Exhibit Ex. FF-1, #37 Exhibit Ex. FF-2, #38 Exhibit Ex. FF-3, #39 Exhibit Ex. FF-4, #40 Exhibit Ex. FF-5, #41 Exhibit Ex. FF-6, #42 Exhibit Ex. GG, #43 Exhibit Ex. HH, #44 Exhibit Ex. II, #45 Exhibit Ex. JJ, #46 Exhibit Ex. KK, #47 Exhibit Ex. LL, #48 Exhibit Ex. MM, #49 Declaration Declaration of Marianne Ernesto, #50 Exhibit Ex. NN, #51 Exhibit Ex. OO, #52 Exhibit Ex. PP, #53 Exhibit Ex. QQ, #54 Exhibit Ex. RR, #55 Exhibit Ex. SS, #56 Exhibit Ex. TT, #57 Exhibit Ex. UU, #58 Exhibit Ex. VV, #59 Exhibit Ex. WW, #60 Exhibit Ex. XX, #61 Exhibit Ex. YY, #62 Exhibit Ex. ZZ, #63 Exhibit Ex. AAA, #64 Exhibit Ex. BBB, #65 Exhibit Ex. CCC, #66 Exhibit Ex. DDD, #67 Exhibit Ex. EEE, #68 Exhibit Ex. FFF, #69 Exhibit Ex. GGG, #70 Exhibit Ex. HHH, #71 Exhibit Ex. III, #72 Exhibit Ex. JJJ, #73 Declaration Declaration of Lauress Wise, #74 Exhibit Ex. KKK, #75 Exhibit Ex. LLL, #76 Declaration Declaration of Wayne Camara, #77 Exhibit Ex. MMM, #78 Declaration Declaration of Felice Levine, #79 Exhibit Ex. NNN, #80 Exhibit Ex. OOO (Public Version), #81 Exhibit Ex. PPP, #82 Exhibit Ex. QQQ, #83 Exhibit Ex. RRR, #84 Exhibit Ex. SSS, #85 Exhibit Ex. TTT-1, #86 Exhibit Ex. TTT-2, #87 Exhibit Ex. UUU, #88 Declaration Declaration of Kurt Geisinger, #89 Declaration Declaration of Dianne Schneider, #90 Text of Proposed Order Proposed Order, #91 Certificate of Service Certificate of Service)(Hudis, Jonathan). Added MOTION for Permanent Injunction on 12/22/2015 (td).

Download PDF
EXHIBIT V-2 Case No. 1:14-cv-00857-TSC-DAR G TEqrrh[ß rJ¡ Ë I-tJ E lf Ellrl !runñMIÍtE tÃt lËE!, I U llltlJFåtGt/ q RF ntqfFpqE lJl E f-l CÈr!lJ¡ LIruffiUlSTEG BAGKGROUruMS Renknrnrrnd Fo¡ all resr rakers, any res¡ rher ernploys language is, in part, a measure o[their language skills. This is of particular concern [o¡ test rakers whose firsr ianguage is nor the [anguage of the resr. Tesr use wirh ìndividuals who have nor suFficiently acquired rhe language of rhe cest may introduce constructi¡relevanr components to the tesring process. In such inscances, rest resulrs may not reflect accurarely rhe qualities and competencies inrended ro be measured. In addi¡ion, Ianguage differences are almost a.lwa¡'s associated wirh concomi¡a¡r culrural differences that need to be taken into account when tests are used wirh individuals whose dominanr language is differenr from that of the ¡esr. 'lf.hether a certain dialecc oFa language should be considered a differenr language cannor be resolvcd herc, although somc aspecrs of the presenr discussion are relevanr to rhe debate. In eirher case, specia! arrenrion ro ìssues rela¡ed ro language and culrure may be needed when developing, adminisrering, scoring, and interprering test scores and making decisions based on test scores. language proficiency tesrs, if appropriarely designed and used, a¡e an obvious exceprion ro rhis concern because chey are inrended ro measure lamiliariry wirh the language of rhe rest :u required in eduqtional m<I ocher serrings. Individuals who are bilingual can vary considerably in rheir abiliry ro speak, wrire, comprehend aurall¡ and read in each language. Thcse abilirics arc affccted by thc social o¡ functional sirua¡ions oIcommunication. Some people develop socially and culturally acceptable ways of speaking thar combine rwo or rnoÍe languagcs simulraneously. Orher individuals familiar with rwo languages may perform more slowl¡ less effìcientl¡ and at times les accurarely on prob- lem-solving tasks rhat are administered in rhe less familiar language. Language dominance is not necessarily an indicaror of language comperence in raking a tesr, and some accommodation may be necessary even when administering rhe test in the more familiar language. Therelore ir is importanr ro consider language background in developing, selecting, and adminisrering res$ and in inrerpreting rest performance. Consequentl¡ lor example, resr norms based on netive speakers oFEnglish either should nor be used with individuals whose fi¡st language is nor English o¡ such individuais' test resuks should be inrerprered refìecting in part current level of English proficiency radrer rhan abiliry potenrial, aptitude or personaliry characteristics or symptomatolory. In cases wherc a language-orienred tesl is inappropriate due to the resr take¡s' limited proficiency in that language, a nonverbal test may be a suitable alternative. es Vhere effective job perFormance requires ùe abiliry ro communicate in the language ol Ihe ¡es¡, persons who do noc have adequate proficiency in that language may perlorm poor- ly on úre test, on rhe job, or borh. In rhat cese, rhe tesrs used for prediction o[ ñ¡rure job performa¡ce appropriately would be adminisrered in the language ofrhe job, as long as rhe language level needed fo¡ rhe tesr did nor exceed the level needed ro meer work requirements. Test use¡s should undersrand rhar poor rest performance, as well as poor job performance, may result from poor language proficiency rather than other deficiencies. Many isues addressed in this chaptcr arc also ¡clevant ro resring individuals who havc unique linguistic characccrisrics due ro disabilitics such as deaFness and/or blindness. For example, issues regarding test translation and adapration are applicable to Âmerican Sign Language (ASL) versions oF traditional tes¡s. h should be noted, however, chat ASL is 9r AERA-APA-NCME_OOOO1 OO TEST]NG ]NDIVIDUATS OF DIVERSE LINGUISTÍC BACKGROUNDS nor only ¿ diFferenr language but is also a different mode of communicarion. Also, individuals with disabilities may require modifications in test adminis¡ration procedures similar to rhose required by non-native speakers, A more specifrc discussion of tesúng individuals with disabilities is provided in chapter 10. Issues discussed in earlier chapters, in parricular chaprers l-5, including validiry o[ rest scorc inferences, test reliabiliry, and test development and adminisrracion are germane ro this chapter. The present chaprer extends rhese discussions, emphasizing the imporrance oF recognizing rhe possiblc impact of language abiliries and skills on resr performa¡ce. There may be legal requiremens relevant ro rhe tescing of individuals with different language backgrounds. The standards in this chapter are inrended to be applied in a manner consis¡enr wirh rhose requiremenß. Test Translation, Adaptation, and Modification Gsting test ¡akers in rheir primary language may be necessar¡r in o¡de¡ to draw valid infcrenc€s bâsed on their test scores. Thus, language modifications are often needed. Tianslating a rest ro rhe primary language represents one such modification. However, a number ol hazards need to be avoided when doing this sort of translation. One cannot simply assume rhar such a ¡ra¡slarion produces a ver- sion ofrhe test that is equivalenr in con¡enr, diffìculry levcl, rcliabiliry, and validiry to thc original untranslated version. Furthe¡, one câJìnot assume thar test takers' relevant acculturation experiences are comparable across the two versions. Also, many words have diÊ ferenr frequency rates or diffìculry levels in various languages. Therefore, words in two / PART II rhat oFrhe original version. For example, a test of reading skills in language A rhat is rranslaced to serve es e tesr oIreading skills in language B may include conrent nor equally meaninglul or appropriare for people who read only language B. For the purposes of rest rranslation and adapration for use rvith ¡esr takers whose first language is no¡ ¡he language of rhe resr, back rranslation is not recommended as a standalone procedure. k may provide an arrificial similariry olmeaning acros languages but not the best version in che new language. In mos¡ a¡ iterative process more akin to resr developmenr and validarion is suggested to ensure that similar construcrs are measured across versions. \X/hen tesr Forms in rwo or more languages are developed concurrentl¡ it is generally desirable rhat some irems originate in each of the languages involved. The decision as to wherher to use rhe standa¡d original language rest or an adapted version is a complex macrer. lssues that may have an impact on úris decision are discussed in the nexr section. ' Other strategies oF test modification may be appropriate when the resr taJcer's primary siruations, language is not rhe language ofthe test. These include modifring aspects oFthe test or the tesr admìnistrarion procedure such as the presenration format, rhe response Format, the rime allowed ro complete the test, the rest setting (individual administracion instead of group testing), and rhe use ofonly those portions oF rhe resr thar are appropriare For che lcvel oFlanguagc proficìcncy oF rhc tst take¡. I[ modificarions are made ro rhe presentarion or response format of the test, it may some¡imes be appropriare for the modified resr to be field tested with an adequate population sample prior ro use with irs inrended population. languages rhar appear to be close in meaning may differ significantþ in ways that seriously lssues of Equivalence impact the translated test [or che intended resr use. Addirionall¡ rhe tesr content of rhe The rerm equiualence, as used hc¡e, refcrs ro the degree ro which test scores can be used to make comparable inferences [or differcnt rranslaced version may nor be equivalent ro 92 AERA-APA_N CM E_OOOO 1O1 PÅ¡T II/ TESTT${G ITiÐIVIDUALS OF ÐIVEBSE I.INûII¡STIC BACKGROUi¡DS '.----t-^^q¿rrIÀrrss. \Y/L-- .--.. -.. l-"i^-...1 fì.. "ñ.{ used wirh lingui$icalþ homogeneous popu- lations, issues of equivalence are relatively straightforward (for example, see chapters 1 and 4). lf an individual examinee can be demonstrated co belong ro the population for ¡vhich the test was designed, tien adhering to standard procedures oF test adminisrretion and interpretation is expectcd to lead ro reliable and valid inferences based on the examinee's test score.'\fhen a tes¡ is inrended for use with test takers who differ linguistically f¡om those for whom the test ninrrcs -----J he srrnnlemented with loeicel mav - - ' LL of the results based on knowledge of the linguistic characteristics of Ehe test "-1--- analyses taher's popuJation of origin. Other types of equivalence also need to be considered when testing individuals from diffe¡ent Iinguistic bacþrounds. Functionai equivalence addresses the question ofwhethcr simiiar activities or behavìors measured by a test heve the same meaning in different cultural or linguistic groups. Tianslation equivalence requires that the translated or adapted resr be comparable in content to the original it was addressed above in the discussion was designed; establ.ishing equivalence poses tes¡i In general, the linguistic and cultural characteristics of the intended examinee population should be reflected in examinee samples used throughout the processes of test design, validation, and norming. At each of these stages of test development and standardization, distinct iinguistic groups should receive the same level ofspecific attention. The inclusion of proportional representetion of lìnguistic subgroups in aggregare standardization and validation samples may be insuffìcient to assure equivalence across linguistic gtoups. of tesr translation and adaptation. Metric equivalence concerns the issue of whether scores f¡om the same test administered in different languages have comparable psychometric properties. For example, wirh metric equivalence, a score of 50 on test X in language A is interpretable in the same wey es a score of 50 on test X in language B. In general, metric equivalence will be limited co par- a greetü challenge. Issues associated with construct equiva- lence are,perhaps most fundamental. One may question whecher the test score for a panicular individual represents that individual's standing with respecc to the same construct as is measured in the target population. For example, emong non-native speakers oí thè language oi the test, one may nol know whether a test designed to measu¡e primariiy academic achievement becomes in whole or in part a rneasure ofproficiency in rhe iangua$e of the test, Therc are several psychometric techniques that can be used to dere¡mine the equivalence ofconstructs ecross gtoups, including confirmatory Factor analysis, analysis of data contained in mul¡imethod-multitrait matrices and the equivalence of responsiveness of the groups to experimental manipulations. These tech- dcular conte¡its, exarniace groups, and types ofinterpreetions. r: Language Proficiency Testing Consideration oF relevant within-linguistic group differences is crucial in determining appropriate cest interpretation and decision making in educ¿tional pro$ams and in some professional applications of individualized tests. Fo¡ example, individuals whose fìrst language is not the language of the test may vary considerably in their proficiency along a continuum from those who have no knowl- of the language of che test to those who fluent in it and knowledgeable of the cor- edge are responding culrure. Further, a demographic proxy such as Mexican or German is likeþ to prove insufficient in determining the langu4ge of test administ¡¿tion because members of the same culcural group may vary widely in their degree of acculturadon, proficienc¡ in the language of the test, familiariry with words and syntax in their native languages, 93 AERA APA NCME OOOOIO2 TESTING INDIVIDUATS OF DIVERSE LINGUISTIC BACKGROUNDS bacþround, fami.l iariry witlr tests and rest-taking skills, and orher faccors thar may significantly affect rhe reliabiliry and validiry o[ inferences drawn from test scores. educatio nal Thts, it is essenrial thar individual differences that may affecr cesc performance be raken inro accounr when resring individuals of diflering linguistic backgrounds. The need exists ro consider borh language dominance and language proficiency. S¡anda¡dized resrs rhat assess mulriple a give n language can be helpful in determining language dominance and proficienry. The person conducting the testing first should obtain information about the language in which rhe examinee is dominanr (i.e., t}re preferred or salienr language). Following this determination of dominance, the examinee's level of proficiency in the dominanr language should be cstablished. If the languages are similarly dominant, rlren proficienry should be esublished For both (or a.ll) languages. Then the test should be administcred in the most proficienr language ifavailable (unless the purpose of rhe resting ís to derermine proficiency in rhe language of úre test). However, testing individuals in their dominant landomains in guage alone is no panacea because, as suggesred above, a bilingual individualk cwo languages are likely to be specialized by domain (e.g., the firsc language is used in the con¡ext o[home, religious practic€s, and narive culrure, whereas rhe second language is used in the contcxt of school, work, relevision, and mainstream culture). Thus, a resr in either language by irself will likely measure some domains and miss our on others. In such situations, testing in both languages (i.e., the dominanr language and the language in which the resr taker is most proficienr) may be necessery, provided appropriate [csts arc available. If assessmenr in both languages is carried our, cereful consideration should be given ro rhe possibiliry of order effects. / PABT II Because scudents are expected to acquire proficiency in rhe language used in schools thar is appropriate to thei¡ ages and cducarional lcvels, ¡esrs suirablc for assesing rheir progress in that language are needed. Fo¡ example, some tes¡s, especially paper-andpencil measures, rhar are prepâred for srudenrs oFEnglish as a foreign language may not be pa¡ricularly useful if rhey place insuftìcient emphasis on the assessmenr of important listening and speaking skills. Mcasures of competcncy in all relevant English language skills (e.g., communicarive competence, lireracy, grammar, pronunciation, and comprehensìon) are likely ro be mosr valuable in rhe school conrext. Observing studenrs' speech in naturalistic siruations cen provide additional inlo¡ma- rion about their proficiency in a language. However, findings lrom natu¡a.listic obscrvations may nor be suffìcienr to judge srudcnrs' abiliry to function in that language in formal, academically oriented situarions (e.g., clæs¡ooms). For example, ir is not appropriâte to base judgments of a child's abiliry to benelìt from insrruction in one language solely on language fìuency observed in speech use on the playground. Nor is ir appropriate to b¿se judgrnents of a person's abilicy to perlorm a job on essessmen¿s of forrnal language usage, iI formal language usage is not linked to job perFormance. In general, the¡e are special diFficulries attenda¡t upon the use of a rest with individuals who have nor had an adequate opportuniry to learn the langrrage used by t]re test. tü/hen a resr is used ro inlorm a dccision process that has a b¡oad impact, it may be imporcant for the test user to review the test irself and ro conside¡ rhe possible use oF alternative information-garhering tools (e.g., additional tests, sources olobservational information, modified forms of the choscn test) to ensure thar rhe info¡mation obsincd is adequate to the intendcd purpose. Reviews of rhis kind may sometimes reveal the nccd 94 AERAJPA_NCME_0000103 PABT II / TESTING INDIVIIIUALS OF OIVEßSE LINGUISTIC BACKGROUNDS to creatc a Êormal adaptation of a test or to develop a new test rhat is suirable for che spe- thar require cereful ettencion on rhe oarc of the practirioner adminisre¡ing the rest. cific linguistic characteristic¡ of the individu- Factors rhar may aFlecr rhe performance als being rested. the examinee indude rhe culrural and linguis- of tic background of both the examiner and Testing Bilingual lndividuals Tesr use with examinees who are bilingual also poses special challenges. An individual who k¡ows wo languages may noi cesr well in eirher language. fu an example, children from homes where perenß speak Spanish may be able to undersrand Spanish bur express themselves besr in English. ln addition, some persons r.¡ho are bilingual use their native language in most social siruations and use English primarily for academic and workrelared acriviries; rhe use oFone or both languages depends on the nature of the siruation. fu anothcr example, proficiencics in conversarional English and written English can often differ. Non-native English speakers who may give the impression of being fluent in conve¡sational English may not be competenr in aking rerts thar require English lireracy skills. Thus, an understanding oF an individual's rype and degree of bilingualism is imporrant to proper test use. Administration and Ëxam¡ner Variables tù/hen an examinee cannor be assumed ro belong to the culrural or linguisric population upon which rhe tesr was srandardized, ¡hen use oFscanda¡dized adminisrrarion proccdures may not provide a comparable administ¡arion of rhe resr for that o<a-minee. In this siruadon, rhe fundamenral principle ofsound pracrice is thar examinces, regardless ofbackground, should be provided with an adequare opporruniry to complete rhe test and demonsrrate their level ofcompetence on the attribures the resr is inrended ro measure. There may be, howeve¡, complex inreractions among examiner, examinee, and siruarional va¡iables examinee; rhe gender and resting sryle of the examiner; the levcl ofacculturation ofthe examinee and examiner; whether the rest is administered in the originai language of rhe test, the examinee's primary language, or whether both languages are used (and ifso in whar order); the rime limits of the resting; and wherhe¡ a bilingual interpreter is used. llse of lnterpreters in Testing ldcall¡ rvhen an adequareþ iranslared version of the test or a suitable nonverbal tesr is unavailable, aisessment of individuals wirh limired proficiency in the language of rhe rest should be conducred by a professionally trained bilingual examincr. Thc bilingual e<aminer should be proficient in the language o[dre examinee at rhc level òfa professional trained in that language. !?hen a bilìngual examine¡ is not available, an afternarive is to use an interpreter in the tesring process and administer the test in the examinee's narive language. Although a commonly used procedure, this practice has some inherenr difficulties. For example, there may be a lack ot linguistic and cul¡ural equivalence berween the tr¿nslation and rhe original tcst, rhe rranslaror or the interpreter may not be adequately rrainecl to *o¡k in the resting situation, and representative norms may not be available to sco¡e and interpret rhe tesr resulu appropriately. These difficulries mey pose significanc thrcats ¡o the validiry of infercnces based on test results. 'When ihe need for ar¡ inteip¡eier arises for a parricular testing situation, it is important to obtain a fully qualified interpreter to assist the examiner in administering che rest. The mosr important considcration in testing with the services ofan interpreter is the inter- AERA APA NCME OOOOI04 TESTING INDIVIDUALS OF DIVERSE TINGUÍSTIC BACKGROUNOS preceri abiliry and preparedness in carrying out rhe rcquired duries during resting. The inrerprerer obviously needs to be fluenr in borh the language of ùe rest and the examineei narive language and have general familiariry with the process oftranslaring. To be effective, the inrerpreter also needs ro have a basic unde¡standing of the process of psychological and educarional essessmenr, including the imponance o[ flollowing standardized procedures, rhe imporrance of accurately conveying to rhe examiner an examinee! actual responses, and the role and responsibiliries of the interprerer in resring. Addirionall¡ ir is inappropriare for the intcrpreter to have any prior personal relationship with úre resr raker that is likely to jeopardizc the objecriviry oF the tesr adminisrration. However, in small linguiscic or cultural communities, speakers ofthe ahernate languages a¡e often known ro cach other. Therefore, in such cases, it is the responsibiliry oFthe cesr user or examiner to ensure that the interpreter has received adequate instruction in rhe principles ofobjective tesr adminisrration and to assess preexisring biases so t}rat rest interpretacions can take such faccors inro accounr. Ifclear biases a¡e cvident and ca¡noc be ameliorated, then ùe o<aminer should make arr¿ngemenls to obrein anorher .inrerprecer. lùØhencver proficiency in r}re language of thc rcst is essencial ro job performarrce, use of a t¡anslator ro assist a candidate with licensure, certificarion, or civil service examinations should be permitrcd only when it will not compromise standards designed ro protect public health, safcry, and welFarc. When a rransletor is permirted, it also is essenrial that the c¿ndidare nor receive help interpreting ttre content of rhe rcsr or any orher æsisrance rhar would compromise rhe integriry ol the licensure or cerrlficarion decision. Crearion oFaudio tapes rhar enable a candidate ro listen to cach question being read in the language of the ¡est may be more appropriare when such an accommodarion is jusrified. / PAßT II In educ¿¡ional and psTchological resring, ir may be appropriarc for an inrerprerer ro become familiar wirh all decails of test conten¡ a¡d administration prior to rhe testing. Also, rime needs to be providcd for rhc inrerpreter to translare rest insrrucrioru a¡rd irems, if necessary. In psychological resting, it is often desirable for the examiner to demon- strate [or the interpreter how cerrain tesr irems are administered and explain what to expect during tesring. In addirion, it is important that, prior ro the resring, rhe examiner and *re interpreter become familiar with each other's sryle ofspeaking and rhe speed at which they work. Immediately prior to the assessmenr, the role o[ the interpreter needs to be explained dearly to the examinee. h is essential thar thc interprerer make all efforrs to provide accurate info¡mation in rranslarion. The interpreter musr reflecr a professional amirude and maintain objectivicy rhroughout the testing process (e.g., nor interjecr subjeccive opinions, nor givc cues to rhe ocaminee). Once rhe cesring is complered, rhe examiner is responsible for reviewing the ress responses with the assisance ofthe inter- prerer. Responscs rhar a¡e diffìcult ro interprer (e.g., vocabulary words), nontesr behaviors chat rnighr have special meanings (e.g., body language), as well as language factors (e.g., mixed use of rwo larrguages) and cultural facro¡s thar might have an cflect on testing resuls need ro be discussed fully. This informarion is co be used then by the examiner in erefully evaluating r-he test results and drawing inFerences from rhc rcsults. Cultural Differences and lndividual Testing Linguistic behavior rhat may appear cccent¡ic or be judged ro be less appropriate in one cul- ture may be seen es more appropriare in another culture and may need ro be raken into account during thc testing process. For example, children or adults from some cul- 96 AERA APA NCME OOOOIO5 PART II / TISTING INDIVIDUALS OF ÐIVERSE IIIIGUISTIT BACKGR()UNDS tures may be reluctant to speak in elabo¡ate language to adulrs or people in higher starus roles and insread may be encouraged ro speak to such persons only in response to specific quesrions or wich formulaic urerances. Thus, when tested, such resr mkers may respond to an examiner probing for elaborate speech wirh only short phræes or by shrugging their shoulders. Inrerpretations of scores resulting from such resring may prcve n be inaccu¡ate il rhis rendency is nor properly taken inro consideratìon. Ar the same dme, rhe examiner should not p¡esume that their reticence is necessarily a cultural characreristic. Additional information (e.g., prior observarions or a family membert consulation) may be needed ¡o discuss the exrent ol culturet possible influence on linguisric performance. The values associated with rhe nature and degree ofverbal outpur also may differ ac¡os culftr¡es. One cultural group may judge verbosiry or rapid speech as rude, whereas anorhe¡ may regard rhose speech parterns as indications of high mental abiliry or friendliness. A¡ individual From one culture who is -,.^1..^.-J,,,:.L y¿¡qs dl/yruprr¿rL --^.L-!v4udrlu ry¡ttr .-1.,^- ^^^-^--:^.^,^ culture may be considered tacirurn, withdrawn, or oF low mental abiliry. Resulting interpretations and prescriptions may be invalid and potentially individual being tested. oF treatment harmñi to che Standard 9.1 Gsting practice should be designed to reduce threats to the reliability and validiry of test score inferences t-hat may a¡ise from language differences. Comment: Sornc resrs are inappropriace for with individuals whose knowledge of rhe language of rhe ¡est is questionable. use .Assessmenr methods together proFessional with careful judgment are requircd to detcr- mine when language differences are relevanrTesr users can judge how best to address this stanclard in a partìcular tesring situarion. Standard 9.2 When credible ¡esearch evidence reports that test scores differ in meaning across subgroups of linguistically diverse test takers, then to the extent feasible, test dwelopers should collect for eac-h linguistic subgroup scudied the same form ofralidity evidence collected io¡ the examinee population as a whole. Gmmant Linguistìc subgroups may be found ¡o differ with respecr ro appropriareness of test content, the internal structure of rheir test responses, rhe relation of their ¡esr scores Io other variables, or rhe response processes employed by individual examinees. Any such findings need to receive due consideration in rhe interpretation and use oFscores as well as in test revisions. There may also be legal or rcgulatory requirements ¡o collecr subgroup validiry evidence. Not all to¡ms of evidence can be examined separarely for members of all linguistic groups. The vatídiry argument may rely on exisring research lirerature, for example, and such literarure mây not be available for some populations- For some kinds of evidcnce, separate linguistic subgroup analltes mây not be feasible due ¡o ¡he limited number of cases available, Data may somctimes be accumulated so thar chese AERA APA NCME OOOOIO6 TESNNG INDIVIDUATS OF DIVERSE LINGUISTIC BACKGROUNDS STAIUDARDS analyses cen be performed after the ¡esr has in use for of rime. Ic is important to note thar this standard c¿lls for more been a period than representariveness in rhe selection of samples used For validarion or norming srudies. Rather, ir c¿lls For separare, paraJlel analy- of dara [or members of different linguiscic groups, sample sizes permitring. If a cesr is bcing used while such dara are being collecred, rhen caurionary srarernen¡J are in order regarding the limitarions of inrerprerarions ses based on test scores. Standard 9.3 Vhen testing an examinee proficient in two or more languages for whicà the test is available, the examinee's relative language proficiencies should be determined. The test generally should be administe¡ed in the test ta-kert most profic¡ent L.go"g., unless pro- ficiency in the less proficient languaç is pan of the essessment. Commcnt: Unless rhe purpose of the tcsring is to decermine proficiency in a particular language or rhe level of language proficiency required for the tesr is a wo¡k requiremenr, test users need ro take inro accounr rhe linguistic characteris¡ics of examinees who are bilingual or use multiple languages, This may require the sole use olone language or use of mulriple languages in o¡der to minimizc the introduction of consuucr-irrelevanr components ro thc measurement proc€ss. For exam- ple, in educationaì scrtings, testing in borh the language used in school and rhe narive language of rhe examinee may be necessary in order to derermine the optimal kind of instrucrion required by the examinee. Professional iudgement needs to be used to dete rmine the mosc appropriate procedures for esrablishing relative [anguage proficiencies. Such procedures may reng€ From selÊ identificarion by examinees through formal proficiency tesring. / PAFT II Standard 9.4 Linguistic modifications recommended by test publishers, as well as the rarionale for the modifications, should be described in detail in the test manual. Comment: Linguisric modificarions may be recommended lor the original tesr in the primary language or lor an adapred ve¡sion in a secondary language, o¡ borh. In any case, the tesr manual should provide appropriate in[ormation regarding rhe recommended modifications, rheir rarionales, and the appropriare use oFscores obcained using rhese linguistic modifications. Standard 9.5 When there is credible evidence of score comparability across regular and modified rests or administrations, no flag should be attached to a score. When such evidence is lacking, specific information about rhe nature oF the modification should be provided, if perniitted by law, to assist test users properly to interpret and act on test scores. Comment: The inclusion ola flag on a resr score where a linguistic modificarion was provided may conflicr wirh legal and social policy goals promoting fairness in rhe rceacment of individuals o[ diverse linguiscic backgrounds. IFa score fcom a modified adminisrarion is comparablc ro a sco¡e from a nonmodified adminisr¡arion, rhere is no need for a flag. Similarl¡ if a modification is provided for which therc is no reasonable basis lor belìeving rhat rhe modifìcarion would affect score comparabiliry, rherc is no need for a flag. Further, reporring praccices rhar use asrerisla or orher non-specific sym* bols ro indic¿te rhar a rest's adminisrration has been modifìed provide linle useÊ.¡l in[o¡- mation to test users. 98 AERA-APA_N C M E_OO OO 1 07 PABT ll I TESïilG II.¡D¡VIDUALS 0F DIVEßSE LlFlGUlSflC BACKGROUIì¡OS $tandard 9.6 When a test is recommended for use with linguistically diverse test takers, lest developers and publishers should provide the infor- mation necessary for appropriate test use and inteqpreetion. Comment: Test developers should include in tesr manuals and in inst¡uctions for score inrerpretation explicit statemenrs about the applicabiliry of the rest wiûr individuals who are not narive speakers o[the original language o[the resr. However, i¡ should be rec' ognized rhat tesr developers and publishers seldom will find it feasible to conduct srudies specific ro the large numbe¡ of linguistic groups Found in cerrain counrries. When a test is trânslated Êom one language to anor.heç the methods used in establishing the adequacy ofthe t¡anslation should be described, and empirical and logical evidence should be provided for sco¡e reliabiliry ^-l urc r4ur)/ ^r-L^ 4¡u .L^ --t:J:-. v¡ u¡c uaÈrdrcu --^t-.^l inferences for tfie uses intended in the guistic groups to be tested. lin- &mment: For example, if a test is translated into Spanish for use with Mexican, Pue¡ro Rican, Cuban, Cent¡al American, and Spanish populations, score reliabiliry and the validiry of test score inferences should be with membe¡s oF each of these groups separately where feasible. In addirion, the resr translation methods used necd ro be described in derail. Standard 9.8 In employment and crcdentialing testing the proficiency level required in the la¡rof the test should not exceed that appropriate to the releva¡t occupat¡on or guage IIFI R Rññ \ Ë ¡¡tut tft;{t t\ vt¡¡¡Ëgntl¡rt¡, Commmt: Many occuparions and proFessions require a suitable faciliry in rhe language of rhe rest. In such cases, a tesr rhar is used as a part of selecrion, advancement, or credentialing may appropriately reflect that aspecr of performance. Howcver, the level of language proficiency required on the test should be no grcarer rhan ¡he level needed to meet work requirements. Similarly, rhe modality in which language proñciency is assessed should be comparable to that on the job. For example, if the job requires only thar employees understand verbal inst¡uctions in rhe language used on rhe iob, it would be inap- propriate for a selection tesr to require proficiency in reading and writing that particular language. Standard 9.9 Standard 9.7 esrablished ¡ê.Fà lfhen multiple language versions of a test to be comparable, test developers should report evidence of test comparability. a¡e intended C,ommctt: Evidence of test comparabiliry may include but is not limi¡ed to evidence that the different language ve¡sions measure equivalenr or similar consrrucrs, and drar score relia- biliry and the validiry of inferences from scores from rhe rwo versions are comparable. $tandard 9.10 Inferences about test takers' general language proficiency should be based on tests that measure a range of language fearures, and not on a single linguistic skill. Comment: For example, a mulriple-choice, pencil-and-paper rest ofvocabulary does nor indicate how well a person undcrstands rhe language when spoken nor how well the person speaks rhe [anguage. However, the test in determining how well a person understands some aspects of the written language. In making educational score might be hclpful 99 ^trÞ^ aÞ^ Ntr-À¡tr rìnnnlnç lsrnrunnnps TESTING INOIVII}UAIS OF TIVERSE TINGUISTIC BACKGROUNDS / PART II placement decisions, a morc complete range oF communicarive abiliries (e.g., wo¡d knowledge, synrax) will rypically need to be assessed. Standard 9.11 lfhen an inte¡preter is used in testing, the inte¡preter should be fluent in both the la¡rguage oF the test and the examinee's native language, should have experdse in translating, and should have a basic understanding of the assessment process. Comment: Ahlough individuals rvith limited proÊcienry in the language ofthe test should ideally be rested by professionally trained bilinguel examiners, rhe use o[an inrerprerer may be necessary in some siruarions. If an interpreter is required, the professional exam- iner is responsible for ensuring that the interpreter has the appropriare qualifications, experience, and preparacion to assisr appropriately in ùe administracion of rhe test. Ic is necessery for rhe inrerprerer ro undersrand the imporrance of following srandardized procedúes, how testing is conducred rypicdlX the importance of accurately conveying to ùre cxaminer an examinee's acrual responses, and úre role and responsibiliries of the inrerprerer in resting. 100 AERA-APA_NCM E-OOOO 1 09 10" TË$TËNG !ruDËUIüUALS WETþI T}!SABËN-ETEES Background lVirJr the advancement of scientific knowledge, medical praccices, a¡d socia.l policies, increesing numbers oIindividua]s rvirh disabiliries are parricipating more fully in educational, employment, and social acrivities. This inc¡eased pardcipation has resulted in a greater need for rhe resring and assessment of individuals with disabilides for a variery oF pu¡poses. Individuals wirh disabiliries are defined es persons possessing a physical, menral, o¡ developmenrai impairment that substantially limits one or more of rheir major Ii[e activities. Although rhe Stand¿rds focus on technical and professional issues regarding the tesring ofindividua.ls with disabilities, test developers and users are encouraged to become familiarwiúr federal, sare, and local laws, and cou¡c and adminisrrarive rulings thar regulate the testing and ¿rssessment of individuals wich disabilities. Tesc a¡c administered to individuals wiÛr li."hìli¡ipc i. w..inr'" ""rtiro5 ¿¡1d lor diverse pu¡poses. For example, terß are used for diagnostic purposes co determine rhe exisrencc and narure of a test taker's disabilides. Testing is also conducted flor presciiptive purposer ro determine intervention plans, In addirion, tes$ ere administered ro persons who have been diagnosed with identified disabiliries for educationai and employment purposer to make placement, seiectìon, or other sìmiia¡ ciecisions, or for moniroring pe rlormance as a rool for educational accounrabiliry. Thcse uses of tescs for persons wirh disabilicies occur in a variery of contexts including schoot, clinic¿l, counscling Forensic, employment, and credentialing. lssues Regarding Accommodation When Testing lndividuals With Disabilities when testing individuals with disabilities concerns the use of accommoda- A major issue rions, modificacions, or adaptations. The purof these accommodations or modifications is ro minimiz.e the impact of rest-uker actribuces rhar a¡e nor relevanr ro the construcr rhat is the primary focus of the assessmenr, The terms accommodation a.nd modifcatioz have varying connorerions in different subfields. Here accommodation is used as the general term lor any acrion raken in rerponse to a determination pose úrar an individual's dìsabiliry requires a depamue From esablished testing protocol. Depending on ci¡cumstances, such accommodadon may indude modificarion of tesr administration processes or modific¿rion of rest content. No connotarion rhat modifìcarion implies a change in the consrruc(s) b'eing measured is intended. A srandardized rest that has been designed for use with rhe general population may be inappropriare [or use for individuals with specific d'sabilities if úre resr requires the use of sensory motor, language, or psychological skills thar are affected by the disabiliry and rhat are not relev"¿nt to the focal conscruct. For o<ample, â person who is blind may read only in Braille format, and an individual with hemiplegia may be unable ro hold a pencil and rhus would have difiìculry complering a standard wrirten exam. In addition, some individuals with disabilities mey possess othe¡ arrendant characteristics (e.g., a person rvith a physical disabiliry may farigue easily), causing them to be funher challenged by some srandardized resting siruations. In these examples, if reading, use oF a pçncil, and farigue are incidenra.l ro the construct inrended ro be measu¡ed by the rest, modifica' dons of tes¡s and tesc administration procedures rnay bc necessary for a¡ accuratc aJsc-csmcnl. Note also thar accommodations are not needcd o¡ appropriare under a variery ofcircumstances. First, the disabitiry ma¡ in facr, be directly relevanr to the Focal construct. Fo¡ example, no accommodation is appropriate for a person who is completely blind if the 10f AERA APA NCME OOOOIlO IESTII'¡G INDIVIDUALS IryITH OISAEILITIES test is designed to me¿sure visual spatial abiliry- Similarl¡ in employment tesring ir would be inappropriate to meke rest modificarions ilrhe test is designed ro essess essen¡ial skills required for the job and rhe modifications would Fun- damentally alter ùre consrrucc being measured. Second, an accommoda¡ion fo¡ e parriculer disabiliry is inappropriate when the purpose o[ a test is to diagnose *re presence and degree of that disabíIiry. For example, allowing exrra time on â rimed rest to assess rhe existence of a specific learning disabilicy would make ir very difficult to derermine iFa processing diffìculry actually exisrs. Third, it is imporrant to nore that nor all individuals wirh disabilities require special provisions when taking all tesrs. Many individuals have disabilities rhar would nor influence rheir perFormance on a parricular test, a¡d hence no modification is needed. Professional judgment necessarily plays a substantial role in decisions abour test accommodâtions. Judgment comes into play in determining whether a paniculer individual needs accommodedon and the nature and exrenr oF such accommodarion. ln some circumstences, individuals wirh disabiliries request testing accommodarions and provide appropriate docunenndon in suppon of rJre request. Generaìly rhe requesr is reviewed by rhe agenry sponsoring rhe assessment or an outside source k¡owledgeable about rhe essessment process and the rype ofdisabiliry. In either câse, a conclusion ;s d¡awn as to whar constitures re¿sonable accommodacion. Disagreement may arise benveen the accommodadon requesccd by an individual with a disabiliry and rhe granted accommodation, In these situations, and to the eKtcnt permitted by law, the overarching conce¡n is the validiry of rhe inference made from the score on ¡he modifìed tesr: Fairness to all parties is best served by a decision about resc modificadon thar rcsulr in rhe mosr accurare meåsure possible ol the construct o[ interest. The role of professional judgment is Ârrrher complicated by rhe Facr rhat empirical research on resr accommodations is often lackìng. / PARI II \Xtren modi$,ing rests it is also imporrant to recognize thar individua.ls wirh rhe same rype of drsabiliry may differ considerably in their need for accommodarion. A cenrral consideration in derermining a rest modification for a disabiliry is ro recognize rhar rhe modifications should be railored direcdy ro the specific needs ofindividual fu an exanrple, it would be incorrecr to make rhe assumption rhar all individuals rvith visual impairmenc would be successluily accommodared by providing resring marerials in Braille format. Depe nding on the ex¡enr of the disabiliry it may be more appropriace for some individuels to receive resting materials writcen in large prinr, while ochers might need test nkers. e repe câssette ' or reader. As tesr modific¿tions involve alrering some aspecr oFa test originally developed for use wirh it is important to recognizc rhar makìng rhese alre¡arions has rhe porenrial to affecr the psychometric qualiries ofthe resr. There have been few empirical invesrigarions a rarget population, inro the effecrs of various accommodarions on the reliabiliry oF resr scores or the validiry of inferences dr¿wn from modiÊed rests. Due to â number of praccical limitarions (e.g., small sample size, nonra¡dom selection of resr rakers wich disabilicies), there is no precise, rechnical solurion avaiÌable for equaring modified ¡escs ro the original form of these tests. Thus it is diffìcuh to compare scores from a rest modified for persons wiù disabiliries with scores from the original tesr. Modificacions designed ro accommodare pcrsons with disabilities also may change rhe construct measured by the test, or the extent to which it is fully measured. For example, a cesr of oral comprehension may become a rest oF reading comprehension when adminisrered in wrirren fo¡mat ro a person who is deaf or hard ofhearing. Such a change in resr adminisr¡arion may alter rÀe consruct being measured by rhe original test. When rhis occurs, ùre scores on che standard and modified versions oFrhe tesr will nor have rhe same meaning. Similarl¡ modification of test administration may âlso 102 AERA APA NCME OOOO1,I1 PART ll / TESTIIìIG INDIVIDUALS WITH 0lSABltlTtES aìrer the predictive vrlue of test scores. For example, rvhen a speed cesr is adminiscered wirh relaxed time requiremenrs to a person with a disabiliry the relationship oF test scores to cri- I ___, lulr6 ul uver5r4u LUI¡¡yuLçr ù9r!çr¡r r(rd/ latÉct Í-_-_ _, _--^_^'.--) be used. Individuals wirh a hearing dìsabiliry may receive rest instructions through the use oIsign communic¿tion or wtiting. reria such as iob pcrformance may be affected. Appropriate professional judgment should be exercised in interprecing and using scores on Moolrvrue BespoHst Fonuar modified tesr. Some modified tess, with accompanying ¡esearch to support the appropriate modifications, have been available for a number ofyears. Although rhe developmenr oF rests and testing procedures for individua.ls with disabilities is encouraged by rhe Sønâards, it shouÌd be noted ùar all relevanr individua.l standards given elsewhe¡e in this document are fully applicable to the testing applicarions and modifìcations or accommodations considered irems using their preferred communicarion a¡ individual with swere language deficis might be allowed to point to rhe pre[erred reiponse. A test uke¡ who cannot moda1iry. For example, manually record answers to test items or questions may be assisted by an aide who wou.ld mark che answer. Oùcr *ap of obaining a response indude having the respondent use a ape recorder, â computer keyboard, or a Braillewrirer. in rhis chapter. Isues oF,ra1idiry and reliabiliry are c¡icical whenever modifications or accommodations occu¡. Strategies of Test Modification A variery of test modification strategies ha're been.implemen¡ed in varios semings to accommodare rJre needs of tesr rakers wi¡-h disabilicies. Some require modifoing test administration proceduræ (e.g., instructions, response Format) while others alter test medium, timing, setrings, or content. Depending on rhe nature a¡d exrent of rhe disabiliry, one or more rest modificarion procedures may be appropriate for a parricular individual. The listing here oFa '¡ariery of modification straregies should nor sug-^^- -L^- -L^ c.ll ^--^,. ^f----^^:^-:ÉL5r r¡ rdr rrr! lur¡ d rdl ur ¡rr4(sBru -^..-:-^l-. ¡s ruurr¡rLr/ available or appropriate; the decision to modily rests Modificarions also can be made to ailow individuals wirh disabiliries to respond to test on a determina¡ion that modificarion is needed ro make valid inferences about the individual's standing on rhc consr¡uct ìn qucsrion. ir/iourryrHc Pnrsg¡¡rnr¡oH Fonl¡nr One modificarion oprion is to aher rhe medium used to presenr the tesr instrucrions and iccms to the test takers. For cxample, a rest booklet may be produced in Braille or large print for individuals with visual impairmenrs. When resrs are computer-administered, Mootry¡tc Tr¡¡lr¡¡c A¡other moditcation available is to alter the timing of tesa. This may include extended time to complete rhe test, more breals during testing, or extended resting sessions ove¡ sevetaJ days. Many nationel tesring programs (e.g., achievement, certificarion) allow peisons with disabiiities additionai time to takc the resr. Reading Braille, using a casserre recorder, or having a reader may take longer than reading regular print. Reading large cype may or may not be more rime-consuming, depending on rJre layout of the material and on the nature and severiry of the impairment. Mosrm!¡o Tesr Ssmr¡c Tests normally adminisrered in group setúngs may be adminisrered individ,ually for a variery oFpurposes. Individual adminisrrarion may avoid inrerference with others taking a test in a group. Some disabiliries (e.g., attenrion deficit disorder) make it impractical to test in a group seming. Other alterations may include changing the resting [ocarion if it is not wheelchaìr accessible, providing tables or chai¡s ùat provide greater phpical support, or akering the lighting condirions for individuals who are visually impaired. 103 AERA_APA_NCME_OOOO1 1 2 TESTING INDIVIDUALS WIIH DISABILIT¡ES / PART I¡ Usrnc Oruv PoRnorus oF A TEsT and for which scores can be placed on rhe Anorher strategy of tesr accommodation involves the use oF porrions of a resr in æsess- same scale as rhe original tesr. ing persons wir}r disabilities. These procedures in clinical resring when certain subparrs ofa res¡ require physical, sensory language, or orher capabilities ¡har a resr raker with dìsabiliries does nor have. This approach is commonþ used in cognirive and achievcment resring when the physie.l or sensory limiations of a¡ individual inre¡fere wiûr rhe abiliry to perform on a rest. For example, if a cognitive abiliry tesr includes ircms presenred orally combined wìth items prcrented in a writren fashìon, rhe orally-presented items mighr be omined when rle test is given to an individual wirh a hearing disabiliry as they will not provide an adequare essessmenr of that individua.l's cognitive abiliry. Results on such items ete more likely ro reflect the individual's hearing difiìculry radrer than Using Modificat¡ons in Ditferent Testing Contexts his or her true cognitive abiliry. Alchough omitting tesr items may represenr an effecrive specific disabilicy and preFerred modaliry of communication and uses chis informarion ro determine rhe accommodations appropriate for rhe test caker. During rhe assessmenr process, are sometimes used accommodarion technigue, it may also prevent ùe test from adequately measuring rhe intended skills or abiliiies, especíelly if rhose skills or abilities are ofcentral interest. For example, it should be nored that eliminating a porrion of the tesr may not be eppropriare in siruarions such as certification testing and employmenr t€stinB r¡/here the consrruc¡ meesured by rhe each portion mey represent a separere and necessary job or occupâtional requirement. There are imporranr conrexrrral dif[erences berween rhe individualized use oFresrs, as in rhe case of clinical diagnosis, and group or large-scale resring, as in the case ofresring for academic achievement, employment, creden- tialing, or admissions. Individua.l diagnostic resring is conducred rypically for clinical or educarional purposes. In rhese conrexcs a highly qualified resc proFessionaì (e.g., a Iicensed or cerrified psychologisr) is responsible lor the enrire assessment process of test setection, administr¿don, interpretation, and reponing of resula. The resr prolessional seeks to gather appropriare informadon abou¡ rhe dienr's any modified resrs are used along wirh orher assessmenr methods to collec¡ dara abour the clientt functioning in relevanc areæ. Inferences a¡e then made based on rhis multirude o[infor- marion. Test modifications may be used during âssessment nor only our ol necessiry but also æ a source of clinical insighr abour rhe clienrt func- tioning. For example, a resr ¡aJ<er wiri obsessive compulsive disorder may be a.llowed to conrinue AreRun Ass¡ssrvrnrs One additional modifìcarion is to replace a rest standardized on rhe general Usrr¡o SussTrTurE Trsrs oR popularion wirh a test or alterna¿e essessmenr rhar has been specially designed lo¡ individu- als with disabilities. More valid resulrs may be obtained through the use ofa test specifically designed For use with individuals rvith disabilities. Although a subsciture tesr may repÍesent a desirable accommodarion solu- tion, ir may be diffìcult ¡o find an adequate teplacement lhat measures che same consr¡uct wirh comparable technical qualir¡ to complete a test item, subtest, or a tocal rest beyond the srandardized rime limits. Alrhough in such cases the perlormance ofthe test aker cannot be judged according to the standardized scoring standards, the facr rhat the test taker could produce a successlul perlormancc wirh extra time often aids cLinical intetprecation. The use o[resr modificacions in large-scaie testing ìs differenr, however. Large-scale testing is used lo¡ purposes such as measurement of academic achievement, p¡ogram evaluation, credentialing, licensu¡e, and employment. In rhese conrexts, a s¡anda¡dized resr usually is 104 AERA-APA-NCME_OOOO1 1 3 PART II / TESTING INOIVIDUATS WITH DISABITITIES .'l-i.i.r.'..1 .^.ll r""t -".¡i.i-".r. I ".o" numbers of tes¡ take¡s are not uncommon, end decisions mey in some cases be made solely on rhe basis of res¡ information, as in the c¿se of initial screening device in a¡ a resr used as an employmenr context. In some cases, decision making requires the comparison of test takers, as in seleccion or admission contexß where rhe number of applicants ma)' grearly exceed the number of available openings. This contexr highlights the need for concern for Fairness ro all parties, as comparisons must be made berween rest scores obtained by individuals wirh disabiliries raking modified tests and scores obrained by individuals under regular condi'!íhile test tekers should nor be disadrions. vantaged due ro a disabiliry nor relgvanr to rhe construct the test is intended to assess, the resulting accommodadon should nor put r-hose raking a modified test at an unduc advantage over those tested under regular conditions. fu research on the comperabiliry ofscores under regular and modified conditions is sometimes limited, decisions about approprìate accommodation in these contcxts involve imponant and diffìculi profesional judgmena. Reporting Scores on Modified Tests 'fhe practice oF reporting scores on modified in different conrexrs. In individual tesring, rhe tesr pro[essional commonly reports when tests have been administered in a nonsrandardized lashion when providing rest scoreç Tvnicallv rhe q¡er. rcerl -"_ _-_*-'_Þ teq¡ _-_c' *'- in m¡kino '--'- -- -) r rests varies accommodations or modificacions a¡e described in the cest reporr, and the validiry o[the inferences resulting from the modfied tesr scores is rliscussed. This pracrice of reporring the narure of modific¡tioru is consistenr wirh implied requiremen-.s co communic¿re information as to the nature of the assessmenr process if thc modifications impact rhe reliabiliry of tesr scores or ùre validiry of inferences drawn Fmm resr sco¡es. On the other hand, rhe reporting of resr f¡orn modified tesrs in large-scale test- ino h¿s creared considerable deba-¡e. Ofien -'_Þ __* -_'-'-when scores f¡om a nonsrandardized version ofa tesr are ieported, the score report contains an asterisk next to rhe score or somc orher designarion, often called aflag, to indi- that the resr administrarion was modified. Sometimes recipienrs of these special designarions are informed oF the meaning of the designarion; many rimes no informârion is provided about the nature of the modific¿¡ion made. Some argue that reporring scores care fiom nonstandard tesr administrations without special idenrificarion misleads test users and perhaps even harrns tesr takers with disabilities, whose scores may not accurarely reflecr their abilities. Others, however, argue rhat idenri$ing scores oF resr rakers wi¡h disabilities as resulting from nons¡andard administrations unlairly labels these test takers as persons with disabilities, stigmatizes them, and may deny them the opportuniry ro compete equally wirh tesr takers withour disabili¡ies when they might otherwise be able to do so. Federai laws and the iaws of most s¡ares bar disc¡i mination agai nst persons wirh disabi li- require individualized reæonable acconrmodâtions in testing, and limit pracrices rhar could stigmarize pcrsons wirh disabiliries, parliculerly in educarional, admissions, credentialing, and employmenr resting. The fundamenral principles relevenr here are that important inlormarion abour rest score meaning should nor be wirhheld from test users who inrerpret and act on rhe test scores, and that irrelevant information should nor be provided. lfhen rhere is sufficienc evidence of score comparability across regular and modified adminisrrarions, there is no need lor any sort of flagging. '\Vhen such evidence is lacking, an undiffercies, enriared flag provides only very limited informarion to rhe resc user, and specific information abour rhe nature of the modification is preferable, if permitted by law sco¡es 105 AERA_APA_NCME_OOOO 11 4 I TESTING INDIVIOUALS WITH OISABIL]TIES srnrunanus Standard 10.1 In testing individuals with disabilities, test developers, test administrators, and resr use¡s should take steps ro ensure that the rest score inferences accurately reflect the intended construcr rather than any disabilities and rhei¡ associated characteristics extraneous ro the intent of rhe measuremenr. / PART 11 on samples inadequate to produce norm data, they are useful for checking the mechanics of the modifications. In many circumsrances, however, lack ol ready access ro individuals wich similar disabilities, or an inabiliry ro posrpone decision making, nrake rhis unfeasible. Standard 10.4 If modifications Commenî: Chapce r I (Validiry) deals more broadly with rhe crirical require menr rhar a rcsr score reflecs the intended construcr. The need ro arrend ro rhe possibiliry ofconsrrucr-irrelevant variance resulting from a rest rakeri dis- are made or recommended by tesr developers For rest u.kers wirh specific abilicy is an example of this general principle. ln some setrings, tesr users are prohibited from inquiring about a resr rakert disabiliry making. tle standard contingenr on test taker selÊreporr of a disabilicy or a need for accommodarion. disabilities, the modifications as well as t]re rationale for the modifìcarions should be described in detail in the rest manual and evidence ofvalidiry should be provided wheneve¡ ara.ilable. Unless evidence of velidity for a given inference has been esab[shed for individuals with the specific disabilities, test developers should issue cautionary starements in manuals or supplementary materi- Standard 10.2 als regarding confidence in inteqpretations based on such tesr scores. People who make decisions about accommodations a¡d test modification for individuals with disabilities should be knowledgeable of existing research on the effecæ of the disabilities in question on test perFormance. Those who modify tests should also have access to pqychometric expenise for so doing. Comment: When rest developers and users inrend rhar a modified version of a resr shor:ld be interprered as comparable to an unmodified one, evidence of resi score comparabiliry should be provided. Standard 10.5 Comment: In some areas rhere may be lirrle k¡own about the effects of a parricular disabil- tchnical material a¡d iry on perlormance on â parricular rype of sratement of the steps taken to modifr the resr. Standard 10.3 llhere feasible, tests that have been modiÊed fo¡ use wich individuals with disabilities pilot tested on individuals who have simila¡ disabiijties to investig'ate the appropriateness and Feasibiliry of the modifications. shou.ld be Comment: Alrhough useful guides for modifring tests are available, rhey do nor provide a unive¡sal subsritute for rrying ou¡ a modifìed test- Even when such tryours are conducred manuals that accompany modified tests should include a ca¡eful tests to a.len users to changes thac are likely to alte¡ tJre validiry ofinFerences drawn from the test score. Comment: II empirical evidence of che nature and elfects of changes resulting from modifring standard rests is lacking, it is impossible to âssess rhe impacr of significanr modificarions. Documentarion oI rhe procedures used to modifr resrs will not only aid in the administration and interpreratìon of the given rest bur will also inform orhers who are modifring tests for people wirh spe- 106 AERA-APA-NCME_OOOO1 1 5 PABT II / F?/I Ãrrùtfåri I I.\ I J I r{tH ¡r ñ¡ìáì TESTING INOIVIDUALS WITH DISABILITIES A cific disabilities. This standard should apply ro both rest developers and test use¡s. Standard 10.6 If a test developer recommends specific dme limirc for people with disabilities, empirical procedures should be used, whenever possible, limis for modified forms of timed tests rathe¡ than simply allowing test takers with disabilities a multiple of dre sundard time. When possible, fatigue should be investigated as a potentially important factor to estab[sh time when time limits a¡e extended. Comment: Such empirical evidcnce is likely only in the limited seminç where a sufficient number of individuals wiúr similar disabilides are rested. Not all individuals wirh the same disabiliry, however, necessarily require the same accommodation. In most cases, professional judgmenc based on available evidence regarding the appropriate rime limic given rhe narure of will be the basis for I fo¡ neoole wi¡h disabilities. The costs of '-'Y--l'obraining validiry evidence should be considered in lighr o[ the consequences of nor having usable information regarding the meanings of scores [or people with disabilities. This srandard is leasible in the limited circumsrances where a suffrcient number oFindividuals wirh the same level e¡ ¿.tr.. of a given disabiliry is available. Standard 10.8 Those responsible fo¡ decisions about test use with potenria.l test takers who may need or may ¡equest specific accommodations should (a) possess rhe information necessary to make an appropriate selection of measures, (b) have current information regarding the availability of modified Forms of the test in question, (c) inform individuals, when app¡opriate, about the er<istence of modified forms, and (d) make ¡hese forms av.¿ilable to test takers when appropriate and feasible. an individual's disabiliry decisions. Legal requirements may be relevant to any decision on absolure rime limirs. Slandard 10.7 When sample sizes permir, the validiry of inferences made from test scores a¡d the reliability of scores on resß administered to individuals with va¡ious disabilities should be investigated and reported by the agenry or publisher that makes the modification. Such investigations shouid examine rhe effects of modifications made for people with various disabiüdes on resulting scores, as well as the effects of adminisrering standard unmodified tests to them. Comment: In addition ro modifring rescs a¡d tesr administrarion procedures for people who have disabiliries, evidence ofvalidiry for inferences drawn from these resrs is needed. Valida¡ion is the only way ro amass knowledge abour rhe usefulness of modified tests Standard 10.9 When relyìng or-r norms as a basis for score interpretation in assessing individuals with disabilities, the norm group used depends upon the purpose of testing. Regular norms are appropriate when the purpose involves the test takert functioning relative ro the general population. If available, normative data from the population of individuals with the same level or degree of disability should be used when the test taker's hrnctioning reladve to individuals wiúr similar disabiliries is at issue. Standard 10.10 Any test modificaiions adopted should be appropriate for the individual test taker, while maintaining all feasible sønda¡dized features. A test professional needs to consider reasonably available information about each test takert experiences, cha¡acteristics, AERA-APA-NCME_OOOO1 1 6 lsmrunnmns and c¡pabílities that might impact test performance, and document the grounds for the modifiution. Standard 10.11 When there is credible evidence of score comparability across regular and modified administrations, no fìag should be attached to a score. When such evidence is lacking, specific i¡formation about t-he narure of rhe modiÊcation should be provided, if permimed by [aw, to assist test users properly to interpret and acf on test scores. TESTING INDIVIDUALS WTH D¡SABILITIES / PART II Comment: For example, when assessing the intellectual functioning of persons with men- ral rerardation, ¡esulrs From an individually adminisre¡ed inrclligence resr are generally supplemenred with orher perrinenr inlormarion, such as casc hisrory inlormarion abour school ñ.¡ncrionìng, and resul¡s from other cognirive ress and adaprive behavio¡ rneesu¡es. In addition, a¡ times a mulridisciplinary evalua- tion (e.g., physical, psychologic:[, linguistic, neurological, ecc.) may be needed ro yield an accurate picture of che person's funcrioning. Commenr: The inclusion of a flag on a resr score where an accommodation for a disabitity was provided may conflicr wirh legal and social policy goals promoting fairness in rhe r¡eatment of individuals wirh disabiliries, If a score Êom a modified adminisrrarion is comparable to a score from a nonmodified adminis¡rarion, there is no need For a flag. Similarl¡ if a modìficarion is provided for which rhe¡e is no rea' sonable basìs for believing d'rar she rnodification would affec¡ score comparabiliry there is no necd for a flag. Furrhe¡ reporring practices rhat use asrerisks or other nonspecific symbols ro indicate rhar a tesr's adminisrration has beeq modified provide lirrlc uselul informarion ro tesr users. \ù7hen permirred by law, if a nonsrandardized administrarion is to be reported because evidence does nor exist to support score comparabiliry then this report should avoid ¡eferencing the existence or narure oFthe test ¡akeri disabiliry and should instead reporr only rhe narure oFthe accommodarion provided, such as extended rime for tesring, rhe usc o[a readet or rhe use oFa upe recorder. Standard 10.12 In tesring individuals with disabilities for diagnostic and intervention purposes, the test should not be used as tlre sole indicator oFthe test taker's functioning. Instead, muJtiple sources of information shoufd be used. 108 AERA APA NCME OOOO117 PART ¡llrE#iui Fa-r o iiiitir;l :iii¡¡, fupÃicarfiffi . Àestmg .,,,,r;i:ì;iiÍi.iii'iir,. 1:-rri :ii j;: :;i;: :;l:i 'ri $# .,,¡j'iii¡i: :ii'itii: :i:;;;i:;aj¡lr ii;l;ri¡i'; i:ilii;,; AERA_APA_NCME_OOOO1 1 8 .I"I . THË ffiESPONSIBILITËES OF TEST USERS duced by independenr publishers, on the other Background P¡evious chaprers have dealt primarily with the responsibilities oF those who develop, market, evaluate, or mandate the administration of tess and the rights and obligarions oftest takers. Many of the standards in these chapters, and in rhe chapters rhar follow, refer to the development oftests and their use in specific seminç. The present chapter includes sm¡d¡¡ds of a more generai nature char apply in almost all measu¡ement contex6. In pardcular, attention is cenrered on the responsibilities ofchose who may be considered the zran of tesa. This group includes psychologists, educato¡s, and other professionals who selecr the specific instruments or supervise rest adminisrationon rheir own aurhoriry or at the behest of orhers. It also includes all individuals who accively parricipate in the interpremtion and use of test resulu, other ùa¡ the tesr takers themselves. It is presumed tlw a legirirnate educarional, ps¡chological, or employmcnt pu¡pose justifies dre rime and otpense of ¡esr administrâtion. In most seftings, rhe user communicates this purpose ro rhose who have a legitimate inreres¡ in the mcasuremenr process and subsequently conveys the implications of examinee performance to those enúded to receive rlte information. Depending on the measurement setting, this group may include individual test takers, parents md guardiaru, cduqtors, employes, policymakers, rhe courts, or rhe general public. '!?he¡e adminisr¿tion of tescs or use o[ rest data is mandated for a specific population by govcrnmenral authoriries, educational institutions, licensing boards, or cmployers, the developer and user ofan instrumenr may be esscnrially the same. In such setrings, there often is no clear separation berween the professional responsibilities of those who produce rhe insrrumenr and rhose who administer the tes¡ and interpret the rcsuls. Instrumens pro- hand, present a somewhat different picture. Typicall¡ rhese tesrs wiil be used with a variery of populations and for diverse purposer. The conscienrious developer ola sta¡da¡d' ized test attemp$ to screen and educate porenrial users. Furthermore, most publishers and rest sponsors work vigorously ro prevenc rhe misuse oFstandardiz¡d measures and the mis- interpreration ofindividual scores and group often illustrate sound everages. Tèsr manuals a¡d unsound interpretations and applicarions. Some idenrifr specific pracrices rhar are not appropriate and should be discouraged. Despite rhe besr elforrs of test developers, however, appropriate test use and sound inrerpretation of resr scores are likely ro remain primarily rhe responsibiliry of the test user. Tesr takers, parcns and guardians, Iegislarors, policymakers, ¡-he media, r}le cours, and rhe public at large ofren yearn [or unambiguous interpremcions of test data. In parricula¡ they ofcen tend to a¡tribute positive or negative resuls, induding group differences, to a single Factor or ro the conditions that prevail in one social institution-most often, the home or rhe school. These consumers of rest data f¡equently press for explicit rationales for decisions that are based only in part on rest sco¡es. The wise test user helps all interested parties undersmnd that sound decisions regarding tes! use and score inteçretarion involve a¡ element of proFcssional judgmenr. h is nor always obvious ro rhe consumers ¡har rhe choice of various information-gathcri ng procedures oftcn involves e<perience rhat is not eæily quantified or verbalized. The user can help them appreciate the lact rhar rhe weighting ofquanritative data, educational and occupational infor' mation, behavioral observations, anecdotal reports! and other relevant data often c¿nnor be specified precisely. il1 AERA APA NCME OOOO119 TUt ¡¡LJ, urtJ¡u¡lrrtLJ nE lEcT ùJLI¡J, I DÀof rrr rtr! DECDnÂtelÞtl tTtce u, ¡LJ, ilcEoc fArrt [r Because of Ihe appearance of objecriviry and numerical precision, rest daca ere some- times allowed ro rotally override orher sources ofevidence about resr takers- There a¡e circumin which selection bæed exclusively on stances ever, thar in some conrex$ legai requirements may place limi¡s on rhe exrenr ro rvhich such compromises can be made. As rvirh standards for rhe various phases oFtest developmenr, when relevant standards are nor mer in resr tesr scores may be appropriate. For example, rhis use, rhe reasons should be persuasive. The may be the case in pre-employmenr screening. But in educational and psychological serrings, test users are well advised, and may be legally requircd, ro consider orher relevanr sources of inlormation on test takers, nor jusr resr scores. greater rhe potenrial impact on rest cakers, for ln the latter siruarions, the psychologisr or educa¡or famjliar wìrh rhe local setring and with local ce.st ¡ake¡s is besr qualified ro inregrare rhis diverse info¡mation cfFecrivcly. As reliance on test results has grown in recent years, grealer pressure has been placed on tesr useis to explain to rhe public the rarionale for tesr-based decisions. More rhan ever be€ore, test users ar€ called upon to defend their tesring praccices. They do rhis by documenting that their test uses and score inrerpretations are supporred by measuremenr .authorities lor rhe given purpose, rhar rhe inFerences drawn f¡om their insrrumenr are validared for use wi¡h a given populàrion, and rhar the results are being used in conjunction with orher in[ormarion, nor in isolation. Ilthese conditions are met, the resr user can convincingly delend the decisions made or the administrative actions taken in which tesrs played a parr. Ir is nor appropriare fo¡ rhese Standards ¡o dic¡are minimal levels oFresr-crirerion cor¡elarion, classificarion accuracy, or reliabiliry Êor any given purpose- Such levels depend on whether decisions must be made immediately on rhe strength of the best available evidence, however weak, or wherher decisions can be dclayed unril betrcr evidence becomes available. But ir is appropriare to expect thc user ro ascerrain whar rhe alrernarivcs are, rvhar the quaiiry and consequences ofthese alternarives a¡e, and whecher a delay in decision making would be beneficial. Cost-benefir compromises become necessary in test use, as rhey often are in resr developmenr. lt should be noted, how- good or ill, the grearer rhe need to identi$, and satisry che relevanc srandards. In seiecting a test and interpreting a cesc score, the resr user is expecred ro haue a clear understanding of the purposes o[ the tesring and ics probable consequences. The knowledgeable user has dcfinire ideas on how ro achieve ¡he¡c purposes and how to avoid bias, unfairness, and undesirable consequences. In subscribing ro rhese Stand¿rds, resr publishers and agencies mandaring tesr use egree ro pro' vide information on rhe strengrhs and weaknesses of rheir ìnstruments. They accept the responsibiliry to \À,arn against likely misinterpretarions by unsophisticared inrerprerers of individuai scores or aggregared dara. However, the ultìmate responsibiliry for appropciare resr use and inrerprerarion lies predominandy rvirh rhe resr use¡. In assuming this responsibiliry ¡he use¡ must become knowledgeable about a resri appropriare uses and rhe populations for rvhich ir is suirable. The user must elso bccome adepr, parricularly in statewide and communiry-wide assessment programs, in communicacing rhe implications oF ¡est results to those enrirled ro receive rhemln some instances, users mây be obligarcd to collecr additional evidence about a testt technical quality. For example, iFperformance assessmenrs are locally scored, evidence ol the degree of inter-sco¡er agreement nray be required. Users also should be alerr to rhe probable local conseqr.rences oF test use, particularly in the case ollarge-scale testing programs. lf ¡he seme test mare¡ial is used in successive years, users should ¿c¡ively monitor rhe program to ensure thac reuse has not compromised rhe integriry of the resulrs. I t¿ AËRA APA NCME OOOOI2O PAHT III / THE RESPONSIBILITIES OF TEST USERS Some of rhe standards that follow reiterare ideas contained in other chapters, principally chaprer 5 "Test Administration, Scoring, and Reporting," chapter 7 "Fai¡ness in Tescing and Test Use," chapter I "Righr and Responsibiliries ofTèst TäJcers," and chapcer I3 "Educ¿tional Tesring and Assessment." This reperirion is intentional. It permits an enumeration in one chapter o[the major obligations rhat mus¡ be assumed largely by the test adminisrraror and use¡, though rhese responsibiliries may refer to topics that are covered more fully in other chapters. STAilIDARDSI Standard 11.1 Prior to the adoption and use oFa published test, the test user should srudy and evaluate tlre materials provided by the test developer. Of particular importance are those that summarize the testt purposes, speci$ the procedures for test administration, define the intended populations of test takers, and discuss the score interpretations for which validiry and reliabiliry data a¡e available. Comment: A prerequisire ro sound tesr use is knowledge of the materials accompanying rhe instrument. As a minimum, úrese include ma¡uals provided by the test developer. Ideall¡ the use¡ should be conversant wiúr releva¡t scudies reported in rhe professional literarure. The degree of relìabiliry and validiry required for sound score interpreracions depends on rhe test's role in the assessmenr process and the porenrial impacr o[ the process on rhe people involved. The resr user should be aware of legal restrictions thar may consrrain rhe use of the test. On occasion, profêssional judgmenr may lead to the use oF instrumenr for which rhere is lirde documentarion oFvalidity for thc intended purpose. In r-lrese siruarions, the user should interpret scores cauriously and rake c¡re not to imply úrat rhe decisions or inferences a¡e based on test resuls rhar a¡e well-documenred wirh respecr to reliabiliry or validiry. Standard 1 1.2 'Ðflhen a test is to be used for a purpose for which lirtle or no documentation is available, the user is responsible for obtaining evidence of the testt ralidity and reliability for rhis purpose. Comment: The individual who uses test scores for purposes rhar a¡e not specifically recommended by the test developer is responsible for colleccing rhe necessary validiry evidence. Suppon for such uses may sometirnes bc found in rhe professional lireraure' If previous evidence is not suffìcienr, rhen additional data should be 1f3 AERA_APA_NCM E_OOOO 1 2 1 laçn.lm a ñÃñ I }¡ ¡ ¡,t¡l¡! !ÀNHfi [\ r 9 ¡t \ùuu¡l! ügu THE FESPONSIBILITIES OF TEST USEHS collecred. Tire provisions oldris sranderd shouici not be construed ro prohibir che generation of hyporheses From resr data. For example, rhough some clinical tesrs have limired or conrradic- I PA.RT f I' at before resr adminisr¡arion, not airerwards. Prelerabl¡ the rarionale should be available in printed materials prepared by che test publisher or by the user. tory validiry evidence for common uses, clinicians generare hyporheses based appropriarely on examinee responses ro such resrs. However, these hypotheses should be clearly labeled as tentarive. Incerested parries should be nrade aware of the potenrial limiracions oF rhe resr sco¡es in such siruacions. Standard 11.3 Responsibiliry ior test use shouid be assumed by or delegated only to those índividr:als who have the training, proFessional credenrials, and experience necessary to handle rhis Standard f 1.5 Those who have a legitimate interesr in an essessment should be inFormed about the purposes oFtesting, horv tests will be admin- istered, the factors conside¡ed in scoring exarninee responses, how the scores are rypi- cally used, how long the records will be retained, and to whom and under what conditions the records may be ¡eleased. Comment: This standard has grearer relevance and applicarion to educational and clinical res¡- responsibility. Any special qualifications for test administration or inteqpretation specified in the test m¿¡ual shouid be met. ing rhan ro employmenr tesring. in mosr uses of tesu for screening job applicants and applicants co educational programs, for licensing proFessionals and awarding credentials, or for Comment: Tesr use¡s should nor arr€mpr to interprcr the scorcs of rcsr rakers whose spccial needs or cha¡acterisrics are ourside rhe range of rhe usec's qualificarions. This sranda¡d ha-c special significance in areas such æ clinica.l resring, fo¡ensic tesring, tesring in special educarion, resring people wirh disabilities or limired exposure ¡o rhe dominan¡ cul¡ure, and in orhcr such situar.ions rvhere potenrìal impacl is great. Vhen rhe situarion fa.Ìls ourside the user's experience, assistance should be obrained. A number ofprolessional organizations have codes of cthìcs that speci$, the qualifications oI those who adminisrer tests and interpret scores. measuring achievement, the purposes of resring and ¡he uses ¡o bc made of rhe iesi sco¡eJ are obvious [o rhe examinee. Neverùeless, it is wise to communicate this information ar least briefly even in rhesc sercings. ln some siruations, however, rhe rerionale for the tesring may be clear ro relatively Few resr rakers. In such serrings, a more detailed and explicit discussion may be called for. Recenrion and release oFrecords, even when such release would cìearìy benefit the examinee, are oFten governed by srarures o¡ instirurionaÌ practices- As relevant, examìnees should be informed about ¡hese const¡ainrs and procedures. Standard'11.4 Standard 11.6 The test user should have a clear rationale for the intended uses of a test or evaluation procedure in terms of its validiry and contribution to the assessment and decisionmaking process. Unless the circumstances clearly require that thc test ¡esults be withheld, the test user is obligated to provide a timely report of the ¡esults that is unde¡srandable to rhe test taker and othe¡s enrirled ro receive this information. Comment: Justificarion for the role oIeach insrrument in selection, diagnosis, classification, and decision making should be arrived Commcnt: The nature of score repons is often dicraced by pracrical considerarions. In some 114 AÊRA-APA-NCME_OOOO 1 22 PART III / THE RESPONSIBILITIES OF TEST USERS Standard 1.l.9 .ãses only e tcrse printed report may be [casi- ble. In others, it may be desirable to provide both an oral and a wrirten reporr. The inrerpretation should vary according co the level of sophisrication of the recipient. When the examinee is a young child, an explanacion of the tes¡ results is rypically provided to parents or guardians. Feedback in the [orm o[ a score reporr or inrerprctation is not typically provided when tes$ are administered For personnel selecrion or promotion. Test users have the responsibiliry to protect the security oftests, to the e5fienr that developers enjoin users to do so. copyrights, and the legal obligations ofother parricipants in the testing process may pro- hibit the disclosure o[ test items without specific authoriz¿tion. Standard 11.10 Commmt: The coss of scoring error are great, particularly in high-srakes resring programs. of selecdon, licensure, or cducadonal accounÞbili- ry, rhe need For rigorous protecrion of test securiry is obvious. On the other hand, when educarional rests are not parr ofa high-srakes program, some publishers consider teacher review of tesr materials to be a legitimare tool in clari[ying reacher perceptions of the skills measured by a test. Consistency and clarity in the definidon ofacceprable ând unacceptable practices is critic¿l in such situarions. 'Vhen tesrs aÍe involved in litigation, inspection ol the instrumens should be restricted-to the ectent permined by law-to those who are legal. þ or echically Test users should remind test takers a¡d others who have access to test materials tÀat the legal rights of test publishers, induding Test users should be alen to the possibiliry of scoring errors; they should arrange for rescoring if individual scores or aggregated data suggest the need for it. Standard 11.7 Comment: Whcn tess a¡e used For purposes STANÐARÐSI obl.ig3ted to safeguard resr securiry. Standard 11.8 Test users have the responsibiliry to respect test copyrights. Commcnt: trgally and ethically, resr users may not reproduce copyrighred materials íor rou- tine resr use without consent ol the copyright holde¡, These marerials-in borh paper and clec¡ronic Form-include rest i rems, ancillary forms such es answer sheets or profile lorms, scoring templares, conversion tables oF raw scores to derived scores, and ubles of norms. In some cases, rescoring may be requested by rhe tesc taker. Ifsuch a test taker right is recognized in published marerials, it should be respected. In educational resting programs, use¡s should nor depend entirely on test takers to alert them ro the possibiliry oFscoring errors. Monitoring scoring accuracy should be a rourine responsibiliry of testing program administrators whercver feasible. Standard 11.11 If the integrity of a test taker's scores is challenged, local authorities, the test deteloper, or the res¡ sponsor should inform dre test Bkers of thei¡ relewnt rights, including the possibiliry of appea.l and representation by counsel. Comment: Proctors in enrrance or licensure tesring programs may rcport irregularities in the test process that result in challenges. Universiry admissions officers may raisc challenges when tesr scores are grossly inconsistent with other applicant information. Test takers should be apprised of their rights in such situations. 115 AERA APA NCME OOOO123 I lllE oÊcoñlletEil ¡Tlcc ñE tEcl ltcEDc / f Àlt¡ tÍ vrrvru¡l¡¡rLJ ur tLdt gúlltu, oÂDr l¡¡ --tHõ,rt\iltltuIrq I \' !Fl!Ë!/4.t! [L]ù' Standard lf.i2 Siandar¡i Iî.i4 Test users or the sponsoring agency should explain to test takers úreir opportunities, if an¡ to retake an examinaeion; users shou.ld also indicate whethe¡ the ea¡lie¡ as well as latet sco¡es will be reported to those entided to receive che score reports. Test usem are obligated to p¡otect rhe priracy Comment: Some resring programs permit tesr ual examinees is a well-established principle in psychological and educarional measuremenr. of examinees and institutions rhar are involved in a measurement program, unless a disclosu¡e o[private informadon is agreed upon, or is speciÊcally authorized by larv. Comnent: P¡orecrion of rhe privacy rakers to retake an examìna¡ion several rimes, ro cancel scores) or ro have scores wirhheld lrom potenrial iecipienrs. I[resr takers have such pri"ileges, they and score recipients should be so informed. Standard 11.I3 'When test-taking strategies thar a¡e unrelat- ed to the domain being measured are found to enhance or adversely affect test oerformance significantl¡ these strategies e¡d thei¡ implications should be explained to all test takers before the test is administered. This may be done either in a¡r information booldet or, if the explanation can be made briefly, along with the test directions. Comment: Test-raking srrategies, strch as guessing, skipping rime-consuming items, or initially skipping and then returning to diFficulr items as time allows, can influence tesr scores positively or negatively. The effecrs of various srrategies depend on rhe scoring system used and aspects olitem and test design such as speededness or rhe number oF response alternacives provided in multiple' choice items. Differential use of such srraregies by tesr takers can affect the validicy and relìabiliry of tesr score inrerprecarions. The goal of test direc¡ions should be to convey inFormacion on the possible effec¡iveness of various srraregies e¡d, rhus, ro provide all resr rakers an cqual opportuniry ro perform oprimally. The use of such scrategics by all cest takers should be encouraged if their effect facilirates performance and discouraged iI rhei¡ efÊecr inrerFeres rvith perFormance. oF individ- In some instances, teJt takers and ¡esr administrators may formally agree (o a lesser degree ofprotection than the larv appears ro require. In other circumstânces, teJt users and resring agencies may adopt more srringenr resrrictions on the communication and sharing of ¡es¡ resulrs rhan ¡elevanr lalv dicrares. The more rigorous srandards somet¡mes a¡ise through the codes ofe¡hics adopred by relevent professional organizations. In some resting programs the condirions For disclosure are stared to the examinee prior to tesring, and iaking ihe tesi cân consrirure agreemeni for rhe disclosure oF tesr score info¡marion as I. orhe. DrôoÉñç. ùe ¡est taker o¡ "necifi"d his/her parenm or guardians must formally egree ro any disclosure olresr information to individua.ls or agencies orher rhan those specified in the resr adminiscrarori published literarure. It should be noted thar ùe righr o[ the public and rhe media to examine the agg¡egare tesr results ofpublic school syscerns is guaranteed in some states. Standard 11.15 den to potential misinterp¡etations of test scores and to possible unintended consequences of test use; users should mke steps to minimize or avoid fore- Test use¡s should be seeable misinte¡premtions a¡d unintended negative con sequenc€s. Comm ent : \fell-meani ng, bur unsophisricaced, audiences may adopt simplistic interpretations of test resulß or may attribute high or low scores or aver4ges to a single causel facror. 116 AERA APA NCME OOOO124 PART III / T}IE RESPONSIBILITIES OF TEST USEBS STAruDARDS Experienced test users can sometimes anticipate such misinterpretations and should rry to prevent them. Obviousl¡ nor every uninrended consequence cåÍl be anticipated.'Whar is required is a reasonable efforr to prevenr negarive consequences and to encourage sound inrerprerarions. Standard 11.16 Test users should verifr periodically that their inteqpretations of test date corrinue to be appropriate, given any significant changes in their population of test takers, their modes of tesr adminisrration, and purposes in testing. their Commt¡t: Over dme, a gradual change in the demographic cha¡acteristics of an examinee popularion may significantly affect the infe¡ences drawn from group averages. The accommodadons made in tesr adminisrration in recognition oFexaminee disabiliries or in r€sponse to unforeseen circumsrances may supplemental information tlat will minimiz€ possible misinterpretations of rhe dara. Commcnt: Preliminary briefings prior to the o[ rest resuls can give reponers for rhe news media arr opporruniry co assimilate relevanr dara. Misinrerpretarion can often be rhe result of the limited cime reporrers have ro prepare media reporrs or inadequate presenterelease rion of information thar bears on resr score inrerpretarion. It should be recognized, however, rhat rhe inreresrs of the media are not always consiscent wirh the intended purposes oI measurement programs. Standard 11.19 \Vhen a test user contemplates an approved change in test format, mode of administration, instructions, or the language used in administering the test, the user should have a sound rationale for conduding that rulidiry, reliabiliry and appropriateness of norms will not be compromised. also affect interpretations. Standard 11.17 In situations where the public is endded to receive a summary of test results, cest users should formulate a policy regarding timely release of the results and apply that policy consistendy over time. Comment: In school resring programs, districts commonly viewed as a coherenr group may avoid conrroversy by adopring the same policies regarding the release ofresr resuls. If one district rourinely ¡slÈces ¡ggretared dara in much greater detail than anorher, groundless suspicions can develop rhar informarion is being suppressed in the latrer district. Standard 11.18 'When test results a¡e released to the public or to polirymakers, rhose responsible for the ¡elease should provide and explain any Comnrcnt: In some insnnces, minor changcs in lormar or mode of adminisrrarion may be reasonably expecred, withour evidence, to have lirtle or no ef[ect on validiry reliabiliry, and appropriareness of norms. In orher insrances, however, changes in formar or adminisrrarive procedures can be assumed a priori to have significanr effecrs. When a given modiÊcation becomes widespread, con- sideration should be given ro validation and norming under the modified conditions. Standard 11.20 In educational, clinical, and counseling seftings, a test take¡'s score should not be inte rpreted in isolation; collateral informadon that may lead to altemative explanations for the examinee's test performence should be considered. C-omtunt: Ir is neirler nec€ssery nor feasible to maÌ<e an inrensive review of cvery test taker's 117 AERA-APA-NCM E_OOOO 1 25 lc.rn nrnn nsrr. ! ì! ! e{-tu !J-AÀ.[l]!.! ù --^.- t- .^^- --..i--" .L-.- -^,, k^ l:,,t- ^. no collateral information o[value. In counseling, clinical, and educ¿rional settiogs, howeveç considerable relevanr information is likely ro be avaìlable. Obvious alternarive explanarions of lowscore¡ include low motivarion, limited fluenry in rhe language of rhe resr, unfamiliariry with cultural concep$ on rvhich test items are based, and perceptual or motor impairmens. In clinicel and counseling serrings, rhe resr user should not ignore how rvell the test raker is Ârncrioning in daily life. ñr--J--J 4{ ñ{ Ðla¡tuatu I t,¿t Test users shou.ld not rely on computer-generated interpretations of test results unless they have rhe experrise to conside¡ the appropriateness of these interpretations in individual cases. Comment:The scoring agency has úre responoF documenting rhe basis [o¡ the inrerprerarions. The user of a computerized scoring and reporting service has che obligarion ro be familiar with rhe principles on which such interpretations were derived. The user should have rhe abiliry to evaluate a computer-bæed score incerpretation in rhe light of orher ¡elevanr evidence on each (es( taker. Automated, narrative reports ere not a subsriture for sound professional judgment. sibility Standard 11.22 When circumstmces require that a test be administered in rhe same language to all examinees in a linguistically diverse population, the test user should investigate the relidity ofthe score interpretations for test takers believed to have limited proficiency in the language of the test. Comment: The achicvement, abilities, and rraìrs of cxaminees who do not speak the lan- of rhe rest as rheir primary language may be seriously mismeasured by the tesr. guage fHE RESPONSIEITITIES OF TEST USERS / PART III The scores oftesr take¡s .,vith seve¡e linguisric limitarions will probably be meaningless. If language proficiency is not ¡elevanr ¡o the purposes of testing, the test user should consider excusing rhese individuals, wirhour prejudice, from raking rhe resr end subsrin:cing alrernarive evaluation merhods. Holvever, ìt is recognized that such actions may be impractical, unnecessary or legally unacceptable in some settings. Standard 1'1.23 If â tesr is mandated for persons of a given or all students in a particular grade, identiff individuals whose disabilities or linguistic bacþround indicates the need for special accommodations in test age users should administration and ensu¡e that these accommodations are employed. C0 mment: Appropriate accommodations depend upon the narure of the tesr and rhe needs of rhe tesr rake¡. The mandaring aur-horiry has primary responsibiliry for defining rlre acceprable accommodations for va¡ious categories of tesr akers. The user musr rake responsibiliry for identifring those test takers who fall within rhese ategories and implement the appropriate accommodarions. Standard 11.24 When a major purpose of testing is to describe the status of a local, regional, or particulæ sminee population, dre program crite¡ia fo¡ inclusion o¡ exclusion ofindìviduals should be stricdy adhered to. Comment: In census-rype programs, biased results can arise from the exdusion ofparticular subgroups ofsrudents. Financial and othe¡ advanrages may eccrue eirher lrom exa&gerating or from reducing rhe proportion ol highachieving or low-achieving srudenr. Clear.l¡ these are unprofessional practices. l lB AERA_APA_NCME-OOOO 1 26 12. PSVTHOLTGEGAL TËSTNTüG ANf} ASSESS[ffi8ruT Background This chapcer add¡esses issues imporranr to professionals who use psychologiøl tesrs with their clienrs. Topics include test selection and administration, resr interprerarion, collaceral informadon used in psychological tacing, gpes o[tests, a¡d purposes o[testing- The rypes ol psychological tesrs reviewed in this chapter include cognitive and neuropsychological; adapdve, sociai, and problcm behavior; family and couples; persona.liry; and vocarional. [n addition, rhe chaprer includes an overview of four common uses of psychological resrs: diagnosis; intervention planning and outcome evaluation; lega.l and governmenral decisions; and personal ewareness, growth, and action. Employment tescing is another context in which psychological tesring is used, The sandards in this chaprer are applicable to those employment settings in which individual indepth assessment is conducted (e.g., an evaluarion o[a candidate for a senior execurive posirion). Employment sertings in which ¡es¡s are designed to measure specific job-related characreristics across multiple candidates are ¡¡eated in the texr and scandards ofchaprer 14. For all proflessionals who use rests, knowledge ofculnrra.l bad<ground and ph¡æical capabil- ities rhar influence (a) a test uker's development, (b) che methods for obtaining and conveying informarion, and (c) the planning and implemenration of interventions is critical. Therefore, readers are encouraged to review chaprers 7, 8,9, and l0 rhar discuss fairness and bias in tesring, the rights and responsibilities of tesr ekers, resting individuals of diverse linguisric backgrounds, and resring lndividuals with disabilities. Readers will find importanr additional de¡ail on validiry; reliabiliry; tesr developmenr; scaling; test adminisrration, scoring, and reporting; and general responsibilities of test users in chapters l,2,3,4,5, and I i, respecrively. The use of tess provides one method o[ collecring informarion within the larger framework o[ a prychological assersment of an individual. Typicall¡ psychological assessmenrs involve an inreraction berween a professional rvho is rrained and experienced in testing and a clienr. Clients may include patients, counselees, parents, employees, employers, attorneys, srudenrs, and other responsible parties who are rest takers or who use the test resuls conrained in psychological reporrs. The results F¡om tess and inventories, used wirhin *re con¡ext oFa psychological essessment, may hclp rhe proFcsional ro understand the clienr mo¡e Ârlly and ro develop more informed and accurate hlporheses, inFerences, and decisions about a clienrt siuation. A psychological assessment is a comprchensive examination undertaken to answer specifìc questions abour a client's psychological Functioning during a perticular time inte¡val or to predict a client's psychological functioning in the Future. An assessmen! may indude adminiscerìng and scor- ing tests, and interpreting test scores, all wirhin rìe context of the individua.l's personal history. Inasmuch aJ tesr scores characteristically are inrerpreted in the conrexr olother in[ormadon abour rhe client, an individual psychological assessment usualiy aJso includes inrerviewing rhe client; observing client behaviou revie$'ing educarional, psychologicel, and othe¡ relevant records; and integrating these findings with other inFormation rhar may be provided by thìrd parties. The tasks oFa psychological assessmenr---<ollecting, evaluating, integrating, and reporring salicnt intormation relevanr to those aspects ofa clienri frrnctioning that are under examinarion--<omprise a complex and sophisticated set of professional activities. The interpretarion oFtesa and inventories can be a valuable pan oFthe intewencion proces and, ifused appropriatcl¡ can provide useñrl information ro clien¡s as well as to other users f19 AERA APA NCME OOOO127 rJJþnuLUUruAL rCùrrfru o[che cesr interprecation. For example, dre resu]¡s ofress a¡d invenro¡ies may be used ro âssess rhe psychological funcrioning of an individualt ro assign diagnostic classificarions; to detect neuropsi,chological impai rment; ro assess cognirive and personalir;, srrengths, vocarional interests, and values; to decermine developnrenral srages; and rc evaluate treermenr ourcomes. Tixr resulm a.lso may provide information used ro make decisions rlat have a porverful and læring impacr on peoplei lives (e.g., vocarional and educarional decision maidng; diagnosis; rreermenr plannin6 selection decisions; interven¡ion and oufcome evaluation; parole, senrencing, civil commicment, child cusrody, and competency ro stand rrial decisions; and personal injury lirigrion). Trsr SrucroH $tD A0MtusTRATt0N Prior ro beginning ¡he assessmenr process, the cest raker should undersrand who rvill have access to the rest resuls and the wrirren report, how resr results will be shared wirh the tesr raker, and ifand rvhen decisions based on rhe Arìau AJùE$JtvtEN¡ / rAnt ft Validiry and reliabiliry considerarions a¡e peramount, bur the demographic characrcristics (e.g., gender, age, income, socioculrural and language bacl<ground, educarion a¡d orher socioeconomic variables) of the group fo¡ which rhe test was originally consrrucred and for which inirial and subsequenr normarive dâre are available also are imporran¡ resr .çelectio¡ wirh demographically appropriare normative groups relevant for rhe client being resred is imporranr ro rhe gcneralizabiliry of the inferences úrar rhe profesional seeks to make. Sometimes the irems or tasks conrained in a rcst are designed for a particular group and are viewed as irrelevanr for another group. A tesr constructed for one group may be applied to orher groups wirh appropriate qualificarions rhar explain rhe tesr choice based on the supporting research dara and issues. Selecring a resr on professional experience. The selecrion of psychological tests and inventories, for a paruicular clìent, often is individualized. However, in some settings a ¡est ¡esulrs will be shared rvith the rest taker third parry. The assessment process begins by clarifring, as much as is possible, prederermined bartery of tests may be taken by anð.lor a a.ll the reasons [or rvhich a client is presented for by these reasons or other relevanr concerns, rhe tests, inventories, and diagnosric procedures ro be used are choseo, and orher sources oF informarion needed to evalrrare the clienc and rhe re[er¡al issues a¡e idenrified. The professional reviews more rhan rhe name oÊ¡he ¡esr ín choosing a resr and is guided by the validiry and reliability evidence and the applicability of rhe normative data avaiìablc in rhe testt accumulated research adolescenc, or an adulc. essessmenr. Guided li¡era¡ure. In addi¡ion ro being thoroughly versed in proper adminisrrarive procedure, rhe professional is responsible For being Familiar wirh rhe validiry and reliabiliry evidence for the intended use and purposes olthe tests and participanr, and group interpretations may be provided. The ¡es¡ raker may be a child, an Thc serrings in which rhe tes$ or invenrories are used include (but are not limired to) preschool, elementary middle, or secondary schools; colleges or universicies; pre-employment or employmenr settings; mental health or ourparient clinics; hospitals; prisons; or profesionals' offìces. ProFessionals who oversee resring and assessment e¡e responsiblc for ersuring úar aü persons lvho administer and score resrs have received rhe appropriate education and traìning needed to perform these taslc-ç. In addirion, rhey are responsible in group tesdng siruations for ensuring thar ùe individuals rvho use ¡he resr results are rrained to inrerprec rhe scores properl¡ 'When conducting psychological cesting, inventories selecred and lor being prepared to develop a logical analysis thar supporrs the va¡ious lacets ofthe assessmenr and rhe inter- srandardized test adminisrration procedures ences made from ¡he assessrnenr. ere ¡o be described and justified. Professionals should be followed.'When nonsrandard adminisrrarion procedures are needed, they 120 AERA-APA*NCME_OOOO 1 28 PART III / PSYCHOLOûICAL TESTING AND ASSESSMENT also are responsib[e for ensuring that testing conditions are appropriate. For example, the examiner may need to determine if,the clienr is capable ofreading at rhe level required, and i[ clienrs with vision, hearing, or neurological disabiliries are adequately accommodared. Finally, professionals are responsible for protecting thc confidentialiry and securiry of che test results and the resring marerials. One advantage o[ individually administered measures is the opporcuniry to observe and adjust testing conditions as needed. In some circumstances, test adminiscration may provide rhe opportuniry for skilled examiners ro carefully observe rhe perlormance of persons under smndardized conditions. For example, rheir observations mey allow rhem ro more scrives to underscand, and prepares to arcicu- lare, such evidence as che need arises. Tästs a¡d inventories chat meet high tech- nical sta¡da¡ds of qualiry a¡e a necess¿¡y but not a sufÏìcien¡ condition ro ensure the responsi- ble use and inrerpretarion oftesr scores. The level of competence of che professional who inrerprerc rhe scores and integrates the infer- from psychological tesrs depends upon the educarional and experiencial qualifi- ences derived carions of rhe professional. \Vith experience, professionals learn thar the challenges in psychological (est scoÍe inrerpretation increase in magnitude along a conrinuum oFprofessional judgmenr with briefsc¡eening invenrories at one end of the continuum and comprehensive multidimensiona.l assessments at the other. For accurately record behaviors being assessed, ro understand beaer the manner in which persons arrive at their answers, to identi$ personal sircngths and weaknesses, and to make modificarions in the testing process. Thus, the observations oF rrained professionals can be example, rhe i nterprerations importanr to all aspeca of ¡est sional regardless olthe soundnes olthe technic¿l characreristics of the resm being used. The education and experience necessâry to administer group tesc a¡d/or procror computer-administered tesrs generally are less srringent than are rhe qudifications nectssary to inteçrer individ- T¡sr Scong use. lrrpRpREranoru Tesr scores ideally are interprered in lighr of the available normative data, thc psychomerric properties of ¡he resr, rhe remporal sra- bility of the constructs being measured, and the effect of moderator variables and demographic characterisrics (e.g., gender, age, income, sexual orientation, sociocultural and language background, education, and other socioeconomic variables) on resr resulrs. The proFessional rarely has rhe resources available oF achievement and abiliry test scores, personaliry test scores, and barceries of neuropsychologicd test scores represent points on a continuum that require increasing levels oIspecialized knowledge, judgmenc, and skill by an experienced profes- ually administered tests. The use and interpreration of individually administered tests requires complerion oFrigorous educarional and applied training, a high degree ofprofessional judgment, appropriate credentialing and adherence ro the professional's ethica.l guidelines. lVhen makìng inFerences about a clientt to personally conduc¡ the research or to past, present, and [uture behaviors and orhe¡ assemble representative norms necded to make accu¡are inferences about each individ- cha¡acterisrics from resr scores, the profcssional ual clientt cu¡reot and future functioning. ThereFore, the professional may rely on the reviews rhe lirerature ro develop familiariry with supponing evidence. tVhen rhere is strong resea¡ch and the body oFscienrific knowledge available for the test that warrans appropriate evidence supporting the reliabiliry and vaÌidiry ola test, including its appliabiliry to the client being assessed, the professional's abiliry ro draw inferences. Presentation and analyses of validiry and reliabiliry evidence often are not needed in a written report, but rhe professional inferences increases. Neverthelcss, the professional sdll corroborates resuls from testing with addirional information from a variecy of sources 121 AERA_APA_N CM E_OOOO 1 29 PSYC}IOLOGIEÂL TESTING AND ASSESSMENT .,,^L 4J ¡¡¡(L¡ y¡tVvJ --J tsult¡ (Ivl¡1 --L-_ (aD, JUL¡r ^^:-"^-,:-..,- dr¡u -^^..r-- r-^_ utr¡lr ----When an inference is based on e single srudy or based on several studies whose samples are nor represenrarive o[¡he dienr, che professional is more cauriots abour rIe inFerences. Corroboraring data From rhe assessmenr's multiple sources of information-including stylistic and teseuking behaviors ìnlerrecl f¡om obse¡vacions during rhe res¡-will srrengthen the confidence placed in the inference. lmportantly, data rhar are nor supporrive of the inference are acknowledged and eirhe¡ reconciled or nored as limits ro the confidence placed in rhe inference. An inrerpretation o[a rerr ml<ert tesr scores bæed upon existing research examines nor only the demonsrated relarionship beween dre scores and rhe c¡i¡e¡ion or crireria, bur also ùe appropriareness of rhe larter. The criterion and the chosen predicror rest or rests are subjecred to a simila¡ examinadon ro undersrand the degree ro which cheir underlying consrn¡cm a¡e congruenr with the inlerences under considerarion. Threats ro the interpretab.iliry of obrained minimized by clearly defining how particular psychological tesrs ere used. These threats occur as a resulr ofconsrrucr-irrelevant variance (i.e., aspecrs o[ rhe resr rhar are not relevan¡ ro rhe purpose oFrhe rest scores) and construct underrepresentation (i.e., imporranr facem relevan¡ ro rhe purpose oFrhe resrìng, bur for which rhe rest does nor accounr). A clienri response bias is anorher example o[a construcrirrelevanr component thar may significantly skew rhe obtained scores, possibly ¡ender¡ng rhe scores uninrerpretable. In situarions where scores ere response bias is anticipared, the proFessional may choose a resr rhar has scales (e.g., faking good, frking bad, social desirabiliry, percent ycs, percenr no) that clarifr rhe th¡eats to validiry from the tesr rakert response bias. In so doing, the proFesional may be able to essess dre degree ro which tcsr takers are acquiescing to the perceived demands of the test administ¡ator or attempting ro porrray rhemselves as impaired by "faking bad," o¡ well-functioning by "faking good." In inrerprering rhe rest u-ker's obtained i€sponse bias score(s), the evidence / PANT III ofvalidiry for consrructs underlying each response bias scale, each scalet inrernal consistency, irs in¡e¡relations wi¡h orhe¡ scales, and evidence of validiry are considered. For some purposes, including cereer counseling and neuropsychological assesmenr, rest bane¡ies frequenrly are used. Such baneries often includc rests of verbal abiliry, numerical abiliry nonverbal reasoning, mechanìcal reasoning, clerical speed and accurac¡ spatial abiliry, and language usage. Some batreries also ìnclude interest and personaliry invenrories. \lhen psy- chological tesr barreries incorporate mulrìple methods and scores, parrerns o[ rest results [re- quendy are interpreced to reflecr a consrruct or even an interacrion among consrrucß underlying test performanccs. Higher order inre¡acrions a-rnong rlìe constructs underlyi ng confi gurario ns of tesr outcomes may be posrulared on rhe basis of ¡est sco¡e perterns. The lite¡arure reporrìng evidence of reliabiliry and validiry rhar supports the proposed interprerations should be idenrifiable. llthe literarure is incomplete, r.he raulting inFe¡ences may be presenred rvirh rhe gudifica- tion rhat rhey are hypotheses ior íucure veriíìcation raÙrer ùran probabilistic starements rhat imply some known validiry evidence. Colur¡nnr. luronrualroH Usro rH Psvcxol0ercnt ïrsrtHc nHu PsvcH0r.ootcnL Assess¡¡e¡{t The qualiry o[ psychological resring and psychological assessmenr is enhanced by obtaìning c¡ed i ble col laterai i nformation from various third-party sources such as teachers, personal physicians, Family members, and school or employmenr records. Psychological tesdng a.lso is enha¡ced by using various meûrods to acquire i nlo¡mation. Strucrured behavioral observations, checklisa and ratings, interviews, and criterion- and norm-rcFerenced measures a¡e bu¡ a few of ¡he merhods thar may be used ro acquire inlormation. The use oF psychological tests also can be enhanced by acquiring inFormation abour multiple trais or atribures to help characrerize a person. For example, an 122 AERA APA NCME OOOOI30 PAßT III / PSYCHOLOGICAT TESTING AND ASSESSMENT eveluadon of career goals may be enhanced by obaining a history of current and prior employmenr as well æ by administering tesß to essess academic apritude a¡d achievement, voca¡ional inreresrc, work values, and personaliry and temperÀmenr cha¡¿cer'súcs. The availabi.lig' oF informarion on mulriple rrairs or arrribu¡es, when acquired from various sources and through the use of va¡ious meùods, enables professionals to åssess more accurarely an individual's psychosocial functioning and facilitates more effective decision making. Types of Psychological Tests For purposes of this chapreç the rypes of psychological tests have been divided into five caregories: cogn itive and neuropsychological tests; adaptive, social, and problem behavior tests; family and couples tests; personaliry resr; and vocational resrs. Cocgmu¡ Al¡D NEURopsycxor0GrcÀL TEslrNc Tesa often are used to assess various classes of cogniúve and neuropsychological functioning íncluding intelligence; broad abiliry domains (e.g., verbal, qua¡titarive, and spatid abilities); and mo¡e focused domains (e.g., attention, sensorimoror [unctions, perception, learning, memory, reasoning, execurive firnctions, and language). Overlap may occur in the construcrs that are assessed by tests oFdiffering functions or domains. ln common with other rypes of tests, cognitive and neuropsychological resrs require a minimally sufficient level oF resr-nker attentional capaciry. Cogiúve .Ability. Measures dcsigned to quantifr cognirive abiliries are among tÀe most widely administe¡ed rests. The interpretarion of cognitive abiliry rcss is guided by rhe rieorerica.l consrrucrs used ro develop the rest. Many cognirive abiliry tesrs consisr oF multidimensional test barreries thar are designed to âsseis a broad range oFabiliries and skills. Individually adminisrered tesr barre¡ies also a¡e required for resting for purposes such as diag- nosing a cognitive disorder. Tesr resuks a¡e ued to draw inFerences about a persont overall level o[ inrellecrual fi.rnctioning es well âs scrcngrhs and weakneses in various cognitive abilities. Because each test in a batrery examines a diF- ferent Êrnction, abiliry skill, or combinarion rhereof, rhe rest rakert performance can be undcrs¡ood besr when scores are not combined or aggregated, bur rather when each sco¡e is inrerpreted within the context olall orher scores and orher assessmenr dara. For example, low scores on timed tesr alert the examiner to slowed responding as a problem that may not be apparent ifscores on different kinds ofress are combined. Attention. Attention refcrs to that class of funcrioning that encompasses a¡o',"a], estabIishment and deployment of sem, sustained actention, and vigilance as constructs. Tesm mây meesure levels of alerrness, orientation, and loc¿lization; dre abiliry to focus, bhift, and maintain attention and to ¡rack onc or more stimuli under various condirions; span of acrention; inFormadon processing speed and choice reacrion time; and short-te¡m inForma- tion storage capacicy. Scores fo¡ each aspecr oF attention that has been examined should be reported individually so rhar rhe narure ofan arrention disorder can be clarified. Motor, Sensorimoto¡ Fr¡nctions, and l-ateral Preferences. Visuel, auditory somarosensory and other sensory sensitiviry and dis- crimination c¿n be me¿sured by simple moror or verbal responses to selecrive srimularion upon command. Perception and Perceptual Organization/Integration. This class of funcrioning involves reasoning and judgment as r-hey relate to rhe proccssíng and elaboration ofcomplex sensory combinations a¡d inputs. Tæu of per- ccption may emphasize immediate perceprual processing but also may require conceprualizations that involve some reasoning and.iudgmental processes. Some tests have a motor componenr ranging from a simple moror resPonse to an elaborate conscruction. Also, 123 AERA_APA_NCME_OOOO 1 31 PSYCTIÛLOGICAL TESTING ANt} ASSËSSMEF¡T ^^-^ 1^^r-L^-^ -^--- --^^t:,.^ .L^ ùLsL T4NLI IUI L¡¡L5ç fT¡I -^-. --l-^TLs(J PLIIdI¿L ...-..t-:l:.----l r4lBudår ôsrrlc¡¡L ¡¡rurúr¡¡bu4, t,----_- / PART III -¿-_:_-t-.J-urrcr r rr rLruû6 slow perlormance thar may be caused by some- an essessment of languaç comperence and rhe thing other than perceprua.l dyslunction. order oFdomìnence among the difFerenr lan- I-earning and Memory, This class oF funcrions involves rhe acquisirion and rerenrion oF i n Fo rmarion beyond che arcen rional requiremenrs of immedìate or short-term informarion processiûg ard srorage. These tesæ ma)/ meesu¡e acquisirion of new inFo¡marion rhrough various sensory channels and by means oFasso¡red rest formas (e.g., word liss, prose pesseges, geomerric figures, formboards, digirs, and musical melodies). Memory rests also may require retenrion and recall of old inlormarion (e.g., personal data as well as commonly learned facts and skills). Abstract Reasoning ând Categorical Tests of reasoning and thinking Thinking. vary widely. They asses the examineet abiliry ro infer relationships or to respond to changing environmental circumstances and to acr in goal-orienred situations. Executive Functions. This class o[funcrions is invofved in the organized perlormances thar are necessary for the independenr, purposive and effective atrainment ofpersonal goals in various cognitive processing, problem-solving and social sinrations. Some tess emphasiz€ reasoned plans ofacrion thac anticipatc consequences o[ alternacive soludons, moror per[ormance in problem-solving situacions (har require goaJ-orien ted i nren cions, and regulation of perFormance For achieving a desi¡ed ourcome, Language. language âssessmenr rypically locuses on phonology, morphology, synrax, semanrics, and pragmatics. Recepcive and expressive language fr,rnccions may be assessed, including lisrcnìng, reading, talking, and wrirren language skills and abilíties. fusessment of cenrral lang¡agc disorders focuses on functional speech and ve¡bal comprehension measured through ora1, lvrirren, or gestural modes; lexic¡l access and elabo¡arion; repetition ofspoken language; and associatíve verbal fluency. \ù/hen assessing persons who are nonnative English speakers or who are bilinguaì or guages. Iî a mulrilingual person is assessed fo¡ a possible language disorder, one issue For rhe professional ro consider is rhe degree ro which ¡he diso¡der may be due more direcrly ro language-relared qualiries (e.g., phonological, morphological, syntactic, semantic, pragmatic delays; menral retardarìon; peripheral sensory or cenrral neurological impairment; psychological conditions; hearing disorders) than to dominance oF a non-English language. Academic Achievement. Academic achievement tests ere meesures of academic knowledge and skills that a person hæ acquired in formal and informal learning opportunities. Two major types oIacademic achievement resß include general achievemenr batteries and diagnosric achievemenr resrs. Cenerai achieve' ment batteriÊJ are designed to arsess a personl level of learning in multiple areas (e.g., reading, marhematics, spelling, social studies, science). Diagnostic achievemenr resrs, on rhe other hand, rypically focus on one parcicular subjecr area (e.g., reading) and assess imporcanr academic skiils in grearer derail. Test results are used ro determine rhe tesr raker's srrengths as dìffìcul¡ies and may help idenrifr sources of the diffìculries and ways to overcome rhem. Chaprer l3 provides additional detail on academic achievemenr resting in educa¡ional setrings. well as specific Socrer., Ao*nve, ANo PRoBLEM BExlvroR Tgsrt¡¡c Measures o[social, adaprive, and problem behaviors assess abiliry and morìvation to care for onet selfand to relare ro orhers. Adaptive behaviors include a repertoire of knowleclge, skills, and abiliries rhar enable a person ro meer rhe daily demands and expecrations of the environment, such as eating, dressing, using trensportetion, inreracring wich peers, communicaring rvirh orhers, making purchases, managing monef, meinraining a schedule, remaining in school, and maintaininga job. 124 AERA-APA-NCME-OOOO 1 32 PABT III / PSYCHOLOGICAL TESTING AND ASSESSMENT Problem behaviors include behavioral adjustmenr d¡ffìculries rhar inrerFere wirh a person's effecrive functioning in daily life siruations. Fnl,¡nv n¡¡o Coupus Trslxc Family tesring addresses the issues of family dynamics, cohesion, and in rerpersonal relations among family members including partners, pe¡enrs, children, and extended family members. Tests developed ro assess families and couples are distinguished by measuring the interaction patterns of partial or whole Families, requiring simultaneous Focus on trvo or more [amily members in te¡ms of their rra¡sacrions. Tesring wiúr couples may address personal lacrors such as issues of inrimac¡ comparibiliry shared inrerests, trusr, and spiritual beliefs. logically or statistically derived dimensions esrablished.by previous research. Personaliry rests may be designed to Focus on the assessment oFnormal or abnormal attirudes, feelings, traia, and related cha¡acteristics. Tesa intended to measure normal personaliry characteris¡ics are constructed to yield scores reflecring rhe degree to which a person manifesrs personaliry dimensions empirically identifìed and hypothesized to be present in the behavior of most individuais. A persont configuration of scores on chese dimensions is then used to infer how dre person behaves presendy and how she/he may behave in new siruations. Tesr scores ouride of rhe expected range may be considered extrcmc expressions o[ normal rrairs or indicative ofpsychoparhology. Such scores also may reflect normal PeRsoHru-¡w T¡srruo Broadly considered, r-he assessmenr oFper- sonaliry requires a synrhesis olaspects oFan individual's Functioning rhar contribute to the formularion and expression of thoughts, attitudes, emotions, and behaviors. In che asscssmenr of a¡ individual, cognirive and ernotional fi.rnctioning may be considered separacel¡ but their influences a¡e inre¡related. For example, a person rvhose perceprions are highly accurete, or who is relatively srable emorionall¡ may be able to control suspiciousness betrer rhan cen a person whose perceprions are inaccurate or distorted o¡ who is emorionally unstable. Scores on a personaliry reJr may be regarded as reflecting the underlying rheoreric¿l constructs or cmpirically derived scales or Facrors that guided the tesr's consrrucrion. The srimulus and response [ormats ofpersonaliry resrs vary rvidely. Some include a series o[questions (e.g., self-report inventories) to which rhe cesr taker is required co choose from several welldefined oprionsi orlers involve being placed in a novel siruarion in which the tesc taker's response tle person rvirhin functioning of ftom rhar a culrure di-fferenr oI rhe normative population sample. Orher personaliry tesrs a¡e designed specifically to measure corìsrrucr underlþg abnormal ñrnctioning and psychopathology. Developers oFsome of these tes[s use previously diagnosed individuals ro corìsrrucr their scales and bese their infercnces on rhe association berween rhe test's scále scores, within a given range, and the behavioral correlates of persons who scored within that range. IF inferences made from scores go beyond the theory chat guided the rest's construcdon, rlen rhe inferences musr be validated by collecting and analyzing addidonal relevant data. Voc¡rtouru- T¡sr¡ruc is nor completely strucrured (e.g., responding to visual srimuli, telling stories, discussing picrures, Vocational testing generally includes rhe measurement of interests, work needs, and values, as wcll as cor¡sideration and assessment oFrelared elements of caree¡ development, maturiry, and indecision. The resulrs from inventories that assess these constructs often are used for enhancing personal growrh and undersranding, carccr counseling, ourpIacement counseling, and vocârional dccision or responding ro other projective srimuli). The iesponses are scored and combined into either making. Thesc interventions frequently rake place in the context oFeducational settings. 125 AERA APA NCME OOOO133 PSYCHOLOGICAT TESTIf'¡G AI\¡O ASSESSMEf\lT However, interest invenrories and measures of wo¡k values also may be used in workplace settings as perr ofrraining and developmenc programs, For career planning, or for selecrion, placemenr, and advancemenr dccisions. Interest Inventories. The measurement oF inreress is designed to identi$, a persont preferences for various acdvities. Self-reporr inreresr invenrories are widely used ro assess personal preíerences including likes and dislikes fo¡ rarious work and leisure acriviries, school subjecrs, occuparions, or rypes of people. The resulring scores may provide insight into rypes and pac rerns oÊdifferentiai inreresrs in educ¿rional curricula (e.g., college majors), in different fields of work (e.g., specific occupations), or in more or bæic a¡eas oFinteres¡s relared ro specific activities (e.g., sales, office practices, or general mechanical activities). W'ork Values Inventories, The measuremenr of wo¡k va.lues identifies a person's pref- for the various reinForcemencs one may obrain from r¡,ork activities. Sometimes rhese erences values a¡e idenúfied as needs that persons seek ro satisly. Work values or needs may be c"rego¡ized as int¡i¡uic ard imponant fo¡ the pleasure gained from rhe activiq¡ (e.g., independence, abiliry utilization, achievement) o¡ as exrrinsic and imporrant for the rewa¡ds rhey bring (e.g., coworkers, supervisory relations, working conditions). The fo¡mat o[work values resm usually involves a selÊ-raring of rhe imporrance of che value associared wíth qualities desc¡ibed by rhe irems. Me¿su¡es oí Career Deveiopment, Maturity, and Indecision. Addirional arcas of vocarional assessment include measures of career development and maturiry and meesu¡cs of career indecision. Invenrories rhar measure career development and maturiry rypica.lly elicir client self-desciiptions in response to items rhar inquire abour the individual's knowledge ol the world of work; self-appraisal o[ onet decisi on- making skills ; atrirudes toward careers and career choices; and thc degree to which rhe individual already has engaged in career / PART If' plannine. Measures oIcareer indecision usual[y are constructed and srandardized ro assess boch the level oIcareer indecision oIa clienr as rvell as the reasons for, o¡ anrecedenrs of, indecision. Such career development, maruriry and indecision findings may be used wirh individuals and groups to guide the design and delivery o.F ca¡eer services and ro evaluare the eífectiveness of career inrervenrions. Purposes of Psychological Testing For purposæ of this chapcer, psychological tesr uses have been divided into four caregories: resring for diagnosis; intenenrion plannìng and outcome evaluation; legal and governmenral decisions; and personal awãreness, grorvrh and action. However, these categories are not a.lways mutually exclusive. TesrHc ron Dncnosrs Diagnosis refe¡s ¡o a process ¡har includes the collecrion and integrarion of resr results with prior and cu¡renr in[ormarion about a person together with relevanr contexruai conditions ro identifr characrerisrics of healrhy psychological ñrnciioning as well as psychological diso¡ders. Disorders may manifesr rhemin informarion obrained during the testing o[an individua]'s cognirive, cmorionel, social, personaliry, neuropsychological, physi cal, perccprual, and motor atr¡ibutes. Psychodiagnosis. Psychological rescs are helpful ro professionals involved in the pqychological diagnosis oFan individual. Tèsting may be perfòrmed ro confirm a h¡po*resized diagnosis or ro rule our alternarive diagnoses. Psychodiagnosis is complicared by the prevalence of comorbidi ry between diagnosric ca tegories. For example, a clienr diagnoscd as suffering from selves schizophrenia simultaneously may be diagnosed suffering from depression. Or, a child diagnosed as having a learning disabiliry also may as be diagnosed as suFfering lrom an arrention deficit disorder. The goal of psychodiagnosis is ro assisr each clienr in receiving rhe appropriate intervenrions for rhe psychological o¡ behavio¡al l¿u AERA_APA_NCM E-OOOO 1 34 PART III / PSYCHOLOGICAL TESTING AND ASSESSMENT dysfuncrions rhar rhe client, or a third parry views as impairing rhe clienr's expecred funcrioning and/or enjoymenr oflile. In developing is interested in rhe presence or absence ofdiag- rreetment plans, professionals olten use noncacegorical diagnostic descriptions of clìent funcrioning along rrearmenr-¡elevanr dimensions (e.g., degree of aruiery emounr of suspiciousness, openncss to inrerpretations, amounr of insight into behaviors, and level of inrellec- cognitive 6.rncrioning and use configurarions oF obtained scores. These configurations of scores indicare the degree ro which a clienrt responses are similar to those of individuals who have been determined by prior research to belong to a specifìc diagnostic group. Diagnoses made with the help olresr scores rypicdly a¡e based on empirically demonsrrated relationships becween the test score and the diagnostic câtegory. Validiry srudies thar demonstrate relationships berween tesr scores and diagnostic categories currently a¡e availeble for rual functioning). The firsc step in evaluating a tesri suitabiliry to yield scores or information indicative of a parricular diagnostic syndrome is ro compare rhe consrrucr rhar rhe cesr is inrended to measurc rvirh rhe sympromarology described in rhe diagnostic crireria. This srep is imporrant because different diagnoscic sysrems mey use the same diagnostic term to describe differenr symptoms; even wirhin one diagnostic sysrem the symproms described by ùre same term may differ be¡ween editions oFthe manual idendfring the diagnosric criteria. Similarl¡ a rest r¡ar uses a diagnosric rerm in ir ritle may differ significandy from another test using a similar tirle or frorn a subscale wirh rhe same rerm. For example, some diagnostic sysrems may define depression by behavioral symptomatology {e.g., psychomotor rera¡darion, disrurbance in appetite or sleep) or by affective sympromatology (e.g., dysphoric feeling, emorional flatness) or by cognirive symptomatoìogy (e.g., thoughts of hopelessness, morbidiry) or some other symptomatology. Further, rarely are úre symproms of diagnosric categories murually exclusive. Hence, it can be expected rhat a given symprom may be shared by several diagnosric categories. More knowledgeable and precisely drawn inferences relaring ro a diagnosis may be ob¡a.ined from tesr scores if appropriaie weight is given to rhe symproms included in rhe diagnosric category and ro rhe suirabiliry oleach tesrc essess the symptoms. Diffe¡enr merhods may be used ro esseJs particular dìagnostic caregories. Some methods rely primarily on srrucrured interviews using a "yes" or "no" lormat in which the professional nosis-specific qfmpromatoloС Othe¡ merhods often rely principally on tesrs of personaliry or some diagnostic cetegories. Somecimes resß rhar do nor have supponing velidiry srudies also may be useful to the professional in arriving at a diagnosis. This also mey occur, for example, when the symptoms assessed by a tesr are a subse¡ ofthe criteria that comprise a parricular diagnosric ceregory. Iühile it ofren is not fe¡sible for individual prolessionals to personally conduct ¡esearch inro relationships berween obained scores and inferences, their iamiliaricy with the body of the research lirerarure rhar examines ¡hese relarionships is imporranr. The professional often can enhance the diagnostic infe¡ences derived lrom resr scores by integraring the tesr ¡esul¡s with infe¡ences made from other sources of informarion regarding rhe clienrt functioning such as selF-reporred history or informarion provided by significant orhers or sysremaric observations in rhe natural environment or in the tesring serring. In arriving at a diagnosis, a professional also loolcs for information that does not corroborare the diagnosis, and in rlose instances, places appropriate limia on ùre degree o[confidence placed in thc diagnosis. S?hen relevanr to the rclerral issue, rhe profess io nal acknowledges drernative diagnoses that may require consideration, Panicular aftention is paid to all relevant available dau before concluding that a client fa.lls into a diagnostic category. Culrural sensitiviry is paramount to avoid misdiagnosing and over 127 AERA APA NCME OOOO135 _=-q ÞçveHnl nntnÀt TFqflñc ÁNn ÂseFceÍ.¡tENT , pÂPl ilt parhoiogizing cul ruraily approp riate behavior, affecr or cognirion. Tes¡s a.lso are used ro âsse5s rhe appropriareness ofcontinuing rhe inicial diagnostic characterizarion, especially after a course o[treatmen¡ or if the clienr's psycholog- ical funcrioning has changed over rime. Neuropsychodiagnosis, NeuropsychologiczJ tesring analf¿es the current psychologica.l and behavioral starus, i ncluding manifescations oIneurological, neuropathological, and neurochemical changes rhar may arise during devel- opmenr o¡ from brain injury or illness. The purposes of n eu ropsychol ogical res ring typically ìnclude, bur are nor limiced ro, ùe fo.llowing: differential diàgnoses benveen pqrchogenic and neurogenic sources of cognitive, perceptual, and personaliry dysÂrnction; diffe¡ential diagnoses benveen rwo or more suspecred eriologies of cereb¡al dysfunction; evaluation of impaired functioning secondary to a cereb¡el, corrica.l, or subcortical event; establishment of neuropsychologica.l baseline measurements for monitoring progressive cerebral disease or recovery eflecrs; comparison of pre- and post-pharmacologic, surgicel, behavio¡al, or psychological intervencions; identifi cadon of pameros of higher cortical ñrnction and dysÂlnction for rhe Formulacion oF ¡ehabiliurion srraregis and lor rhe design ol ly occurs foilowing an evaluation oÊ dre nature and severiry oFa disorder and a review olpersonal and conrexrual condi¡ions rhar may impac irs resolurion. Subsequenr eveluarion.ç may occur in an efforr ro diagnose fur¡her rhe narure end severiry o[ rhe disorder, to review the effecrs ol inrervenrions, ro revise them æ needed, a¡d ro mee¡ e¡hical and legal srandards. Testrle ron JuotcrAL nuo Gov¡Ru¡¡r¡¡rnl Drcrstorus Clienrs may volunrarily seek psychological tesring as parr of psychological æsessmcnts ro assist in marte¡s before a courr or orher govcrnmenral agencies- Conversel¡ courrs or other governmental agenc¡es sometimes require a clienr ro submir involunrariìy ro a psychological or neuropsychological assessmenr that mey involve a wide range of psychological ress. The goal of rhese psychological assessmenrs is ro provide important information to a rhird parry, clienri atrorne¡ opposing atro¡ney, jLrdgc, or administratìve board about the psychological functioning of the clienr rhar has bearing on rhe legal issues in question. At the outset of evaluations for judicial and government decisions, it'is impentive ro clarifr the purpose of the evalua¡ion, who rvill have access ro the rest prcscribed irtcrvcncions, and desircd ourcome is imponanr. Intervenrions may bc inrended co prevenr rhe onser of one or more symploms, to stabilize or overcome rhem, to ameliorare ¿heir effects, ro minimize their impact, and to provide Fo¡ a person's basic physical, psychological, results and the reports, and any righcs rhac the client may have to refuse to participere in courr-ordered evaluarions. The goals ofpsychological testing in judiciaj and governmental senings are informed and consrrained by the legaj issues to be addressed, and a detailed undecsranding oI their salienr aspects is cssen¡ial. Legal issucs may arisc as part oFa civil proceeding (e.g., invoh.rntary commitment, resrâmcnrãry capaciry comperence ro srand rrial, parole, child cusrody, personal injury, discriminarion issues), a criminal proceedìng (e.g., crrttrpeterr.c tô st¿rnd tri¿l, not gtilryby reason of insa¡iry mitigating circumstances in sentencing), decermination oFreasonable accommodations for employees wirh disabìlities, or an adminisrrarive proceeding or decision (e.g., license revocârion, parole, work- and social needs. In¡erven¡ion planning rypicai- er\ compensation). Each oFthese legal issues remedial procedures; and characrerizing brain- behavior Ârnctions to assist rhe t¡ier of facr in criminal and civil legal acrions. Ttsttnc rsR l¡¡r¡Rv¡¡¡tro¡¡ P¡-eilr¡n¡c $r0 0rrTcoME EvntunnoH Professionals often rely on resr results for in planning, executing, and evaluat- assisrance ing intcrventions. ThereFore, their awareness of validiry information rha¡ supports or does not supporc rhe relationship benveen rest resulrs, is 128 AERA_APA-NCM E_OOOO 1 36 PART III / PSYCHOLOGICAT TESTING ANO ASSESSMENT Some tess a¡e intended to provide informaa dienr's functioning that helps darify a given legal issue (e.g., parenrâl funcrioning in defìned in law applicable to a particular lcgislarive jurisdiction. The definition of each legal isue may be iurisdiction specific. For example, rion abour rhe criteria by which a person c¿n be involuntarily commitred often differ benveen legislative jurisdictions. Fu¡¡hermore, resrs initially charges against a defendant in competency to administered for one purpose a.lso may be used lor anorher purpose (e.g., initially used [or a civil case bur later used in administrative or c¡iminal proceedings). hgislarures, courr, and other adminstrative bodies often define lega.l issues in commonly used language, nor in diagnostic or other technic¿l psychol ogicd terms. The p ro fessional is resporuible for explaining the diagnostic frame of refe¡ence, including test scores and inle¡ences made from rhem, in rerms o[rhe legal criteria by which rie jury judge, or ad¡ninistrative boa¡d will decide rhe legal issue. For example, a diag- or neuropsychological impairment, which does not also include a reference to the legal crireria, neither precludes an exa.mi¡ee lrom obtaining sole custody o[children in a child custody dispuce nor does it necessarily acquit a person of criminal responsibiliry. In instances involvíng legal or quæi-legal issues, it is imporrant to âJsess the examinee's resr-taking orientation including response bias to ensure that the legal proceedings have not afFected the responses given. For example, a person seeking ¡o obrain the greatest possible monerary award lor a personal injury may be nosis oI schizophrenia motivated to exaggeràte cognitive and emotiona.l symptoms, while persons anempring to forestall the loss oFa professional license may arrempt [o portray rhemselves in the best possible lighr by minimizing symptoms or deñcirs. In forming an aisessmenr opinion, ir is necessary ro interprcr thc test scores with inFormed knowledge relaring ro the available validiry and reliabiliry evidence. When forming such opinions, it a.lso is necessary to inregrate a clienr's tesr scores with all other sources of informetion rhat bear on currenr stetus including psychologicai, medicd, educarional, occupâtionel, legal, and other relevanr collareral records. a child custody case or abiliry to understand srand rrial matters). The manuals of some tescs also provide demographic and actuarial dara for normative groups chat are representetive o[ persons involved in the legal system. However, many tcsts ¡n€asu¡e const¡uc¿s thar are generally relevant ro rhe legal issues even though norms specific ro che judicial or governmental context may nor be available. ProFessionals are expected ro make every effon to be awa¡e ofevidence of validicy and reliabiliry chat supporrs or does nor supporr rheir inferences and to place appropriare limits on rhe opinions rendered. Test users who practice in judicial and government setrings are expected to be aware ofconfìicr o[ intcresr rhat may lead to bias in the interpreta- tion of tesr resuks. Prorecting the confidentialiry of a client's resulr a¡d of the test instr¡lment irself poses part icular challenges for professio nals i nvolved res¡ wirh atrorneys, judges, jurors, and other legal and quasiJcgal decision makers. The tes! teker right to expec¡ that test resuls will only to persons who are legally authorized to receive rhem and that otl¡er inlormarion f¡om ùe tesring session that is not relevant to che evaluation will not be reported. k is imporranr lor the professional to be apprised otpossible thre¿s ro confidenridity does have a be communicated and test sccuriry (e.g., releasing ûre test quesdors, the examineet responsesr and raw and scaled scores on tesm ro another qualified prolessional) and to seek, iInecessary, appropriate legal and professional remedies. T¡srrHc ron PEnsount AwaRtuESs, Gnown, A¡¡O ACïOI¡ Tesg and inventories frequenrly are used ro provide information ro help individua-ls to undersrand rhemselves, to identifu their own strengths and weaknesses, and to otherwise clariff issues important to their own decision 125 AERA-APA-NCM E_OOOO 1 37 PSYCHOLOGITAL TESTING ANO ASSESSMEI¡T making and development. For example, rest resul¡s from personaliry invenrories may help ciients be¡ter underscand ¡hemselvcs and also understand their inre¡actions wi¡h orhers. Resulrs from interesr inven¡ories and tests of es crucìal elemenrs / of úre psychologiel PART ¡II re-sring and assessment process. The snndards in rhis chaprer provide a fra¡newo¡k for guiding rhe professìonal toward achieving relevance and effec¡iveness in rhe use ofpsychological rests abiliry may be uselul to individuals who a¡e making educational and career decisions. within rhe boundaries or limia defined by Appropriare cogn itive and neuropsychological tes$ thar have been normed and sranda¡dized for children may facilitare rhe rnonitoring of development and growch during the fo¡marire years when relevant inrervenrions may be morc efficacious for prevenring porentially disabling learning disabiliries from being ovcrlooked or c¿l foundations. E¿rlier chaprers and srandards misdiagnosed. Test results may be used for self-exploration, sell-growrh, and decision making in several wap. First, the resuks can providc individuals with new inFo¡marion that allows them to rhe professional's educational, experientiaì and erhi- rhat are ¡elevant to psychological resring and essessment desc¡ibe general aspeca of resr quaJi- ty (chapters i-6, chapter I 1), resr fai¡ness (chapters 7-i0), and test use (chaprer ll). Chapter I 3 discusses educ¿rional applicarions; chaprer 14 discusses tesr use in rhe workplacc, including credenrialing, and rhe imporrance oF collecting data rhat provide evidence ofa resri accurecy for predicting job performance; and chapter l5 discusses tesr use in program evaluation and public poliry compare ¡hemselves wi¡h othe¡s or ro evaluare themselve¡ by focuing on selÊdescriptions and characterizations. Tesr resuls also may serve to srimulate discussions berween a clienr and professional, to r¡acilitate dient insighm, to provide di¡ections fo¡ future considerarions, to help individuals idenriFT strengths a¡d assets, and rc provide the proFessional with a general framework for organizing and inregraring information about an individual. Testing for personal growrh may take place in training and development progrems, wiúrin an educational curricu' lum, during psychothcrap¡ in ¡ehabili¡a¡ion programs as part of an educational or career planning process, or in other siruarions. Summary The appl ication of psychological resr conti nues to expand in scope and depth on a course that is cha¡acreriæd by an increasingly dìverse ser of pulposcs, procedures, and ¡ssessmcnt nceds and challenges. Therefore, the responsible use of tesr in practice requires a commirment by rhe professional ro develop and maintain the necessary knowledge and competcnce ro select, administer, and interpret res¡s and invenrories 130 AERA_APA-NCME_OOOO 1 38 PART III / PSYCHOLOGICAL TESTING AND ASSESSMENT wich borh ûre tesr taker and rhe organizarion Standard 12.1 Those who use psychological tests should confine their testing and related assessment activities to their areas of competence, as demonstrated through educetion, supervised tra-ining, experience, and appro- priate credentialing. Conment: The responsible use and interpretation of rest scores require appropriate levels of experience and sound professional judgment. Comperency also requires STAilIÐARDSI su ffìcient familiariry with rhe population from which the test taker comcs ro allorv appropriarc inreracrion, tesr selecrion, test administration, and tesr interprerarion. For example, when personaliry tesrc and neuropsychological tests ere administered ofa psychological assessment ofan individual, ¡he resr scores must be undersrood in rhe conrext of üre individual's physie.l and as part emotional stere, es well as the individual's culural, educationaJ, occupational, and medical bacþround, and must take into account other evidence relevant ro rhe tesm used. Tesr in¡erpretation in this context requires professionally responsible ,judgment rhar is exercised wirhin rhe boundaries of knowledge and skill afforded by rhe professionalt education, training, and supervised experience. uesting assessm ent). A professio nal engaged in a professional relationship with multiple req clients takes cere to ensur€ that the mukiple relationships do nor become a conflicr of inrerest that would occur when the professional's judgmenr roward one clienr is unduly infìuenced by hìs or her relarionship wirh rhe orher client. Tesr selections and interprerarions thar favor a special èxternal expectation or perspective by deviating from established principles o[ sound tcst interpretation are unprolessional and unethic¿1. Standard 12.3 Tests selected for use in individual testing should be suitable for the characteristics and bacþround o[fis ¡ss¡ taker. Comment: Considerations [or test selection should include culture, language and./or physical requiremenr of the test and the availabiliry ol norms and evidence of validiry for a population representative of the test nker. If no normative or validiry studies are available for rhe' population ar issue, resr irrerprentions should be gualified and presented as hypotheses rarher than conclusions. Standard 12.4 Standard 12.2 Those who select tests ald interpret test results should ref¡a-in from introducing biases that accommodate individuals or groups with a vested interest in decisions affected by the test interpretation. Commcat: Individuals or groups with a vesred in the significance or meaning of rhe inceresr findings from psychological testing include many school personne[, arro¡neys, reflerring health professionals, employers, professionaJ associates, and managed care organizations. In settinç a proFcsional may have a professional relarionship wirh mulriple clienm (e.g., some IÊa publisher suggerts ùat tests a¡e to be used in combination with one another, the professional should review the evidence on which the procedures for combining tests is based and determine the rationale Êor the specific combination of tests and the justifìcation of the interpretation based on the combined scores. Commtnt: For e<ample, if measutes oFdevcloped abiliries (e.g., achievemenr or specific or gcnerd abilities) or personaliry are packaged with interest meâJures to suggest a requisite combina¡ion of scores, or a neuropsychological battery is being apptied, then supporcing validiry dara for such combinations of scores should be available' 131 AERA APA NCME OOOO139 la+¡ a¡m ô ññ [ ÄnÍt ¡Äft{[ lv untúüJr-lfl sI/L/ t\ [\ ñr--J_-,r ¿^ PsfeH0toGlGÂt F ùr,iilru¿lfu l¿.c The selection of a combinarion oi tests to add¡ess a complex diagnosis should be appropriate for dre purposes of the assessment as determined by available evidence ofvalidity. The professional's educational training and supervised experience also should be comrnensurâie with the test user qualifications required to administer and interpret the TESTIñ|G At\¡t ASSESSMEruT i PAET lli diagnostic terms or categories employeci should be careñrfly defined or identified. Standard 12.8 Professionals should ensure that persons under their supervision, who adminis¡er a¡d sco¡e tests, are adequately trained in the settings in which che testing occurs and with the populations served. selected tests. Standard 12,9 Commenì: For example, in a neuropsychologi- cal assessment lo¡ evidence ofan injury ro a particular area of the brain, ir is necessary to select a combinerion of tests of known diagnosric sensiriviry and specificiry to impairments arising From trauma ro various regions oF the cerebral hemispheres. Siandard 12,6 \lhen diffe¡ential diagnosis is needed, t]¡e professional should choose, ifpossible, a test for which there is egidence of the test's abiliry to distinguish between the two or more diagnostic groups of concern rather than merely to distinguish abnormal cases from the genera.l population. Comment: ProFessionals wi[ find it particularly helpful if evidence of validiry is in a form rhar enables rhem to determine how much confidence c¡n be placed in inle¡ences regarding an individual. Differences berween group means and their statistic¡[ significance provide inadequare informarion regarding validiry For individual diagnoscic pu¡poses. Additional inFormacion mighc consisr o[confidence intervals, effecr sizes, or a table showing rhe degree of overlap of predicror distriburions among dillcrent criterion groups. Standard 12.7 When the ralidity of a diagnosis is appcaised by erzluacing the ievei ofagreement berween tesî-based inferences a¡d the diagnosis, the P¡ofessionals responsible for supervising group testing ptograms should ensure that the individuals who interpret the test scores are properly instructed in the appropriate methods for interpreting them. Comment: If, for example, interesr invenrories are given to college students for use in academic advising, the professional who supervises the academic advisors is responsible for ensuring that the advisors know how to provide an examinee an appropriare interpretation ol the rest resuls. Standard 12.10 Prior to testing, professionals and test administmrors should provide r-he test taker with appropriate introductory information in language undersundable to the test taker, The test taker who inquires also should be advised of opponunities and circumstances, ii any, for retesting. Comment:The clienr should understand resr- ing time limirs, who will have access to the tesr results, iFand u,hen test results will be shared with rhe resr taker, and iFand when decisions based on the tesc resulcs wili be shared wich rhe cesr taker. Standard 12.11 P¡ofessionals and others who have access to test mate¡ials and test results should ensure 12A AERA-APA_NCM E-OOOO 1 4O PART III / PSYCHOLOGICAL TESTING AND ASSESSMENT the confidentialiry of the test results and testing mater¡als consistent with legal and professional ethics requirements. Comment: Professionals should be knowledgeable and conform to record-keeping and confidentialiry guidelines required by the stare or province in which they practice and the professional organizations to which they belong. Confidenrialiry has different meanings lor the rest developer, the test use¡ rhe test raker, and third parcies (e.g., school, court, employer). STAIÐARDS ¡ests that cmploy an unstructu¡ed response fo¡mat, such as some projecrive rechniques and inlormal behavioral ratings, the professional should Follow objective scoring criteria, where available and appropriate, that are cleer and minimize rhe need lor the scorer to rely only on individual judgmenr. The resdng may be conducted in a realisdc, less rhan optimal, setting to determine how a clienr with an amentional disorder, for example, performs in a noisy or distracring environmenc rather than in an optimal environment rhat typically To rhe exrent possible, the proFessional who protects rhe ¡esr take¡ [rom such external uses rests is responsible For managing rhe con- rhreats ro perFormance eftìcienry. fidenrialiry o[ tesr information across all parries. Ir is important for the professional to be Standard f2.13 arvare ofpossible tÀreau to confidenrialiry and rhe legal and professional remedies available, Prolessionals also are responsible for maintaining the securiry of testing materials and for protecting the copyrighm ofall rcsts to the exrenr permirted by law. Standard 12.12 The professional examines available norms a¡d follows aciministration instructions, including calibration of technical equipment, verification of scoring accuracy and replicabiliry and provision of settings for testing that facilitate optimal perlormance of test takers. Howeve¡, in those instances where realistic rather than optimal test settings will bæt sadsfr the assessment purpose, the professional shorld report tle rzuon for using such a sening md, when possible, also conduct the testing under optimal conditions to provide a comparison. Commmt: Because rhe no¡marive data against which a client's performance will be evaluated we¡e collected unde¡ rhe reported standard procedures, rhe professional needs to be aware ofand rake into account the effecr that nons¡andard procedures may have on the client's obtained score.'Vhen rhe professional uses Those who seled tests and d¡aw inferences from test scores should be familia¡ with the relevant evidence ofvalidiry and reliabiliry for tests and inventories used and should be prepared to a¡ticulate a logical analysis that supports all facets of the assessment and the inferences made from the assessment. Comment: A présentation and analysis of validiry and reliabiliry evidence generally is not needed in a written report, beceuse it is too cumbersome and of li¡tle interest to most repon readers. However, in situations in which úre selection of tærs may be problematic (e.g., ve¡bal subtests with deal clíents), a brief descriprion of the rarionale for using or not using parricular measurei is advisable. '!flhen potential inlerences derived from psychological test dara are not supponed by evidence oIvalidiry yet may hold promise lor future validation, they may be described by dre test dweloper and pro[essional as h¡potheses for lurrher va]idation in test interpretation' Such interpretive remarks should be qualified to communic¿te to the sou¡ce of rhe referral that such inferences do not es yet have adequately demonstrared evidence ofvalidìry and should not be the basis for a diagnostic deci- sion or prognosric formularion. 133 AERA_APA-N CM E-O OOO 1 4 1 I I a.rn q¡n  t¡àrùrr !5 PSYCHOLOûICAL TESTING AND ASSESSME$IT Sr I lå0\t¡ tÀ¡F+C Qåaa¡la.¡l rlrqlludlU t! { /t l¿. l¡l / FART III êå--i--l itldltufilu {4 {?t t¿, I .ó The interpretation oftest results in the essessment process should be informed when possible by an analysis of srylistic and other qualitative featu¡es of test-taking Criterion-related evidence of validiry should be available when recommendations or decisions are presented by rhe professional as having an actuarid basis. behavior that a¡e infer¡ed Êom observations during interviews and testing and from |.i"r^'i^.1 i-Ê^.-^':^^ Standard 12.18 C,omment: Such fearures of tesr-a]dng behavior results generally should be based upon multiple sources of convergent test ârrd collater¿l The interpretation of test or test batrery include manifesrerions of farigue, momeRcery flucruations in emorional srate, rapporr with rhc examiner, resr takert level of morivarion, wia\holding or distortion of response as seen in insrances of deception and malingering or in insta¡ces of pseudoneurological condirions, and unusual response or generai adaptation to the testing environment, Standard 12.15 Those who use computer-generated interpretetions of test data should eva.luate the qualiry of the interpretarions and, when possible, the relevance and appropiìateness of the no¡ms upon which the interpretations data and an understanding of the normative, empirical, and theo¡etical foundations well as the limitations of such tests. as Comment: A given parrern of tesr perFormances represents e cross-sectional view o[ the individual being æsesscd wirhin a parricular conrext (i.e., medical, psychosocial, educational, vocational, cultural, ethnic, gender, familial, geneeic, end behavioral). The inrerpretation of findings derived from â complex battery o[tescs in such conrexrs requires appropriate educarion, supervised experience, and an appreciarion ofprocedural, rheorericai, a¡ci e mpirical limitations of the tess. are based. Standard 12,19 Comment: Eflorrs to reduce a complex ser dara of into compurer-generared interprerarions of a given construcr may yield grossly misleading or simplified analyses of meanings of test scores, that in turn may lead to far-rlty diagnosric and prognosric decisions as wcll as mislead rhe tric¡ of facr in )udicial and governmenr setrings. Standard 12,16 Test interpretations should not imply that ernpirical evidence exists for a relationship ârnong pârticular test results, prescribed interventions, and desired outcomes, unless empirical evidence is available for populations similar to those representât¡rc of the er<aminee. The interpretation of test scores ot paftems of test battery results should take cognizance of the many factors that may influence a pa¡ticula¡ testing outcome. Where appropriate, a description and analysis ofthe alterna.:--^ L--^.L^--^ ur ç^Pl4r¿rru¡rs -L^- l¡¡d/ rlvc rr/Pvr¡rçùss ^- ^..-l---.:^-^ t¡ldr _^.. have contributed to the pattern of results should be included in rhe report. Comment: Many [acrors (e.g., unusual testing conditions, morivarion, educational levcl, employmenr srarus, larera-l sensorimocor usage preferences, health, or disabiliry status) may influence individual testing resuks.'When such factors ere known ro inrroduce consrruct-i¡¡elevân¡ variance in component test scores, those factors should be considered during rest score inrerprerarions. 134 AERA APA NCME OOOO142 PART III / PSYCHOLOGICAL TESTING AND ASSESSMENT STANDARÐSI Standard 12.20 Except for some judicial or governmental referra.ls, or in some employment testing situadons when the client is the employer, professionals should share test results and interpretations with the test taker. Such information should be expressed in language thar rhe test taker, or when appropriate rhe test taker's legal representative, can understand. Comment: For example, in rehabilitation se¡rings, where cliens rypically are required to parricipate actively in intervenrion programs, sharing ofsuch information, expressed in re¡ms rhar c¿n be undersrood teadily by the clienr and family members, may facilirare the effecriveness of intervention, f35 AERA APA NCME OOOO143 ? x ato Fnñ EnÃTlfl[\HAü l\rlV¡f-llÉÐ [ra¡JllJ\Jf-ãC iJ TF-qTtil[ffi A Ettt il sÉì¡tå ÃnËm rilEir¡r ASSESSIWËruT Baekground This chaprer concerns testing in Formal educational serrings from kindergarren rhrough postgraduare training. Resuks of tesr administered ro students are used ro mal<e judgrnents, for example, abour rhe stetus, prog¡ess, or accom- plishmenr of individuals or groups. Tess that provide i nlormation about individual performence are used to (a) evaluate a studentt overall achievemenr and growth in a content domain, (b) diagnose student srrengrhs and weaknesses in and acros content domains, (c) plan educarional inrervenrions and ro design individualized inst¡uc¡ìonal plans, (d) ptace s¡udents in appropriare educational progrems, (e) selecr applicants inro programs wirh limired enrollment, and (fl cerrifr individual achiwemenr or qualifi cations. Tesrs rhar provide information abou¡ the starus, progreJs, or accomplishmens as schools, school disrricm, or srares are used (a) ro judge and monircr rhe qualiry o[educarionai programs for all or for nical qualiry and fairness in resring that may not be addressed or emphasized in the preceding chaprers. This chapter does nor explicitly address issues relâred ro res¡s cons¡¡ucred and administered by teachers for their own classroom use or provided by publisheis of instruc- rional materials. rMhile many aspects of the Stand¿rds, parricularly ¡hose ìn the areas of validiry, reliabiliry, test development, and [airness, are relevanr ro such resrs, rhis documenr is not intended for rests used by teachers for ¡heir orvn cl¿ssroom purposes. lssues in Educational Testing This chaprer first considers some cross-curring isues: the disrinctions emong rfpes oF rests, úre design or use of rests ro serve mulriple purposes including the measurement ofchange, and the "stakes" associared with differenr pur- o[groups such poses For resting parricular subsets of individuals, and (b) ro infer the success of policies and inrervenrions chat have been selected lor evaluarion. These resring purposes are rypically mandated by instiru¡ions such as schools and colleges and by governing bodies ofpublic and privarely Asrssl¡ens adminisrered educarional p¡ograms. In this chapte¡ chree broad areas oFedu.l J-t -.-^:-qr¡u¡rarl----:-Lc5LulB arc Lurrslocrcg toat cncompass performance in a specified domain. Perlorma¡ce assessmens, however, at(empt ro emulere rhe one or more ofrhe above purposes: (a) routine school, disrricr, srate, o¡ other sysrem-wide testing programs; (b) tesring [or selection in higher educacìon; and (c) individualized and special needs resring. Vhile rhe second and rhird areas refer ro relarively specific purposes of testing, rystem-wide resting programs can encompess multiple individual and group purposes. For each of rhese ereas, rhe chaprer elaborates on the specific pu¡poses and domains encompassed and raises specific issues ofrech- in educacion. 0rsrtucno¡¡s AqÌogc Typ¡s 0F TEsIS AfiD Tests used from in educacional setrings range of tradirional irem formars such as multiple-choice irems ro perFormance tesrs consisring assessmen6 including sco¡able porrfolios. Everi, test, regrdles of its fo¡mar, meesure-s resr-tate¡ contcxr or conditions in which rhe inrended knowledge o¡ skills are actually applied. As discussed in chaprer 3, they are diverse in narure and can be producr-based as well as behaviorbased. The exôcuúon ofche tasks posed in rhæe tests often involves relarively extended time periods, ranging From a lew minuces to e cless period or moÌe to several hours or days. Examples of such performanccs might include solving problems using manipulable marerials, making complex inlerences after collecting information, or explaining orally or in wriiing 137 AERA-APA-NCM E-OOOO 1 44 EDUCATIONAL TESTING AND ASSESSMENT the rationale for a parricular course oFgovernmenr acrion under given economic condirions, The performance task may be undertaken by a single individual or e reem ofstudents. Performance âssessmenß may require increased resring rime ro provide suffìcient domain sampling for reasonable estimates o[ individual attainment and for making generalizations to the broader domain. Exrended tíme periods, collaboration, a¡d the use o[ancillary marerials pose grear challenges ro rhe srandardizetion of adminisr¡arion and scoring oFsome perlormânc€ asscssmen$. This is parricularly rrue when resr rakers define their own tasks or when they selecr their own work products For evaluation. lVhen rhis is the case, test take¡s need to be aware of rhe basis for scoring æ wcll as rhe narure of the crìteria rhat will be applied. Further, performance assessmencs oFten require complex procedures and training to inc¡ease the accuracy of judgments made by those evaluaring student performance (see chapter 3). A¡ individual portFolio may be used âs another type oF performânce essessment. Sco¡able porrfolios are syscemaric collecrions o[ educational p roducrs ryp ically collecred over time and possibly amended over time. The particular purpose oF the port[olio determines wherher ir wilì include represenrative products, the besr work of rhe studenr, or indicarors of progr€ss. The purpose also dìcrares who will be responsible for compiling the contencs oF rhe portfolio-the examiner, the studenr, or both parries working together. The more standardized rhe contents and procedures oFadministradon, the easicr it is to esnblish comparabiliry of porrfolio-based scores. Establishing comparabiliry requires porrfolios ro be construcred according ro tesr specifications and sanda¡ds, and rhe development of objective procedures ro .iudç their qualiry. The resr specifications for porrfolios may indicate that studens a¡e to make certain decisiors about the narure olthe work to be included. For example, in constructing an arr porr[olio, srudeors may selecr the media rhat besr represent cheir rvork. Brablishing compâ- / PÀRT III rabiliry also requires specificarions regarding the kinds of ¿çsisrance che srudenr may have received during portfolio prepararion. It is paniculaily diftìcuft to compare the performance of srudens whose portfolios may vary in content. All performance assessments, including scorable ponfolios, are judged by rhe same sranda¡ds oF rechnical qualiry as rradirional cescs ofachievemenr. Elecrronic media are oflten used both to present testing marerial and to record and score resr rake¡s' responsÊs. These tests may be adminisrered in schools, in special laboratory serrings, or in exre¡nal resring centers. Examples indude simple enhancemenrs oF rex¡ by audio-raped instrucrions ro lacilitare srudent understanding, compurer-bæed rests traditionally given in paper-and-pencil [ormar, computer-adaptive tesrs, and newer, interactive muhimedia resring si¡uarions where artribures oF perlormance asseisrnents are supported by computer. Some computer-based tess also may have dre capaciry ro capnrre espec$ ofstudens'processes sòlve tesr items. They may, 6r as they example, moniror time spenr on items, solutions ried a¡d rejected, or ediring sequenc€s fo¡ texts- Electronic media ir possible ro provide ¡est administ¡acion conditions designed ¡o assis¡ studenr wirh partícular needs, such as those wi¡h dilferent language backgrounds, attenrion problems, or physical disabilitìes. Computers can also help identifr rhe contributions of individuals ro a group task complered by a team or in geographicaìly remore locations on a nenvork. Computer-bæed tests are eva-luated by the same technical qualiry srandards as other tcsts adminisrered through more traditional means. h is especially important that test takers be familiarized with rhe media o[the test so r]tat any unfamiliariry with computers or strategies does nor lead ro inle¡enccs based on constructirreleva¡t va¡iance. Fur¡he¡more, ir is importanr ro describe scoring aJgorithms, experr models upon which they may be based, and technical data supporting rheir use in any documentation accompanying rhe resting system. It is irnporrant, howeve¡, to assure that the docu¿lso make 138 AERA APA NCME OOOO145 PART III / EDUCATIONAL TESTII.IG AiII} ASSESSMEI{T ñÞhr.,;^ñ .l^.. .^, iÞ^^",Åi,.,h. ...,,.i-, ^F rhe iiems rhar could adversely affecr rhe validiry oF score interpretarions. Some computerbased tess rnay a.lso gerìerate ¡econrmendations for ins¡rucrional practices based on tesr resuls. Describing the basis for these recommendarions assisrs the user in evaluating rheir applicabiiiry in a given situation. "¡".lo.t i. -.^-^..1 ^. ..¡"i^^Å ^. ^ grade level, graduated, or admitted or placed into a desired progrem, the test use is said ro have high sakes, A low-srake¡ resr, on ùe other hand, is one adminisrered for informarional purposes or for highly tentadve judgments such -'h"rh". as o when resr ruuJrs provide feedback ro srudenc, ceachers, and parenrs on studenr progress dur- Many tesrs are designed or used to serve multiple purposes in education. For example, a test may be used ro monitor individual srudenr achievemenr as well as ro evaluare the qualiry ofeducational programs at rhe school or dis- ing an academic period. Tesring programs lor institurions can have high stakes when aggregate performance of a sample o¡ of rhe entire population of resi takers is used ro infer rhe qualiry ofservice provided, and decisions are made about insritutional status, rcwards, or sanctions based on tesr resuls. For example, tricr level. As another example, a test may úre qualiry oFreading curriculum and insrruc- Mut-r¡ple PuRposes a¡¡D MEAsuBrilG CHANcE be used to evaluare an individual's performance relarive to the performance oFone or mo¡e reÊ erence popularions as well as ro evaluare the level oFrhe individualt comperence in some defined domain (see chapters 3 and 4). The evidence needed for rhe rechnic¿l qualiry ofone purpose, howeve¡ will differ from the evidence needed lor enother purpose. Consequentl¡ it is inipoitanr ro eváluare the evidence o[cechni-^l ^.--l:-- rur -^^L -..--^-^ ^r-^--:-ø qu4¡r/ f^- kLll PurpusE ur ret¡¡lË. Test resula may bc used to infer rhe gromh or progress es well as rhe srarus of individuals or groups ofstudenrs, such as rvhen tesrs are expecced to reveal the effecrs of insr¡ucrion, of changes in educarional polic¡ or oF orher inrerventions. In such cases, ¡he rest's abiliry ro detect chanç is essencial. Ifdiffe¡encrs in scores are reporred, the rechnical qualiry of rhe differences needs arrenrion. More genecally, whenever inferences abour growth or progress are made, ir is imponanr to evaluare rhe validiry of rhose inferences. tion may be judged on rhc basis are direcdy affected by rest pr[ormancr, such as olsru- dent progress or the levels ofanainmenr reached by groups ofstudenrs. Even when resr ¡esul¡s are reporred in rhe aggregate and intended for a low-stakes purpose such as moniroring the educarional system, the public release of daca can raise rhe stakes for particuiar schools or districts. Judgments about program qualiry, p€¡sonnel, and educarional programs rnighr be made and policy decisions might be affecred, even rhough the resrs were noc inrended or designed for ¡hose purposes. The higher the srakes associared wirh a given resr use, the more imporranr ir is rhar tesr-based inferences are supporced with srrong evidence of technical qualiry. In panicular, when rhe srakes for a¡ individual are high, and imponmt decisions depend substmciaìly on resr performance, the resr needs ro exhibir higher standards of rechnical qualiry for its avowed purposes than might be expecced oFtesrs used fbr lower-stakes purposes Srexrs or TesnHe The imporrance of rhe results of tesring progcems for individuals, ins¡irutions, or groups is often referred ro as rhe srø,åar of the tesring prog(¿m. At the individual level, when significant educarional paúrs or choiccs ofa¡ individual oF resr resulrs because test scoreJ can indicare rhe rare (see chapters l, 2, and 7 for a more thorough di.çcussion on validiry, reliabiliry, and bias in tesring, respectively). Although it is never possible ro achieve perfect accuracy in describing an individualt performa¡cc, cffors need to be made ro minimize errors in estimating individual scotes or in classifring i¡divid,,rls in pass/åil or admit/rejecr caregories' 139 AERA APA NCME OOOO146 EDUCATIONAL TESTING ANO ASSESSMENT Further, enhancing validiry for high-stakes purposes, wherhe¡ individual or institutional, ry pi ca,lly en tails collecti n g so u nd collaceral information both to assist in understanding ¡he factors rhat conrribured ro resr resulrs and to provide corroboraring er,'idence thar suppora inferences based on test results. These issues will be add¡essed more fully as rhey rela¡e ro the rhree areas of tesring desc¡ibed below. School, District, State, or 0ther System-Wide Testing Programs fu indicared previousl¡ sysrem-wide resring programs can spen multiple purposes. At the individual level, tests are used for low-stakes purposes, such as moniroring and providing feedback on srudcnt progress, and for more high-stakes purposes, such as cerrifring students' acquisition of particular knowledge and skills for promotion, placement into special instructional progrâms, or graduation. At the school, district, srare, or other aggregate level, a common purpose o[ tess is ro eva]uare rhe progress made by. groups of students or to moniror the long-term efFectiveness of the overall educational system. Educational resting programs may also permit comparisons among che performance of various groups of students in difÏerent programs or in diverse sertings for the purpose 'of rn"king an evalua¡ion of rhose learning environments. Chapter l5 providcs a more thorough discussion on p¡ogrâm evaluation. In these contexts, educational tests ere designed (o measure cerrain aspecrs o[srudenm' knowledge and skills as reflected in cu¡riculum goals and standards. There may be considerable variarion in rhe breadth and depth o[ the knowledge and skills that are measured by such cests. Some educarional tess locus on thc test nkers' general abiliry or knowledge in a panicular contenr area, such as their underscanding oFmarhematics or science. Orier tests focus on te¡t takers' specific knowledge oFa topic in detail, such as rrigonomerry. / PART III Still orhers emphæize specific skills or procedures, such as rhe abiliry ro rvrire persuasively or conducr, and inccrpret rhe resula scientifìc experimenr. Tess may address other cognitive aspects oftest takers' developmenc, such as their abiliry ro work wirh others to solve problems or their selF-reporred habits and ani¡udes, as well as noncognitive aspects, such as srudenrs'abiliry to perlorm parricular physic:J tasks. In most ceses, valid interpreration olrÏe results requires ¡hat evidence of the fir berween the resr domaìn and the relevant curriculum goals or standards be ¿scer¡ained. Tèsring programs may involve the use of resrs designed .o repreJent a set ofgeneral educarional sranda¡ds æ determined for insrance by rhe srate, disrricr, or relevanr educ¿tional professional organizarion. Such tesrs are conceptually similar to criterion'referenced tesr, co design, oF a in ¡hat a se¡ ofconrent sranda¡ds is developed that is intended ro provide broad specificarions for studenr perFormance by delimiring the conrent and general skills ro be me¿sured. Subsequcn tly, descriptive or empi rical targem or levels ofachievement are developed and reFerred to as performance standards. These performance sm¡dards a¡e in¡ended ¡o define ñrrrher the knowledge and skills required of srudencs [o¡ each oFthe different categories of proficiency. This rype of tesring may involve the developrnenr ofa new test ro essess the relevant conrenr and skills or the selection of an exisring tesr char can be relerenced co rhe srandards. 'Whether a tesr is designed or selected, valid ínterpretacion ofúre results in light ofthc standards enrails assessmen( of rhe degree offit beoveen rhe resr domain and conrenc and ¡he descripcive sratements oFstandards or goals. This involves a process ol mapping or reFerencing rhe contenr and skills of ¡le tes¡ ¡o those of the scandards to be sure that gaps or imba[ances do not occur. The curriculum goals or sanda¡ds may be suftìcienrly broad to encompâss many different ways for students to demonsrrate chei¡ sratus, accomplishmenrs, or 140 AERA-APA-NCM E-OOOO 1 47 PART III / EDUCATIONAT TESTNG AiID ÂSSESSMENT Àr^.-^,,-. .^-- -^-t. ^- -.^^a^.À^ may nor lend themselves to conventional test formas. These are cases in which rhe tesr may ¡esulr in consÌrucr underrepresenrarion rhar refers ro the exrenr ro which a resr fails ro caprure important aspeca oFwhar ir is inrended ro meâsure. Chaprer 1 provides a more rhorough discussion of consrrucr underrepresentation. In rhese cases, inrerpretarion of tesr resulrs in light ofgoals or standards is enhanced by an understanding ofwhat is nor covered as wcll as what is covered by rhe resr. Sometimes, addirional commercial or locally developed tests are adminisrered wi¡hin a parriculer jurisdicion, and atrempts are made ro link rhese exisring resa ro the proficienry levels reported for the new resr or to provide orher evidence oFcomparabiliry. It is imporrant ro provide logical and empirical validiry evidence oFany reporred linla. For example, evidence can be collected ro derermine the extent to which rhe existing test can provide inlo¡marion about the proficiency of individual students and groups ofstuden¡s in the particuia¡ contenr arees and skiils addressed by rhe standards. The validiry of such linla is problemaric io rhe exrenr rher rhe tes¡s measure differcnt conrenr (see chaprer 4 [o¡ a discussion on issues in equaring and linking resrs). l,)flhen inFerences a¡e ro be d¡awn abour the performance ol groups of srudenrs, practical considerarions and rhe formar of rhe rest (e.g., performance assessmenr) often diccare thar diÊ ferenr subgroups of sruden¡s wi¡hin each unir respond ro differenr scts of asks or ircms, a pro- cedure referred ro as mar¡ix sampling. This marrix sampling approach a.llows for a rcsr ro bener represenr dre breadú ofdle urget domain withour increasing rhe tesúng time for each tesr raker, Grou¡level resulrs are most usefiil when testing programs and srudent populations remain suffìcienrly stable ro provide information about trends over rime. \ühen a resring program is designed lor group-lwel reporting and employs marrix sampling, reporting individual scorcs generalJy is not appropriare. \v r¡!rr ur(rrHrç(¡¡¡6 ^^) ..-"-- ¡rvra ¿uuu( iv --- :-.-^-^.:-d¡u ur¡¡6 ^-^-^- -L---individuals or groups oI srudenrs, consideration ofrelevant collare¡al informarion can enhance rhe validiry ol rhe interpretarion, by providing corroborating evidence or evidence rhat helps explain srudenr perFormance. Tesr resulrs can be influenced by mulriple facrors, including ins¡iru¡ional and individual facrors such as the qualiry oFeducarion provided, srudents'exposure ro educarion (e.g., through regular schooI attendance), and studenrs' morivarion to perform well on rhe resr. As rhe stakes ofresring increase for individual scudena, the imponance ofconsidering additional evidence ¡o documenr rhe validiry ofscore interprearions and rhe fairness in resr- ing increases accordingly. The validiry oI individual interpreretions can be enhanced by raking into accounr other relevanr inflormarion about individual studenrs before making imporrant decisions. It is imporranr ro consider oÍany collateral rhe soundness and ¡elevance information or evidence used in conjuncrion wirh test scores for making educadonal decisions. Fu¡rher, fairness in rescing can be enhanced through careful considerarion of condidons rhar affecr students' opporruniries ro demonstrate their capabilicies. For example, when resrs are used for promotion and graduation, rhe fairness ofindividual inrerprerarions can be enhanced by (a) providing students wirh mukiple opportunities to demonsrrare the ir capabiliries through repeared testing wirh akerna¡e lorms or through orhe¡ consrrucr-equivalenr means, þ) ensuring srudenm havc had adcquare noúce of skills and conrenr ro be resred along wirh other appropriate resr prepararion marerial, (c) providing students with curriculum and insrruction thar affords drem rhe opporruniry to learn the content and skills rhat are tesrcd, and (d) providing studenrs with equal access to any specific preparation for test taking (e.g., resttaking straregies). Chapter 7 provides a more thorough discussion on fairness in testing. Collateral information can also enhance inrerprerarion and decisions at the insrirutional t41 AERA APA NCME OOOO148 EDUCATIONAL TESTING AND ÀSSESSMENT level. For instance, charges in tes¡ sco¡es from yeer ro year may not only reflecr changes in the capabíliries ofstudents bur also changes in the srudenr population (e.g., successive cohorts oFstudents). Differences in scores across ethnic groups may be confounded wirh differences in socioeconomic starus of rhe communiries in which they live and, hence, the educational resources ro which srudenrs havc access. Dífferences in scores from school to school may similarly reflect differences in resources and acriviries such as rhe qualificarion of teachers or the number ofadvanced course offerings. !ühile local empirical evidence oFdre influence ofrhese facrors may nor be readily available, considerarion of, evidence from similar contexts availeble in published li¡erature can enhance the qualiry of the inrerpreration and use of cu¡rent resulu. Because public parricipation is an inregral part of educational governance, policymakers, professional educators, and members of the public are concerned with rhe narure ofeducational tests, rhe domains rhat rhe ccsts arc intended ro measure, the choices in test desìgn, adoprion, and implementarion, a¡d the issues with valid intcrprccarion and uses oF resc results. Ir is imporranr that test results be reporred in a way rhat all stakeholders can associated underscand, rhat enables sound interpretations, and that dsç¡eas65 rhe chance oF misinterpretations a¡d inap p rop riate decisions. large-scale resting is increasingly viewed as a rool ofeducarional policy. From this perspecrive, resrs used for program evaluarion, such as some srate tests that are aligned to the sure's own curriculum standa¡ds, are nor used solely as measures of school outcomes (see chaprer I 5 fo¡ a more rhorough disc"ssion on rhe use oftesm for program evaluation). They are also viewed as a means to influence curriculum and instruction, to hold ¡eachers and school administrators accountable, to increase srudent motivation, and to communicate performance expecrations to studen6, to teachers, and ¡o ¡lre public. Ifsuch goals a¡e se¡ forth as / PABT III pan of the ¡arionale for a tesring program, r-he validity of the tesring progrâm needs ro be examined wich respecc to rhese goals. Beyond any inrended policy goals, it is imporranc ro consider potentiel uninrendcd eFFec¡s rha¡ may resulr from large-scale testing programs. Concerru have been raised, fo¡ instance, abour narrowing the curriculum to focus only on the objectives rested, resrricring rhe range of insrructional approaches to correspond to the testing format, increasing the number of dropouts among students who do nor pass rhe tesr, and encouraging orher insrrucrional or administrarive pracrices that may raise resr scores wirhouc affecring rhe qualiry ofeducation. It is important for those who mandatc tesrs ro consider and monitor their consequences and ro identi$ and minimize rhe porentiel of negative consequences. Selection in Higher Education It is widely recognized thar rese are used in rhe selecrion oFapplicanr for admission ro partic- ular educational programs, especially admissions ro collega, universiries, and professional schools. Selecrion crireria mây vary wirhin an instirution by academic specializarion. In addirion to scores from selection tests, many orher sources ofevidence are used in making selecrion decisions, including pasc academic records, rranscripts, and grade-point everâge or rank in class. Scores on tests used ro certiry studenrs for high school graduation may be use d in the college admissions process. Orher measures used by some ins¡irutions are samples ol prwious worh by studenrs, lists of acedemic and service accomplishments, lerrers of recommendarion, and srudent-composed staremenrs evaluared for the appropriateness of the goals and experience of rhe student or for wriring profi ciency. Two major poins may be made about rhe role of tests in the admìssions process. Often, scores ere used in combination with orher sources of inFormatìon. Some of rhese supple- 14? AERA APA NCME OOOO149 PÂNT III/ ËIIUCATIONAL TESTiIG AflID ASSESSMENT -^..-^^^ ^r ^..;,.1^-^^rr¡d' r¡vt L- rrrr4u¡/ uL --l:-LL, or may lack comparabiliÐ/ from applicant to applicant. For this reason, it is imporranr rhar srudies be conducred examining the relarionships among resr scores, data from othe¡ sou¡ces oF informacion, and college performance. Second, the public and polirymakers ere to be cau¡ious about the widespread use of repora of college admission resr scores to inler the effectiveness of middle school and high school as well as to compare schools o¡ stares. Admissions tests, whether thcy are intended to measu¡e achievemenr or abiliry are nor direcrly linked ro a particular insrructional curriculum and, therefole, are nor appropriate for derecring changes in middle school or high school performance. Because of differenrial motivational factors and orher demographic variables found across and within pre-collegiere progrems, sell-selecion predudes general comparisons of resr scores across demographic groups, Therefore, selÊseleccion also precludes comparisons of tesr scores among the full ranges of pre-collegiare progrâms. ----^t assessed l-l:..:J..^l:-^J ð-^-:-r f,r^^J- into appropiiacc educational programs. Individually administered tests can serve a number of purposes, including screening, diagnostic classifi cation, inrervention plan nin6 assure all srudencs aÍc placed and program evaluation. For screening purposes, tests ere administered ro identifr studenrs who might differ significantly from their peers and might requìre addirional assessmenr- For example, screening rests may be used to idenrifr young child¡en who show signs of devclopmental disorders and to signal the need for lur¡her evaluation. Fo¡ diagnostic purposes, tests may be used to clarifr the types and extenr of an individual's dilficulries or problems in light o[ well-escablished crireria. Tesr results provide an imporranr basis for derermining whether the scude nr mees eligibiliry requiremenr for special education end orher related services and, ifso, the specific rypes ofservices thar rhe student needs. Tèsr results may be used For intervenrion purposes in esrablishing behavior and learning goals and objectives for rhe srudent, planning insrructional strategies thar should be used, and speci- --l il¡urvruudlt¿çu dilu ÐpËutiil Nuuutt fring the appropriate serting in which Testing special sewices ere to be delivered (e.g-, regular individually administered tesrs ere used by classroom, resource room, full-rime special dass, etc.). Subsequenr ro rhe srudenrt place- school psychologists and other professionals in schools and other related secrings ro facilitate rhe learning and developmenr of students who may have special educational needs (see chaprer l2). Some ofthese servicæ a¡e reserved for rhose sruden¡s who have giíred capabilities as well as Fo¡ rhose srudencs who may have relacively minor academic difficukies (e.g., such as rhose requiring remedial rcading). Orhcr scrwices a¡e rescrved Fcrr students who display behavioral, cmotionaì, physical, and/or mo¡e severe learning difficulries. Serviccs may be provided ro srudenrs who are in regular classroom serrings as well as to s(udents who need more specialized instrucrion outside of rhe regular classroom. The ultimare purpose of rhese services is to rhe ment in special services, tesrs may be administered to monitor the progress of rhe s¡udenr roward prescribed learning goals and objectives. Tesr results may be used also to evaluate rhe effecriveness of insrrucrion ro determinc whcrhcr the special serviccs necd to be conrinued, modified, or disconrinued. Many typæ of tess are used in individualized and special needs testing. These include tess of cognirive abilities, academic achievement, learning processes, visual and auditory memory speech and language, vision and hearing, and behavior and personaliry. These cesÈs are used typically in conjunction with other assessmenr methods such as intervielvs, behavioral observation, and review of records. Each o[these may provide useful dara [o¡ mak- AERA APA NCME OOOOI50 EDUCATIONAT TESTIiIG AND ASSESSMEITT / PART III ing appropriate decisions abour a srudenr. ln addirion, procedures thac aim to link assessment closely to intervention may be used, have training and comperence, in order to including behavioral assessmenrs¡ assessmenß of learning environmenrs, curriculum-based rests, and ponfolios. Regardless ofrhe qualiries being assessed end rypes ol data collecrion that studenrs who are relerred [or possible methods employed, âssessment dara used in making special educarion decisions are evaluared in terms of validiry reliabiliry and relevance ro the specific needs of the students. They must also be judged in terms oF rheir usefulness for designing appropriare educarional programs for srudents rvho have special needs. The amount and complexiry of the assessment dara required for making various decisions about a student will vary dcpending on rhe purpose of tescing, the needs of the srudent, and orhe¡ iñfo¡marion already available about the srudent (e,g,, current scores on a relevanr resr may be on file for some studenrs but not lor ochers). In general, resting for scrcening and program evaluation purposes rypically involves the use o[one or rwo tests raùer than comprehensive resr betteries. For determining eligibiliry and designing intervention, resring and assessmenr is more comprehensive and may involve multiple procedureJ and sources. Moreover, in-depri anâlyses and interpretarion of the data are necessary In special education, tests are sefected, administered, and inrerprered by school psychologists, school counselors, regular and spe- cial educarors, speech pathologisrs, and physical therapìsts, among orher proFessìonals. The validiry oF inferences will be enhanced if rest users possess adeguate knowledge of the prevenr misuse of tescs. Srate and federa.l law generally requires special education se¡vices be screened fo¡ eii- gibiliry. The screening or initial assessmenr may in turn call fo¡ a more comprehensive er.aluation. Bur the large numbe¡s of srudents to be tested, the high cost of special educarion programs, and rhe limirs oFtime c¡eare pressures on special educarion assessment pracrices. Assessment usually must be com- pleted wirhin a specific number oIworking days aFrer reFerral, and, in mosr insrances, rhe school disrric¡ is responsible for funding special services ¡ecommended by the child study ream. Occesionall¡ adminisrrators mighr be inclined ro use less expensive, less time-consuming, or more readily available testing procedures than a prolessional evaluator believes are warranted. An example would be the inappropriate use of available, but less adcquarely rrained, sraff to evaluate srudents. There also might be pressures to minimize or overlook problems thar require expensive services. These condirions are likely to adversely affecc rhe validiry of the interpretarion of rest resul¡s. Adhe¡ence to professiona.l srandards governing test use in conducting special educâtion assessmens is imporranr, in rhe Face of pressures to use more expedient procedures. The responsible use of tescs by school personnel can improve the opportuniries for promoring rhe developmenr and learning of all children. principles oF measurement and evaluation. However, rhis diverse group of rest users may differ in their levels of technical expertise in measurement and degree of proflessional training in assessme nr procedures. It is imporunr rhar professional evaluators adminisrer and interprcr only those resrs wi¡h rvhich rhey 144 AERA-APA_NCME_OOOO 1 51 PART I¡I / EDUCATIONAT TESTING AÑD ASSESSMENT Ot^xlav¡l atl(lllUqlU tI t lrr'l When educationa.l testing p¡ograms are mendated by school, district, state, or othe¡ authorities, the ì¡/ays in which test results are intended to be used should be clearly described. It is the responsibiliry o[ those who mandate the use of tests to -^.i¡n' thpir imnort rnrl t^ irl".tiÉ¡ enrl minimize potential negative consequences. Consequences resulting f¡om the uses of the test, both intended and unintended, should also be examined by the test user. Cornnzent: Mandated iesîing progrems are ofren justified in rerms of rheir porenrial bcnefits for teaching and learning. füncerns have been raised abou¡ the potentia.l negative impact of mandared testing programs, perticularly when they result direcrly in important decisions for individuals or instirurions. Frequent concerns include narrowing the currìculum to focus only on the objectives tesred, increasing rhe number oFdropouts among students who do nor pass the tesr, or encouraging orhe¡ instrucrional or administrative pracrices simply designed to raise rest scores rather than to af[ect rhe qualiry o[ education- Standard 13.2 In educational settings, when a test is designed or used to serve multiple pulpos- of the test's technica.l qualiry should be provided for each purpose" es, evidence Comment: In educational tesring, ir has become common practice to use the same tesr for multiple purposes (e.g., moniroring achievement o[ individuaì studenrs, providing information to assist in instrucrional planning for individuals or groups ofstudents, evaluaring schools or disrricrs). No test will serve all purposes equally well. Choices in tesr developmenr and evaluarion rhat enhance validiry for one purpose may ,l:-:^:-L rcrru¡L)i rv( ulr¡!¡ ulrr¡¡¡¡¡r¡r .,^l:l:-.. C^- ^.L^- PurPwrçJ. Different purposes require somewhat diÊ lerenr kinds of technical evidence, and appropriate evidence of technica.l qualiry lor each purpose should be provided by rhe test developer. If rhe rest user rvishes ro use [he rest for e purpose not supporred by rhe available evidence, ir is incumbenr on the user ro provide the necessary additional evidence (see chapier l). Standard 13,3 When a test is used as an indicator of achievemenî in a¡ instructional domain or with respect to specified curriculum sendards, evidence of ¡he extent to whidr the tesr sa"mples the range oF knowledge and elicits rhe processes reflected in the target domain should be provided. Both tested and target domairis should be described in sufiìcient detail so their rela' tionship can be evaluated. The analyses should make explicit those as^oects of the Érget domain that the test represents as well as those aspects that it fails to represent. Comment: Increasingl¡ tesrs are being developed to moniror p¡ogress oF individuals and groups toward local, stare, or professional curriculum srandards- Rarely can a single test cover rhe full range ol performances reflected in the curriculum standards. To assure app¡opriare interpretations of resr s.nres 2ç inrlicrtorç nf n..l^.mance on these stenda¡ds, it is essential to document and evaluate both the relevance oFrhe test to the srandards and rhe exrenr ro which rhe test represenrs the srandards. When exisring resrs are selecred by a school, district, or state to rep¡esent local curricula, it is incumbcnr on rhe user to provide rhe necessary evidencc of the congruency of rhe cur¡iculum domain and the test content. Furrher, conducting studies o[ rhe cognitive straregies and skills employed by test takers or srudies of the 145 AERA_APA_NCME_OOOO1 52 EDUCATIONAL TESTING AND ASSESSMENT lsmn¡nanos relationships berween resr scores and orher performance indicarors relevant ro the broader domain enables evaluation of the extent to which generaliza¡ions ¡o the b¡oader domain are supported. This information should be made available ro all rhose who use the rest and interpret the test scores. Standard 13,4 Local norms should be developed when necessary to support test users' intended interpretations. Commen t: Comparison of examinees' scores to locel as well as more broadly representative norm groups can be informarive. Thus, sample size permirting, local no(ms ere often useful in conjunction wich published norms, especially if rhe local popularion di[fers markedly ftom ¡he popularion on which published norms are based. In some cases, local / PART IfI Standard 13.6 Students who must demonstrate mastery of certain skills or knowledge beFore being promoted or granted a diploma should ha*e a reasonable numbe¡ of oppomrnities to succeed on equivalenr forms of dre test or be provided with consttuct-equivalent testing alternatives oF equal difficulry to demonstrate tfie skills or knowledge. In most ci¡cumstances, when students are provided widr multiple opportunities to demonst¡ate masrery rhe time interval berween the oppomrnities should allow for srudenu to have the opporfllniry to ob¡ai¡ the relerrant instructional experiences. Comment: The number of opportunities and rime berween each testing opportunìry will vary with the specific circumstances of the setring. Further, some students may beniÊrt From a different tesring approech to demon- norms may be used exclusively. strate ¡heir achievemenc. Care mus¡ be taken Standard 13.5 rhar evidence of construcr equivalence of ahernative approaches is provided as well es rhe equivalence olcut scores defining pæsing expectations. 'rühen test ¡esulrs substantially contribute to making decisions about student promotion or graduation, there should be evidence that the test adequately covers only the specific or generùized content a¡d skills that students have had an opportunity to learn. Comment: Sudens, parenu, and educational sraff should be informed of the domains on which tÀc srudents will be tested, thc nature of rhe item rypes, and the srandards for masReasonable efforts should be made to documenr the provision of instrucrion on rer¡ resred content and skills, even though ir may not be possible or feasible to determine the specific content of insrruction for every studenr. Chaprer 7 provides a more thorough discussion ol the difficulties that arìse with this conception offairness in testing. Standard 13.7 In educational sertings, a decision or characterization that will have major impact on a student should not be made on rlre basis of a single test score. Other releva¡rt informa- tion should be taken into account if it will enhmce t}re over¿ll validiry of the decision. Comment: As an example, when rhe purpose of resting is ro identifr individuals with special needs, including srudents who would benefir f¡om gilted and ralented programs, a screening for eligibiliry or an iniria.l assessmen¡ should be conducted. The screening or initial assessment may in turn ca]l for more comprehensive evaluation. The comprehensive assessment should involve che use of 146 AERA APA NCME OOOO153 P¡.RT !!! / ENLIEATIONAL TESTIIÉG ÂIIT .--___t ci I ltnlt Itr!J[ ¡ti v U rrü Cur¡r-t! tUJaJ ÂS-CESSMEiIT multipie measures, and data shouid be col- ol rhe same test such :s multipie apdrude resr lecred from multiple sources. Any assessment batre¡ies and selection tests. data used in making decisions are evaluated in terms of validiry, retiabiliry, and reLevance I Standard 13.9 ro the specific needs oF the students. It is important rhat in addition to lest scores, orher relevanr information (e.g., school record, classroom observation, parent reporr) is raken in¡o account by the professionals making rhe decision. Standard 13.8 'lfhen an individual student's scores from different tests are compared, any educationai decision based on this comparison should take into account the extent of overlap beween the two construcs and the reliability or standard emor of the difference score. Commenr: Vhen difference scores be¡ween rwo rests are used ro aid in making educaüonal decisions, it is important r-har dre rwo tesß are standardized and, i[appropriate, no¡med on the same population at about úre same time. In addition, the reliabiliry and standard error of the difference scores berween the rwo resrs are affected by rhe relarionship berween the consrrucrs measured by the rests as well as rhe standard errors of measuremenr of rhe scores of rhe rwo tests. ln the case of comparing abiliry rvirh achievemen! test scores, the overlapping na¡ure of the rwo consrructs may render the -^t:^L;t:-, ^f-L^ l:cc^---^^ --^-^- luwLr -L^rLrréurr¡ry ut urc ulttLtc¡rLL ùLUlc5 t--..^- ul¿l¡ If rhe abiliry and/or achievemenr tess involve a significanr amount o[ measuremenr erroc rhis will also rcducc dre confidence one may place on rest users norma.lly would assume. rhe diffe¡ence scores. Ail rhese Facrors affect the reliabiliry ofdifference scores berween tests and should be considered by professional evaluators in using difference scores as a basis [or making importanr decisions abour a sru- dent. Thìs srandard is also relevant when comparing sco¡es from different componenß When test scores are intended to be used as part of the process for making decisions for educational placement, promotion, or implementation oFpresciibcd cducadonal plans, empirical evìdence documenting the relationship among panianlar test scores, the instructional proBrams, and desired student outcomes should be provided.'\Vhen adequate empirical evidence is not available, users shouid'oe cautioned to weigh the test resul$ accordingly in light of other relevant info¡madon about the student. Comment: The validity of rest scores for placement or promotion decisions rests, in part, upon evidence about wheùrer students, in fact, benefìr from the differenrial instrucrion. Similarl¡ in special education, when lest scores are used in rhe development of specific educarional objecrives and inst¡uctional srrategies, evidence is needed ro shorv that úre prescribed instruction enhances studenrs' learning. When there is limited evidcnce about the relarionship among resr results, insrructional plans, and studenr achievemenr outcomes, resr developers and users should stress rhe renmrive narure oF che test-besed recommendarions and encourage a¡d other decision makers to consider -L^ ..-^C.l^^^^ ur -^^- sLUrQ l¡¡ t:^L- ur uúlç¡ rrrt usLru[¡o5 ^C t6( -^^-^-:- llB¡t( -f ^-L-reachers relevanr inFormacion abouc rhe students. Standard 13.10 Those responsible for educational testing programs should ensu¡e that the individuals wtro ad¡ninister and score the test(s) are proÊcient in the appropriate test adminisüation Procedures a¡rd scoring procedures and that they r¡ndersmnd the impomance of adhering to the directions provided by the test developer- l4t AERA-APA-NC M E-OOOO 1 54 isrnn¡nnnns Standard 13.11 In educationa.l sertings, test users should ensu¡e ¡Iat any test preparation ac¡ivities a¡d ¡nateria.ls provided to studenc will not adversely affect the validiry of test score inferences. Comment: ln mosr educarional resting contexts, the goal is to use a sample of tesr irems to make inferences to a b¡oade¡ domain. \ùlhen inappropriate test preparation activities occur, such as reaching irems that are equivalent to those on rhe ces¡, rhe validiry of test score inferences is adversely affected. The appropriateness of rest prepa¡arion activities and materials can be evaluated, for example, by determining the exrenr to which rhey reflecr the specific test irems and the extent ro which test scores are artifi cidly r¿ised wirhout actually increasing srudenrs' level of achievemenr. Standard 13.12 In educational settings, those who superin test selection, administ¡ation, and interpretation should have receìved education and training in testing necessary to ensure familiariry wirh the evidence for r"alidiry and reliabilíty for tests used in the educational setting and to be prepared to articulate or to ensure that others articulate a logical explanation ofthe relationship among the tests used, the purposes and rhe interpretations of the r¡ise others EDUCATIONAL TESTING AND ASSESSMENT / PAffT III Comment: When testíng programs are used as e strategy For ers expected guiding insrrucrion, teach- to make inferences abour insrructional needs may need assisrance in inrerpreting test resulrs fior rhis purpose. If the tests are normed locall¡ statewide, or nationatl¡ reachers and adminiscraco¡s need to be proficienr in inrerpreting rhe normrefeienced tes¡ scores. The inrerpretation ofsorne test scores is sufficicntly complex ro require thar ¡he user have ¡elevanr psychologìcal rraining and experience or be assisred by and consulr with persons who have such training and experience. Examples ofsuch tests include individually adminisre¡ed intelligence resm, personaìiry inventories, projective techniques, and neu ropsychological tests. Standard 13.14 In educational settings, score ¡epo¡rs should be accompanied by a clear starement of the degree of measuremenr error associated with each score or classfication level and info¡mation on how to interpret the sco¡es. Comment:This informarion should be communicated in a way that is accessible to persons receiving the score report, For instance, rhe degree oFuncercainry might be indicated by a likely cange olscores or by the probabiliry of misclassifi carion. *lr5: Standard 13,15 Standard 13.13 In educational settings, reports of group differences in test scores should be accompanied by relevant contenu¿l info¡mation, Those responsible for educational tesdng programs should ensu¡e that ¿he i¡dft¿id'elc who interpret t-he test resulæ to make decisioru within the school conterct are qualiÊed ro do so or are assisted by and consulr with persoru who are so qualified. where possible, to enable meaningful interpretation of these difFerences. Where appropriate conte$ual information is not available, users should be cautíoned agêinst misinterpretarion. 148 AERA APA NCME OOOO155 PAfiT IIf / tr¡Tr¡t frrrì n nrsr** ;ì il FÈN{[.¡'dåF{lt !:rt EDUCATIONAL TESTTNG AND ASSESSMENT I ¡¡gEvl Comnen!: Obse¡ved difÊ¡ences in tesl scores Cnmmcnt'^fhe berween groups (e.g., classified by gender, rucel presumes the same rest or equivalenr Fo¡¡ns ethnicic¡ schooli district, geographical region) oF rhe rest we¡e used can be influenced, for example, by differences (or the lorms have) not been materially in course-taking petterns, in curriculum, in reachert qualifications, or in parental educational level. Differences in perlormance oF alrered berween adminisrrations. The sran- cohorrs of srudents âcross time may be influenceci by changes in úre popularion ofstudents cested or changes in le,arningopporrun.ities for studenrs. Users should be advised ro consider the appropriare contexrual informarion and cautioned against misinterpretation. Standard 13.16 In educational settings, whenever a test score is reported, the date of test administration should be reported. This information and the age of any norms used for inte¡pretation should be considered by test users in making inferences. Comment: Vhen a resr score is used for a p?rricular purpose, rhe dare of rhe resr score should be taken into conside¡a¡ion ìn derermining irs worrh or appropriareness For making inFerences abour a srudenc. Depending on the particular domain rneasured, the validiry of score inferences may be quesrionable as time progresses. For instance, a readìng score from a tesr administered 6 months ago to ân elementary school-aged s¡udenr may no longer reflecr rhe scudent's currenr readins level. Thus. a resr score should no¡ be used if it has been derermined rhar undue time has passed since che rime of dara collection and rhar rle score no longer can be considered a valid indicator of a srudent's currenr level of proficiency. and rhat rhe tesr has dard erro¡ of the difference berween scores on the prerest and postcesr, the regression oF posttest scores on pretest scores, or relevant data from other reliable merhods for examining change, such as rhose based on strucrural equarion modeling, should be reporred. Standard 13.18 Documen¡ation of design, models, scoring algorithms, and methods For scoring and classifying should be provided for tests administered and scored using mu.ltimedia or computers. Construct-irreler¡ent va¡iance pertinent to computer-based resting and the use of orher media in testing, such as the test taker's familiariry with technology and rhe test Format, should be addressed in their design and use. Comment; lt is imporranr to assure ¡ha¡ rhe documenration does nor jeopardize rhe securiry of the items that could adversely affect rhe validiry ofscore inrerpretaúons. Compurer and multimedia resting need ro be held ro the same requiremenrs of technical qualiry as are other tests. Standard 13.19 In educational settings, when average or summary sco¡es for groups of students a¡e reported, they should be supplemented with additional information about the sample size and shape or dispersion of Comment: Score reports should be designcd When change or gain scores a¡e used, such qualities should be reponed" nFrhrnoe nr orin crnr"c -'_-"Þ- -_ score distriburions- Standard 13.17 scores should be deÊned and rrçe rleir rechnic¿J to communicate clearly and effectively co their intended audiences. In most cases, rePorts rhet go beyond average score comParisons are helpful in firrthering thoughtfirl use 145 AERA APA NCME 0000156 EDUCATIONAT TESTING AND ASSESSMENT lsrnnuunnns / PART III and inrerpretation of test scores. Depending on rhe intended purpose and audience olthe score ¡eporr, additional in[ormation might take rhe Form ofstandard devia¡ions or other common measures of score variabiliry, or of selecred percenti[e points lor each disrribution. Alternativel¡ benchmark score levels mighr be established and then, for each group or region, rhe proportions o[¡esr takers atraining each specified level could be reporced. Such benchmark nright be defined, as selecred percentiles of the pooled disrriburion for all groups or regions. Orher dis¡¡iburional summaries of reporrìng formas may also be useful. The goal of more for example, derailed reporting mus¡ be balanccd against goals of clariry and conciseness in communicaring test scor€s. 150 AERA APA NCME OOOO157 I4" TESTHNG NNN ËMÍPLTY¡VIENT AhED F.RFMFzuTgAE IhIG tta lIlV!bú Él 6 ll--lhlt I'tJl ñ-^!----..-J t¡ehavioral problems at work. Testing as a rool DöUrlgruuilu Employment testing is carried out by organizations for purposes of employee selection, promotion, or placement. Selection generaJly reFers ro decisions abouc which individuals will enter the organizarion; placement refers to decisions as to how to assign individuals to posirions wiúrin che work force; end prcmotion refers to decisions abour rvhich individualq wirh- in rJre organization will adva¡ce. What all rhree have in common is a focus on che predicrion of furure job behaviors, wir}r the goal of influencing organizational outcomes such as effìciency, growr-h, productivicy, and employee motivarion and sarisFacrion. Testing used in the processes oflicensure and certificarion, which will here generically be called credentialing, focuses on the applicant's current skill or competency in a specified domain. In many occupations, individuals must be licensed by governmental agencies in order to engage in the parricuiar occuparion. In other occupations! professional societies or orher o rganizacions assume responsibil i ry for crcdentialing. AJ though licensurc is rypically a credenrial [or enrry inro an occuparion, credentialing programs may exisr er varying levels, from novice to experr in a given field. Cerrification is usually soughr voluntaril¡ alrhough occupations diffe¡ in the degree ro rvhich obtaining cerrificarion influences cmployabiliry or advancement. Têsring is commonly only a parr ofa credentlaling process, which may also include other requiremenr, such as educarion or supervised experiences. The Sønd¿rds apply ro rhe use oftesa in rhe broader credenrialing process. Tesring is also carried our in work organizations for a variery oF purposes orher rhan employment decision making arrd credenrialing. TÞsring ro derect psychopathology can take place, as in rhe case ofan employee exhibiring lor personal growrh can be part oFrraining and development programs, in which insrruments measuring personaliry characrerisrics, interests, values, preferenccs, and w'ork sryles are commonly used with the goal oFproviding self-insighr to employees. Testing can also uJ<e place in rhe context oFprogram waluarion, as in rhe case of an experimenral study oF rhe cffecdvcness o[a training progrem, whe¡e ¡esLç may be administeced as pre- and post-measures. The locus of ùis chapteç though, is on the use of resting in employment and credentialing. Many issues relevant to such resting are discussed in orher chaprers: technical maners in chaprers l-6, fairness issues in chaprers 7-10, general issues of rest use in chaprer i i, and individualized assessment of job candidates in chaprer 12. Employment Testing ïHe iHnu¡¡rc¡ 0F Corfltrr 0H TEsT ¡isE Employment resring involves using cest informarion to aid in penonnel decision making. Borh the conient and rhe conrexr ofemploymenr testing varies widely. Conrenr mey cover various domains o[knowledge, skills, abilities, rrais, disposirions, and values. The conrexr in which tesrs are used also va¡ies widely. Some contextual features reprcJenr choices made by the employing organization; others represenr constrains rhat must be accommodared by the employing organizåtion. Decisions about the design, evaluation, and implementation ola testing sysrem are specific to rhe context in which the sysrem is to be used. lmportant conrextual leacures inc.lude the [ollowing: Inte¡nal vs. external candidate pool. In some instances, such as promotional setrings, rhe candidates to be tested are alrcady employed by the organization. In others, applicarions are soughr from outsidc the r51 AERA APA NCME OOOO158 IESTING IN EMPI.OYMENT ANO CREDENTIALING otgeoiz.atìon. In others, a mix of inrernal and exrernal c¿¡didares is soughr. Unt¡ained vs. specialized jobs. In some insrances, untrained individuals are selected either because the job does nor require specialized knowledge or skill or because the organization plans to o[Fer training after rhe point of hire. In other insrances, rrained or experienced workers are soughr wirh rhe expecrarion thar rhey can immediarely siep inro a specialized job, Thus, the same job may require very different selection sysrems depending on whether trained or untrained individuals will / PAFT III is used in a mechanicai, standardized fashion. scores on a resr batrery are combined by formula and candidares are selecred in strict top-down rank order, or when only candidares above specific cur scores âre eligible to continue ro subsequent stages of a selection sysrem. In other instances, information from a test is judgmenrally integrared with This is the case when information from other resrs and with nontesr info¡marion ro lor¡n an overall assessmenr of the candidate. be hired or promoced. Ongoing vs. one-time use of a test. In some insrances, a tesr may be used fo¡ an exrended period of rime in an organization, Short-term vs. long-term focus. In some insrances, the goa.l of rhe selection sysrem is ro permitting rhe accumulation of da¡a and expe' rience abouc the tesr in thac contexc. In other predict performance immecliately upon or shortly after hi¡e. In other instances, rhe conce¡n is wirh longer-term perFormance, as in the case of predictions as to whether candidates insrances, concerns about test securiry are such job assignmenr. Concerns abour changing job msks and job requiremens also can lead to a focus on cha¡acterisrics projected to be necessary for performance on the target job in the future, even if not a part of rhe job as thar repeeted use is infeasible, end a new resr is required for each tesr administration. For example, a work-sample test for lifeguards, requiring retrieving a mannequin from the bor¡om ola pool, is nor compromised if candidates possess detailed knowledge ofthe ¡est in advance. ln conrrast, a written job knowledgc resr may be severely compromised if some c¿ndidates have eccess to the test in advancr. The currently constitu ted. key quesrion is wherhe¡ advancc knowledge oF will successñrlly complece a multiyear Sc¡een in overseas vs. screen out. In some resr conrenr changes rhe constructs measured instances, the goal of the selection system is ro screen in individuals who will per[orm well on one set of behavioral or outcome criteria ofinreresr ro the organiza¡ion. In orhers, ¡he goal is co screen our individuals for whom rhe by the rest. risk o[ pathological, devianr, or criminal behavior on rhe ,job is deemcd roo high. A didates applying before a specific date will be considered. In orher cases, the¡e is a concinuous testing system well suired ro one objective may be completely inappropriate for another. That an individual is evaluared as a low risk for enpging in pathological behavior docs nor imply a prediction rhat rhe individuel will exhibit high levels of job performance. That a resr is predictive of one criterion does not suppon rhe ínference o[linkages to other crite¡ia of incerest as well. Mechanical vs, iudgmental decision making. In some insrances, test informarion fired applicant pool vs. continuous flow. In some insrances, an applicanr pool can be assembled prior ro beginning the selection process, as in the casc oFa policy that all can- flow of applicants about whom employment decisions need to be made on an ongoing bas'rs. A ranking ofcandidates is possible in the case of*re fixed pool; in *re c¡se ola con¡inuous flow, a decision may need ro be made about of information each candidate independent about odrer candida¡es. Small vs. Iarge sample size, large sample sizes are sometimes available for jobs with many incumbents, in situarions in which multiple similar jobs can be pooled, or in sirua- 152 AERA-APA_NCM E-OOOO 1 59 PABT !!! / TEST!¡IG !N ËMPLOYMENT AilD CNEÍIEi¡TIALING tions in which organizations with simiiar jobs collaborace in selection system development. In oúre¡ siruadons, sample sizes are smali; at úre extreme is the case of the single-incumben¡ job. Sample size aflects the degree to which different lines of evidence can be drawn on in examining validiry [o¡ the inrended inflerence ¡o be drawn from rhe tesr. For example, relying on the local setting for empirica.l linkages ben¡een test and crìterion scores is not rechnically feasible rvirh small sample sizrs. Size of applicant pool, relative to the number of job openings. The size of an ùe gpe of testing rhe test can be used ro preciict subsequenr job behavior. The valìdation process in cmployment seftings involves the gathering and evaluarion ofevidence relevant to sustaining or challenging this inference. fu detailed below, a variery ofvalidacion srrategies can be used to suppor[ rhis inference. Ir rhus follows that establishing rhis predictive inFerence requires ¡har attention be paid ro rwo domains: rhar of the tesr (the predicror) and char of the job behavior or outcome olinieresr (the crirerion). Eva.luating che use of a test for an employmenr decision can ùe hypothesis of a link- Operarionall¡ chere applica-nr pool can constrein be viewed as testing system rhat is feasible. For desirable jobs, ver¡ age berween rhese domains. large numbers of candidates may vie for a small are many weys of testing rhis hypothesis. This is illustrared by rhe following diagram: number ofjobs. Under such scenarios, short screening tests may be used to reduce rhe pool ro a size lor which the adminisrrarion of more time-consuming and expensive rests is pracricable. Large applicant pools may also pose test securiry concerns, limitìng the organization to resting merhods rhat permit simuluneous test adminisrradon ¡o.all candidares. Thus, test use by employers is condirioned by contexcual Feaçuies such as chosc in the foiegoing list. Knowledge of these features plays an imporrant part in the professional judgment rhat will influence both the rype of resting system thar will be developed and dre straregy rhat will be used to evaluate critically the validiry ol rhe inference(s) drawn using rhe testing qysrem. Tr¡ T¡snnc The fundamentai inFercncc ro be drawn f¡om test scores in most applications oftes¡ing in employment settings is one oF predicrion: rhe test user wishes to make an inference from tesr results to some furure job behavior or job ourcome. Even when the valid¿tion strategy used does not involve empirical predicrorcrite¡ion tinkaga, as in rhe c¡se of reliance on validiry evidence based on tesr conrenr, rhere is an implied c¡iterion. Thus, while differenr srrategies o[gathering evidence may be used, the inference to be supported is rhat scores on Vru-lnnrroH PRocrss ru Empr-ovru¡r*t *'i"'\ predicror I cqiterion -'j"* - 254 t\t r\Í predictor construct domain crirerion 3 COnStruCt domain The diagram differentiares bctween a predicror consrruct domain and a predictor measu¡e and berween a criterion consrrucr domain and a crirerion measure, A, prcdictor constntct domain is defìned by specifying the set of behaviors that will be included under a particular consrruct label (e-g., verbal reasoning, rypin g speed, conscienriousness). Similarly, a citerion consmtct doæain specífiæ the set of job behavion or job outcomes úrat will be induded unde¡ a parricular construct label (e.g., perFormance o[ core job taslc, reamwork, arten' dance, sales volume, overall job perFormance)' Predictor and crirerion measures are aftemPts ar operationalizing these domai ns. 153 AERA_APA-NCME-OOOO 1 60 TESTI!{G IN EMPLOYMENT ANt] CflEOENTIALIT¡G The diagram enumereres a nunrber of in[ercnces commonly of interesr. The firsc is ùe inference that scores on a predicror measure are related ro scorcs on a crirerion meâsure. This inference is resred through empiricel examination of relationships be$veen the rwo measures. The second and four¡h ere concep¡ually similar: bo¡h examine the inference rhat an operarional rneesure can be inrerpreted as representing an individuali scanding on rhe construct domein oF interest. Logical analysis, experr judgment, and convergence with or divergence from conceprually similar or different m€âsures are among the forms of evidence ¡hat can be examined in tesring these linkages. The third is the inFerence oFa relacionship bcrween the predictor consrruct domain and ¡he c¡iterion consrruct domain. Thìs linkage is established on the basis o[theorerical and logical analysis. h conrmonly dr¿ws on sys(ematic evaluation ofjob contenr and experr judgment as ro the individual characteristics linked to successful job performa,nce. The / PART III I and linkage 4; and a chird involves pairing linkage 2 and [ìnkage 3. \Øhen rhe rest is designed as a semple of rhe criterion construct domain, this linkage can be esrablished direcrly via linkage 5. Anorhe¡ strategy lor linkíng a prcdictor measure and the crire¡ion consrruct domain focuses on linkages I and 4: pairing an empirical link berween the predictor and criterion measures with evidence of the adequary wirh which rhe crirerion me'qure represenrs rhe criterion consrruct domain. involves pairing lìnkage The empiriel link beween the predictor measure and the cri¡erion meâsure is part oFwhat r6ese Stand¿rds reler ro as "validiry evidence based on relationships to orher variables," re[erred ro as cri¡erion-relared validiry in prior conceprualizations oI rhe validation process. The empirical link of úre tcst and the criterion measure musr be supplemented by evidence ol rhe relevance of the criterion measure to the crirerion construcc domain to complete the Iinkage becween rhe resr and the criterion con- fifrh represenß strucr domain. Evidence of the relevance of the rhe linkage berween che prediclor measure and criterion measure to the criterion construcr domain is commonly based on job analysis, rhough in some cases the link between the domain and the me¿sure is so direct that relevance is epparenr without job analysis (e.g., when rhe crirerion consrruct ol interest is absenteeism or turnover). No¡e thar this strate- rhe crirerion construct domain. Some predicror meesures are dcsigned explicitly as samples of the crirerion construct domain of inreresr, and, thus, isomorphism berween rhe measure and rhe consrruct domain consrirutes direcr evidence for linkage 5. Estabtishing linkage 5 in this Fæhion is the hallmark oFapproaches rhat rely heavily on whar to as "validiry evidence based on tesr contenr," ¡elerred to es content validiry in prior conceprualizations of the validarion process. Tèsts in which candidares for li[eguard posirions pcrform rescue operations or in which candidates for word procersor positions rype and edit text exemplìfr rhis approach. A prerequisitc (o the use of a predictor measure [or personnel selcction is that the linkage becween the predictor measure and the crireríon construct domain be established. As the diagram illustrares, there are multiple scrategies For escablishing ¡his crucial linkage. One strategy is direcr, via linkage 5; a second rhese Stand¿rds refer gy does not necessarily rely on a well-developed pred icror consrruct domain. Predicror measures such as empirically keyed biodata measures are consrrucred on the basis of empirical links becween resr irem responscs and the criterion meesure of interesr. Such measures ma¡ in some insrances, be developed withour a fully established a priori conceprion ofrhe predictor consrrucr domain; the basis For thcir use is rìe direct empirica.l link berween resr responscs and a relevant crirerion meesure. Yet another strategy for linking predictor scores and rhe criterion construct domain of the adequacy focuses on paíring evidence wirh which the predictor measure represents the predicror construcr domain (linkage 2) 154 AERA APA NCME 000016,I PART lll / TESTING ll'i EMPTOYMENT AND CREÐENTIALING --:-L cvtuçiluq ur L¡lc lr¡rKaBc L__,-^-_ -L_ ___ wlul ---:l---- ^r-L- I:_t_--^ uctwçglt tl¡q Ptcdicror construcr domain and the crirerion con- ¡esearch base may make investing rcsources in addirio nal I ocel dara col lection un necessarj,. srruct domain (linkage 3). As noted above, there is no single direcr roure to establishing these linkages. They involve lines olevidence subsumed under "construcc validicy' in prior concepruaJizations of the validation process. A combination ol[ines ofevidence, such as experr judgment of rhe characreriscics predictive oljob success, inlerences drawn From an analysis of crirical incidents of eFfec¡ive and ineffective job performance, and interview and obse¡vation merhods, mây supporr inferences abour rhe predictor construcrs linked to rhe criterion construct domain, Measures of rhese predictor consrructs may then be selecred or developed, and the linkage berween rhe predictor meârute and the predictor construcr domain c¿¡ be esablished with various lines of evidencc lor linkage 2 discussed above. Thus multiple sources of data and muhiple lines ofevidence can be dr¿wn on ro evaluate the linkage becween a predictor measure and rhe criterion consrruct domain o[ inte¡esr. There is not a single correct or even a preFerred merhod oF inquiry for esablishing this linkage. R¿the¡, the test user musr consider rJre specifics ol rhe cesting situarion and apply professionâl judgment in developing e strategy for resting the hypothesis ola linkage becween rhe predictoÍ measu¡e and the criterion domain. For many tesring applicarions, rhere is a considerable cumulative body of research that speaks to some, if not all, ol rhe inferences discussed above. A meta-anal¡ic integrarion of this research can form an inregral parr oF rhe strategy for linking test information ro rhe construct domain of interest. The value of collecting local validation data varies with rhe magnitude, relevance, and consisrency of rescarch findings using similar predictor measures and sìmila¡ crire¡ion cons[rucr domains for similar jobs- ln some cases, a small and inconsistent cumularive ¡esea¡ch record may lcad co a validation srraregy rhat relies heavily on local data; in orhers, e large, consistenr Bns¡s ron Evru-umr¡¡e Trsr Use V/hile a primary goal of employmenr resring is rhe accurate predicrion ofsubsequenr job behaviors or job ourcomes, ir is important to recognize rhat there are limi¡s ro rhe degree to which such criteria can bc predicred. Perfecr prediction is a¡ unarrainable goal. Firsr, behavior in work settings is also influenced by a rvide variety of organizarional and extra-organizational Factors, including supervisor and peer coeching, formal a¡d informal uaining che¡ges in job design, changes in organizarional structures and systems, and changing family responsibiliries, among orhers. Second, behavior in work settings is influenced by a wide variery of individual characte¡isrics, including knowledge, skills, abiliries, personaliry, and work artirudes, among orhers. Thus any single characrerisric will be only an imperfec predicror, and even complex selecrion sysrems [ocus on rhe ser oF consrrucrs deemed most crirical f,or rhe job, ¡ather tha¡ ori all charac¡erisrics thar cân influcnce job behavio¡. Third, somc mcasuíemenc error always occu¡s even in wel.l-developed test and criterion measures. Thus, testing sysrems cannor be judged against a standard ol perfecr predicrion buc rarhe¡ in terms oI comparisons wirh available alrernative selection merhods. P¡ofessional judgmenr, inFormed by knowledge of the research literature abot¡r che degree of predictive accurary reiative to avaiiabie alternatives, infìuencs decisions abour resr use. Decisions abour rest use a¡e often influenced by additional considerations including urility (i.e., cosr-benefir) evaluarion, value judgments about the relative imporrance of selecting for one criterion domain vs. othets, concerns about applicant rcactions to test con¡enr and proccss, rhe avaì.labiliry and appropriateness of alternarive selection mcthods, s(atutory or regulatory requiremenu governing test useJ and sociai issues such as workforce 155 AERA APA NCME OOOO162 TESTING IN EIiIPIOYMENT AND CREDENTIATING / PART III diversiry. Organizational values necessarily come inro play in making decisions abour test sions and occupations, including medicine, law, psychology, teaching, archirecrure, real use; organizations wich comparable evidence supporring an inrended inlerence drawn lrom estare, and cosme tology. test scores may rhus ¡each diffe¡en¡ conclusions aborrr whether tô use eny parricular tesr. Testing in Professional and 0ccupational Gredentialin g widely used in rhe credentialing of persons for many occuparions and proFessions. Licensing requiremenrs are imposed by Tesrs are stare and local governmenrs ro ensure thar rlose licensed possess knowledge and skills in suflìcient degree to perform important occuparional ac¡ivities safely and el[ecrively. Cerrificarion plays a similar role in many occuparions not regulated by governmens and is ofren a necessary precursor to advancement in many occupations. Ce¡tification has a.lso become widely used to indicate rhar a person has certain specific skills (e.g., operation ol specialized auro repair equipmen$ or knowledge (e.g., atare planning), which may be only paft o[ rhe¡r occupational duries. Licensu¡e and certificarion, as well as registry and other watrants of expertise, will here generically be called crcdentialing. Tèsrs used in credencialing are intended to provide the public, including employers e and government agencies, with a dependable mechanism for idenrifoing practitioners who have met particular standards. The sandards are srricr, but nor so srringent as to unduly restrain the right olqualified individuals ro offer rheir services ¡o r}re public. Credentia.ling also serves to prorect the proFession by excluding persons who are deemed to be not qualified ro do rhe work of the occupation. Qua.l ifi cario ns fo r crede n ci¿ls ryp ically i ncl ude educational requiremenrs, some amount of supervised experience, and other specifrc criteria, as well as artainment of a pasing scote on one or more examinarions. Tescs are used in credenrialing in a broad spectrum ofprofes- ]n some oF these, such as actuarial science, clinícal neuropsychology, and nredical specialties, tests are also used ro certi$' advanced levels of expertise. Relicensure or recertification is also required in some occupations and professions. Tþsrs used in credenrialing are designed ro derermine whe¡her úre essen¡ial knowledge and skills ofa specified domain have been masrered by the candidare. The focus of performance srandards is on levels o[knowledge and perlormance necessary For safe and appropriare practice. Tesr design generally stars wirh an adequare definirion o[ the occupation or specialry so that persons can be clearly idendfied as engaging in rhe acriviry. Then, ¡he narure and requiremenc of the occupation, in its current [orm, are delineared. Ofren, a thorough analysis is conducted oFthe work performed by people in the proFession or occupation to document the tæks and abilities rhar are essential ro practice. A wide variery of empirical approaches is used, including detineation, critic¿i incidence techniques, job z.r:.Åysis, training needs assessments, or practice srudies and suweyn of practicing professionals. Panels of respecred expertr in the field often work in colfaborarion with qualified specialisw in resrìng to define resr specifications, including the knowledge and skills needed for safe, eflecrive performance, and an appropriate way of asesing rhat perlormance. Forms oCtesring may include rradirional muhiple-choice resrs, wrirren essays, and oral examinarions. More elaborate performance tasls, somecimes using computer-based simuiation, are also used in assessing such practice components as, for example, parient dìagnosis or treatment planning. Hands-on performance tasks may also be used (e.g., operating a boom crane or filling a roorh) while being observed by one or more examiners, Credenrialing teits may cover a number of relared bur disrincr areas. Designing the testing 156 AERA APA NCME 0000163 PART III i TESTING IñI EMPLOYMENT ANO CREOENT]ATING :--t---l^- l--:l:^--,L-- --^^- ^-^ -^ L^ PluBrdl¡l tlrLluus qçLtqr¡rË wltdL dcô dç tu us covered, wherher one or â series oI tesrs is to be used, and how multiple test scores are to be coml¡ined ro reach an overall decision. In some cases high scores on some tests are permitted ¡o offser low sco¡es on other tesa, so thar additive combinarion is appropriate. In other c¿ses, an acceprable perFormance level is required on each ¡es¡ in an examinatìon series. Validation of credentialing tesrs depends mainly on conrent-related evidence, often in dre form ofjudgmens thar the test adequately represenrs the con¡en¡ domain ol the occuparion or specialry being considered. Such evidence may be supplemented with other forms ofevidence exrernal to the tesr. Crirerion-related evidence is oFlimired applicabiliry in licensure setrings because criterion measures are generally not available for rhose who are not )^-^rl^) rrrrvr¡¡tat¡vrt ^L^,.. -L- C^-*^- ut -LduuuL r¡¡! rur¡tfdL -r !llÈ uLLdrrlu i-1^-^--:^tesr and the diftìculry oF irems, such numericel specificarions have linle meaning. Tesrs for credentia.ling need to be precise in rhe viciniry of the passing, or cut, score. They may not need to be precise fo¡ those who clearly pass or clearly [ail. Somerimes a cesr used in credenrialing is designed ro be precise only in the vicinity o[ rhe cut score. Defining the minimum level of knowlcertification is one oF the most imponant ând difficult tasls facing those responsible for credentialing. Verifring rhe appropriateness of Computer-based mastery tests may include a procedure ro end the testing when a decision abou¡ che candidare's performance can be clearly made or when a maximum time limit is reached. This may result in a shofter test for candidare¡ whose perlormance clearly exceeds or falls far below the minimum performance required for a passing score. The ¡est rake¡ may be told only whether the decision was pass or [ail, Because such mastery rests are not designed to indicate how badly rhe candidate failed, or how well ¡]re ca¡dìdare passed, providing scores rhat are much higher or lower rhan the cur score could be misleading. Ncvenheless, candidates who fail are likely to profit ftom inlormarion about rhe a¡e¿s in which their per- rie cu¡ formance was especially weak. lVhen feedback granted a Iicense. edge aad skill required for licensure or score or scores on the tes¡s is a c¡iric¿i element in validiry. The validiry of the infe¡ence d¡awn from the test depends on whethe¡ rhe sandard floi passing malçes a valìd disrinction benveen adequate and inadequace performance. Ofren, panels of experts are used to specifr rhe level oFperformance that should be required. Standards must be high enough to protecr rhe public, as well as rhe practitioner, bur nor so high as to be unreasonably limiting. Verifoing the appropriareness o[rhe cut score or sco¡es on ã test used for licensure or certific¿tion is a crirical elemenr of the validiry o[ test resulm, Legislarive bodies somerimcs aftempr ro legislate e cuc score, such as a score o[ 7070. fubirrary numeric¿l specificadons of cut scores are unhelpful for rwo reasons. Firsr, withour derailed lnformation abour the tesr, job requirements, and their relationship, sound sandard serring is impossible. Second, wirhout to candidates about how well or how poorly they performed is inrended, precision chroughour the score range is needed. Practice in prolessions and occupations often changes over time. Evolving legal restrictions, progress in scientific fields, and refinements in rechniques can result in a need for changes in cesr conrenr. When change is subsmntial, it becomes necessary ro revise the definition of rhe job, and rhe resr conrent, to rcflect changing circumsrances. lWhen major revisions are made in the cesc, rhe cut score rhar identifies required test performance is also ¡eesrablished. Because credenrialing is an ongoing process, wich tesrs given on a regular sched' ule, new versions of the tesr are often needed. From a technical perspective, all versions ofa test should be prepared to the samc specifi- cations and rePresenr ¡he same content' 157 AERA-APA_NCME_OOOO 1 64 I srnruonnus TESTING IN EMPLOYMENT ANO CREDENTIAI.II{G Alternate resr forms should have comparable PAfiT III Standard 14.1 score scales so thar scores can retain rheir meaning. Various methods ol joinrly calibrac ing akernate Forms can be used ro assure ¡har ¡he s¡andard For passing repreJenrs the same level oÊ performance on all forms. k may be noted ¡ha¡ release of pasr resr forms may compromise the qualiry oFtesr lorm comparabiliry Sonre creden¡ialing groups consider ir necessary as a practicaì marrer, to adjust their crircria yearly in o¡de r io regulare the ¡umber ofaccredited candidares entering rhe prolession. This qucstionable procedure raises serious problems fo¡ the rechnical qualiry olthe test scores. Adjusring che cur score annually implia higher srandards in some years than / in others, which, alrhough open and scraighrForward, is difficulr ro justiþ on rhe grounds oÊgualiry of perlormance. Adjuscing the score scale so that a certain number or proponion reach the passing scorc, whilc less obvious to the Prior to development and implementation a clear statement o[ the objective oF testing should be made. The subsequenc validation effort should be designed to determine how well tJre objective has been achieved. o[ an employmeot test, Comment: The objectives of employmenr resrs can vary considerably. Some aim ro screen out those least suited [or rhe job in quesrion, while others are designed ro identifo rhose best suired For the job. Gsts also vary in rhe aspeccs ofjob behavior they are intcnded to predicr, which may include quantiry or qualicy of work ourpur, renure, counrerproduccive behavior, and reamlvork, among others. Standard 14.2 of rhe scores from \flhen a test is used to predict a criterion, the decision to conduct local empirical yeat @ year. Passing a credentialing examina. studies of predictor-criterion relationships candida¡es, is technìcaily inappropriate because it changes the meaning rion should signifu rþar rhe candidate meets the knowledge and skill standards ser by rhe credentialing bod¡ independent of che availabiliry of work. Issues ofcheating and rær securiry are of special importance for cesring practices in credentialing. lssues of rest securiry are covcred in chaprers 5 and 1 l. lssues ofcheating by resr raliers a¡e covered ìn chaprer 8- Issues concerning the rechnicel quality ofress are lound in chapters I-6, and issues of fairness in chaprers 7- I 0. and interpretation of the results of local srudies of predicior-criterion relationships should be grounded ìn knowledge of rele- vant research. Comment: The cumularive lirera¡ure on rhe relationship between a particular rype of prediccor and rype o[c¡iterion may be suffì' ciently large and consisrent ro supporr the predictor-critcrion reladonship without additional research. In some sertings, rhe cumu[arive research lìrerarure may be so subs¡an¡ial and so consistent thar a dissimílar finding in a local srudy should be viewed wirh caurion unless the local srudy is exceprionally sound. Loc¿l studies are ofgreacest valrre in settings where the cumulative research literatu¡e is sparse (e.g., due co rhe novelry ofrhe predic- tor and/or criterion used), where dre cumularive record is inconsisrent, or lvhere rhe cumularive Iiteraure does not include studies similar to the local serting (e.g., a resr wirh a 158 AERA APA NCME 0000165 PART III / TESTINû IN EMPTOYMFTIT AND CBEDEilTIALING large cumulaiive Iiterarure dea.ling exclusivel¡ wirh producrion jobs, and a local setring involving managerial jobs). Individuals conducting and interprering donships should idendfr conrarninants and artifacts that may have influenced study Reliance on local evldence of empirically determined predicor-criterion relationships ¡t14rcË/ .{ Á E ê$^-.¡^,,t r)llttlUdl U 11.i, empirical studies of predictor-criterion rela- Standard 14.3 -- a vd¡ud(¡v¡¡ d5 ^ -_l:J^-:^- ^+4 IIÍlfi ¡Ai{I trs \ I¡ raüû [gû-l¡u ll¡.¡rl¡t v :- ^^--:-^^-¡Þ Lu¡¡t¡¡rbcrl( ^- 4 u¡r ^ determination of technic¿I feasibiliry. Comment: Meaningful evidence oI predicror- cricerion relationships is conditional on a number oF features, including (a) rhe job being relarively stable, rather than in a period o[rapid evolurion; (b) the avaitabiliry ofa relevanr and reliable crirerion measure; (c) rhe availabiliry of a sample reasonably represencative ofthe population ofinreresr; and (d) an adequate sample size for esrimating rhe strengrh of dre prediccor-críterion reladonship. Standard 14.4 When empirical evidence oI predictor-criteiion relâtionships is part ofthe pattern of evidence used to support test use, rhe criterion me¿sure(s) used should reflecc the criterion construct domain of interest to the organization. AII crite¡ia used should represent importarit wo¡k behaviors or work outputs, on the job or in job-relevant training, as indicated by an appropriate review of information about the job. Comment: When criteria are consrrucred co represenf job acrivities or behaviors (e,g., supcrvisory ratinç of subordinares on impormnt job dimensions), qærematic collection of information about rJre job informs the developmenr of the criterion measurcs, though there is no clear choice among rhe many available job ana-lysis methods. There is nor a clear need for job analysis to support criteri- on use when rneasure.s such as absenreeism or turnover a¡e the criteria of inreresr. findings, such as error oF measurement, range restriction, and the effects of missing Cata. Evidence of rhe piescnce o¡ absence ofsuch features, ând ofections taken ro remove or cont-rol their in-0uence, should be retained and made available as needed. Comment Er¡o¡ ol measuremenr in the criçerion and resrricrion in the variabiliry ofpredictor or criterion scores systemarically reduce estimates oFrhe relationship becween predic- tor measures and the criterion construct domai¡, and procedures for cor¡ection fo¡ the effecrs of these artifacg a¡e available. lù(/hen these procedures are applied, borh corrected and uncorrecred values should be presented, along with the rationale for rhe correcrion procedures chosen. Sta¡istical significance tesrs for uncorrected correlations should nor be used with corrected cor¡elations. O¡he¡ fearu¡es ro be considered include issues such as missing dara for some variab.les for some individuals, decisions about ¡he retention or removal of exrreme dara points, rhe effecrs of capiralization on chance in selecting predicrors from a larger set on the basis of srrengh of predicrorcriterion relarionships, and rhe possibiiiry of spurious prediccor-criterion relationships, as :- r¡rç ---^ ^f^^ll^--:_- --:-^-:-_ ---:_-- a--¡rr -L^ Lôs u( LUr¡eÇl¡rlti trltçrlo¡t ra(r¡tBs rLurrr supervisors who know selection test scores. Standard 14.6 Evidence of predictor-criterion relatiorships in a cr¡Eent local situation should not þ i¡ferred from a single previous valid¿tion süd)¡ unless the previous study of the predictor-criterion relationship was done r¡nder favorable conditions (i.e., with a large sample size and a rele- vant criterion) and if the current situadon cortesponds closely to the previous situation' 159 AERA APA NCMË 0000166 lsrnn¡unnns TESTING IN EMPLOYMENT ANO CREDENTIALING Comment: Close correspondence means thar the job requiremenrs or underlying psychological consrrucrs are subscan¡ially rhe same (as is determined by a job analysis), and rhar rhe predicror is subsrantially rhe same. Standard 14.7 If tesrs are to be used to make job classifica- tion decisions (e.g., the pattern ofpredictor scores will be used to make differenrial job assignments), evidence rhar scores are linked to different levels o¡ likelihoods ofsuccess among jobs or job groups is needed. Standard 14,8 Evidence ofvalidity based on tesr conrenr requires a thorough and explicit definirion of the content domain oFinterest. For selecdon, classification, a¡rd promotion, the cha¡acrerization of the domain should be based on job analysís. Comment:.ln genera.l, the job content domain should be described in terms of job tasl.s or wo¡ke¡ knowledge, skills, abilities, and orher personal characterisrics rhar arc clearly operarionally defined so rhar they can be linked to rest conrenrr and lor which job demands are not expecred ro change substan- tially over a specified perìod oF rime. Knowledge, skills, and abiliries included in rhe content domain should be those the applicant should already posess when being considered For rhe job in quesrion. Standard 14.9 l(4ren evidence of validity based on test content is a primary source of nlidity evidence in suppon ofthe use ofa test in seleccion or promotion, a close link berween test content and job content should be demonstrated. Comment: For example, i[the tesr content samples job rasks wirh considerable fideliry / PART III (e.g., acrual job samples such as machine operarion) or, in rhe judgment oFexperrs, correctly simulares job rask conrenr (e.g., cerrain assessmenr center exercises), or samples specific job knowledge required For successFul job performance (e.g., inFormarion necessary ¡o exhibir certaìn skills), rhen conrenr-relared rJ're principal [orm ofevidence ofvalidiry Ilthe Iink berween the test content and the job conrenr is not clear and direct, orher lines oIvalidiry evidence rake on greater importance. evidence can be offered as Standard 14.10 Vhen evidence of validiry based on test content is presented, the rationale for defining and describing a specific job content domain in a particular way (e.g., in terms of tasla to be performed or knowledge, skifls, abilities, or other personal characreristics) should be sared clearly. Comment: rVhen evidence olvalidicy based content is presented for a job or èlass o[jobs, che evidence should include a description oF rhe major job character'srics that a rest is meant to sample, including the relarive frequenc¡ imporcance, or criti- on resr calicy of the elemenrc. Standard 14.11 Ifevidence based on test content is a primary sou(ce ofvalidiry evidence supporting the use of a test for selection inco a particular job, a similar inference should be made about the test in a new situation only if the critical job contenr factors are substandally the same (as is determined by a job andysis), the reading level of the test mate¡ia.l does not exceed that appropriate for the new job, and there are no discernible features of the new situation that would sub- stantielly change the original meaning of the test material. 160 AERA APA NCME 0000167 PART III / TESTING ¡N EMPLOYMENT ANÐ êfanc{ac¡l g(qttgqtu CREDENTIALING {tt.rÉ ¡l l4 'When the use of a given test for personne! selection relies on relationships between a predicto¡ construct domain that the test represents and â criterion construct domain, need to be established. First, there should be evidence for the relationship herween rj.e test ,nrl rhe ntedictor construcç so linla domain, and second, rhere should be evidence for the relationship berween the predictor construct domain ald major factors of the criterion consÌruct domain. Commmt:Thcre should bc a dcar conccptual rationale fo¡ these linkages. Both the predictor construct domain and the criterion construct domein ro which ir is ro be linked should be defined carefully. There is no single route ro establishing rhese linkages. Evidence in support of linkages berween the rwo construcr domains can include par¡erns offindings in the research literature and systemaric evaluation ofjob contenr ro identiÇ predicror consrrucrs linked ro the crirerion. domain. The bases for |udgments linking the predictor and critc¡ion construcr domains should be arriculated. Standard 14.13 'When decision makers integrate information from multìple tests or integrate tesr e¡d nontest info¡mation, the role pl"y.d by each test in the decision process should be clearly explicated, and the use ofeach test or test composite should be supported by v-alidiry evidence. Comment: A decision maker may inregrate test scores with inrerview dara, reference checks, and many other sources of information in making employment decisions. The inferences drawn from resr scores should be limired to rhose for which validiry evidence is available. For example, viewing a high tesr score as indiceting overall job suirabiliry, and rhus precluding the neeC fo¡ ¡efereacc chcclis, would be an inappropriate inference from a rest measuring a single narrow, albeit relevant, domain, such as job kno*ledge. In othe¡ circumstences, decision makers integrare scores mulriple (esrs, or across mulriple scales within a given resr. across Standard 14.14 The content domain to be covered by a credentialing test should be defined clearly and justified in terms of the importance of the content fo¡ credential-worthy perfiormance in an occupation oi profession. A rationale should be provided to support a claim that the knowledge or skills being assessed are requited fo¡ credential-worthy performance in an occupation and ere consistent with the purpose for which the licensing or ceftification program was instituted. Commcnt: Some form of job or practice for defining the conrenr domain. If rhe sarne cxaminadon is used in the licensure or certificedon o[ pmple employed in a variery of semings ard specialcies, a number of differenr job senings may need ro be analf¿ed. Although the job analysis techniques may be similar ro rhose used in employmenr tesring, the emphasis for licensure is limited appropriately ro knowledge and skills necessary for effecrive prectice. The knowledge and skills contained in a core analy'sis provides the primary basis .,,,,i.,,1,,-.1..i---.1 .^ .,,i- ^.^^1. f^, .h. job or occupation may be relevant, especially iF the curriculum has been designcd to be consistent wirh empirical job or practice analyses. In tests used for licensure, skills that may be important to success bur a¡e not direcrly ¡elared ro the purpose of licensu¡c (e.g., proreccing the public) should nor be included. For example, in real esate, marketing skills may be importanc for success as a b¡oker, and assessment of these skills might have uciliry for agencies selecring brokers for 161 AERA APA NCME OOOO168 I IESTING IN EMPTOYMENT ÁI,lO CREOESITIALING srnruunnos employment. However, lack of these skills may not presenr a rhrear to the public and would appropriately be excluded from considerarion For a licensing examination. The facr rhar successful pracritioners possess certain knowledge or skills is relevanr bur not persuasive. Such in[ormarion needs to be coupled wi¡h an analysis of the purpose of a licensing progrenr and rhe reasons thar the knowledge or skill is reguired in an occu parion or proFession. Standard 14.15 Estimates of the reliability of test-based credentialing decisions should be provided. Comment:The sunda¡ds fo¡ decision reliabiliry described in chapter 2 are applicable to tesrs used [o¡ licensu¡e and certification. Other rypes ol reliabiliry estimates and aso- / PARÏ III Standard 14.17 The level of performance required for passing a credentialing test should depend on the knowledge and skills necessary for accepable performance in tle occupation or profession and should not be adjusted to regulate the number or proportion of persons passing the test. Comment: The number or proportion of persons granted credentials should be adjusc- ed, if necessar¡ on some basis other than modifications to either the pæsing score or rhe passing level. The cut score should be determined by a careÊul analysis and judgment of acceprable perlormance. Vhen rhere are ahernare lorms ofrhe resr, the cur score should be careÊ.rlly equated so thar has rhe same meaning it lor all fo¡ms. ciated standard errors oI measurement may also be uscFul, but the reliabiliry olthe deci- sion of whether or not ro cerrifu is oI primary importance. Standard 14.1ô Rules and procedures used ro combine on multiple assessments to determine scores the overall outcome of a credentialing test should be reponed to test takers, preferably before the test is administered. Comment: In some cases, candidares may be requircd ro score aboue a specified minìmum on each ofseveral tests. In other cases, the pass-fail decision ma¡' be based solely on a toral composite score. \ù7hile candidares may bc rold thar cests will be combined into a composite, the specific weights given to various componen¡s mây nor be known in advance (e.g., to achieve equal effecrive weights, nominal weights will depend on rhe variance of rhe components), 162 AERA APA NCME OOOO169 I5. TËSTËruG Eru PRTffiRAM EWALIJATffiffi f{[\Ëfr1 ÐE[Ril Etrt pf'Ëa Bn\f FûqEUJ E IL¿IJÉ-UTJ ¡ i'tr-I|ï.' E Background Tesrs are widely used in program evaluarion and in public policy decision making. Program evaluarion is t}re set oFprocedures used to make judgments about rÏe clientt need lor a program, the way ir is implemented, its effectiveness, and irs value. Policy srudies are somewhat broader than program evaluations and refer to srudies rhat conrribure to judgments abour plans, principles, or procedures enacted ro achieve broad public goals. There is no sharp disrincrion berween policy srudies and program evaluations, and in many insrances the¡e is subs¡antial overlap berween the rwo rypes oF invesrigacions. Te¡t resula are ofren one impor- tant source o[evidence For the initiation, continuation, modifìc¿cion, terminarion, or expansion o[various progr¿ms and policies. Interprerarion olresr scores in program evaluadon and policy srudies usually enuils the complex analysis of a number of variables. Fbr example, some progre$s are mandated lor a broad population; others target only certain subgroups. Some are designed ro aFfec¡ arritudes, .*'hile othets are intended to havc a more direct impact on behavior. It is imponant that the parricipants included in any study ar leasr meet the specified criteria tor rhe program or policy under review so thar appropriare intetpretation oFtesr results will be possible. Tesc resulrc will reflecr not only rhe effecrs oF rules for participanr selection and the impact of panicipation in different progrems or rrearmens, but also the characrerisrics of úrose resred. Relevan¡ background information about clienrs or srudena may be obrained in order ro strengtien rìe inflerences derived From rhe res¡ results. Valid inte¡pretarions may depend upon addiriona.l considerations thar have norhing to do with rhe appropriareness of rhe resr o¡ its technical qualiry including srudy design, administrative feasibiliry, and rhe qualiry of o¡her available dara. Ir is not rhe inrenr ofrhis chaprer to deai with these varied consider¿rions in any substanrial way. In o¡der ro develop delensi bl e concl usions, horvever, investigato rs conducting piog¡am evaluaiions and policy srudies are encouraged ro supplement rest ¡esulrs with dara from other sources. These include information about program characterisrics, deliver¡ cosrs, clienr backgrounds, degree of participation, and evidence of side ef[ects. Because test results lend imporrant weighr to evaluation and policy studies, ir is critical ¡hat any rests used in rhese investigations be sensirive to the questions of the srudy and appropriate for the test rake¡s, It is important to evaluate any proposed test in terms of its relevance to rhe goals of dre program or policy and/o¡ to the parricular quesrion its use will address. k is relatively rare for a rest to be designed specifically for program evaluation or policy study purposes. "lypicaJly, the instrumenc u¡ed in such srudies were originally developed for purposes orher rhan program or policy evaluation. In addition, because oFcosr or convenience, certain tests may be adopted for use in a progrem evaluation or policy srudy even though they may have been developed for a somewhar diÊ ferenr population of respondenrs. Some rests may be selected For use in program evaluation oi policy s¡udies because the iesis are well known a¡d thoughr to be especially c¡edible ro rhe clienrs or rhe public consumer. Even though certain rcsrs may be mo¡c familiar to rhe public or may be lcss time-consuming or to use than an instrument developed specificaily for rhe evaluarion, they may be nonetheless inappropriare for use as cri¡erion measures to dcte¡mine the necd For or to evaluare the effeccs oF particular interventions. As government agencies and other instirutions move to improve their own routine detâ less expensive collection capabiliry, Fewer special studies are AERA APA NCME OOOOITO TESTING IN PROGRAM EVALUATION AND PUBLIC POLICY conducred ro cvaluetc programs and policies. insread, evaluarions and policy srudies may depend upon a special anajysis oldata prcviously coüected For orher purposes. In rhese cases, the investigators may reanÅyze. tesr data aiready obrained and anal¡zed for another purpose in o¡der to make inferences abour program or polìcy eflecriveness. This procedure is called seco ndary dnta ana lys i s. I n some ci rcumsrences, it may be difficult ro / PART III ed, describcs interventions thar range from large-scale srate or narional programs wirh provisions for local flexibiliry to small-scale, more experimenra.l projecrs. In many cases, evaluation is mandated by the agency or 6rnding source lor the program, and rhe inrervention is evaluated by ,iudging irs effectiveness in meeting srared goals. Some examples oFprograms rhar might use cest ¡esulrs as parr of their evaluarion assure a good march data i nclude psychothe npeutic services, m ilitary berween the exisring rest and rhe intervention rraining programs and job place menr programs, school curricula, or services for individua.ls wirh special needs. or the policy under examinarion. Moreoveß ir may be diFficul¡ ro ¡econstrucr in derail ¡he conditions unde¡ which che daca were originally collected. Secondary data analysis a.lso requires consideration oFwhether adequarc informed consent was obrained lrom subjecrs in the otiginal data collection to allow secondary analysis ro occur wirhour obraining add.itional consent. In selecring (or developing) e tesr or in deciding ro use exisring dara in evaluarion and policy studies, c¿¡e[ul invesrigators artempt to balance the purpose of the rest, its likelihood ro be sensirive ro the inrervenrion under srud¡ the credibiliry oi rh. ,.r, ¡o intcrested percies, and the costs of its adminisrration. Otherwise, tesr resul¡s may lead ro inappropriate inrerpremtions about the progress, impacr, and ove¡all value of progrems and policies under rerierv. Test resul¡s, along wirh orher informarion, may be used to cornpare compering interventions, such as alternative reading curricula or different psychotherapeutic interventions, or to describe the long-term pattern ofeffecrs lor one or more groups. it is often important to âssess a program for is differencial effectiveness in meeting rhe needs ofsubgroups (such as diflerenr ethnic or gender groups wirhin rhe rarger population). Even though úre performance of groups is of primary intcrest in program cvaluation, the analysis of individuals' histories and test performances may provide addirional useful in[ormarion to aid in rhe inrerprerarion of rest resulrs. Because of administrative realities, such as cost constrainr and response burden, merhod- ological refinements may be adopted to Program Evaluation Tests may be used in program evaluations to provide information on ¡he s¡atus of clients or scudents before, during, or following an in¡er- well as ro provide information on appropriate comparison groups. Vhereas understanding rhe performancc ofan individual srudent or client is ofren the goal of many resring activities, program evaluarion rargerc rhe performance of, or impacr on, groups. Tesrs a¡e used in program evaluations in a variery of fields, such as social services, education, healrh services, and military and employmenr venrion, as training. The rcrm progran, broadly inrerpret- increase rhe effrcienqy of testing. One stratery is ro obtain a sample of participants to be eva-l- uated from the largcr sec o[thosc exposed ro a program or policy. 1ù7hen rherc is a suffìcienc number ofclicnts affected by the program or policy to be evaluared, and when there is a desire to limit úre time spent on testing, eveluarors cån create multiple forms of shorrer tess from a larger pool of items. By consrruccing a number of differen¡ resr forms consisting of relatively few irems and assigning these test 'lo¡ms to different subsamples of test mkers (a procedure known æ matrix sampling), a larger number ôf irems can be included in the study rhan could reasonably be administered to any 164 AERA-APA-N CM E_OOOO 1 71 PÀRT III / TËSTING IN PHOGRÁM EVATUATIOT{ ANl] PUBLIC POTICY "i^-1. ...r .-L-. \Y/h.- ;. i" .l.ci."hl. chil.l¡en from low-incorne larnilies. As an sent a domain with a large number of test items, this approach is often used. However, individual scores are not usually c¡eated or interpreted when matrix sampling is employed. Because procedures for sampling individuals or tesr items m y very in a number of ways, adequare ana.l¡ais and interpretarion of test resulrs for any study depend upon a clear descrìprion of horv samples were fo¡med and the manner in which rest results were aggregated. approach, a srate's educational aurhoriries mighr require the separare report¡ng of tesr scores for children in high-poverty areas. large differences in group performance would be expected to attracr the amention ofthe public and ro place grearer pressur€ on dre schools ro improve the performance of particular Policy Uses ot Tests As noted prcviousl¡ tes¡s are also used in poli- ry anal¡'ses, and the distinction berween program evaluaiion and policy uses of rests is often a marter ofdegree. Programs âre expecred to share parricular goals, procedures, and tesources. Policy is a b¡oader term, applying to plans, principles, procedu¡es, or programs enacted to achieve panicular goals in different senings. Programs provide direct se¡vices or in¡ervenrions. Polìcies may be consrrucred ro achieve úreir goals by direct or indirect means. Indeed, one direct approach used to achieve a policy goa.l might ínclude the funding of specific programs. Other examples of direcr poiiry approaches might involve the provision of training resources to improve performance in particula¡ heaIth-service occupations, or rhe enactment of new recertifi cetion requirernenr for accountanrs. Studies of rhe need for or impacr of both of these policies could in parr depend upon the ana.lyses o[ rest results. To illustrace in more depth, ro meec rhe general policy objective of conraining the costs of health care, direct policies might include giving incentives to clients to participare in fitnes programs and che developmenr oFparienr education programs. Têsts could measure the understandings and atritudes of parricipanr abour the relationship offitness to rhe preven¡ion o[illness. A¡orher policy example, using a more indirect approach, is ro encourage educalors to cteare more effective programs [or groups oF children. In decentralized governments, policy implemenation may be left to local authorities and may be inrerprered in a numbe¡ of diffe¡enr ways. As a result, ir may be difficult to selecr or develop a single r€sr or outcome measure chat will be sensitive to the range of diffe¡cn¡ acrivi¿ies or tacrics used to implement a given policy. For thar reason, policy srudies may often use more than one test or outcome measure to Provìde a more adequate picrure oFrhe range o[efilècts. lssues in Program and Policy Evaluation Test results are somerimes used as one way to inspire program adminis¡rators as well as ro infe¡ insrirutioiral effectiveness. This use of cescs, including the public reporring ofresulu, is thoughr ro encourage an insrirution to improve i¡s services for im cliencs. For cxample, consisrenrly poor achievement test results may rrigger special menâgemenr aftenrion for public schools in some locales. The inrerpreration of resr results is especially complex when tests ^.- ,,.-l L^,L ^. ^^ :-^*:....:^-^t yu¡¡L/ ¡¡GL¡¡4^^ì:^, --^L^ nism and as e meesure of eF[ec¡iveness. For example, a policy or prograrn may be based on rhe assumpdon that providing clear goals and general specifìcations oftesi content (such as rhc rype oF topics, construcrs and cognitive dornains, and responses includcd in the test) may be a reasonable stretegy to communicate new expectations to educatots. Yet, rhe desire co influence test or evaluacion resulcs to show acceprable insr.irutional perFormance could lead ro inappropriate tcsting practices, such as r65 AERA APA NCME OOOO172 TESTING IN PROGRAM EVALUATION AND PUBLIC POLICY teaching rhe resr irems in advance, modifring test adminisrration procedures, discouraging certain studenrs or clients from parriciparing in rhe resring sessions, or locusìng exclusively on resr-caking procedures. These pracrices might occur insread o[ ¡hose aimed ar helping the resr ¡aker learn rhe domains measured by the tesr. Because resulrs derived Fronl such prectices mighr lead to spuriously high esrìmares of impact and might reflecr rhe negarive sìde eflecrs ol rhis particular poliry, diligent invesslg¿¡s,, may esrimate the impacr of such cons€quences in order ro interprer rhe resr resulm appropriarely. Looking er possible inappropriate consequences of tests as well as their benefits rvill berter assess policy claims thar parricular rypes of resring programs lead to improved performance. On the orhcr hand, policy studies and progrem evaluarions often do not make available reporrs of resulrs to the resr rakers and may give no clear reasons to the test raker for participating in the testing procedure. For example, when marrix sampling is used for program evaluarion, ir may nor be [easible to provide such reporm. lf lirrle effort is made to motivate the rest uker to regard the test seriously (for insmnce, iF rhe purpose of the rest is nor explained ro che tesr raker), ir is possible that test cakers mighc have Iirrle reason to tq/ ro pcrform well on che resr. Obrained resr results then might rvell underrepresent the impacr of rhe prograrn, insriturion, or poliry because oF poor motivârion on the parr of rhe res¡ taker. \J(/hen rhe¡e is a suspicion thar the tesr might not have been taken seriously, moti- varion of resr rakers may be explored by collecring addirional information, using observarion or inrerview merhods. The issues of inappropriate preparation or unmotivated perFormance are examples rhat raise basic qucs- rions abour the validity olinterpretations of rest resufrs. In every case, ir is important ro consider rhe potential impact ofthe resring process irself, including test administrarion and reporting practices, on rhe cesr taker. / PABI III Public policy decisions are rarely based soleiy on the results oÊempirical sçudies, even when rhe studies have bêen well done. The more expansive and indi¡ec¡ rhe poliry, rhe more lìkely will i¡ be rhar o¡he¡ considera¡ions will come inro pla¡ such as ¡he polirical and economic impacr ofabandoning, changing, or rerainìng the poìiry or the reacrion to offering rewards o¡ sanctions to insrirurions. ln a polirical climate, tesrs used in policy seirings mey be subjecred ro inrense and detailed scruriny. \üØhen results do not support a favored position, atrempc may be made ro discounr rhe approp riateness of the testing procedure, construct, or interpretation. It is important that all ress used in public evaluarion or policy conrexrs meer rhe standards desc¡ibed in earlier chaprers. As described in chapter 8, resc are to be adminisrered by trained personnel. ft is also essenrial thar assisrance be provided ro those responsible for inrerprering study resulrs ro pracritioners, ro the lay public, and ro rhe medie. Careful communicarion oF the study's goals, procedures, findings, and liiÀirations increases the chances that the public's interprerations will be accurare and useÊ-rl. Additional Consideralions This chaprer and its associated standards are directed ro users oF ter$ in program evaluation and policy studìes and ro the conditions unde¡ which those studies are usually conducted. Orher srandards documenr that are relevant to rhis chaprer include The Program Eaaluarion Standard¡: How to Assess Eualuøtioru of Educacional Programs, prepared by the Joint Commitcee on S¡andards fo¡ Educational Evaluation (2nd ed., Thousand Oaks, CA: Sage Publicarions, 1994), aadrbe Codc of Fair TÞsing Praoices in Education, prepared by the Joint Committee on Testing Pracrices (Vashingron, DC: Joint Commirree on Tesring Praccices, 1988). '166 AERA-APA-N CM E-OO OO 1 73 PART !I! 1 TESTIruG IIJ PROGFÄM EVALUATIO¡¡ AND PUBLIO POLICY ñr_-^r---J ¡F ùtailoafu tc, be provided. ln educarional resting, I I ìttiiltudtu {È 4 tc.J I the sãme test is designed or used ro serve multiple purposes, evidence of technical qualiry for each purpose should Comment: t nr I arr-rl--:t Ärut ¡Ilt{[ tfì I tlJrJ ê¡--i--J ¿ 'When ii \ v For example, corÍìmon pracrice to use the same rest lor multiple purposes (e.g., monitoring achievemenr of individual srudenrs, providing .information ro assisc in insrructional planning for individua]s or groups oÊstudents, eva.luating schools or disrrics). No tesr wilI serve a.li purposes equally well. Choices in has become resr developmenr and ev-¿-luation thar enhance validicy for one purpose may diminish validi- ty for orher purposes. Different purposes \Trnen change or gain scores are used, the definition of such scores should be made explicit, and their tech-oic¿l qualides should be reported. Comment: The use of change or gain scores presumes that the same test or equivalent forms of the resr were used and rhar the resr (or forms) have not been materially ahcred ber*een administrations. The sra¡dard error of che differcnce berween scores on precesrs and posrtests, the regression ofposttest scores on prerest scores, or relevant data from orher reliable me¡hods for examining change, such as rhose based on sr¡ucrural equation modeling, should be reported. require somewhar different kinds of technic¡l evidence, and appropriate evidence oftechni' cal quajiry for each purpose should be provid- cd by rhe resr developer. If the rest user wishes to use the test for a purpose not sup' ported by rhe available evidence, ir is incumbent on the user to provide.the necessary additional evidence. Standard 15.2 suiability of a test for use in evaluation or policy Evidence shouJd be provided of the studies, including the relev-ance of t-he test to the goals of the program or policy under study and the suitabiliry of the test fo¡ the populations involved. Comment: Far.rlry inferences may be madc when test scores are nor sensitive to the fea¡u¡es ofa particular inte¡venrion. For instance, a test deJigned Íor seleccion may be ineffective as a mcasurc of the effecrs of an intervenrion, It is also importanr ro employ tests thet are appropriate for the age and background of tesc takers. Standard 15.4 In program evaluation or poliry srudies, invesrigators should complement tesr results with information from ocher sources to generate defensible conclusions based on the interpretation oftest resuits. Comment: Descriptions or analyses of such variablcs as client selecrion cri¡e¡ia, services, clients, setring, and resources are olten needed to provide a comprehensive picrure ofthe program or poliry under review and to aid in the interpretarion of rest results. Performance on indicarors orher than tests is almost always useful and in many cases is essen¡ial. Examples oÊother inlorma¡ion include atrrition rares or patterns ofpartici- pation. Another source of info¡mation mighr be ro derermine che degree of motivarion of rhe resr takcrs. \fhen individual scores are not reported ro test takcrs, ir is important to determine whether the examinees took rhe test expericnce scriously' 167 AERA APA NCME OOOO174 I srnrunnmns ÏESTING IN PROGRAM EVALUATION ANO PUBLIC POLICY / PART III Standard 15.5 Agencies using tests to conduct program evaluadons or poliry srudies, or to monitor outcomes, should clearly describe t-he population the program or policy is intended to sewe and should document the extent ro which the sample of test takers is represen- tative oI that popularion. Comment: For example, a clinic with a diverse client popularion using resring to æsess rhe outcome o[ a particular rreatmenr may rourinely reporr rhe exrent ofparticipation by subgroups ofclients, for instance, rhose of diverse ethnic backgrounds or for whom English is a second language. Standard 15.6 lùfhen matrix sampling procedures a¡e used for program evaluation or population descriptions, rules for sampling items and test takers should be provided, and reliabiliry analyses must take rhe sampling scheme into account. Standard 15.7 sions for individuals or insritutions. Tô rhe exrent possibie, sruden¡s, parenrs, and sraff should be informed of the domains on which rhc studens will be resccd, rhe narure ofrhe item qpes, and rhe standards for masrery. Efforr should be made ro documenr ¡he provision ofinstruction in ¡esred conrenr and skills, even though ir may not be possible or feasible ro de¡ermine the specific conrenr of instruction for every studenr. An example of negative impact is the use of strategies to raise performance artificially. Standard f 5.8 rVhen it is clearly stated or implied that a recommended cest use will result in a specific outcome, the basis for ogecting tJrat outcome should be presented, together with relevant evidence. Comment: A given daim For the benefirs of resr use, such as improving students' achieve- ment, may be suppomed by logical or rleoretical argument as well as empirical data. Due rveight should be given co findings in rhe scientific litcrature chat may be inconsisrenr wirh rhe stared claim. When educational testing programs are mandated by school, district, sate, or other authorities, the ways in which test results are intended to be used should be clearly described. lt is the responsibiliry o[those who mandate the use oF tests to identify and monitor their impact and to minimize potential negative consequences. Consequences resulting from the uses of the test, both intended and unintended, should also be examined by the test user. Commtnt: Mandared resring programs are ofren jusrified in rerms of theìr porenrial benefi¡s for reaching and lcarning. Concerns have been raised about the potential negative impact of mandated tescing programs, particularly when rhey affecr imporrant deci- Standard 15.9 The integrity of test resulrc should be m¿intained by eliminating practices designed to raise test sco¡es without improving performânce on the const¡uct or domain measured by the test. Comment: Such practices may include teach- ing tesr items in advance, modifuing tesr adminisrrarion procedures, and discouragin g or excluding certain test takers from taking the test. These practices can lead to spuriously high scores that do not reflect perfoimance on the underlying consrruct or domain of inte¡es¡. 168 AERA-APA*N CM E_O OOO 1 75 PAHT III i ÍESTING FÂ^_J.-J ¡Ê II.¡ PFOGRAM EVALUAT¡O$¡ ANO PUBLIC POLICY I \ v I rlntr ¡rt¡¿r is rnu ul¡,rot-_-t [t/¡J I I tlL- -J--t--J,----,r)---l-srìuulu uE asvlscu ru LUllsluEr urs aPP[oP[lare { ùta¡tuatu to.tu contextual information and be cautioned againsr misinterpretarion. Those who have a legitimate interest in an essessment should be informed about the purposes of testing how tests will be administered and scored, how long records will be retained, and to whom a¡d under what conditions the records may be released. Those who mandate testing progrems should ensure that the individuals who Comment: Those wirh a legirimate interesr within the school o¡ program context are may ìnclude rhe test rakers, their parenr or qualified to essurne this responsibility and proficient in the appropriate methods for inte¡preting test results. Standard 15.13 :-.^---. ¡rrrv¡y¡eÈ .L^ .-.. -^-..1.- .^ -^1,^,¡^-:^:^-^ guardians, or personnel who may be affecred by results (teachers, program staff). Siandard 15.I1 Comnt¿nt:1Vhen testing programs are used When test results are released to the public or to policymake¡s, those responsible for the release should provide and explain any supplemental information that will minimize possible misinterpretations of the dacaComment:The contexr and limirarions of the study should be described, rvith parricular attention given ro merhods ofcausal inferences. as e strategy for guiding interventions or insrruction, professiona.ls expected to make inferences lcading to program improvement may need assisrance in interpreting rest ¡esuks for this purpose. The interpretarion olsome tes¡ scores is suffìcienrly complex ro require rhar the user have ¡elevant psychological training and expe- rience. Examples of such rcsrs include individually adminisrered inrelligcnce tests, personaliry inventories, projective rechniques, and neuropsychological Standard 15.12 tesrs. Reportr ofgroup differences in average test scores should be accompalied by relevant contextual information, where possible, ro enable meaningful interpretation of these differences. Where appropriate conrexrual info¡mation is not available, users should L- ^^,,.:^^-J ^^-:--. r¡rs¡¡rr!rP¡rø.¡u¡¡. d6d¡rôr -:-:-.^-^-^.^.:^Comment: Observed difFe¡ences in average teir scores benveen groups (e.g., classified by gendeç racelethnicicy, or geographica.l rcgion) can be influenced, lor example, by differences in lif,e experiences, rraining experience, efforr, insrructor qualit¡ o¡ level and type of parental suppon. In educarion, differences in group performance ecross rime may be influenced by changes in rhe population ofrhose tesied oi changes in iheii expeiiences. Users 169 AERA APA NCME 0000176 GTOSSARY This glossary provides defini¡ions of rerms as used in this rexr. For many of the terms, mulriple definitions can be Found in rhe lireracure; also, technical usage may differ from common usage. adapúve testing A. sequentiâl form ofindividual testing in which successive items, or sers oF irems, in che test arc chosen based primarily on rheir psychometric properties and conrent, in relarion to the ¡est rakert responses co previous items, ability/trait parameter In item adjusted validity/reliability coefficieot A vaiidiry or reliabilicy ¿ssffisig¡¡-rnsst often, response theory (iRT), a cheoretical value indicating rhe level ola tesr taker on the abiliry or trair measured by rhe tesr; analogous to the concept of rrue score in classical rest theor;.. a producc-mome¡¡ çg¡¡s[¿¡is¡-rtrat adjusred to has been oßer rhe effecrs of differences in score variabiliry crirerion variabiliry, or the unreliability o[ resr and/or crirerion. abiliry testing The use ofstandardized tests ro evaluate rhe current performance of a r¿ t trictio See n of range or uari ab i li t7. person in some defined domain o[cognitive, age psychomotor, or physical functioning. defìned population For which a given score is absolute score interpretation The meaning oFa resr score for an individual or an average score for a defined group, indicåring an individual's or group's level of performance in some defined crirerion domain. By conrrast, see rcktiue score interpretation, accommodation process whereby age in a the median (middte) score. Thus, if children I 0 years and 6 months of age have a median score of 17 on a rest, rhe score 17 is said ro have an age equivalent of t0-6 for rhat population. See grade equiaahnt. alternate forms Two or rno¡e vetsions of a See ¡¿¡¡ modif.cation. acculturation The equivalent The chronological individ- tesr thar are considered interchangeable, in thac rhey measure the same construcrs in the sâme ways, are intended for rhe same purposes, and are administered using rhe same direc- uals f¡om one culture adopt the characreris- tions. Ahcrnare tics and values oFanother cukure wirh rvhich reler ro any of rhree categories. Parallclþrms have equal raw score means, equal standard deviarions, equal error strucrures, and equaì correlations with other measures for any given population. Equiuabnt þrms do not have the sratisrical similariry of parallel [orms, but the dissimilarities in raw score sratisrics are compensared For in the conversions to derived scores or in form-specific norm tables. Comparable þrms are highly similar in conrent, but che degree of statistical similariry has nor been demonsrrated. See linÞagt. they have come in contact. achievement levels/profi ciency levels Descriptions of a tesr raker's competency in a parricular area of knowledge or skìll, usually defined as ordered caregories on a continuum, often labeled from "basic" ro "advanced," or "novice" to "experr," that constitute broad ranges for classifring perflormarce. See rut score. achievement testiÃg A rest co evaluatc the exrenr of knowledge or skill attained by a tesr raker in e conrenr domain in which rhe test taker had received insrruction. fornr is a gencric term used to anal¡ic scoring A mechod oFscoring in which each critical dimension of performance 171 AERA-APA-N CM E_OOOO 1 77 ULUùùAN f is judged and scored separarel¡ and úre resultant values are combined for an overall score. In See predictíue b ias, ti o n, co co ns truct underrepresenta - nstruct i rrele uance. some insrances, scores on the separare dimensions may also be used in interpreting performence. See holistic scoring. bitinguat The characte¡isric of being relarive- anchor test A common ser oF items adminiscered rvith each of rwo or more difFerenr calib¡ation fornrs ola resr for rhe purpose olequating rhe scores obrained on rhese forms. ly proficienr in nvo languages. l. In linking resr score scales, rhe process ofsetting the tesr score scale, includ- ing mean, srandard deviarion, and possibly shape ofscore dis¡riburion, so ¡har scores on a scale have the same relative meaning as scores assessrnent Any sysrematic merhod ol obtaining inlormarion from rests and orher sources, used ¡o drarv ínFerences abour characreristics o[ people, objecrs, or progrems. on a related scale. 2. In irem response rheory the process oFdeterrnining rhe paramerers rhe response Êlncrion for an irem. of certification A volunrary process, ofren attention assessment The of collecr- national in scope, by which individuals who ing data and making an appraisal oFa person's abiliry ro focus on rhe relevanr s¡i¡nuli in a siruation- The assessment may be direcred ar mechanisms involved in arousal, susreined arrention, selecrivc arrenrion and vigilance, or limiration in the capaciry ro atrend ro have bee¡ certified have demonsrrared some process incoming informarion. automated nerrative repo(t ared tes t i n terp re ta tio n. level olk¡owledge and skill in an occupacion. See lìcensing credentialing. classìc¿l based teit theory A psychomecric theory on the view rhar an individual's observed score on a tcst is the sum See cûmputêr- ofa true score componen¡ for ¡he resr raker, þlus an indcpendenr meâsurement error component. p rep classification accuracy The degree ro which neither false positive nor felse negative câte- back translation A rranslarion ol a resr, rvhich is irselfa translarion from an original tesr, back in¡o the language ofrbe original ¡esr. The degree to which a back rranslarion gorizations and dìagnoses occur when a test is used to classifr an indivìdual or evenr. See seruitiuity and, spectfciry. marches rhe original tesr indicates rhe accuracy oI rhe original translarion. coaching Plannedshort-term instrucrional bartery A of tests usually administered as a unir. The scores on rhe several resrs usually ser are scaled so that they can readily be compared o¡ used in combination for decision making. bias In a s(a¡isrical context, a systematic error in a test score. In discussing test fairness, bias may refer to construct underrepre- activities in which prospective resr mkers parricipate prior ro ¡he resr adminìstrarion for the primary purpose o[ improving their tesr scores. Coaching typically includes simple practice, insrrucrion on tesr-taking straregies, and relared acrivirics. Activiries thar approrimate rhe inscrucrion provided by regular school curricula or rraining programs are noc rypically refe¡¡ed to as coaching. senrario n or consrruct-i¡relevanc componenrs of tesr scores rhar differentially affect the perlormance ofdifferent groups oltest takers. coefficient alpha An inrernal consisrency reliabiliry coe[ficienr based on rhe nu¡nber 17? AERA_APA_NCM E-OOOO 1 78 GLOSSABY of perrs into which the tesr is parritioned (e.g., irems, subrests, or rate¡s), the inrerrelacionships of rhe parrs, and ¡he roral tes¡ score variance. Also called Cronbach\ alpha and, computerized adaptive test An adaptive tesr lor dichotomous irems, lG 20. The variance oF measurement errors thar cognitive assessment The process of system- affect the scores of examinees at a spccified resr score level; rhe square ofthe condirional standard error of measu¡emenr. arically garhering terr scores and related dara in order to malie )udgmenrs about an individuali abiliry ro perlorm various menral acriviries involved in rhe processing, acquisirion, retenrion, conceprualizadon, and organization of sensor¡ perceprual, verbal, spatial, and psychomotor information. composite score A score thât combines severâl scores according to a specified lormula. adminisrered by computer. See adzptiue testing. conditional meâsurement eror variance conditional stânda¡d error of measu¡ement The srandard deviation of measuremenr errors ther af[ect the scores ofexaminees at a specified test score level. confidence interval .A¡ interval berween rwo values on e score scâle wirhin which, with spec- computer-administered test A rest adminis- ified probabiliry, a score or parameter of inreresr [ies. The rerm is âlso used in drese sranda¡ds ro designare Bayesian credibiliry intervals that tered by a computer. Quesrions eppear on define the probability that the unknown a computer-produced displa¡ and the resr paremeter falls in the specified inrerval. taker ansrvers by using a keyboard, "mouse" or orher similar response device. configural scoring rule A rule for scoring set of rwo or more elemeng (such as ícems a or cor.nputer-based mastery test An adaprive test administered by computer rhat indicates wherher or nor che resr taker has masrered a cerrain domain. The tesr is nor designed to provide scores indicating degree of masrery but only whethcr the test performence was above or below some specified level. Thus a computer-based mastery tfit is nor simply a master! test given by computer. See mastery tett. subrescs) in which the score depcnds on a per- computer-based test See computn-adminis- construct equivalence l. The o<rent ro which terc¿ test. rhe consrruct measured by onc rest is €ssenually rhe same as the consruct mê?sured by another resr. 2. The degree to which a consm¡ct mcasu¡ed computer-generated tesr interpretation dcular pamern of responses to rhe elements. construct The concept or che cha¡acteristic that a test is designed to measure. construct domain The ser of inrerrelared anributes (e.g., behaviors, anicuda, values) ürat are included under a consrrucr's label. A rest rypica.lly samples from this consrrucr domain. computer-prepa¡ed test interpretation A by a tcst in one culrural or linguistic group is comparable to the construct meåsured by the same resÍ in a differenr culnrral or linguistic group. programmed, compurer-prcpared interpreration of an examinee's test rcsults, based on empirical dara and/or experr judgment. test scores are influenccd by factors that atc See computer-prepared test interprctation. constrr.rct irrelcva[ce The cxtcnt to which irrelevanr to thc construct thar thc test is r73 AERA_APA-NCM E-OOOO 1 79 ct nqqÂnv íntendeci to meåsure. Such exrrancous factors distorr rhe meaning of resr scores From lvhar is implied in the proposed interpretation. ,å, re¡m used in rhe 1974 Sund¿rds to refer ro a þind or dtpect of validiry rhar was "required when rhe rest user rvishes to estimete how an individual pe rforms in concentvaüdiÐ/ construct underrepresentation The extenr to which e resr fails to ceprure important aspects of rhe consrrucr rhat the resr is inrended ro measurc. In chis siruacion, rhe ro represent" (p. 28). In rhe 1985 Standará¡, rhe term was changed to content-related meaning o[ resr scores is na¡rower ¡han the proposed inrerprerarion implies. cype ofevidence construcr validity A rerm used ro indicåre rhar the test scores are ro be inrerprered as indicaring rhe tesr taker's sranding on the psychological consrrucr meæured by the resr. A construct is a cheorerical variable inferred from mulriple rypes of evidence, which mighr include rhe interrelations o[ the tesr scores wirh othe¡ variables, inrernal resr srrucrure, observarions o€ response processes, es well es rhe contenr of the resr. dards, In rhe cu¡ren¡ sran- all tesr scores are viewed as meesures of some constÍuct, so rhe phrase is redunda¡r with validiry. The validiry argurnenr esrablishes rhe construcr validiry of a resr. See cazsauct, ualidity argumrnt constructed response itern An exercise for which exeminees rnust create thei¡ own reJponses or products rather than choose â response from an enumerated ser. Shorc- answer items require a few words or a numbe¡ as a¡ answer, whereas extended-response items require ar leasr a cor¡.t€nt Few senrences. domain The set of behaviors, knowledge, skills, abilities, artirudes or other characteristics to be measu¡ed by a resr, rcpresenced in a derailed specification, and ofren organized into categories by which items are classified. content standârd A starement ofa broad goal describing expectations for students in a subject marter et a perticular grade or at rhe complerion of a level of schooling. che universe oFsitua¡ions rhe ¡esr is inrended that ir referred ro one within a unirary concepiion euidenc¿ emphasizing of validiry. In the cu¡ren¡ Standards, rhis rype ofevidence is cha¡acterized as "evidence based on fesr conrent." cÕnvergent evidence Evidence bæed on the relationship between tcst scores and other meâsutes o[the same construc(. credentialing Granring ro a person, by some aurhoriry, a credenrial, such as a cerrificare, license, or diploma, rhac signifies an acceptable level of perlormance in some domain of knowledge or activiry. criterion doma-in The consrrucr domain of a variable used es a crire¡ion. See consÌruct Com¿in. criterion-referenced score interpretation S ee c ri te i o n - refere næd ¡ e ¡ t. critetion-¡efe¡enced test A cest rhar allows its users to make score interpretarions in relarion to a fr¡ncrional performance level, as disringuished f¡om those interprerarions thar are madc in relarion ro the performance oForhers. Examples of crite¡ion-¡eferenced inrerpre- ¡ations include comparison to cut scores, inrerprerations based on expectancy cables, and domain-referenced score inrerpretations. cross-validation A procedure in which a scoring syscem or ser ofweighrs for predicting perforrnance, derived f¡om one sâmple, is applied ro a second sample in order to inves' tigate the stabitiry o[prediction o[rhe scoring sfsrem or weighrs. 174 AERA APA NCME OOOOISO GLOSSAfiY cut score A specified point on a score scale, such rhat scores at or above that poinr are inrerpreced or acted upon differently lrom scores below that point. See performance standard. equated forms Two oÍ more rest forms construcred to cover rhe same explicir contenr, to conform to the same srâtisrical specificarions, and ro be adminisrered under identícal proced,rr es (alte rn ate form s) i th ro u gh staris rical adjustmens, ¡]re scores on rhe alternate lorms derived score A score ro which raw scores are converred by numerical rranslormarion (e.g., conversion of raw scores ro percenrile share a common scale. ranks or standard scores). ellel csrs on â common diagnostic and intervention decisions equivalent forms See ahernate forms. Decisions bascd upon inferences derived from psychological test scores as parr of an assess- error of measutement The difference ment oFan individual rhat lead ro placing rhe individual in one or more categories. See also interuentìon plznning. differential item Êmctioning A statistical propercy o[a resr irem in which diFferenr groups oF tesr ralers who have rhe same rotal resr scorc have differenr average irem scores or, in some cases, differenr rates ofchoosing various item options. Also k¡own as DiF. equating Puning rwo or more essenrially parscaJe. See øltcmaa forms. berlveen an observed score and rhe corresponding true score or proficiency. See st¿ndard cnor of meaurement and truc score. factor l. Any variable, real or chat is an aspecr hyporhecical, ofa concepr or consrrucr, 2. In measurement theory, a stâtisricål dimension defined by a Factor analpis. See facør anaþsis. disc¡iminant evidence Evidence based on the relationship berween resr scores and measures ol difFerent co nsrrucrs. factor andysis Any of several srârisricâl mcthods of describing the incerrelationships o[ a set o[ variables by starisrically deriving new va¡iebles, c¿lled fzctors, that a¡e fewer in number than the original ser of variables. documentation The body of literarure factorial structure l. The set of facrors (e.g., test manuals, manual supplemenE, research rcports, publicarions, user's guides, etc.) made available by publishers and resr auchors to suPPorr tesr use, obtained in a lactor analpis. 2. Tèchnicall¡ rhe correlation oFeach facror widr each o[the origina.l variablcs from which tle åcrors a¡e derived. fai¡ness In resting, the principle rhac cvery domain sampling The process of selecting test taker should be assessed in an equitable test items to represenr a specified universe oF performance. wa¡ empirical evidence Evidence bæed on some form ofdata, as opposed ro rhar bæed on logic or theory. As used here, rhe rerm does nor speci$ the r¡pe ofevidence; rhis is in conrrasr to some setrings where rhe rerm is equared wirh crirerion-relared evidencc of validiry. See chaprer 7. false negaúve In classification, diagnosis, or selcction, en crror in which an individual is essessed or predicred nor to meet the cri¡eria for inclusion in a particular group but in truth does (or would) meet tlese crireria. Sce t¿nt iriu; r! and sp rc if city. 175 AERA-APA_NCME_OOOO1 81 GLOSSA.BY iaise positive in ciassificarion, diagnosis, or selection, en error in which an individual is assessed or p¡edicted !o meer ¡he criieria for inclusion in a particular group but in rrurh does nor (or would nor) meer rhe¡e c¡i¡eria. See sensitiuitl and specifciry. generalizability theory An exrension ofciessical reliabiliry rheory and methodology in whìch the magnirudes of errors lrom specified sources are estimated rhrough rhe use of one or another experimental design, and rhe application of rhe srariscical techn iques oF the analysis olvariance. The analysis indicetes rhe field test A administrarion used ro check the adequacy oftesting procedures, generally includìng ¡est administration, test responding, resr scoring, and test reporring. A field tesr is generally more exrensive than a pilot resr. See resr pibt test. flag An indicator atrached to a resr score, a test item, or other entir¡r ro indicare a special status. A flagged tesc score generaily signifies in a modifìed, nonsrandard test adminisrration. A flagged cesr irem gen- a sco¡e obtained erally signifies an irem wirh undesi¡able cha¡acrerisrics, such es excessive differential item functioning. ftrnctional equivalence In evaluaring tesc r¡ansla¡ions, rhe degrec to whìch similar acrivi- generajizabiliry ofscores beyond rhe specific of irems, pe rsons, and obse¡vational condirions rhar were srudied. sample grade equivalent The school grade level for a given population lor which a given score is the median score in rhat population . See age equiuaknt. high-stakes test A reÍ used ro provide resulc rhar have imporranr, dirccr consequences Íor examinees, programs, or instirutions involved in rhe testing. holistic scoring A merhod of obraining a score on a test, or a resr item, based on a judgment of ove¡all perFormance using specified crireria. See anaþtíc scoing. ties or behaviors have the same ñlnctions in different cul¡ural or Iinguisric groups. gain score In testing, rhe difference berween rwo scoÌes obtained by a resr uker on rhe same tesr or rwo equared rests raken on difFerent occ¡sions, often before and afrer some rreatmenr. generaliz:biliry coefficient A reliabiliry index encompassing one or more independent sources oFerror. k is fo¡med as the ratio of (a) rhe sum olveriaûces rlat are considercd components of rest score variance in the serting under srudy to (b) che Foregoing sum plus rhe weighted sum ofvariances artribu¡able to various €rror sources in this serting, Such indices, which arise from the application ofgeneralizabiliry thcor¡ are rypically inrerprered in the same manner as reliabiliry coeffìciens. See gneralizab i lity rh eo ry. informed consent The agreemenr of a person, or thar person's legal representative, Êor some procedure to be performed on or by rhe individual, such as raking a resc or cornplecing a quesrionnaire. The agreement, which is usually writren, is made after the nature, posible elfec¡s, and use o[rhe procedure has been explained. inteffigence test A psychological or educational tesr designed ro rneasure an individuali level of cognirive Êrnctioning in accord wich sorne recognized rheory of intelligence. internal consistency coefficient An index of the reliabiliry of resr scores derived F¡om the sra¡istical inrerrelarionships of responses among icem responses or scores on separate Parts ofa resr. 176 AERA APA NCME OOOO182 GLOSSARY internal structure In test analysis, che facto- rial srructure of item responses or subscales ofa resr. See føctoríal strucmre. perFormance on â scålc of the abiliry, trait, or proficiency being measured, usuelly denoted as e. In ¡he casc of i¡ems scored 0 / I (incorreclco¡¡ecr response) the model describes the inter-rater agreement The consistenry wìrh which rwo or rno¡e judges rare the wo¡k or relationship berween 0 and rhe irem mean score (P) for rest akers ar level 0, over rhe range of perlormance permissible values of 0. ln most applications, rhe marhemarical funcrion relaring P to 0 is oF ¡esr rakers; somerimes referred to às inter-razr reliabilìry. ãssumed to be a logistic intervention planning The activiry of a prac[itioner thar involves the development of a treatmenr protocol. function that closely resembles the cumulative normal distribution. job analysis Ageneral term referring ro rhe invesrigation oÊpositions or job classes to inventory A qucsrionnaire or checklist, usually in the lorm ofa self-report, rha¡ elicirs inFormation about an individual's personal opin ions, inreresrs, artitudes, preferences, personaliçy characterisrics, morivacions, and rypi- obtain descriptive information about job duries and reslc, responsibilities, necessary worker characteristics (e.g. knowledge, skills, and abilities), working conditions, and/or other a.specs of the work. cal reactions ro situations and problems. job perfomrancÆ meâsur€ment The measu¡e' item A snremenc, quesrion, exercise, or task on â test for which the test raker is to selecr or consrrucr a response, or perform a task. ment of en incumbentt performance of a job. This may include a job sample test, an asseisment of job knowledge, and possibly rarinç o[ See hcm prompt. the incumbent's acnral per[ormance on item characteristic curve A mathemarical job sample test A test of the abiliry of an individua.l ro pcrform rhe ¡asks of which rhe function relating the probabiliry of a cerrain item response, usua[y a correct response, to the level of the atrribure measured by the irem. Also called item rctponre curue, or itcm respone funcion, or icc. rJre job. job is comprised. licensing The granring usually by a Bovernment agency, ofan authorizarion or legal item pool The aggregare of irems from permission to pracrice an occupâtion or profession. See ilso ccrtifcation, oedentialing. which a resr or resr scale's irems are selecred during test development, or rhe roral ser of linkage The result of placing lwo or more items from whìch a particular rest is selecced for a rest rakcr during adaptive testing. tests on rhe same scale, so that scores can be used i n rerchangeably. Several li nking methods are used: See equating, cdlibrøtion, modcra- item prompt The guestion, srimultrs, or instrucrions tha¡ direct the efforrs of examinces in lormulating rheir rerponses ro a consrructed-respo nse exercise. item response theo¡¡ (IRI) A madremarical model olthe relationship benveen performance on a tesr irem and rhe resr caker! level of tion, and projection, and altnnatc forms. litemtu¡e ln this document, a term denoting accessible reports ofresea¡ch, such as books, articles published in professional iournals, technical reporrs, and âccessiblc versions of pepers presented ar proFessional meetings. 177 AERA-APA_NCM E_OO OO 1 83 GLOLCARY loc¿I evidence Evidence (usually ¡elated ro reliabiliry or validiry) collecred for a specific ser olrest rakers in a single insritution or ar a specific location. local norms Norms by which resi scores are referred ro a specific, limiteò refnence populatìon of parcicular interest to the resr user (e.g., locale, organization, or institurion); local norms are not inrended as representarive of popularions beyond rhar serting. loca-l setting The organizarion or instirurion whe¡e a rest is used. low-stakes test A test used to provide resuln that have only minor or indirect consequences for examinees, programs, or institurions involved in rhe tesring. mandated tests Tèsts rhat are adminisrered because oFa mendate From an extemal auùoriry. mâstelf test l. A crirerion-referenced rest l-,:---l q0lBllÉu ^- ¡llulurc rllú cÁ(gltL -- wrtlet¡ -Lt9 :_11---, -L- - ---- ru , L:-L ulc cesr taker hæ mærered sorne domain of knowledge or skill. Mastery is generally indicated by arraining a passing sco¡e or cut score. 2, In some technic¿l use, e test designed to indicate whe¡le¡ â tesr taker has o¡ hæ nor attained e prescribed level oF masrery of a domain. See cttt sc7re, computer-based maítery tcst- matrix sampiing A measurement fbrmar in which a large ser of test items is organized inro a number of relarively short item scrs, each of which is randomly assigned to a subsample oI test rakers, thereby avoiding the need ro adminisrer all irems to all examinees in a program evaluation. moderation In cesr linking, che term moder- ecion, used wirhout a modifie¡ u5r'ally signifies starisdcal moderadon, which is rhe adjustment of the score scale o[one resc, usually by serring råe mean and s¡andard deviarion of one ser oF tes¡ scoÍes to be equal to the mea¡ a¡d sra¡da¡d devia¡ion of anorhe¡ dis¡riburion of rest scores- moderator va¡iable In regression analysis, a variable thar serves ro explâin, at least in part, the correlation of rwo other va¡iables. mofification See test modifcarion. neuropsychodiagnosis Classifi cation or description of inferred central nervous sysrhe bæis of neuropsychological ::i,:,åï:" neumpqychological assessment A specialized rype of psychological æsessment of no¡mal o¡ pathological processes affecting the cenrraI nervolrs system a¡d the resulting psychologie-l a¡d behavioral firncrions or dyifunctions. norm-referenced test interpretetion A score interpreution based on a comparison of a cesc takert performenc€ ro thc performance oF ocher people in a specified ,eference popuhrion- See criterion-rtferenced teo. not¡¡.aJiznd sm¡da¡d sco¡e A derived ¡esr score in which a numerical r¡ansformarion hæ been chosen so rhar the sco¡e distribution closely approximates a normal distriburion, for some specific popularion, norms Sratisrics or rabular data thar su.ûrmerize the distribution of tesr performance for one or more specified groups, such as tesr ¡akers of vârioirs ages or grades. Norms are usually meta-analysis A statistic¿l medrod of research in which the resulr from several independent, comparable srudies a¡e combined to derermine rhe size of an overall efFect o¡ rhe degrce of designed to represent some larger population, such as rest rakers drroughour dre counrry The group of examinees represented by dre norms is referred to as dte refercncc population. relatìonship berween rwo va¡iables. 178 AERA APA NCME OOOO184 GLOSSABY operationâl use The actual use of a test, pilot test A after inirial cesr developmenr has been complered, to inform an interpreration, decision, or action based, in pa¡c, upon tesr scotes. rest takers (o try out some aspec$ resr adminisrered ro â sample of ofrhc ¡esr or test irems, such as instructions, time limits, irem response formag, or item response oprions. SeefeU rc*. ûutcome evaluation An evaluation of the effìcary ol an intervenrion. policy The prìnciples, plan, or procedures parallel forms See ahernateþrms. esublished by an agenry, institution, organization, or government, generally with the inrenr o[ reaching a long-term goal. percentile The score on e tesr below which a given percentage of scores fall. percentile rank Mosr commonl¡ rhe percentage oÎscorcs in a specified distribution thar fall below the point at which a given score lies. Sometimes the percentage is defined to include scores rhat fall at the point; somerimes rhe percenrâge is defined ro include half of rhe scores at the point. performance assessrnents Product- and behavio¡-bæed measuremenrs based on settings designed to emulate real-life conrexß or conditions in rvhich specific knowledge o¡ skills are actually applied. performarrce standa¡d l. An objective definition ol a cerrain level oF performance in some domain in terms of a cur score or a range of scores on the score scale of a test measuring proficiency in that domain. 2. A srarement or description of a set oî operational tasks exemplifring a level oF performance âssociared with a mo¡e general conren( srandard; che snremenr may be used to guide judgments abour the location of a cur score on a score scale. The cerm ofren implies a desired levcl of performance. See cut score. personality inventory An inventory rhar measures one or more characrerisrics thet âre regarded generally as psychologiel arrribures or interpersonal procliviries or skills. porfolio In æsessment, a sysremeric collec- rion of educarional or work products thar have been cornpiled or accumulated over time, according to a speciÂc sec o[principles. precision of measu¡emcnt A general term rhat refers to a meæure's sensitiviry to meesurement e¡ror. See standzrd etor of men¡ ¿rtor of measurtmcnt. mca.sure- practice analysis A general term referring ro ùe investþtion of a cenain work posirion, or profession, to obtain descriptive informacion about ùe acrivities and responsibilities of the position and abour the knowledgc, skills, and abilities needcd ro engage in the work of rhe position. The concept is esscntially the safite es job anal¡æis but is generally preferred lor professional occupetions involving a grear deal oF individual decision makìng. See job anøþsis. a predictive bies The systematic under- or overrediction oF criterìon perÊormance for people belonging ro groups differendared by characreristics not relevant to criterion performence. p predictive vdidity A re¡m Stand¿rds ro refer ro a rypc used in ¡he 1974 of"criterion-related v-¿lidiry'thar applies "when one wishes ro infer Êom a rest score an individuali most probable sanding on some other variabfe called a criterion' (p. 26). In r-he 1985 Sta¡dards, the term criterion-¡e!¿ted uali¿it! wls changed rc criterion-r¿kad aidcnc¿, emPhasizing rha¡ it re[erred 179 AERA APA NCME OOOO185 GTOSSARY ro one rype of evidence wirhin a unirary concepdon o[validiry. The currenr documenr reFers io "evidence based on relarions ro other variables" ¡hat include "cesr-c¡iterion relacionships. " Predictive evidence indicares horv accurately tesr dara can predicr crirerion sco¡es rher ere obrained at a larer ¡ime. program evaiuation The collecrion and synthesis of systematic evidence abour rhe use, operation, and effecs ofsome planned ser o[ psychological testing Any procedure rhat involves rhe use of resrs or invenrories ro particula¡ psychological characre¡isrìcs of an individual. âssess random error An unsystematic errori a quenriry (often obsecved indirectly) that appears ro have no relarionship to any orhe¡ variable. random sample See sample. procedures. program no¡rns See user nonns. projection In rest scaling, a merhod oflinking in which scores on one resr (X) are used ro predict scores on anorher resr 00. The projecred Y score is the average Y score for all persons with â given X score. Like regression, the projecdon of res¡ Y onto test X is differenr from the projection of test X onto resrY. See linÞage. proposed interpretation A summar¡ or a of illustrarions, of the intended méaning of cesr scores, based on rhe consiruct(s) or sec concept(s) rhe resr is designed to measure. protocol A reco¡d ofeyenrs. A cest prorocol consisr o[ the test reco¡d and resr ïllJj*,V psychodiagnosis Formalizetio n o r classifi carion of func¡ional menul heal¡h srarus based on psychological Nssmenr. See nnropsychodiagnosis. psychological ¿ssessment A comprehensive examination of psychological functioning rhar involves collecting, evaluari ng, and integrating test resuls and collaæ¡al informarion, and reponing inlormarion about an individual. Various methods may be used to acquire information during a psychological assessment: administering scoring and inrerpreting rests and invenro- a¡d third-parcy nterviews; ar al¡sis of pri or educational, occupalional, medical, and psychological recocds. ries; behavioral obsenadon; ciient raw score The unadjusted score on a resr, often ctetermined by counring rhe numbe¡ of correcr answers, bur more gcnerally e sum or other combinarion of irem sco¡es. In icem response rheor¡ the estimare of res¡ raker proficienry, usuaily symbotiz.d ô, is analogous to e raw score although, unlike a raw score, its scaling is nor arbitrary. reference population The popularion oftest takers ¡ep¡esented by test norms. The sample on which the test norms are based musr per' mit accurate estimation of rhe tesr score dístriburion [o¡ the reference popularion. The reFerence population may be defined in terms oF examinee age, grade, or clinical status at time oltesring, or other characterisrics. relative score interpretation The meaning o€the test score for an individual, or the average score For a definable group, deiived lrom the rank of the score or average wirhin one or more reference dissriburions of scores. See a b so lute s core ìn terpre tat io n. reliability The degree ro which rcsr scores for a group o[ test takers are consistenr over repeated applicerions of a measuremenr procedure and hence are inferred ro be depend- able, and repearable for an individual cest taker; the degree to which scores are free of errors of measurement for a given group. See gen e ralizabi li ty theory. i 180 AERA APA NCME 0000186 GLOSSABY reliabiliry coefficient A unir-f,ree indicator thac reflects the degree to which scores ere free of measurement error. The indicaror ¡esembles (or is) a product-momenr correla- ¡ion. In classical resr rheory rhe cerm represents the ratio of true score variance to observcd score variance for a parcicular examinee populacion. The condirions under which the coeffìcient is esdmated may involve varia- rion in resr [orms, measuremenr occasions, rateÍs, scorers, or clinicians, and may enraìl multiple examinee products or perFormances. These and other yariarions in condirions give rise to qualifring adjectives, such as alternete-[orm reliabilir¡ inrernal consisrency reliabilit¡ rest-recesr reliabiliry, €rc. Sce gen crøliza bi liry th eory. response bias A tesr raker's tendency co respond in a particular way or sryle to items on a rest (i.e., acquiescence, sociâl dairabiliry, the tendency to choose 'rrue' on a true-false tesr) thet yields systematic, consrruct-irrelevanc error in test ent¡ries, called rhe popularion. A ¡andom sample is a selection according ro a random process, wirh rhe selecrion ofeach entiry in no way dependent on the selecrion oforher enri- A stratified random sample is e ser ol random samples, each of a specified size, from several differenr sea, which are viewed as srracies. re o[r¡e population. scale l. The system of numbers, and rheir units, by which a value is reporred on some dimension of measurement. Lengrh can be reported in the English system of feet and inches or in the metric system of meters and centimeters. 2. In resring, ¡m!¿ sometimes refers to the set of items or subrests used in the measurement and is distinguished [rom a test in rhe rype of characteristic being measured. Onc speaks o[a test of verbal abilit¡ but a scale of extroversion-introversion. scale score See scaling The process oFcrearing a scale or a interpretation by placing scores from dififurenr tests or test fo¡ms on¡o a common scale or by producing scale scores designed ro support crite¡ion-reFerenced or norm-referenced score interpretations. protocol A record ofthe score. scale score. Scaling may enhance resr score scores. response proc€ss A component, usually hypothetical, o[ a cognirive accounr of some behavior, such as making an icem response. response deriutd See scah. responses given by a rest rake¡ ro a parricular resr. score Any specific number resulting from the assessmenr of an individual; a generic restriction of range or wariability Reducrion in the obscrved score variance of an examinee sample, compared ro rhe variance of the enrire term applied for convenience [o such diverse examinec popularion, as a consequence ofconsrraints on rhe process of sampling examinees. course grades, rerings, and so forch. Sce adj usted rubric ualidit/re liab i lity coeffcient. See scoring rubric. sa-rrple A selection of a specifìed numbe¡ of s¡¡i¡ies ¡rll¿cl sampling unir (resr rakcrs, irems, etc.) from a larger specified ser of possible measures es rest scores, esrimares of lare nt variables, production counts, absence records, scoring forsruìa The forrnula by which rhe raw score on a resr is obtained. The simplest scoring formula is "raw score equals number corrcct." Other formulas differentialþ weight item responses. For example, in an anempt to corrcct Êor guessinB o¡ nonresPonse, zero weighs may be assigned to nonresponses and negative weights to incorrect resPonses- 181 AERA_APA-N CM E_OO OO 1 87 GLOSSARY scoring rubric The established criteria, including rules, principles, and illusrrations, used in scoring responses ¡o individual irems and clusters oFitems. The term usually refers to the scoring procedures for assessment tasls thar do nor provide enumerated responses [rom rvhich rest takers make a choice. Scoring rubrics vary in the degree ofjudgmenr enrailed, in rhe number of disdnct score levels defined, in ùe ladrude given scorers for assigning intermediate o¡ Fracrional score values, and in orher ways. split-halves reliability coefücient An internal consistency coefficient obrained by using half rhe irems on the test to yield one score and che orher half oF rhe items ro yield a second, independenr score. The correlarion betrveen the scores on these rrvo haif-rests, adjusred via the Spearman-B¡orvn formula, provides an estimate of rhe alrernare-form reliabiliry ofthe rotal resr. stabiliry The excenr to which scores on a test are essenrially invarianr over time. Stabiliry is en especr o[ reliabiliry and is assessed by corre- screening test A test that is used ro make lating the rest scores ofa group of individuals che same tes!, or an equaced broad categorizations oFexamínees as a firsr srep with scores on in seleccion decisions or diagnostic processes. test, taken by rhe same group at a larer rime. security (ofa rest) sta¡rdard erro¡ See test seatity. selection A purpose for testing chac results ia rhe acceprance or rejection ofapplicanrs for a parricular educational or employment opportuniry. sensitiv¡tt¡ In classification o[disorders, the proporrion of cases in which a disorder is derected when it is in Fact presenr. Spearman-Brown formula A Fo¡mula de¡ived wirhin classical test theory that proj' ecrs the reliabiliry ofa shortened or lengthened resr f¡om rhe reliabiliry of a test specified length. of specificity In classification ofdisorde¡s, the for which a diagnosis of diso¡der is rejecred when rejection is wa¡rantproporrion of cases cd. speededness A tesr characteristic, dictated by the test's rime limis, that results in a test ¡aker's score being dependent on the rate at which work is per[ormed as well as the correcrness of the responses. The term is noc used ro describe tests ofspeed. Speededness is of¡en an undcsirable charac¡e¡is¡ic. of me¿su¡ement The srandard devia¡ion ofan individualt observed scores from repeated administ¡ations of a teç¡ (or parallel [orms oIa tesr) under idenrical conditions. Because such data cannor generally be collecred, rhe standard error of measurement is usually estimated from grouþ data. See enor of measurement. standa¡d sco¡e A gAe o[derived score such thar rhe distribution of these scores for a specified population has convenienr, k¡own values for the mean and standard deviarion. The rerm is sometimes used to signifr a mean of 0.0 and a sranda¡d deviation of 1.0. See dniued score. l. In rest administration, maincaining e constant testing environment and conducting the resr according ro derailed rules and specifications, so thar tesring conditions are rhe same fo¡ all rest takers. 2, In test development, establishing scoring norms based on rhe resc performance of a repreJentative sample of individuals with which the test is intended to be used. 3. In sratistical analysis, transforming a variable so that irs standard deviation is 1.0 for somc specified population or sample. See ¡mnd¿rd score. staoda¡dization 182 AERA_APA_N CM E-OOOO 1 88 GTOSSARY function relating each lcvel of an abiliry or standards-based assessment Assessments i n cended to rep rese nr sysrema r¡ ceì ly descri bed conten( and perFormance sta¡da¡ds. stratified coefiìcient alpha A modificacion ol coeFficient alpha that renders it appropriate for a mulri-faccor tesr by defining ¡he toral score as rhe composite ofscores on single-fac- tor Paft-tes15. stratified sample See sørnpla. error A consiscent score component (ofcen observed indirectly), not relaced syst€matic to the test performance. See bids. technical manual A publicarion prepared by rest authors and publishers to provide technical and psychomerric info¡me¡ion on e resr. t€st An evaluative device or procedure in which a sample domain o[an examinee's behavior in a specified obnined and subsequenrly evaluated is a:rd scored using a smrrdardized procrss. test developer The person(s) or agency responsible for the consrrucrion ofa tesr ând for the documentarion regarding im technical qualiry lor an intended purpose. test devclopment The process rhrough which a rest is planned, constructed, cvaluated, and modified, including conside¡arion of conrent, Format, adnrinistrarion, scoring, item propertics, scaling, and technical qualiry for irs intended purpose. latenr rrait, as defined under item response rhe- ory (lRT), to the reciprocål of the corresponding conditional measuremenr error variance. test menual A publication prepared by resr developers and publishers to provide information on tesr admiDistrarion, scoring, and interprctation and to provide technical dara on resr characrerisrics. See user's gride. test modiÊcation Changes made in rhe content, format, and,/or admin istration procedure of a tesr in order to accommodate test takers who are unable to cake the original ¡esr under standard test condirions. test security Limiting eccess ro the specific conrent oFa test to those rvho need to know it for test developmenr, tesr scoring, and resr evaluation. In parricular, test items on secure tests a¡e not published; uneurhorized copying is forbidden by any test taker or anyone otherwise associated with rhe test. A secure rest is not for publication in any form, in eny venue. t€st specifications A derailed dcscriprion for a tesr, often c¿lled a rest blueprint, thar specifies the number or proportion oF irems thar assess each content and process/skill area; the [ormat of items, responses, and scoring rubrics and procedures; and rhe desired psychometric properties oF rhe irems and resr such as the disrriburion of irem difficulry and discriminarion indices. test us€r The person(s) or âgency responsible for the choice and administ¡ation oFa test, test documents Publicarions such as ¡est manuals, technical manuals, user's guides, specimen ses, and direcrions [or tesr administrators and scorers rhat provide informarion for evaluating the appropriareness and technicel adequacy oFa test for irs intended purpose. test informãtion frrnction A marhemarical For the interpre¡at.ion of res¡ scores produced in a given conrext, and for any decisions or ections thar are based, in part, on resr scores. test-rÊtest reliability A rel íab i I i ry coeFfi cienr obraincd by administering the seme test a second time to the same grouP elter a rime interval and correlating the rwo scrs ofscores' 183 AERA APA NCME OOOO189 GTOSSARY timed tesrc ,4, resr administered to a resr taker who is allorted a stricrly prescribed amoun( of time ro respond to rhc tes¡. validity The degree ¡o which accumula¡ed evidence and theory supporr specific inrerpre- rarions oF resr sco¡es enrailed by proposed uses top-down A merhod of selecting rhe ofa test. best applicanrs according ro some numerical scaie of suirabiliry. Often, "besi' is aken ro mean "highesr scoring on some resr." validity a¡gunent Arl explìcir scientific jusrificarion of rhe degree ro which accumulated evìdence and rheory supporr rhe proposed interpreration(s) of resr scores. trenslatioqal equivalence The degree to which rhe ¡¡anslated version of a tert is equivalent to the original rest. Tianslarional equivalence is typically examined in rerms of rhe language used, the scores produced, and rhe constructs measured by the translaced version and the original resr. See bacþ taruktion. m¡e score In dassical tesr lheory t]re average of the scores that would be ea¡ned by an individual on ao unlimired number of perfecdy paralle[ forms of the samc test. In .irem response theory rhe error-free value ofrest taker proficienc¡ usually symbolizeà by 0. uridimensionai i{aving only one dimension, or only one latent variable. us€r nortns Descriptive sraristics (including percenri[e ranks) for a sample ofrest takers rhar does not represent a well-defined reference popularion, for example, all persons tested during a certein period oftime, or a set of self-selected resr tekers. Also called program norms. See norms. user's guide A publication prepared by the validiry generalization Applying validiry evidence obrained in one o¡ rnore siruârions to other similar situations on the basis oF sirnultaneous estimation, mera-analysis, or syn checic validation argumen rs. va¡iance compo¡ie¡ts In iesting, variances accruing from the seperate consrituenr sources that are æsumed to contribure ro the overalì va¡ia¡ce ofobserved scores. Such variances, esrimared by methods of che analysis of variance, often reflecr siruarion, locarion, t¡me, resr fo¡m, rarer, and relared effecrs. vocational essessment A specialized rype oí ¿ssess men r desi gned to generale hypotheses and inferences about interests, work needs and values, career developmenr, voc¿¡ional macuriry, and indecision. psychological weighted scorlog A method ofscoring a rest in which the ntrmber of points awarded for a correcr (or diagnosrically relevant) response is not rhe same for all items in rhe tesr. In some cases, rhe scoring formula awards more poinrs for one response to an item than For enother. tesr auchors and publishers ro provide info¡marion on a resr's purpose, appropriarc uses, proper adminis¡rarion, scoring procedures, norrnative data, i nrerpreration of results, and c¡se srudies. See test manual. validation The process rhrough which rhe validiry ofthe proposed interpretation oftesr scores is invesdgared. 184 AERA APA NCME OOOO190 truf)Ëx Numbcrs in ¡his index rclcr ro specific srandard(s), .4.cccptablc perfommcc on ædmúaling tar, I4.17 Confìdcntia.licy prorccrion, 8.2 Consequcncc of misconducr, 8.2 Bæed on knowlcdgc and skills only, 14.17 scc'Tst modifi cacions" Achicvcmcnr in insrructional domain, 13.3 Accommodarion, fuila md prcædua o dercrminc ovcmll ourcomc of crcdcncialing rars, l4- Actuarial bæis for recommcndarions and decisions, Adaprivc tscing proccdurc, 2. 16 I 2. I 7 Adequacy of fi¡, 3.9 Adcquacy of ircnr or rcsr pcrformmcc, 4.21 Âdjusrcd validiry/rcliabiliry cocffìcjcnr, l. I 8 Adminisrracion, 2.18, 3.6, 3.9,7.20-3.71, 5.1-5.7 , 6.7 6.8, 6. ¡ ¡, 6. I 5, 8. t-8.3, 9.3, 9.5, 9.|, I 0-1, 10.5r0.6, 10.8, 11.r, 11.3, 11.5, 11.9, tì.13, lt.t6, ll.r9. u.22, 12.5, 12.8, tz.L0-12.r2. 13.6, f3.t0- t3.r2, r3.t6, r3.18, r5.r0 Àccomodations for æminco with disabilìria, 2.t8, 10.r, t0.8, I t.l6 Ädcquarc uaining of adminisrmror, 12.8, t3.10, 13.r2 ¡\dvancc informarion, 8.2, 12. 10, I 5. AÌlernare merhods, 6.1 I, l l.6 l6 Scoring aireria, 8.2 Tcr Iaking srratcgis, 8-2, I I . l3 Tcsdng policy, 8.2, 12.10, 15.10 Jìmc limir, 8.2, 12.10 To rcst takcr,8.2, 8.4, 12.10 Uæof torscors,8.2, 12.10, 15.10 Advanccmcnt,9.S Altcrnatc forms, sec'Tqr forms" ¡{nchor rsr, 4.11,4.13 Pryclomctric characrcriscis, Rcpraenrarivcnas, 4. I 3 fubirmtion of dispurö, 8.1 I ,{.ttcnution, corrccrion for, I.t8, 2.6 Arr¡irion ra¡o, 15.4 4.1 3 ft Bcnch-uLs, 13.19 Clariry of direcrioro, 3.20 Compurer-adminisæred ræs, 2.8, 8.3, 13.18 Compurer-scorcd rar, t3.l 8 C¿libmtion, 4.15, 5.12, 12.12 Conditions, 3.9, 5.4, 8.1, 12.12 Cæc srudis,6.10, Corocnr fo¡ms, 6.15 Catcgorioì decisions, 2.1 5 Ccmu-rypc tcring programs, I L24 Changc scora, 13.17, 15.3 Chæctcrisrics of job, 14.10, l4-12 Disruprioro, 5.2 Examincc! mosr proficicnt languagc, 9,3 Guesing, 3.20 How to makc rsporoes, 5-5 Inrcrprerers, 9. I I Minimizc posibiliry of brcachcs in test sccuriry 5'6 Modifisrions of sandard proccdurc, 2.18,5.25.3,9.5, u.r9, r2.12 Moniroring, 5.4-5.5 Opporrunir¡' ro pracricc usìng cquipmcnt, 5.5 Papcr-and-pcncil adminisrntion, 2.8, 8.3 Pcrmissiblc variarion in condirions, 3.21 Pracrice matcrials, 3.20, 8. Protccr sccuriry of l, I 3. I l rcr rurcrials, 5.7, I1.9, tZ.l t Quarions f¡om rcsr rakcrs,3.20 Self-scored rss, 6.8 Srandardizcd irotrucrioro ro rat akcn, 5.5 Srmdardiæd proccdura, 5. ì -5.2 Tcr taking srrarcgio, Ì l.l3 Uscr 3.20, 10.6 qualifiarioro, 6.7, 13.12 Adva¡cc informarion, 8.2, 8.4, 1 1,5, 15,r0 1 r.24, r2.2 l0.tz Chearing, 8.2, 8.7, 8.10-8, I l, I l.l l, ClasiÊc¿tion, 2.14, 3.7, 3.22,4.9,4.t9, 14.7, tA.B Employmcnt, 14.7, 14.8 Of consrructcd raporoes, 3.22 Of aaminca, 4.9, 4.19 Clæification consisrcncy, 2.1 5 Clinicel and courscling scnings, I 1.20 Coaching, t.9 Coding, 3.22 Collarcr¿l inlormation, 12. I 8 Combining tsr, t2.4- t2.5 Addrsing compla diagnosa, Judfìation for inrcrprcrarion, 12.1 I 2,4 Rationalc, I 2.4 Spccial qualificatioro, I t.3 Srandud adminisr¡a¡ion ins¡¡ucioro, 3.20, 12.8, 12.12,13.10 Îmc limis, Bìas, 7,3.7.4, 7.1?', ll.l3, 12.10, 14.16, Compmbilicy, 4.r0,7.8,9.4,9.9, tO.1, 10.1 l, 13.8, t4.l t .Acrw groups, 7.8 Job contcnt Êcror, l4.l I Modifiarioro for ind¡viduåls wiÌh disbilirìo, I0.4 Multiplcluguagc vcrsions of Ì6t, 9.9 Score,4.l0, 9.4, I0.11, 13.8 Conrpura-adminisrcrd rar, 2.8, 5.5, 6-ll, 8.2-8.3, 13.18 Dæumcntarion of dcign, 13.18 Documcntarion of scoring algorirhru, l3'18 Mcthods for sco¡ing and classifring, l3-18 185 AERA-APA-NCME-OOOO1 91 II¡DEX a^^-,..-- L---J,-.i^^ r2 ro rÞ(¡¡¡6¡ rJ.¡u ç,^...;,.i^-Á^. Consrruct-ir¡clcqnr variancc, 13. l8 áf l Lcgal rcquìrcmenrs, 4. ì 9 Compurer-gencmrcd inrcrprcrations, i.11, 6.12, I Lzl, t2.t5 Pæ/Ëil,4.21 Proccdurs for cstablishing, Cur scors. 6.12 Empìricl basis, 5.ì I Limit¡rioro, 5.1 l, I 1.21, I2.15 4. I 9 Proficiency catcgorie, 4-21 Rarionalc,4.19 Relarion of rcsr pcr[ormance ro releva¡t críreria, 4.20 No¡ms, 12.15 Qualiry, l2.l 5 Rationale, II Sourca, 5. 1 t.4, 2. 1 7, 11.5, t4.7, t4.r3, r4.r 5-r4-16 Actuarial buis, 12.17 Ccrdfiotion, I4.15 Dccision malriog, 5.1 I Compurcrizcd adaprivc ras, 3.12, 4-10, 8-3 Documenration,3-12 Clæifierion, R¡rionalc,3.l2,4.10 Supporring cvidencc, J.l2 Concordanæ rablæ, 4, I 4 Condirional smndard erro¡s of measurcment, 2. I 4 Confidcncc inrerud, ?.2 Confidenrialiry prorcction, 8.2, 8.6, Conflia of intcrar, 1 1 3.7 -1 1.9, 1 3. 13, 11.4, 13.7 Consrrucr overlap, t3.8 Desired sruclenr outcomes, lJ.9 DiaEnosis, I 1.4 Edus¡ional placcmenr, lJ-9 l2.l Graduation, l3.J Inregraring inforrotion from mulriplc tcsrs and I 12.2 Corucqucncc of misconduct, 8.2 Consequcnø oftcsr wc, 1.24 Coruisrenry of scors, 2.4 Cororrucr dscriprìon, 1.2 sourccs, t4.13 Job clæsifrerions, 14.7 Corotruc cquiva.lcnt ræs,7.2, School conrcxr, 13.13 Selccrion, I 1.4 Pss/fail,14.16 Promotion, 11.5,13.9 13.6 Consrrucr-irrcl*mr uriance, 7.2, 7.10, ì2.19, 13.18 Cororrucr overlap, 13.8 Corotruct rcprcscnurioo, 7,1 I Courucr undcrrepracnrarion, 7. I 0 Conrcnr domaìn, l-6,5.11,7.3,13.5, 14.8, 14.10, 14.14 )oD, tl.tv Limirarions, {.1 Conrcnr specìfìarions, 1.6 Conrst cffccs, 2.17, 4.15, 13.1 Conrrol(ing ítcm cxposurc, 3. Convcrgcnt widcncc, 12. l8 I Manings,4.l Copyrìghr, 8.7, I 1.8-l 1.9, Infringemcnt, 8.7 Dc¡ivcd sco¡es, 2.2, 3.22, 4,2, 4.7, 6.5 Dcscriprivc sratisrio, 2.4 Differcncc scorcs, 13.8 5 2 Convcrtcd særs,4.16 Posiblc noncquiwlcncc in r*isions, Sranda¡diæd rsrs, 13.8 4.l6 l2.l I Prorecrion, I 1.8-l 1.9, 12- I I Copvright dare, 6. l.l Crcdenríaling tsring, 9.8, L4-14-14.17 Crcdcnrial-worthy pe¡formance in an coparion, Lrvcl of pcrformance rcquircd for prsing, 14.17 Liccrourc md ccrrifietion, 14.15 Crircrion comruct domain, 14.12 Crirqrion-¡cfcrcnccd inrcrprctarion, 4, I, 4,9 Enpírical bæis, 4.9 Rarionalc, 4.9 Critcrion-¡cfcrcnccd ruting progrns, 1 -4, I 4.2 Cross-rdida¡ion s¡udis, 3. I 0 2.1 4-2.1 5, 4.4, 4. t hypochesis, 7.6 Disabilirics (rcsring individuals wirh), see'Tcsring individuals wirì disabilitio" Divcsiry, 6. I 0, 9. l-9.8, 9. l0-9. I I, I 0. I - I 0. I 2, | 1.22-l 1.23 (ndividu¡ls wirh disabilirics, lo.l-10.12, I 1.23 Línguhric, 9.1 -9.8, 9, l0-9. I l, I 1.2?-1 1.23 Dæmcnration, scc "Publkhc¡ maccrials/rsponsibilitics" Eduqtional toting progms, 8.10-8.13, 9.3, I1.20, r3.r-r3-r9, 15.7, 15.12-15.13 Average of summaryscora for groups, l3-19, t5.t2 Educarional placcmenr, 13.9 Graduarion, 13.5-13.6 Culrural diFcrcnca, I.l-9.1 I Curiculum s¡andatds, 13.3 t3.6, 14.t7 Differcn¡ial diagnosis, t 2.6 Abitiry ro disringuish beween multiplc groups ofconccrn,12.6 Diflerenrial itcm ftrncrioning (DlF), 7.3 Diffcrcntid predicrion r4.r4 Cut særc, Yùidìcy, 1t.4, t3.7 Defined domain, 3.1 I Dcrived score sølcs, 4. f Inrcnded inrerpretarion, 4. ì l, 4.19 -4.21, 6.5, 6. 12, Group diffcrcnces in rar scors, l3.l 5 Guiding insrrucrions, 13.13, 15.t3 186 AERA APA NCME OOOO192 INDEX Mandarcd tss, I5.7, 15.13 l0.ll, 13.5-13.6 Abscncc of biæ, 7 .3-7 .4,7 .lZ Equaìiry of tcsring ourcomc for qamincc sub- Faimss, 7.1-7.12, 8.1, 8.1l, 9.5, Promotion, 13.5-13.6, 13.9 Qualifiøtioro o[ administrarors, I3.10 Qualifierions of scorcn, t3.10 Scorc repors, 13.14 Special nccds groups,7.8,7.10'7.11 Equirable rrcarmenr o[all 13.7 13.5-11.6 idcntifierion, Srandards for mætcry 7.8,7.12,8.1, 9.5, l0.l Vdidiryofscoreinfcrcnccsærimcpsc, Ëmpirial cvidcncc, 4-20, 7.6, 14.4-14.5, 9,7 , 10.5, 13.16 10.2 Effects ofdisabilitics on rcsr pcrformancc, 1 2.1 6, 1 3.9, 15.8 qmines, 7.t-7.4, I Opponuniry ro lca¡n,7.10, 13.5-13-6 Fariguc, t0.6 Fìeld rss, 3.8-3.9 Fhgged resr scorc, g.5, l 0. l Forms, scc 'Tesr forms" l Con¡aminana and a¡cifaca, 14.5 Supponing bæis for upccting specific coms, ouc Gain scoræ, t3.17, 15.3 Rcportof tcc-hnical qualitic, 13.17, 15.3 Gcncmlizability, 2,5,2.L0,3. t l, I 2.16, 13.3 Group-lwcl informarion, 5.12, 11.24, t3.15,15.12 .Aggregaring rcsulu, 5.i 2 15.8 Employmcnt tcsting, 9.8, t 4. t-ì 4. 13 14.8 Job analysis, l4-4, 14.6 C{æsifrøtion, Job clæsifìetion dcc-siqru, 14.7 Caurions againsr misrcprcscnnrions, 15.12 l4.l Objcctivs, Diffcrcnca, 13.t5, 14.12 Prcdicrion, l4-1,14.4 15.12 Group maru.4.8 Pcrsonncl sclcction, Group pcrformancc mcæurc, 2,20 Prcdic¡or-crirerionrclarìonships,14-2-14.6 Grouprcringprogms,l2.g Promorion, Scrcening, 14.8-14.9 Profsional supcNisor raponsibiliria, 14.8-14.9 1 Equating pmccdure, 4. I I Equaring studics, 4.lL-4.13 Archorrcrdaign,4.l3 tndividual tcting, 12.3, t2.t8-12.t9, t3.t3 Sclccion, Equatcd foms, 4.I lnfo¡med choicc, 8.3 Info¡mcd conscn¡, 8.4-8.5 Êxapriom, 8.4 Inrcgriryolrsrraula, Cha¡aaeris¡ia of anchor tcsts or linking ircms, 4.1I Clasical,4.|3 Daign, 4.1 I Inrc¡prctcrs, Starisric¡l merhods uscd, 4, I 4. l2 I 14,5 ofrsr to program goals, 15.2 2.8-2.9 Examinccsubgroups,T.l-7.4,7.6,7.10-7-12,1t.21 Expcn ,iudgmcnt, |.7,3.5-3-7,3.1 l, 3.t l, 4.19, 4.21, l4-9 Cut lntcrprctivc nate¡i¿l For loc¡l rclcæc, 5.10, Common misinrcrpre rarions, J. l0 How roræ witl bc ucd, 5. Ì 0 l5.tj lVhar scores mcan, 5.10 15.2 Examìnee pcrformancc, scc "Score inrerprerarion" Precision of scores, 5.1.0 Simplc ìanguagc, 5. t0 Error varianca, 2.5 Ethics, 12.2, 12.10 Rclcvancc oftcrscoru, 9-ll Qulifierions, 9.1I Staristiel cquivalencc of cxamiocc groups, E¡ror of meæurcmenr, 15.9 lnrcr-ircm ærrclarion, 3.3 Inrcrprcmtion of individual itcm rsporocs, LlO Inrcrprctarion Examìnee sampla, 4.1I IRT-bæed,4.13 Evaluarìon, 12.9 l4.l rcorcs,4.2l Dcmogmphic chamcrcrisrio of judge, Job ræk conrcnr, 14.9 Qualifierion of judga, 3.5-3.6 Rclmnt upcricoø ol judgc, Srædrd scaiog, 4.1 9 l.J Prore, 3.5 Purposc, 3.5 Rcsuls,3.5 Êxperr rcvicw, Exrcndcd response ircms, 3.I4 3.5-3.6 3-5-3.6 Whar rcsr covcrs, 5.10 lnrcr-rarcr agrccmcnt, 3.23 Invcstigarion of rs( raker misconducr, 8. l0-8. t2 Irrclryantwimcc,3.lT kem dcvclopmcnr, 3.7 kcm mjuarion,3-9 Psychomcrric propcrrië, 3.9 Suplc docipúon, 3.9 lrcm pool, 4.17, 6.4 ko rsponse drcor'' (lRT), 2.16, 5.9 Ábiliry or mir parmcrcr, 2.1 6 lrcm puametcr stimara, 2.16, 3.9 Ircm rryiry, 3.7 ftcm selcction, 3.7,1.r-3.10, 3.12 Empirical rclarionships,3,lO ftcm dif[ìoìry, 1.9 187 AERA APA NCME OOOO193 INNEV Item discrimination, J.9 Item inform¡rion, 3.9 Adminisrrarion, 5.4-5.5,12.8 Scoring, 5.9, 12.8-12.9 Motivarion ofrcsr ukcrs, 15.4 Mukidisciplinary cvalu¡rion, 10.12 Proccdures,3.l2 Subscu ofirems,3.12 Tcndenc¡ to sclecr by chance, Itcm tryours, 3.7-3.8 Itcm wcigha, 3.10 Mulrimeclia resring, 13.18 Documcnrarion ofdcsign, 13.18 Documcnrrìo¡ ofscoring algorirhms, 13-18 3.13 Bæcdoncmpirical data,3.l3 B*cd on cxpcrr judgmcnr, 3.13 Job æalysis, t4.6, 14.8, l4.t Job contenr domai¡, 14.10 Mcchodsofscoringandclæsifring, 13.18 Mukiplc-aprirudc rcsr bacterics, 13.8 Comparing scora from rcr componenrs. 13.8 l, 14.14 Mukìple-language resrs, 8-3 Muhiple-purposc rsrs, 13.2, 14.10 Knowlcdge, 14.10 Abiliris, l5.l Âppropriare rcchnical cvidence for cach purposc, 13.2, l5.l Skìlls, 14. t0 Tasþs, l¡bcls, 14.10 Nomatirc daø,6.4-6.5,13.16 8.8 Lrær srigmatizing, 8.8 l:nguagc di flerc¡ccs (taúng individuals wirh), 9. l -9. I I 11.22 , Dscriptivc scatiscìc,4.6 Parriciparion rara, 4.6 1.22 11.22 Appropriarcns oFrsa, 9.1, I L:rguage pro6cicn c'y, 9,3, 9.8, 9.10, Popularion, 4.6 Sampling proccdurcs,4.6 Bilingual,9.3 Communicarive abilitics, 9,10 Exminccs,9.3, 9.10 Mukiplc languages, 9.3 Rcquircd lwel for occuparions, 9.8 Lugc-sølc tcsring prcgms, 5.3,5.6,5.12 barniog opporruniry changc, l3^l5 Lrgally mndatcd tsting, 8.4 Licerourc md ccniÊqtion, 8.7, Ll0-8.13, 9.8, \feighring of samplc,4.6 Norm-rcfercnced inrerprerarion, 4.1,4.9,13.13, 13.16 Norm-refc¡enced raring programs, 3.4 Norms, 2.12, 3.1r, 4.2, 4.54.8, 4.15,1. i 8, 10.t, I l. 19, 12.1,'t2.17, 12.18, t3.4, 13.8, 13.13 Croup mcans, 4.8 lndividuals wirh disabili¡ies, 10.9 14.14- 14.17 Limirarions of rsr særes, I1,2 Linguiscic abiliry, 7.7,ll-23 Linguistic characrcrisrìcs of q:ramines, Outcome oonimring, 15.5, 15.8 Basis for cxpecring ourcomc, 15.8 Ourcome of crcdenrialing Pæsllail, 14.16-14.t7 Levcl ofperlormancc rcquircd, 14-16-14-17 Períormanqe æessmcn6,3.i4 9.2 i.l-Ì2 "Scorcri' cvidencc,9.7 Linkagc,4.i5, Læal score¡s, scc Pilor rcsring, 10.3 Policy srudia, 15.2, 15.4-15.5, t5.Il-15-12 Mædatcdt6úngprcg¡ús, 13.1,15.7,15.13 Dcsaiption of wa¡s ¡ouks will bc uscd, 13.t, 15.7,15.13 Negarive conscqucnco, 13.1, 15.7, t3.6 Marrixsampling,2.20,5.l2, 15.6 Møurement crror, 13.8, 13-14 1.21 va¡iablcs,7.6 Modìfia¡ioru, scc "Tsr modiFtadoro" Moniroting, 5.4-5.r, 5.9, 12.8-12.9 Mcra-analysis, 1.20, '' Modcraror 'r ras, t4.t6 I.l-9.3, 9.t-9.6, 11.22 Linguistic subgroups, Mærcry of skills, lÃc:J, 4.7, 13.4 Prccision,4.6 Knowledge and skills neccsaqy, 14.14 Purposc of program, 14.14 Logisl Norming poputation, 6.4 Ycus of dara colfccrion, 6.4, 13.t6 Norming studics, 4.6 Dares of raring, 4.6 Rclcæc of rcsr rcsuls, 15. I l-15.12 Suirabiìiryof tsr, I5.2 Iblicy makers,7.9, l5.lì EduarioD¡j,7.g 15-13 Public, 7.9 Social,?.9 Popularions, 1.2, 1.r,3.6,3.8,4.5-4.7,6.4,7.),7.3, I t.l, I1.16, 11.74, 12.3, 12.8, 12.16, 13.4, 13.8, 13.15, 15.5-15.6 Background ofrest taker, 12.3 Ccnsu.r¡pc resring p¡oBrâms, I 1.24 Chracrerisria of tcsr rakcr, 12.3 88 AE RA-APA-NCM E-O OOO 1 94 I¡¡DEX Cu)¡u¡el diffcrcncs, I 3. I 5 Amcnding, rcvising, or rvithdrawing tcsr, 3.25. Dscriprions, 2.20, 15.6 Gradua.l changc in dcmographic t 6.13 charucrcristic, Ll6 15.5 diffc¡encs,7.1,7.3,13.f5 Pracricc cffæs, 1.9 Precision ofscorcs, 2.4 Prcdiction, 14.1, 14.4, 14.6-14.7 ,Al¡senrceism, I4.4 Job bchavior. l4.l Job-relevanr rrainin& l4-4 Job succas, 14.7 Turnovcr, 14.4 Vork bchavior, {{.4 Vorkourpur,I4.4 P¡cdictor consrruct domain, 14.t2 Predictor-critcrion relationships, 14.2-14-6 Groundcd in rseuch, 14.2 Prctcsr/posrtcc scor6, I 3. 17, I 5.3 Changc scora, 13.17, l5-3 Gain scorcs, 13.17, l5-3 Privacy prorccrion, I l.l4 Procedural pçotcctions, 8.12-8.t3 P¡octors, I t.l I Rcprccnrarivcns, 1.5,12.16, 13.4, Conscnc Cor¡ecred s@rc rcporr, 5.14 Critcria for scoring, 3.20 Dirccrions for adminisrmrion, 3.19 Dirccrions ro rsr ralicm,3.3,8.l Documcnra¡ion of proædurcs rst, Gcncnl informarion, 6. I 5 Idcntifierion ol¡clarcd cou¡s or orriculum, 6.6 Information ro policy makcß, 7.9, ll,l8 lrorruaioru for using raring scalcs, 3.22 I, l2.l l2.l l2.f Program øaluarion,2.18, Eliminare pmcticcs dcigncd to raisc tsr scors, 5.9 Iorcrpreration and rcisse of rqulrs, I 5. 13 oftar ¡o p¡ogtem goaJs, t5.2 Programgoals, 15.2 Suilabiliry lrogram moniroring,2.16 Promorion, 14.8-14.9 14.8-14.9 12-6-12.7 Diagnosric sensitiviry and spcciÊciry, Individual rsring, I2.3 Inrcrprcrive rcmarlc, 12. 13 Potcnrial infc¡cne dccribcd 12.13 12.5 ll.l, as hyporhoc, ll.l3, lz.4 proccdura,5.l ll.3-11.4, lt.7-11.9, Âdminis¡¡arion 3.3 Qu:ìifìcarions to adninisrer and scorc rar, 6.7 R¡rionalc, ll.4 R¡rionalc for modificarions, 10.4 Rccommcndarions and aurions regarding modi- Rcnorming wich suflìcicn< frcqucncy.4.18 Rscuch ro avoid biæ,7.J Rwisioro and implicarions on rst scorc inrc¡- prcr¡rion, 3.26,6.13 Using rcss in combinaúon, lZ.4-12.5 Publishc¡matc¡ials/raponsibilirics, l-l-1.3,2.11-2.12, 1.1-3.5, 3.9-3.r3, 3.15, 3.19-3.27 ,4.1-4.6, 4. ì I, 4.14-4.16,4.18-4.19, 5.1, 5.10, 5.14,6.1-6.15,7.37.4,7.9-7.10,8.1-8.2, 9.4, 9.6-9.7, I0.1-10.5, 10.7- l0-8, Modificd Fo¡ms, I0.8 Normíng studis, 4.6, 6.4 No¡ms,4.2,4.5 Pracric or samplc qucsrions or (õ6, 3.20, 8.1 Proccdurs for rest adminisrracion and scoring, fisrions, 10.4 tuliabilirydara,2.rr-2.12,6.5 Psychologial rcring, l2.l-12.20 Complcx diagnoss, 12.5 Diagnosis, lnsrruciore ro tesr rakers,3.20 Interpreurion ofscorc, I'.9, l.l2 lnrcrprcrive marerial, 5.10,6,8,6.10 Linguisric modifierioro,9.4 I2.l 2.20, l5.t-15.13 Employmenr, ro modiÇ 3.24 Forcign language rnnslarion or adaphrion proc¿durs, 6.4 Supcruiscd training, I ucd 10.5 Dæumenrarion withour compromising sccuriry, 3,12, Iì.18 Expccred levcl oÉscorcr agrccmcnr and acilracy, Crcdcnrialing, Expcricnce, forms,6.l5 Copyrighr date, ó.14 13.12-13.13 Eduqtional, speakers, 9.6 Cæc srudio, 6.10 Caurions againsr misuscs, 6.3, I t.7, I1.8 Compuccr-gcncrated interprcrarions,6.t2 Subgroup P¡ofssìonal compcrcncc¡ 12.1,12.5, ¡2.8, ì2.10-12.1 Àppliabiliry ofresr to non-native Samplc marcrial, 3.20 Scorc rcporr, l.l0 Scoring uireria, 3-22 Scoring præcduro, 5.1 Sccuriry tl.8-11.9 Scroiriviry reicws, 7.4 Sutcmcnc rcgarding rcerch-w-only cats, 3.27 S¡arisdal dacriprions and ana.lyscs Suggcsrionsrouscrß$incombination, 12.4 Summuics of citcd studics,6.9 189 AERA APA NCME OOOO195 !NOEX Supplenreoi:.ì nrrreriaÌ, 6. i Tþchnic¡l docurnen¡adon, 4.2, Tcchniel manual,6.l, rrunsraüons or a 4.6,4.19 10.5 Rcliabiliry coefiìcicnts, 2.5-2.6,2.t1-2.t2 Tcst bullctin (advancc informarion), Tèsr directions, 8.2 Aftcrnare-form cocfficicns, 2.5 Inrernal consisrency cocffìcicnts,2.í 3.15 Tcr manr¡al, l-10, 3.1, 10.4-10.5, ll.3 4.16,6-t-6.2,6.4,9.4, Tcrtakingstratcgics, ll.l3 score¡s, 3.23-3.24 Resca¡ch Trrrslaricn infcrmarion,9.T Uscri guidc, 6.1 Validig' informarion, 6.5 uc only rors, Respcnsc formar, 3.27 2.8,3.G,3.14,3.22,4.11,5.t,t.5, I 1.13,12.12 Purposeofrsc,3.2,).6,8.1,11.1,11-2,ìi.i,tl.l6, 11.24, 13.2-13.3,13.7,13.12, 14.14 14.5 1.1,6.3,9.4 Rarv scoro, 4.4, 6-5 Rangcrouieion, Consrucrcd, 2.8,3.22, 4.21 Exrendcd-raponsc,3.l4 Unstrucrurcd, 12.12 Rcstric¡ion ofrangc or variabiliry, 1.18, 2.6 Rccenrion policy,5.¡5.5,16,8.6, t1.5, 15.10 Confideaiialiry 8.6 Data rransmision sccuriry, 8,6 Purionale, 4.4 Pro¡ecrion from impropcr disclosu¡e, 8.6 Valid uc of informarion, 5.16, 15.t0 Rccesropportuniry, ll.l2, 12.10, 13.6 4.4 Mcaninç,4.4 Raàingabi|iry,7.7 Limimtioro, Righrof rarraher,8.10-8.13, ll.l0-11.12, Rclationship bcvecn rat sco¡a, 13.8- l 3.9, I 3.12 fuleæcofsummaryrc¡raulrropublic t5.l Rsrricion of rmgcorwiabiliryadjvsmcnt,Z.(t Tsr-¡cresr or srabiliry cocfficicnr, 2-5 Rcplicabiliry, t2-12 Taining matcrials for (ntcnded inrcrprcrarions, (6r, y./ Virhin-cxaminee consisrency, 2.10 ll.l7-ll.l8, I Policy for rimely rclæc, I L 12.20, 13.6 Appeal and rcprcen rarion by counsel, I I .l I Rcrestopporruniy, 11.12, 13.6 Rubric, sce "Scoring rubric" l7 Provision of supplcmcnral cxplanarions, I l. I 8, l5.ll Smplc rcpracnatimøs, 3.8 Smplingprocedures,2.4,3.8,3.l0, 14.6, 15.6 Rcliabiliry Z.l-2.20, 3.3,3.19,3.23, 5.12,9..1,9.7,9.9, Scale dcvelopmcnr proccdure, 6.4 ' It.l-11.2, 11.19, 12.13,13.8, 13.12, 14.15, I5.6 Scalc *abitiry,4.17 Alternrte-form reliabiliry admarc, 2-9 Ovcr címc,4-17 Anal¡ss for særes produccd under major wriaScalcs, 4,2 tions,Z.l8 Data for major popularioro, 2.1I Data for scpararc grades ud agc groups, Data forsubpopularioro,2.lt Scaling, 3.22 Scorc comparabilir¡, 4.10,9.4, 2.12 14.15 '13.8 DiFcrencc scores, Error wiancc c¡imaro, 2-10 Btimats, 2.1, 2.9 Generalizabilir,v cocÊìcicnr, 2.5 Inrer-mtcr consisrcnry, 2.10 laguage diffcrcncc,9.l Local rcliabiliry dam, 2. 12 Dccirion rcliabilìry, tesr, 2.17 2.8-2.9 Rcliabi.liryarimarìonproædura,2.7 Rcportcdforlaclofaggrcgation-1,12 Sampling procdurcs, 15.6 Scorcr,3.23 Sou¡cc of mæurcmcnr c¡¡or, 2.10 Spccdcdncs, sec "Rarc ofworli Systcmaric vuiancc, 2.8 Tfstcompambiliry,9.9 Tcsr.rcrsr rcliabilicy cstímare , 2.9 lang and shorr vcmioro ofa Rarc of mrk, l0.ll, 13.4 Scorc convcrsioro,4,l4 Limirations,4,l4 Sco¡c diFercnces,2.3 Scorc equivalcncc,4.t0-4.1 I Dirccr cvidcnce, 4.10 Equating procedurs, 4.1t Intcndcd uses, 4. l0 Scorc intc6riry 5,6 I.l-1.2, 1.9, 1.12, 1.21,2.11,3.4, 3.14, 3.16, 3.18, 3.75-3.26, 4-l , 4.3-4.4, 4.6-4.7, 4.10, 4.16, 4.t8-4.20,5.1, 5.10-5.1 l, 5.14,6.3,6.5, 6.7-6.8, 6.10.6.12, 7.1.7.r, 7 .8, B-7, 8.9, 9.2, 9.5- Scorc inrcrprerarion, 9-7,9.9, 10.4-10.5,'l0.7,IO.9,lO.ll, 11.1,11.3, ll,5-ll.6,ll.l5,l1,i7-11,18,11.?0,11,2?,12.9, lZ.l), 12.19,13.3, 13.7.1).r, 13.12-13.15, 14.13, 14.16,15.11-15.13 Absolu¡e, 3-4 Aflecred by ¡evisions,3.26,4.16 Alrcrmrc cxplanarions for rðr rakcr! pcrfom- a¡ce,7.5,11.20, 17.19,13.7 Cæe scudiæ, ó.10 190 AËRA APA NCME 0000196 INDEX Compurer-gcnerated interprcratioro, 5.t t, 6.I2 Rcqucr for rryicw or ¡evision of scores, 8.13 Rcrcntion ofindividual dara, 5.15,8.6, t5.10 Vaivcr of acæs, 8.9 Contcxrual informarion, 13.t5, 15. t2 Cut scor*, 1.19-4.20, 6.5 Differe nce rcorcs, 13.8 Scorc E6ecs ofmodificarioro for individuals wirh disabilirics, 10.7 Flaggcd scors, 9.5, l0.l I lnfcrcae wirhin subpopularions, 2.11, 7.3-7.4 Intcrprcrive marcrial lor loæl rclcæc, 5.10, l r.l7-t Ll8, 13.r2-r3.14, t5.l I I¡cm læcl informarion, 6.5 Linguistielìy diversc uminca, 9.2,9.6, 11.22 Matcrial eror requirc c¡nmcd src repon, 5.14 Modifi arioro for indi"idu¡ls wiûr .l ¡abiliria, I 0.4 Norms, 4.6, 10.9 Porcnria.l misinrcrprcrariotrs, I Ll5, 13.11-13.15, 5.12 Rclarivc, 3.4 Scorc cquivaìencc, 4, I 0 Scoro obnincd undcr alrernatc ondirions, 6. I Sclf-scored rss, 6.8 Shorr form, 3.1ó 4.1 -4.4, 4 -9 Agc-equivalcnt scorcs, 4. I Crirerion-¡efc¡enccd inrcrprctation, Derivcd sco¡s, 4.1, 4.4, 4.9 Forøarning of porent¡al 4 -l -4.7, 4.9 specifi c Disinrcrprer¡- rions, 4.3 Gradc-cquivalenr s@rcs, 4. I Norm-rcfcrcnccd inrcrprcurion, Á.1 -4.2, 4.9 Pcrccnrile trls, 4. t Raw scora, 4-1,4.4,4.9 Studard scorc sc¿les, 4.1 Scorcn, 2.12, 3.22-3.24, 5.9, 6.7, 12.8, I3.10 Accuracy, 3.24, 13.10 Agrcemcnr, 3.24 Fccdback, 5.9 1 Spccial sc¡la, I qualifiøtioro, I t.3 LÐc:.],2.r?, 3.22, 3.24 Moniroring, 5.9 Qualifierions, 3.23, 6.7, 13.10 tulìabiliry 3.23 Rcrraining or dismising, 5.9 Specd componcnr appropriarcncs, 3.18 Scorcr judgnenr, Subgrorrp diffc¡cncæ, 7.1, Sclccting, 3.23 7 -B 3.2{, 5.9 lmslarcd rcsts, 9.7 Valid infercnø for samincc Tiainìng, 3.23, 12.8, 13.10 Scors, rypcs \flcightcd scoring, 14.16 Composirc scorcs, 1.12, 2.1, 2.7, 14.16 Subscorc, 1.12, 2.1 Særing crircria, 3.14, 5.9, 8.2, 12. I I subgroups, 7.2 VaJidiry jcopudizcd by dcparurc from smdud p¡occdurs, 5. I Score rcporting, 2.17 , 5.13-5.16, 6.12,7.8,8.4-8.6, 8.8- 8.r 1, 8.r3, 9.4-9.5, I r.6, I t.12, I t.t4, I l.l7I t.t 8, 12.9, t 2. t 5, t2.19 - 12.30, t3.t6- 13.t7, r3.r9, 15.3, 15.10-rt.tr Agc of norms ucd for rcporring, t3.16 Aronymiry for raca¡chcrs, 8.5 Canccllarion or wirhdrawal of scoru, 8.1 I Catcgorical decisiom, 8.8 Changc scora, 13-17, 15.3 Computcr-gcncmrcd inrerprcrarions, 6. 12, 12.15 Conditions for disclosure, I t.l4 Confi denriaìiry, 5. 13. 8.4-8.t, 8.9 Co¡rectcd score reporr, J.I4 Dare of rar adminisrrarion, 13.16 Dela¡r bccausc oIposiblc irrcgulariries, 8.Ì0 Dæcriprion ud ualysis of alrc¡narc hyporhææ or cxplanations, 12. l9 Exm ¡cnks, I1.12 Flaggcd resr scorc, 9.5 Format appropriarc for recipicnr, 11.6, 12.9, 12.20, r3.r4, t3.19, r5.n Gain scorc, 13.17,15.3 Invalidarion of scorc, 8.13 Linguisrielly mod¡lìcd 16$, 9.4 Public rcporrìng fot gtoups,7.8, I ì. I 7- I l.l 8, r3.19, l 5.1 r Scoring crros, 5.8, I l-10 Særing præcdurc, 3,14, 5.1-5.2, 5.8-5.9 Særing rubrics, 3.23-3.24, 5.9 Scoring scruicc, 5.8, 6.12 Søccning, 1 1.5, t3.7, l4.t Sacening in, I 4. I Scccning our, l4.l Scleaion, 2.14, 9.8, l4-8-14.9, l4.l l-14.t2 Employcc, 14.8-14.9, l4.l t-14.12 Selcction tsc, t3.8 Comparing scora, 13.8 Sclf-scorcd rsc, 6.8 t 3. I 7, I 5.3 2.19 Vdiabiliry duc to mcæurcmcnt crror, 2.19 Vuiabilicy duc to umpling, 2 19 Srenda¡d crrors ofabiliry scora, 2.t6 Stmdard crrors olcquating functioro, 4.1I Sc¿ndud crrors of mc6urcmcnr, 2.1 -?.3, 2.5, 2. I l -2. 12, Srandard crror of Smndard cmr of rìc diffcrcncc scorc, I 3.8, rhc group 2.t4,6.5,13.8, ma, 14.15 Condirion¡1,2'2 Ove¡dl, 2.2 Rcpcarcd-mcæu¡cmcnrs approach, 2. I 5 Smdud scrriog, 4.194.20 Srndudiarion, 3.20 Snndard¡ for mærcry 13.5 191 AERA_APA_N CM E-OOOO 1 97 ñav iltuil t Àt Strucrural equarion modding, L1.17, Srudenr ourcomcs, 13.9 15.3 Shon form, 3.16 Tèst framewo¡k, 3.2 Tst info¡marion funcrions, Tàrgct doruin, 13.3 Tar batrcris, 2- II Tat inrerprcratioo, 2.2-21'7.12, l2-l-12.5, L2 14-12.16' 12.18 12.19-12.20,13.4, 13.12-13.13,15.4 conrcnc, 3.6,7.3-7.4, 8.1 Tcsr design, 3. I 5, 7.3 Tcsr dcvelopcr raponsibiliria, sæ Tsr Observed, 2.3 Tcst items, 3.6 "Publishcr Conrcnr qualiry, 3.6 Sensirivicy to gcnder and cultural isues,3.6 marerials/raporoibiliiics" Tcst modifìcations, 2.18,1.26' 5.1-i.3' 8.1' 9 4-9.5' Tèst dcvclopmcnr, 3'l-1.27,4,19,6.4,7'4,7 7,7.10, 9.11,l0.l-10.8, l0.ll, 11.23 9.G9.7,9.9,l0.l-10.7, l4.l Acconrmodations for individuals rvirh disabili¡is, Accommodadons for individuals rvirh disabilil0.l I, I 1.23 tics, l0.l Appropriarc for individual rst Þker' 10.10 Comparabiliry ofmultiplc-language vcrsions, Documentarion, 5-2 9.9 Documentation ol procedurcs ued to modiÇ Cut scora, 4.19 resr, 10.5 Dcfìnirion ofdomain,3.2 Effectson rcsultingscores, 10.7 Dcfinitionofobjecrivc, l4.l Flagged scoræ, 9.5, t0.l I Doomcnurion of proadura ucd ro modìfr tat, lndividu¡ls u'ith disabilitic, 10.2-10.3 lO.i Intcrprcters, 9 l'l Êffcca ofdisabilitia on tct performancc, 10.2 Linguisric modifisrions, 9.4-9.5,lI.Zi Effccr of modifiøcions [o¡ individuals wirh disPilor rsring for appropriarcns md føibiliry, 10.3 abilicis, 10.7 Psychomctric expertisc, Ì0.2 EmpirieJ procedurc ro srablish time limirs for Requcsring and rccciving accommodations, 5.3, modìfìed fo¡ms, 10.6 8.3, t0.l-10.2, 10.8 kcm sclccrion, 3.6 Scorc comparabiliry, 10.4 Linguistic or reading lcvcl, 7.7 iìmc limi¿s, 10.6 Linguisticaliy divcae subgroups, 9.6 Test purposc, scc "Purposc of tcst" Pilor rscing ofmodifierom for individuals wirh Tesr revisioro, 3.25-3.26,4.16 disabiliúc. 10.3 Tcr score inrerprcrarion, see "Scorc intcrprcration" Rarionale for modifìqrioro, 10.4 Test sccuriry, 5.6'5.7,lL7' l2'll' l3.ll Raponsc formar,3.6 Test selcction, 7.9,7.11, 10 8, 12.2-12-3, 12.5' 12.6, Selc dcvclopment proccdures,6.4 12.11, 13-12 Særing præeduro, 3.6 Addrsing complcx diagnose, 12.5 Scnsitivc or off<nsivc conrcnr, 7.4 Biæes, 12.2 Tesr adminisrration proccdura, 3.6 Culturc, I 2 3 Taring outcoms for cxamincc subgroups, 7. I 0 Diffe¡cnúal àiagnosis, l'2.6 Tianslations from onc languaç ro a¡orher,9.7 Languagc and physical requirements, I 2.3 Tat diftculcy, 3.3 Modificd forms, 10.8 Tst dircctions, 3.15 Norms, l?.3 Tsr [orm, 3.16, 4.i0-4.1i. 6.5,7.2,8-3,9.4,9.9,l0.lRariondc, lz.t3 t0.8,10.10-10-ll,13.6,13.17-13.18,14.17 Test uscr qualifiøtions, I 2-5, I 3. I 2 Adaptcd vcrsion in sccondary languge, 9.4 Validiryforpopularionof tattakcr, 12.3 Alre¡na¡cforms,4.1l,7-2,8.3, 14.t7 Vcsrcd intcrst, 12.2 Compurcr administcrcd, 13.18 Tcstscttings,l2.S' l3.ll Equarcdforms,4.tl,4.l3,6.5,l4.l7 Tcscspccifications,3.2-3.5,3-7,3-11,3.14-3-17,4.16, Intcrchangabiliry,4.l0 6.4,7., Mxing md distributìog for equadng sru diæ, 4.12 Changa from onc vcrsion ro subscquenr vcrsion, Modifiørioro For individua.ls with disabilirie, 4.t6 r0.t-10.8, 10.t0-10.t I Characcerisria, 7.9 Mulcimcdia, 13.18 Coroequencæ, 7.) MultiplcJanguage vesions, 8-3, 9.9 Dcûnirion of content ol rest, 3.3 Mulrlplc vcriom Êom reermngcmcnt of ircms, DeÍìnirio¡ of domain, 1.14, 1.17 4.15 Dcvclopmcnt p roccss, 3.3 Score equivalence, 4. 0-4. t I I 192 AERA APA NCME OOOO198 INOEX Di¡cctioro ro tor uhcrs, 3.3 lnlormation ro policy makcrs, Challcngc, I I. I I 7.9 frcm and rcion ar¡angcmenr,3.3 kcm formas,3.3 P¡æcdurcs lor Tesúng prograru, 2.18,2.70,3.1,4.17, Ll0-8.13, 9.1, ll.l2, t1.20, t3.l-I3.19, 15.1, 15.13 ta¡ adminismtion md rcring, Proposcd numbcr of itcms, 3.3 Psychomcrric propcr¡ia ofitems, Rerionalc, 3.3 Short Tcring policy, 8.2 3.3 Thcorcriol foundations of tcst, l2 l8 Timc limic for rsa, 3. 18, 8-2, 10-6 Exrcnsions for modified forms, 10.6 3.3 Tianslarions ofa rsq 9.7 form,3.l6 Tating rimc, Unstuaucd ruporoe 6omar, I 2.12 3,3 Tarrakerswithdisabilities,scc"Tstingindividualswirh disabiliris" Uscof ta¡scors, l.l, 1.2, 1.3, 1.4,7-10-7-ll,8.2, 11.2, r3.l, 13.9, 15.7 Cautions abouc unsupporrcd inrcrprctations. 1.3 t4 Dæision making lor eduational placement, 13.9 12.14 Evidence toiusú$'ncw use, 1.4, Il.2 Motivation, 12.t4 Mqn tesr sco¡c differencs bcuccn relcuant Rappom, I 2.14 subgroups, 7.10-7.1 I Rcsporoa, 12.14 Uscrrspomibiliriu,l.l,l.4,3.24,4.5,4.7-4.8,5.2, Tottakingsrratcgio,S.2, ll.l3, 15.7,15.9 Ncgarivc impact in manda¡cd taring programs, 5.7, 5.10, 7.10, 8.7, 9. 0, 0. 1, 11.l-11.24, 12.1 , t2.4-r2.5,12.8-r2.9, r2.tt-12.t2, r3.1, r3.3, 15.7,15.9 I3.10-l3.ll, 13.t9,t5.7,15.1t-15.12 Tes¡uc, 1.19, r.71,t.23,6.9,6.t5,7.9-7.tl,9.5-9.6, Adequatctniningofsuperviscdtsradminis¡ra 10.5, 10.8, t0.tI,I1.2-lt.3, 14-4-14.5, 14.7,14.9, tors and sco¡erc, 12.8, 13.10 15.10-15.1 I Amrcncssof legal corotnins, ll.t, l2.ll Conscquenca,T.9 C¡nsidcration ofcollarcml informarion for rest Employmcnt sclccdon or promotion, 14.9 inrcçrctation, I 1.20 Flaggcd scores,9.5, l0.l t Evaluation of computer-gcncratcd inrerprcra Job clæifiarion decisions, 14.7 cions, I1.21 Jutifierion for tcúng progrm, I.23, 15.10. Formulatc poliry for rcleuc ofaggregated data, l5.l I I 1.17, 13.19 Linguirically divuse subgroups, 9.5-9.6 Srudia, 6.9, 14.+11.5 Gcncral language profìcicncy of æmincc, 9.10, 11.22 Tar ue rarionale, 1.8, t.ll, 12.13 ldcnriÇ individuals nceding spccial accommoda Tcsr ucr rcporoibilirie, sce "Uscr raponsibilitia" tions, ì 1.23 Tcring cnvironmc¡t, 5.4, 12.12 lnformcd abou¡ purposs and adminisrrarion of Oprima.l, 12.i2 Rcalisric,I2.t2 tesr,ll.5 Insrruc¡ioro ro individuals who inrerprer tcst Taring lor diagnosis, 12.6-12.7 scorc,l2.9,13.l0 Taringindividualsrvirhdisabilicis,l0.L-l0,l2,ll.Z3 lnrcrprctirc marerial for losl releæe, 5.10. Avoidìng consrruct irrclcvant variancc, l0.l I Lt7-ll.l8, 13.19, l5.l I Diagnostic purposo, 10.12 Flagged tesr score, l0.l I Jurification for ue oftest, I 1.4 Minimizc or avoid misintcrprcrarions ofscores, Funcrionìng relarivc to gcncra.l population, 10.9, Tsrraking behavioa t2. Fatigue, 1 11.23 r Funcrioning rclarivc ro individuals wirh samc lcvel ofdisabiliry, 10.9 lntcrycntion purposa, 10,12 Mainnining all fcæiblc scenda¡diæd ferura, 10.10 Modif¡ætions adoprcd, 10.10 Mukiplc sources of info¡marion rcquircd, Nor rclc indiaor of tar ekcr's finaioning, Normativc dar¡, 10 9 Racarch of cffcæ oldisabilitic on rar formancc, 10.2 Taringirrcgularitia, S.l0-8.12, ll.ll 10.12 10-12 pcr 1 r.r5, r5.u Monitor impacr of mandarcd tæting progmms, I3.I, 15.7 Monitor scoring acoracy, l l l0 Obrain cvidcncc of rcliabiliry and validiry for nw purposc, I 1.2 Prevcnt negarivc conscqucnco, I l.l5 PrcFesiona.l compctencc, l2 l, 12.5 Profcsional judgmcnt, I l.l Prorect privacy ofexaminccs and insrirutions' t l. 14 Protccr sccuriry of ¡crs' 5 7' 8'7' ll 7-11'9' 12,t1, l3.ll 193 AERA APA NCME OOOO199 rru0åq Rz¡inn¡lc fnr rh:no¡ ;n r¡.. l^.-". ^. ..1-i-;.- (ration, 'l l .19 Rarionale for inrcnded usa, i L4- I I .i Rcview cvidence for uing rars in combinarion, 12.4 Scorc rcponing, I 1.6 Study and evaluarc marerials, ì l.l Tocrakingsratcgies, ll.l3 Uscr qualificarioru, I t.3 Usa with groups notspeciÊcd by dæclopee 7.t0 Verì! appropriarencs of inrcrprerations, ! I . 16, I5.l l-ì5.12 r-^-.,,,,". 14.8- l4.l I V¿lidarion, crirerion-rcfared cviclencc, LIJ-f.21, 12.17, t4.3 Asumprions, l6 Empirical cvidcncc, ì.8 Evidcnce bærd on rcsponsc processcs, l.B lnrcrnal consisrcncy cvidencc, l.l I lnrerrelarionshipsofscorcs. hnguagc differenca, l.ll, l.lZ 9.1 Linguisric subgroup validìcy oidcnce, ,ct.?, 11.27 Modifiørjoro for rst rakcr widr disbilitis, t0.4 Mulriplc prcdicrors. 13.7, t4.t3, l5.l uftiple-purpose Ola rss, 13.2 diagnosis, 12.6-l?.7 Placemcnr or promor¡on dccisions, 13.9 Prolìle inrcrprcrarion, l.l2 Scorc intcrprenrion recionalc, 15 Crirerion pcrformance, l.l5 Criterion rclcvancc, l.16 Subgroups, 7.1-7-2 Subscorc inrcrprerarion, l.l9 Tct l.2l 1.21 t.l7 Predicrion, 1.17,14.3 Predicrivc srud¡ l' l5 Smrisriel anal¡is, l.(Z-1.t8 Muftiplc prcdictors, sccuriry, 8.7, Tcsr Judgmensrgaldingmcúrodologial choiq, l.l2 Tor comparabiliry 9.9 1.20 Mcra-aaì¡ic *idcnce, I.20- l.B, l.t I Scors from combi¡arion oî ræcs, 12.4-12,5 DìFercnrial prcdiccion îor groups, Ethiel md lcgal consrrainrc, l.l9 T^-L-:..l ¡cull¡¡g¡ r ¡¡ Rcporred for levcl ofaggrcgarion, 5.12 l.2l Concurrcnr stud¡ l. Gcncraliation, ,..,r-^-- Effeccs of rimc pæagc, t 3. N,f Validation, contcnt-rclated widencc, 1.6-1.7, ;.,.t^!,.^i Convcrgcnr widcnce, l. 14 Discriminanr evidence, l. I 4 uc ¡¡rionalc, l3.l I . I II Toring individuals wirh disabilirics, t0.l Thcorerical evidcnce, 1.8 Tnnslarions ofa rs¡,9.7 Usefulnes of modified 166; 10.7 Validiry gcneralizarion, 1.20 Varcd inrc¡csr, 12.2 r--:L:t:-, ¡kà¡uu¡(,, rr 2 ¡:.J Tsr<riterion rclationships, 1.16, Use oFtarscoro, l.16 Validarion, gcneral issues, l. I -t.6, i.20 Waiver of accas, 8.9 Wcighrcd scoring, 14.16 t. I 3- l. l 4, !.22-t.24, l4. t Conscrucr-irrelcvant componcns, 1.24 Conscrucr undcrrcprcsenmtion, 1.24 Dara collccrion condirioro, l.t3 Evjdcncc fo¡ expæred ou¡come, l 22 Group diffcrcnca, t.24 lndircct bcnefit rationalc, l.2J lnrcrprcration of r6t sco.s, I,24 Objccrivc for cmploymcnr rot, l4.l Sratisriql analysis, Ll3 Tating condirions, l.l3 Validation proccdurcs, 1.6 Valid¡rion smplc, 1.5 Validiry I. 1 -1.24, 3.r9, 3.25, 5.12, 6.t2. 7.1 -7.2. 8.7, 8.r r,9.r-9.2, 9.7,r,9, to.t, t0.4-t0.5, t0.7, tt.tr 1.2, I l.t 9, t t.22, t2.3-r2.6,'t2.13, | 3.2, t3.7, r3.9, l3.n-13.r2,13.r6, r3.18, r4.¡3, r5.l Chmges likc.ly from modifierions fo¡ individuals wirh disabiliria, 10.5 Compurcr-adminisrcccd rss, 13. I 8 Compurcr-gcncratcd inrcrprcrarions, 6. I 2 194 AERA-APA_NCME_OOOO2OO f SBN-1 3: 978-0-S3530 2-25-T |Iililillililililililil|l ilïfiilililil AERA APA NCME OOOO2O1

Disclaimer: Justia Dockets & Filings provides public litigation records from the federal appellate and district courts. These filings and docket sheets should not be considered findings of fact or liability, nor do they necessarily reflect the view of Justia.


Why Is My Information Online?