In the first season of The Wire, an American crime drama television series, a young girl who lives in a poor and crime-ridden neighborhood asks Wallace, a teenaged drug dealer, for help with a math problem. It's a word problem that has multiple passengers getting on and off a bus and that asks how many passengers are on the bus at the end of it. The girl is lost. Wallace reframes the problem for her, describing a situation in which different buyers and sellers of crack cocaine take and give her different numbers of vials. When she answers correctly, Wallace asks her why she can't do the same problem when it's in her math book. She explains that if she gets the vial count wrong, the drug dealers will hurt her, so she must get it right.
What does this have to do with our work? A lot. Imagine you're working in a region with very low school enrolment, and you're testing the impact of a program designed to bring more children into school. Say you want to measure how much children learn once they benefit from the program compared to children who don't have the program where they live. If your measurement tool has math problems on it and they're presented in a format or with situations that would be unfamiliar to out-of-school children, then they have less of a chance of answering the questions correctly, even if they're perfectly comfortable with the math concepts behind the questions. For example, a child in school may be able to answer a question like: 100 - (5 x 6) - 45 = ?
But a child who has not gone to school may only be comfortable with this when framed in the context of an errand to the market, where she’s asked to bring back the correct change after her mother gives her 100 Rupees and asks for 6 sachets of laundry soap and a kilogram of sugar. If you give both types of children problems framed the way they would be in school settings, then it's as though you are giving different tests to the two groups of children and trying to compare their scores. It's not a very useful comparison.
Psychometrics is the branch of psychology focused on the design, administration, and interpretation of quantitative assessments. These assessments could include student test scores, social-emotional skills, and the quality of teaching practices.
This summer, Andrew Ho, a psychometrician at Harvard's Graduate School of Education, delivered a pair of lectures at the World Bank on the basics of psychometrics. We strongly recommend you experience these lectures for yourself, and to pique your interest, in this blog we highlight three main areas where psychometrics can and should inform the work we do at the Bank.
First, psychometrics helps you use the word “validate” correctly when you’re talking about measurement tools and the scores they generate. People often claim that they are working with a “validated measure” or a “valid and reliable” measure, but this usage doesn’t quite cut it. We validate not a test but a score, and not really a score but an interpretation or use of a score for a specific purpose. In fact, the point of measurement is to create one or more scores for a use or interpretation, and if we don’t specify what these scores are and how we hope to use them, we cannot proceed with validation. You see, validation requires you to present five distinct types of evidence:
1. Content: There is evidence that what is being measured is aligned to content standards or to the theoretical evidence on the construct. For example, exam questions should measure what is specified in content standards and what is taught by the curriculum. Questionnaire items on grit should align with the theoretical underpinning of grit. Reports should include example items, if not the full instrument. Authors should make the case that these items are measuring the intended content or construct.
2. Cognition: There is evidence that your prompts and items are interpreted correctly by respondents. Some test score items have multiple solution paths, and some of them may be unintended. A test claiming to measure geometric proficiency should not have items that can be solved using algebra. Developers can gather evidence about cognition by talking to examinees and having them think aloud as they answer items. Other inconsistencies can be more obvious; for example, if you’re trying to ascertain whether children in rural Nepal have a pincer grasp, you probably should not ask parents if their little ones can pick up a Cheerio.
3. Coherence: The components of the measure relate to each other as expected, resulting in sufficiently precise (reliable) scores. This means, for example, parents who respond that their child can read paragraphs should also respond that their child can read sentences. If not, then the score will depend on the prompts—and this is measurement error. Other kinds of measurement error manifest over occasions: when you go back and ask the parent the same questions two weeks later, you hope to get the same responses.
4. Correlation: The score predicts the outcomes it should or correlates well with other scores that are predictive of outcomes. If, for example, you’re constructing a measure of the quality of teaching practices, then the resulting teacher scores should predict student learning outcomes.
5. Consequence: The use of the scores achieves its intended purpose. If scores, for example, are being used to separate children under the age of 5 into those that are developmentally on-track and those that are not, with the goal of targeting lower-scoring students for remediation, then there should be an appreciable gain for children right below the cutoff in longer term evaluations, compared to those just above the cutoff.
Second, psychometrics can also offer empirically guided methods for shortening or adapting a tool for a specific context. Specifically, we can calculate the amount of information (or precision) for each question or for the instrument and get rid of questions that don’t add more information. We face this issue all the time, for example, when our government counterparts want instruments that they themselves can use routinely for quality assurance, not multi-hour surveys originally designed for research purposes. Andrew gave us an example of a scale meant for assessing hearing-related function and quality of life, where the authors identified a smaller set of questions that could be used at scale in clinical settings that provided a similar amount of information as the full scale.
Finally, psychometric analyses can tell us whether we can credibly compare scores of sub-groups in a population or the scores across different populations. Again, this an issue we face quite regularly at the Bank, as we often compare groups within a population (e.g. poor and non-poor) and across populations (e.g. a comparison of countries). Let’s go back to our example from The Wire. The girl clearly knows how to do chained addition and subtraction but only in a context that she has experience with. A higher-income student, however, may struggle with math in the context of running drugs. If this is the case for poor and affluent children more generally, then this question would exhibit something called differential item functioning (or a lack of measurement invariance) since two children with the same math ability from two different socio-economic backgrounds have different probabilities of answering the problem correctly. If too many questions on an instrument exhibit differential item functioning, then it would be difficult to credibly compare scores between groups. If our goal is to measure mathematical proficiency in familiar contexts, we must adapt assessments accordingly.
To be fair, someone could also say that measuring math ability in high-income contexts is important (which would be an argument based on content), so the fact that the girl can’t do the bus problem is construct-relevant. Therefore, it’s so important to start with the intended interpretation of the score when we design and adapt our measurement tools.
At the Bank, we use more and more data that come in the form of scores or indices. When these scores are used to inform policy or project design, this makes them high-stakes. With the help of The Wire (and Andrew Ho), we hope we’ve convinced you of some of the many ways psychometrics can help us ensure we are using our measures and scores appropriately.