Smart people, mainly with good reason, like to make statements like “Measure what is important, don’t make important what you can measure,” or “Measure what we treasure and not treasure what we measure.” It is rumored that even Einstein weighed in on this by saying: “Not everything that can be counted counts and not everything that counts can be counted.” A variant of this has also become a rallying cry among those who are “anti-randomista,” to agitate against focusing research only on questions that one can answer experimentally.
However, I am confident that all researchers can generally agree that there is not much worse than the helpless feeling of not being able to vouch for the veracity of what you measured. We can deal with papers reporting null results, we can deal with messy or confusing stories, but what gives no satisfaction to anyone is to present some findings and then having to say: “This could all be wrong, because we’re not sure the respondents in our surveys are telling the truth.” This does not mean that research on sensitive topics does not get done, but like the proverbial sausage, it is necessary to block out where the data came from and how it was made.
Last Thursday I attended a conference on AI and Development organized by CEGA, DIME, and the World Bank’s Big Data groups (website, where they will also add video). This followed a World Bank policy research talk last week by Olivier Dupriez on “Machine Learning and the Future of Poverty Prediction” (video, slides). These events highlighted a lot of fast-emerging work, which I thought, given this blog’s focus, I would try to summarize through the lens of thinking about how it might help us in designing development interventions and impact evaluations.
A typical impact evaluation works with a sample S to give them a treatment Treat, and is interested in estimating something like:
Y(i,t) = b(i,t)*Treat(i,t) +D’X(i,t) for units i in the sample S
We can think of machine learning and artificial intelligence as possibly affecting every term in this expression:
- If you could go back to the time you did not have any children and could choose exactly the number of children to have in your whole life, how many would that be?
- How many of these children would you like to be boys, how many would you like to be girls, and for how many would it not matter if it’s a boy or a girl?
About a year ago I reviewed Angela Duckworth’s book on grit. At the time I noted that there were compelling ideas, but that two big issues were that her self-assessed 10-item Grit scale could be very gameable, and that there was really limited rigorous evidence as to whether efforts to improve grit have lasting impacts.
A cool new paper by Sule Alan, Teodora Boneva, and Seda Ertac makes excellent progress on both fronts. They conduct a large-scale experiment in Turkey with almost 3000 fourth-graders (8-10 year olds) in over 100 classrooms in 52 schools (randomization was at the school level, with 23 schools assigned to treatment).
About a year ago, I wrote a blog post on issues surrounding data collection and measurement. In it, I talked about “list experiments” for sensitive questions, about which I was not sold at the time. However, now that I have a bunch of studies going to the field at different stages of data collection, many of which are about sensitive topics in adolescent female target populations, I am paying closer attention to them. In my reading and thinking about the topic and how to implement it in our surveys, I came up with a bunch of questions surrounding the optimal implementation of these methods. In addition, there is probably more to be learned on these methods to improve them further, opening up the possibility of experimenting with them when we can. Below are a bunch of things that I am thinking about and, as we still have some time before our data collection tools are finalized, you, our readers, have a chance to help shape them with your comments and feedback.