Although, I try to follow the research in my field regardless of where it is conducted, I usually don’t pick studies from the U.S. or other developed countries for discussion in this space. However, when the study involves interventions to improve various outcomes for adolescents, reports some encouraging findings, and may be applicable in the developing world, we can make an exception. So, today’s post is about a study that takes place in a public high school on the south side of Chicago…
Cook et al. (2014) start with the following premise: there are very few interventions with proven effectiveness in improving learning outcomes for adolescents. While some interventions are shown to improve graduation rates, the lack of impact on learning (as measured by standardized test scores) has led to two types of approaches: a recommendation to channel students that have fallen behind to vocational or technical training and a recommendation to focus on early childhood interventions to reduce the prevalence of problems encountered at adolescence. For adolescents who have fallen well behind their grade-appropriate level of learning, schools try to improve quality of grade-level education but are usually limited in their ability to provide catch-up services, such as tutoring.
The authors counter that perhaps we are giving up on these youths too quickly. If the reasons for children falling behind in the classroom are behavioral (due to their disadvantaged backgrounds) in addition to academic, then the right intervention may not only focus on academic remediation but behavior change as well. So, the authors set out to test a combination intervention in a school, where many of the students may benefit from such an approach.
Target group: Male students in 9th and 10th grade in a Chicago Public School (CPS), almost all of who are eligible for free school lunches. Based on baseline data available form the CPS, and after excluding students who missed more than 60% of school days the previous year and failed more than 75% of their classes) the authors constructed an index of academic risk (based on absences in the previous year, failed courses, and being old for grade) and selected the bottom third for the study. This group had missed a lot of school he previous year, had low GPAs (even lower for Math), had disciplinary issues, and sat around the 25th percentile of the National Math and Reading Scores (using the tests employed to evaluate this intervention). So, the target group is an at-risk group of 14-16 year-old male secondary school students (note, however, the exclusion of those mentioned above and those who are already out of school).
Intervention: There are two interventions. BAM (acronym for Becoming a Man) is described as having components on social-emotional learning (SEL) and also components based on standard elements of cognitive behavioral therapy (CBT). As I would have little idea about what these would entail, the authors helpfully provide an example, which I include here:
“The nature of the intervention is best illustrated by example. The very first activity for youth in the program is the “Fist Exercise.” Students are divided into pairs; one student is told he has 30 seconds to get his partner to open his fist. Then the exercise is reversed. Almost all youth attempt to use physical force to compel their partners to open their fists. During debrief, the group leader asks youth to explain what they tried and how it worked, pointedly noting that (as is usually the case) almost no one asked their partner to open their fist. When youth are asked why, they usually provide responses such as: “he wouldn’t have done it,” or “he would have thought I was a punk.” The group leader will then follow up by asking: “How do you know?” The exercise is an experiential way to teach youth about hostile attribution bias. The example also shows how the program is engaging to youth who might not normally sign up for pro-social activities, because it is slightly subversive – to participate they get out of an academic class, and then the first activity winds up involving rowdy horseplay.”
The students chosen for BAM have an opportunity to participate in up to 27 one-hour per week sessions with one college-educated instructor, who does not need to have any specialized training. The sessions are 8-15 students each and take place during school hours (an incentive for students to skip an academic class and attend BAM instead).
The second intervention is two-on-one math tutoring. The instructors are recruited “Teach for America-style,” i.e. for one-year durations with a small stipend plus benefits. As they focus on an extremely small class size of two, they don’t need to have training in class management and other crucial “teacher skills,” but generally have good math and interpersonal skills (and are willing to commit to an academic year for about $16,000 plus benefits). The sessions are one-hour per day (compared to 1.5 hours of tutoring per week available for all eligible children at the school from No Child Left Behind funds).
Study Design: 106 students from grades 9 and 10 form the bottom third of the male academically at-risk group. Of these, 34 are assigned to control (status quo services), 24 to BAM, and 48 to BAM plus tutoring (BAM+). Two design elements are unusual here. First, the authors forego a more classical 2x2 factorial design that would have included tutoring only. I assume that a combination of the small sample size, budgetary constraints, and the fact that the “Match-style” tutoring is likely being tested elsewhere (by the same research team) contributed to this decision. Obviously, the three-arm study has its downsides (any improvement in BAM+ over and above BAM might have been achieved with “+” alone), but this is not unusual for a pilot (The design actually gives me some solace as one of the studies I am working on was similarly constrained and could not afford four study arms and we have to make a similar choice of Control, Treatment A, and Treatment A+B – with prolonged discussions of which treatment arm to leave out). The second unusual element is the unequal sizes of the two treatment groups: all else equal, such deviations reduce statistical power: it seems like there were some logistical constraints that went into this decision. It’s easy to see here that the small overall sample size will be an issue – especially for detecting differences between the two intervention arms. I return to this issue when discussing the results.
Data: The authors have administrative data on attendance in the various intervention sessions. Given that this is a school-based intervention, they also have decent baseline data on what CPS has on these students. The main outcome variable is the math and reading tests that CPS administers in 9th (EXPLORE) and 10th (PLAN) grades. The usual problem with this type of outcome variable is that it is subject to selection bias – students who are absent that day or dropped out of school have missing values. If the program has an effect on these outcomes, then there is differential attrition from the sample that can cause bias. The authors show that there is indeed differential attrition (control group is less likely to have follow-up test scores) and that the students who don’t take the tests are a different group than those who do. It’s not clear why the authors did not use tests that could be administered at home – probably a combination of costs and perhaps an assumption of lower attrition rates. The authors also examine GPA, courses failed, and some behavioral outcomes, such as discipline incidents, days absent, and out of school suspensions.
Findings: For the main analysis, the authors combine the two treatment groups (due some cross-contamination and lack of power). The ITT estimate is impressive: more than 0.5 standard deviations (control group distribution) in math achievement test scores. When the authors benchmark the performance to national percentile rankings, the results are equally impressive. The effect is a 12 pp increase over a mean of 25th percentile ranking in the control group. With the 75% take-up rate (of attending at least one session in either intervention), the treatment effect on the treated is 15 pp. The authors predict that the ranking of the complier in the control group would be around the 19th percentile, so the 15 pp improvement would move the treated children to the 34th percentile in the national distribution. There are no improvements in reading, although the number of non-math courses failed is significantly reduced in the combined treatment group. The number of days absent is also lowered significantly, by about 10 days (ITT) from a mean of 44 days.
The rest of the paper is robustness checks and discussion. The main robustness checks center around the standard errors and missing test scores. The standard errors could suffer from being too small due to multiple testing (multiple outcome variables) and false discovery rates. The authors provide p-values that use permutation tests, and other procedures. The main findings, with the exception on the reduction in days absent (the p-value for which goes up to 0.141), are robust to these calculations.
The problem arises when the authors deal with the missing values. In addition to controlling for baseline characteristics that are predictive of attrition (the exclusion of which does not alter the results significantly), the authors use a multiple imputation technique to fill in missing values (assuming conditional missing at random), as well as assign arbitrarily low values to missing test scores and estimating quantile regressions to estimate differences at the median rather than the mean. The findings do not survive these adjustments: a combination of slightly smaller effect sizes (but still quite large in absolute terms) and slightly larger standard errors means that none of the earlier findings are significant at the conventional levels. For example the ITT (SE) for math goes from 0.51 (0.20) to 0.39 (0.30). [Note: earlier the authors had argued – rightly, I thought – that given that worse performing students are less likely to take the tests and there are more of these students in the control group, reasonable adjustments that bring these students back into the sample by filling in their missing test scores should have increased the ITT effects, not decrease it. However, given that the confidence intervals for either estimate overlap, perhaps one need not dwell on this too much.]
The authors also examine the effects of each intervention. Not surprisingly, it is impossible to state with any confidence that there is a difference between BAM and BAM+. However, it does look like the effects of the non-academic intervention alone (BAM) are as high as when it is combined with tutoring (BAM+). The authors interpret this result as potentially contradicting earlier experimental results showing that while BAM led to large reductions in behavioral problems, it did not improve test scores and suggest that BAM itself could be effective. I found this assertion to be a stretch and unconvincing. If that was the case, why do we see the effects only in Math achievement – the subject matter that the tutoring is focused on?
The authors conclude by examining the cost-effectiveness of the intervention here, which cost approximately $4,400 per student per year. Comparing effects on math scores with other studies that examine the effectiveness of Perry Preschool interventions, cash transfers due to the EITC (earned income tax credit for readers not familiar with the U.S. tax code), and class size reductions, they come down favorably on the combined intervention in Chicago. Granted, this is mainly because improvements per $1,000 spent on a program are small in other studies (hence the motivation for this study) and the confidence intervals for this study are large due to its small sample size, but the authors argue that there is nothing intrinsic about this intervention that cannot be scaled up.
The authors conclude by discussing the optimal timing of interventions, particularly contrasting early childhood interventions vs. those during adolescence. Taking for granted that plasticity may decline from childhood to adolescence, the authors still question whether this is sufficient reason to prioritize funding efforts towards early childhood rather than adolescence. They argue that most interventions show a ‘fade out’ of impacts, so there may be a trade-off between treating someone closer to the time when they’re about to get into trouble (drop out of school, get arrested, get pregnant, etc.). Furthermore, they argue that it is easier to target at risk adolescents for the right interventions because we have ‘more of a track record’ with them.
Given the fact that psychological problems among adolescents are at least as big an issue in many developing countries, it is possible to think about combining the usual interventions that target adolescents with other interventions focusing on social-cognitive skills, such as mentorship interventions informed by proper psychological models of adolescent behavior and behavioral therapy (see, for example, this study by Bandiera et al. 2012 in Uganda). Previously, many of us stated the same objections about the potentially prohibitive costs of these types of interventions in developing countries as the ones quoted in this study for the U.S. But, perhaps, the benefits are high enough and there are innovative ways of designing interventions to keep the costs down. Who is game?
Cook et al. (2014) start with the following premise: there are very few interventions with proven effectiveness in improving learning outcomes for adolescents. While some interventions are shown to improve graduation rates, the lack of impact on learning (as measured by standardized test scores) has led to two types of approaches: a recommendation to channel students that have fallen behind to vocational or technical training and a recommendation to focus on early childhood interventions to reduce the prevalence of problems encountered at adolescence. For adolescents who have fallen well behind their grade-appropriate level of learning, schools try to improve quality of grade-level education but are usually limited in their ability to provide catch-up services, such as tutoring.
The authors counter that perhaps we are giving up on these youths too quickly. If the reasons for children falling behind in the classroom are behavioral (due to their disadvantaged backgrounds) in addition to academic, then the right intervention may not only focus on academic remediation but behavior change as well. So, the authors set out to test a combination intervention in a school, where many of the students may benefit from such an approach.
Target group: Male students in 9th and 10th grade in a Chicago Public School (CPS), almost all of who are eligible for free school lunches. Based on baseline data available form the CPS, and after excluding students who missed more than 60% of school days the previous year and failed more than 75% of their classes) the authors constructed an index of academic risk (based on absences in the previous year, failed courses, and being old for grade) and selected the bottom third for the study. This group had missed a lot of school he previous year, had low GPAs (even lower for Math), had disciplinary issues, and sat around the 25th percentile of the National Math and Reading Scores (using the tests employed to evaluate this intervention). So, the target group is an at-risk group of 14-16 year-old male secondary school students (note, however, the exclusion of those mentioned above and those who are already out of school).
Intervention: There are two interventions. BAM (acronym for Becoming a Man) is described as having components on social-emotional learning (SEL) and also components based on standard elements of cognitive behavioral therapy (CBT). As I would have little idea about what these would entail, the authors helpfully provide an example, which I include here:
“The nature of the intervention is best illustrated by example. The very first activity for youth in the program is the “Fist Exercise.” Students are divided into pairs; one student is told he has 30 seconds to get his partner to open his fist. Then the exercise is reversed. Almost all youth attempt to use physical force to compel their partners to open their fists. During debrief, the group leader asks youth to explain what they tried and how it worked, pointedly noting that (as is usually the case) almost no one asked their partner to open their fist. When youth are asked why, they usually provide responses such as: “he wouldn’t have done it,” or “he would have thought I was a punk.” The group leader will then follow up by asking: “How do you know?” The exercise is an experiential way to teach youth about hostile attribution bias. The example also shows how the program is engaging to youth who might not normally sign up for pro-social activities, because it is slightly subversive – to participate they get out of an academic class, and then the first activity winds up involving rowdy horseplay.”
The students chosen for BAM have an opportunity to participate in up to 27 one-hour per week sessions with one college-educated instructor, who does not need to have any specialized training. The sessions are 8-15 students each and take place during school hours (an incentive for students to skip an academic class and attend BAM instead).
The second intervention is two-on-one math tutoring. The instructors are recruited “Teach for America-style,” i.e. for one-year durations with a small stipend plus benefits. As they focus on an extremely small class size of two, they don’t need to have training in class management and other crucial “teacher skills,” but generally have good math and interpersonal skills (and are willing to commit to an academic year for about $16,000 plus benefits). The sessions are one-hour per day (compared to 1.5 hours of tutoring per week available for all eligible children at the school from No Child Left Behind funds).
Study Design: 106 students from grades 9 and 10 form the bottom third of the male academically at-risk group. Of these, 34 are assigned to control (status quo services), 24 to BAM, and 48 to BAM plus tutoring (BAM+). Two design elements are unusual here. First, the authors forego a more classical 2x2 factorial design that would have included tutoring only. I assume that a combination of the small sample size, budgetary constraints, and the fact that the “Match-style” tutoring is likely being tested elsewhere (by the same research team) contributed to this decision. Obviously, the three-arm study has its downsides (any improvement in BAM+ over and above BAM might have been achieved with “+” alone), but this is not unusual for a pilot (The design actually gives me some solace as one of the studies I am working on was similarly constrained and could not afford four study arms and we have to make a similar choice of Control, Treatment A, and Treatment A+B – with prolonged discussions of which treatment arm to leave out). The second unusual element is the unequal sizes of the two treatment groups: all else equal, such deviations reduce statistical power: it seems like there were some logistical constraints that went into this decision. It’s easy to see here that the small overall sample size will be an issue – especially for detecting differences between the two intervention arms. I return to this issue when discussing the results.
Data: The authors have administrative data on attendance in the various intervention sessions. Given that this is a school-based intervention, they also have decent baseline data on what CPS has on these students. The main outcome variable is the math and reading tests that CPS administers in 9th (EXPLORE) and 10th (PLAN) grades. The usual problem with this type of outcome variable is that it is subject to selection bias – students who are absent that day or dropped out of school have missing values. If the program has an effect on these outcomes, then there is differential attrition from the sample that can cause bias. The authors show that there is indeed differential attrition (control group is less likely to have follow-up test scores) and that the students who don’t take the tests are a different group than those who do. It’s not clear why the authors did not use tests that could be administered at home – probably a combination of costs and perhaps an assumption of lower attrition rates. The authors also examine GPA, courses failed, and some behavioral outcomes, such as discipline incidents, days absent, and out of school suspensions.
Findings: For the main analysis, the authors combine the two treatment groups (due some cross-contamination and lack of power). The ITT estimate is impressive: more than 0.5 standard deviations (control group distribution) in math achievement test scores. When the authors benchmark the performance to national percentile rankings, the results are equally impressive. The effect is a 12 pp increase over a mean of 25th percentile ranking in the control group. With the 75% take-up rate (of attending at least one session in either intervention), the treatment effect on the treated is 15 pp. The authors predict that the ranking of the complier in the control group would be around the 19th percentile, so the 15 pp improvement would move the treated children to the 34th percentile in the national distribution. There are no improvements in reading, although the number of non-math courses failed is significantly reduced in the combined treatment group. The number of days absent is also lowered significantly, by about 10 days (ITT) from a mean of 44 days.
The rest of the paper is robustness checks and discussion. The main robustness checks center around the standard errors and missing test scores. The standard errors could suffer from being too small due to multiple testing (multiple outcome variables) and false discovery rates. The authors provide p-values that use permutation tests, and other procedures. The main findings, with the exception on the reduction in days absent (the p-value for which goes up to 0.141), are robust to these calculations.
The problem arises when the authors deal with the missing values. In addition to controlling for baseline characteristics that are predictive of attrition (the exclusion of which does not alter the results significantly), the authors use a multiple imputation technique to fill in missing values (assuming conditional missing at random), as well as assign arbitrarily low values to missing test scores and estimating quantile regressions to estimate differences at the median rather than the mean. The findings do not survive these adjustments: a combination of slightly smaller effect sizes (but still quite large in absolute terms) and slightly larger standard errors means that none of the earlier findings are significant at the conventional levels. For example the ITT (SE) for math goes from 0.51 (0.20) to 0.39 (0.30). [Note: earlier the authors had argued – rightly, I thought – that given that worse performing students are less likely to take the tests and there are more of these students in the control group, reasonable adjustments that bring these students back into the sample by filling in their missing test scores should have increased the ITT effects, not decrease it. However, given that the confidence intervals for either estimate overlap, perhaps one need not dwell on this too much.]
The authors also examine the effects of each intervention. Not surprisingly, it is impossible to state with any confidence that there is a difference between BAM and BAM+. However, it does look like the effects of the non-academic intervention alone (BAM) are as high as when it is combined with tutoring (BAM+). The authors interpret this result as potentially contradicting earlier experimental results showing that while BAM led to large reductions in behavioral problems, it did not improve test scores and suggest that BAM itself could be effective. I found this assertion to be a stretch and unconvincing. If that was the case, why do we see the effects only in Math achievement – the subject matter that the tutoring is focused on?
The authors conclude by examining the cost-effectiveness of the intervention here, which cost approximately $4,400 per student per year. Comparing effects on math scores with other studies that examine the effectiveness of Perry Preschool interventions, cash transfers due to the EITC (earned income tax credit for readers not familiar with the U.S. tax code), and class size reductions, they come down favorably on the combined intervention in Chicago. Granted, this is mainly because improvements per $1,000 spent on a program are small in other studies (hence the motivation for this study) and the confidence intervals for this study are large due to its small sample size, but the authors argue that there is nothing intrinsic about this intervention that cannot be scaled up.
The authors conclude by discussing the optimal timing of interventions, particularly contrasting early childhood interventions vs. those during adolescence. Taking for granted that plasticity may decline from childhood to adolescence, the authors still question whether this is sufficient reason to prioritize funding efforts towards early childhood rather than adolescence. They argue that most interventions show a ‘fade out’ of impacts, so there may be a trade-off between treating someone closer to the time when they’re about to get into trouble (drop out of school, get arrested, get pregnant, etc.). Furthermore, they argue that it is easier to target at risk adolescents for the right interventions because we have ‘more of a track record’ with them.
Given the fact that psychological problems among adolescents are at least as big an issue in many developing countries, it is possible to think about combining the usual interventions that target adolescents with other interventions focusing on social-cognitive skills, such as mentorship interventions informed by proper psychological models of adolescent behavior and behavioral therapy (see, for example, this study by Bandiera et al. 2012 in Uganda). Previously, many of us stated the same objections about the potentially prohibitive costs of these types of interventions in developing countries as the ones quoted in this study for the U.S. But, perhaps, the benefits are high enough and there are innovative ways of designing interventions to keep the costs down. Who is game?
Join the Conversation