- Joshua B. Gilbert
Search EdWorkingPapers by author, title, or keywords.
Joshua B. Gilbert
Analyzing heterogeneous treatment effects plays a crucial role in understanding the impacts of educational interventions. A standard practice for heterogeneity analysis is to examine interactions between treatment status and pre-intervention participant char- acteristics, such as pretest scores, to identify how different groups respond to treatment. This study demonstrates that identical observed patterns of heterogeneity on test score outcomes can emerge from entirely distinct data-generating processes. Specifically, we describe scenarios in which treatment effect heterogeneity arises from either variation in treatment effects along a pre-intervention participant characteristic or from correlations between treatment effects and item easiness parameters. We demonstrate analytically and through simulation that these two scenarios cannot be distinguished if analysis is based on summary scores alone as such outcomes are insufficient to identify the relevant generating process. We then describe a novel approach that identifies the relevant data-generating process by leveraging item-level data. We apply our approach to a randomized trial of a reading intervention in second grade, and show that any apparent heterogeneity by pretest ability is driven by the correlation between treatment effect size and item easiness. Our results highlight the potential of employing measurement principles in causal analysis, beyond their common use in test construction.
Longitudinal models of individual growth typically emphasize between-person predictors of change but ignore how growth may vary within persons because each person contributes only one point at each time to the model. In contrast, modeling growth with multi-item assessments allows evaluation of how relative item performance may shift over time. While traditionally viewed as a nuisance under the label of “item parameter drift” (IPD) in the Item Response Theory literature, we argue that IPD may be of substantive interest if it reflects how learning manifests on different items at different rates. In this study, we present a novel application of the Explanatory Item Response Model (EIRM) to assess IPD in a causal inference context. Simulation results show that when IPD is not accounted for, both parameter estimates and their standard errors can be affected. We illustrate with an empirical application to the persistence of transfer effects from a content literacy intervention on vocabulary knowledge, revealing how researchers can leverage IPD to achieve a more fine-grained understanding of how vocabulary learning develops over time.
When analyzing treatment effects on test score data, education researchers face many choices for scoring tests and modeling results. This study examines the impact of those choices through Monte Carlo simulation and an empirical application. Results show that estimates from multiple analytic methods applied to the same data will vary because, as predicted by Classical Test Theory, two-step models using sum or IRT-based scores provide downwardly biased standardized treatment effect coefficients compared to latent variable models. This bias dominates any other differences between models or features of the data generating process, such as the variability of item discrimination parameters. An errors-in-variables (EIV) correction successfully removes the bias from two-step models. Model performance is not substantially different in terms of precision, standard error calibration, false positive rates, or statistical power. An empirical application to data from a randomized controlled trial of a second-grade literacy intervention demonstrates the sensitivity of the results to model selection and tradeoffs between model selection and interpretation. This study shows that the psychometric principles most consequential in causal inference are related to attenuation bias rather than optimal scoring weights.
This simulation study examines the characteristics of the Explanatory Item Response Model (EIRM) when estimating treatment effects when compared to classical test theory (CTT) sum and mean scores and item response theory (IRT)-based theta scores. Results show that the EIRM and IRT theta scores provide generally equivalent bias and false positive rates compared to CTT scores and superior calibration of standard errors under model misspecification. Analysis of the statistical power of each method reveals that the EIRM and IRT theta scores provide a marginal benefit to power and are more robust to missing data than other methods when parametric assumptions are met and provide a substantial benefit to power under heteroskedasticity, but their performance is mixed under other conditions. The methods are illustrated with an empirical data application examining the causal effect of an elementary school literacy intervention on reading comprehension test scores and demonstrates that the EIRM provides a more precise estimate of the average treatment effect than the CTT or IRT theta score approaches. Tradeoffs of model selection and interpretation are discussed.
The current study aimed to explore the COVID-19 impact on the reading achievement growth of Grade 3-5 students in a large urban school district in the U.S. and whether the impact differed by students’ demographic characteristics and instructional modality. Specifically, using administrative data from the school district, we investigated to what extent students made gains in reading during the 2020-2021 school year relative to the pre-COVID-19 typical school year in 2018-2019. We further examined whether the effects of students’ instructional modality on reading growth varied by demographic characteristics. Overall, students had lower average reading achievement gains over the 9-month 2020-2021 school year than the 2018-2019 school year with a learning loss effect size of 0.54, 0.27, and 0.28 standard deviation unit for Grade 3, 4, and 5, respectively. Substantially reduced reading gains were observed from Grade 3 students, students from high-poverty backgrounds, English learners, and students with reading disabilities. Additionally, findings indicate that among students with similar demographic characteristics, higher-achieving students tended to choose the fully remote instruction option, while lower-achieving students appeared to opt for in-person instruction at the beginning of the 2020-2021 school year. However, students who received in-person instruction most likely demonstrated continuous growth in reading over the school year, whereas initially higher-achieving students who received remote instruction showed stagnation or decline, particularly in the spring 2021 semester. Our findings support the notion that in-person schooling during the pandemic may serve as an equalizer for lower-achieving students, particularly from historically marginalized or vulnerable student populations.
Analyses that reveal how treatment effects vary allow researchers, practitioners, and policymakers to better understand the efficacy of educational interventions. In practice, however, standard statistical methods for addressing Heterogeneous Treatment Effects (HTE) fail to address the HTE that may exist within outcome measures. In this study, we present a novel application of the Explanatory Item Response Model (EIRM) for assessing what we term “item-level” HTE (IL-HTE), in which a unique treatment effect is estimated for each item in an assessment. Results from data simulation reveal that when IL-HTE are present but ignored in the model, standard errors can be underestimated and false positive rates can increase. We then apply the EIRM to assess the impact of a literacy intervention focused on promoting transfer in reading comprehension on a digital formative assessment delivered online to approximately 8,000 third-grade students. We demonstrate that allowing for IL-HTE can reveal treatment effects at the item-level masked by a null average treatment effect, and the EIRM can thus provide fine-grained information for researchers and policymakers on the potentially heterogeneous causal effects of educational interventions.