Methodology, measurement and data

Search EdWorkingPapers

Search EdWorkingPapers by author, title, or keywords.

The Promises and Pitfalls of Using Language Models to Measure Instruction Quality in Education

Paiheng Xu, Jing Liu, Nathan Jones, Julie Cohen, Wei Ai. 04/2024

Assessing instruction quality is a fundamental component of any improvement efforts in the education system. However, traditional manual assessments are expensive, subjective, and heavily dependent on observers’ expertise and idiosyncratic factors, preventing teachers from getting timely and frequent feedback. Different from prior research that focuses on low-inference instructional practices, this paper presents the first study that leverages Natural Language Processing (NLP) techniques to assess multiple high-inference instructional practices in two distinct educational settings: in-person K-12 classrooms and simulated performance tasks for pre-service teachers. This is also the first study that applies NLP to measure a teaching practice that has been demonstrated to be particularly effective for students with special needs. We confront two challenges inherent in NLP-based instructional analysis, including noisy and long input data and highly skewed distributions of human ratings. Our results suggest that pretrained Language Models (PLMs) demonstrate performances comparable to the agreement level of human raters for variables that are more discrete and require lower inference, but their efficacy diminishes with more complex teaching practices. Interestingly, using only teachers’ utterances as input yields strong results for student-centered variables, alleviating common concerns over the difficulty of collecting and transcribing high-quality student speech data in in-person teaching settings. Our findings highlight both the potential and the limitations of current NLP techniques in the education domain, opening avenues for further exploration.

Download 04/20241.38 MB

Marginal Returns to Public Universities

Jack Mountjoy. 04/2024

This paper studies the causal impacts of public universities on the outcomes of their marginally admitted students. I use administrative admission records spanning all 35 public universities in Texas, which collectively enroll 10 percent of American public university students, to systematically identify and employ decentralized cutoffs in SAT/ACT scores that generate discontinuities in admission and enrollment. The typical marginally admitted student completes an additional year of education in the four-year sector, is 12 percentage points more likely to earn a bachelor's degree, and eventually earns 5-10 percent more than their marginally rejected but otherwise identical counterpart. Marginally admitted students pay no additional tuition costs thanks to offsetting grant aid; cost-benefit calculations show internal rates of return of 19-23 percent for the marginal students themselves, 10-12 percent for society (which must pay for the additional education), and 3-4 percent for the government budget. Finally, I develop a method to disentangle separate effects for students on the extensive margin of the four-year sector versus those who would fall back to another four-year school if rejected. Substantially larger extensive margin effects drive the results.

Download 04/20241.35 MB

The Prevalence and Policy Implications of Between-School Heterogeneity in Learning Outcomes: Evidence from Six Public Education Systems

Daniel Rodriguez-Segura, Savannah Tierney. 04/2024

While learning outcomes in low- and middle-income countries are generally at low levels, the degree to which students and schools more broadly within education systems lag behind grade-level proficiency can vary significantly. A substantial portion of existing literature advocates for aligning curricula closer to the proficiency level of the “median child” within each system. Yet, amidst considerable between-school heterogeneity in learning outcomes, choosing a single instructional level for the entire system may still leave behind those students in schools far from this level. Hence, establishing system-wide curriculum expectations in the presence of significant between-school heterogeneity poses a significant challenge for policymakers — especially as the issue of between-school heterogeneity has been relatively unexplored by researchers so far. This paper addresses the gap by leveraging a unique dataset on foundational literacy and numeracy outcomes, representative of six public educational systems encompassing over 900,000 enrolled children in South Asia and West Africa. With this dataset, we examine the current extent of between-school heterogeneity in learning outcomes, the potential predictors of this heterogeneity, and explore its potential implications for setting national curricula for different grade levels and subjects. Our findings reveal that between-school heterogeneity can indeed present both a severe pedagogical hindrance and challenges for policymakers, particularly in contexts with relatively higher levels of performance and in the higher grades. In response to meaningful between-system heterogeneity, we also demonstrate through simulation that a more nuanced, data-driven targeting of curricular expectations for different schools within a system could empower policymakers to effectively reach a broader spectrum of students through classroom instruction.

Download 04/20245.63 MB

Public Good Perceptions and Polarization: Evidence from Higher Education Appropriations

Reuben Hurst, Andrew Simon, Michael Ricks. 03/2024

To understand the causes and consequences of polarized demand for government expenditure, we conduct three field experiments in the context of public higher education. The first two experiments study polarization in taxpayer demand. We provide information to shape beliefs about social returns on investment. Our treatments narrow the political partisan gap in ideal policies---a reduction in ideological polarization---by up to 32%, with differences in partisan reasoning as a key mechanism. Providing information also affects how people communicate their ideal policies to elected officials, increasing their propensity to write a (positive) letter to an official of the other party---a reduction in affective polarization. In the third experiment, we send these letters to a randomized subset of elected officials to study how policymakers respond to constituent demand. We find that officials who receive their constituents' demands engage more with higher education issues in our correspondences.

Download 03/20241.25 MB

Multiply by 37 (or Divide by 0.023): A Surprisingly Accurate Rule of Thumb for Converting Effect Sizes from Standard Deviations to Percentile Points

Paul T. von Hippel. 03/2024

Educational researchers often report effect sizes in standard deviation units (SD), but SD effects are hard to interpret. Effects are easier to interpret in percentile points, but converting SDs to percentile points involves a calculation that is not transparent to educational stakeholders. We show that if the outcome variable is normally distributed, we can approximate the percentile-point effect simply by multiplying the SD effect by 37 (or, equivalently, dividing the SD effect by 0.027). For students in the middle three-fifths of a normal distribution, this rule of thumb is always accurate to within 1.6 percentile points for effect sizes of up to 0.8 SD. Two examples show that the rule can be just as accurate for empirical effects from real studies. Applying the rule to empirical benchmarks, we find that the least effective third of educational interventions raise scores by 0 to 2 percentile points; the middle third raise scores by 2 to 7 percentile points; and the most effective third raise scores by more than 7 percentile points.

Download 03/2024856.72 KB

Exploring Multidimensional Approaches to Learning During Early Childhood in Ghana

Kenji Kitamura, Dana Charles McCoy, Sharon Wolf. 03/2024

Children's approaches to learning (AtL) are widely recognized as a critical predictor of educational outcomes, especially in early childhood. Nevertheless, there remains a dearth of understanding regarding the dimensionality of AtL, the reciprocal dynamics between AtL and learning outcomes, and how AtL operates in non-Western contexts. This paper aims to extend the existing AtL literature by both conceptually and empirically investigating the dimensionality of the AtL scale of the International Development and Early Learning Assessment (IDELA) – a globally used measure of early childhood development – based on data from Ghanaian children newly enrolled in formal schooling. Additionally, our research explores reciprocal relationships between AtL subconstructs and academic skills over time. Our analysis identifies two dimensions within the IDELA AtL scale: Self-Regulation (SR) and Motivation. We found that children with higher levels of SR early in schooling demonstrated better literacy and numeracy skills in later grades compared to their peers with low early SR, whereas children's motivation did not predict subsequent literacy and numeracy skills. This study enhances understanding of AtL in non-Western contexts, with implications for culturally appropriate support for children’s engagement in learning.

Download 03/2024994.56 KB

Which Colleges Increase Voting Rates?

D'Wayne Bell, John B. Holbein, Samuel Imlay, Jonathan Smith. 02/2024

We study how colleges shape their students' voting habits by linking millions of SAT takers to their college-enrollment and voting histories. To begin, we show that the fraction of students from a particular college who vote varies systematically by the college's attributes (e.g. increasing with selectivity) but also that seemingly similar colleges can have markedly different voting rates. Next, after controlling for students' college application portfolios and pre-college voting behavior, we find that attending a college with a 10 percentage-point higher voting rate increases entrants' probability of voting by 4 percentage points (10 percent). This effect arises during college, persists after college, and is almost entirely driven by higher voting-rate colleges making new voters. College peers' initial voting propensity plays no discernible role.

Download 02/20241.74 MB

A Global Regression Discontinuity Design: Theory and Application to Grade Retention Policies

Isaac M. Opper, Umut Özek. 02/2024

We use a marginal treatment effect (MTE) representation of a fuzzy regression discontinuity setting to propose a novel estimator. The estimator can be thought of as extrapolating the traditional fuzzy regression discontinuity estimate or as an observational study that adjusts for endogenous selection into treatment using information at the discontinuity. We show in a frequentest framework that it is consistent under weaker assumptions than existing approaches and then discuss conditions in a Bayesian framework under which it can be considered the posterior mean given the observed conditional moments. We then use this approach to examine the effects of early grade retention. We show that the benefits of early grade retention policies are larger for students with lower baseline achievement and smaller for low-performing students who are exempt from retention. These findings imply that (1) the benefits of early grade retention policies are larger than have been estimated using traditional fuzzy regression discontinuity designs but that (2) retaining additional students would have a limited effect on student outcomes.

Download 02/2024500.91 KB

How Measurement Affects Causal Inference: Attenuation Bias is (Usually) More Important Than Scoring Weights

Joshua B. Gilbert. 02/2024

When analyzing treatment effects on test scores, researchers face many choices and competing guidance for scoring tests and modeling results. This study examines the impact of scoring choices through simulation and an empirical application. Results show that estimates from multiple methods applied to the same data will vary because two-step models using sum or factor scores provide attenuated standardized treatment effects compared to latent variable models. This bias dominates any other differences between models or features of the data generating process, such as the use of scoring weights. An errors-in-variables (EIV) correction removes the bias from two-step models. An empirical application to data from a randomized controlled trial demonstrates the sensitivity of the results to model selection. This study shows that the psychometric principles most consequential in causal inference are related to attenuation bias rather than optimal scoring weights.

Download 02/20242.25 MB

Disentangling Person-Dependent and Item-Dependent Causal Effects: Applications of Item Response Theory to the Estimation of Treatment Effect Heterogeneity

Joshua B. Gilbert, Luke W. Miratrix, Mridul Joshi, Benjamin W. Domingue. 02/2024

Analyzing heterogeneous treatment effects (HTE) plays a crucial role in understanding the impacts of educational interventions. A standard practice for HTE analysis is to examine interactions between treatment status and pre-intervention participant characteristics, such as pretest scores, to identify how different groups respond to treatment. This study demonstrates that identical patterns of HTE on test score outcomes can emerge either from variation in treatment effects due to a pre-intervention participant characteristic or from correlations between treatment effects and item easiness parameters. We demonstrate analytically and through simulation that these two scenarios cannot be distinguished if analysis is based on summary scores alone. We then describe a novel approach that identifies the relevant data-generating process by leveraging item-level data. We apply our approach to a randomized trial of a reading intervention in second grade, and show that any apparent HTE by pretest ability is driven by the correlation between treatment effect size and item easiness. Our results highlight the potential of employing measurement principles in causal analysis, beyond their common use in test construction.

Download 02/20242.06 MB