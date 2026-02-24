The Urban Institute’s adjusted rankings of state National Assessment of Educational Progress (NAEP) scores are serious, careful work. They call for serious, careful engagement, and a resistance to uncritical celebration.

So, about that, I worry that education decision makers will prematurely look to a reordered NAEP list as proof of a “southern surge.” That is exactly the kind of narrative leap that can obscure more than it reveals.

No one loves a breakthrough in education more than me, but I’ve learned over time that the distance between a useful analysis and a sweeping conclusion is often crossed in the time it takes to write a press release.

Let’s trust, but verify.

I’m not a statistician, researcher, or Ph.D. But if I’m going to advocate systemic educational improvement and be a zealot for student outcomes wherever I can. It’s my religion, and as such, I need to stretch my thinking. What follows is just my attempt to understand after too much reading.

As I understand it, the NAEP reports how the students a state actually serves scored in a given year. Those averages are shaped by who lives in each state. Places with more low-income students, English learners, and students receiving special education services tend to post lower scores, not necessarily because their schools are worse, but because those student populations face greater barriers to measured academic success. In full disclosure, this is the basis of a liberal argument for why states do better or worse depending on demography that for years got under my skin.

It’s a little more complicated now that the center-right has also embraced this argument. Not because they think demography is destiny (they don’t), but because the southern surge narrative may or may not support their favored systemic reforms in how we teach reading and math.

Urban’s analysts, considered to be center-left, shift the question from asking which states have the highest average scores to asking which states’ students score higher or lower than we would expect given the kinds of students they serve. Using student-level data, they predict a score for each child based on age, gender, race and ethnicity, socioeconomic status, special education status, and English learner status, then compare that prediction with the child’s actual score and average those differences by state.

Very simply put, they are asking whether students do better or worse in a given state than demographically similar students do nationwide.

That novel shift produces a striking reshuffle where Mississippi jumps to the top, some traditional high performers drop, and several Southern states rise. Readjusting the data this way is both legitimate and important as an analytical move, much like adjusting hospital mortality rates for how sick the incoming patients are. It gives a more precise answer to a narrow question about how students in each state perform relative to a national baseline after accounting for observable characteristics.

But precision about a narrow question is not the same as an answer to the broader one about which states have the most effective school systems, and why. That broader question sits on assumptions the model cannot fully test.

Urban’s adjustments rely on crude proxies and contested demographic classifications. “Low income” is typically defined by the faulty free or reduced-price lunch eligibility metric. Poverty in rural Mississippi and poverty in suburban Connecticut are not the same material condition, even when they share a label. States also differ in how they classify students for special education and English learner services, decisions that affect funding and test inclusion. A state that labels more struggling students as special education can look unusually effective after controlling for special education status, even if nothing exceptional is happening in classrooms.

There is a deeper structural assumption as well. The model assumes that the effect of being low-income is the same everywhere. But a system where 70 percent of students live in poverty is not just a scaled-up version of one where 20 percent do. It is a different instructional environment in terms of teacher retention, peer effects, classroom management, and what it takes just to keep learning on track. When we hold individual characteristics constant, we do not hold that environment constant. The model treats what remains as school quality. That is an assumption, not a finding.

None of this makes the adjusted rankings useless. It means they are one reasonable way to slice the data, built on specific choices about what to measure and how. Small shifts in NAEP scores, often within normal sampling error, can move states several places, especially in the middle of the pack. The ordered list looks firmer than the underlying uncertainty justifies.

So what should we take seriously?

First, trends over time. A state whose adjusted performance improves over multiple NAEP cycles is telling us more than a state that surfaces near the top in a single, post-pandemic snapshot. One or two flattering years is a hypothesis, not a conclusion.

Second, coherence across subjects, grades, and subgroups. If a state is genuinely beating demographic expectations, that signal should appear in math and reading, in fourth and eighth grade, and it should persist when we look specifically at Black students, low-income students, and English learners. Gains that disappear when we disaggregate are not the gains reformers should be touting.

By that standard, Mississippi is interesting not simply because it ranks highly on one adjusted list, but because its improvements show up in both raw and adjusted scores, appear concentrated in literacy, are plausibly tied to specific reading reforms, and have accumulated across several test cycles. There is a story there that goes beyond a statistical artifact.

I’m raising a question I can’t resolve here, and I’m not sure anyone can on this evidence alone.

The larger danger is not imperfect models, since all models are imperfect, but selective consumption. States that rise on a new metric trumpet it. States that fall ignore it. Rankings are quickly turned into causal stories and policy talking points that outrun what the evidence can honestly support.

Matthew Chingos, who led the Urban analysis, put it bluntly. “The states that get adjusted up love this. The states that get adjusted down ignore it.” That asymmetry is the signature of evidence being used as political capital rather than as a tool for learning.

An outcomes-driven approach to education policy should demand from our statistics what we demand from our reforms. We should want clarity about what is being measured, transparency about assumptions, and proof that results hold up across reasonable tests.

The adjusted NAEP rankings invite those questions without determining the answers. For me, welcoming the analysis while withholding the applause is not cynicism. It is what it looks like to take outcomes and evidence seriously.

* To be clear, this is my opinion and my attempt to understand an important issue. A meditation. Personal study. I make no claims for authority.