A 90-Year-Old Psychology Test Just Revealed Why AI Cannot Think Like Your Students

In 1935, a psychologist named John Ridley Stroop designed a test so simple that most adults can take it in under a minute: look at a word like "red" printed in blue ink, and say the color of the ink, not the word. The test became one of the most replicated instruments in cognitive psychology, used for nine decades to measure something neuroscientists call executive control — the brain's capacity to hold a goal in mind and resist an automatic, competing response. In June 2026, three researchers from Queens College, CUNY, and Texas A&M University ran the same test on the world's most advanced AI models. The results, published in the peer-reviewed journal PNAS Nexus, are not a minor technical footnote. They are a precise, measurable demonstration of the exact thing that should determine how every school in India designs its AI literacy curriculum starting this year.

The Test That Has Outlived Every AI Hype Cycle

The Stroop task is deceptively simple. Words for colors — red, blue, green — are displayed in colored ink. Sometimes the word and the ink color match. Sometimes they conflict: the word "red" printed in blue ink. The subject is asked to name the ink color while ignoring the word itself. For a human being, this creates genuine friction, because reading is so deeply automatic that the brain has to actively suppress it in order to focus on color. Despite that friction, healthy adult humans maintain high accuracy even on long lists of conflicting words, because executive control allows the brain to hold a single goal — name the color — steady against a constant pull toward a different, easier, more practiced response.

The research team, led by Suketu Chandrakant Patel along with Hongbin Wang and Jin Fan, set out to test something that had never been measured with this precision before: do large language models possess anything resembling this capacity? Their paper, titled "Deficient executive control in transformer attention," tested GPT-4o, GPT-5, Claude 3.5 Sonnet, Claude Opus 4.1, and Gemini 2.5 against word lists of increasing length, with mismatched colors and words interspersed throughout.

The findings, summarized by ScienceDaily on June 10, are stark in a way that pure benchmark scores rarely are. GPT-4o achieved 91 percent accuracy on five-word lists. At ten words, accuracy fell to 57 percent. At forty words, it dropped to just 15 percent. Claude 3.5 Sonnet held up longer, remaining stable through twenty-word lists, but then experienced a sharp decline to 24 percent accuracy on forty-word lists. The same pattern, with variations in where the collapse began, appeared across GPT-5, Claude Opus 4.1, and Gemini 2.5. When matching and mismatched items were interleaved within the same list, performance for the mismatched items fell toward zero in some conditions.

This is not a story about one model being worse than another. Every leading model, regardless of company or architecture, showed the same underlying signature of failure: strong performance on short, simple instances of the task, and a collapse that accelerated as the task grew longer and more demanding of sustained, conflict-resistant focus.

What Is Actually Happening Inside the Model

The explanation the researchers offer is precise and important to understand correctly. As the lists grew longer and the conflicting information accumulated, the models did not simply produce more random errors. They increasingly defaulted to reading the word rather than naming the ink color — reverting to the response that their training had made most automatic, exactly the behavior a human would also be tempted toward, except that humans suppress it and the models, under pressure, did not.

This distinction matters because it tells us something specific about how transformer-based attention differs from biological executive control. Human executive function is not simply about processing information; it is about actively maintaining a goal in working memory while continuously monitoring for, and suppressing, competing responses. It is a form of internal governance over one's own cognitive process. The study suggests that what current AI architectures do, even when their output looks remarkably like sustained reasoning, may not involve an equivalent governing mechanism. The model can describe what executive control is. It can write a textbook-quality paragraph about the Stroop effect. What it apparently struggles to do is exercise it under sustained, escalating pressure.

The researchers are careful to frame this as a finding about mechanism, not a verdict on intelligence broadly. But the practical implication for any institution placing AI into the hands of developing minds is the same either way: the tool that sounds the most confident is not always the tool exercising the most control.

Why This Should Matter Enormously to Every School Adopting AI

India's schools are, this year, in the middle of the most significant integration of AI into the K-12 curriculum the country has ever attempted. CBSE's AI and Computational Thinking mandate brings students into structured contact with AI tools and AI concepts from Class 3 onward. Parallel to that policy shift, students across India and the world are independently adopting general-purpose AI tools for homework help, research, and writing at a rate that has alarmed researchers and educators alike. A January 2026 survey by the American Association of Colleges and Universities and Elon University's Imagining the Digital Future Center, polling more than a thousand faculty members, found that 95 percent believe AI is making students excessively reliant on the technology, 90 percent believe it is decreasing students' critical thinking abilities, and 83 percent believe it is shortening their attention spans.

The Stroop study gives those concerns a mechanism, not just an impression. If AI systems themselves do not reliably sustain a single cognitive goal under the pull of competing information, then a student who outsources sustained, conflict-laden thinking to an AI system is not borrowing a stronger form of attention. They are borrowing a system that, by this measurement, exercises less of it than the student's own developing brain does.

This connects directly to a separate and now well-documented phenomenon researchers have termed "cognitive debt." A widely cited 2025 MIT Media Lab study led by Nataliya Kosmyna found that students who relied heavily on large language models for academic writing showed measurably reduced activation in brain regions associated with working memory, creativity, and executive function — and that when those same students were later asked to write without AI assistance, they showed impaired recall and remained anchored to the linguistic patterns the model had previously suggested to them. The pattern is consistent: cognitive capacities that are not actively exercised do not strengthen, and capacities that are persistently outsourced can measurably weaken.

The Difference Between Using AI and Outsourcing Thought to It

None of this is an argument against AI in classrooms. It is an argument for precision about what AI is good at and what it is not, and for designing every student-facing AI interaction around that precision rather than around convenience.

AI systems remain extraordinarily capable at tasks that reward pattern completion across vast trained knowledge — summarizing, drafting, translating, generating practice material, explaining a concept in multiple ways until one lands. These are exactly the tasks where a single, well-defined goal does not have to be sustained against escalating internal conflict over a long sequence. Where the Stroop research suggests caution is in any design that treats AI as a substitute for the specific cognitive labor of sustained, effortful reasoning — the kind a student needs to build through repeated practice, not bypass through delegation.

A 2026 systematic review published in Smart Learning Environments examining AI dialogue systems and student cognition reached a similar conclusion from a different angle: the educational value of AI assistance depends heavily on whether the tool is designed to provoke the student's own reasoning process or simply to complete the task on the student's behalf. The same underlying technology produces opposite educational outcomes depending entirely on this design choice.

This is precisely the distinction that should guide which AI tools schools choose to put in front of children, and how. A tool that answers a student's question is solving the school's short-term convenience problem. A tool that asks the student a better question in return — one that requires the student to hold a line of reasoning in mind, evaluate competing possibilities, and arrive at understanding through their own sustained effort — is solving the actual problem that AI literacy education exists to address.

Designing for Genuine Thinking, Not Just Convenient Answers

This principle sits at the foundation of how Cypher, AI Ready School's AI-powered active learning companion, was designed. Cypher is built specifically to guide students' thinking rather than to deliver finished answers — surfacing follow-up questions, asking a student to justify a step before moving to the next one, and creating exactly the kind of structured friction that builds independent reasoning capacity rather than eroding it. Where a general-purpose AI chatbot will, by default, produce the most complete and convenient answer it can, Cypher is designed to do the opposite: to make a student do more of their own cognitive work, not less, while still removing the genuine barriers — confusion, lack of confidence, absence of a person to ask — that prevent a student from engaging with a problem at all.

This is not a cosmetic design preference. It is a direct response to exactly the kind of evidence the Stroop study and the cognitive debt research point toward. If sustained, effortful, conflict-resistant reasoning is a capacity that strengthens through practice and weakens through disuse, then the single most consequential design decision any school makes when adopting AI is whether the tool in front of its students removes the difficulty from thinking, or removes the unnecessary barriers around it while preserving the difficulty that actually builds capability.

The distinction also reframes what AI literacy should actually teach. A curriculum that teaches students to write effective prompts and recognize AI-generated content is necessary but insufficient. A complete AI literacy curriculum should teach students, explicitly, that AI systems — for reasons now demonstrated in peer-reviewed, replicable research — do not reason the way human brains do, are not uniformly reliable across all types of cognitive tasks, and require active human judgment precisely in the long, complex, conflict-laden situations where their performance is least dependable. That is a more sophisticated and more honest form of AI literacy than simply teaching tool fluency, and it is consistent with the broader ambition of CBSE's own curriculum framework, which explicitly aims to develop critical, ethical reasoners about technology rather than passive users of it.

What This Means for the Classroom This Term

For teachers integrating AI into lesson delivery, the practical takeaway is not to avoid AI for complex, multi-step tasks. It is to maintain visibility into how a student arrived at a piece of work, not only what the work contains. A research project, a long-form essay, or a multi-step mathematics proof are exactly the categories of task where sustained reasoning under competing considerations matters most — and exactly the categories where, per this research, AI assistance is most likely to look fluent while actually reasoning least reliably.

This is one of the reasons real-time visibility into student learning has become a foundational design requirement, not an add-on feature, in how Morpheus supports teachers. A teacher who can see where a student's understanding is genuinely solid versus where it has been quietly outsourced to an AI-generated answer is in a position to intervene before a gap calcifies into a dependency. That visibility, paired with Zion's structured, school-governed environment for student AI use, allows schools to give students real access to AI's genuine strengths — drafting, explaining, practicing — while keeping a human eye on the parts of learning that AI, by its own measured nature, cannot yet reliably replace.

For school leaders evaluating which AI tools to bring into their institutions this year, the Stroop research offers a useful, concrete question to ask of any vendor: does this tool's design assume that fluency equals reliability? A model can produce a perfectly articulate, confident-sounding response while its underlying attention mechanism is, by the measure of this study, on the verge of collapse. Schools that select tools purely on the basis of how polished the output sounds are selecting on exactly the wrong signal.

A More Honest Starting Point for AI in Education

There is a tendency, in both marketing and in well-meaning enthusiasm, to describe AI in classrooms as a kind of tireless tutor — endlessly patient, endlessly knowledgeable, available at any hour. Much of that is true and valuable. But "tireless" should never be confused with "reliable under cognitive load," and this study is one of the clearest empirical demonstrations yet that those are different properties entirely.

The honest version of AI in education does not require abandoning the technology. It requires building every product, every classroom practice, and every curriculum decision on an accurate model of what the technology actually does well and where it genuinely struggles — informed by research like this, rather than by the smoothness of a chatbot's prose. A ninety-year-old psychology test, run on the newest AI models in the world, just gave educators a precise, reproducible way to see the difference.

The measure of a good AI tool in a school is not how convincingly it sounds like it is thinking. It is whether it leaves the student a better thinker than it found them.

AI Ready School provides a complete K-12 AI ecosystem — Cypher (personalised learning companion that guides thinking rather than replacing it), Morpheus (AI teaching agent with real-time learning visibility), Zion (safe, governed AI tool suite for students), NEO (AI Innovation Labs), and Matrix (sovereign AI infrastructure) — built on the principle that AI should strengthen independent thinking, not substitute for it.

To see how AI Ready School designs AI literacy and AI-assisted learning around genuine cognitive development, reach out at hey@aireadyschool.com or call +91 9100013885.

‍