AI tools are transforming the production function for economics research, and rapidly changing both the skills required of a research assistant and how we assess those skills when hiring. At the World Bank's Development Impact department, objective assessment of coding skills is an essential part of the hiring process. We have traditionally relied on take-home coding tests in Stata or R: applicants are provided a dataset and assignment and given a set number of hours to return their code and outputs. For a long time, successfully completing the test and producing clear, reproducible, functional code served as a reasonable proxy for technical competence.
That assumption no longer holds. In recent recruitment rounds, we observed that tasks which once required candidates to reason carefully about data structures, syntax, and edge cases can now be completed end-to-end with AI assistance. A 2023 World Bank blog post on AI-generated Stata code noted that ChatGPT’s performance on Stata tasks was mixed compared with other languages. However, that landscape has changed rapidly, and the Stata tooling ecosystem has evolved to include context-aware, agentic assistance. Tools such as Claude Code and Stata MCP, which integrate AI-assisted coding into a statistical computing environment, have substantially improved how AI systems work with Stata codebases and data.
How AI performs on the software test
To understand what this means concretely, we ran several common AI configurations directly on a past version of our software test, in both R and Stata.
Without an execution environment, Copilot scored 55.5% in R and 48.9% in Stata (the latter requiring a few manual fixes to run because Copilot cannot execute Stata code directly, Stata being proprietary). In other words, a widely available AI tool with no additional setup already performs at the level of a median applicant.
The picture changes substantially with an agentic setup, i.e. asking an AI agent to write code in a computing environment that gives the model direct access to the working directory, such as VSCode, Positron and Claude Code. In fact, a script generated by Codex (Open AI) running inside Positron scored 78% in R and 77% in Stata; scripts generated by Claude Code scored 80% in R and 81% in Stata.
The gap between the two configurations reflects what the execution environment adds: the model can read data files, run code, observe errors, and iterate, rather than generating code blindly.
Until recently, we assumed that Stata test results would be less affected by AI coding tools. Chat-based AI has always been weaker in Stata given that Stata is a proprietary software with a smaller public footprint. However, newer agentic coding tools have completely changed the calculus: Stata and R scores are nearly identical once the agent has the opportunity to iterate and improve code.
Why did the AI-generated script not achieve a score of 100%? Our grading rubric evaluates how a research assistant should approach an unfamiliar dataset: for example, analyzing and visualizing variables, addressing outliers and missing values, and demonstrating careful data exploration. Many of these steps are not strictly required to complete the test tasks, so the model skips them. The model's goal is to solve the task efficiently, while a candidate's goal is to showcase methodological rigor and domain judgment.
AI changes how we code, not what we need to know
High scores do not tell the whole story. Research on AI-generated code in software development documents recurring quality problems that carry over directly to data science: AI-generated solutions tend to be longer and more redundant than necessary, exhibit irregular structural patterns that deviate from human coding idioms, and do not consistently minimize complexity. A review of the Positron-generated Stata script confirmed this: while the code scored 76.6% and ran correctly, a Stata expert identified it as unnecessarily complex, with elaboration that obscured rather than clarified analytical intent. In applied research, producing working code is not the same as producing maintainable, research-grade code. This matters because it points to what the software test is still measuring, and what it is not.
AI has changed how code is written, but it has not removed the need for coding knowledge - what has shifted is where that knowledge matters most. The skill of recognizing when AI output is unnecessarily complex or poorly structured is precisely what a good research assistant needs, but applying a grading rubric that explicitly penalizes these features (or bugs?) requires expert reviewers with time to read scripts carefully, which limits scalability, and even expert reviewers face genuine ambiguity about what they are observing.
The assessment design space is still open
One response to these findings is to restrict AI use during the assessment itself, through controlled environments, or tools that detect or block AI assistance. This would preserve the signal value of the software test as currently designed. But, besides raising practical challenges around implementation and fairness, this approach would sidestep a more fundamental issue: using AI is now part of how code is written, and a candidate who cannot leverage these tools effectively may be less prepared for the job, not more.
The more productive question is: what kind of assessment can reveal essential judgment that AI cannot replace? Our hypothesis is that we need to observe how candidates interact with AI tools themselves. In our next recruitment round, we will ask candidates to submit AI chat transcripts alongside their code, to understand how tools were used and corrected during the task, and we will include an human-in-the-loop score in the grading rubric. This reflects that once AI is part of the coding workflow, it is important to evaluate not only the final script but also the tools and prompts used to develop it.
Join the Conversation