Andrej Karpathy Open-Sources ‘Autoresearch’: A 630-Line Python Tool That Enables AI Agents to Run Autonomous ML Experiments on Single GPUs

0 3 3 minutes read

Andrej Karpathy Open-Sources ‘Autoresearch’: A 630-Line Python Tool That Enables AI Agents to Run Autonomous ML Experiments on Single GPUs

Andrej Karpathy has been released autoresearchis a small Python tool designed to enable AI agents to automate machine learning experiments. The project is a stripped down version of nanochat The LLM training content, reduced to a single file archive of approximately ~630 lines of code. Prepared for use in a One NVIDIA GPU.

Autonomous Iteration Loop

The framework establishes a certain division of labor between the human researcher and the AI agent. The system works in a continuous feedback loop where progress is tracked via git commits to the feature branch.

Element	Responsibility	File format
A person	It reiterates the guidelines and constraints of quality research.	`.md` (Markdown)
An AI agent	Suggests and implements training script modifications.	`.py` (Python)
The execution	Conducts training of fixed length to evaluate changes.	Shell/Python

The agent reads human-provided instructions, adjusts the training code—adjusts the neural network architecture, optimizers, or hyperparameters—and applies long-term training. five minutes.

Test and validation metrics

To ensure that the agent only stores beneficial changes, the system uses bits per byte (BPB) as a primary validation metric. BPB measures the compression efficiency of a model on a validation dataset; lower scores indicate a more accurate model.

Authentication Protocol: The agent only commits code changes to the git branch if the final BPB score is lower than the previous best.
Performance Observed: In the first run, Karpathy demonstrated an agent that effectively reduces the loss of validation from 1.0 to 0.97 BPB by repeating the independent code.
Granularity: Every completed 5-minute training session is presented as a data point, allowing researchers to compare the effectiveness of different information or agent settings over time.

Case Study: Made by Shopify’s Tobi Lutke

After the release, Shopify CEO Tobi Lutke changed this autoresearch internal project framework. By allowing the agent to iterate on the construction of a small model, Lutke reported a 19% improvement on validation scores. Notably, the small model prepared by the agent ended up being more effective than the large model prepared by standard manual methods.

OK this thing is absolutely crazy. Before going to bed I…

* used try to create a new qmdresearch directory
* I told my pi to read this github repo and make a version of that model to increase the qmd query with the goal of high quality and speed. Get training data at… https://t.co/hbCfD62ElJ

— tobi lutke (@tobi) March 8, 2026

Karpathy noted that some code tweaks found by the agent were then merged back into his scope. nanochat framework, which shows that the tool can find effective optimization in large-scale production systems.

I’ve put together an “autoresearch” project and made a small repo that contains it if people want to play over the weekend. It is the core of nanochat LLM training reduced to one GPU, one file version of ~630 lines of code, then:

-Things that affect people… pic.twitter.com/3tyOq2P9c6

— Andrej Karpathy (@karpathy) March 7, 2026

The Importance of Technology for Engineers

For Devs, autoresearch represents a shift to an ‘agent’ workflow in model development. Rather than tuning hyperparameters, the engineering task shifts to immediately engineering the agent to navigate the search area more efficiently. The limit of ~630 lines ensures that the entire codebase fits within the context window of modern LLMs, reducing errors in code generation and allowing the agent to maintain a ‘perfect’ understanding of the training script.

Key Takeaways

Autonomous Research Loop: The framework allows AI agents to automatically iterate on ML experiments by reading human-provided text. Markdown (.md) instruction and configuration file a Python (.py) training script without manual intervention.
~630-Line Core: By stripping the nanochat The LLM training core is reduced to a single file, ~630 line cache, the codebase is small enough to fit completely within the LLM core window, reducing code generation errors.
Strategically Driven Metrics: The agent is running configured 5 minutes of jogging training of a One NVIDIA GPU and only commit code changes to the git feature branch if they lead to a downgrade bits per byte (BPB) the result of confirmation.
Proven Performance Benefits: In a real-world test (as mentioned in the tweet), Shopify CEO Tobi Lutke used the tool to achieve a 19% improvement in model scores, resulting in a smaller, agent-optimized model that performs better than a larger, manually configured one.
Shift in Engineering Focus: The project shifts the engineer’s role from manual hyperparameter tuning to agent engineeringwhere the goal is to develop information that guides the AI to find the most efficient neural structures and training settings.

Check it out Repo here. Also, feel free to follow us Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.