Artificial intelligence

Meta and Harvard Researchers Introduce Confucius Code Agent (CCA): A Software Engineering Agent That Can Work on Large Codebases

How far can the centralized language model go if real innovation moves from the core to the agent framework and tool stack? Meta and Harvard researchers released Confucius Code Agent, an open source AI software developer built on the Confucius SDK designed for software repositories of scales and runtimes. The program targets real GitHub projects, complex testing toolchains during testing, and reproducible results in benchmarks such as SWE Bench Pro and SWE Bench Verified, while presenting a full scaffolding for developers.

Confucius SDK, scaffolding around the model

The Confucius SDK is an agent development platform that treats scaffolding as a core design issue rather than a thin wrapper around a language model. Arranged around 3 axes, Agent experience, user experience, and developer experience.

Agent experience controls what the model sees, including content structure, working memory and tool outputs. User Experience focuses on readability, code diversity and security for human developers. Developer Experience focuses on detecting, detecting and correcting the agent’s own errors.

The SDK introduces 3 main modes, an integrated orchestrator with phase performance memory, a continuous note-taking system, and an extended tool interface. The meta agent then automates the integration and development of the agent configuration through a build, test, upgrade loop. Confucius Code Agent is one concrete implementation of this software engineering scaffold.

Hierarchical working memory for long-horizon coding

Real software tasks in SWE Bench Pro often require thinking about multiple files and multiple interaction steps. The orchestrator in the Confucius SDK maintains a hierarchical working memory, which divides the trajectory into scopes, summarizes the previous steps and stores the compressed context for later evolution.

This design helps keep information within the bounds of the content model while preserving important artifacts such as patches, error logs, and design decisions. The bottom line is that effective tool-based coding agents require a transparent memory structure, not just a sliding window of past messages.

Persistent note-taking for reading at different times

The second method is a note-taking system that uses a dedicated agent to write structured Markdown notes from the action sequence. These notes capture specific work strategies, cache protocols and common failure modes, and are stored as long-term memory that can be reused at all times.

The research team used Confucius Code Agent twice in 151 instances of SWE Bench Pro and Claude 4.5 Sonnet. In the first phase the agent solves tasks from scratch and generates notes. In the second run the agent reads these notes. In this setting, the average spin drops from 64 to 61, token usage drops from about 104k to 93k, and Resolve@1 improves from 53.0 to 54.4. This shows that the notes are not just logs, they act as a working memory of the session.

Modular extensions and tooling are complex

The Confucius SDK exposes tools as extensions, for example file editing, command execution, test runners and code searching. Each extension can maintain its own status and quick threads.

The research team is studying the impact of tool usage complexity using ablation on a 100-sample subset of SWE Bench Pro. For Claude 4 Sonnet, going from a configuration without enhanced context features to enhanced context raises Resolution @ 1 from 42.0 to 48.6. With the Claude 4.5 Sonnet, the simple tool use configuration reaches 44.0, while the rich tool handling reaches 51.6, and 51.0 for the medium variant. These numbers show that the way the agent chooses and ranks the tools is as important as the choice of backbone model.

Meta Agent for automated agent design

On top of these methods, the Confucius SDK includes a meta agent that takes the agent’s natural language specification and recursively suggests configuration, information and extension sets. It then deploys the candidate agent to the tasks, checks the tracking and metrics, and organizes the optimization in the design, testing, improvement loop.

The Confucius Code Agent that the research team is testing is generated with the help of this meta agent, rather than just being manually tuned. This approach turns one of the agent engineering process itself into an optimization problem guided by LLM.

Results on SWE Bench Pro and SWE Bench Verified

The main test uses SWE Bench Pro, which has 731 GitHub issues that require real repositories to pass the test. All the compared systems share the same storage, tooling and testing harnesses, so the differences come from the scaffolds and models.

In SWE Bench Pro, the reported score is Solve @ 1

  • Claude 4 Sonnet and SWE Agent, 42.7
  • Claude 4 Sonnet and Confucius Code Agent, 45.5
  • Claude 4.5 Sonnet and SWE Agent, 43.6
  • Claude 4.5 Sonnet with Live SWE Agent, 45.8
  • Claude 4.5 Sonnet and Confucius Code Agent, 52.7
  • Claude 4.5 Opus with Anthropic system card scaffold, 52.0
  • Claude 4.5 Opus by Confucius Code Agent, 54.3

These results show that a strong scaffold with a middle tier model, Claude 4.5 Sonnet with Confucius Code Agent at 52.7, can outperform a strong model with a weak scaffold, Claude 4.5 Opus with 52.0.

In SWE Bench Verified, Confucius Code Agent with Claude 4 Sonnet reaches Resolve@1 74.6, compared to 66.6 for SWE Agent and 72.8 for OpenHands. The different mini SWE Agent with Claude 4.5 Sonnet reaches 70.6, which is also below the Confucius Code Agent with Claude 4 Sonnet.

The research team also reports performance as a fixed file count function. For tasks editing 1 to 2 files, Confucius Code Agent achieves 57.8 Resolve@1, for 3 to 4 files it achieves 49.2, for 5 to 6 files it achieves 44.1, for 7 to 10 files it achieves 52.6, and for more than 10 files it achieves 44. This shows stable behavior for multiple file changes in large codebases.

Key Takeaways

  • Scaffolding can exceed the size of the model: Confucius Code Agent shows that with strong scaffolding, Claude 4.5 Sonnet reaches 52.7 Resolve@1 in SWE-Bench-Pro, surpassing Claude 4.5 Opus with weak scaffolding at 52.0.
  • Hierarchical working memory is important for long-horizon encoding: The Confucius SDK orchestrator uses hierarchical working memory and context compression to manage long trajectories over large repositories, rather than relying on simple history.
  • Persisting notes act as a reverse time active memory: In 151 SWE-Bench-Pro jobs with Claude 4.5 Sonnet, reusing structured notes reduces turns from 64 to 61, token usage from about 104k to 93k, and increases Resolve @ 1 from 53.0 to 54.4.
  • Tool configuration has a significant impact on success rates: In the 100 sub-system of SWE-Bench-Pro, going from a simple to a richer tool with Claude 4.5 Sonnet increases the Resolve@1 from 44.0 to 51.6, showing that the learned routing and recovery techniques are a big performance protector, not just implementation details.
  • Meta Agent automates agent configuration and tuning: A meta-agent iteratively raises information, toolsets and settings, then evaluates and organizes them by building, testing, improving the loop, and producing Confucius Code Agent itself is produced through this process rather than just manual tuning.

Check it out THE PAPER IS HERE. Also, feel free to follow us Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Check out our latest issue of ai2025.deva 2025-centric analytics platform that transforms model implementations, benchmarks, and ecosystem activity into structured datasets that you can sort, compare, and export.


Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the power of Artificial Intelligence for the benefit of society. His latest endeavor is the launch of Artificial Intelligence Media Platform, Marktechpost, which stands out for its extensive coverage of machine learning and deep learning stories that sound technically sound and easily understood by a wide audience. The platform boasts of more than 2 million monthly views, which shows its popularity among viewers.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button