Race to build AI software engineers

Dariusz Semba
#generativeai #softwareengineering #llms #aiagents
AI software engineer and researchers

The race to build a truly useful AI software engineer is on! 🔥🔥🔥

In just 3 weeks, we’ve witnessed multiple independent breakthroughs on the front of AI agents solving real-world GitHub issues.

Timeline of events:

Although the results may seem similar, in each case researchers found their own unique ways to improve on the benchmark.

AI research is a constant race

Was it the release of Devin that really forced everyone else to reveal their progress?
Or had the release of the SWE-bench and/or general advancements in AI set off a timed explosion of developments? 🤔

Surely the announcement of Devin put some pressure on the rest of the researchers. The authors of SWE-agent even decided to release their code/results early on and have yet to publish their paper.
At the same time, SWE-agent being open-source turns up the heat too - pushes Cognition Labs (Devin’s creators) to improve the technology further and make it accessible to everyone.
The techniques developed by the researches vary across solutions, hence there might be some potential for combining their learnings to obtain even better results. 📈

Importance of benchmarks

Benchmarks allow to evaluate and compare various systems. They measure progress and hence point toward an end goal. Just like any dataset we use to train our models on, benchmarks should reflect the real problem, embody all its complexity and real-world challenges to be solved.

Every time an AI model reaches >90% accuracy on a given benchmark, another one, more challenging is created, where the same model performs poorly.

SWE-bench is an attempt to apply LLMs in the setting of real-world software engineering. It’s definitely challenging, simple RAG + GPT/Claude don’t yield any good results.

Researcher benchmarking his AI software engineer

Limitations of SWE-bench

SWE-bench itself is not ideal. It comprises 2,294 examples from 12 popular open-source projects, all written in Python. It covers only a small specific domain of problems, leaving out other languages and problems often encountered in enterprise setting.

Some SWE-bench examples (e.g. coming from a certain repository) might be less difficult than others. MAGIS framework scores from about 0% on some repositories to as high as 40% accuracy on one specific repository.

Evaluating against SWE-bench might get costly in some scenarios - the more complex the agent, the more prompts/requests to LLMs under the hood. To address this problem the authors created SWE-bench Lite with only a subset of examples, that’s diverse enough.

Will programmers get replaced soon?

14% doesn’t seem a lot. On the other hand: if you could make even just 14% of bugs disappear simply by waving a magic wand - that’s a real deal.

Unfortunately (or fortunately for programmers), that’s not the case.

SWE-bench dataset uses human-written tests to verify that the bugs were solved correctly by AI. Without those tests, you would end up with a lot of useless, potentially detrimental, changes in your code.
The solution can be to solve every issue by writing tests first, and only then you could actually automatically resolve a given issue 14% of time. Does it translate to 14% time savings? Not necessarily. Writing tests requires knowing what went wrong and that constitutes most of the work.

Other alternatives would be to:

What do you think? Should engineers be concerned, or is there cause for excitement?

← Back to Blog