AI can fix mistakes - but it surely cannot find it: the study by Openai emphasizes the limit values of LLMS in software engineering

February 19, 2025

268

Large -speaking models (LLMS) could have modified software development, but firms must think twice about easy methods to only replace human software engineers through LLMS Models can replace “Low” engineers.

In A New paperPresent Openai The researchers describe how they developed an LLM benchmark called SWE-Lancer to check what number of foundation models can earn with freelance software engineering tasks. The test showed that the models, although they’ll solve errors, cannot see why the error exists and proceed to make more mistakes.

The researchers commissioned three LLMS openais GPT-4O and O1 and the Claude 3.5-Sonett from Anthropic-MIT 1,488 Tasks for freelance software engineers from the freelance platform upwork of $ 1 million. They divided the tasks into two categories: individual tasks of the person functions (resolution error or implementation of functions) and administrative tasks (for which the model as a manager who selects the very best proposal for solving problems).

“The results show that real freelance work in our benchmark for border language models remains to be a challenge,” the researchers write.

The test shows that foundation models cannot completely replace human engineers. While you might help with the answer of bugs, you will not be quite at the extent where you may earn freelance money yourself.

Benchmarking freelancing models

The researchers and 100 other skilled software engineers identified potential tasks for upwork and, without changing words, fed a docker container to create the SWE Lancer data record. The container doesn’t have web access and can’t access Github to “avoid the chance that models scrap off or withdraw requirements details,” they said.

The team identified 764 individual worker tasks of a complete of $ 414,775 and ranged from 15-minute error fixes to characteristic inquiries of the week. These tasks, which included the review of freelancers and job advertisements, would pay 585,225 US dollars.

The tasks have been added to Expensing Platform Expensify.

The researchers generated input requests based on the duty title and outline and a snapshot of the code base. If there have been additional suggestions for solving the issue, “we also generated an administrative task using the issue description and list of suggestions,” they said.

From here, the researchers switched to the event of end-to-end tests. They wrote playwrights for each task that used these generated patches, which were then “triple verified” by skilled software engineers.

“Testing real user flows, e.g. B. registration in the applying, the implementation of complex actions (financial transactions) and checking the check whether the answer to the model works as expected, ”explains the paper.

Test results

After carrying out the test, the researchers found that not one of the models earned the complete value of the tasks of $ 1 million. Claude 3.5 Sonett, the very best model, earned only $ 208,050 and solved 26.2% of the person worker problems. However, the researchers point this: “The majority of their solutions are mistaken, and greater reliability is required for the trustworthy provision ratio.”

The models played well with most individual employees, with Claude 3.5-sun best sections followed by O1 and GPT-4O.

“Agents are characterised by the localization, but not the reason behind which ends up in partial or incorrect solutions,” explains the report. “Agents determine the source of an issue remarkably quickly, whereby the important thing words are utilized in your entire repository to seek out the corresponding file and functions quickly – often much faster than an individual. However, they often have a limited understanding of how the issue includes several components or files, and the primary cause can’t be addressed, which ends up in solutions which are incorrect or inadequate. We rarely find cases by which the agent wants to breed the issue or fail since it doesn’t find the proper file or the proper place to edit. “

Interestingly, the models have higher sections for managers, which required argumentation to guage the technical understanding.

These benchmark tests showed that AI models can solve some coding problems on a low level and that the software engineers cannot yet replace “low level”. The models still needed time, often made mistakes and couldn’t have a mistake to seek out the reason behind coding problems. Many “low” engineers work higher, however the researchers said that this will not be the case for very long.

AI can fix mistakes – but it surely cannot find it: the study by Openai emphasizes the limit values of LLMS in software engineering

Benchmarking freelancing models

Test results

LEAVE A REPLY Cancel reply

Must Read

The peer review system is collapsing. Here's how we will fix the issue

Digital surveillance is increasing in South Africa’s public sector – regulation must catch up

Designer Kate Barton teams up with IBM and Fiducia AI for a NYFW presentation

Why Sigmund Freud is making a comeback within the age of authoritarianism and AI

OpenAI hat das Wort „sicher“ aus seiner Mission gestrichen – und seine neue Struktur ist ein Test dafür, ob KI der Gesellschaft oder den...

New J-PAL research and policy initiative to check and scale AI innovations to combat poverty

Non-consensual AI porn doesn't violate privacy – however it's still mistaken

Latest articles

The peer review system is collapsing. Here's how we will fix the issue

Digital surveillance is increasing in South Africa’s public sector – regulation must catch up

Designer Kate Barton teams up with IBM and Fiducia AI for a NYFW presentation

Our Newsletter

AI can fix mistakes – but it surely cannot find it: the study by Openai emphasizes the limit values ​​of LLMS in software engineering

Benchmarking freelancing models

Test results

RELATED ARTICLES

LEAVE A REPLY Cancel reply

Must Read

Latest articles

Our Newsletter

AI can fix mistakes – but it surely cannot find it: the study by Openai emphasizes the limit values of LLMS in software engineering