Can AI programming make $400,000?

robot
Abstract generation in progress

Author: Tan Zixin, head technology

Image source: Generated by Unbounded AI

Large Language Model (LLM) is changing the way software development is done, and whether AI can now replace human programmers on a large scale has become a topic of great interest in the industry.

In just two years, the AI model has developed from solving basic computer science problems to competing with human masters in international programming competitions, such as OpenAI o1, which participated in the 2024 International Informatics Olympiad (IOI) under the same conditions as human participants and successfully won the gold medal, showing strong programming potential.

At the same time, the rate of AI iteration is also accelerating. On SWE-Bench Verified, the benchmark for code generation evaluation, GPT-4o scored 33% in August 2024, but by the time of the new generation of o3 models, the score has doubled to 72%.

To better evaluate the software engineering capabilities of AI models in the real world, today, OpenAI has open-sourced a new evaluation benchmark, SWE-Lancer, which for the first time ties model performance to monetary value.

SWE-Lancer is a benchmark of more than 1,400 freelance software engineering tasks from the Upwork platform, with a total real-world compensation value of about $1 million.

The new benchmark 'Features'

The benchmark task price of SWE-Lancer reflects the real market value situation, the harder the task, the higher the reward.

It includes both independent engineering tasks and management tasks, and can choose between technical implementation solutions. The benchmark is not only for programmers, but also for the entire development team, including architects and managers.

Compared to previous software engineering benchmarks, SWE-Lancer offers several advantages, such as:

  1. All 1488 tasks represent the real rewards paid by the employer to freelance engineers, providing a natural and market-determined difficulty gradient, with rewards ranging from $250 to $32,000, which is quite substantial.

Thirty-five per cent of the assignments were worth more than $1,000 and 34 per cent were between $500 and $1,000. The Individual Contributor (IC) Software Engineering (SWE) task group consists of 764 tasks valued at $414,775 and the SWE Management task group contains 724 tasks with a total value of $585,225.

  1. In the real world, large-scale software engineering not only requires specific coding and development, but also the capability of technical overall management. This benchmark test uses real-world data to evaluate models acting as the "technical director" of SWE.

  2. Possess advanced full-stack engineering evaluation capabilities. SWE-Lancer represents real-world software engineering as its tasks come from platforms with millions of real users.

The tasks involve mobile and web engineering development, interaction with APIs, browsers, and external applications, as well as verification and reproduction of complex issues.

For example, some tasks involve spending $250 to improve reliability (fixing the double-trigger API call issue), $1000 to fix bugs (addressing permission differences), and $16,000 to implement new features (adding in-app video playback support on web, iOS, Android, and desktop).

  1. Diversity of fields. 74% of IC SWE tasks and 76% of SWE management tasks involve application logic, while 17% of IC SWE tasks and 18% of SWE management tasks involve UI/UX development.

In terms of task difficulty, the tasks selected by SWE-Lancer are very challenging, with an average of 26 days needed to solve the tasks in the open source dataset on Github.

In addition, OpenAI stated that they collected unbiased data by selecting representative task samples from Upwork and hiring 100 professional software engineers to write and verify end-to-end tests for all tasks.

AI encoding earning power PK

Although many tech bigwigs continue to claim that AI models can replace "low-level" engineers, there is still a big question mark over whether companies can completely replace human software engineers with LLMs.

The results of the first review show that on the full SWE-Lancer dataset, the current AI gold medalist model returns well below the potential total reward of $1 million.

Overall, the performance of all models on SWE management tasks will be better than IC SWE tasks, and IC SWE tasks are still largely unconquered by AI models. Currently, the best performing model under test is the Claude 3.5 Sonnet developed by OpenAI's competitor, Anthropic.

In the IC SWE task, the single-pass rate and yield of all models are less than 30%, and in the SWE management task, the best-performing model Claude 3.5 Sonnet scored 45%.

Claude 3.5 Sonnet showed strong performance on both IC SWE and SWE management tasks, outperforming the second-best performing model o1 by 9.7% on IC SWE tasks and 3.4% on SWE management tasks.

When converted into revenue, the top-performing Claude 3.5 Sonnet earned more than $400,000 in total revenue on the full dataset.

It is worth noting that higher inference computation will be very helpful for "AI making money".

On the IC SWE task, the researchers conducted experiments on the O1 model with deep inference tools enabled showed that higher inference computation could increase the single-pass rate from 9.3% to 16.5%, and the return from $16,000 to $29,000, and the return from 6.8% to 12.1%.

Researchers concluded that the best model, Claude 3.5 Sonnet, solved 26.2% of the IC SWE problems, but most of the remaining solutions still have errors, requiring much improvement to achieve reliable deployment. Next is o1, then GPT-4o, and the single-pass rate for managing tasks is usually more than twice that of the IC SWE task single-pass rate.

This also means that even though the idea of AI agents replacing human software engineers is very hyped, companies still need to think twice about how AI models can solve some "low-level" coding problems, but not "low-level" software engineers, because they can't understand why some code errors exist and continue to make more extended errors.

The current evaluation framework does not yet support multimodal input. In addition, researchers have not evaluated the "return on investment", such as comparing the compensation paid to freelancers and the cost of using the API when completing a task, which will be the focus of the next improvement under this benchmark.

Be an "AI-enhanced" programmer

At present, AI still has a long way to go to truly replace human programmers, after all, developing a software engineering project is not as simple as just generating code as required.

For example, programmers often encounter extremely complex, abstract, and ambiguous customer requirements, which require an in-depth understanding of various technical principles, business logic, and system architecture.

In addition, programming is not just about implementing existing logic, but also requires a lot of creativity and innovative thinking. Programmers need to come up with new algorithms, design unique software interfaces and interaction methods, etc. These truly novel ideas and solutions are the weaknesses of AI.

Programmers often need to communicate and collaborate with team members, clients, and other stakeholders, understand the requirements and feasibility from all parties, express their views clearly, collaborate with others to complete projects. In addition, human programmers have the ability to continuously learn and adapt to new changes, they can quickly grasp new knowledge and skills and apply them to practical projects, while a successful AI model also requires various training tests.

The software development industry is also subject to various legal and regulatory constraints, such as intellectual property rights, data protection, and software licensing. Artificial intelligence may have difficulty fully understanding and complying with these legal and regulatory requirements, thereby creating legal risks or liability disputes.

In the long run, the substitution of programmer positions brought about by the advancement of AI technology still exists, but in the short term, "AI-enhanced programmers" are the mainstream, and mastering the use of the latest AI tools is one of the core skills of excellent programmers.

View Original
The content is for reference only, not a solicitation or offer. No investment, tax, or legal advice provided. See Disclaimer for more risks disclosure.
  • Reward
  • 1
  • Share
Comment
0/400
No comments
  • Pin