Complete victory over GPT-4, killing the closed-source model in seconds! Code Llama mysterious version exposed

Original source: Xinzhiyuan

Image source: Generated by Unbounded AI‌

Only 2 days after its release, Code Llama once again ignited the revolution of AI coding.

Remember the mysterious version Unnatural Code Llama that Meta appeared in the Code Llama paper that can fully equalize GPT-4?

Big guy Sebastian explained in his blog:

It is a fine-tuned version of Code Llama-Python 34B using 15,000 non-natural language instructions.

By hiding such a very hidden information in the paper, Meta seems to want to hint to the open source community that Code Llama has great potential, so let’s fine-tune it!

So just now, WizardCoder 34B, which was fine-tuned based on Code Llama, directly defeated GPT-4 on the Human benchmark.

Specifically, WizardCoder crushed the March version of GPT-4 (67%) with a winning rate of 73.2%.

In addition, the performance of WizardCoder 34B exceeds the latest version GPT-3.5, and Claude 2.

The programming model WizardCoder was released in June by Microsoft and Hong Kong Baptist University. A fine-tuned 13B/7B version is said to be coming soon.

According to Jim Fan, a top scientist at Nvidia, this is basically an open version of "Unnatural Code Llama".

While the benchmark data looks good, Human only tests a narrow distribution and may overfit. Data testing in natural scenarios is really important. Coding benchmarks need a major upgrade.

## **A mysterious version of Code Llama was born? **

On Friday, Meta officially open-sourced three versions of Code Llama.

In the Human and MBPP benchmarks, many people found a version not mentioned in the official Meta - Unnatural Code Llama.

This mysterious version achieved 62.2% performance on Human pass@1.

The fine-tuned WizardCoder 34B released today has a performance of 73.2% on Human pass@1.

According to the introduction, WizardCoder 34B is a fine-tuned version of the Code Llama model using the synthetic dataset Evol-Instruct.

The following is a visualization of the performance comparison with all open source and closed source models.

In comparison with the OpenAI model, the researchers pointed out that GPT4 and ChatGPT-3.5 have two Human results:

The results provided by OpenAI's official GPT4 report (2023/03/15) are: 67.0% and 48.1%, respectively. The results of the researchers using the latest API (2023/08/26) test are 82.0% and 72.5%.

In addition, the researchers stress that this performance result is 100% reproducible!

A demo of WizardCoder 34B is open for anyone to test it out.

It has been pointed out that overfitting to public leaderboards is one of the main reasons why open source models struggle in practice. Here is an example of wizard-coder data preparation using Human pass@1 scores to decide whether to further develop the dataset. Optimizing only on the test set defeats the purpose of the test set.

Also just yesterday, researchers from the Phind organization fine-tuned Code Llama-34B to beat GPT-4 in the Human evaluation.

ChatGPT vs. Code Llama

How does Code Llama perform in actual coding tasks?

A netizen did a comparative test of GPT-3.5 and Code Llama Instruct-34B. It was tested with access to Code Llama 34B provided by Perplexity.AI.

It feeds 8 identical code tasks to the two models respectively, and compares the quality of their generated codes.

The result is that GPT-3.5 wins by 8:5.

The following are the specific test results.

first question

Use Python to accomplish this task, given two strings word1 and word2. Merge strings by adding letters in alternating order, starting with word1. If one string is longer than the other, append additional letters to the end of the merged string.

Finally output the merged string.

For example:

Input: word1 = "abc", word2 = "pqr" Output: "apbqcr"

Both GPT-3.5 and Code Llama can complete - 1:1

Second question

Use Python to accomplish this task, given a string s, just reverse all vowels in the string and return it.

The vowels are "a", "e", "i", "o", and "u", which can appear multiple times in both lowercase and uppercase.

For example: input: s = "hello" output: "ello"

GPT-3.5 completed, Code Llama not completed - 2:1

The third question

Use Python to accomplish this task, given an integer array nums, move all 0s to the end of it while maintaining the relative order of the non-zero elements.

Note that you have to do this in-place, without making a copy of the array.

For example: Input: nums = [0,1,0,3,12] Output: [1,3,12,0,0]

GPT-3.5 completed, Code Llama not completed - 3:1

Question 4

Using Python for this task, you have a long flowerbed, some plots are planted with flowers, and some are not.

However, adjacent plots cannot be planted with flowers. Given an integer array of 0 and 1 for a flowerbed, where 0 is empty and 1 is not empty, and an integer n, output true if n new flowers can be planted in the flowerbed without violating the no-adjacent flower rule, Otherwise, false is output.

Example 1: Input: Flowerbed = [1,0,0,0,1], n = 1 Output: true Example 2: Input: Flowerbed = [1,0,0,0,1], n = 2 Output: false

Both models are done - 4:2

Question 5

Using Python, given an input string s, reverse the order of the words. A word is defined as a sequence of non-whitespace characters. Words in s will be separated by at least one space.

Output a string of words joined by single spaces in reverse order. Note that s may contain leading or trailing spaces or multiple spaces between two words.

The returned string should have only one space to separate words. Do not include any extra spaces.

Example: Input: s = "the sky is blue" Output: "blue is sky the"

Both models completed - 5:3

Question 6

Use Python to accomplish this task. Given a string s and an integer k, return the maximum number of vowels in any substring of length k in s.

The vowels in English are "a", "e", "i", "o" and "u". Example: Input: s = "leetcode", k = 3 Output: 2

Explanation: "lee", "eet" and "ode" contain 2 vowels.

Both models are done - 6:4

Question 7

Use Python to accomplish this task, given a string s that contains asterisks *. With one operation, you can: Select an asterisk in s.

Removes the nearest non-asterisk character to its left, and removes the asterisk itself. Output the string after removing all asterisks. Example: Input: s = "leet**cod*e" Output: "lecoe"

GPT-3.5 is done, but Code Llama is not - 7:4

Question 8

Use Python to accomplish this task, given an integer temperature array representing the daily temperature, return an array answer, where answer [i] is the number of days after day i you have to wait for warmer temperatures.

If there is no day in the future to do this, keep the answer [i] == 0. Example: Input: Temperature = [73,74,75,71,69,72,76,73] Output: [1,1,4,2,1,1,0,0]

Both models completed - 8:5

Regarding the performance of the two models, this netizen believes that this is not a rigorous study, but a simple test. Every time the model is regenerated to generate code, it can basically get a better answer, but there is no test.

So the conclusion of the test is not the performance of the final two models.

Comparable to GPT-4, Llama 3 should be open source

Since the release of Llama and Llama 2, the machine learning community ChatGPT has exploded, and various fine-tuning models have sprung up.

OpenAI researcher Jason Wei said that he learned from Meta GenAI social activities that Llama 3 and Llama 4 will also be open source in the future.

We have the computing power to train Llama 3 and 4. Our plan is to make Llama-3 as good as GPT-4. Wow, if Llama-3 is as good as GPT-4, will you open source it? Yes, we will. Sorry, alignment staff.

Another netizen said that Meta hopes to open source a GPT-5 level model, and it seems to have insisted on open source before AGI.

I want to be clear about what this means: no kill switch.

If something goes wrong—an agent goes out of control, or a bad actor weapons it—there's no easy way to shut it down. It can run on any small cluster. There is no security at all.

Security research becomes meaningless.

All the work people have done to make AI systems honest, consistent, ethical, etc. becomes meaningless. The world's AI systems will evolve toward whichever system yields the greatest economic benefit, regardless of their values or motivations. There are no guardrails. Anyone can change the AI's values or capabilities at will, for better or worse.

If Meta continues to be open-sourced while we get smarter AI, then it's clear to me that things will get messy. The arrival of these extraterrestrial intelligences is already messing up the world, but it will be even worse if we give up what little control humans have.

As far as I know, Meta's hope for open source is mainly derived from the "open source community dogma", that is, "open source is good". And as far as I know, they weren't that pro-open source until the accidental leak of their first model, the Llama, and they've been pretending to be open source ever since.

In this regard, Musk said that, however, the LLM using autoregressive Transformer has extremely poor energy efficiency, not only in training, but also in reasoning. I think it's off by several orders of magnitude.

## Llama 2 coding ability soars

Llama 2 is a very strong model in all aspects.

However, it has a very obvious weakness - the ability to code.

According to the data in the paper published by Meta on Llama 2, Llama 2's performance in Hum (a benchmark test for evaluating LLM and coding) is even worse than GPT-3.5, not to mention worse than GPT-4 how much.

Annotated figure from the original Llama 2 paper

But code ability will definitely be an important direction for the open source community to use Llama 2 in the future. Naturally, Meta cannot be poor in this direction, so there is Code Llama, which is greatly optimized for code ability.

Two days ago, Meta officially released the Code Llama family: Code Llama (7B, 13B and 34B), and 3 variants: the general code model Code Llama, the instruction follow model Code Llama-instruct and the Python code-specific version Code Llama- Python.

These models are free academic and commercial, as are the Llama 2 licenses.

The code ability of Code Llama 34B model is almost twice that of Llama 2, greatly narrowing the gap with GPT-4.

Remember the Unnatural Code Llama that Meta appeared in the Code Llama paper, which can fully equalize the GPT-4 version?

Big guy Sebastian explained in his blog:

It is a fine-tuned version of Code Llama-Python 34B using 15,000 non-natural language instructions.

By hiding such a very hidden information in the paper, Meta seems to want to hint to the open source community that Code Llama has great potential, so let’s fine-tune it!

Why is there no 70B Code Llama model?

Interestingly, Code Llama only has 7B, 13B and 34B parameter versions, which is 70B less than Llama 2.

Although Meta did not explain why this is the case in the paper, technology guru Sebastian offered two possible reasons:

  1. Code Llama is trained on 500B tokens, and Llama 2 is trained on 2T tokens.

Since the training data of Code Llama is only 1/4 compared with that of Llama 2, it may be because there is not enough training data, coupled with the limitation of LLM's Scaling Laws, the performance of CodeLlama70B is not good.

  1. The Code Llama model supports a context size of 100k, which is very useful when dealing with code tasks.

In contrast, Llama 2 only supports input lengths up to 4k. If the 70B model is to support an input length of 100k tokens, it may make the model's computational requirements too exaggerated.

References:

View Original
The content is for reference only, not a solicitation or offer. No investment, tax, or legal advice provided. See Disclaimer for more risks disclosure.
  • Reward
  • Comment
  • Share
Comment
0/400
No comments