How does Code Llama 70B model compare with GitHub Copilot?

On a mission to build open-source language models, Meta AI released an update to its range of Code Llama models on January 29. The Code Llama 70B is expected to be the largest and the “most powerful” model in the Code Llama brood.

In August, the company released 7 billion, 13 billion and 34 billion parameter models successively. Now, Code Llama 70B is said to be capable of handling a large bulk coding queries. While making the announcement, Meta CEO Mark Zuckerberg said that the team plans to include these improvements eventually in the Llama 3 as well.

The version is available free of cost for both research and commercial purposes under the same community license as Llama 2. Just like the previous versions, the 70B model also comes in three variations - Code Llama- 70B, which is the base model; Code Llama- Python which is fine-tuned on just Python code; and Code Llama - Instruct, which is fine-tuned to understand natural language instructions, meaning it is honed to interact more easily with humans.

Obviously, the varied sizes are built so that they cater to different compute and latency requirements. The blog posted by Meta during with the release explains that while the 7B model can run on a single GPU and is faster, it is best suited for quicker, real-time tasks like code completion. On the other hand, the 34B and 70B models are better at generating code with more accuracy but draw more compute.

(For top technology news of the day, subscribe to our tech newsletter Today’s Cache)

How does the new model fare?

Running the latest 70B version does raise questions around its heavy compute load within the coding and research community. Discussions around the subject on community forums like Hacker News threw up some suggestions. Developers were advised to deploy quantised models, use a Macbook with the M2 chip, and also rent a GPU on an hourly basis. (Quantised models basically reduce the AI computation demands and increase power efficiency of model which also results in the model becoming less accurate)

But even then, the model moves at a slow pace and could potentially burn a serious hole into pockets given that professional coders will possibly need the model for lengthy coding tasks.

Even so, the real test for Code Llama is in how it performs when weighed against large language models like GPT-4.

On to the HumanEval benchmark, a dataset of 164 programming problems that measure the functional correctness and logic of code generation models, Code Llama 70B scores 65.2, far lower than GPT-4, which scores 85.4. (GPT-4 powers the widely used AI coding assistant Github Copilot.) The Llama model is also lower when compared against ChatGPT, which received 72.3 in the benchmark. Infact, the model doesn’t even feature in the top ten best performing models on the AI coders list.

Rather, older foundational models for coding like StarCoder or CodeGen-16B-Mono and GPT-4 are all prominently featured on the leaderboard and have been key to advancements made in AI tools like Code Llama.

But as is with other open-source experiments, the real story for Code Llama will start unfolding once coders get their hands on the model and start building their own versions of it.

How to build a chatbot with the Llama 2 model

For instance, after the release of the previous smaller-sized Code Llama models last year, models like the Phind Model V7 built on top of a finetuned Code Llama-34B were released that came close to GPT-4 in terms of performance. The Phind Model V7 achieved a heroic 74.7% pass@1 on the HumanEval benchmark compared to GPT-4’s 67%.

A Hugging Face leader board for the best open-source AI models for coding has two versions of the Phind Code Llama, the 34B V2 and the 34B V1 rank within the top five.

(However, benchmark scores themselves shouldn’t be taken as the sole evidence for how well a model works in the real-world)

In terms of interface, coders have complained that the prompt format in Code Llama-70B is “complicated” and that has unnecessarily stringent guardrails for prompts. William Falcon, CEO of AI development platform Lightning AI, called the guardrails on the new Code Llama “too safe”. Other engineers on a Reddit forums have also said the tool was prone to throwing up warnings when trying to simply add comments or rewrite a function.

Considering these aspects, Meta’s AI push has Llama at the centre, and it fits in with the company’s history of using open-source projects. Sceptics are curious about how an open-source product might help Meta financially, but the company seems to know what its doing. Besides, cloud computing partnerships with Microsoft Azure and AWS that bring in money directly, the company also seeks to draw in more developers to work with Meta (Facebook’s PyTorch coding framework for machine learning apps did well to lure in more AI/ML developers into the company. )

It is also a smart way of outsourcing work. Post the release of Llama, the open-source community worked on it to make it run on a phone.

Former research scientist at machine learning platform Hugging Face, Nathan Lambert, has called Llama 2 one of most popular open-source LLMs. “It’s the model that most people, and most startups, are playing with,” Lambert told CNBC.

Comments

Comments have to be in English, and in full sentences. They cannot be abusive or personal. Please abide by our community guidelines for posting your comments.

We have migrated to a new commenting platform. If you are already a registered user of The Hindu and logged in, you may continue to engage with our articles. If you do not have an account please register and login to post comments. Users can access their older comments by logging into their accounts on Vuukle.

How does Code Llama 70B model compare with GitHub Copilot?

The version is available free of cost for both research and commercial purposes under the same community license as Llama 2.

How does the new model fare?

Related stories

Related Topics

Top News Today

Comments