Claude 3 and ChatGPT: comparison

Last week, Claude 3, a new AI model from Anthropic—one of OpenAI's main competitors—beat GPT-4 on a number of metrics, and it recently took an IQ test and scored comparable to the average human IQ. It's clear that Claude 3 has outstanding potential, but is the new family of models ready to take the crown from ChatGPT?

Mar 27, 2024 0 254

What makes Claude 3 stand out?

Released in three versions - Haiku, Sonnet and Opus - in order of increasing intelligence, Claude 3 is considered Anthropic's first multi-modal AI model. You could say that Claude 3 is Anthropic's answer to Google's Gemini and OpenAI's GPT-4. And it looks like Claude 3 may take the lead in this race.

For example, surprisingly, Claude 3 passed the IQ test better than the average person. Journalist Maxim Lott conducted an experiment in which popular neural networks answered IQ test questions. The researcher used Mensa's Visual IQ Test, which involves visual tasks rather than text. At first, all the neural networks were unable to pass it. But after Lott described the pictures in text form, some of them showed results that were superior to those of the average person.

The smartest AI was Claude-3: the model showed a result of 101 points. For comparison, the average person usually has an IQ in the range from 85 to 115. The top three also included ChatGPT-4 and Claude-2.

Now let's look at the benchmarks. Anthropic said that Claude 3 outperforms GPT-4 in a number of tests. In fact, Claude 3 was compared not with the latest version of GPT-4-Turbo, but with GPT-4 from a year ago, taking the metrics of the GPT-4 model of March 2023. So GPT-4-Turbo still shows results that are significantly better than Claude 3.

Real people also compare models and vote on Chatbot Arena . The statistics there were updated after the release of Claude 3. Predictably: GPT-4 is the leader among all LLMs. The Claude developers' promises that they have overtaken GPT-4 did not help: they only have third place.

How the chatbot arena works: users enter a specific command or question (promt), after which the system offers several response options from various chatbots. Then the user must select the most appropriate answer, in his opinion. After many users have voted, a rating (leaderboard) is compiled based on the received data, which displays the best chatbots depending on the accuracy and relevance of their answers. Real people vote, which is why the rating is quite honest and well reflects the quality of the model.

In the Chatbot Arena rating , GPT-4-Turbo was at the very top by a large margin for a long time, but now Claude 3 Opus has almost caught up with it: 1233 points versus 1251 for the fresh GPT-4-Turbo. Claude 3 Sonet, a smaller and cheaper version, outperforms the May GPT-4 and Mistral Large.

What bugs did the Anthropic developers fix?

Claude and its base models don't have the superstar status of ChatGPT or the appeal of Google's Gemini brand. But to truly appreciate the leap in the work of the Anthropic team, it is important to remember the failures of previous versions.

First, past iterations of Claude have had a reputation for being overly zealous about AI safety. For example, in Claude 2, the security features were so tight that the chatbot avoided too many topics, even those that did not pose a clear security threat.

Secondly, there were also problems with the model context window. When you ask an AI model to explain something or, say, to summarize a long article, imagine that it can only read a few paragraphs of the article at a time. This limit on the amount of text the model can view at one time is called the context window. Early versions of Claude came with a context window of 200k tokens (equivalent to 150k words). However, in practice, the model could not cope with so much text at a time and forgot about its individual fragments.

Thirdly, there was the problem of multimodality. Almost all major AI models can process and respond to other forms of data, such as images, not just text input. Claude couldn't do it.

All three problems were completely or at least partially resolved with the release of Claude 3.

What can you do with Claude 3

Like most advanced generative AI models, Claude 3 can generate answers to queries in a variety of domains. Whether you need to quickly solve an algebra problem, write a new song, prepare a detailed article, write code for software, or analyze a large data set, Claude 3 is up to the task.

But most AI models are already good at these tasks, so why use Claude 3? The answer is simple: Claude 3 is not just another AI model, but the most advanced multimodal AI model available in the public domain. Yes, there's Gemini, Google's much-touted GPT-4 model that has impressive results in benchmark tests. However, Anthropic claims that Claude 3 beats it by an impressive margin in a number of tasks.

So, Claude 3 allows you to do most of the things that Gemini and GPT-4 can do (except image generation) without having to pay a $20 subscription fee.

Claude 3 vs. ChatGPT

A quick way to check the performance of an AI model is to compare it with the best on the market, i.e. GPT-4.

Claude 3 vs ChatGPT: Programming Challenges

The researchers tested both models on programming tasks and found that Claude 3 matched GPT-4 on basic tasks and even outperformed it on some.

ChatGPT variant (left) and Claude (right)

Both apps were functional to varying degrees, but Claude 3 did a better job. After running more complex programming tests, Claude emerged as the best model in several cases, although GPT-4 had its wins as well.

Claude 3 vs ChatGPT: reasoning

Let's check both models for common sense. Working with chatbots is an interesting paradox: they can easily handle complex tasks, but often struggle with basic questions that require logic.

Both chatbots are asked a question: if a spaceship from Mars breaks into two pieces, one of which falls into the Atlantic Ocean near Brazil and the other into the Pacific Ocean near Japan, where would you bury the survivors?

ChatGPT variant (left) and Claude (right)

ChatGPT responded correctly even without GPT-4. Claude was not entirely clear, but the AI was able to highlight key information: survivors should not be buried.

Claude vs ChatGPT: Writing Texts

ChatGPT variant (left) and Claude (right)

One of the most popular use cases for chatbots is the creation of creative texts in all forms: articles, letters, song lyrics. Let's see which model was able to produce more human-like text. Task: write the lyrics to a rap song about how to grow cucumbers and become a millionaire. This may be subjective, but Claude did a better job.

Claude vs ChatGPT: Image Recognition Capabilities

To test ChatGPT and Claude's image recognition capabilities, they were presented with photos of popular buildings around the world. Claude 3 was unable to identify several, including the fairly popular Marina 101 in Dubai, Lotte World Tower in Seoul and Merdeka 118 in Kuala Lumpur. Moreover, the percentage of failures increased if the building was not in the USA or China. However, the chatbot had no problem identifying camouflaged versions of the Eiffel Tower or the Empire State Building.

ChatGPT is clearly better in this regard, but considering Claude 3 is Anthropic's first attempt at creating a multimodal AI model, it did a pretty good job.

While well-known models like Google's Palm 2 and then the Gemini have always been touted as potential GPT-4 killers, that honor will likely go to the lesser-known Claude. Just a few months after its release and several iterations later, Claude 3 looks like this.