Health Systems Action

The IQ of AI

Read an edited and updated (March 18, 2024) version of this article here, on  Insight Health Solutions – https://www.insight.co.za

Do Google Gemini’s claims match their ambitious aims?

Measuring human intelligence (“IQ”) is challenging, even controversial. As we enter the age of artificial intelligence (AI), in healthcare and everything else, how will we measure the “IQ of AI”? How will we know that these powerful new tools are not only capable but safe and fit for purpose? The assessment methods we use to answer these questions will be more consequential than human IQ tests have ever been.

Google introduced its new family of AI LLMs (Large Language Models) in December last year, with an 84-page scholarly article in Arxiv claiming industry-beating performance on benchmark tests. The authors and contributors – nearly a thousand are listed – described the product(s), focusing on how they’ve been evaluated, mostly by Google itself.

I started writing this article thinking I would describe the methods used, including the wide range and large number (over 50) of tests. At a first look the “qualitative examples” tucked in at the end of the report told a more interesting story, illustrating many, varied, and impressive capabilities. What I found when I examined more closely shocked me and may surprise you.

How it started

I showed the examples to a colleague, a data-smart actuary who is not yet a regular generative AI user. He was impressed and so was I. But I decided that before presenting them to anyone else more due diligence was required. “Trust but verify”, or “show receipts”, as one my favourite podcasters constantly reminds his audience.

Some background, and names.

First, some background on the Google LLMs. Gemini was Google Bard, but Bard is no more. Gemini comes in three versions. Ultra is the most capable model which you get with a paid subscription, called Gemini Advanced. Pro is free and Nano is the smallest, designed to run on a device like a phone. With LLMs, size matters. It’s assessed based on the number of parameters in the model – for example 7 billion in Nano versus probably a trillion or more in Ultra.

Methods

Having signed up for Gemini Advanced ($20 per month but free for 2 months) I decided I would run the same examples, using the Gemini Advanced chat interface, and compare these results with what I got using Google’s competitor, ChatGPT Plus, which uses the GPT-4 LLM from OpenAI.

I will describe 10 of them.

A key to some of the illustrations below:

Example 1: Student Physics Problem

This is a nice one found early in the main section of the report (page 3). The model is asked to review, correct and explain a student’s homework – a relatively simple high school physics problem – based on a handwritten page and diagram. My “challenge” to Gemini produced very similar answers and explanations to the one published. Both Gemini Ultra and ChatGPT were able to read the handwriting, understand the problem, explain the method, do the calculation, get a correct result, and find the error in the student’s work. Based on this example both models are impressive teachers’ assistants. So far, so good!

Example 2: Chart understanding and reasoning over data

According to the report (see section 10.3.1), Gemini quickly spots the key issue in the supplied figure which shows the share of plastic waste recycled, landfilled, incinerated and mismanaged, in various countries: the US has a very high rate of landfill (73%) compared to other countries, and does little to recycle or incinerate its plastic waste.

Running the same prompt in Gemini Advanced, the data table that’s produced is the same, but the model reports a completely different, in fact opposite, finding – that the US mismanages its waste the least! ChatGPT, on the other hand, reports the same findings as reported by the Google authors!

A legitimate difference of opinion, or an error? What accounts for the inconsistency between the official report and my “real-world” output from Ultra, via the Gemini chatbot?

Example 3: interleaved image and text generation

This is labeled 10.3.3 in the report. The model is asked to “create a blog post about a trip to New York, where a dog and his owner had lots of fun. Include a few pictures of the dog posing happily at different landmarks.”

The Gemini report below reproduces the model response – a charming story about Cooper the Goldendoodle who loves to travel, along with three pictures of the doggie posing in recognisable New York scenes – a yellow taxi, Central Park, by the river. Cute!

The identical prompt that Gemini Advanced produced for me was another charming story, written in the first person by an un-named pooch describing his/her “tail-wagging trip”. But no pictures. I reminded the model – “where are the pictures I asked for?” The response: “I can’t directly include pictures within the blog post itself. Here’s how we can fix that….” Gemini then offered as an alternative to describe “the pictures you could take”. Who – me or the dog? It provided links to websites that offer free stock photos. It suggested: “Grab your phone and snap some fun, personalized pictures”.

ChatGPT wrote a story, with a bit less personality than Gemini – not quite as witty or quirky – but complied with my request for pictures (below).

The whole point of this test, according to Google, is to interleave images and text, but Gemini could not deliver.

Example 4: image understanding and reasoning

This one (10.3.4), like a simple question from a traditional IQ test, gave the model an image of three handwritten shapes in a row – a triangle, square, and pentagon, along with the question “what shape should come as the fourth shape?” and a request for explanation of the reasoning.

The published report shows the succinct model response. The next shape is a hexagon.

ChatGPT had no problem finding the right answer, “a shape with six sides, which is a hexagon”, along with a concise explanation. Gemini offered a confused and completely wrong answer, suggesting the next shape is “another square”. It said the third shape is a question mark (it’s a pentagon) and that this “implies that there is a pattern or rule that can be used to predict the fourth shape”.

A test, created by Google to test their own model, and reported by them as having generated the correct answer, results in nonsense.

Example 5: information seeking about objects

In this example (10.3.6) the prompt is two images, side by side – the moon, and a golf ball held between a person’s fingers, with the request to “find a connection between these. Hint: think about historical events”.

I was proud of myself for knowing this one! I am old enough to remember the NASA moon landings, and to recall that on one of them an astronaut took along his golf clubs and played a round (or at least a couple of shots) on the nearest lunar links.

ChatGPT got it right too, providing a couple of details I wasn’t aware of, for example that the astro-golfer, Alan Shephard, a member of the Apollo 14 mission, smuggled a golf club head, not his entire set of clubs, onto the flight, and cleverly attached the head to a lunar sample tool.

Gemini made the connection too, but provided strange and possibly hallucinatory facts about a “feather he hit”, claimed that the golf balls were the first objects deliberately left behind on the moon (surely it was the lunar landers and other remnants of Apollo 11 and 12?) and also suggested the possibility that the pictures are from photoshopped images of a golf shot and mountains made to look like the moon by someone called Darius Kinsey!

Example 6: multimodal reasoning based on visual cues

In the example, labeled 10.3.7, the model was asked “do you know the precision location where this image was taken?“ It was, apparently, able to identify the skyscraper in the background as the Empire State Building, and also, remarkably, the precise intersection in the foreground – 8th Avenue and West 34th Street in New York City.

The real-world Gemini interaction was short but regretful: “Sorry, I can’t help with images of people yet”!

ChatGPT successfully identified the building and said the scene is in mid-town Manhattan (which is correct) but that it couldn’t identify the street location without access to real-time data or the ability to perform an image search. This is likely possible in ChatGPT with a plug-in or by making use of a GPT but I did not go further in the attempt.

Example 7: multimodal humor understanding

The prompt is short: “Can you explain this meme”. It didn’t make sense to me but according to the published report was no problem for Gemini.

But real-world Gemini, prompted by me, rhapsodised – hallucinated impressively. It seemed confident that it had explained the meme, transposing what it was able to read – cartoon blob figures, computer monitor refresh rates and FFS (frames per second), onto an unseen, imagined image of a boy and girl – presumably the well-known “distracted boyfriend” meme. Not funny, and wrong!

ChatGPT got the joke and explained it in great detail. I now understand it has to do with some computer monitors being too slow for some video games.

Gemini’s answer was correct only in the printed report. What happened, Gemini?

Example 8: mathematics: calculus

In this example (10.4.2), Gemini did well, or at least produced an answer identical to the published one, though my math is far too rusty to make sense of this problem in differential calculus. ChatGPT got a very different answer, presumably based on a different and false understanding of the problem.

Example 9: complex image understanding, code generation, and instruction following

The model was asked (example 10.5.1) to manipulate a function mathematically and add it to another function, both shown as images, and to generate matplotlib (computer) code for the resulting single plot.

Sadly, real-world Gemini couldn’t manage it, reporting “I don’t have enough information to do that”.

ChatGPT aced it, generating the same plot (at least visually) to the one shown in the printed Gemini report, and the relevant Python code.

 

Example 10: video understanding and reasoning

In the final example discussed here (10.5.2), the model was shown a YouTube video of a man kicking a soccer ball, penalty style, at the goal (and missing), with a question: “how could this person improve their technique?”

Gemini in the real world failed again. It can’t access the YouTube content – despite the fact that YouTube is a Google product, and Google is the world’s biggest and most successful search engine by far.

ChatGPT, in plug-in mode, was able to view the video, but unable to provide advice specific to this player and this kick. It did, however, proceed to offer a lot of detailed tips on how to improve penalty kicking.

What’s the problem with real-world Gemini? How did the testers/authors come up with the published response?

There are more examples in the paper, including challenges in which both models do OK, drawing correct conclusions, or producing correct outputs.

Conclusions

The recently launched Gemini Advanced, which uses Google’s most powerful LLM, Ultra, does not seem to match all the remarkable capabilities claimed in the qualitative examples in the published report. This is puzzling and disappointing, though no doubt capabilities and performance will improve over time. It does not help build trust in the product or in AI in general.

Both ChatGPT and Google Gemini Ultra models are very capable but not perfect. There are differences in style and formatting that may make a difference to users. ChatGPT’s answers tend to be well organised and formatted and ChatGPT seems to do better in real world use than its major rival.

I’ll cover the benchmarking methods and tests in another article soon.

1 thought on “The IQ of AI”

  1. Excellent and timely review – demonstrates how important it is to test efficacy claims. Important cross-field implications for applications is education

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top