Can large language models solve logic puzzles? There's one solution to discover: You must ask. That's exactly what Fernando Perez-Cruz and Hyun Song Shin did recently. (Perez-Cruz is an engineer; Shin is a research director on the Bank for International Settlements and the person who taught me a few of the more mathematical parts of economic theory within the early Nineties.)
The riddle in query is often often called “Cheryl's Birthday Riddle.” Cheryl challenges her friends Albert and Bernard to guess her birthday, and for riddle purposes they comprehend it is considered one of 10 dates: May 15, 16, or 19; June 17 or 18; July 14 or 16; or August 14, 15, or 17. To speed up the guessing, Cheryl tells Albert her birth month and tells Bernard the day of the month, but not the month itself.
Albert and Bernard think for some time. Then Albert broadcasts, “I don't know your birthday, and I do know Bernard doesn't comprehend it either.” Bernard replies, “In that case, I now know your birthday.” Albert replies, “Now I do know your birthday too.” When is Cheryl's birthday?* And more importantly, what will we learn by asking GPT-4?
The puzzle is difficult. Solving it requires eliminating possibilities step-by-step while pondering questions corresponding to, “What does Albert must know, given what he knows that Bernard doesn't?” It is subsequently extremely impressive that when Perez-Cruz and Shin repeatedly asked GPT-4 to resolve the puzzle, the massive language model got the reply right each time, fluently understanding different and accurate explanations for the logic of the issue. Yet this bravura feat of logical mastery was nothing greater than a clever illusion. The illusion fell apart when Perez-Cruz and Shin asked the pc a trivially modified version of the puzzle, changing the names of the characters and the months.
GPT-4 continued to supply fluid, plausible explanations of logic, so fluid that you just really have to pay attention to identify the moments when those explanations crumble into nonsense. Both the unique problem and its answer can be found online, so presumably the pc had learned to reword that text in a clever way, giving the looks of being a superb logician.
When I attempted the identical thing, keeping the formal structure of the puzzle but changing the names to Juliet, Bill and Ted and the months to January, February, March and April, I got the identical disastrous result. GPT-4 and the brand new GPT-4o each worked through the structure of the argument competently, but reached incorrect conclusions in several steps, including the last one. (I also realized that on my first attempt I had introduced a fatal typo into the puzzle that made it unsolvable. GPT-4 didn't bat an eyelid and “solved” it anyway.)
Curious, I attempted one other famous puzzle. A contestant on a game show is trying to seek out a prize behind considered one of three doors. The quizmaster, Monty Hall, allows a preliminary guess, opens one other door that doesn’t reveal a grand prize, after which offers the contestant the choice of fixing doors. Should he change?
The Monty Hall problem is definitely much simpler than Cheryl's birthday, but confusingly counterintuitive. I made it harder for GPT4o by adding some complications. I introduced a fourth door and asked not whether the candidate should switch (which he should), but whether it’s price paying $3,500 to change when two doors are open and the grand prize is $10,000.**
GPT-4's response was remarkable. It avoided the cognitive trap of this puzzle and clearly articulated the logic of every step. Then it stumbled on the finish line, added a nonsensical assumption, and arrived on the fallacious answer because of this.
What are we to conclude from this? In a way, Perez-Cruz and Shin have simply found a twist on the well-known problem that giant language models sometimes construct believable fiction into their answers. Instead of plausible factual errors, the pc here produced plausible logical errors.
Defenders of enormous language models might reply that perhaps the pc could produce higher results with a cleverly designed prompt (which is true, although the word “perhaps” implies a variety of work). It can be almost certain that future models will produce higher results. But as Perez-Cruz and Shin argue, perhaps that’s inappropriate. A pc that seems so right and yet so fallacious is a dangerous tool. It is as if we were to depend on a spreadsheet for our evaluation (which is dangerous enough) and the spreadsheet were to often and sporadically forget how multiplication works.
This isn't the primary time we've learned that giant language models will be phenomenal bullshit engines. The trouble is that the bullshit is so terribly plausible. We've seen falsehoods and mistakes before, and God knows we've seen fluent bluffers before. But this? This is something recent.
*If Bernard had been told the 18th (or nineteenth), he would know that the birthday was June 18th (or that it was May nineteenth). So when Albert says he knows that Bernard doesn't know the reply, that rules out those possibilities: Albert should have been told July or August as an alternative of May or June. Bernard's response that he now knows the reply of course shows that it might probably't be the 14th (otherwise he would have needed to guess between July and August). The remaining dates are August fifteenth, or August seventeenth, or July sixteenth. Albert knows which month, and saying he now knows the reply shows that the month should be July and that Cheryl's birthday is July sixteenth.
**The probability of selecting the right door initially is 25 percent, and that doesn't change if Monty Hall opens two empty doors. Therefore, the prospect of winning $10,000 is 75 percent in the event you switch to the remaining door and 25 percent in the event you stick to your first selection. For a sufficiently risk-taking player, it's price spending as much as $5,000 to change.