Artificial Intelligence

What I Learned by Creating AI to Play Codenames

Artificial intelligence can be smart and stupid at the same time.

Posted April 5, 2022 | Reviewed by Abigail Fagan

Key points

Artificial intelligence has now beaten humans at several different games.
Codenames is a party game played between two teams that involves connecting verbal clues to target words.
A program that plays Codenames demonstrates the difficulty of creating technology that is clever in the same ways humans are clever.

Artificial intelligence has beaten or challenged humans at many games: Go, Hanabi, bridge, Diplomacy, Dota 2, and of course chess. I decided to try my hand at making a program* that performs passably at Codenames.

For those who don’t know, Codenames is a party game played between two teams. Twenty-five cards lie on a table, each displaying a word. Some cards are assigned to one team and some to another. Only one person on each team, the spymaster, knows which. The spymasters take turns giving clues to their teams, hoping the guessers will say the right words and avoid saying the opponents’ words or the game-over word. A clue consists of one word and a number—the number of cards you’re targeting with your clue word. Take a look at the image below, from a program I wrote. I’m playing the spymaster for the red team. If I’m feeling daring, I might say, “sin, 3.” I hope my teammates will guess “thief,” “angel,” and “fall.” Stealing is a sin, angels have to do with morality, and “fall,” as a stretch, might be linked with “angel,” as in a fallen angel.

Playing the game requires a massive amount of knowledge about the world. It also invites creative associations. And it benefits from thinking about what your teammates and spymaster and opponents are thinking. How hard would it be for AI?

I knew of a shortcut. It’s called word embeddings. Computer scientists have developed a method to encode the meanings of words in numbers. An algorithm trawls through billions of words of text on the internet and looks at how often each vocabulary word appears in the vicinity of every other vocabulary word. Then it assigns each a sequence of perhaps 300 numbers summarizing its relationship to other words. This sequence represents its location in a 300-dimensional space of meaning. If people write “thief” and “burglar” in similar circumstances, they’ll be embedded near each other in the semantic space.

I started by downloading a set of word embeddings called GloVe, and trimmed it from 400,000 words to the most popular 50,000 words (ignoring generic words like “a” and “both”). Then I calculated the distance between each of these words—potential clues—and each of the 400 words that could potentially appear in a Codenames game.

Here’s how my program generates clues: It first tries to find a clue that will target every single one of its team’s words that remain on the board. At the beginning of a game, that’s nine, a herculean task. It scans through the 50,000 possible clues and looks for words that are closer to all (say) nine of those words than to any of the opponents’ words or to the bomb word. If it finds none—in practice, successfully targeting three words is impressive, and even two can be a challenge—it will reduce the number of targeted words to eight. For each possible grouping of eight of those nine words, it will again try to find a clue that’s closer to all eight—by some buffer amount—than to the words to be avoided. (If it finds more than one clue, it picks the one that’s closest to the most distant target word. And if it finds clues for more than one grouping of target words, it again picks the clue that’s closest to the most distant target word.) It then offers that clue, and the number of words it’s targeting.

My Codenames program

Source: Matthew Hutson

Guessing is much simpler. If the clue is “sin, 3,” it finds the three words on the board that are most similar to “sin.” One little trick: Hints can carry over. Let’s say that on its previous turn, the clue was “asteroid, 2,” and it guessed “dinosaur” and “space.” “Dinosaur” was a red word, but “space” was a neutral word, so the spymaster must have had another red word in mind when suggesting “asteroid.” Now on this turn, even though the clue is “sin,” it will then be biased slightly toward words that match “asteroid,” and will be more likely to guess “fall.”

As I write in a recent article for The New Yorker about AI and common sense, I learned at least a couple of things from this project. First, my Codenames AI is pretty smart. I was impressed when it provided the clue “wife, 2,” targeting “princess” and “lawyer.” But it’s also pretty stupid. We can embed a lot of human knowledge in a set of numbers, allowing technology to make clever connections, but that doesn't mean it will be clever in all the ways we are clever.

To quantify how well it plays, I ran a small user study using crowdworkers. For my study, I randomly generated 10 collections of 25 words, and for each I provided a clue targeting two of the nine words for one team, and I also had my program provide a clue for two of those nine words. Then I showed online participants the 10 collections, each with either my clue or the computer’s clue. On average, for my clues, 1.6 of the two words people guessed were our team’s words (even if they weren’t the two targets I had in mind). For the computer’s clues, that average was 1.3. Respectable. (Chance was 0.1.) As for guessing from my clues, the computer was slightly better than the people were, though the difference likely isn’t statistically significant. While people guessed 1.6 of my team’s words, the computer guessed 1.7.

Of course, the software has weaknesses not captured in that study. I would never attempt “sin, 3” with this program as a guesser, because “fall” makes sense only through its association with “angel,” and the program looks only at single-hop similarities between clues and cards. Indeed, when I entered that clue, it correctly guessed “thief” and “angel” but incorrectly guessed “charge,” ending the game. It misses obvious clues in head-smacking ways. (Neither "plant" nor "garden" triggered it to guess "root.")

More important, Codenames tests only some aspects of common sense, the deployment of static meanings one might look up. Consider a night of playing Codenames. Common sense reveals itself in the word associations and rational inference needed to score points. It’s also apparent in the unconscious ability to flip an hourglass or fill your wine glass. And it entails social fluidity. One night, I recorded this exchange, during a conversation about my program:

Artificial Intelligence Essential Reads

How AI Bias Impacts Medical Diagnosis

Can We Feel Empathy for AI?

Human 1: “There’s a bit of artificial intelligence standing around this table right now.”

Human 2: “Are you talking about me or you?”

Human 1: “We’ll leave that oblique.”

* I wrote most of my program in 2017. In 2019, I debugged it, added a graphical user interface, and ran the user study. I’ve placed the code on GitHub.