a-turing-test-for-x

A Turing Test for X

Humans continue to build machines which are increasingly better and better at tasks we once thought only a human could do. One of the earliest examples is the software project Chinook which is provably undefeatable at checkers. More contemporary examples include DeepBlue and AlphaGo who best human players at chess and go respectively.

But do these systems play their respective games in the same manner as the best humans? Or have they found exotic, even incomprehensible strategies that differ noticeably from their human counterparts?

In our episode Detecting Cheating in Chess, we interviewed Kenneth Regan about his work analyzing chess games and being able to spot differences between human and algorithmic play. As you might expect, sometimes, chess players encounter situations where there’s one obviously correct move to take. In those situations, we expect human and machine play to be virtually indistinguishable. Ken’s work had to account for this (listen to our episode to find out how) in order to gain accuracy. Ultimately, Ken found that computer play was strongly characterized by taking moves which offered a large breadth of future strategies. This is in contrast to human players whose choices admitted less diverse directions to take in the remainder of the game.

In a manner analogous to Regan’s work, Coquide, Georgeot, and Giraud explore the same question for the game of Go in their paper Distinguishing humans from computers in the game of go: a complex network approach. Their paper focuses on a novel way of representing the game and choices made in the game as a network. Various properties of this network can be studied to see how human actions differ from computer actions. Similar to Regan, they found that “In general, the computer has a tendency to play using a more varied set of most played moves”. Their work suggests that various statistics from the network representation of the game can be useful in revealing whether a particular game was played by a human or a machine.

When discussing means of measuring the different between human and computer activity, it’s not long before someone describes a process as “a turing test for ______”. I’m actually not thrilled with this sort of label. To me, it captures only one small part of why Turing’s Imitation Game is interesting. The Imitation Game is not just about whether or not a judge can recognize the difference between human and computer choices. A key feature of the Imitation Game is the fact that one player has the objective of deliberately deceiving the judge. Some may find this point pedantic, but I think it’s important. Regardless, the ship has long since sailed on that point, and “turing tests for ______” seem here to stay.

The Imitation Game is interesting because the ability for a human to have an interactive conversation seems to be amongst the most impressive feats of the human brain. Through conversation (as well as patience, practice, and commitment) it seems a human (relatively speaking) can get instructed in any skill. One might even metaphorically say that when a master gives instructions to an apprentice, they’re programming that apprentice using the programming language of natural language. Granted, this language is ambiguous, and the compilers don’t always run the code in the expected way, but the analogy is useful.

Given the usefulness and significance of a well functioning conversational agent, when it comes to the Imitation Game, we should take interest in software agents capable of reliably fooling judges. State of the art chatbots are nowhere near this achievement today. Their conversations differ from humans in uncountable ways. However, I expect that gap to begin closing at measurable rates in the future.

But how does software compare to humans in other conversation related tasks? A recent paper by Stolcke and Droppo titled Comparing Human and Machine Errors in Conversational Speech Transcription explores this question. This paper considers the errors made by machine transcription compared to human transcription. When ask to translate audio to text, can a judge distinguish between a human and a machine transcription based on the errors that are made?

To quote the paper:

An informal Turing-like test also demonstrated that error patterns in the two types of transcription are not obviously distinguishable. w This conclusion is reached after volunteers at a conference attempted to judge which errors were human generated and which were machine generated. In this informal test, human judges were not able to distinguish. The researchers themselves, whom we can consider experts in these types of errors given their effort to study them, probably could perform a bit better on the test. They observe two interesting cases where machine transcription is currently struggling.

First is the case of “backchanneling” - utterances provided by a listener to signal to the speaker that they are listening and understanding. A common one is the utterance “uh huh”.

Second is the case of “filled pauses” - utterances provided by a speaker to signal to a listener that they’re not done talking. A common one is the utterance “ummm”.

Though distinguishable, these two examples are phonetically similar. Further, one cannot say that “uh huh” is always used as a backchanneling technique. The context is quite important. A human speaker seems to have easy access to context in order to parse these utterances fluidly, even unconsciously. On the other hand, this is one area when machines struggle and make mistakes. This is likely because machine transcription via deep learning is able to learn abstractions of grammar, likely phonetic sequences, and information content from pausing, pace, and pitch, yet fail to learn the more complex contextual clues necessary to disambiguate these use cases.

Perhaps even the most advanced builders of these systems use ensemble learning to leverage language models to help disambiguate. In other words, a recording might genuinely be ambiguous from a purely audio perspective. yet, “the soup was pot” and “the soup was hot” is easily corrected by a human, and likely by a machine as well. But higher order structures such as a behavior to signal that one is still speaking but needs a pause to think (i.e. “ummmm") is a seemingly complex behavior requiring the listener to consider (simulate) the thinking process of the speaker.

When it comes to game playing, I suspect advances in machine players will continue to exhibit these distinguishable features. The optimizations of those algorithms are for winning, not for mimicking human behavior. At this point, mimicking human behavior would actually handicap those game playing algorithms! However, nothing is technically preventing a system designer from regularizing their game playing model to exhibit more human-like behavior. I suspect this tactic will become increasingly common in the design of video game AI in the future.

As for conversational agents, the point is to engage with humans in the manner humans find suitable. Sure, machines might find it more convenient to converse amongst themselves in some exotic way. But to be effective conversational agents, their errors should be minimized and ideally, any errors produced whose be hard to distinguish from human-like errors. In other words, ideal systems would make “common” mistakes instead of blunders.