It is irritating that this needs to be said, but apparently it does.
Not Passed as Described by Turing
Along with many others, a random Twitter user claims that the Turing Test has already been passed:
Two remarks. First, a test has not already been passed because you personally believe that if the test were taken, it would be passed. And no one has pointed to any attempt at the task suggested by Antony Raphael here.
Second, the suggested task is not the test “as Turing formulated it” in any way, shape or form. Let us look at his formulation.
I PROPOSE to consider the question, ‘Can machines think?’ This should begin with definitions of the meaning of the terms ‘machine’ and ‘think’. The definitions might be framed so as to reflect so far as possible the normal use of the words, but this attitude is dangerous. If the meaning of the words ‘machine’ and ‘think’ are to be found by examining how they are commonly used it is difficult to escape the conclusion that the meaning and the answer to the question, ‘Can machines think?’ is to be sought in a statistical survey such as a Gallup poll. But this is absurd. Instead of attempting such a definition I shall replace the question by another, which is closely related to it and is expressed in relatively unambiguous words.
The new form of the problem can be described in terms of a game which we call the ‘imitation game’. It is played with three people, a man (A), a woman (B), and an interrogator (C) who may be of either sex. The interrogator stays in a room apart from the other two. The object of the game for the interrogator is to determine which of the other two is the man and which is the woman. He knows them by labels X and Y, and at the end of the game he says either ‘X is A and Y is B’ or ‘X is B and Y is A’. The interrogator is allowed to put questions to A and B thus:
C: Will X please tell me the length of his or her hair? Now suppose X is actually A, then A must answer. It is A's object in the game to try and cause C to make the wrong identification. His answer might therefore be
‘My hair is shingled, and the longest strands are about nine inches long.’
In order that tones of voice may not help the interrogator the answers should be written, or better still, typewritten. The ideal arrangement is to have a teleprinter communicating between the two rooms. Alternatively the question and answers can be repeated by an intermediary. The object of the game for the third player (B) is to help the interrogator. The best strategy for her is probably to give truthful answers. She can add such things as ‘I am the woman, don’t listen to him!’ to her answers, but it will avail nothing as the man can make similar remarks.
We now ask the question, ‘What will happen when a machine takes the part of A in this game?’ Will the interrogator decide wrongly as often when the game is played like this as he does when the game is played between a man and a woman? These questions replace our original, ‘Can machines think?’
There are three players here, A, the machine, B, the “woman,” i.e. the human who must be identified, and C, the interrogator who must determine which of the two is human and which of the two is the machine. Make note of several points:
B is explicitly assigned the role of helping C determine which is which. Thus (among other possibilities) if B does not know about the game, this would not be test as formulated by Turing.
When C asks a question, “A must answer.” It is required by the rules of the game. A cannot say “I don’t want to talk to you,” and walk away. Similarly, and more to the point (in regard to our Twitter user’s example), if A refuses to engage in generic conversation and will only discuss certain topics, this is not the test as formulated by Turing.
In order to be a test, the game must be played multiple times. “We now ask the question, ‘What will happen when a machine takes the part of A in this game?’ Will the interrogator decide wrongly as often when the game is played like this as he does when the game is played between a man and a woman?” This is actually a rather weak requirement; the machine does not have to cause the interrogator to make a mistake 50% of the time, just at least as frequently as when the game is played with a man and a woman.
It can be seen that test suggested by our Twitter user does not match the test formulated by Turing in any way. First, customer service representatives are not assigned a task of helping anyone distinguish them from machines. Second, customer service representatives, whether machine or human, have a job to do, and will not engage in arbitrary conversation on topics such as whether “shall I compare thee to a winter’s day” is a good line of poetry or not. Third, the requirement to measure error rates implies some formal setup which is evidently missing in the example. Fourth, the interrogator (in Turing’s formulation) has to be explicitly comparing a human and a machine; it is not enough to be talking to one of them alone and trying to determine their identity.
Perhaps the closest to an attempt to follow Turing’s formulation is the Loebner Prize competition. But here too (1) the rules are not followed very closely, and (2) the setup makes no serious attempt to arrive at a good estimate of the machine’s ability to convincingly appear human. For one thing, while Turing does not specifically mention the amount of time, the time allowed in the competition is very short; at one time it was 5 minutes total (so approximately 2.5 minutes to question A and 2.5 minutes to question B,) and more recently 25 minutes total, which is still only 12.5 minutes per participant.
The Turing Test has surely not been passed “as Turing formulated it.”
The “Spirit” of the Test Has Not Been Passed
Others who are more sensible admit that Turing’s test has not been passed literally, but would like to say that the test has been passed in its basic meaning or spirit.
While technically ChatGPT doesn’t pass the official Turing test, it turns out that’s only because the original Turing test is about subterfuge—the AI must pretend to be a human, and the question is whether a good human judge could ever tell the difference via just talking to it. This has always struck me as a strange definition of artificial intelligence because it involves lying. It’s really about how good the AI is at acting, and Turing’s interest in it, as least as I’ve always read it, was in service of a philosophical position about how minds work (what we would now call “substrate independence”), and is based on taking the most extreme case possible. As a practical matter, Turing’s test turns out to be a bad benchmark for AI. …
The issue with Turing’s test is not that it’s wrong, but that the standards are far too high: after all, the judge is supposed to be an expert, and trying to ferret out the answer as quickly as possible—they are in the mode of an inquisitor. With an expert and suspicious judge, it is incredibly hard to pass the imitation game—even for humans swapping in for other humans. Turing’s own analogy breaks down here. Against a discerning and suspicious judge, could the average man playing the imitation game as a woman really last 5 minutes? A lot of men would be stumped by the simple question of: “Name three brands of makeup that you’ve used, and what for.” Or, to move from Turing’s gender swap example, could an American really pass as a Canadian to a discerning judge familiar with Canadian culture? And so on. Just thinking through the details of the test makes it seem ill-defined, because one can just increase the distance between imitator and the imitated (e.g., Chinese for American) and ask simple questions about capitals and geography.
There may be a valid criticism of Turing here regarding the comparison of the frequency with which a man can fake being a woman with the frequency the machine can fake being human. But leave that aside: the criticism is wrong to attack the requirement of deception as being a standard that is “too high.” Turing discusses this issue himself:
The game may perhaps be criticised on the ground that the odds are weighted too heavily against the machine. If the man were to try and pretend to be the machine he would clearly make a very poor showing. He would be given away at once by slowness and inaccuracy in arithmetic. May not machines carry out something which ought to be described as thinking but which is very different from what a man does? This objection is a very strong one, but at least we can say that if, nevertheless, a machine can be constructed to play the imitation game satisfactorily, we need not be troubled by this objection.
Recall that the intention behind Turing’s test was to “replace” the question “can machines think?” The point of his test is that if we cannot distinguish the machines from things that we know to think (in respect to “thinking” behaviors such as speech), then we have no reason to deny that the machines themselves can think.
Deception is therefore required for his test; you can object to the morality of this, you can object to its fairness, you can object however you like. But if you don’t have deception, you don’t have Turing’s test, not even the spirit of it. For without the deception, it cannot be said that you are unable to distinguish the things we know to think (humans) from the things we are testing for thinking. In that case, you have simply backed off from Turing’s entire project here. You will then have to explain what you believe “thinking” to be, and whether or not the machines can do it. Which is fine: but please do not say that anything has passed his test in some spiritual sense.
Without the deception, you no longer have Turing’s test, but a subjective test such as “does it feel to me like this thing is thinking?”
Not Almost Passed (Probably)
Others appear to believe that even if Turing’s test hasn’t been passed officially, surely that is just a technicality. At any rate we obviously (according to them) have intelligent machines:
It is good at game playing? Is it good at playing Turing’s game? Is it just a question of taking the time to actually try it?
I asked GPT-4 (I have nothing to add to this post from two years ago) whether it believed it could pass a real Turing Test. It immediately responded that it would expect to fail, due to its inability to discuss recent events.
Technicality easily fixed with web search? I suggested another test. I said I would ask it to compose a text, pause for 30 seconds, and then continue, without waiting for any reply from me. It admitted that it cannot do this; the structure just does not allow for it.
This also could be easily fixed by the programmers, by giving the model access to tokens that would cause some real-world effect instead of being printed on the screen; e.g. “text1 … [PAUSE30] … text2” could be programmed to have the desired effect.
Note that in both of these cases, what is required is to attach the model’s outputs to real-world effects; both web search and the pause feature require this. Whether you do it a little or a lot, this is a significant architectural change. In fact, I pointed out in the linked post on the GPT models that the ability to act to affect the world, and to perceive the effects of its action, was a fundamental requirement for an intelligent being that was missing from the original architecture. The base architecture cannot possibly pass the test; not even almost.
But, one says, this is not a discussion of GPT-4, but of the general issue. So as long as these architectural changes can be easily made, surely the test has almost been passed.
That is why the heading is “probably.” Even in the general post on “GPT-N,” we also noted that in a sense only “small” architectural changes would be required for true intelligence. The problem is that small or large, no one seems to know how to make them. And all of this assumes that simply as a matter of the text itself, the machine can present itself convincingly as human. This is unlikely with any current model. Open AI itself says of GPT-4:
Despite its capabilities, GPT-4 has similar limitations as earlier GPT models. Most importantly, it still is not fully reliable (it “hallucinates” facts and makes reasoning errors). Great care should be taken when using language model outputs, particularly in high-stakes contexts, with the exact protocol (such as human review, grounding with additional context, or avoiding high-stakes uses altogether) matching the needs of a specific use-case.
But humans make errors too, you will say. The problem is that this makes no difference, for our present question. The errors that humans make are recognizably different from the type of errors made by this model, and thus the machine will not be able to present itself convincingly.
Maybe someday, but we’re not there, and we’re not almost there.