Winograd Schema Challenge

The Winograd Schema Challenge (WSC) is a machine intelligence test proposed by a computer scientist at the University of Toronto, Hector Levesque. The Winograd Schema Challenge is designed to improve the efficiency of the Turing Test, which plays a central role in the philosophy of artificial intelligence.

General Idea

Turing suggested that instead of discussing what intelligence is, the science of AI should be about demonstrating intelligent behavior that can be tested. But the exact nature of Turing’s proposed test has come under scrutiny, especially after an AI chatbot named Eugene claimed to have passed it in 2014. The Winograd Schema Challenge was proposed in part to address problems that had come to light due to the nature of the programs that performed well in the test.

Turing’s original proposal was what he called the imitation game, which involves free, unrestricted communication in English between human referees and computer programs via a text-only channel (such as a teletype). In general, a machine passes a test if investigators cannot tell it apart from a human in a five-minute conversation.

If you are in need of professional assistance with your AI modeling, get in touch with these mature ML developers.

Eugene Gustman

On June 7, 2014, Eugene Gustman’s computer program was announced as the first AI to pass the Turing test in a competition held by the University of Reading in England. At the competition, Evgeny managed to convince 33% of the judges that they were talking to a 13-year-old Ukrainian boy. The supposed victory of the machine that thinks has sparked controversy over the Turing test. Critics have argued that Eugene passed the test by simply tricking the judge and taking advantage of his assumed identity. For example, he can easily skip some key questions by joking and changing the subject. However, the judge will forgive his mistakes because Eugene identified himself as a teenager who spoke English as a second language.

Weaknesses of the Turing Test

Eugene Gustman’s scores demonstrated some of Turing’s tests. Levesque identifies several main problems, which are summarized as follows:

Deception: The machine is forced to create a false identity that is not part of intelligence.

Conversation: A lot of interactions qualify as “legitimate conversation” – jokes, witty remarks, points of order – without rational reasoning.

Evaluation: People make mistakes, and judges often disagree with the results.


WSC is a multiple-choice test that uses questions of a particular structure: they are instances of the so-called Winograd Schemas, named after Terry Winograd, professor of computer science at Stanford University.

At first glance, the machine must identify the preceding ambiguous pronoun in the statement. This makes it a natural language processing task, but Levesque argues that for Winograd’s schemes, the task requires knowledge and common sense.

A key factor in the WSC is the unique format of questions derived from Winograd’s schemes. The questions on this form can be adapted to require knowledge and common sense in various fields. They must also be written carefully so as not to give away their answers due to choice restrictions or statistical information about the words in the sentence.


The first cited example of Winograd’s scheme (and the reason for their namesake) is from Terry Winograd:

A city council member denied permission to the demonstrators because they [feared/promoted] violence.

The choices “fear” and “protect” turn the scheme into two examples of it:

A city council member denied permission to the demonstrators because they feared violence.

The demonstrators were denied permission by city council members because they favor violence.

The question is whether the pronoun “they” refers to city council members or demonstrators, and switching between the two instances of the diagram changes the answer. The answer is evident to a human reader, but difficult to reproduce on machines. Levesque argues that knowledge plays a central role in these problems: the answer to this diagram is related to our understanding of the typical relationship between council members and demonstrators and their behavior.

Starting with the original proposal for the Winograd Schema Challenge, Ernest Davis, a professor at New York University, has compiled a list of over 140 Winograd Schemas from various sources as examples of the types of questions that should appear in the Winograd Schema Challenge.

Formal description

The Winograd Schema Challenge question consists of three parts:

  1. A proposal or summary that contains the following:
  • two nouns of the same semantic class (masculine, feminine, inanimate, or a group of objects or people),
  • an ambiguous pronoun that can refer to any of the above phrases and
  • a special word and an alternative word, so that if the special word is replaced by an alternative word, the natural resolution of the pronoun changes.
  1. Question asking the identity of an ambiguous pronoun and
  2. Two response options correspond to the noun phrases under consideration.

The machine will be given a problem in one standardized form that includes answer choices, making it a binary decision problem.


Calling Winograd’s schema has the following purported benefits:

  • Solving them requires knowledge and common sense.
  • Winograd schemes of varying complexity can be developed, including anything from simple cause-and-effect relationships to complex narratives of events.
  • They can be created to test reasoning ability in certain areas (e.g. social/psychological or spatial reasoning).
  • There is no need for human judges.


One of the difficulties in doing the Winograd Schema Challenge is the development of the questions. They must be carefully tailored to ensure that common sense is required to solve them. For example, Levesque gives the following example of the so-called Winograd scheme, which is “too simple”:

The women stopped taking the pills because they were [pregnant/carcinogenic]. What kind of people were pregnant/carcinogenic]?

The answer to this question can be determined based on the limitations of choice: in any situation, pills do not become pregnant, women are pregnant; women may not be carcinogenic, but the pills can. Thus, this answer can be obtained without the use of reasoning or any understanding of the meaning of the proposals – all that is needed is data on pregnant and carcinogenic selection restrictions. If you like to read more about AI technologies, here is my favorite ML blog.