Sapir–Whorf hypothesis (Linguistic Relativity Hypothesis)

Mia Belle Frothingham

Author, Researcher, Science Communicator

BA with minors in Psychology and Biology, MRes University of Edinburgh

Mia Belle Frothingham is a Harvard University graduate with a Bachelor of Arts in Sciences with minors in biology and psychology

Learn about our Editorial Process

Saul Mcleod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul Mcleod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

On This Page:

There are about seven thousand languages heard around the world – they all have different sounds, vocabularies, and structures. As you know, language plays a significant role in our lives.

But one intriguing question is – can it actually affect how we think?

Collection of talking people. Men and women with speech bubbles. Communication and interaction. Friends, students or colleagues. Cartoon flat vector illustrations isolated on white background

It is widely thought that reality and how one perceives the world is expressed in spoken words and are precisely the same as reality.

That is, perception and expression are understood to be synonymous, and it is assumed that speech is based on thoughts. This idea believes that what one says depends on how the world is encoded and decoded in the mind.

However, many believe the opposite.

In that, what one perceives is dependent on the spoken word. Basically, that thought depends on language, not the other way around.

What Is The Sapir-Whorf Hypothesis?

Twentieth-century linguists Edward Sapir and Benjamin Lee Whorf are known for this very principle and its popularization. Their joint theory, known as the Sapir-Whorf Hypothesis or, more commonly, the Theory of Linguistic Relativity, holds great significance in all scopes of communication theories.

The Sapir-Whorf hypothesis states that the grammatical and verbal structure of a person’s language influences how they perceive the world. It emphasizes that language either determines or influences one’s thoughts.

The Sapir-Whorf hypothesis states that people experience the world based on the structure of their language, and that linguistic categories shape and limit cognitive processes. It proposes that differences in language affect thought, perception, and behavior, so speakers of different languages think and act differently.

For example, different words mean various things in other languages. Not every word in all languages has an exact one-to-one translation in a foreign language.

Because of these small but crucial differences, using the wrong word within a particular language can have significant consequences.

The Sapir-Whorf hypothesis is sometimes called “linguistic relativity” or the “principle of linguistic relativity.” So while they have slightly different names, they refer to the same basic proposal about the relationship between language and thought.

How Language Influences Culture

Culture is defined by the values, norms, and beliefs of a society. Our culture can be considered a lens through which we undergo the world and develop a shared meaning of what occurs around us.

The language that we create and use is in response to the cultural and societal needs that arose. In other words, there is an apparent relationship between how we talk and how we perceive the world.

One crucial question that many intellectuals have asked is how our society’s language influences its culture.

Linguist and anthropologist Edward Sapir and his then-student Benjamin Whorf were interested in answering this question.

Together, they created the Sapir-Whorf hypothesis, which states that our thought processes predominantly determine how we look at the world.

Our language restricts our thought processes – our language shapes our reality. Simply, the language that we use shapes the way we think and how we see the world.

Since the Sapir-Whorf hypothesis theorizes that our language use shapes our perspective of the world, people who speak different languages have different views of the world.

In the 1920s, Benjamin Whorf was a Yale University graduate student studying with linguist Edward Sapir, who was considered the father of American linguistic anthropology.

Sapir was responsible for documenting and recording the cultures and languages of many Native American tribes disappearing at an alarming rate. He and his predecessors were well aware of the close relationship between language and culture.

Anthropologists like Sapir need to learn the language of the culture they are studying to understand the worldview of its speakers truly. Whorf believed that the opposite is also true, that language affects culture by influencing how its speakers think.

His hypothesis proposed that the words and structures of a language influence how its speaker behaves and feels about the world and, ultimately, the culture itself.

Simply put, Whorf believed that you see the world differently from another person who speaks another language due to the specific language you speak.

Human beings do not live in the matter-of-fact world alone, nor solitary in the world of social action as traditionally understood, but are very much at the pardon of the certain language which has become the medium of communication and expression for their society.

To a large extent, the real world is unconsciously built on habits in regard to the language of the group. We hear and see and otherwise experience broadly as we do because the language habits of our community predispose choices of interpretation.

Studies & Examples

The lexicon, or vocabulary, is the inventory of the articles a culture speaks about and has classified to understand the world around them and deal with it effectively.

For example, our modern life is dictated for many by the need to travel by some vehicle – cars, buses, trucks, SUVs, trains, etc. We, therefore, have thousands of words to talk about and mention, including types of models, vehicles, parts, or brands.

The most influential aspects of each culture are similarly reflected in the dictionary of its language. Among the societies living on the islands in the Pacific, fish have significant economic and cultural importance.

Therefore, this is reflected in the rich vocabulary that describes all aspects of the fish and the environments that islanders depend on for survival.

For example, there are over 1,000 fish species in Palau, and Palauan fishers knew, even long before biologists existed, details about the anatomy, behavior, growth patterns, and habitat of most of them – far more than modern biologists know today.

Whorf’s studies at Yale involved working with many Native American languages, including Hopi. He discovered that the Hopi language is quite different from English in many ways, especially regarding time.

Western cultures and languages view times as a flowing river that carries us continuously through the present, away from the past, and to the future.

Our grammar and system of verbs reflect this concept with particular tenses for past, present, and future.

We perceive this concept of time as universal in that all humans see it in the same way.

Although a speaker of Hopi has very different ideas, their language’s structure both reflects and shapes the way they think about time. Seemingly, the Hopi language has no present, past, or future tense; instead, they divide the world into manifested and unmanifest domains.

The manifested domain consists of the physical universe, including the present, the immediate past, and the future; the unmanifest domain consists of the remote past and the future and the world of dreams, thoughts, desires, and life forces.

Also, there are no words for minutes, minutes, or days of the week. Native Hopi speakers often had great difficulty adapting to life in the English-speaking world when it came to being on time for their job or other affairs.

It is due to the simple fact that this was not how they had been conditioned to behave concerning time in their Hopi world, which followed the phases of the moon and the movements of the sun.

Today, it is widely believed that some aspects of perception are affected by language.

One big problem with the original Sapir-Whorf hypothesis derives from the idea that if a person’s language has no word for a specific concept, then that person would not understand that concept.

Honestly, the idea that a mother tongue can restrict one’s understanding has been largely unaccepted. For example, in German, there is a term that means to take pleasure in another person’s unhappiness.

While there is no translatable equivalent in English, it just would not be accurate to say that English speakers have never experienced or would not be able to comprehend this emotion.

Just because there is no word for this in the English language does not mean English speakers are less equipped to feel or experience the meaning of the word.

Not to mention a “chicken and egg” problem with the theory.

Of course, languages are human creations, very much tools we invented and honed to suit our needs. Merely showing that speakers of diverse languages think differently does not tell us whether it is the language that shapes belief or the other way around.

Supporting Evidence

On the other hand, there is hard evidence that the language-associated habits we acquire play a role in how we view the world. And indeed, this is especially true for languages that attach genders to inanimate objects.

There was a study done that looked at how German and Spanish speakers view different things based on their given gender association in each respective language.

The results demonstrated that in describing things that are referred to as masculine in Spanish, speakers of the language marked them as having more male characteristics like “strong” and “long.” Similarly, these same items, which use feminine phrasings in German, were noted by German speakers as effeminate, like “beautiful” and “elegant.”

The findings imply that speakers of each language have developed preconceived notions of something being feminine or masculine, not due to the objects” characteristics or appearances but because of how they are categorized in their native language.

It is important to remember that the Theory of Linguistic Relativity (Sapir-Whorf Hypothesis) also successfully achieves openness. The theory is shown as a window where we view the cognitive process, not as an absolute.

It is set forth to look at a phenomenon differently than one usually would. Furthermore, the Sapir-Whorf Hypothesis is very simple and logically sound. Understandably, one’s atmosphere and culture will affect decoding.

Likewise, in studies done by the authors of the theory, many Native American tribes do not have a word for particular things because they do not exist in their lives. The logical simplism of this idea of relativism provides parsimony.

Truly, the Sapir-Whorf Hypothesis makes sense. It can be utilized in describing great numerous misunderstandings in everyday life. When a Pennsylvanian says “yuns,” it does not make any sense to a Californian, but when examined, it is just another word for “you all.”

The Linguistic Relativity Theory addresses this and suggests that it is all relative. This concept of relativity passes outside dialect boundaries and delves into the world of language – from different countries and, consequently, from mind to mind.

Is language reality honestly because of thought, or is it thought which occurs because of language? The Sapir-Whorf Hypothesis very transparently presents a view of reality being expressed in language and thus forming in thought.

The principles rehashed in it show a reasonable and even simple idea of how one perceives the world, but the question is still arguable: thought then language or language then thought?

Modern Relevance

Regardless of its age, the Sapir-Whorf hypothesis, or the Linguistic Relativity Theory, has continued to force itself into linguistic conversations, even including pop culture.

The idea was just recently revisited in the movie “Arrival,” – a science fiction film that engagingly explores the ways in which an alien language can affect and alter human thinking.

And even if some of the most drastic claims of the theory have been debunked or argued against, the idea has continued its relevance, and that does say something about its importance.

Hypotheses, thoughts, and intellectual musings do not need to be totally accurate to remain in the public eye as long as they make us think and question the world – and the Sapir-Whorf Hypothesis does precisely that.

The theory does not only make us question linguistic theory and our own language but also our very existence and how our perceptions might shape what exists in this world.

There are generalities that we can expect every person to encounter in their day-to-day life – in relationships, love, work, sadness, and so on. But thinking about the more granular disparities experienced by those in diverse circumstances, linguistic or otherwise, helps us realize that there is more to the story than ours.

And beautifully, at the same time, the Sapir-Whorf Hypothesis reiterates the fact that we are more alike than we are different, regardless of the language we speak.

Isn’t it just amazing that linguistic diversity just reveals to us how ingenious and flexible the human mind is – human minds have invented not one cognitive universe but, indeed, seven thousand!

Kay, P., & Kempton, W. (1984). What is the Sapir‐Whorf hypothesis?. American anthropologist, 86(1), 65-79.

Whorf, B. L. (1952). Language, mind, and reality. ETC: A review of general semantics, 167-188.

Whorf, B. L. (1997). The relation of habitual thought and behavior to language. In Sociolinguistics (pp. 443-463). Palgrave, London.

Whorf, B. L. (2012). Language, thought, and reality: Selected writings of Benjamin Lee Whorf. MIT press.

Print Friendly, PDF & Email

Related Articles

Automatic Processing in Psychology: Definition & Examples

Cognitive Psychology

Automatic Processing in Psychology: Definition & Examples

Controlled Processing in Psychology: Definition & Examples

Controlled Processing in Psychology: Definition & Examples

How Ego Depletion Can Drain Your Willpower

How Ego Depletion Can Drain Your Willpower

What is the Default Mode Network?

What is the Default Mode Network?

Theories of Selective Attention in Psychology

Availability Heuristic and Decision Making

Availability Heuristic and Decision Making

SEP home page

  • Table of Contents
  • Random Entry
  • Chronological
  • Editorial Information
  • About the SEP
  • Editorial Board
  • How to Cite the SEP
  • Special Characters
  • Advanced Tools
  • Support the SEP
  • PDFs for SEP Friends
  • Make a Donation
  • SEPIA for Libraries
  • Back to Entry
  • Entry Contents
  • Entry Bibliography
  • Academic Tools
  • Friends PDF Preview
  • Author and Citation Info
  • Back to Top

Supplement to Philosophy of Linguistics

Whorfianism.

Emergentists tend to follow Edward Sapir in taking an interest in interlinguistic and intralinguistic variation. Linguistic anthropologists have explicitly taken up the task of defending a famous claim associated with Sapir that connects linguistic variation to differences in thinking and cognition more generally. The claim is very often referred to as the Sapir-Whorf Hypothesis (though this is a largely infelicitous label, as we shall see).

This topic is closely related to various forms of relativism—epistemological, ontological, conceptual, and moral—and its general outlines are discussed elsewhere in this encyclopedia; see the section on language in the Summer 2015 archived version of the entry on relativism (§3.1). Cultural versions of moral relativism suggest that, given how much cultures differ, what is moral for you might depend on the culture you were brought up in. A somewhat analogous view would suggest that, given how much language structures differ, what is thinkable for you might depend on the language you use. (This is actually a kind of conceptual relativism, but it is generally called linguistic relativism, and we will continue that practice.)

Even a brief skim of the vast literature on the topic is not remotely plausible in this article; and the primary literature is in any case more often polemical than enlightening. It certainly holds no general answer to what science has discovered about the influences of language on thought. Here we offer just a limited discussion of the alleged hypothesis and the rhetoric used in discussing it, the vapid and not so vapid forms it takes, and the prospects for actually devising testable scientific hypotheses about the influence of language on thought.

Whorf himself did not offer a hypothesis. He presented his “new principle of linguistic relativity” (Whorf 1956: 214) as a fact discovered by linguistic analysis:

When linguists became able to examine critically and scientifically a large number of languages of widely different patterns, their base of reference was expanded; they experienced an interruption of phenomena hitherto held universal, and a whole new order of significances came into their ken. It was found that the background linguistic system (in other words, the grammar) of each language is not merely a reproducing instrument for voicing ideas but rather is itself the shaper of ideas, the program and guide for the individual’s mental activity, for his analysis of impressions, for his synthesis of his mental stock in trade. Formulation of ideas is not an independent process, strictly rational in the old sense, but is part of a particular grammar, and differs, from slightly to greatly, between different grammars. We dissect nature along lines laid down by our native languages. The categories and types that we isolate from the world of phenomena we do not find there because they stare every observer in the face; on the contrary, the world is presented in a kaleidoscopic flux of impressions which has to be organized by our minds—and this means largely by the linguistic systems in our minds. We cut nature up, organize it into concepts, and ascribe significances as we do, largely because we are parties to an agreement to organize it in this way—an agreement that holds throughout our speech community and is codified in the patterns of our language. The agreement is, of course, an implicit and unstated one, but its terms are absolutely obligatory ; we cannot talk at all except by subscribing to the organization and classification of data which the agreement decrees. (Whorf 1956: 212–214; emphasis in original)

Later, Whorf’s speculations about the “sensuously and operationally different” character of different snow types for “an Eskimo” (Whorf 1956: 216) developed into a familiar journalistic meme about the Inuit having dozens or scores or hundreds of words for snow; but few who repeat that urban legend recall Whorf’s emphasis on its being grammar, rather than lexicon, that cuts up and organizes nature for us.

In an article written in 1937, posthumously published in an academic journal (Whorf 1956: 87–101), Whorf clarifies what is most important about the effects of language on thought and world-view. He distinguishes ‘phenotypes’, which are overt grammatical categories typically indicated by morphemic markers, from what he called ‘cryptotypes’, which are covert grammatical categories, marked only implicitly by distributional patterns in a language that are not immediately apparent. In English, the past tense would be an example of a phenotype (it is marked by the - ed suffix in all regular verbs). Gender in personal names and common nouns would be an example of a cryptotype, not systematically marked by anything. In a cryptotype, “class membership of the word is not apparent until there is a question of using it or referring to it in one of these special types of sentence, and then we find that this word belongs to a class requiring some sort of distinctive treatment, which may even be the negative treatment of excluding that type of sentence” (p. 89).

Whorf’s point is the familiar one that linguistic structure is comprised, in part, of distributional patterns in language use that are not explicitly marked. What follows from this, according to Whorf, is not that the existing lexemes in a language (like its words for snow) comprise covert linguistic structure, but that patterns shared by word classes constitute linguistic structure. In ‘Language, mind, and reality’ (1942; published posthumously in Theosophist , a magazine published in India for the followers of the 19th-century spiritualist Helena Blavatsky) he wrote:

Because of the systematic, configurative nature of higher mind, the “patternment” aspect of language always overrides and controls the “lexation”…or name-giving aspect. Hence the meanings of specific words are less important than we fondly fancy. Sentences, not words, are the essence of speech, just as equations and functions, and not bare numbers, are the real meat of mathematics. We are all mistaken in our common belief that any word has an “exact meaning.” We have seen that the higher mind deals in symbols that have no fixed reference to anything, but are like blank checks, to be filled in as required, that stand for “any value” of a given variable, like …the x , y , z of algebra. (Whorf 1942: 258)

Whorf apparently thought that only personal and proper names have an exact meaning or reference (Whorf 1956: 259).

For Whorf, it was an unquestionable fact that language influences thought to some degree:

Actually, thinking is most mysterious, and by far the greatest light upon it that we have is thrown by the study of language. This study shows that the forms of a person’s thoughts are controlled by inexorable laws of pattern of which he is unconscious. These patterns are the unperceived intricate systematizations of his own language—shown readily enough by a candid comparison and contrast with other languages, especially those of a different linguistic family. His thinking itself is in a language—in English, in Sanskrit, in Chinese. [footnote omitted] And every language is a vast pattern-system, different from others, in which are culturally ordained the forms and categories by which the personality not only communicates, but analyzes nature, notices or neglects types of relationship and phenomena, channels his reasoning, and builds the house of his consciousness. (Whorf 1956: 252)

He seems to regard it as necessarily true that language affects thought, given

  • the fact that language must be used in order to think, and
  • the facts about language structure that linguistic analysis discovers.

He also seems to presume that the only structure and logic that thought has is grammatical structure. These views are not the ones that after Whorf’s death came to be known as ‘the Sapir-Whorf Hypothesis’ (a sobriquet due to Hoijer 1954). Nor are they what was called the ‘Whorf thesis’ by Brown and Lenneberg (1954) which was concerned with the relation of obligatory lexical distinctions and thought. Brown and Lenneberg (1954) investigated this question by looking at the relation of color terminology in a language and the classificatory abilities of the speakers of that language. The issue of the relation between obligatory lexical distinctions and thought is at the heart of what is now called ‘the Sapir-Whorf Hypothesis’ or ‘the Whorf Hypothesis’ or ‘Whorfianism’.

1. Banal Whorfianism

No one is going to be impressed with a claim that some aspect of your language may affect how you think in some way or other; that is neither a philosophical thesis nor a psychological hypothesis. So it is appropriate to set aside entirely the kind of so-called hypotheses that Steven Pinker presents in The Stuff of Thought (2007: 126–128) as “five banal versions of the Whorfian hypothesis”:

  • “Language affects thought because we get much of our knowledge through reading and conversation.”
  • “A sentence can frame an event, affecting the way people construe it.”
  • “The stock of words in a language reflects the kinds of things its speakers deal with in their lives and hence think about.”
  • “[I]f one uses the word language in a loose way to refer to meanings,… then language is thought.”
  • “When people think about an entity, among the many attributes they can think about is its name.”

These are just truisms, unrelated to any serious issue about linguistic relativism.

We should also set aside some methodological versions of linguistic relativism discussed in anthropology. It may be excellent advice to a budding anthropologist to be aware of linguistic diversity, and to be on the lookout for ways in which your language may affect your judgment of other cultures; but such advice does not constitute a hypothesis.

2. The so-called Sapir-Whorf hypothesis

The term “Sapir-Whorf Hypothesis” was coined by Harry Hoijer in his contribution (Hoijer 1954) to a conference on the work of Benjamin Lee Whorf in 1953. But anyone looking in Hoijer’s paper for a clear statement of the hypothesis will look in vain. Curiously, despite his stated intent “to review and clarify the Sapir-Whorf hypothesis” (1954: 93), Hoijer did not even attempt to state it. The closest he came was this:

The central idea of the Sapir-Whorf hypothesis is that language functions, not simply as a device for reporting experience, but also, and more significantly, as a way of defining experience for its speakers.

The claim that “language functions…as a way of defining experience” appears to be offered as a kind of vague metaphysical insight rather than either a statement of linguistic relativism or a testable hypothesis.

And if Hoijer seriously meant that what qualitative experiences a speaker can have are constituted by that speaker’s language, then surely the claim is false. There is no reason to doubt that non-linguistic sentient creatures like cats can experience (for example) pain or heat or hunger, so having a language is not a necessary condition for having experiences. And it is surely not sufficient either: a robot with a sophisticated natural language processing capacity could be designed without the capacity for conscious experience.

In short, it is a mystery what Hoijer meant by his “central idea”.

Vague remarks of the same loosely metaphysical sort have continued to be a feature of the literature down to the present. The statements made in some recent papers, even in respected refereed journals, contain non-sequiturs echoing some of the remarks of Sapir, Whorf, and Hoijer. And they come from both sides of the debate.

3. Anti-Whorfian rhetoric

Lila Gleitman is an Essentialist on the other side of the contemporary debate: she is against linguistic relativism, and against the broadly Whorfian work of Stephen Levinson’s group at the Max Planck Institute for Psycholinguistics. In the context of criticizing a particular research design, Li and Gleitman (2002) quote Whorf’s claim that “language is the factor that limits free plasticity and rigidifies channels of development”. But in the claim cited, Whorf seems to be talking about the psychological topic that holds universally of human conceptual development, not claiming that linguistic relativism is true.

Li and Gleitman then claim (p. 266) that such (Whorfian) views “have diminished considerably in academic favor” in part because of “the universalist position of Chomskian linguistics, with its potential for explaining the striking similarity of language learning in children all over the world.” But there is no clear conflict or even a conceptual connection between Whorf’s views about language placing limits on developmental plasticity, and Chomsky’s thesis of an innate universal architecture for syntax. In short, there is no reason why Chomsky’s I-languages could not be innately constrained, but (once acquired) cognitively and developmentally constraining.

For example, the supposedly deep linguistic universal of ‘recursion’ (Hauser et al. 2002) is surely quite independent of whether the inventory of colour-name lexemes in your language influences the speed with which you can discriminate between color chips. And conversely, universal tendencies in color naming across languages (Kay and Regier 2006) do not show that color-naming differences among languages are without effect on categorical perception (Thierry et al. 2009).

4. Strong and weak Whorfianism

One of the first linguists to defend a general form of universalism against linguistic relativism, thus presupposing that they conflict, was Julia Penn (1972). She was also an early popularizer of the distinction between ‘strong’ and ‘weak’ formulations of the Sapir-Whorf Hypothesis (and an opponent of the ‘strong’ version).

‘Weak’ versions of Whorfianism state that language influences or defeasibly shapes thought. ‘Strong’ versions state that language determines thought, or fixes it in some way. The weak versions are commonly dismissed as banal (because of course there must be some influence), and the stronger versions as implausible.

The weak versions are considered banal because they are not adequately formulated as testable hypotheses that could conflict with relevant evidence about language and thought.

Why would the strong versions be thought implausible? For a language to make us think in a particular way, it might seem that it must at least temporarily prevent us from thinking in other ways, and thus make some thoughts not only inexpressible but unthinkable. If this were true, then strong Whorfianism would conflict with the Katzian effability claim. There would be thoughts that a person couldn’t think because of the language(s) they speak.

Some are fascinated by the idea that there are inaccessible thoughts; and the notion that learning a new language gives access to entirely new thoughts and concepts seems to be a staple of popular writing about the virtues of learning languages. But many scientists and philosophers intuitively rebel against violations of effability: thinking about concepts that no one has yet named is part of their job description.

The resolution lies in seeing that the language could affect certain aspects of our cognitive functioning without making certain thoughts unthinkable for us .

For example, Greek has separate terms for what we call light blue and dark blue, and no word meaning what ‘blue’ means in English: Greek forces a choice on this distinction. Experiments have shown (Thierry et al. 2009) that native speakers of Greek react faster when categorizing light blue and dark blue color chips—apparently a genuine effect of language on thought. But that does not make English speakers blind to the distinction, or imply that Greek speakers cannot grasp the idea of a hue falling somewhere between green and violet in the spectrum.

There is no general or global ineffability problem. There is, though, a peculiar aspect of strong Whorfian claims, giving them a local analog of ineffability: the content of such a claim cannot be expressed in any language it is true of . This does not make the claims self-undermining (as with the standard objections to relativism); it doesn’t even mean that they are untestable. They are somewhat anomalous, but nothing follows concerning the speakers of the language in question (except that they cannot state the hypothesis using the basic vocabulary and grammar that they ordinarily use).

If there were a true hypothesis about the limits that basic English vocabulary and constructions puts on what English speakers can think, the hypothesis would turn out to be inexpressible in English, using basic vocabulary and the usual repertoire of constructions. That might mean it would be hard for us to discuss it in an article in English unless we used terminological innovations or syntactic workarounds. But that doesn’t imply anything about English speakers’ ability to grasp concepts, or to develop new ways of expressing them by coining new words or elaborated syntax.

5. Constructing and evaluating Whorfian hypotheses

A number of considerations are relevant to formulating, testing, and evaluating Whorfian hypotheses.

Genuine hypotheses about the effects of language on thought will always have a duality: there will be a linguistic part and a non-linguistic one. The linguistic part will involve a claim that some feature is present in one language but absent in another.

Whorf himself saw that it was only obligatory features of languages that established “mental patterns” or “habitual thought” (Whorf 1956: 139), since if it were optional then the speaker could optionally do it one way or do it the other way. And so this would not be a case of “constraining the conceptual structure”. So we will likewise restrict our attention to obligatory features here.

Examples of relevant obligatory features would include lexical distinctions like the light vs. dark blue forced choice in Greek, or the forced choice between “in (fitting tightly)” vs. “in (fitting loosely)” in Korean. They also include grammatical distinctions like the forced choice in Spanish 2nd-person pronouns between informal/intimate and formal/distant (informal tú vs. formal usted in the singular; informal vosotros vs. formal ustedes in the plural), or the forced choice in Tamil 1st-person plural pronouns between inclusive (“we = me and you and perhaps others”) and exclusive (“we = me and others not including you”).

The non-linguistic part of a Whorfian hypothesis will contrast the psychological effects that habitually using the two languages has on their speakers. For example, one might conjecture that the habitual use of Spanish induces its speakers to be sensitive to the formal and informal character of the speaker’s relationship with their interlocutor while habitually using English does not.

So testing Whorfian hypotheses requires testing two independent hypotheses with the appropriate kinds of data. In consequence, evaluating them requires the expertise of both linguistics and psychology, and is a multidisciplinary enterprise. Clearly, the linguistic hypothesis may hold up where the psychological hypothesis does not, or conversely.

In addition, if linguists discovered that some linguistic feature was optional in two different languages, then even if psychological experiments showed differences between the two populations of speakers, this would not show linguistic determination or influence. The cognitive differences might depend on (say) cultural differences.

A further important consideration concerns the strength of the inducement relationship that a Whorfian hypothesis posits between a speaker’s language and their non-linguistic capacities. The claim that your language shapes or influences your cognition is quite different from the claim that your language makes certain kinds of cognition impossible (or obligatory) for you. The strength of any Whorfian hypothesis will vary depending on the kind of relationship being claimed, and the ease of revisability of that relation.

A testable Whorfian hypothesis will have a schematic form something like this:

  • Linguistic part : Feature F is obligatory in L 1 but optional in L 2 .
  • Psychological part : Speaking a language with obligatory feature F bears relation R to the cognitive effect C .

The relation R might in principle be causation or determination, but it is important to see that it might merely be correlation, or slight favoring; and the non-linguistic cognitive effect C might be readily suppressible or revisable.

Dan Slobin (1996) presents a view that competes with Whorfian hypotheses as standardly understood. He hypothesizes that when the speakers are using their cognitive abilities in the service of a linguistic ability (speaking, writing, translating, etc.), the language they are planning to use to express their thought will have a temporary online effect on how they express their thought. The claim is that as long as language users are thinking in order to frame their speech or writing or translation in some language, the mandatory features of that language will influence the way they think.

On Slobin’s view, these effects quickly attenuate as soon as the activity of thinking for speaking ends. For example, if a speaker is thinking for writing in Spanish, then Slobin’s hypothesis would predict that given the obligatory formal/informal 2nd-person pronoun distinction they would pay greater attention to the formal/informal character of their social relationships with their audience than if they were writing in English. But this effect is not permanent. As soon as they stop thinking for speaking, the effect of Spanish on their thought ends.

Slobin’s non-Whorfian linguistic relativist hypothesis raises the importance of psychological research on bilinguals or people who currently use two or more languages with a native or near-native facility. This is because one clear way to test Slobin-like hypotheses relative to Whorfian hypotheses would be to find out whether language correlated non-linguistic cognitive differences between speakers hold for bilinguals only when are thinking for speaking in one language, but not when they are thinking for speaking in some other language. If the relevant cognitive differences appeared and disappeared depending on which language speakers were planning to express themselves in, it would go some way to vindicate Slobin-like hypotheses over more traditional Whorfian Hypotheses. Of course, one could alternately accept a broadening of Whorfian hypotheses to include Slobin-like evanescent effects. Either way, attention must be paid to the persistence and revisability of the linguistic effects.

Kousta et al. (2008) shows that “for bilinguals there is intraspeaker relativity in semantic representations and, therefore, [grammatical] gender does not have a conceptual, non-linguistic effect” (843). Grammatical gender is obligatory in the languages in which it occurs and has been claimed by Whorfians to have persistent and enduring non-linguistic effects on representations of objects (Boroditsky et al. 2003). However, Kousta et al. supports the claim that bilinguals’ semantic representations vary depending on which language they are using, and thus have transient effects. This suggests that although some semantic representations of objects may vary from language to language, their non-linguistic cognitive effects are transitory.

Some advocates of Whorfianism have held that if Whorfian hypotheses were true, then meaning would be globally and radically indeterminate. Thus, the truth of Whorfian hypotheses is equated with global linguistic relativism—a well known self-undermining form of relativism. But as we have seen, not all Whorfian hypotheses are global hypotheses: they are about what is induced by particular linguistic features. And the associated non-linguistic perceptual and cognitive differences can be quite small, perhaps insignificant. For example, Thierry et al. (2009) provides evidence that an obligatory lexical distinction between light and dark blue affects Greek speakers’ color perception in the left hemisphere only. And the question of the degree to which this affects sensuous experience is not addressed.

The fact that Whorfian hypotheses need not be global linguistic relativist hypotheses means that they do not conflict with the claim that there are language universals. Structuralists of the first half of the 20th century tended to disfavor the idea of universals: Martin Joos’s characterization of structuralist linguistics as claiming that “languages can differ without limit as to either extent or direction” (Joos 1966, 228) has been much quoted in this connection. If the claim that languages can vary without limit were conjoined with the claim that languages have significant and permanent effects on the concepts and worldview of their speakers, a truly profound global linguistic relativism would result. But neither conjunct should be accepted. Joos’s remark is regarded by nearly all linguists today as overstated (and merely a caricature of the structuralists), and Whorfian hypotheses do not have to take a global or deterministic form.

John Lucy, a conscientious and conservative researcher of Whorfian hypotheses, has remarked:

We still know little about the connections between particular language patterns and mental life—let alone how they operate or how significant they are…a mere handful of empirical studies address the linguistic relativity proposal directly and nearly all are conceptually flawed. (Lucy 1996, 37)

Although further empirical studies on Whorfian hypotheses have been completed since Lucy published his 1996 review article, it is hard to find any that have satisfied the criteria of:

  • adequately utilizing both the relevant linguistic and psychological research,
  • focusing on obligatory rather than optional linguistic features,
  • stating hypotheses in a clear testable way, and
  • ruling out relevant competing Slobin-like hypotheses.

There is much important work yet to be done on testing the range of Whorfian hypotheses and other forms of linguistic conceptual relativism, and on understanding the significance of any Whorfian hypotheses that turn out to be well supported.

Copyright © 2024 by Barbara C. Scholz Francis Jeffry Pelletier < francisp @ ualberta . ca > Geoffrey K. Pullum < pullum @ gmail . com > Ryan Nefdt < ryan . nefdt @ uct . ac . za >

  • Accessibility

Support SEP

Mirror sites.

View this site from another server:

  • Info about mirror sites

The Stanford Encyclopedia of Philosophy is copyright © 2024 by The Metaphysics Research Lab , Department of Philosophy, Stanford University

Library of Congress Catalog Data: ISSN 1095-5054

  • Bipolar Disorder
  • Therapy Center
  • When To See a Therapist
  • Types of Therapy
  • Best Online Therapy
  • Best Couples Therapy
  • Best Family Therapy
  • Managing Stress
  • Sleep and Dreaming
  • Understanding Emotions
  • Self-Improvement
  • Healthy Relationships
  • Student Resources
  • Personality Types
  • Guided Meditations
  • Verywell Mind Insights
  • 2024 Verywell Mind 25
  • Mental Health in the Classroom
  • Editorial Process
  • Meet Our Review Board
  • Crisis Support

The Sapir-Whorf Hypothesis: How Language Influences How We Express Ourselves

Rachael is a New York-based writer and freelance writer for Verywell Mind, where she leverages her decades of personal experience with and research on mental illness—particularly ADHD and depression—to help readers better understand how their mind works and how to manage their mental health.

strong whorf hypothesis

Thomas Barwick / Getty Images

What to Know About the Sapir-Whorf Hypothesis

Real-world examples of linguistic relativity, linguistic relativity in psychology.

The Sapir-Whorf Hypothesis, also known as linguistic relativity, refers to the idea that the language a person speaks can influence their worldview, thought, and even how they experience and understand the world.

While more extreme versions of the hypothesis have largely been discredited, a growing body of research has demonstrated that language can meaningfully shape how we understand the world around us and even ourselves.

Keep reading to learn more about linguistic relativity, including some real-world examples of how it shapes thoughts, emotions, and behavior.  

The hypothesis is named after anthropologist and linguist Edward Sapir and his student, Benjamin Lee Whorf. While the hypothesis is named after them both, the two never actually formally co-authored a coherent hypothesis together.

This Hypothesis Aims to Figure Out How Language and Culture Are Connected

Sapir was interested in charting the difference in language and cultural worldviews, including how language and culture influence each other. Whorf took this work on how language and culture shape each other a step further to explore how different languages might shape thought and behavior.

Since then, the concept has evolved into multiple variations, some more credible than others.

Linguistic Determinism Is an Extreme Version of the Hypothesis

Linguistic determinism, for example, is a more extreme version suggesting that a person’s perception and thought are limited to the language they speak. An early example of linguistic determinism comes from Whorf himself who argued that the Hopi people in Arizona don’t conjugate verbs into past, present, and future tenses as English speakers do and that their words for units of time (like “day” or “hour”) were verbs rather than nouns.

From this, he concluded that the Hopi don’t view time as a physical object that can be counted out in minutes and hours the way English speakers do. Instead, Whorf argued, the Hopi view time as a formless process.

This was then taken by others to mean that the Hopi don’t have any concept of time—an extreme view that has since been repeatedly disproven.

There is some evidence for a more nuanced version of linguistic relativity, which suggests that the structure and vocabulary of the language you speak can influence how you understand the world around you. To understand this better, it helps to look at real-world examples of the effects language can have on thought and behavior.

Different Languages Express Colors Differently

Color is one of the most common examples of linguistic relativity. Most known languages have somewhere between two and twelve color terms, and the way colors are categorized varies widely. In English, for example, there are distinct categories for blue and green .

Blue and Green

But in Korean, there is one word that encompasses both. This doesn’t mean Korean speakers can’t see blue, it just means blue is understood as a variant of green rather than a distinct color category all its own.

In Russian, meanwhile, the colors that English speakers would lump under the umbrella term of “blue” are further subdivided into two distinct color categories, “siniy” and “goluboy.” They roughly correspond to light blue and dark blue in English. But to Russian speakers, they are as distinct as orange and brown .

In one study comparing English and Russian speakers, participants were shown a color square and then asked to choose which of the two color squares below it was the closest in shade to the first square.

The test specifically focused on varying shades of blue ranging from “siniy” to “goluboy.” Russian speakers were not only faster at selecting the matching color square but were more accurate in their selections.

The Way Location Is Expressed Varies Across Languages

This same variation occurs in other areas of language. For example, in Guugu Ymithirr, a language spoken by Aboriginal Australians, spatial orientation is always described in absolute terms of cardinal directions. While an English speaker would say the laptop is “in front of” you, a Guugu Ymithirr speaker would say it was north, south, west, or east of you.

As a result, Aboriginal Australians have to be constantly attuned to cardinal directions because their language requires it (just as Russian speakers develop a more instinctive ability to discern between shades of what English speakers call blue because their language requires it).

So when you ask a Guugu Ymithirr speaker to tell you which way south is, they can point in the right direction without a moment’s hesitation. Meanwhile, most English speakers would struggle to accurately identify South without the help of a compass or taking a moment to recall grade school lessons about how to find it.

The concept of these cardinal directions exists in English, but English speakers aren’t required to think about or use them on a daily basis so it’s not as intuitive or ingrained in how they orient themselves in space.

Just as with other aspects of thought and perception, the vocabulary and grammatical structure we have for thinking about or talking about what we feel doesn’t create our feelings, but it does shape how we understand them and, to an extent, how we experience them.

Words Help Us Put a Name to Our Emotions

For example, the ability to detect displeasure from a person’s face is universal. But in a language that has the words “angry” and “sad,” you can further distinguish what kind of displeasure you observe in their facial expression. This doesn’t mean humans never experienced anger or sadness before words for them emerged. But they may have struggled to understand or explain the subtle differences between different dimensions of displeasure.

In one study of English speakers, toddlers were shown a picture of a person with an angry facial expression. Then, they were given a set of pictures of people displaying different expressions including happy, sad, surprised, scared, disgusted, or angry. Researchers asked them to put all the pictures that matched the first angry face picture into a box.

The two-year-olds in the experiment tended to place all faces except happy faces into the box. But four-year-olds were more selective, often leaving out sad or fearful faces as well as happy faces. This suggests that as our vocabulary for talking about emotions expands, so does our ability to understand and distinguish those emotions.

But some research suggests the influence is not limited to just developing a wider vocabulary for categorizing emotions. Language may “also help constitute emotion by cohering sensations into specific perceptions of ‘anger,’ ‘disgust,’ ‘fear,’ etc.,” said Dr. Harold Hong, a board-certified psychiatrist at New Waters Recovery in North Carolina.

As our vocabulary for talking about emotions expands, so does our ability to understand and distinguish those emotions.

Words for emotions, like words for colors, are an attempt to categorize a spectrum of sensations into a handful of distinct categories. And, like color, there’s no objective or hard rule on where the boundaries between emotions should be which can lead to variation across languages in how emotions are categorized.

Emotions Are Categorized Differently in Different Languages

Just as different languages categorize color a little differently, researchers have also found differences in how emotions are categorized. In German, for example, there’s an emotion called “gemütlichkeit.”

While it’s usually translated as “cozy” or “ friendly ” in English, there really isn’t a direct translation. It refers to a particular kind of peace and sense of belonging that a person feels when surrounded by the people they love or feel connected to in a place they feel comfortable and free to be who they are.

Harold Hong, MD, Psychiatrist

The lack of a word for an emotion in a language does not mean that its speakers don't experience that emotion.

You may have felt gemütlichkeit when staying up with your friends to joke and play games at a sleepover. You may feel it when you visit home for the holidays and spend your time eating, laughing, and reminiscing with your family in the house you grew up in.

In Japanese, the word “amae” is just as difficult to translate into English. Usually, it’s translated as "spoiled child" or "presumed indulgence," as in making a request and assuming it will be indulged. But both of those have strong negative connotations in English and amae is a positive emotion .

Instead of being spoiled or coddled, it’s referring to that particular kind of trust and assurance that comes with being nurtured by someone and knowing that you can ask for what you want without worrying whether the other person might feel resentful or burdened by your request.

You might have felt amae when your car broke down and you immediately called your mom to pick you up, without having to worry for even a second whether or not she would drop everything to help you.

Regardless of which languages you speak, though, you’re capable of feeling both of these emotions. “The lack of a word for an emotion in a language does not mean that its speakers don't experience that emotion,” Dr. Hong explained.

What This Means For You

“While having the words to describe emotions can help us better understand and regulate them, it is possible to experience and express those emotions without specific labels for them.” Without the words for these feelings, you can still feel them but you just might not be able to identify them as readily or clearly as someone who does have those words. 

Rhee S. Lexicalization patterns in color naming in Korean . In: Raffaelli I, Katunar D, Kerovec B, eds. Studies in Functional and Structural Linguistics. Vol 78. John Benjamins Publishing Company; 2019:109-128. Doi:10.1075/sfsl.78.06rhe

Winawer J, Witthoft N, Frank MC, Wu L, Wade AR, Boroditsky L. Russian blues reveal effects of language on color discrimination . Proc Natl Acad Sci USA. 2007;104(19):7780-7785.  10.1073/pnas.0701644104

Lindquist KA, MacCormack JK, Shablack H. The role of language in emotion: predictions from psychological constructionism . Front Psychol. 2015;6. Doi:10.3389/fpsyg.2015.00444

By Rachael Green Rachael is a New York-based writer and freelance writer for Verywell Mind, where she leverages her decades of personal experience with and research on mental illness—particularly ADHD and depression—to help readers better understand how their mind works and how to manage their mental health.

The Sapir-Whorf Hypothesis Linguistic Theory

DrAfter123/Getty Images

  • An Introduction to Punctuation
  • Ph.D., Rhetoric and English, University of Georgia
  • M.A., Modern English and American Literature, University of Leicester
  • B.A., English, State University of New York

The Sapir-Whorf hypothesis is the  linguistic theory that the semantic structure of a language shapes or limits the ways in which a speaker forms conceptions of the world. It came about in 1929. The theory is named after the American anthropological linguist Edward Sapir (1884–1939) and his student Benjamin Whorf (1897–1941). It is also known as the   theory of linguistic relativity, linguistic relativism, linguistic determinism, Whorfian hypothesis , and Whorfianism .

History of the Theory

The idea that a person's native language determines how he or she thinks was popular among behaviorists of the 1930s and on until cognitive psychology theories came about, beginning in the 1950s and increasing in influence in the 1960s. (Behaviorism taught that behavior is a result of external conditioning and doesn't take feelings, emotions, and thoughts into account as affecting behavior. Cognitive psychology studies mental processes such as creative thinking, problem-solving, and attention.)

Author Lera Boroditsky gave some background on ideas about the connections between languages and thought:

"The question of whether languages shape the way we think goes back centuries; Charlemagne proclaimed that 'to have a second language is to have a second soul.' But the idea went out of favor with scientists when  Noam Chomsky 's theories of language gained popularity in the 1960s and '70s. Dr. Chomsky proposed that there is a  universal grammar  for all human languages—essentially, that languages don't really differ from one another in significant ways...." ("Lost in Translation." "The Wall Street Journal," July 30, 2010)

The Sapir-Whorf hypothesis was taught in courses through the early 1970s and had become widely accepted as truth, but then it fell out of favor. By the 1990s, the Sapir-Whorf hypothesis was left for dead, author Steven Pinker wrote. "The cognitive revolution in psychology, which made the study of pure thought possible, and a number of studies showing meager effects of language on concepts, appeared to kill the concept in the 1990s... But recently it has been resurrected, and 'neo-Whorfianism' is now an active research topic in  psycholinguistics ." ("The Stuff of Thought. "Viking, 2007)

Neo-Whorfianism is essentially a weaker version of the Sapir-Whorf hypothesis and says that language  influences  a speaker's view of the world but does not inescapably determine it.

The Theory's Flaws

One big problem with the original Sapir-Whorf hypothesis stems from the idea that if a person's language has no word for a particular concept, then that person would not be able to understand that concept, which is untrue. Language doesn't necessarily control humans' ability to reason or have an emotional response to something or some idea. For example, take the German word  sturmfrei , which essentially is the feeling when you have the whole house to yourself because your parents or roommates are away. Just because English doesn't have a single word for the idea doesn't mean that Americans can't understand the concept.

There's also the "chicken and egg" problem with the theory. "Languages, of course, are human creations, tools we invent and hone to suit our needs," Boroditsky continued. "Simply showing that speakers of different languages think differently doesn't tell us whether it's language that shapes thought or the other way around."

  • Definition and Discussion of Chomskyan Linguistics
  • Generative Grammar: Definition and Examples
  • Cognitive Grammar
  • Universal Grammar (UG)
  • Linguistic Performance
  • What Is a Natural Language?
  • Linguistic Competence: Definition and Examples
  • Transformational Grammar (TG) Definition and Examples
  • What Is Linguistic Functionalism?
  • 24 Words Worth Borrowing From Other Languages
  • The Theory of Poverty of the Stimulus in Language Development
  • Definition and Examples of Case Grammar
  • The Definition and Usage of Optimality Theory
  • An Introduction to Semantics
  • Construction Grammar
  • 10 Types of Grammar (and Counting)

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons

Margin Size

  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Social Sci LibreTexts

3.1: Linguistic Relativity- The Sapir-Whorf Hypothesis

  • Last updated
  • Save as PDF
  • Page ID 75159

  • Manon Allard-Kropp
  • University of Missouri–St. Louis

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

Learning Objectives

After completing this module, students will be able to:

1. Define the concept of linguistic relativity

2. Differentiate linguistic relativity and linguistic determinism

3. Define the Sapir-Whorf Hypothesis (against more pop-culture takes on it) and situate it in a broader theoretical context/history

4. Provide examples of linguistic relativity through examples related to time, space, metaphors, etc.

In this part, we will look at language(s) and worldviews at the intersection of language & thoughts and language & cognition (i.e., the mental system with which we process the world around us, and with which we learn to function and make sense of it). Our main question, which we will not entirely answer but which we will examine in depth, is a chicken and egg one: does thought determine language, or does language inform thought?

We will talk about the Sapir-Whorf Hypothesis; look at examples that support the notion of linguistic relativity (pronouns, kinship terms, grammatical tenses, and what they tell us about culture and worldview); and then we will more specifically look into how metaphors are a structural component of worldview, if not cognition itself; and we will wrap up with memes. (Can we analyze memes through an ethnolinguistic, relativist lens? We will try!)

3.1 Linguistic Relativity: The Sapir-Whorf Hypothesis

In the 1920s, Benjamin Whorf was a graduate student studying with linguist Edward Sapir at Yale University in New Haven, Connecticut. Sapir, considered the father of American linguistic anthropology, was responsible for documenting and recording the languages and cultures of many Native American tribes, which were disappearing at an alarming rate. This was due primarily to the deliberate efforts of the United States government to force Native Americans to assimilate into the Euro-American culture. Sapir and his predecessors were well aware of the close relationship between culture and language because each culture is reflected in and influences its language. Anthropologists need to learn the language of the culture they are studying in order to understand the world view of its speakers. Whorf believed that the reverse is also true, that a language affects culture as well, by actually influencing how its speakers think. His hypothesis proposes that the words and the structures of a language influence how its speakers think about the world, how they behave, and ultimately the culture itself. (See our definition of culture in Part 1 of this document.) Simply stated, Whorf believed that human beings see the world the way they do because the specific languages they speak influence them to do so.

He developed this idea through both his work with Sapir and his work as a chemical engineer for the Hartford Insurance Company investigating the causes of fires. One of his cases while working for the insurance company was a fire at a business where there were a number of gasoline drums. Those that contained gasoline were surrounded by signs warning employees to be cautious around them and to avoid smoking near them. The workers were always careful around those drums. On the other hand, empty gasoline drums were stored in another area, but employees were more careless there. Someone tossed a cigarette or lighted match into one of the “empty” drums, it went up in flames, and started a fire that burned the business to the ground. Whorf theorized that the meaning of the word empty implied to the worker that “nothing” was there to be cautious about so the worker behaved accordingly. Unfortunately, an “empty” gasoline drum may still contain fumes, which are more flammable than the liquid itself.

Whorf ’s studies at Yale involved working with Native American languages, including Hopi. The Hopi language is quite different from English, in many ways. For example, let’s look at how the Hopi language deals with time. Western languages (and cultures) view time as a flowing river in which we are being carried continuously away from a past, through the present, and into a future. Our verb systems reflect that concept with specific tenses for past, present, and future. We think of this concept of time as universal, that all humans see it the same way. A Hopi speaker has very different ideas and the structure of their language both reflects and shapes the way they think about time. The Hopi language has no present, past, or future tense. Instead, it divides the world into what Whorf called the manifested and unmanifest domains. The manifested domain deals with the physical universe, including the present, the immediate past and future; the verb system uses the same basic structure for all of them. The unmanifest domain involves the remote past and the future, as well as the world of desires, thought, and life forces. The set of verb forms dealing with this domain are consistent for all of these areas, and are different from the manifested ones. Also, there are no words for hours, minutes, or days of the week. Native Hopi speakers often had great difficulty adapting to life in the English speaking world when it came to being “on time” for work or other events. It is simply not how they had been conditioned to behave with respect to time in their Hopi world, which followed the phases of the moon and the movements of the sun.

In a book about the Abenaki who lived in Vermont in the mid-1800s, Trudy Ann Parker described their concept of time, which very much resembled that of the Hopi and many of the other Native American tribes. “They called one full day a sleep, and a year was called a winter. Each month was referred to as a moon and always began with a new moon. An Indian day wasn’t divided into minutes or hours. It had four time periods—sunrise, noon, sunset, and midnight. Each season was determined by the budding or leafing of plants, the spawning of fish, or the rutting time for animals. Most Indians thought the white race had been running around like scared rabbits ever since the invention of the clock.”

The lexicon , or vocabulary, of a language is an inventory of the items a culture talks about and has categorized in order to make sense of the world and deal with it effectively. For example, modern life is dictated for many by the need to travel by some kind of vehicle—cars, trucks, SUVs, trains, buses, etc. We therefore have thousands of words to talk about them, including types of vehicles, models, brands, or parts.

The most important aspects of each culture are similarly reflected in the lexicon of its language. Among the societies living in the islands of Oceania in the Pacific, fish have great economic and cultural importance. This is reflected in the rich vocabulary that describes all aspects of the fish and the environments that islanders depend on for survival. For example, in Palau there are about 1,000 fish species and Palauan fishermen knew, long before biologists existed, details about the anatomy, behavior, growth patterns, and habitat of most of them—in many cases far more than modern biologists know even today. Much of fish behavior is related to the tides and the phases of the moon. Throughout Oceania, the names given to certain days of the lunar months reflect the likelihood of successful fishing. For example, in the Caroline Islands, the name for the night before the new moon is otolol , which means “to swarm.” The name indicates that the best fishing days cluster around the new moon. In Hawai`i and Tahiti two sets of days have names containing the particle `ole or `ore ; one occurs in the first quarter of the moon and the other in the third quarter. The same name is given to the prevailing wind during those phases. The words mean “nothing,” because those days were considered bad for fishing as well as planting.

Parts of Whorf ’s hypothesis, known as linguistic relativity , were controversial from the beginning, and still are among some linguists. Yet Whorf ’s ideas now form the basis for an entire sub-field of cultural anthropology: cognitive or psychological anthropology. A number of studies have been done that support Whorf ’s ideas. Linguist George Lakoff ’s work looks at the pervasive existence of metaphors in everyday speech that can be said to predispose a speaker’s world view and attitudes on a variety of human experiences. A metaphor is an expression in which one kind of thing is understood and experienced in terms of another entirely unrelated thing; the metaphors in a language can reveal aspects of the culture of its speakers. Take, for example, the concept of an argument. In logic and philosophy, an argument is a discussion involving differing points of view, or a debate. But the conceptual metaphor in American culture can be stated as ARGUMENT IS WAR. This metaphor is reflected in many expressions of the everyday language of American speakers: I won the argument. He shot down every point I made. They attacked every argument we made. Your point is right on target . I had a fight with my boyfriend last night. In other words, we use words appropriate for discussing war when we talk about arguments, which are certainly not real war. But we actually think of arguments as a verbal battle that often involve anger, and even violence, which then structures how we argue.

To illustrate that this concept of argument is not universal, Lakoff suggests imagining a culture where an argument is not something to be won or lost, with no strategies for attacking or defending, but rather as a dance where the dancers’ goal is to perform in an artful, pleasing way. No anger or violence would occur or even be relevant to speakers of this language, because the metaphor for that culture would be ARGUMENT IS DANCE.

3.1 Adapted from Perspectives , Language ( Linda Light, 2017 )

You can either watch the video, How Language Shapes the Way We Think, by linguist Lera Boroditsky, or read the script below.

Watch the video: How Language Shapes the Way We Think ( Boroditsky, 2018)

There are about 7,000 languages spoken around the world—and they all have different sounds, vocabularies, and structures. But do they shape the way we think? Cognitive scientist Lera Boroditsky shares examples of language—from an Aboriginal community in Australia that uses cardinal directions instead of left and right to the multiple words for blue in Russian—that suggest the answer is a resounding yes. “The beauty of linguistic diversity is that it reveals to us just how ingenious and how flexible the human mind is,” Boroditsky says. “Human minds have invented not one cognitive universe, but 7,000.”

Video transcript:

So, I’ll be speaking to you using language ... because I can. This is one these magical abilities that we humans have. We can transmit really complicated thoughts to one another. So what I’m doing right now is, I’m making sounds with my mouth as I’m exhaling. I’m making tones and hisses and puffs, and those are creating air vibrations in the air. Those air vibrations are traveling to you, they’re hitting your eardrums, and then your brain takes those vibrations from your eardrums and transforms them into thoughts. I hope.

I hope that’s happening. So because of this ability, we humans are able to transmit our ideas across vast reaches of space and time. We’re able to transmit knowledge across minds. I can put a bizarre new idea in your mind right now. I could say, “Imagine a jellyfish waltzing in a library while thinking about quantum mechanics.”

Now, if everything has gone relatively well in your life so far, you probably haven’t had that thought before.

But now I’ve just made you think it, through language.

Now of course, there isn’t just one language in the world, there are about 7,000 languages spoken around the world. And all the languages differ from one another in all kinds of ways. Some languages have different sounds, they have different vocabularies, and they also have different structures—very importantly, different structures. That begs the question: Does the language we speak shape the way we think? Now, this is an ancient question. People have been speculating about this question forever. Charlemagne, Holy Roman emperor, said, “To have a second language is to have a second soul”—strong statement that language crafts reality. But on the other hand, Shakespeare has Juliet say, “What’s in a name? A rose by any other name would smell as sweet.” Well, that suggests that maybe language doesn’t craft reality.

These arguments have gone back and forth for thousands of years. But until recently, there hasn’t been any data to help us decide either way. Recently, in my lab and other labs around the world, we’ve started doing research, and now we have actual scientific data to weigh in on this question.

So let me tell you about some of my favorite examples. I’ll start with an example from an Aboriginal community in Australia that I had a chance to work with. These are the Kuuk Thaayorre people. They live in Pormpuraaw at the very west edge of Cape York. What’s cool about Kuuk Thaayorre is, in Kuuk Thaayorre, they don’t use words like “left” and “right,” and instead, everything is in cardinal directions: north, south, east, and west. And when I say everything, I really mean everything. You would say something like, “Oh, there’s an ant on your southwest leg.” Or, “Move your cup to the north-northeast a little bit.” In fact, the way that you say “hello” in Kuuk Thaayorre is you say, “Which way are you going?” And the answer should be, “North-northeast in the far distance. How about you?”

So imagine as you’re walking around your day, every person you greet, you have to report your heading direction.

But that would actually get you oriented pretty fast, right? Because you literally couldn’t get past “hello,” if you didn’t know which way you were going. In fact, people who speak languages like this stay oriented really well. They stay oriented better than we used to think humans could. We used to think that humans were worse than other creatures because of some biological excuse: “Oh, we don’t have magnets in our beaks or in our scales.” No; if your language and your culture trains you to do it, actually, you can do it. There are humans around the world who stay oriented really well.

And just to get us in agreement about how different this is from the way we do it, I want you all to close your eyes for a second and point southeast.

Keep your eyes closed. Point. OK, so you can open your eyes. I see you guys pointing there, there, there, there, there ... I don’t know which way it is myself—

You have not been a lot of help.

So let’s just say the accuracy in this room was not very high. This is a big difference in cognitive ability across languages, right? Where one group—very distinguished group like you guys—doesn’t know which way is which, but in another group, I could ask a five-year-old and they would know.

There are also really big differences in how people think about time. So here I have pictures of my grandfather at different ages. And if I ask an English speaker to organize time, they might lay it out this way, from left to right. This has to do with writing direction. If you were a speaker of Hebrew or Arabic, you might do it going in the opposite direction, from right to left.

But how would the Kuuk Thaayorre, this Aboriginal group I just told you about, do it? They don’t use words like “left” and “right.” Let me give you hint. When we sat people facing south, they organized time from left to right. When we sat them facing north, they organized time from right to left. When we sat them facing east, time came towards the body. What’s the pattern? East to west, right? So for them, time doesn’t actually get locked on the body at all, it gets locked on the landscape. So for me, if I’m facing this way, then time goes this way, and if I’m facing this way, then time goes this way. I’m facing this way, time goes this way— very egocentric of me to have the direction of time chase me around every time I turn my body. For the Kuuk Thaayorre, time is locked on the landscape. It’s a dramatically different way of thinking about time.

Here’s another really smart human trait. Suppose I ask you how many penguins are there. Well, I bet I know how you’d solve that problem if you solved it. You went, “One, two, three, four, five, six, seven, eight.” You counted them. You named each one with a number, and the last number you said was the number of penguins. This is a little trick that you’re taught to use as kids. You learn the number list and you learn how to apply it. A little linguistic trick. Well, some languages don’t do this, because some languages don’t have exact number words. They’re languages that don’t have a word like “seven” or a word like “eight.” In fact, people who speak these languages don’t count, and they have trouble keeping track of exact quantities. So, for example, if I ask you to match this number of penguins to the same number of ducks, you would be able to do that by counting. But folks who don’t have that linguistic trait can’t do that.

Languages also differ in how they divide up the color spectrum—the visual world. Some languages have lots of words for colors, some have only a couple words, “light” and “dark.” And languages differ in where they put boundaries between colors. So, for example, in English, there’s a word for blue that covers all of the colors that you can see on the screen, but in Russian, there isn’t a single word. Instead, Russian speakers have to differentiate between light blue, goluboy , and dark blue, siniy . So Russians have this lifetime of experience of, in language, distinguishing these two colors. When we test people’s ability to perceptually discriminate these colors, what we find is that Russian speakers are faster across this linguistic boundary. They’re faster to be able to tell the difference between a light and a dark blue. And when you look at people’s brains as they’re looking at colors—say you have colors shifting slowly from light to dark blue—the brains of people who use different words for light and dark blue will give a surprised reaction as the colors shift from light to dark, as if, “Ooh, something has categorically changed,” whereas the brains of English speakers, for example, that don’t make this categorical distinction, don’t give that surprise, because nothing is categorically changing.

Languages have all kinds of structural quirks. This is one of my favorites. Lots of languages have grammatical gender; so every noun gets assigned a gender, often masculine or feminine. And these genders differ across languages. So, for example, the sun is feminine in German but masculine in Spanish, and the moon, the reverse. Could this actually have any consequence for how people think? Do German speakers think of the sun as somehow more female-like, and the moon somehow more male-like? Actually, it turns out that’s the case. So if you ask German and Spanish speakers to, say, describe a bridge, like the one here—“bridge” happens to be grammatically feminine in German, grammatically masculine in Spanish—German speakers are more likely to say bridges are “beautiful,” “elegant,” and stereotypically feminine words. Whereas Spanish speakers will be more likely to say they’re “strong” or “long,” these masculine words.

Languages also differ in how they describe events, right? You take an event like this, an accident. In English, it’s fine to say, “He broke the vase.” In a language like Spanish, you might be more likely to say, “The vase broke,” or “The vase broke itself.” If it’s an accident, you wouldn’t say that someone did it. In English, quite weirdly, we can even say things like, “I broke my arm.” Now, in lots of languages, you couldn’t use that construction unless you are a lunatic and you went out looking to break your arm—[laughter] and you succeeded. If it was an accident, you would use a different construction.

Now, this has consequences. So, people who speak different languages will pay attention to different things, depending on what their language usually requires them to do. So we show the same accident to English speakers and Spanish speakers, English speakers will remember who did it, because English requires you to say, “He did it; he broke the vase.” Whereas Spanish speakers might be less likely to remember who did it if it’s an accident, but they’re more likely to remember that it was an accident. They’re more likely to remember the intention. So, two people watch the same event, witness the same crime, but end up remembering different things about that event. This has implications, of course, for eyewitness testimony. It also has implications for blame and punishment. So if you take English speakers and I just show you someone breaking a vase, and I say, “He broke the vase,” as opposed to “The vase broke,” even though you can witness it yourself, you can watch the video, you can watch the crime against the vase, you will punish someone more, you will blame someone more if I just said, “He broke it,” as opposed to, “It broke.” The language guides our reasoning about events.

Now, I’ve given you a few examples of how language can profoundly shape the way we think, and it does so in a variety of ways. So language can have big effects, like we saw with space and time, where people can lay out space and time in completely different coordinate frames from each other. Language can also have really deep effects—that’s what we saw with the case of number. Having count words in your language, having number words, opens up the whole world of mathematics. Of course, if you don’t count, you can’t do algebra, you can’t do any of the things that would be required to build a room like this or make this broadcast, right? This little trick of number words gives you a stepping stone into a whole cognitive realm.

Language can also have really early effects, what we saw in the case of color. These are really simple, basic, perceptual decisions. We make thousands of them all the time, and yet, language is getting in there and fussing even with these tiny little perceptual decisions that we make. Language can have really broad effects. So the case of grammatical gender may be a little silly, but at the same time, grammatical gender applies to all nouns. That means language can shape how you’re thinking about anything that can be named by a noun. That’s a lot of stuff.

And finally, I gave you an example of how language can shape things that have personal weight to us—ideas like blame and punishment or eyewitness memory. These are important things in our daily lives.

Now, the beauty of linguistic diversity is that it reveals to us just how ingenious and how flexible the human mind is. Human minds have invented not one cognitive universe, but 7,000—there are 7,000 languages spoken around the world. And we can create many more—languages, of course, are living things, things that we can hone and change to suit our needs. The tragic thing is that we’re losing so much of this linguistic diversity all the time. We’re losing about one language a week, and by some estimates, half of the world’s languages will be gone in the next hundred years. And the even worse news is that right now, almost everything we know about the human mind and human brain is based on studies of usually American English-speaking undergraduates at universities. That excludes almost all humans. Right? So what we know about the human mind is actually incredibly narrow and biased, and our science has to do better.

I want to leave you with this final thought. I’ve told you about how speakers of different languages think differently, but of course, that’s not about how people elsewhere think. It’s about how you think. It’s how the language that you speak shapes the way that you think. And that gives you the opportunity to ask, “Why do I think the way that I do?” “How could I think differently?” And also, “What thoughts do I wish to create?”

Thank you very much.

Read the following text on what lexical differences between language can tell us about those languages’ cultures.

  • Subject List
  • Take a Tour
  • For Authors
  • Subscriber Services
  • Publications
  • African American Studies
  • African Studies
  • American Literature
  • Anthropology
  • Architecture Planning and Preservation
  • Art History
  • Atlantic History
  • Biblical Studies
  • British and Irish Literature
  • Childhood Studies
  • Chinese Studies
  • Cinema and Media Studies
  • Communication
  • Criminology
  • Environmental Science
  • Evolutionary Biology
  • International Law
  • International Relations
  • Islamic Studies
  • Jewish Studies
  • Latin American Studies
  • Latino Studies

Linguistics

  • Literary and Critical Theory
  • Medieval Studies
  • Military History
  • Political Science
  • Public Health
  • Renaissance and Reformation
  • Social Work
  • Urban Studies
  • Victorian Literature
  • Browse All Subjects

How to Subscribe

  • Free Trials

In This Article Expand or collapse the "in this article" section Linguistic Relativity

Introduction, edited collections.

  • Reference Resources
  • Foundational Works
  • Theoretical Perspectives
  • Object-Substance
  • Object-Substance and Acquisition
  • Kinds and Categories
  • Grammatical Number
  • Tight-Fit, Loose-Fit
  • Path-Manner
  • Frames of Reference
  • Reorientation
  • Theory of Mind
  • Grammatical Gender

Related Articles Expand or collapse the "related articles" section about

About related articles close popup.

Lorem Ipsum Sit Dolor Amet

Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Aliquam ligula odio, euismod ut aliquam et, vestibulum nec risus. Nulla viverra, arcu et iaculis consequat, justo diam ornare tellus, semper ultrices tellus nunc eu tellus.

  • Contrastive Analysis in Linguistics
  • Critical Applied Linguistics
  • Cross-Cultural Pragmatics
  • Educational Linguistics
  • Edward Sapir
  • Generative Syntax
  • Georg von der Gabelentz
  • Languages of the World
  • Linguistic Complexity
  • Positive Discourse Analysis
  • Psycholinguistics
  • Synesthesia and Language
  • Translation

Other Subject Areas

Forthcoming articles expand or collapse the "forthcoming articles" section.

  • Sentence Comprehension
  • Text Comprehension
  • Find more forthcoming articles...
  • Export Citations
  • Share This Facebook LinkedIn Twitter

Linguistic Relativity by Peggy Li , David Barner LAST REVIEWED: 28 October 2011 LAST MODIFIED: 28 October 2011 DOI: 10.1093/obo/9780199772810-0026

Linguistic relativity, sometimes called the Whorfian hypothesis, posits that properties of language affect the structure and content of thought and thus the way humans perceive reality. A distinction is often made between strong Whorfian views, according to which the categories of thought are determined by language, and weak views, which argue that language influences thought without entirely determining its structure. Each view presupposes that for language to affect thought, the two must in some way be separable. The modern investigation of linguistic relativity began with the contributions of Benjamin Lee Whorf and his mentor, Edward Sapir. Until recently, much experimental work has focused on determining whether any reliable Whorfian effects exist and whether effects truly reflect differences in thought caused by linguistic variation. Many such studies compare speakers of different languages or test subjects at different stages of language acquisition. Other studies explore how language affects cognition by testing prelinguistic infants or nonhuman animals and comparing these groups to children or adults. Significant progress has been made in several domains, including studies of color, number, objects, and space. In many areas, the status of findings is hotly debated.

Often, leading researchers in the field summarize their newest findings and views in edited collections. These volumes are good places to begin research into the topic of linguistic relativity. The listed volumes arose from papers presented at conferences, symposia, and workshops devoted to the topic. Gumperz and Levinson 1996 arose from a symposium that revived interest in the linguistic relativity hypothesis, leading to a wave of new research on the topic. Highlights of this work are reported in Bowerman and Levinson 2001 , Gentner and Goldin-Meadow 2003 , and Malt and Wolff 2010 .

Bowerman, Melissa, and Stephen C. Levinson, eds. 2001. Language acquisition and conceptual development . Cambridge, UK: Cambridge Univ. Press.

DOI: 10.1017/CBO9780511620669

This volume brings together research on language acquisition and conceptual development and asks about the relation between them in early childhood.

Gentner, Dedre, and Susan Goldin-Meadow, eds. 2003. Language in mind: Advances in the study of language and thought . Cambridge, MA: MIT Press.

The volume starts with a collection of perspective papers and then showcases papers that bring data to bear to test claims of linguistic relativity. The papers are delineated on the basis of the types of language effects on thought: language as a tool kit, language as a lens, and language as a category maker.

Gumperz, John J., and Stephen C. Levinson, eds. 1996. Rethinking linguistic relativity . Papers presented at the Werner-Gren Symposium 112, held in Ocho Rios, Jamaica, in May 1991. Cambridge, UK: Cambridge Univ. Press.

A collection of papers arising from the “Rethinking Linguistic Relativity” Wenner-Gren Symposium in 1991 that brought about renewed interest in the topic.

Malt, Barbara C., and Phillip M. Wolff. 2010. Words and the mind: How words capture human experience . Oxford: Oxford Univ. Press.

Researchers across disciplines (linguists, psychologists, and anthropologists) contributed to this collection of papers documenting new advances in language-thought research in various domains (space, emotions, body parts, causation, etc.).

back to top

Users without a subscription are not able to see the full content on this page. Please subscribe or login .

Oxford Bibliographies Online is available by subscription and perpetual access to institutions. For more information or to contact an Oxford Sales Representative click here .

  • About Linguistics »
  • Meet the Editorial Board »
  • Acceptability Judgments
  • Accessibility Theory in Linguistics
  • Acquisition, Second Language, and Bilingualism, Psycholin...
  • Adpositions
  • African Linguistics
  • Afroasiatic Languages
  • Algonquian Linguistics
  • Altaic Languages
  • Ambiguity, Lexical
  • Analogy in Language and Linguistics
  • Animal Communication
  • Applicatives
  • Applied Linguistics, Critical
  • Arawak Languages
  • Argument Structure
  • Artificial Languages
  • Australian Languages
  • Austronesian Linguistics
  • Auxiliaries
  • Balkans, The Languages of the
  • Baudouin de Courtenay, Jan
  • Berber Languages and Linguistics
  • Bilingualism and Multilingualism
  • Biology of Language
  • Borrowing, Structural
  • Caddoan Languages
  • Caucasian Languages
  • Celtic Languages
  • Celtic Mutations
  • Chomsky, Noam
  • Chumashan Languages
  • Classifiers
  • Clauses, Relative
  • Clinical Linguistics
  • Cognitive Linguistics
  • Colonial Place Names
  • Comparative Reconstruction in Linguistics
  • Comparative-Historical Linguistics
  • Complementation
  • Complexity, Linguistic
  • Compositionality
  • Compounding
  • Computational Linguistics
  • Conditionals
  • Conjunctions
  • Connectionism
  • Consonant Epenthesis
  • Constructions, Verb-Particle
  • Conversation Analysis
  • Conversation, Maxims of
  • Conversational Implicature
  • Cooperative Principle
  • Coordination
  • Creoles, Grammatical Categories in
  • Critical Periods
  • Cross-Language Speech Perception and Production
  • Cyberpragmatics
  • Default Semantics
  • Definiteness
  • Dementia and Language
  • Dene (Athabaskan) Languages
  • Dené-Yeniseian Hypothesis, The
  • Dependencies
  • Dependencies, Long Distance
  • Derivational Morphology
  • Determiners
  • Dialectology
  • Distinctive Features
  • Dravidian Languages
  • Endangered Languages
  • English as a Lingua Franca
  • English, Early Modern
  • English, Old
  • Eskimo-Aleut
  • Euphemisms and Dysphemisms
  • Evidentials
  • Exemplar-Based Models in Linguistics
  • Existential
  • Existential Wh-Constructions
  • Experimental Linguistics
  • Fieldwork, Sociolinguistic
  • Finite State Languages
  • First Language Attrition
  • Formulaic Language
  • Francoprovençal
  • French Grammars
  • Gabelentz, Georg von der
  • Genealogical Classification
  • Genetics and Language
  • Grammar, Categorial
  • Grammar, Cognitive
  • Grammar, Construction
  • Grammar, Descriptive
  • Grammar, Functional Discourse
  • Grammars, Phrase Structure
  • Grammaticalization
  • Harris, Zellig
  • Heritage Languages
  • History of Linguistics
  • History of the English Language
  • Hmong-Mien Languages
  • Hokan Languages
  • Humor in Language
  • Hungarian Vowel Harmony
  • Idiom and Phraseology
  • Imperatives
  • Indefiniteness
  • Indo-European Etymology
  • Inflected Infinitives
  • Information Structure
  • Interface Between Phonology and Phonetics
  • Interjections
  • Iroquoian Languages
  • Isolates, Language
  • Jakobson, Roman
  • Japanese Word Accent
  • Jones, Daniel
  • Juncture and Boundary
  • Khoisan Languages
  • Kiowa-Tanoan Languages
  • Kra-Dai Languages
  • Labov, William
  • Language Acquisition
  • Language and Law
  • Language Contact
  • Language Documentation
  • Language, Embodiment and
  • Language for Specific Purposes/Specialized Communication
  • Language, Gender, and Sexuality
  • Language Geography
  • Language Ideologies and Language Attitudes
  • Language in Autism Spectrum Disorders
  • Language Nests
  • Language Revitalization
  • Language Shift
  • Language Standardization
  • Language, Synesthesia and
  • Languages of Africa
  • Languages of the Americas, Indigenous
  • Learnability
  • Lexical Access, Cognitive Mechanisms for
  • Lexical Semantics
  • Lexical-Functional Grammar
  • Lexicography
  • Lexicography, Bilingual
  • Linguistic Accommodation
  • Linguistic Anthropology
  • Linguistic Areas
  • Linguistic Landscapes
  • Linguistic Prescriptivism
  • Linguistic Profiling and Language-Based Discrimination
  • Linguistic Relativity
  • Linguistics, Educational
  • Listening, Second Language
  • Literature and Linguistics
  • Machine Translation
  • Maintenance, Language
  • Mande Languages
  • Mass-Count Distinction
  • Mathematical Linguistics
  • Mayan Languages
  • Mental Health Disorders, Language in
  • Mental Lexicon, The
  • Mesoamerican Languages
  • Minority Languages
  • Mixed Languages
  • Mixe-Zoquean Languages
  • Modification
  • Mon-Khmer Languages
  • Morphological Change
  • Morphology, Blending in
  • Morphology, Subtractive
  • Munda Languages
  • Muskogean Languages
  • Nasals and Nasalization
  • Niger-Congo Languages
  • Non-Pama-Nyungan Languages
  • Northeast Caucasian Languages
  • Oceanic Languages
  • Papuan Languages
  • Penutian Languages
  • Philosophy of Language
  • Phonetics, Acoustic
  • Phonetics, Articulatory
  • Phonological Research, Psycholinguistic Methodology in
  • Phonology, Computational
  • Phonology, Early Child
  • Policy and Planning, Language
  • Politeness in Language
  • Possessives, Acquisition of
  • Pragmatics, Acquisition of
  • Pragmatics, Cognitive
  • Pragmatics, Computational
  • Pragmatics, Cross-Cultural
  • Pragmatics, Developmental
  • Pragmatics, Experimental
  • Pragmatics, Game Theory in
  • Pragmatics, Historical
  • Pragmatics, Institutional
  • Pragmatics, Second Language
  • Pragmatics, Teaching
  • Prague Linguistic Circle, The
  • Presupposition
  • Quechuan and Aymaran Languages
  • Reading, Second-Language
  • Reciprocals
  • Reduplication
  • Reflexives and Reflexivity
  • Register and Register Variation
  • Relevance Theory
  • Representation and Processing of Multi-Word Expressions in...
  • Salish Languages
  • Sapir, Edward
  • Saussure, Ferdinand de
  • Second Language Acquisition, Anaphora Resolution in
  • Semantic Maps
  • Semantic Roles
  • Semantic-Pragmatic Change
  • Semantics, Cognitive
  • Sentence Processing in Monolingual and Bilingual Speakers
  • Sign Language Linguistics
  • Sociolinguistics
  • Sociolinguistics, Variationist
  • Sociopragmatics
  • Sound Change
  • South American Indian Languages
  • Specific Language Impairment
  • Speech, Deceptive
  • Speech Perception
  • Speech Production
  • Speech Synthesis
  • Switch-Reference
  • Syntactic Change
  • Syntactic Knowledge, Children’s Acquisition of
  • Tense, Aspect, and Mood
  • Text Mining
  • Tone Sandhi
  • Transcription
  • Transitivity and Voice
  • Translanguaging
  • Trubetzkoy, Nikolai
  • Tucanoan Languages
  • Tupian Languages
  • Usage-Based Linguistics
  • Uto-Aztecan Languages
  • Valency Theory
  • Verbs, Serial
  • Vocabulary, Second Language
  • Voice and Voice Quality
  • Vowel Harmony
  • Whitney, William Dwight
  • Word Classes
  • Word Formation in Japanese
  • Word Recognition, Spoken
  • Word Recognition, Visual
  • Word Stress
  • Writing, Second Language
  • Writing Systems
  • Zapotecan Languages
  • Privacy Policy
  • Cookie Policy
  • Legal Notice
  • Accessibility

Powered by:

  • [66.249.64.20|109.248.223.228]
  • 109.248.223.228

Whorfian Hypothesis

  • Reference work entry
  • pp 1566–1567
  • Cite this reference work entry

strong whorf hypothesis

  • Seongwon Yun 3 &
  • Shelia M. Kennison 3  

Linguistic determinism ; Linguistic relativity ; Sapir-Whorf hypothesis

The hypothesis suggests that human thought is influenced by the language one speaks.

Description

The term Whorfian Hypothesis takes its name from Benjamin Lee Whorf (1876–1941) who claimed that the language one speaks influences one’s thinking [ 7 ]. Whorf was an amateur linguist who studied with the anthropologist Edward Sapir in the 1920s and 1930s. The term Sapir-Whorf Hypothesis is also used to refer to their view that language determines thinking. Linguistic determinism and linguistic relativity are also terms referring to the notion that the characteristics of one’s language shape one’s cognition.

The hypothesis was developed following observations of cross language differences in linguistic structure and speculations about how such differences might impact speakers’ thinking. For example, Whorf noted that in the Hopi language, there was little to no grammatical marking for tense. Utterances...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Berlin, B., & Kay, P. (1969). Basic color terms: Their universality and evolution . Berkeley, CA: University of California Press.

Google Scholar  

Gordon, P. (2004). Numerical cognition without words: Evidence from Amazonia. Science, 306 , 496–499.

PubMed   Google Scholar  

Heider, E., & Olivier, D. (1972). The structure of the color space in naming and memory for two languages. Cognitive Psychology, 3 , 337–354.

Hunt, E., & Agnoli, F. (1991). The Whorfian hypothesis: A cognitive approach. Psychological Review, 98 , 377–389.

Pica, P., Lemer, C., & Izard, V. (2004). Exact and approximate calculation in an Amazonian indigene group with a reduced number lexicon. Science, 306 , 499–503.

Roberson, D., Davies, I., & Davidoff, J. (2000). Colour categories are not universal: Replications and new evidence from a Stone-Age culture. Journal of Experimental Psychology: Genera, 129 , 369–398.

Whorf, B. L. (1956). Language, thought, and reality . Cambridge, MA: MIT Press.

Download references

Author information

Authors and affiliations.

Department of Psychology, Oklahoma State University, 116 North Murray Hall, Stillwater, OK, 74078, USA

Seongwon Yun & Shelia M. Kennison

You can also search for this author in PubMed   Google Scholar

Editor information

Editors and affiliations.

Neurology, Learning and Behavior Center, 230 South 500 East, Suite 100, Salt Lake City, Utah, 84102, USA

Sam Goldstein Ph.D.

Department of Psychology MS 2C6, George Mason University, Fairfax, VA, 22030, USA

Jack A. Naglieri Ph.D. ( Professor of Psychology ) ( Professor of Psychology )

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer Science+Business Media, LLC

About this entry

Cite this entry.

Yun, S., Kennison, S.M. (2011). Whorfian Hypothesis. In: Goldstein, S., Naglieri, J.A. (eds) Encyclopedia of Child Behavior and Development. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-79061-9_3087

Download citation

DOI : https://doi.org/10.1007/978-0-387-79061-9_3087

Publisher Name : Springer, Boston, MA

Print ISBN : 978-0-387-77579-1

Online ISBN : 978-0-387-79061-9

eBook Packages : Behavioral Science

Share this entry

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research
  • Search Menu

Sign in through your institution

  • Browse content in Arts and Humanities
  • Browse content in Archaeology
  • Anglo-Saxon and Medieval Archaeology
  • Archaeological Methodology and Techniques
  • Archaeology by Region
  • Archaeology of Religion
  • Archaeology of Trade and Exchange
  • Biblical Archaeology
  • Contemporary and Public Archaeology
  • Environmental Archaeology
  • Historical Archaeology
  • History and Theory of Archaeology
  • Industrial Archaeology
  • Landscape Archaeology
  • Mortuary Archaeology
  • Prehistoric Archaeology
  • Underwater Archaeology
  • Urban Archaeology
  • Zooarchaeology
  • Browse content in Architecture
  • Architectural Structure and Design
  • History of Architecture
  • Residential and Domestic Buildings
  • Theory of Architecture
  • Browse content in Art
  • Art Subjects and Themes
  • History of Art
  • Industrial and Commercial Art
  • Theory of Art
  • Biographical Studies
  • Byzantine Studies
  • Browse content in Classical Studies
  • Classical History
  • Classical Philosophy
  • Classical Mythology
  • Classical Literature
  • Classical Reception
  • Classical Art and Architecture
  • Classical Oratory and Rhetoric
  • Greek and Roman Epigraphy
  • Greek and Roman Law
  • Greek and Roman Papyrology
  • Greek and Roman Archaeology
  • Late Antiquity
  • Religion in the Ancient World
  • Digital Humanities
  • Browse content in History
  • Colonialism and Imperialism
  • Diplomatic History
  • Environmental History
  • Genealogy, Heraldry, Names, and Honours
  • Genocide and Ethnic Cleansing
  • Historical Geography
  • History by Period
  • History of Emotions
  • History of Agriculture
  • History of Education
  • History of Gender and Sexuality
  • Industrial History
  • Intellectual History
  • International History
  • Labour History
  • Legal and Constitutional History
  • Local and Family History
  • Maritime History
  • Military History
  • National Liberation and Post-Colonialism
  • Oral History
  • Political History
  • Public History
  • Regional and National History
  • Revolutions and Rebellions
  • Slavery and Abolition of Slavery
  • Social and Cultural History
  • Theory, Methods, and Historiography
  • Urban History
  • World History
  • Browse content in Language Teaching and Learning
  • Language Learning (Specific Skills)
  • Language Teaching Theory and Methods
  • Browse content in Linguistics
  • Applied Linguistics
  • Cognitive Linguistics
  • Computational Linguistics
  • Forensic Linguistics
  • Grammar, Syntax and Morphology
  • Historical and Diachronic Linguistics
  • History of English
  • Language Acquisition
  • Language Evolution
  • Language Reference
  • Language Variation
  • Language Families
  • Lexicography
  • Linguistic Anthropology
  • Linguistic Theories
  • Linguistic Typology
  • Phonetics and Phonology
  • Psycholinguistics
  • Sociolinguistics
  • Translation and Interpretation
  • Writing Systems
  • Browse content in Literature
  • Bibliography
  • Children's Literature Studies
  • Literary Studies (Asian)
  • Literary Studies (European)
  • Literary Studies (Eco-criticism)
  • Literary Studies (Romanticism)
  • Literary Studies (American)
  • Literary Studies (Modernism)
  • Literary Studies - World
  • Literary Studies (1500 to 1800)
  • Literary Studies (19th Century)
  • Literary Studies (20th Century onwards)
  • Literary Studies (African American Literature)
  • Literary Studies (British and Irish)
  • Literary Studies (Early and Medieval)
  • Literary Studies (Fiction, Novelists, and Prose Writers)
  • Literary Studies (Gender Studies)
  • Literary Studies (Graphic Novels)
  • Literary Studies (History of the Book)
  • Literary Studies (Plays and Playwrights)
  • Literary Studies (Poetry and Poets)
  • Literary Studies (Postcolonial Literature)
  • Literary Studies (Queer Studies)
  • Literary Studies (Science Fiction)
  • Literary Studies (Travel Literature)
  • Literary Studies (War Literature)
  • Literary Studies (Women's Writing)
  • Literary Theory and Cultural Studies
  • Mythology and Folklore
  • Shakespeare Studies and Criticism
  • Browse content in Media Studies
  • Browse content in Music
  • Applied Music
  • Dance and Music
  • Ethics in Music
  • Ethnomusicology
  • Gender and Sexuality in Music
  • Medicine and Music
  • Music Cultures
  • Music and Religion
  • Music and Media
  • Music and Culture
  • Music Education and Pedagogy
  • Music Theory and Analysis
  • Musical Scores, Lyrics, and Libretti
  • Musical Structures, Styles, and Techniques
  • Musicology and Music History
  • Performance Practice and Studies
  • Race and Ethnicity in Music
  • Sound Studies
  • Browse content in Performing Arts
  • Browse content in Philosophy
  • Aesthetics and Philosophy of Art
  • Epistemology
  • Feminist Philosophy
  • History of Western Philosophy
  • Metaphysics
  • Moral Philosophy
  • Non-Western Philosophy
  • Philosophy of Science
  • Philosophy of Language
  • Philosophy of Mind
  • Philosophy of Perception
  • Philosophy of Action
  • Philosophy of Law
  • Philosophy of Religion
  • Philosophy of Mathematics and Logic
  • Practical Ethics
  • Social and Political Philosophy
  • Browse content in Religion
  • Biblical Studies
  • Christianity
  • East Asian Religions
  • History of Religion
  • Judaism and Jewish Studies
  • Qumran Studies
  • Religion and Education
  • Religion and Health
  • Religion and Politics
  • Religion and Science
  • Religion and Law
  • Religion and Art, Literature, and Music
  • Religious Studies
  • Browse content in Society and Culture
  • Cookery, Food, and Drink
  • Cultural Studies
  • Customs and Traditions
  • Ethical Issues and Debates
  • Hobbies, Games, Arts and Crafts
  • Natural world, Country Life, and Pets
  • Popular Beliefs and Controversial Knowledge
  • Sports and Outdoor Recreation
  • Technology and Society
  • Travel and Holiday
  • Visual Culture
  • Browse content in Law
  • Arbitration
  • Browse content in Company and Commercial Law
  • Commercial Law
  • Company Law
  • Browse content in Comparative Law
  • Systems of Law
  • Competition Law
  • Browse content in Constitutional and Administrative Law
  • Government Powers
  • Judicial Review
  • Local Government Law
  • Military and Defence Law
  • Parliamentary and Legislative Practice
  • Construction Law
  • Contract Law
  • Browse content in Criminal Law
  • Criminal Procedure
  • Criminal Evidence Law
  • Sentencing and Punishment
  • Employment and Labour Law
  • Environment and Energy Law
  • Browse content in Financial Law
  • Banking Law
  • Insolvency Law
  • History of Law
  • Human Rights and Immigration
  • Intellectual Property Law
  • Browse content in International Law
  • Private International Law and Conflict of Laws
  • Public International Law
  • IT and Communications Law
  • Jurisprudence and Philosophy of Law
  • Law and Politics
  • Law and Society
  • Browse content in Legal System and Practice
  • Courts and Procedure
  • Legal Skills and Practice
  • Primary Sources of Law
  • Regulation of Legal Profession
  • Medical and Healthcare Law
  • Browse content in Policing
  • Criminal Investigation and Detection
  • Police and Security Services
  • Police Procedure and Law
  • Police Regional Planning
  • Browse content in Property Law
  • Personal Property Law
  • Study and Revision
  • Terrorism and National Security Law
  • Browse content in Trusts Law
  • Wills and Probate or Succession
  • Browse content in Medicine and Health
  • Browse content in Allied Health Professions
  • Arts Therapies
  • Clinical Science
  • Dietetics and Nutrition
  • Occupational Therapy
  • Operating Department Practice
  • Physiotherapy
  • Radiography
  • Speech and Language Therapy
  • Browse content in Anaesthetics
  • General Anaesthesia
  • Neuroanaesthesia
  • Browse content in Clinical Medicine
  • Acute Medicine
  • Cardiovascular Medicine
  • Clinical Genetics
  • Clinical Pharmacology and Therapeutics
  • Dermatology
  • Endocrinology and Diabetes
  • Gastroenterology
  • Genito-urinary Medicine
  • Geriatric Medicine
  • Infectious Diseases
  • Medical Toxicology
  • Medical Oncology
  • Pain Medicine
  • Palliative Medicine
  • Rehabilitation Medicine
  • Respiratory Medicine and Pulmonology
  • Rheumatology
  • Sleep Medicine
  • Sports and Exercise Medicine
  • Clinical Neuroscience
  • Community Medical Services
  • Critical Care
  • Emergency Medicine
  • Forensic Medicine
  • Haematology
  • History of Medicine
  • Browse content in Medical Dentistry
  • Oral and Maxillofacial Surgery
  • Paediatric Dentistry
  • Restorative Dentistry and Orthodontics
  • Surgical Dentistry
  • Browse content in Medical Skills
  • Clinical Skills
  • Communication Skills
  • Nursing Skills
  • Surgical Skills
  • Medical Ethics
  • Medical Statistics and Methodology
  • Browse content in Neurology
  • Clinical Neurophysiology
  • Neuropathology
  • Nursing Studies
  • Browse content in Obstetrics and Gynaecology
  • Gynaecology
  • Occupational Medicine
  • Ophthalmology
  • Otolaryngology (ENT)
  • Browse content in Paediatrics
  • Neonatology
  • Browse content in Pathology
  • Chemical Pathology
  • Clinical Cytogenetics and Molecular Genetics
  • Histopathology
  • Medical Microbiology and Virology
  • Patient Education and Information
  • Browse content in Pharmacology
  • Psychopharmacology
  • Browse content in Popular Health
  • Caring for Others
  • Complementary and Alternative Medicine
  • Self-help and Personal Development
  • Browse content in Preclinical Medicine
  • Cell Biology
  • Molecular Biology and Genetics
  • Reproduction, Growth and Development
  • Primary Care
  • Professional Development in Medicine
  • Browse content in Psychiatry
  • Addiction Medicine
  • Child and Adolescent Psychiatry
  • Forensic Psychiatry
  • Learning Disabilities
  • Old Age Psychiatry
  • Psychotherapy
  • Browse content in Public Health and Epidemiology
  • Epidemiology
  • Public Health
  • Browse content in Radiology
  • Clinical Radiology
  • Interventional Radiology
  • Nuclear Medicine
  • Radiation Oncology
  • Reproductive Medicine
  • Browse content in Surgery
  • Cardiothoracic Surgery
  • Gastro-intestinal and Colorectal Surgery
  • General Surgery
  • Neurosurgery
  • Paediatric Surgery
  • Peri-operative Care
  • Plastic and Reconstructive Surgery
  • Surgical Oncology
  • Transplant Surgery
  • Trauma and Orthopaedic Surgery
  • Vascular Surgery
  • Browse content in Science and Mathematics
  • Browse content in Biological Sciences
  • Aquatic Biology
  • Biochemistry
  • Bioinformatics and Computational Biology
  • Developmental Biology
  • Ecology and Conservation
  • Evolutionary Biology
  • Genetics and Genomics
  • Microbiology
  • Molecular and Cell Biology
  • Natural History
  • Plant Sciences and Forestry
  • Research Methods in Life Sciences
  • Structural Biology
  • Systems Biology
  • Zoology and Animal Sciences
  • Browse content in Chemistry
  • Analytical Chemistry
  • Computational Chemistry
  • Crystallography
  • Environmental Chemistry
  • Industrial Chemistry
  • Inorganic Chemistry
  • Materials Chemistry
  • Medicinal Chemistry
  • Mineralogy and Gems
  • Organic Chemistry
  • Physical Chemistry
  • Polymer Chemistry
  • Study and Communication Skills in Chemistry
  • Theoretical Chemistry
  • Browse content in Computer Science
  • Artificial Intelligence
  • Computer Architecture and Logic Design
  • Game Studies
  • Human-Computer Interaction
  • Mathematical Theory of Computation
  • Programming Languages
  • Software Engineering
  • Systems Analysis and Design
  • Virtual Reality
  • Browse content in Computing
  • Business Applications
  • Computer Security
  • Computer Games
  • Computer Networking and Communications
  • Digital Lifestyle
  • Graphical and Digital Media Applications
  • Operating Systems
  • Browse content in Earth Sciences and Geography
  • Atmospheric Sciences
  • Environmental Geography
  • Geology and the Lithosphere
  • Maps and Map-making
  • Meteorology and Climatology
  • Oceanography and Hydrology
  • Palaeontology
  • Physical Geography and Topography
  • Regional Geography
  • Soil Science
  • Urban Geography
  • Browse content in Engineering and Technology
  • Agriculture and Farming
  • Biological Engineering
  • Civil Engineering, Surveying, and Building
  • Electronics and Communications Engineering
  • Energy Technology
  • Engineering (General)
  • Environmental Science, Engineering, and Technology
  • History of Engineering and Technology
  • Mechanical Engineering and Materials
  • Technology of Industrial Chemistry
  • Transport Technology and Trades
  • Browse content in Environmental Science
  • Applied Ecology (Environmental Science)
  • Conservation of the Environment (Environmental Science)
  • Environmental Sustainability
  • Environmentalist Thought and Ideology (Environmental Science)
  • Management of Land and Natural Resources (Environmental Science)
  • Natural Disasters (Environmental Science)
  • Nuclear Issues (Environmental Science)
  • Pollution and Threats to the Environment (Environmental Science)
  • Social Impact of Environmental Issues (Environmental Science)
  • History of Science and Technology
  • Browse content in Materials Science
  • Ceramics and Glasses
  • Composite Materials
  • Metals, Alloying, and Corrosion
  • Nanotechnology
  • Browse content in Mathematics
  • Applied Mathematics
  • Biomathematics and Statistics
  • History of Mathematics
  • Mathematical Education
  • Mathematical Finance
  • Mathematical Analysis
  • Numerical and Computational Mathematics
  • Probability and Statistics
  • Pure Mathematics
  • Browse content in Neuroscience
  • Cognition and Behavioural Neuroscience
  • Development of the Nervous System
  • Disorders of the Nervous System
  • History of Neuroscience
  • Invertebrate Neurobiology
  • Molecular and Cellular Systems
  • Neuroendocrinology and Autonomic Nervous System
  • Neuroscientific Techniques
  • Sensory and Motor Systems
  • Browse content in Physics
  • Astronomy and Astrophysics
  • Atomic, Molecular, and Optical Physics
  • Biological and Medical Physics
  • Classical Mechanics
  • Computational Physics
  • Condensed Matter Physics
  • Electromagnetism, Optics, and Acoustics
  • History of Physics
  • Mathematical and Statistical Physics
  • Measurement Science
  • Nuclear Physics
  • Particles and Fields
  • Plasma Physics
  • Quantum Physics
  • Relativity and Gravitation
  • Semiconductor and Mesoscopic Physics
  • Browse content in Psychology
  • Affective Sciences
  • Clinical Psychology
  • Cognitive Psychology
  • Cognitive Neuroscience
  • Criminal and Forensic Psychology
  • Developmental Psychology
  • Educational Psychology
  • Evolutionary Psychology
  • Health Psychology
  • History and Systems in Psychology
  • Music Psychology
  • Neuropsychology
  • Organizational Psychology
  • Psychological Assessment and Testing
  • Psychology of Human-Technology Interaction
  • Psychology Professional Development and Training
  • Research Methods in Psychology
  • Social Psychology
  • Browse content in Social Sciences
  • Browse content in Anthropology
  • Anthropology of Religion
  • Human Evolution
  • Medical Anthropology
  • Physical Anthropology
  • Regional Anthropology
  • Social and Cultural Anthropology
  • Theory and Practice of Anthropology
  • Browse content in Business and Management
  • Business Strategy
  • Business Ethics
  • Business History
  • Business and Government
  • Business and Technology
  • Business and the Environment
  • Comparative Management
  • Corporate Governance
  • Corporate Social Responsibility
  • Entrepreneurship
  • Health Management
  • Human Resource Management
  • Industrial and Employment Relations
  • Industry Studies
  • Information and Communication Technologies
  • International Business
  • Knowledge Management
  • Management and Management Techniques
  • Operations Management
  • Organizational Theory and Behaviour
  • Pensions and Pension Management
  • Public and Nonprofit Management
  • Strategic Management
  • Supply Chain Management
  • Browse content in Criminology and Criminal Justice
  • Criminal Justice
  • Criminology
  • Forms of Crime
  • International and Comparative Criminology
  • Youth Violence and Juvenile Justice
  • Development Studies
  • Browse content in Economics
  • Agricultural, Environmental, and Natural Resource Economics
  • Asian Economics
  • Behavioural Finance
  • Behavioural Economics and Neuroeconomics
  • Econometrics and Mathematical Economics
  • Economic Systems
  • Economic History
  • Economic Methodology
  • Economic Development and Growth
  • Financial Markets
  • Financial Institutions and Services
  • General Economics and Teaching
  • Health, Education, and Welfare
  • History of Economic Thought
  • International Economics
  • Labour and Demographic Economics
  • Law and Economics
  • Macroeconomics and Monetary Economics
  • Microeconomics
  • Public Economics
  • Urban, Rural, and Regional Economics
  • Welfare Economics
  • Browse content in Education
  • Adult Education and Continuous Learning
  • Care and Counselling of Students
  • Early Childhood and Elementary Education
  • Educational Equipment and Technology
  • Educational Strategies and Policy
  • Higher and Further Education
  • Organization and Management of Education
  • Philosophy and Theory of Education
  • Schools Studies
  • Secondary Education
  • Teaching of a Specific Subject
  • Teaching of Specific Groups and Special Educational Needs
  • Teaching Skills and Techniques
  • Browse content in Environment
  • Applied Ecology (Social Science)
  • Climate Change
  • Conservation of the Environment (Social Science)
  • Environmentalist Thought and Ideology (Social Science)
  • Natural Disasters (Environment)
  • Social Impact of Environmental Issues (Social Science)
  • Browse content in Human Geography
  • Cultural Geography
  • Economic Geography
  • Political Geography
  • Browse content in Interdisciplinary Studies
  • Communication Studies
  • Museums, Libraries, and Information Sciences
  • Browse content in Politics
  • African Politics
  • Asian Politics
  • Chinese Politics
  • Comparative Politics
  • Conflict Politics
  • Elections and Electoral Studies
  • Environmental Politics
  • European Union
  • Foreign Policy
  • Gender and Politics
  • Human Rights and Politics
  • Indian Politics
  • International Relations
  • International Organization (Politics)
  • International Political Economy
  • Irish Politics
  • Latin American Politics
  • Middle Eastern Politics
  • Political Methodology
  • Political Communication
  • Political Philosophy
  • Political Sociology
  • Political Behaviour
  • Political Economy
  • Political Institutions
  • Political Theory
  • Politics and Law
  • Politics of Development
  • Public Administration
  • Public Policy
  • Quantitative Political Methodology
  • Regional Political Studies
  • Russian Politics
  • Security Studies
  • State and Local Government
  • UK Politics
  • US Politics
  • Browse content in Regional and Area Studies
  • African Studies
  • Asian Studies
  • East Asian Studies
  • Japanese Studies
  • Latin American Studies
  • Middle Eastern Studies
  • Native American Studies
  • Scottish Studies
  • Browse content in Research and Information
  • Research Methods
  • Browse content in Social Work
  • Addictions and Substance Misuse
  • Adoption and Fostering
  • Care of the Elderly
  • Child and Adolescent Social Work
  • Couple and Family Social Work
  • Direct Practice and Clinical Social Work
  • Emergency Services
  • Human Behaviour and the Social Environment
  • International and Global Issues in Social Work
  • Mental and Behavioural Health
  • Social Justice and Human Rights
  • Social Policy and Advocacy
  • Social Work and Crime and Justice
  • Social Work Macro Practice
  • Social Work Practice Settings
  • Social Work Research and Evidence-based Practice
  • Welfare and Benefit Systems
  • Browse content in Sociology
  • Childhood Studies
  • Community Development
  • Comparative and Historical Sociology
  • Economic Sociology
  • Gender and Sexuality
  • Gerontology and Ageing
  • Health, Illness, and Medicine
  • Marriage and the Family
  • Migration Studies
  • Occupations, Professions, and Work
  • Organizations
  • Population and Demography
  • Race and Ethnicity
  • Social Theory
  • Social Movements and Social Change
  • Social Research and Statistics
  • Social Stratification, Inequality, and Mobility
  • Sociology of Religion
  • Sociology of Education
  • Sport and Leisure
  • Urban and Rural Studies
  • Browse content in Warfare and Defence
  • Defence Strategy, Planning, and Research
  • Land Forces and Warfare
  • Military Administration
  • Military Life and Institutions
  • Naval Forces and Warfare
  • Other Warfare and Defence Issues
  • Peace Studies and Conflict Resolution
  • Weapons and Equipment

From Whorf to Montague: Explorations in the Theory of Language

  • < Previous chapter
  • Next chapter >

From Whorf to Montague: Explorations in the Theory of Language

2 The Whorf hypothesis

  • Published: October 2013
  • Cite Icon Cite
  • Permissions Icon Permissions

This chapter analyses Whorf’s (hypo)thesis that language influences or determines thought. Besides a great deal of notional unclarity about ‘language’ and ‘thought’, it is found that its main weakness is its failure to establish that the direction of causality is from language to thought and not vice versa. The arguments put forward in its defence typically confuse the how and the what : what is said may well influence thought, but how it is said—the real question of the Whorf hypothesis—has not been shown to have an effect on thought. Whorf’s own arguments are easily countered and new arguments, based on psychological and anthropological experiments, are shown to be inconclusive, though there is a possibility that the frequent use of certain locutions has a marginal effect on mental processes in the peripheral area where thought is prepared for linguistic expression. Such effects do not confirm the Whorf hypothesis, as this hypothesis is about central thought processes and categories. The conclusion is that the Whorf hypothesis must be taken to be unconfirmed and probably false.

Signed in as

Institutional accounts.

  • GoogleCrawler [DO NOT DELETE]
  • Google Scholar Indexing

Personal account

  • Sign in with email/username & password
  • Get email alerts
  • Save searches
  • Purchase content
  • Activate your purchase/trial code
  • Add your ORCID iD

Institutional access

Sign in with a library card.

  • Sign in with username/password
  • Recommend to your librarian
  • Institutional account management
  • Get help with access

Access to content on Oxford Academic is often provided through institutional subscriptions and purchases. If you are a member of an institution with an active account, you may be able to access content in one of the following ways:

IP based access

Typically, access is provided across an institutional network to a range of IP addresses. This authentication occurs automatically, and it is not possible to sign out of an IP authenticated account.

Choose this option to get remote access when outside your institution. Shibboleth/Open Athens technology is used to provide single sign-on between your institution’s website and Oxford Academic.

  • Click Sign in through your institution.
  • Select your institution from the list provided, which will take you to your institution's website to sign in.
  • When on the institution site, please use the credentials provided by your institution. Do not use an Oxford Academic personal account.
  • Following successful sign in, you will be returned to Oxford Academic.

If your institution is not listed or you cannot sign in to your institution’s website, please contact your librarian or administrator.

Enter your library card number to sign in. If you cannot sign in, please contact your librarian.

Society Members

Society member access to a journal is achieved in one of the following ways:

Sign in through society site

Many societies offer single sign-on between the society website and Oxford Academic. If you see ‘Sign in through society site’ in the sign in pane within a journal:

  • Click Sign in through society site.
  • When on the society site, please use the credentials provided by that society. Do not use an Oxford Academic personal account.

If you do not have a society account or have forgotten your username or password, please contact your society.

Sign in using a personal account

Some societies use Oxford Academic personal accounts to provide access to their members. See below.

A personal account can be used to get email alerts, save searches, purchase content, and activate subscriptions.

Some societies use Oxford Academic personal accounts to provide access to their members.

Viewing your signed in accounts

Click the account icon in the top right to:

  • View your signed in personal account and access account management features.
  • View the institutional accounts that are providing access.

Signed in but can't access content

Oxford Academic is home to a wide variety of products. The institutional subscription may not cover the content that you are trying to access. If you believe you should have access to that content, please contact your librarian.

For librarians and administrators, your personal account also provides access to institutional account management. Here you will find options to view and activate subscriptions, manage institutional settings and access options, access usage statistics, and more.

Our books are available by subscription or purchase to libraries and institutions.

  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Rights and permissions
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

  • Media Center

The Whorf Hypothesis

The basic idea, theory, meet practice.

TDL is an applied research consultancy. In our work, we leverage the insights of diverse fields—from psychology and economics to machine learning and behavioral data science—to sculpt targeted solutions to nuanced problems.

If you’ve spent any amount of time online, you might have seen an article titled along the lines of “10 words we wish we had in English” – even major outlets like the  BBC 1  and  The Guardian 2  have indulged! The format is simple. The article lists 10 words, like  sobremesa  (Spanish for when you stay chatting in a restaurant for too long) or  kummerspeck  (German for the weight gain from emotional eating), and probably a few relatable, witty remarks. The article usually ends by mourning the lack of English translation.

As behavioral scientists, we might think to ask: Are there  really  untranslatable words? Are there certain thoughts you can only have in certain languages?

The “Whorf Hypothesis” (also known as the “Sapir-Whorf Hypothesis” or “Linguistic Relativism”) is an umbrella term for the claim that the language you speak determines or influences what you can think. If you speak English, there are certain thoughts you can have; if you speak Spanish or German, there are different thoughts you can have. Certain words are untranslatable because only certain languages can convey those thoughts.

Illustration of a man in front of a cafe deciding what to order and saying "I feel my brain expanding"

Does the language we speak shape or determine what we can think? What is the relationship between language and thought?

The Strong Whorf Hypothesis : the claim that the language you speak determines which thoughts you can have.3 It is generally rejected by most linguists, psychologists, and cognitive scientists today .4,5, 6

The Weak Whorf Hypothesis : the claim that the language you speak  influences , but does not determine, which thoughts you can have. 3  This is a claim currently being studied, and many behavioral scientists believe some form of it. 5,7,8

Nativism: The claim that language is largely an innate cognitive faculty, virtually identical across individuals and cultures. Versions of this claim have been defended by linguists like Noam Chomsky, 9  psychologists like Steven Pinker, 6  and philosophers like Jerry Fodor. 10

While some scholars argue that the Whorf hypothesis dates back to Aristotle’s  Rhetoric  or to German philosopher Gottfried Leibniz’s writings on language, we can safely start with Wilhelm von Humboldt, 11  an early 19th-century German linguist and political theorist. Before and during his fieldwork, Humboldt often wrote about the relationship between language and thought. To him, language was not merely the means through which we conveyed what was going on in our minds. Instead, the language established a worldview: languages were the means through which we understood ourselves and the world. 11

Humboldt’s ideas became influential in the late-19th century through the work of the German-trained Franz Boas: 11  a professor at Columbia and founder of the American Anthropological Association. 12  Boas’s work in linguistic anthropology (mainly on what we now call “Inuit languages”) followed Humboldt in arguing that different languages classify how we experience the world in different, subconscious ways. 11,13  Crucially, though, he did not think that language  determines  how we view the world. Instead, he thought that our languages’ grammatical categories  reflect  the ways our  culture  classifies the world. 13

Moving on to the early 20th century, one of Boas’s own students, Edward Sapir, would also be one of the main contributors to the development of the Whorf hypothesis. (This is why it is sometimes called the “Sapir-Whorf” hypothesis.) Sapir followed Boas in arguing that different languages classify how we experience the world, but he stressed that languages are complete systems, often untranslatable between each other. 13  He also pushed further than Boas: he thought that language was necessary for us to fully develop the ability to think because our ability to think arises from our ability to interpret the language we speak. 13  Different languages yield different interpretations, and those different interpretations place constraints on what we can think. 13

This progressive strengthening of Humboldt’s original idea was finalized by one of Sapir’s students, Benjamin Lee Whorf. 11 13  While not a professional linguist, Whorf was interested in documenting previous and current forms of the indigenous languages of North America, especially Nahuatl and Hopi. Whorf’s main contribution to the hypothesis was to point out that not all linguistic categories are overt; sometimes, a language encodes information  implicitly . Whorf also accepted Sapir’s claim that languages place constraints on what we can think, based on the interpretations we give them. But because languages also marked things implicitly, these interpretations were widespread and pervasive — we didn’t have to actively use our language for us to be interpreting things  through  our language. As Whorf would put it:

“[…]users of markedly different grammars are pointed by their grammars toward different types of observations and different evaluations of externally similar acts of observation, and hence are not equivalent as observers but must arrive at somewhat different views of the world.” 15

In the 1960s and onwards, with the rise of nativism in linguistics — especially Chomsky’s theory of Universal Grammar — the Whorf hypothesis began to come under scrutiny.  It was believed that languages were just too similar to yield the kinds of effects on the thought that Sapir and Whorf hypothesized. Further empirical work also showed that the Whorf hypothesis, in its strong form, was shown to be flawed: humans and other primates display the ability to think  without  language 5 , refuting Sapir and Whorf’s claim that we needed to interpret the language to be able to think.

However, researchers in the 1990s started studying whether language still influenced thought in any interesting ways. Among other things, behavioral scientists began looking at language’s effect on color perception, spatial cognition, and more.  Many studies suggest that language does have some effect on which kinds of processing are  easier  for a speaker.16, 17 The research on these weaker versions of the Whorf hypothesis is still ongoing, but many behavioral scientists— even ones who reject the stronger forms— accept one version or another. 14

Wilhelm von Humboldt

Wilhelm von Humboldt (1767 – 1835) was a philosopher and political theorist who made great contributions to philosophy, linguistics, education, anthropology, and more. 18  Many theorists on language (including Boas, Sapir, Whorf, and, paradoxically enough, Chomksy) claim to have been influenced by his views. 18  In particular, Humboldt is often credited with arguing that a language’s grammar is best studied by looking at the forms and procedures it uses to generate actual speech, and for arguing that thought without language is impossible. 18

Franz Boas (1858 – 1942) is usually credited as the founder of the American anthropological tradition, and he is the founder of the American Anthropological Association. 19  His work focused on the indigenous languages in the United States, where he contributed to both our anthropological and linguistic understanding of them. 19  Additionally, Boas was also among the first white social scientists who argued that racial differences were due to historical events, not genetics; and that racial categories were themselves culturally constructed. 19

Edward Sapir

Edward Sapir (1884 – 1939) is often considered one of the most important figures in linguistics and anthropology in the United States. 20  He was the founder of “ethnolinguistics,” which focused on the relationship between language and culture, and he is often credited as a key developer of American structural linguistics. 20  His work focused on the indigenous languages of all of North America. 20

Benjamin Lee Whorf

Benjamin Lee Whorf (1897 – 1941) was an American linguist whose work, like that of his mentor Edward Sapir, focused on the indigenous languages of North America. 21  Whorf is most well-known due to his arguments in favor of linguistic relativity (which came to be known as “the Whorf hypothesis”), based on his work on Hopi and other indigenous languages. 21

Noam Chomsky

Noam Chomsky (1928 – current) is an American linguist, political theorist, and cognitive scientist. 21  Chomsky’s 1959 review of B.F. Skinner’s  Verbal Behavior  is often credited as the moment of death for  behaviorism. 21  Starting from the 60s and onward, Chomsky founded and contributed to the Generativist approach to linguistics, which holds that language is a separate cognitive faculty unique to humans, which children are born with and use to acquire their native language without much stimulus.21 He also argues that this linguistic faculty is universal: all humans are born with the same “Universal Grammar,” which allows them to learn language quickly and makes all human languages the same at bottom. 21  This approach to language remains standard and influential to this day, especially in theoretical syntax and semantics. 8,22

Consequences

If true, the strong form of the Whorf hypothesis would have massive ripple effects on our understanding of how the human mind works. If the language we spoke determined the kinds of thoughts we could have, it would be incredibly hard to find any cognitive universals. Our world speaks over 6500 languages, so the strong Whorf hypothesis predicts that we would have radically different— and untranslatable— thoughts.

Thankfully for cognitive scientists worldwide, the strong form of the Whorf hypothesis has been falsified for decades. However, we might still ask: what about the consequences of the weak form?

The research is still ongoing, but one general trend is that the language we speak makes certain thoughts  slightly   easier  to access in non-trivial ways. For instance, if our language marks space using the cardinal directions (e.g., “the office is north of the coffee shop”) it would make it easier for us to  think  in terms of north and south. 16   If, in contrast, our language marks space using speaker-focused directions (e.g., “the office is to my left”), it makes it easier for us to think in terms of left and right. 16

Controversies

The Whorf hypothesis cuts at the core of what linguists, psychologists, and behavioral scientists in general want to know about language. So, it should be no surprise that it has been the topic of much (very passionate) debate in a great number of topics.

While we cannot take a stand on which side is right, we can walk through some of the research in one topic of debate: the linguistic relativity (or lack thereof) of color categories. Color categories are a natural place to look for language’s effect on a thought because there is nothing in the physics of light that requires us to draw the color boundaries at one place or another; we can split up the wavelengths in any way we would like. 15  Furthermore, it’s a fact that different languages mark color boundaries differently. English marks “light blue” and “dark blue” as one color, whereas Spanish distinguishes between “celeste” and “azul.” If the strong Whorf hypothesis were true, we would expect speakers of different languages to literally perceive colors differently, in accordance with their specific language’s boundaries. If the weak Whorf hypothesis were true, we would expect to see some linguistic influence of color perception.

In the 70s, many researchers argued that universals in color categories and perception across different languages falsified both versions of the Whorf hypothesis. For example, Eleanor Hedier’s study from 1972 found that there was no difference in how speakers from languages with different color categories could memorize “focal”, or easily rememberable, colors. 23  Brent Berlin and Paul Kay’s work in the 60s and 70s found that, while different languages have different color categories, these color categories all follow the same patterns: they come from 11 universal categories, and they follow the same historical progression. 24   These results greatly contradict any strong version of the Whorf hypothesis: it seems as though speakers of different languages perceive colors the same way, and that languages might not differ much in how they categorize color at all!

However, recent work has come to the defense of the weak form of the Whorf hypothesis. For instance, a landmark study done by researcher Johnathan Winawer and his colleagues in 2007 found that Russian speakers are significantly faster than English speakers at discriminating particular shades of blue. The Russian language, like Spanish, marks lighter shades of “blue” and darker shades of “blue” as different color categories. 17  As it turns out, when tasked with discriminating between these sorts of shades of blue, Russian speakers were able to discriminate between them faster than English speakers. 17  More importantly, when Winawer and his coauthors inserted a verbal interference—  such as asking speakers to memorize a series of numbers  and  discriminate between different colors — the difference  went away . 17  This suggests that Russian speakers are faster at discriminating between these shades of blue  because  they speak Russian. 17

We all know those good decisions are often future-oriented. We save money  now  so we can have a better retirement  later . We exercise  now  so we are healthier  long-term .  But can the language we speak influence how prone we are to make future-oriented decisions?

According to Economist M. Keith Chen’s 2013 study titled “The Effect of Language on Economic Behavior: Evidence from Savings Rates, Health Behaviors, and Retirement Assets,” the answer seems to be “yes”. In this study, Chen studied the future-oriented decisions of English-speakers and German-speakers. English requires speakers to mark the future tense in a way that German does not. To say something about the future, English requires us to add the word “will.” 24  For example, to turn “it rains” into the future tense, we say “it will rain.” German, in contrast, does not require an additional word: present tense “Morgen regnet es” means “it rains  tomorrow,”  allowing German speakers to communicate about the future in the present tense. 25

Chen’s hypothesis was that this difference in whether a language marks the future through its own grammatical category could lead to a difference in decision-making.25 If a language forces speakers to separate the present from the future— like English does— speakers might be influenced into thinking of the future as more distant, making them less prone to make future-oriented decisions. 20  In contrast, if speakers are not forced to grammatically mark the difference between the present and the future— like German does—speakers might see the future as closer to the present, making them  more  prone to make future decisions. 25

Surprisingly, the hypothesis was borne out: German speakers were more likely to save, exercise, etc. than English speakers. 25  Even more shockingly, this effect doesn’t seem to be only correlated with a cultural or institutional difference between English-speaking countries and German-speaking countries.25 What Chen found is that language and culture can influence decision-making independently: people can be influenced into more future-oriented decisions  either  by the society they live in or by the language they speak. 25

Related TDL Content

Zooming Out: The Impact of Distance on our Decisions

In our case study, we saw how placing distance between future events and present events— by speaking a language that forces us to distinguish them grammatically—is associated with worse long-term decision making. However, temporal distance is not alone in causing behavioral effects. In this piece, Kaylee Somerville explores how other kinds of distance influence the decisions we make.

Drone Policy (2/3): Understanding The Issues

As we discussed earlier, most behavioral scientists are willing to admit that language has an  influence  on how we think. In this piece by Jared Celniker, we see one example of that influence in drone policy. He explores how, oftentimes, delicate and inoffensive language influences us into thinking that a drone strike was justified

  • Special Words That Don’t Exist in English (Yet). (2018).  BBC News .  https://www.bbc.com/news/av/world-45685575
  • 10 of the Best Words in the World (That Don’t Translate Into English). (2018).  The Guardian .  https://www.theguardian.com/world/2018/jul/27/10-of-the-best-words-in-the-world-that-dont-translate-into-english
  • Scholz, B. C., Pelletier, F. J., & Pullum, G. K. (2020). Philosophy of Linguistics.In E. N. Zalta (Ed.), The Stanford Encyclopedia of Philosophy  (Summer2020). Metaphysics Research Lab, Stanford University.
  • Boutonnet, B., Dering, B., Viñas-Guasch, N., & Thierry, G. (2013). Seeing Objects through the Language Glass.  Journal of Cognitive Neuroscience ,  25 (10), 1702–1710.  https://doi.org/10.1162/jocn_a_00415
  • Pinker, S. (2010).  The language instinct: How the mind creates language  (Nachdr.). Harper Perennial.
  • Lucy, J. A. (1992).  Language diversity and thought: A reformulation of the linguistic relativity hypothesis . Cambridge University Press.  https://doi.org/10.1017/CBO9780511620843
  • Lupyan, G. (2012). Linguistically Modulated Perception and Cognition: The Label-Feedback Hypothesis.  Frontiers in Psychology ,  3 .  https://doi.org/10.3389/fpsyg.2012.00054 .
  • Chomsky, N., & Smith, N. (2000).  New Horizons in the Study of Language and Mind  (1st ed.). Cambridge University Press.  https://doi.org/10.1017/CBO9780511811937
  • Fodor, J. A. (1983).  The Modularity of Mind: an Essay on Faculty Psychology .  https://doi.org/10.7551/mitpress/4737.001.0001
  • Koerner, E. F. K. (1992). The Sapir-Whorf Hypothesis: A Preliminary History and a Bibliographical Essay.  Journal of Linguistic Anthropology ,  2 (2), 173–198.  https://doi.org/10.1525/jlin.1992.2.2.173
  •  Advance Your Career. (2000).  American Anthropological Association .  https://www.americananthro.org/AdvanceYourCareer/Content.aspx?ItemNumber=1581
  • McWhorter, J. H. (2014).  The language hoax: Why the world looks the same in any language . Oxford University Press.
  • Baghramian, M., & Carter, J. A. (2021). Relativism. In E. N. Zalta (Ed.),T he  Stanford  Encyclopedia  of  Philosophy  (Spring 2021). Metaphysics Research Lab, Stanford University.
  • Haun, D. B. M., Rapold, C. J., Janzen, G., & Levinson, S. C. (2011). Plasticity of human spatial cognition: Spatial language and cognition covary across cultures.  Cognition ,  119 (1), 70–80.  https://doi.org/10.1016/j.cognition.2010.12.009
  • Winawer, J., Witthoft, N., Frank, M. C., Wu, L., Wade, A. R., & Boroditsky, L. (2007). Russian blues reveal effects of language on color discrimination.  Proceedings of the National Academy of Sciences ,  104 (19), 7780–7785.  https://doi.org/10.1073/pnas.0701644104
  • Mueller-Vollmer, K., & Messling, M. (2017). Wilhelm von Humboldt. In E. N.Zalta (Ed.), The  Stanford  Encyclopedia  of  Philosophy  (Spring 2017).Metaphysics Research Lab, Stanford University.
  • Tax, S. (2021, July 5).  Franz Boas .  Encyclopedia Britannica .  https://www.britannica.com/biography/Franz-Boas
  • Britannica, T. Editors of Encyclopaedia (2021, January 31).  Edward Sapir .  Encyclopedia Britannica . https://www.britannica.com/biography/Edward-Sapir
  • Britannica, T. Editors of Encyclopaedia (2021, July 22).  Benjamin Lee Whorf .  Encyclopedia Britannica . https://www.britannica.com/biography/Benjamin-Lee-Whor
  • McGilvray, J. A. (2021, March 23).  Noam Chomsky .  Encyclopedia Britannica .  https://www.britannica.com/biography/Noam-Chomsky
  • Heim, I., & Kratzer, A. (1998).  Semantics in generative grammar . Blackwell.
  • Heider, E. R. (1972). Universals in color naming and memory.  Journal of Experimental Psychology ,  93 (1), 10–20.  https://doi.org/10.1037/h0032606
  • Cook, R. S., Kay, P., & Regier, T. (n.d.).  The World Color Survey Database: History and Use . 22.  http://www1.icsi.berkeley.edu/~kay/claire7.pdf
  • Chen, M. K. (2013). The Effect of Language on Economic Behavior: Evidence from Savings Rates, Health Behaviors, and Retirement Assets.  American Economic Review ,  103 (2), 690–731.  https://doi.org/10.1257/aer.103.2.690

About the Author

Juan Ignacio Murillo

Juan Ignacio Murillo

Juan was a Summer Associate at The Decision Lab. He recently graduated from the University of Toronto with a Bachelor’s degree in philosophy and linguistics, and starting this upcoming fall he will be pursuing an MA in Philosophy at the University of Wisconsin-Milwaukee. He is passionate about integrating and applying traditional philosophical thinking—especially in metaethics, the philosophy of language, and the philosophy of science—to empirical research and problems in everyday life. Currently, he is interested in what values are, and how they feature in what we say and how we think. He is also interested in how understanding the role values play in our lives may help us deal with broader societal issues, such as vaccine hesitancy.

Behavior Change Guide

Behavior Change

brain and gear icon

Functionalism

The COM-B Model for Behavior Change

The COM-B Model for Behavior Change

butterfly icon

The Butterfly Effect

Notes illustration

Eager to learn about how behavioral science can help your organization?

Get new behavioral science insights in your inbox every month..

helpful professor logo

Sapir-Whorf Hypothesis: Examples, Definition, Criticisms

sapir-whorf hypothesis examples and definition

Developed in 1929 by Edward Sapir, the Sapir-Whorf hypothesis (also known as linguistic relativity ) states that a person’s perception of the world around them and how they experience the world is both determined and influenced by the language that they speak.

The theory proposes that differences in grammatical and verbal structures, and the nuanced distinctions in the meanings that are assigned to words, create a unique reality for the speaker. We also call this idea the linguistic determinism theory .

Spair-Whorf Hypothesis Definition and Overview

Cibelli et al. (2016) reiterate the tenets of the hypothesis by stating:

“…our thoughts are shaped by our native language, and that speakers of different languages therefore think differently”(para. 1).

Kay & Kempton (1984) explain it a bit more succinctly. They explain that the hypothesis itself is based on the:

“…evolutionary view prevalent in 19 th century anthropology based in both linguistic relativity and determinism” (pp. 66, 79).

Linguist Edward Sapir, an American linguist who was interested in anthropology , studied at Yale University with Benjamin Whorf in the 1920’s.

Sapir & Whorf began to consider lexical and grammatical patterns and how these factored into the construction of different culture’s views of the world around them.

For example, they compared how thoughts and behavior differed between English speakers and Hopi language speakers in regard to the concept of time, arguing that in the Hopi language, the absence of the future tense has significant relevance (Kay & Kempton, 1984, p. 78-79).

Whorf (2021), in his own words, asserts:

“Every language is a vast pattern-system, different from others, in which are culturally ordained the forms and categories by which the personality not only communicates, but also analyzes nature, notices or neglects types of relationship and phenomena, channels his reasoning, and builds the house of his consciousness” (p. 252).

10 Sapir-Whorf Hypothesis Examples

  • Constructions of food in language: A language may ascribe many words to explain the same concept, item, or food type. This shows that they perceive it as extremely important in their society, in comparison to a culture whose language only has one word for that same concept, item, or food.
  • Descriptions of color in language: Different cultures may visually perceive colors in different ways according to how the colors are described by the words in their language.
  • Constructions of gender in language: Many languages are “gendered”, creating word associations that pertain to the roles of men or women in society.
  • Perceptions of time in language: Depending upon how the tenses are structured in a language, it may dictate how the people that speak that language perceive the concept of time.
  • Categorization in language: The ways concepts and items in a given culture are categorized (and what words are assigned to them) can affect the speaker’s perception of the world around them.
  • Politeness is encoded in language: Levels of politeness in a language and the pronoun combinations to express these levels differ between languages. How languages express politeness with words can dictate how they perceive the world around them.
  • Indigenous words for snow: A popular example used to justify this hypothesis is the Inuit people, who have a multitude of ways to express the word snow. If you follow the reasoning of Sapir, it would suggest that the Inuits have a profoundly deeper understanding of snow than other cultures.
  • Use of idioms in language: An expression or well-known saying in one culture has an acute meaning implicitly understood by those that speak the particular language but is not understandable when expressed in another language.
  • Values are engrained in language: Each country and culture have beliefs and values as a direct result of the language it uses. 
  • Slang in language: The slang used by younger people evolves from generation to generation in all languages. Generational slang carries with it perceptions and ideas about the world that members of that generation share.

See Other Hypothesis Examples Here

Two Ways Language Shapes Perception

1. perception of categories and categorization.

How concepts and items in a culture are categorized (and what words are assigned to them) can affect the speaker’s perception of the world around them.

Although the examples of this phenomenon are too numerous to cite, a clear example is the extremely contextual, nuanced, and hyper-categorized Japanese language.

In the English language, the concept of “you” and “I” is narrowed to these two forms. However, Japanese has numerous ways to express you and I, each having various levels of politeness and appropriateness in relation to age, gender, and stature in society.

While in common conversation, the pronoun is often left out of the conversation – reliant on context, misuse or omission of the proper pronoun can be perceived as rude or ill-mannered.

In other ways, the complexity of the categorical lexicons can often leave English speakers puzzled. This could come in the form of classifications of different shaped bowls and plates that serve different functions; it could be traces of the ancient Japanese calendar from the 7 th Century, that possessed 72 micro-seasons during a year, or any number of sub-divided word listings that may be considered as one blanket term in another language.

Masuda et al. (2017) gives a clear example:

“ People conceptualize objects along the lines drawn between existing categories in their native language. That is, if two concepts fall into the same linguistic category, the perception of similarity between these objects would be stronger than if the two concepts fall into different linguistic categories.”

They then go on to give the example of how Japanese vs English speakers might categorize an everyday object – the bell:

“For example, in Japanese, the kind of bell found in a bell tower generally corresponds to the word kane—a large bell—which is categorically different from a small bell, suzu. However, in English, these two objects are considered to belong within the same linguistic category, “bell.” Therefore, we might expect English speakers to perceive these two objects as being more similar than would Japanese speakers (para 5).

2. Perception of the Concept of Time

According to a way the tenses are structured in a language, it may dictate how the people that speak that language perceive the concept of time

One of Sapir’s most famous applications of his theory is to the language of the Arizona Native American Hopi tribe.

He claimed, although refuted vehemently by linguistic scholars since, that they have no general notion of time – that they cannot decipher between the past, present, or future because of the grammatical structures that are used within their language.

As Engle (2016) asserts, Sapir believed that the Hopi language “encodes on ordinal value, rather than a passage of time”.

He concluded that, “a day followed by a night is not so much a new day, but a return to daylight” (p. 96).

However, it is not only Hopi culture that has different perception of time imbedded in the language; Thai culture has a non-linear concept of time, and the Malagasy people of Madagascar believe that time in motion around human beings, not that human beings are passing through time (Engle, 2016, p. 99).

Criticism of Sapir-Whorf Hypothesis

1. language as context-dependent.

Iwamoto (2005) expresses that the Sapir-Whorf hypothesis fails to recognize that language is used within context. Its purely decontextualized textual analysis of language is too one-dimensional and doesn’t consider how we actually use language:

“Whorf’s “neat and simplistic” linguistic relativism presupposes the idea that an entire language or entire societies or cultures are categorizable or typable in a straightforward, discrete, and total manner, ignoring other variables such as contextual and semantic factors .” (Iwamoto, 2005, p. 95)

2. Not universally applicable

Another criticism of the hypothesis is that Sapir & Whorf’s hypothesis cannot be transferred or applied to all languages.

It is difficult to cite empirical studies that confirm that other cultures do not also have similarities in the way concepts are perceived through their language – even if they don’t possess a similar word/expression for a particular concept that is expressed.

3. thoughts can be independent of language

Stephen Pinker, one of Sapir & Whorf’s most emphatic critics, would argue that language is not of our thoughts, and is not a cultural invention that creates perceptions; it is in his opinion, a part of human biology (Meier & Pinker, 1995, pp. 611-612).

He suggests that the acquisition and development of sign language show that languages are instinctual, therefore biological; he even goes so far as to say that “all speech is an illusion”(p. 613).

Cibelli, E., Xu, Y., Austerweil, J. L., Griffiths, T. L., & Regier, T. (2016). The Sapir-Whorf Hypothesis and Probabilistic Inference: Evidence from the Domain of Color.  PLOS ONE ,  11 (7), e0158725.  https://doi.org/10.1371/journal.pone.0158725

Engle, J. S. (2016). Of Hopis and Heptapods: The Return of Sapir-Whorf.  ETC.: A Review of General Semantics ,  73 (1), 95.  https://www.questia.com/library/journal/1G1-544562276/of-hopis-and-heptapods-the-return-of-sapir-whorf

Iwamoto, N. (2005). The Role of Language in Advancing Nationalism.  Bulletin of the Institute of Humanities ,  38 , 91–113.

Meier, R. P., & Pinker, S. (1995). The Language Instinct: How the Mind Creates Language.  Language ,  71 (3), 610.  https://doi.org/10.2307/416234

Masuda, T., Ishii, K., Miwa, K., Rashid, M., Lee, H., & Mahdi, R. (2017). One Label or Two? Linguistic Influences on the Similarity Judgment of Objects between English and Japanese Speakers. Frontiers in Psychology , 8 . https://doi.org/10.3389/fpsyg.2017.01637

Kay, P., & Kempton, W. (1984). What Is the Sapir-Whorf Hypothesis?  American Anthropologist ,  86 (1), 65–79. http://www.jstor.org/stable/679389

Whorf, B. L. (2021).  Language, Thought, and Reality: Selected Writings of Benjamin Lee Whorf . Hassell Street Press.

Gregory

Gregory Paul C. (MA)

Gregory Paul C. is a licensed social studies educator, and has been teaching the social sciences in some capacity for 13 years. He currently works at university in an international liberal arts department teaching cross-cultural studies in the Chuugoku Region of Japan. Additionally, he manages semester study abroad programs for Japanese students, and prepares them for the challenges they may face living in various countries short term.

  • Gregory Paul C. (MA) #molongui-disabled-link Social Penetration Theory: Examples, Phases, Criticism
  • Gregory Paul C. (MA) #molongui-disabled-link Upper Middle-Class Lifestyles: 10 Defining Features
  • Gregory Paul C. (MA) #molongui-disabled-link Arousal Theory of Motivation: Definition & Examples
  • Gregory Paul C. (MA) #molongui-disabled-link Theory of Mind: Examples and Definition

Chris

Chris Drew (PhD)

This article was peer-reviewed and edited by Chris Drew (PhD). The review process on Helpful Professor involves having a PhD level expert fact check, edit, and contribute to articles. Reviewers ensure all content reflects expert academic consensus and is backed up with reference to academic studies. Dr. Drew has published over 20 academic articles in scholarly journals. He is the former editor of the Journal of Learning Development in Higher Education and holds a PhD in Education from ACU.

  • Chris Drew (PhD) #molongui-disabled-link Social Penetration Theory: Examples, Phases, Criticism
  • Chris Drew (PhD) #molongui-disabled-link 10 Fixed Ratio Schedule Examples
  • Chris Drew (PhD) #molongui-disabled-link 10 Sensorimotor Stage Examples
  • Chris Drew (PhD) #molongui-disabled-link 11 Unconditioned Stimulus Examples

Leave a Comment Cancel Reply

Your email address will not be published. Required fields are marked *

Adam Becker

Author and astrophysicist, weak forms and strong forms.

For Cameron Neylon, because he kept asking me for this…

The Sapir-Whorf hypothesis 1 states that language affects thought — how we speak influences how we think. Or, at least, that’s one form of the hypothesis, the weak form. The strong form of Sapir-Whorf says that language determines thought, that how we speak forms a hard boundary on how and what we think. The weak form of Sapir-Whorf says that we drive an ATV across the terrain of thought; language can smooth the path in some areas and create rocks and roadblocks in others, but it doesn’t fundamentally limit where we can go. The strong form, in contrast, says we drive a steam train of thought, and language lays down the rails. There’s an intricate maze of forks and switchbacks spanning the continent, but at the end of the day we can only go where the rails will take us — we can’t lay down new track, no matter how we might try.

Most linguists today accept that some form of the weak Sapir-Whorf hypothesis must be true: the language(s) we speak definitely affect how we think and act. But most linguists also accept that the strong Sapir-Whorf hypothesis can’t be true, just as a matter of empirical fact. New words are developed, new concepts formed, new trails blazed on the terrain of thought. Some tasks may be easier or harder depending on whether your language is particularly suited for them — though even this is in dispute . But it’s simply not the case that we can’t think about things if we don’t have the words for them, nor that language actually determines our thought. In short, while the weak form of Sapir-Whorf is probably correct, the strong form is wrong. And this makes some sense: it certainly seems like language affects our thoughts, but it doesn’t seem like language wholly determines our thoughts.

But the Sapir-Whorf hypothesis isn’t the only theory with strong and weak forms — in fact, there’s a whole pattern of theories like this, and associated rhetorical dangers that go along with them. The pattern looks like this:

  • Start with a general theoretical statement about the world, where…
  • …there are two forms, a weak form and a strong form, and…
  • …the weak form is obviously true — how could it not be? — and…
  • …the strong form is obviously false, or at least much more controversial. Then, the rhetorical danger rears its head, and…
  • …arguments for the (true) weak form are appropriated, unmodified or nearly so, as arguments for the strong form by the proponents of the latter. (You also sometimes see this in reverse: people who are eager to deny the strong form rejecting valid arguments for the weak form.)

I don’t know why (5) happens, but I suspect (with little to no proof) that this confusion stems from rejection of a naive view of the world. Say you start with a cartoonishly simple picture of some phenomenon — for example, say you believe that thought isn’t affected by language in any way at all. Then you hear (good!) arguments for the weak form of the Sapir-Whorf hypothesis, which shows this cartoon picture is too simple to capture reality. With your anchor line to your old idea cut, you veer to the strong form of Sapir-Whorf. Then, later, when arguing for your new view, you use the same arguments that convinced you your old naive idea was false — namely, arguments for the weak form. (This also suggests that when (5) happens in reverse, this is founded in the same basic confusion: people defend themselves from the strong form by attacking the weak form because they would feel unmoored from their (naive) views if the weak form were true.) But why this happens is all speculation on my part. All I know for sure is that it does happen.

Cultural relativism about scientific truth is another good example. The two forms look something like this:

Weak form : Human factors like culture, history, and economics influence the practice of science, and thereby the content of our scientific theories.

Strong form : Human factors like culture, history, and economics wholly determine the content of our scientific theories.

It’s hard to see how the weak form could be wrong. Science is a human activity, and like any human activity, it’s affected by culture, economics, history, and other human factors. But the strong form claims that science is totally disconnected from anything like a “real world,” is simply manufactured by a variety of cultural and social forces, and has no special claim to truth. This is just not true. In her excellent book Brain Storm — itself about how the weak form of this thesis has played out in the spurious science of innate gender differences in the development of the human brain — Rebecca Jordan-Young forcefully rejects the strong form of relativism about science, and addresses both directions of the rhetorical confusion that arises from confounding the weak form with the strong:

The fact that science is not, and can never be, a simple mirror of the world also does not imply that science is simply “made up” and is not constrained by material phenomena that actually exist—the material world “pushes back” and exerts its own effects in science, even if we accept the postmodern premise that we humans have no hope of a direct access to that world that is unmediated by our own practices and culturally determined cognitive and linguistic structures. There is no need to dogmatically insist (against all evidence) that science really is objective in order to believe in science as a good and worthwhile endeavor, and even to believe in science as a particularly useful and trustworthy way of learning about the world. 2

Successful scientific theories, in general, must bear some resemblance to the world at large. Indeed, the success of scientific theories in predicting phenomena in the world would be nothing short of a miracle if there were absolutely no resemblance between the content of those theories and the content of the world. 3 That’s not to say that our theories are perfect representations of the world, nor that they are totally unaffected by cultural and political factors: far from it. I’m writing a book right now that’s (partly) about the cultural and historical factors influencing the debate on the foundations of quantum physics. But the content of our scientific theories is certainly not solely determined by human factors. Science is our best attempt to learn about the nature of the world. It’s not perfect. That’s OK.

There are many people, working largely in Continental philosophy and critical theory of various stripes, who advocate the strong form of relativism about science. 4 Yet most of their arguments which are ostensibly in favor of this strong form are actually arguments for the weak form: that culture plays some role in determining the content of our best scientific theories. 5 And that’s simply not the same thing.

Another, much more popular example of a strong and weak form problem is the set of claims around the “power of positive thinking.” The weak form suggests that being more confident and positive can make you happier, healthier, and more successful. This is usually true, and it’s hard to see how it couldn’t be usually true — though there are many specific counterexamples. For example, positive thinking can’t keep your house from being destroyed by a hurricane. Yet the strong form of positive-thinking claims — known as “the law of attraction,” and popularized by The Secret — suggests exactly that. This states that positive thinking, and positive thinking alone, can literally change the world around you for the better, preventing and reversing all bad luck and hardship. 6 Not only is this manifestly untrue, but the logical implications are morally repugnant: if bad things do happen to you, it must be a result of not thinking positively enough . For example, if you have cancer, and it’s resistant to treatment, that must be your fault . While this kind of neo-Calvinist victim-blaming is bad enough, it becomes truly monstrous — and the flaw in the reasoning particularly apparent — when extended from unfortunate individual circumstances to systematically disadvantaged groups. The ultimate responsibility for slavery, colonialism, genocide, and institutionalized bigotry quite obviously does not lie with the victims’ purported inability to wish hard enough for a better world.

In short, easily-confused strong and weak forms of a theory abound. I’m not claiming that this is anything like an original idea. All I’m saying is that some theories come in strong and weak forms, that sometimes the weak forms are obviously true and the strong obviously false, and that in those cases, it’s easy to take rhetorical advantage (deliberately or not) of this confusion. You could argue that the weak form directly implies the strong form in some cases, and maybe it does. But that’s not generally true, and you have to do a lot of work to make that argument — work that often isn’t done.

Again, I strongly suspect other people have come up with this idea. When I’ve talked with people about this, they’ve generally picked it up very quickly and come up with examples I didn’t think of. This seems to be floating around. If someone has a good citation for it, I’d be immensely grateful.

Image credit: Zink Dawg at English Wikipedia , CC-BY 3.0. I was strongly tempted to use this image instead.

  • This is apparently a historical misnomer, but we’ll ignore that for now. [ ↩ ]
  • Rebecca M. Jordan-Young, in Brain Storm: The Flaws in the Science of Sex Differences, Harvard University Press, 2011, pp. 299-300. Emphasis in the original. [ ↩ ]
  • See J.J.C. Smart,  Philosophy and Scientific Realism , and Hilary Putnam,  Mathematics, Matter, and Method . [ ↩ ]
  • Bruno Latour is the first name that comes to mind. [ ↩ ]
  • See, for example, Kuhn, who even seems to have confused himself about whether he was advocating the strong or the weak version. [ ↩ ]
  • The “arguments” in favor of this kind of nonsense take advantage of more than just the confusion between the strong and weak forms of the thesis about positive thinking. They also rely on profound misunderstandings about quantum physics and other perversions of science. But let’s put that aside for now. [ ↩ ]

Share this:

One thought on “ weak forms and strong forms ”.

There’s Occam’s Rusty Razor at work. Weak versions of theories necessitate lots of conditionals. Simpler just to eschew all conditionals. But simplicity itself is a virtue only with lots of subtlety and conditionality. Rusty razors butcher. Eschew Occam’s Rusty Razor.

Comments are closed.

strong whorf hypothesis

Yale Linguistics

You are here, benjamin lee whorf.

strong whorf hypothesis

Whorf’s publications include The Comparative Linguistics of Uto-Aztecan (1935), Maya Writing and Its Decipherment (1935), Discussion of Hopi Linguistics (1937), Science and Linguistics (1940), Linguistics as an Exact Science (1940), Languages and Logic (1941), Grammatical Categories (1945), An American Indian Model of the Universe (1950), and A Review of General-Semantics (1950)

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 20 May 2024

Testing theory of mind in large language models and humans

  • James W. A. Strachan   ORCID: orcid.org/0000-0002-8618-3834 1 ,
  • Dalila Albergo   ORCID: orcid.org/0000-0002-8039-5414 2 , 3 ,
  • Giulia Borghini 2 ,
  • Oriana Pansardi   ORCID: orcid.org/0000-0001-6092-1889 1 , 2 , 4 ,
  • Eugenio Scaliti   ORCID: orcid.org/0000-0002-4977-2197 1 , 2 , 5 , 6 ,
  • Saurabh Gupta   ORCID: orcid.org/0000-0001-6978-4243 7 ,
  • Krati Saxena   ORCID: orcid.org/0000-0001-7049-9685 7 ,
  • Alessandro Rufo   ORCID: orcid.org/0009-0003-8565-4192 7 ,
  • Stefano Panzeri   ORCID: orcid.org/0000-0003-1700-8909 8 ,
  • Guido Manzi   ORCID: orcid.org/0009-0009-2927-3380 7 ,
  • Michael S. A. Graziano 9 &
  • Cristina Becchio   ORCID: orcid.org/0000-0002-6845-0521 1 , 2  

Nature Human Behaviour ( 2024 ) Cite this article

291 Altmetric

Metrics details

  • Human behaviour

At the core of what defines us as humans is the concept of theory of mind: the ability to track other people’s mental states. The recent development of large language models (LLMs) such as ChatGPT has led to intense debate about the possibility that these models exhibit behaviour that is indistinguishable from human behaviour in theory of mind tasks. Here we compare human and LLM performance on a comprehensive battery of measurements that aim to measure different theory of mind abilities, from understanding false beliefs to interpreting indirect requests and recognizing irony and faux pas. We tested two families of LLMs (GPT and LLaMA2) repeatedly against these measures and compared their performance with those from a sample of 1,907 human participants. Across the battery of theory of mind tests, we found that GPT-4 models performed at, or even sometimes above, human levels at identifying indirect requests, false beliefs and misdirection, but struggled with detecting faux pas. Faux pas, however, was the only test where LLaMA2 outperformed humans. Follow-up manipulations of the belief likelihood revealed that the superiority of LLaMA2 was illusory, possibly reflecting a bias towards attributing ignorance. By contrast, the poor performance of GPT originated from a hyperconservative approach towards committing to conclusions rather than from a genuine failure of inference. These findings not only demonstrate that LLMs exhibit behaviour that is consistent with the outputs of mentalistic inference in humans but also highlight the importance of systematic testing to ensure a non-superficial comparison between human and artificial intelligences.

Similar content being viewed by others

strong whorf hypothesis

Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT

strong whorf hypothesis

ThoughtSource: A central hub for large language model reasoning data

strong whorf hypothesis

Overlap in meaning is a stronger predictor of semantic activation in GPT-3 than in humans

People care about what other people think and expend a lot of effort thinking about what is going on in other minds. Everyday life is full of social interactions that only make sense when considered in light of our capacity to represent other minds: when you are standing near a closed window and a friend says, ‘It’s a bit hot in here’, it is your ability to think about her beliefs and desires that allows you to recognize that she is not just commenting on the temperature but politely asking you to open the window 1 .

This ability for tracking other people’s mental states is known as theory of mind. Theory of mind is central to human social interactions—from communication to empathy to social decision-making—and has long been of interest to developmental, social and clinical psychologists. Far from being a unitary construct, theory of mind refers to an interconnected set of notions that are combined to explain, predict, and justify the behaviour of others 2 . Since the term ‘theory of mind’ was first introduced in 1978 (ref. 3 ), dozens of tasks have been developed to study it, including indirect measures of belief attribution using reaction times 4 , 5 , 6 and looking or searching behaviour 7 , 8 , 9 , tasks examining the ability to infer mental states from photographs of eyes 10 , and language-based tasks assessing false belief understanding 11 , 12 and pragmatic language comprehension 13 , 14 , 15 , 16 . These measures are proposed to test early, efficient but inflexible implicit processes as well as later-developing, flexible and demanding explicit abilities that are crucial for the generation and comprehension of complex behavioural interactions 17 , 18 involving phenomena such as misdirection, irony, implicature and deception.

The recent rise of large language models (LLMs), such as generative pre-trained transformer (GPT) models, has shown some promise that artificial theory of mind may not be too distant an idea. Generative LLMs exhibit performance that is characteristic of sophisticated decision-making and reasoning abilities 19 , 20 including solving tasks widely used to test theory of mind in humans 21 , 22 , 23 , 24 . However, the mixed success of these models 23 , along with their vulnerability to small perturbations to the provided prompts, including simple changes in characters’ perceptual access 25 , raises concerns about the robustness and interpretability of the observed successes. Even in cases where these models are capable of solving complex tasks 20 that are cognitively demanding even for human adults 17 , it cannot be taken for granted that they will not be tripped up by a simpler task that a human would find trivial 26 . As a result, work in LLMs has begun to question whether these models rely on shallow heuristics rather than robust performance that parallels human theory of mind abilities 27 .

In the service of the broader multidisciplinary study of machine behaviour 28 , there have been recent calls for a ‘machine psychology’ 29 that have argued for using tools and paradigms from experimental psychology to systematically investigate the capacities and limits of LLMs 30 . A systematic experimental approach to studying theory of mind in LLMs involves using a diverse set of theory of mind measures, delivering multiple repetitions of each test, and having clearly defined benchmarks of human performance against which to compare 31 . In this Article, we adopt such an approach to test the performance of LLMs in a wide range of theory of mind tasks. We tested the chat-enabled version of GPT-4, the latest LLM in the GPT family of models, and its predecessor ChatGPT-3.5 (hereafter GPT-3.5) in a comprehensive set of psychological tests spanning different theory of mind abilities, from those that are less cognitively demanding for humans such as understanding indirect requests to more cognitively demanding abilities such as recognizing and articulating complex mental states like misdirection or irony 17 . GPT models are closed, evolving systems. In the interest of reproducibility 32 , we also tested the open-weight LLaMA2-Chat models on the same tests. To understand the variability and boundary limitations of LLMs’ social reasoning capacities, we exposed each model to multiple repetitions of each test across independent sessions and compared their performance with that of a sample of human participants (total N  = 1,907). Using variants of the tests considered, we were able to examine the processes behind the models’ successes and failures in these tests.

Theory of mind battery

We selected a set of well-established theory of mind tests spanning different abilities: the hinting task 14 , the false belief task 11 , 33 , the recognition of faux pas 13 , and the strange stories 15 , 16 . We also included a test of irony comprehension using stimuli adapted from a previous study 34 . Each test was administered separately to GPT-4, GPT-3.5 and LLaMA2-70B-Chat (hereafter LLaMA2-70B) across 15 chats. We also tested two other sizes of LLaMA2 model (7B and 13B), the results of which are reported in Supplementary Information section 1 . Because each chat is a separate and independent session, and information about previous sessions is not retained, this allowed us to treat each chat (session) as an independent observation. Responses were scored in accordance with the scoring protocols for each test in humans ( Methods ) and compared with those collected from a sample of 250 human participants. Tests were administered by presenting each item sequentially in a written format that ensured a species-fair comparison 35 ( Methods ) between LLMs and human participants.

Performance across theory of mind tests

Except for the irony test, all other tests in our battery are publicly available tests accessible within open databases and scholarly journal articles. To ensure that models did not merely replicate training set data, we generated novel items for each published test ( Methods ). These novel test items matched the logic of the original test items but used a different semantic content. The text of original and novel items and the coded responses are available on the OSF (methods and resource availability).

Figure 1a compares the performance of LLMs against the performance of human participants across all tests included in the battery. Differences in performance on original items versus novel items, separately for each test and model, are shown in Fig. 1b .

figure 1

a , Original test items for each test showing the distribution of test scores for individual sessions and participants. Coloured dots show the average response score across all test items for each individual test session (LLMs) or participant (humans). Black dots indicate the median for each condition. P values were computed from Holm-corrected Wilcoxon two-way tests comparing LLM scores ( n  = 15 LLM observations) against human scores (irony, N  = 50 human participants; faux pas, N  = 51 human participants; hinting, N  = 48 human participants; strange stories, N  = 50 human participants). Tests are ordered in descending order of human performance. b , Interquartile ranges of the average scores on the original published items (dark colours) and novel items (pale colours) across each test (for LLMs, n  = 15 LLM observations; for humans, false belief, N  = 49 human participants; faux pas, N  = 51 human participants; hinting, N  = 48 human participants; strange stories, N  = 50 human participants). Empty diamonds indicate the median scores, and filled circles indicate the upper and lower bounds of the interquartile range. P values shown are from Holm-corrected Wilcoxon two-way tests comparing performance on original items against the novel items generated as controls for this study.

Source data

False belief.

Both human participants and LLMs performed at ceiling on this test (Fig. 1a ). All LLMs correctly reported that an agent who left the room while the object was moved would later look for the object in the place where they remembered seeing it, even though it no longer matched the current location. Performance on novel items was also near perfect (Fig. 1b ), with only 5 human participants out of 51 making one error, typically by failing to specify one of the two locations (for example, ‘He’ll look in the room’; Supplementary Information section 2 ).

In humans, success on the false belief task requires inhibiting one’s own belief about reality in order to use one’s knowledge about the character’s mental state to derive predictions about their behaviour. However, with LLMs, performance may be explained by lower-level explanations than belief tracking 27 . Supporting this interpretation, LLMs such as ChatGPT have been shown to be susceptible to minor alterations to the false belief formulation 25 , 27 , such as making the containers where the object is hidden transparent or asking about the belief of the character who moved the object rather than the one who was out of the room. Such perturbations of the standard false belief structure are assumed not to matter for humans (who possess a theory of mind) 25 . In a control study using these perturbation variants (Supplementary Information section 4 and Supplementary Appendix 1 ), we replicated the poor performance of GPT models found in previous studies 25 . However, we found that human participants ( N  = 757) also failed on half of these perturbations. Understanding these failures and the similarities and differences in how humans and LLMs may arrive at the same outcome requires further systematic investigation. For example, because these perturbations also involve changes in the physical properties of the environment, it is difficult to establish whether LLMs (and humans) failed because they were sticking to the familiar script and were unable to automatically attribute an updated belief, or because they did not consider physical principles (for example, transparency).

GPT-4 performed significantly better than human levels ( Z  = 0.00, P  = 0.040, r  = 0.32, 95% confidence interval (CI) 0.14–0.48). By contrast, both GPT-3.5 ( Z  = −0.17, P  = 2.37 × 10 −6 , r  = 0.64, 95% CI 0.49–0.77) and LLaMA2-70B ( Z  = −0.42, P  = 2.39 × 10 −7 , r  = 0.70, 95% CI 0.55–0.79) performed below human levels (Fig. 1a ). GPT-3.5 performed perfectly at recognizing non-ironic control statements but made errors at recognizing ironic utterances (Supplementary Information section 2 ). Control analysis revealed a significant order effect, whereby GPT-3.5 made more errors on earlier trials than later ones (Supplementary Information section 3 ). LLaMA2-70B made errors when recognizing both ironic and non-ironic control statements, suggesting an overall poor discrimination of irony.

On this test, GPT-4 scored notably lower than human levels ( Z  = −0.40, P  = 5.42 × 10 −5 , r  = 0.55, 95% CI 0.33–0.71) with isolated ceiling effects on specific items (Supplementary Information section 2 ). GPT-3.5 scored even worse, with its performance nearly at floor ( Z  = −0.80, P  = 5.95 × 10 −8 , r  = 0.72, 95% CI 0.58–0.81) on all items except one. By contrast, LLaMA2-70B outperformed humans ( Z  = 0.10, P  = 0.002, r  = 0.44, 95% CI 0.24–0.61) achieving 100% accuracy in all but one run.

The pattern of results for novel items was qualitatively similar (Fig. 1b ). Compared with original items, the novel items proved slightly easier for humans ( Z  = −0.10, P  = 0.029, r  = 0.29, 95% CI 0.10–0.50) and more difficult for GPT-3.5 ( Z  = 0.10, P  = 0.002, r  = 0.69, 95% CI 0.49–0.88), but not for GPT-4 and LLaMA2-70B ( P  > 0.462; Bayes factor (BF 10 ) of 0.77 and 0.43, respectively). Given the poor performance of GPT-3.5 of the original test items, this difference was unlikely to be explained by a prior familiarity with the original items. These results were robust to alternative coding schemes (Supplementary Information section 5 ).

On this test, GPT-4 performance was significantly better than humans ( Z  = 0.00, P  = 0.040, r  = 0.32, 95% CI 0.12–0.50). GPT-3.5 performance did not significantly differ from human performance ( Z  = 0.00, P  = 0.626, r  = 0.06, 95% CI 0.01–0.33, BF 10 0.33). Only LLaMA2-70B scored significantly below human levels of performance on this test ( Z  = −0.20, P  = 5.42 × 10 −5 , r  = 0.57, 95% CI 0.41–0.72).

Novel items proved easier than original items for both humans ( Z  = −0.10, P  = 0.008, r  = 0.34, 95% CI 0.14–0.53) and LLaMA2-70B ( Z  = −0.20, P  = 9.18 × 10 −4 , r  = 0.73, 95% CI 0.50–0.87) (Fig. 1b ). Scores on novel items did not differ from the original test items for GPT-3.5 ( Z  = −0.03, P  = 0.955, r  = 0.24, 95% CI 0.02–0.59, BF 10 0.61) or GPT-4 ( Z  = −0.10, P  = 0.123, r  = 0.44, 95% CI 0.07–0.75, BF 10 0.91). Given that better performance on novel items is the opposite of what a prior familiarity explanation would predict, it is likely that this difference for LLaMA2-70B was driven by differences in item difficulty.

Strange stories

GPT-4 significantly outperformed humans on this test ( Z  = 0.13, P  = 1.04 × 10 −5 , r  = 0.60, 95% CI 0.46–0.72). The performance of GPT-3.5 did not significantly differ from humans ( Z  = −0.06, P  = 0.110, r  = 0.24, 95% CI 0.03–0.44, BF 10 0.47), while LLaMA2-70B scored significantly lower than humans ( Z  = −0.13, P  = 0.005, r  = 0.41, 95% CI 0.24–0.60). There were no differences between original and novel items for any model (all P  > 0.085; BF 10 : human 0.22, GPT-3.5 1.46, LLaMA2-70B 0.46; the variance for GPT-4 was too low to compute a Bayes factor). As reported in Supplementary Information section 6 , partial successes were infrequent and more likely for LLaMA2-70B than for other models.

Understanding faux pas

In line with previous findings that GPT models struggle with faux pas 36 , in our battery, faux pas was the only test in which GPT-4 did not match or exceed human performance. Surprisingly, faux pas was also the only test in which LLaMA2-70B, which was otherwise the poorest-performing model, scored better than humans (Fig. 1 ).

The faux pas test consists of vignettes describing an interaction where one character (the speaker) says something they should not have said, not knowing or not realizing that they should not say it. To understand that a faux pas has occurred, one must represent two mental states: that the speaker does not know that they should not say what they said, and that the person hearing it (the victim) would feel insulted or hurt by what was said. Consider the following example of a story describing a faux pas:

Jill had just moved into a new house. She went shopping with her Mum and bought some new curtains. When Jill had just put them up, her best friend Lisa came round and said, "Oh, those curtains are horrible, I hope you're going to get some new ones." Jill asked, "Do you like the rest of my bedroom?"

Following the presentation of the story, four comprehension questions are asked. The first question is, ‘In the story, did somebody say something they should not have said?’ to which the correct answer is always yes. The second question asks the respondent to report what the person said that they should not have said, and the third question is a comprehension question specific to the content of the story. The fourth and key question, which was our focus for coding, relates to the speaker’s belief when they made the inappropriate utterance: ‘Did [Lisa] know that [the curtains were new]?’ The correct answer to this final question is always no.

Both GPT-4 and GPT-3.5 correctly identified that the victim would feel insulted or hurt, sometimes going so far as to provide additional details about why the utterance might cause offence. However, when asked whether the speaker was aware of the context that made their statement offensive (such as Lisa knowing that Jill had just bought the curtains), they failed to produce a correct answer. A closer look revealed that the overwhelming majority of the errors on this question reported that there was not enough information provided to be sure, for example:

Did Lisa know the curtains were new? GPT-4: […] It is unclear from the story whether Lisa knew the curtains were new or not.

Only two responses out of 349 reported that, yes, the character did know. We consider three alternative hypotheses for why GPT models, and specifically GPT-4, fail to answer this question correctly.

The first hypothesis, which we term the failure of inference hypothesis, is that models fail to generate inferences about the mental state of the speaker (note that we refer to inference here not in the sense of the processes by which biological organisms infer hidden states from their environment, but rather as any process of reasoning whereby conclusions are derived from a set of propositional premises). Recognizing a faux pas in this test relies on contextual information beyond that encoded within the story (for example, about social norms). For example, in the above example there is no information in the story to indicate that saying that the newly bought curtains are horrible is inappropriate, but this is a necessary proposition that must be accepted in order to accurately infer the mental states of the characters. This inability to use non-embedded information would fundamentally impair the ability of GPT-4 to compute inferences.

The second hypothesis, which we term the Buridan’s ass hypothesis, is that models are capable of inferring mental states but cannot choose between them, as with the eponymous rational agent caught between two equally appetitive bales of hay that starves because it cannot resolve the paradox of making a decision in the absence of a clear preference 37 . Under this hypothesis, GPT models can propose the correct answer (a faux pas) as one among several possible alternatives but do not rank these alternatives in terms of likelihood. In partial support of this hypothesis, responses from both GPT models occasionally indicate that the speaker may not know or remember but present this as one hypothesis among alternatives (Supplementary Information section 5 ).

The third hypothesis, which we term the hyperconservatism hypothesis, is that GPT models are able both to compute inferences about the mental states of characters and recognise a false belief or lack of knowledge as the likeliest explanation among competing alternatives but refrain from committing to a single explanation out of an excess of caution. GPT models are powerful language generators, but they are also subject to inhibitory mitigation processes 38 . It is possible that such processes could lead to an overly conservative stance where GPT models do not commit to the likeliest explanation despite being able to generate it.

To differentiate between these hypotheses, we devised a variant of the faux pas test where the question assessing performance on the faux pas test was formulated in terms of likelihood (hereafter, the faux pas likelihood test). Specifically, rather than ask whether the speaker knew or did not know, we asked whether it was more likely that the speaker knew or did not know. Under the hyperconservatism hypothesis, GPT models should be able to both make the inference that the speaker did not know and identify it as more likely among alternatives, and so we would expect the models to respond accurately that it was more likely that the speaker did not know. In case of uncertainty or incorrect responses, we further prompted models to describe the most likely explanation. Under the Buridan’s ass hypothesis, we expected this question would elicit multiple alternative explanations that would be presented as equally plausible, while under the failure of inference hypothesis, we expected that GPT would not be able to generate the right answer at all as a plausible explanation.

As shown in Fig. 2a , on the faux pas likelihood test GPT-4 demonstrated perfect performance, with all responses identifying without any prompting that it was more likely that the speaker did not know the context. GPT-3.5 also showed improved performance, although it did require prompting in a few instances (~3% of items) and occasionally failed to recognize the faux pas (~9% of items; see Supplementary Information section 7 for a qualitative analysis of response types).

figure 2

a , Scores of the two GPT models on the original framing of the faux pas question (‘Did they know…?’) and the likelihood framing (‘Is it more likely that they knew or didn’t know…?’). Dots show average score across trials ( n  = 15 LLM observations) on particular items to allow comparison between the original faux pas test and the new faux pas likelihood test. Halfeye plots show distributions, medians (black points), 66% (thick grey lines) and 99% quantiles (thin grey lines) of the response scores on different items ( n  = 15 different stories involving faux pas). b , Response scores to three variants of the faux pas test: faux pas (pink), neutral (grey) and knowledge-implied variants (teal). Responses were coded as categorical data as ‘didn’t know’, ‘unsure’ or ‘knew’ and assigned a numerical coding of −1, 0 and +1. Filled balloons are shown for each model and variant, and the size of each balloon indicates the count frequency, which was the categorical data used to compute chi-square tests. Bars show the direction bias score computed as the average across responses of the categorical data coded as above. On the right of the plot, P values (one-sided) of Holm-corrected chi-square tests are shown comparing the distribution of response type frequencies in the faux pas and knowledge-implied variants against neutral.

Taken together, these results support the hyperconservatism hypothesis, as they indicate that GPT-4, and to a lesser but still notable extent GPT-3.5, successfully generated inferences about the mental states of the speaker and identified that an unintentional offence was more likely than an intentional insult. Thus, failure to respond correctly to the original phrasing of the question does not reflect a failure of inference, nor indecision among alternatives the model considered equally plausible, but an overly conservative approach that prevented commitment to the most likely explanation.

Testing information integration

A potential confound of the above results is that, as the faux pas test includes only items where a faux pas occurs, any model biased towards attributing ignorance would demonstrate perfect performance without having to integrate the information provided by the story. This potential bias could explain the perfect performance of LLaMA2-70B in the original faux pas test (where the correct answer is always, ‘no’) as well as GPT-4’s perfect and GPT-3.5’s good performance on the faux pas likelihood test (where the correct answer is always ‘more likely that they didn’t know’).

To control for this, we developed a novel set of variants of the faux pas likelihood test manipulating the likelihood that the speaker knew or did not know (hereafter the belief likelihood test). For each test item, all newly generated for this control study, we created three variants: a ‘faux pas’ variant, a ‘neutral’ variant, and a ‘knowledge-implied’ variant ( Methods ). In the faux pas variant, the utterance suggested that the speaker did not know the context. In the neutral variant, the utterance suggested neither that they knew nor did not know. In the knowledge-implied variant, the utterance suggested that the speaker knew (for the full text of all items, see Supplementary Appendix 2 ).

If the models’ responses reflect a true discrimination of the relative likelihood of the two explanations (that the person knew versus that they didn’t know, hereafter ‘knew’ and ‘didn’t know’), then the distribution of ‘knew’ and ‘didn’t know’ responses should be different across variants. Specifically, relative to the neutral variant, ‘didn’t know’ responses should predominate for the faux pas, and ‘knew’ responses should predominate for the knowledge-implied variant. If the responses of the models do not discriminate between the three variants, or discriminate only partially, then it is likely that responses are affected by a bias or heuristic unrelated to the story content.

We adapted the three variants (faux pas, neutral and knowledge implied) for six stories, administering each test item separately to each LLM and a new sample of human participants (total N  = 900). Responses were coded using a numeric code to indicate which, if either, of the knew/didn’t know explanations the response endorsed (−1, didn’t know; 0, unsure or impossible to tell; +1, knew). These coded scores were then averaged for each story to give a directional score for each variant such that negative values indicated the model was more likely to endorse the ‘didn’t know’ explanation, while positive values indicated the model was more likely to endorse the ‘knew’ explanation. These results are shown in Fig. 2b . As expected, humans were more likely to report that the speaker did not know for faux pas than for neutral ( χ 2 (2) = 56.20, P  = 3.82 × 10 −12 ) and more likely to report that the speaker did know for knowledge implied than for neutral ( χ 2 (2) = 143, P  = 6.60 × 10 −31 ). Humans also reported uncertainty on a small proportion of trials, with a higher proportion in the neutral condition (28 out of 303 responses) than in the other variants (11 out of 303 for faux pas, and 0 out of 298 for knowledge implied).

Similarly to humans, GPT-4 was more likely to endorse the ‘didn’t know’ explanation for faux pas than for neutral ( χ 2 (2) = 109, P  = 1.54 × 10 −23 ) and more likely to endorse the ‘knew’ explanation for knowledge implied than for neutral ( χ 2 (2) = 18.10, P  = 3.57 × 10 −4 ). GPT-4 was also more likely to report uncertainty in the neutral condition than responding randomly (42 out of 90 responses, versus 6 and 17 in the faux pas and knowledge-implied variants, respectively).

The pattern of responses for GPT-3.5 was similar, with the model being more likely to report that the speaker didn’t know for faux pas than for neutral ( χ 2 (1) = 8.44, P  = 0.007) and more likely that the character knew for knowledge implied than for neutral ( χ 2 (1) = 21.50, P  = 1.82 × 10 −5 ). Unlike GPT-4, GPT-3.5 never reported uncertainty in response to any variants and always selected one of the two explanations as the likelier even in the neutral condition.

LLaMA2-70B was also more likely to report that the speaker didn’t know in response to faux pas than neutral ( χ 2 (1) = 20.20, P  = 2.81 × 10 −5 ), which was consistent with this model’s ceiling performance in the original formulation of the test. However, it showed no differentiation between neutral and knowledge implied ( χ 2 (1) = 1.80, P  = 0.180, BF 10 0.56). As with GPT-3.5, LLaMA2-70B never reported uncertainty in response to any variants and always selected one of the two explanations as the likelier.

Furthermore, the responses of LLaMA2-70B and, to a lesser extent, GPT-3.5 appeared to be subject to a response bias towards affirming that someone had said something they should not have said. Although the responses to the first question (which involved recognising that there was an offensive remark made) were of secondary interest to our study, it was notable that, although all models could correctly identify that an offensive remark had been made in the faux pas condition (all LLMs 100%, humans 83.61%), only GPT-4 reliably reported that there was no offensive statement in the neutral and knowledge-implied conditions (15.47% and 27.78%, respectively), with similar proportions to human responses (neutral 19.27%, knowledge implied 30.10%). GPT-3.5 was more likely to report that somebody made an offensive remark in all conditions (neutral 71.11%, knowledge implied 87.78%), and LLaMA2-70B always reported that somebody in the story had made an offensive remark.

We collated a battery of tests to comprehensively measure performance in theory of mind tasks in three LLMs (GPT-4, GPT-3.5 and LLaMA2-70B) and compared these against the performance of a large sample of human participants. Our findings validate the methodological approach taken in this study using a battery of multiple tests spanning theory of mind abilities, exposing language models to multiple sessions and variations in both structure and content, and implementing procedures to ensure a fair, non-superficial comparison between humans and machines 35 . This approach enabled us to reveal the existence of specific deviations from human-like behaviour that would have remained hidden using a single theory of mind test, or a single run of each test.

Both GPT models exhibited impressive performance in tasks involving beliefs, intentions and non-literal utterances, with GPT-4 exceeding human levels in the irony, hinting and strange stories. Both GPT-4 and GPT-3.5 failed only on the faux pas test. Conversely, LLaMA2-70B, which was otherwise the poorest-performing model, outperformed humans on the faux pas. Understanding a faux pas involves two aspects: recognizing that one person (the victim) feels insulted or upset and understanding that another person (the speaker) holds a mistaken belief or lacks some relevant knowledge. To examine the nature of models’ successes and failures on this test, we developed and tested new variants of the faux pas test in a set of control experiments.

Our first control experiment using a likelihood framing of the belief question (faux pas likelihood test), showed that GPT-4, and to a lesser extent GPT-3.5, correctly identified the mental state of both the victim and the speaker and selected as the most likely explanation the speaker not knowing or remembering the relevant knowledge that made their statement inappropriate. Despite this, both models consistently provided an incorrect response (at least when compared against human responses) when asked whether the speaker knew or remembered this knowledge, responding that there was insufficient information provided. In line with the hyperconservatism hypothesis, these findings imply that, while GPT models can identify unintentional offence as the most likely explanation, their default responses do not commit to this explanation. This finding is consistent with longitudinal evidence that GPT models have become more reluctant to answer opinion questions over time 39 .

Further supporting that the failures of GPT at recognizing faux pas were due to hyperconservatism in answering the belief question rather than a failure of inference, a second experiment using the belief likelihood test showed that GPT responses integrated information in the story to accurately interpret the speaker’s mental state. When the utterance suggested that the speaker knew, GPT responses acknowledged the higher likelihood of the ‘knew’ explanation. LLaMA2-70B, on the other hand, did not differentiate between scenarios where the speaker was implied to know and when there was no information one way or another, raising the concern that the perfect performance of LLaMA2-70B on this task may be illusory.

The pattern of failures and successes of GPT models on the faux pas test and its variants may be the result of their underlying architecture. In addition to transformers (generative algorithms that produce text output), GPT models also include mitigation measures to improve factuality and avoid users’ overreliance on them as sources 38 . These measures include training to reduce hallucinations, the propensity of GPT models to produce nonsensical content or fabricate details that are not true in relation to the provided content. Failure on the faux pas test may be an exercise of caution driven by these mitigation measures, as passing the test requires committing to an explanation that lacks full evidence. This caution can also explain differences between tasks: both the faux pas and hinting tests require speculation to generate correct answers from incomplete information. However, while the hinting task allows for open-ended generation of text in ways to which LLMs are well suited, answering the faux pas test requires going beyond this speculation in order to commit to a conclusion.

The cautionary epistemic policy guiding the responses of GPT models introduces a fundamental difference in the way that humans and GPT models respond to social uncertainty 40 . In humans, thinking is, first and last, for the sake of doing 41 , 42 . Humans generally find uncertainty in social environments to be aversive and will incur additional costs to reduce it 43 . Theory of mind is crucial in reducing such uncertainty; the ability to reason about mental states—in combination with information about context, past experience and knowledge of social norms—helps individual reduce uncertainty and commit to likely hypotheses, allowing for successful navigation of the social environment as active agents 44 , 45 . GPT models, on the other hand, respond conservatively despite having access to tools to reduce uncertainty. The dissociation we describe between speculative reasoning and commitment mirrors recent evidence that, while GPT models demonstrate sophisticated and accurate performance in reasoning tasks about belief states, they struggle to translate this reasoning into strategic decisions and actions 46 .

These findings highlight a dissociation between competence and performance 35 , suggesting that GPT models may be competent, that is, have the technical sophistication to compute mentalistic-like inferences but perform differently from humans under uncertain circumstances as they do not compute these inferences spontaneously to reduce uncertainty. Such a distinction can be difficult to capture with quantitative approaches that code only for target response features, as machine failures and successes are the result of non-human-like processes 30 (see Supplementary Information section 7 for a preliminary qualitative breakdown of how GPT models’ successes on the new version of the faux pas test may not necessarily reflect perfect or human-like reasoning).

While LLMs are designed to emulate human-like responses, this does not mean that this analogy extends to the underlying cognition giving rise to those responses 47 . In this context, our findings imply a difference in how humans and GPT models trade off the costs associated with social uncertainty against the costs associated with prolonged deliberation 48 . This difference is perhaps not surprising considering that resolving uncertainty is a priority for brains adapted to deal with embodied decisions, such as deciding whether to approach or avoid, fight or flight, or cooperate or defect. GPT models and other LLMs do not operate within an environment and are not subject to the processing constraints that biological agents face to resolve competition between action choices, so may have limited advantages in narrowing the future prediction space 46 , 49 , 50 .

The dis-embodied cognition of GPT models could explain failures in recognizing faux pas, but they may also underlie their success on other tests. One example is the false belief test, one of the most widely used tools so far for testing the performance of LLMs on social cognitive tasks 19 , 21 , 22 , 23 , 25 , 51 , 52 . In this test, participants are presented with a story where a character’s belief about the world (the location of the item) differs from the participant’s own belief. The challenge in these stories is not remembering where the character last saw the item but rather in reconciling the incongruence between conflicting mental states. This is challenging for humans, who have their own perspective, their own sense of self and their own ability to track out-of-sight objects. However, if a machine does not have its own self-perspective because it is not subject to the constraints of navigating a body through an environment, as with GPT 53 , then tracking the belief of a character in a story does not pose the same challenge.

An important direction for future research will be to examine the impact of these non-human decision behaviours on second-person, real-time human–machine interactions 54 , 55 . Failure of commitment by GPT models, for example, may lead to negative affect in human conversational partners. However, it may also foster curiosity 40 . Understanding how GPTs’ performance on mentalistic inferences (or their absences) influences human social cognition in dynamically unfolding social interactions is an open challenge for future work.

The LLM landscape is fast-moving. Our findings highlight the importance of systematic testing and proper validation in human samples as a necessary foundation. As artificial intelligence (AI) continues to evolve, it also becomes increasingly important to heed calls for open science and open access to these models 32 . Direct access to the parameters, data and documentation used to construct models can allow for targeted probing and experimentation into the key parameters affecting social reasoning, informed by and building on comparisons with human data. As such, open models can not only serve to accelerate the development of future AI technologies but also serve as models of human cognition.

Ethical compliance

The research was approved by the local ethical committee (ASL 3 Genovese; protocol no. 192REG2015) and was carried out in accordance with the principles of the revised Helsinki Declaration.

Experimental model details

We tested two versions of OpenAI’s GPT: version 3.5, which was the default model at the time of testing, and version 4, which was the state-of-the-art model with enhanced reasoning, creativity and comprehension relative to previous models ( https://chat.openai.com/ ). Each test was delivered in a separate chat: GPT is capable of learning within a chat session, as it can remember both its own and the user’s previous messages to adapt its responses accordingly, but it does not retain this memory across new chats. As such, each new iteration of a test may be considered a blank slate with a new naive participant. The dates of data collection for the different stages are reported in Table 1 .

Three LLaMA2-Chat models were tested. These models were trained on sets of different sizes: 70, 13 and 7 billion tokens. All LLaMA2-Chat responses were collected using set parameters with the prompt, ‘You are a helpful AI assistant’, a temperature of 0.7, the maximum number of new tokens set at 512, a repetition penalty of 1.1, and a Top P of 0.9. Langchain’s conversation chain was used to create a memory context within individual chat sessions. Responses from all LLaMA2-Chat models were found to include a number of non-codable responses (for example, repeating the question without answering it), and these were regenerated individually and included with the full response set. For the 70B model, these non-responses were rare, but for the 13B and 7B models they were common enough to cause concern about the quality of these data. As such, only the responses of the 70B model are reported in the main manuscript and a comparison of this model against the smaller two is reported in Supplementary Information section 1 . Details and dates of data collection are reported in Table 1 .

For each test, we collected 15 sessions for each LLM. A session involved delivering all items of a single test within the same chat window. GPT-4 was subject to a 25-message limit per 3 h; to minimize interference, a single experimenter delivered all tests for GPT-4, while four other experimenters shared the duty of collecting responses from GPT-3.5.

Human participants were recruited online through the Prolific platform and the study was hosted on SoSci. We recruited native English speakers between the ages of 18 and 70 years with no history of psychiatric conditions and no history of dyslexia in particular. Further demographic data were not collected. We aimed to collect around 50 participants per test (theory of mind battery) or item (belief likelihood test, false belief perturbations). Thirteen participants who appeared to have generated their answers using LLMs or whose responses did not answer the questions were excluded. The final human sample was N  = 1,907 (Table 1 ). All participants provided informed consent through the online survey and received monetary compensation in return for their participation at a rate of GBP£12 h −1 .

We selected a series of tests typically used in evaluating theory of mind capacity in human participants.

False belief assess the ability to infer that another person possesses knowledge that differs from the participant’s own (true) knowledge of the world. These tests consist of test items that follow a particular structure: character A and character B are together, character A deposits an item inside a hidden location (for example, a box), character A leaves, character B moves the item to a second hidden location (for example, a cupboard) and then character A returns. The question asked to the participant is: when character A returns, will they look for the item in the new location (where it truly is, matching the participant’s true belief) or the old location (where it was, matching character A’s false belief)?

In addition to the false belief condition, the test also uses a true belief control condition, where rather than move the item that character A hid, character B moves a different item to a new location. This is important for interpreting failures of false belief attribution as they ensure that any failures are not due to a recency effect (referring to the last location reported) but instead reflect an accurate belief tracking.

We adapted four false/true belief scenarios from the sandbox task used by Bernstein 33 and generated three novel items, each with false and true belief versions. These novel items followed the same structure as the original published items but with different details such as names, locations or objects to control for familiarity with the text of published items. Two story lists (false belief A, false belief B) were generated for this test such that each story only appeared once within a testing session and alternated between false and true belief depending on the session. In addition to the standard false/true belief scenarios, two additional catch stories were tested that involved minor alterations to the story structure. The results of these items are not reported here as they go beyond the goals of the current study.

Comprehending an ironic remark requires inferring the true meaning of an utterance (typically the opposite of what is said) and detecting the speaker’s mocking attitude, and this has been raised as a key challenge for AI and LLMs 19 .

Irony comprehension items were adapted from an eye-tracking study 34 in which participants read vignettes where a character made an ironic or non-ironic statement. Twelve items were taken from these stimuli that in the original study were used as comprehension checks. Items were abbreviated to end following the ironic or non-ironic utterance.

Two story lists were generated for this test (irony A, irony B) such that each story only appeared once within a testing session and alternated between ironic and non-ironic depending on the session. Responses were coded as 1 (correct) or 0 (incorrect). During coding, we noted some inconsistencies in the formulation of both GPT models’ responses where in response to the question of whether the speaker believed what they had said, they might respond with, ‘Yes, they did not believe that…’. Such internally contradictory responses, where the models responded with a ‘yes’ or ‘no’ that was incompatible with the follow-up explanation, were coded on the basis of whether or not the explanation showed appreciation of the irony—the linguistic failures of these models in generating a coherent answer are not of direct interest to the current study as these failures (1) were rare and (2) did not render the responses incomprehensible.

The faux pas test 13 presents a context in which one character makes an utterance that is unintentionally offensive to the listener because the speaker does not know or does not remember some key piece of information.

Following the presentation of the scenario, we presented four questions:

‘In the story did someone say something that they should not have said?’ [The correct answer is always ‘yes’]

‘What did they say that they should not have said?’ [Correct answer changes for each item]

A comprehension question to test understanding of story events [Question changes for every item]

A question to test awareness of the speaker’s false belief phrased as, ‘Did [the speaker] know that [what they said was inappropriate]?’ [Question changes for every item. The correct answer is always ‘no’]

These questions were asked at the same time as the story was presented. Under the original coding criteria, participants must answer all four questions correctly for their answer to be considered correct. However, in the current study we were interested primarily in the response to the final question testing whether the responder understood the speaker’s mental state. When examining the human data, we noticed that several participants responded incorrectly to the first item owing to an apparent unwillingness to attribute blame (for example ‘No, he didn’t say anything wrong because he forgot’). To focus on the key aspect of faux pas understanding that was relevant to the current study, we restricted our coding to only the last question (1 (correct if the answer was no) or 0 (for anything else); see Supplementary Information section 5 for an alternative coding that follows the original criteria, as well as a recoding where we coded as correct responses where the correct answer was mentioned as a possible explanation but was not explicitly endorsed).

As well as the 10 original items used in Baron-Cohen et al. 13 , we generated five novel items for this test that followed the same structure and logic as the original items, resulting in 15 items overall.

Hinting task

The hinting task 14 assesses the understanding of indirect speech requests through the presentation of ten vignettes depicting everyday social interactions that are presented sequentially. Each vignette ends with a remark that can be interpreted as a hint.

A correct response identifies both the intended meaning of the remark and the action that it is attempting to elicit. In the original test, if the participant failed to answer the question fully the first time, they were prompted with additional questioning 14 , 56 . In our adapted implementation, we removed this additional questioning and coded responses as a binary (1 (correct) or 0 (incorrect)) using the evaluation criteria listed in Gil et al. 56 . Note that this coding offers more conservative estimates of hint comprehension than in previous studies.

In addition to 10 original items sourced from Corcoran 14 , we generated a further 6 novel hinting test items, resulting in 16 items overall.

The strange stories 15 , 16 offer a means of testing more advanced mentalizing abilities such as reasoning about misdirection, manipulation, lying and misunderstanding, as well as second- or higher-order mental states (for example, A knows that B believes X …). The advanced abilities that these stories measure make them suitable for testing higher-functioning children and adults. In this test, participants are presented with a short vignette and are asked to explain why a character says or does something that is not literally true.

Each question comes with a specific set of coding criteria and responses can be awarded 0, 1 or 2 points depending on how fully it explains the utterance and whether or not it explains it in mentalistic terms 16 . See Supplementary Information section 6 for a description of the frequency of partial successes.

In addition to the 8 original mental stories, we generated 4 novel items, resulting in 12 items overall. The maximum number of points possible was 24, and individual session scores were converted to a proportional score for analysis.

Testing protocol

For the theory of mind battery, the order of items was set for each test, with original items delivered first and novel items delivered last. Each item was preceded by a preamble that remained consistent across all tests. This was then followed by the story description and the relevant question(s). After each item was delivered, the model would respond and then the session advanced to the next item.

For GPT models, items were delivered using the chat web interface. For LLaMA2-Chat models, delivery of items was automated through a custom script. For humans, items were presented with free text response boxes on separate pages of a survey so that participants could write out their responses to each question (with a minimum character count of 2).

Faux pas likelihood test

To test alternative hypotheses of why the tested models performed poorly at the faux pas test, we ran a follow-up study replicating just the faux pas test. This replication followed the same procedure as the main study with one major difference.

The original wording of the question was phrased as a straightforward yes/no question that tested the subject’s awareness of a speaker’s false belief (for example, ‘Did Richard remember James had given him the toy aeroplane for his birthday?’). To test whether the low scores on this question were due to the models’ refusing to commit to a single explanation in the face of ambiguity, we reworded this to ask in terms of likelihood: ‘Is it more likely that Richard remembered or did not remember that James had given him the toy aeroplane for his birthday?’

Another difference from the original study was that we included a follow-up prompt in the rare cases where the model failed to provide clear reasoning on an incorrect response. The coding criteria for this follow-up were in line with coding schemes used in other studies with a prompt system 14 , where an unprompted correct answer was given 2 points, a correct answer following a prompt was given 1 point and incorrect answers following a prompt were given 0 points. These points were then rescaled to a proportional score to allow comparison against the original wording.

During coding by the human experimenters, a qualitative description of different subtypes of response (beyond 0–1–2 points) emerged, particularly noting recurring patterns in responses that were marked as successes. This exploratory qualitative breakdown is reported along with further detail on the prompting protocol in Supplementary Information section 7 .

Belief likelihood test

To manipulate the likelihood that the speaker knew or did not know, we developed a new set of variants of the faux pas likelihood test. For each test item, all newly generated for this control study, we created three variants: a faux pas variant, a neutral variant and a knowledge-implied variant. In the faux pas variant, the utterance suggested that the speaker did not know the context. In the neutral variant, the utterance suggested neither that they knew nor did not know. In the knowledge-implied variant, the utterance suggested that the speaker knew (for the full text of all items, see Supplementary Appendix 2 ). For each variant, the core story remained unchanged, for example:

Michael was a very awkward child when he was at high school. He struggled with making friends and spent his time alone writing poetry. However, after he left he became a lot more confident and sociable. At his ten-year high school reunion he met Amanda, who had been in his English class. Over drinks, she said to him,

followed by the utterance, which varied across conditions:

'I don't know if you remember this guy from school. He was in my English class. He wrote poetry and he was super awkward. I hope he isn't here tonight.'

'Do you know where the bar is?'

Knowledge implied:

'Do you still write poetry?'

The belief likelihood test was administered in the same way as with previous tests with the exception that responses were kept independent so that there was no risk of responses being influenced by other variants. For ChatGPT models, this involved delivering each item within a separate chat session for 15 repetitions of each item. For LLaMA2-70B, this involved removing the Langchain conversation chain allowing for within-session memory context. Human participants were recruited separately to answer a single test item, with at least 50 responses collected for each item (total N  = 900). All other details of the protocol were the same.

Quantification and statistical analysis

Response coding.

After each session in the theory of mind battery and faux pas likelihood test, the responses were collated and coded by five human experimenters according to the pre-defined coding criteria for each test. Each experimenter was responsible for coding 100% of sessions for one test and 20% of sessions for another. Inter-coder per cent agreement was calculated on the 20% of shared sessions, and items where coders showed disagreement were evaluated by all raters and recoded. The data available on the OSF are the results of this recoding. Experimenters also flagged individual responses for group evaluation if they were unclear or unusual cases, as and when they arose. Inter-rater agreement was computed by calculating the item-wise agreement between coders as 1 or 0 and using this to calculate a percentage score. Initial agreement across all double-coded items was over 95%. The lowest agreement was for the human and GPT-3.5 responses of strange stories, but even here agreement was over 88%. Committee evaluation by the group of experimenters resolved all remaining ambiguities.

For the belief likelihood test, responses were coded according to whether they endorsed the ‘knew’ explanation or ‘didn’t know’ explanation, or whether they did not endorse either as more likely than the other. Outcomes ‘knew’, ‘unsure’ and ‘didn’t know’ were assigned a numerical coding of +1, 0 and −1, respectively. GPT models adhered closely to the framing of the question in their answer, but humans were more variable and sometimes provided ambiguous responses (for example, ‘yes’, ‘more likely’ and ‘not really’) or did not answer the question at all (‘It doesn’t matter’ and ‘She didn’t care’). These responses were rare, constituting only ~2.5% of responses and were coded as endorsing the ‘knew’ explanation if they were affirmative (‘yes’) and the ‘didn’t know’ explanation if they were negative.

Statistical analysis

Comparing llms against human performance.

Scores for individual responses were scaled and averaged to obtain a proportional score for each test session in order to create a performance metric that could be compared directly across different theory of mind tests. Our goal was to compare LLMs’ performance across different tests against human performance to see how these models performed on theory of mind tests relative to humans. For each test, we compared the performance of each of the three LLMs against human performance using a set of Holm-corrected two-way Wilcoxon tests. Effect sizes for Wilcoxon tests were calculated by dividing the test statistic Z by the square root of the total sample size, and 95% CIs of the effect size were bootstrapped over 1,000 iterations. All non-significant results were further examined using corresponding Bayesian tests represented as a Bayes factor (BF 10 ) under continuous prior distribution (Cauchy prior width r  = 0.707). Bayes factors were computed in JASP 0.18.3 with a random seed value of 1. The results of the false belief test were not subjected to inferential statistics owing to the ceiling performance and lack of variance across models.

Novel items

For each publicly available test (all tests except for irony), we generated novel items that followed the same logic as the original text but with different details and text to control for low-level familiarity with the scenarios through inclusion in the LLM training sets. For each of these tests, we compared the performance of all LLMs on these novel items against the validated test items using Holm-corrected two-way Wilcoxon tests. Non-significant results were followed up with corresponding Bayesian tests in JASP. Significantly poorer performance on novel items than original items would indicate a strong likelihood that the good performance of a language model can be attributed to inclusion of these texts in the training set. Note that, while the open-ended format of more complex tasks like hinting and strange stories makes this a convincing control for these tests, they are of limited strength for tasks like false belief and faux pas that use a regular internal structure that make heuristics or ‘Clever Hans’ solutions possible 27 , 36 .

We calculated the count frequency of the different response types (‘didn’t know’, ‘unsure’ and ‘knew’) for each variant and each model. Then, for each model we conducted two chi-square tests that compared the distribution of these categorical responses to the faux pas variant against the neutral, and to the neutral variant against the knowledge implied. A Holm correction was applied to the eight chi-square tests to account for multiple comparisons. The non-significant result was further examined with a Bayesian contingency table in JASP.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

All resources are available on a repository stored on the Open Science Framework (OSF) under a Creative Commons Attribution Non-Commercial 4.0 International (CC-BY-NC) license at https://osf.io/fwj6v . This repository contains all test items, data and code reported in this study. Test items and data are available in an Excel file that includes the text of every item delivered in each test, the full text responses to each item and the code assigned to each response. This file is available at https://osf.io/dbn92 Source data are provided with this paper.

Code availability

The code used for all analysis in the main manuscript and Supplementary Information is included as a Markdown file at https://osf.io/fwj6v . The data used by the analysis files are available as a number of CSV files under ‘scored_data/’ in the repository, and all materials necessary for replicating the analysis can be downloaded as a single .zip file within the main repository titled ‘Full R Project Code.zip’ at https://osf.io/j3vhq .

Van Ackeren, M. J., Casasanto, D., Bekkering, H., Hagoort, P. & Rueschemeyer, S.-A. Pragmatics in action: indirect requests engage theory of mind areas and the cortical motor network. J. Cogn. Neurosci. 24 , 2237–2247 (2012).

Article   PubMed   Google Scholar  

Apperly, I. A. What is ‘theory of mind’? Concepts, cognitive processes and individual differences. Q. J. Exp. Psychol. 65 , 825–839 (2012).

Article   Google Scholar  

Premack, D. & Woodruff, G. Does the chimpanzee have a theory of mind? Behav. Brain Sci. 1 , 515–526 (1978).

Apperly, I. A., Riggs, K. J., Simpson, A., Chiavarino, C. & Samson, D. Is belief reasoning automatic? Psychol. Sci. 17 , 841–844 (2006).

Kovács, Á. M., Téglás, E. & Endress, A. D. The social sense: susceptibility to others’ beliefs in human infants and adults. Science 330 , 1830–1834 (2010).

Apperly, I. A., Warren, F., Andrews, B. J., Grant, J. & Todd, S. Developmental continuity in theory of mind: speed and accuracy of belief–desire reasoning in children and adults. Child Dev. 82 , 1691–1703 (2011).

Southgate, V., Senju, A. & Csibra, G. Action anticipation through attribution of false belief by 2-year-olds. Psychol. Sci. 18 , 587–592 (2007).

Article   CAS   PubMed   Google Scholar  

Kampis, D., Kármán, P., Csibra, G., Southgate, V. & Hernik, M. A two-lab direct replication attempt of Southgate, Senju and Csibra (2007). R. Soc. Open Sci. 8 , 210190 (2021).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Kovács, Á. M., Téglás, E. & Csibra, G. Can infants adopt underspecified contents into attributed beliefs? Representational prerequisites of theory of mind. Cognition 213 , 104640 (2021).

Baron-Cohen, S., Wheelwright, S., Hill, J., Raste, Y. & Plumb, I. The ‘Reading the Mind in the Eyes’ Test revised version: a study with normal adults, and adults with Asperger syndrome or high-functioning autism. J. Child Psychol. Psychiatry Allied Discip. 42 , 241–251 (2001).

Article   CAS   Google Scholar  

Wimmer, H. & Perner, J. Beliefs about beliefs: representation and constraining function of wrong beliefs in young children’s understanding of deception. Cognition 13 , 103–128 (1983).

Perner, J., Leekam, S. R. & Wimmer, H. Three-year-olds’ difficulty with false belief: the case for a conceptual deficit. Br. J. Dev. Psychol. 5 , 125–137 (1987).

Baron-Cohen, S., O’Riordan, M., Stone, V., Jones, R. & Plaisted, K. Recognition of faux pas by normally developing children and children with asperger syndrome or high-functioning autism. J. Autism Dev. Disord. 29 , 407–418 (1999).

Corcoran, R. Inductive reasoning and the understanding of intention in schizophrenia. Cogn. Neuropsychiatry 8 , 223–235 (2003).

Happé, F. G. E. An advanced test of theory of mind: understanding of story characters’ thoughts and feelings by able autistic, mentally handicapped, and normal children and adults. J. Autism Dev. Disord. 24 , 129–154 (1994).

White, S., Hill, E., Happé, F. & Frith, U. Revisiting the strange stories: revealing mentalizing impairments in autism. Child Dev. 80 , 1097–1117 (2009).

Apperly, I. A. & Butterfill, S. A. Do humans have two systems to track beliefs and belief-like states? Psychol. Rev. 116 , 953 (2009).

Wiesmann, C. G., Friederici, A. D., Singer, T. & Steinbeis, N. Two systems for thinking about others’ thoughts in the developing brain. Proc. Natl Acad. Sci. USA 117 , 6928–6935 (2020).

Bubeck, S. et al. Sparks of artificial general intelligence: early experiments with GPT-4. Preprint at https://doi.org/10.48550/arXiv.2303.12712 (2023).

Srivastava, A. et al. Beyond the imitation game: quantifying and extrapolating the capabilities of language models. Preprint at https://doi.org/10.48550/arXiv.2206.04615 (2022).

Dou, Z. Exploring GPT-3 model’s capability in passing the Sally-Anne Test A preliminary study in two languages. Preprint at OSF https://doi.org/10.31219/osf.io/8r3ma (2023).

Kosinski, M. Theory of mind may have spontaneously emerged in large language models. Preprint at https://doi.org/10.48550/arXiv.2302.02083 (2023).

Sap, M., LeBras, R., Fried, D. & Choi, Y. Neural theory-of-mind? On the limits of social intelligence in large LMs. In Proc. 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP) 3762–3780 (Association for Computational Linguistics, 2022).

Gandhi, K., Fränken, J.-P., Gerstenberg, T. & Goodman, N. D. Understanding social reasoning in language models with language models. In Advances in Neural Information Processing Systems Vol. 36 (MIT Press, 2023).

Ullman, T. Large language models fail on trivial alterations to theory-of-mind tasks. Preprint at https://doi.org/10.48550/arXiv.2302.08399 (2023).

Marcus, G. & Davis, E. How Not to Test GPT-3. Marcus on AI https://garymarcus.substack.com/p/how-not-to-test-gpt-3 (2023).

Shapira, N. et al. Clever Hans or neural theory of mind? Stress testing social reasoning in large language models. Preprint at https://doi.org/10.48550/arXiv.2305.14763 (2023).

Rahwan, I. et al. Machine behaviour. Nature 568 , 477–486 (2019).

Hagendorff, T. Machine psychology: investigating emergent capabilities and behavior in large language models using psychological methods. Preprint at https://doi.org/10.48550/arXiv.2303.13988 (2023).

Binz, M. & Schulz, E. Using cognitive psychology to understand GPT-3. Proc. Natl Acad. Sci. USA 120 , e2218523120 (2023).

Webb, T., Holyoak, K. J. & Lu, H. Emergent analogical reasoning in large language models. Nat. Hum. Behav. 7 , 1526–1541 (2023).

Frank, M. C. Openly accessible LLMs can help us to understand human cognition. Nat. Hum. Behav. 7 , 1825–1827 (2023).

Bernstein, D. M., Thornton, W. L. & Sommerville, J. A. Theory of mind through the ages: older and middle-aged adults exhibit more errors than do younger adults on a continuous false belief task. Exp. Aging Res. 37 , 481–502 (2011).

Au-Yeung, S. K., Kaakinen, J. K., Liversedge, S. P. & Benson, V. Processing of written irony in autism spectrum disorder: an eye-movement study: processing irony in autism spectrum disorders. Autism Res. 8 , 749–760 (2015).

Firestone, C. Performance vs. competence in human–machine comparisons. Proc. Natl Acad. Sci. USA 117 , 26562–26571 (2020).

Shapira, N., Zwirn, G. & Goldberg, Y. How well do large language models perform on faux pas tests? In Findings of the Association for Computational Linguistics: ACL 2023 10438–10451 (Association for Computational Linguistics, 2023)

Rescher, N. Choice without preference. a study of the history and of the logic of the problem of ‘Buridan’s ass’. Kant Stud. 51 , 142–175 (1960).

OpenAI. GPT-4 technical report. Preprint at https://doi.org/10.48550/arXiv.2303.08774 (2023).

Chen, L., Zaharia, M. & Zou, J. How is ChatGPT’s behavior changing over time? Preprint at https://doi.org/10.48550/arXiv.2307.09009 (2023).

Feldman Hall, O. & Shenhav, A. Resolving uncertainty in a social world. Nat. Hum. Behav. 3 , 426–435 (2019).

James, W. The Principles of Psychology V ol. 2 (Henry Holt & Co, 1890).

Fiske, S. T. Thinking is for doing: portraits of social cognition from daguerreotype to laserphoto. J. Personal. Soc. Psychol. 63 , 877–889 (1992).

Plate, R. C., Ham, H. & Jenkins, A. C. When uncertainty in social contexts increases exploration and decreases obtained rewards. J. Exp. Psychol. Gen. 152 , 2463–2478 (2023).

Frith, C. D. & Frith, U. The neural basis of mentalizing. Neuron 50 , 531–534 (2006).

Koster-Hale, J. & Saxe, R. Theory of mind: a neural prediction problem. Neuron 79 , 836–848 (2013).

Zhou, P. et al. How far are large language models from agents with theory-of-mind? Preprint at https://doi.org/10.48550/arXiv.2310.03051 (2023).

Bonnefon, J.-F. & Rahwan, I. Machine thinking, fast and slow. Trends Cogn. Sci. 24 , 1019–1027 (2020).

Hanks, T. D., Mazurek, M. E., Kiani, R., Hopp, E. & Shadlen, M. N. Elapsed decision time affects the weighting of prior probability in a perceptual decision task. J. Neurosci. 31 , 6339–6352 (2011).

Pezzulo, G., Parr, T., Cisek, P., Clark, A. & Friston, K. Generating meaning: active inference and the scope and limits of passive AI. Trends Cogn. Sci. 28 , 97–112 (2023).

Chemero, A. LLMs differ from human cognition because they are not embodied. Nat. Hum. Behav. 7 , 1828–1829 (2023).

Brunet-Gouet, E., Vidal, N. & Roux, P. In Human and Artificial Rationalities. HAR 2023. Lecture Notes in Computer Science (eds. Baratgin, J. et al.) Vol. 14522, 107–126 (Springer, 2024).

Kim, H. et al. FANToM: a benchmark for stress-testing machine theory of mind in interactions. In Proc. 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) 14397–14413 (Association for Computational Linguistics, 2023).

Yiu, E., Kosoy, E. & Gopnik, A. Transmission versus truth, imitation versus nnovation: what children can do that large language and language-and-vision models cannot (yet). Perspect. Psychol. Sci. https://doi.org/10.1177/17456916231201401 (2023).

Redcay, E. & Schilbach, L. Using second-person neuroscience to elucidate the mechanisms of social interaction. Nat. Rev. Neurosci. 20 , 495–505 (2019).

Schilbach, L. et al. Toward a second-person neuroscience. Behav. Brain Sci. 36 , 393–414 (2013).

Gil, D., Fernández-Modamio, M., Bengochea, R. & Arrieta, M. Adaptation of the hinting task theory of the mind test to Spanish. Rev. Psiquiatr. Salud Ment. Engl. Ed. 5 , 79–88 (2012).

Download references

Acknowledgements

This work is supported by the European Commission through Project ASTOUND (101071191—HORIZON-EIC-2021-PATHFINDERCHALLENGES-01 to A.R., G.M., C.B. and S.P.). J.W.A.S. was supported by a Humboldt Research Fellowship for Experienced Researchers provided by the Alexander von Humboldt Foundation. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Open access funding provided by Universitätsklinikum Hamburg-Eppendorf (UKE).

Author information

Authors and affiliations.

Department of Neurology, University Medical Center Hamburg-Eppendorf, Hamburg, Germany

James W. A. Strachan, Oriana Pansardi, Eugenio Scaliti & Cristina Becchio

Cognition, Motion and Neuroscience, Italian Institute of Technology, Genoa, Italy

Dalila Albergo, Giulia Borghini, Oriana Pansardi, Eugenio Scaliti & Cristina Becchio

Center for Mind/Brain Sciences, University of Trento, Rovereto, Italy

Dalila Albergo

Department of Psychology, University of Turin, Turin, Italy

Oriana Pansardi

Department of Management, ‘Valter Cantino’, University of Turin, Turin, Italy

Eugenio Scaliti

Human Science and Technologies, University of Turin, Turin, Italy

Alien Technology Transfer Ltd, London, UK

Saurabh Gupta, Krati Saxena, Alessandro Rufo & Guido Manzi

Institute for Neural Information Processing, Center for Molecular Neurobiology, University Medical Center Hamburg- Eppendorf, Hamburg, Germany

Stefano Panzeri

Princeton Neuroscience Institute, Princeton University, Princeton, NJ, USA

Michael S. A. Graziano

You can also search for this author in PubMed   Google Scholar

Contributions

J.W.A.S., A.R., G.M., M.S.A.G. and C.B. conceived the study. J.W.A.S., D.A., G.B., O.P. and E.S. designed the tasks and performed the experiments including data collection with humans and GPT models, response coding and curation of the dataset. S.G., K.S. and G.M. collected data from LLaMA2-Chat models. J.W.A.S. performed the analyses and wrote the manuscript with input from C.B., S.P. and M.S.A.G. All authors contributed to the interpretation and editing of the manuscript. C.B. supervised the work. A.R., G.M., S.P. and C.B. acquired the funding. D.A., G.B., O.P. and E.S. contributed equally to the work.

Corresponding authors

Correspondence to James W. A. Strachan or Cristina Becchio .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature Human Behaviour thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information.

Supplementary Figs. 1–8, Tables 1–4, additional methodological details, analyses and discussion, Appendix 1 (full text of false belief perturbations adapted from Ullman (2023)) and Appendix 2 (full text of items generated for the belief likelihood test).

Reporting Summary

Peer review file, source data fig. 1.

Raw score data on the full theory of mind battery for all models used to generate Fig. 1a,b.

Source Data Fig. 2

Zip file containing two CSV files used to generate Fig. 2. Fig2A_data.csv: raw score data with GPT models’ performance in the Faux Pas Likelihood test, used to generate Fig. 2a. Fig2B_data.csv: raw score data on the belief likelihood test for all models used to generate Fig. 2b.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Strachan, J.W.A., Albergo, D., Borghini, G. et al. Testing theory of mind in large language models and humans. Nat Hum Behav (2024). https://doi.org/10.1038/s41562-024-01882-z

Download citation

Received : 14 August 2023

Accepted : 05 April 2024

Published : 20 May 2024

DOI : https://doi.org/10.1038/s41562-024-01882-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

strong whorf hypothesis

IMAGES

  1. PPT

    strong whorf hypothesis

  2. PPT

    strong whorf hypothesis

  3. PPT

    strong whorf hypothesis

  4. PPT

    strong whorf hypothesis

  5. sapir whorf hypothesis explanation

    strong whorf hypothesis

  6. What is the Sapir Whorf Hypothesis?

    strong whorf hypothesis

VIDEO

  1. What is the Sapir-Whorf hypothesis?

  2. Sapir- Whorf Hypothesis

  3. Saphir-whorf hypothesis

  4. Sapir-Whorf Hypothesis and PC Language Manipulation

  5. شرح علم اللغة جابتر 20 The Sapir–Whorf Hypothesis and Against the Sapir–Whorf Hypothesis

  6. Aranea Science Podcast

COMMENTS

  1. Linguistic relativity

    The idea of linguistic relativity, known also as the Whorf hypothesis, the Sapir-Whorf hypothesis (/ s ə ˌ p ɪər ˈ hw ɔːr f / sə-PEER WHORF), or Whorfianism, is a principle suggesting that the structure of a language influences its speakers' worldview or cognition, and thus individuals' languages determine or influence their perceptions of the world.. The hypothesis has long been ...

  2. Sapir-Whorf hypothesis (Linguistic Relativity Hypothesis)

    The Sapir-Whorf hypothesis states that the grammatical and verbal structure of a person's language influences how they perceive the world. It emphasizes that language either determines or influences one's thoughts. ... speakers of the language marked them as having more male characteristics like "strong" and "long." Similarly, these ...

  3. Whorfianism

    The term "Sapir-Whorf Hypothesis" was coined by Harry Hoijer in his contribution (Hoijer 1954) to a conference on the work of Benjamin Lee Whorf in 1953. But anyone looking in Hoijer's paper for a clear statement of the hypothesis will look in vain. ... 'Strong' versions state that language determines thought, or fixes it in some way.

  4. The Sapir-Whorf Hypothesis: How Language Influences How We Express

    The Sapir-Whorf Hypothesis, also known as linguistic relativity, refers to the idea that the language a person speaks can influence their worldview, thought, and even how they experience and understand the world. While more extreme versions of the hypothesis have largely been discredited, a growing body of research has demonstrated that ...

  5. Sapir-Whorf Hypothesis

    The strong form of the Sapir-Whorf hypothesis claims that people from different cultures think differently because of differences in their languages. So, native speakers of Hopi perceive reality differently from native speakers of English because they use different languages, Whorf claimed. Few sociolinguists would accept such a strong claim ...

  6. Definition and History of the Sapir-Whorf Hypothesis

    The Sapir-Whorf hypothesis is the linguistic theory that the semantic structure of a language shapes or limits the ways in which a speaker forms conceptions of the world. It came about in 1929. The theory is named after the American anthropological linguist Edward Sapir (1884-1939) and his student Benjamin Whorf (1897-1941).

  7. 3.1: Linguistic Relativity- The Sapir-Whorf Hypothesis

    After completing this module, students will be able to: 1. Define the concept of linguistic relativity. 2. Differentiate linguistic relativity and linguistic determinism. 3. Define the Sapir-Whorf Hypothesis (against more pop-culture takes on it) and situate it in a broader theoretical context/history. 4.

  8. PDF 2 opposing ideas about language, thought, and culture

    The Sapir-Whorf Hypothesis, in its "strong version," consists of 2 paired principles: linguistic determinism: the language we use determines the way in which we view and think about the world around us.* linguistic relativity: People who speak different language perceive and think about the world quite differently. *

  9. Linguistic Relativity

    Linguistic relativity, sometimes called the Whorfian hypothesis, posits that properties of language affect the structure and content of thought and thus the way humans perceive reality. A distinction is often made between strong Whorfian views, according to which the categories of thought are determined by language, and weak views, which argue ...

  10. Linguistic Relativity

    KEY WORDS: Sapir-Whorf hypothesis, linguistic determinism, language and thought, language and cognition, language and culture ABSTRACT The linguistic relativity hypothesis, the proposal that the particular language we speak influences the way we think about reality, forms one part of the broader question of how language influences thought.

  11. Whorfian Hypothesis

    The term Whorfian Hypothesis takes its name from Benjamin Lee Whorf (1876-1941) who claimed that the language one speaks influences one's thinking [ 7 ]. Whorf was an amateur linguist who studied with the anthropologist Edward Sapir in the 1920s and 1930s. The term Sapir-Whorf Hypothesis is also used to refer to their view that language ...

  12. 2 The Whorf hypothesis

    The 'strong' version is easier to deal with, ... Consequently, the hypothesis is often called the Humboldt-Sapir-Whorf hypothesis. I will refrain from a discussion of how de Condillac, Herder, von Humboldt and others contributed to the Whorfian paradigm, mainly because these authors are too remote for the present purpose. ...

  13. Linguistic determinism

    Linguistic determinism is the strong form of linguistic relativity (popularly known as the Sapir-Whorf hypothesis), which argues that individuals experience the world based on the structure of the language they habitually use. Since the 20th century, linguistic determinism has largely been discredited by studies and abandoned within ...

  14. The Whorf Hypothesis

    The Strong Whorf Hypothesis: the claim that the language you speak determines which thoughts you can have.3 It is generally rejected by most linguists, psychologists, and cognitive scientists today.4,5, 6. The Weak Whorf Hypothesis: the claim that the language you speak influences, but does not determine, which thoughts you can have. 3 This is a claim currently being studied, and many ...

  15. Sapir-Whorf Hypothesis

    The Sapir-Whorf Hypothesis can be divided into two basic components: Linguistic Determinism and Linguistic Relativity. The first part, linguistic determinism, refers to the concept that what is said, has only some effect on how concepts are recognized by the mind. ... (The Sapir-Whorf Hypotheses, 2002, p.1). Strong determinism refers to a ...

  16. Sapir‐Whorf Hypothesis

    The Sapir-Whorf Hypothesis, also known as the linguistic relativity hypothesis, states that the language one knows affects how one thinks about the world. The hypothesis is most strongly associated with Benjamin Lee Whorf, a fire prevention engineer who became a scholar of language under the guidance of linguist and anthropologist Edward Sapir ...

  17. Sapir-Whorf Hypothesis: Examples, Definition, Criticisms

    Developed in 1929 by Edward Sapir, the Sapir-Whorf hypothesis (also known as linguistic relativity) states that a person's perception of the world around them and how they experience the world is both determined and influenced by the language that they speak. The theory proposes that differences in grammatical and verbal structures, and the ...

  18. Further evidence that Whorfian effects are stronger in the right ...

    The Whorf hypothesis holds that semantic differences between languages induce differences in perception and/or cognition in their speakers ().Much of the experimental work pursuing this idea has focused on the domain of color and has centered on the issue of whether linguistically coded color categories influence color discrimination (2-13).A new perspective has been cast on the debate by ...

  19. Whorfian hypothesis

    Other articles where Whorfian hypothesis is discussed: North American Indian languages: Language and culture: …now often known as the Whorfian (or Sapir-Whorf) hypothesis. Whorf's initial arguments focused on the striking differences between English and Native American ways of saying "the same thing." From such linguistic differences, Whorf inferred underlying differences in habits of ...

  20. Weak Forms and Strong Forms

    The Sapir-Whorf hypothesis 1 states that language affects thought — how we speak influences how we think. Or, at least, that's one form of the hypothesis, the weak form. The strong form of Sapir-Whorf says that language determines thought, that how we speak forms a hard boundary on how and what we think. The weak form of Sapir-Whorf says ...

  21. Benjamin Lee Whorf

    Benjamin Lee Whorf (April 24, 1897 in Winthrop, Massachusetts - July 26, 1941) was an American linguist. Whorf is widely known for his ideas about linguistic relativity, the hypothesis that language influences thought. An important theme in many of his publications, he has been credited as one of the fathers of this approach, often referred to as the "Sapir-Whorf hypothesis", named ...

  22. The Whorfian hypothesis: A cognitive psychology perspective.

    The linguistic relativity (Whorfian) hypothesis states that language influences thought. In its strongest form, the hypothesis states that language controls both thought and perception. Several experiments have shown that this is false. The weaker form of the hypothesis, which states that language influences thought, has been held to be so vague that it is unprovable. The argument presented ...

  23. Testing theory of mind in large language models and humans

    Testing two families of large language models (LLMs) (GPT and LLaMA2) on a battery of measurements spanning different theory of mind abilities, Strachan et al. find that the performance of LLMs ...