Phonetics and Phonology: The Basics

  • First Online: 17 September 2022

Cite this chapter

hypothesis phonetic transcription

  • Antônio Roberto Monteiro Simões 5  

Part of the book series: Prosody, Phonology and Phonetics ((PRPHPH))

142 Accesses

This chapter presents some of the foundations of Phonetics and Phonology. At the end of this chapter there are practice exercises for all sections. Going forward, the terms Mainstream Spanish and Mainstream Brazilian Portuguese will be used alternatively with their corresponding abbreviations MSp and MBP as needed.

Nothing is the way it looks. Not even the sound of a phoneme.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
  • Durable hardcover edition

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

The jargon “aspiration” (or “aspirated”) is traditionally used in Linguistics. Although it is technically incorrect to use the term to denote a sudden, tiny puff of air, this usage continues to prevail. See the Glossary and other sections for more details.

It should be noted that lenguaje , linguagem , and langage have other connotations, in addition to “human language.” In these languages, it may refer to the spoken language or the written language, that is lenguaje hablado , lenguaje escrito , as well as when referring to different styles or registers, that is, lenguaje formal , lenguaje informal , and so weiter.

Bibliography

Bell, A. M. (1867). Visible speech: the science of universal alphabetics; or self-interpreting physiological letters, for the writing of all languages in one alphabet (Inaugural Edition). London: Simpkin, Marshall and Co.`

Google Scholar  

Chapin, P. G. (1987, September). Support of Linguistic Research at the National Science Foundation, in the Documentary of the Anthropology Newsletter , (6), 32.

de Saussure, F. (1916/1995). Cours de linguistique générale, publié par Charles Bailly, Albert Séchehaye avec la collaboration de Albert Riedlinger (Édition critique préparée par Tullio de Mauro; postface de Louis-Jean Calvet (Grande Bibliothéque Payot)). : Éditions Payot & Rivages.

Duarte, J. A. C. (2017). Caracterización acústica de la reducción vocálica en el español de Bogotá (Colombia. In Estudios de Fonética Experimental ,vol. XXVI, pp. 63–91.

Escudero, P., Boersma, P., Rauber, A. S., & Bion, R. (2009). A cross-dialect acoustic description of vowels: Brazilian and European Portuguese. The Journal of the Acoustical Society of America, 126 (1379), –1393. https://doi.org/10.1121/1.3180321

Jones, D. (1917). An english pronouncing dictionary . Dent & Sons.

Jones, D. (1976). An outline of English Phonetics (9th ed.). Cambridge University Press.

Klatt, D. H. (1976). Linguistic uses of segmental duration in English: acoustic and perceptual evidence. The Journal of the Acoustical Society of America, 59 (5), 1208–1221.

Article   Google Scholar  

Ladefoged, P., & Johnson, K. (2015). A course in phonetics (7th ed.). Cengage Learning.

Landercy, A., & Renard, R. (1977). Éléments de phonétique . Didier.

Liljencrantz, J., & Lindblom, B. (1972). Numerical simulation of vowel quality systems: The role of perceptual contrasts. Language, 48 (1972), 839–862.

Lindblom, B., & Rapp, K. (1973). Some Temporal Regularities of Spoken Swedish , Publ. no. 21, Institute of Linguistics, University of Stockholm (unpublished).

Monaghan, P., Shillcock, R. C., Christiansen, M. H., & Kirby, S. (2014). How arbitrary is language? Philos Trans R Soc Lond B Biol Sci, 369 (1651), 20130299. https://doi.org/10.1098/rstb.2013.0299

Download references

Author information

Authors and affiliations.

Department of Spanish and Portuguese, University of Kansas, Lawrence, KS, USA

Antônio Roberto Monteiro Simões

You can also search for this author in PubMed   Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Singapore Pte Ltd.

About this chapter

Simões, A.R.M. (2022). Phonetics and Phonology: The Basics. In: Spanish and Brazilian Portuguese Pronunciation. Prosody, Phonology and Phonetics. Springer, Singapore. https://doi.org/10.1007/978-981-13-1996-9_1

Download citation

DOI : https://doi.org/10.1007/978-981-13-1996-9_1

Published : 17 September 2022

Publisher Name : Springer, Singapore

Print ISBN : 978-981-13-1995-2

Online ISBN : 978-981-13-1996-9

eBook Packages : Religion and Philosophy Philosophy and Religion (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research
  • Tools and Resources
  • Customer Services
  • Applied Linguistics
  • Biology of Language
  • Cognitive Science
  • Computational Linguistics
  • Historical Linguistics
  • History of Linguistics
  • Language Families/Areas/Contact
  • Linguistic Theories
  • Neurolinguistics
  • Phonetics/Phonology
  • Psycholinguistics
  • Sign Languages
  • Sociolinguistics
  • Share This Facebook LinkedIn Twitter

Article contents

  • D. H. Whalen D. H. Whalen City University of New York and Haskins Laboratories, Yale University
  • https://doi.org/10.1093/acrefore/9780199384655.013.57
  • Published online: 29 July 2019

Phonetics is the branch of linguistics that deals with the physical realization of meaningful distinctions in spoken language. Phoneticians study the anatomy and physics of sound generation, acoustic properties of the sounds of the world’s languages, the features of the signal that listeners use to perceive the message, and the brain mechanisms involved in both production and perception. Therefore, phonetics connects most directly to phonology and psycholinguistics, but it also engages a range of disciplines that are not unique to linguistics, including acoustics, physiology, biomechanics, hearing, evolution, and many others. Early theorists assumed that phonetic implementation of phonological features was universal, but it has become clear that languages differ in their phonetic spaces for phonological elements, with systematic differences in acoustics and articulation. Such language-specific details place phonetics solidly in the domain of linguistics; any complete description of a language must include its specific phonetic realization patterns. The description of what phonetic realizations are possible in human language continues to expand as more languages are described; many of the under-documented languages are endangered, lending urgency to the phonetic study of the world’s languages.

Phonetic analysis can consist of transcription, acoustic analysis, measurement of speech articulators, and perceptual tests, with recent advances in brain imaging adding detail at the level of neural control and processing. Because of its dual nature as a component of a linguistic system and a set of actions in the physical world, phonetics has connections to many other branches of linguistics, including not only phonology but syntax, semantics, sociolinguistics, and clinical linguistics as well. Speech perception has been shown to integrate information from both vision and tactile sensation, indicating an embodied system. Sign language, though primarily visual, has adopted the term “phonetics” to represent the realization component, highlighting the linguistic nature both of phonetics and of sign language. Such diversity offers many avenues for studying phonetics, but it presents challenges to forming a comprehensive account of any language’s phonetic system.

  • psycholinguistics
  • neurolinguistics
  • sign language
  • sociolinguistics

1. History and Development of Phonetics

Much of phonetic structure is available to direct inspection or introspection, allowing a long tradition in phonetics (see also articles in Asher & Henderson, 1981 ). The first true phoneticians were the Indian grammarians of about the 8th or 7th century bce . In their works, called Pratiśãkhya, they organized the sounds of Sanskrit according to places of articulation, and they also described all the physiological gestures that were required in the articulation of each sound. Every writing system, even those not explicitly phonetic or phonological, includes elements of the phonetic systems of the languages denoted. Early Semitic writing (from Phoenician onward) primarily encoded consonants, while the Greek system added vowels explicitly. The Chinese writing system includes phonetic elements in many, if not most, characters (DeFrancis, 1989 ), and modern readers access phonology while reading Mandarin (Zhou & Marslen-Wilson, 1999 ). The Mayan orthography was based largely on syllables (Coe, 1992 ). All of this required some level of awareness of phonetics.

Attempts to describe phonetics universally are more recent in origin, and they fall into the two domains of transcription and measurement. For transcription, the main development was the creation of the International Phonetic Alphabet (IPA) (e.g., International Phonetic Association, 1989 ). Initiated in 1886 as a tool for improving language teaching and, relatedly, reading (Macmahon, 2009 ), the IPA was modified and extended both in terms of the languages covered and the theoretical underpinnings (Ladefoged, 1990 ). This system is intended to provide a symbol for every distinctive sound in the world’s languages. The first versions addressed languages familiar to the European scholars primarily responsible for its development, but new sounds were added as more languages were described. The 79 consonantal and 28 vowel characters can be modified by an array of diacritics, allowing greater or lesser detail in the transcription. There are diacritics for suprasegmentals, both prosodic and tonal, as well. Additions have been made for the description of pathological speech (Duckworth, Allen, Hardcastle, & Ball, 1990 ). It is often the case that transcriptions for two languages using the same symbol nonetheless have perceptible differences in realization. Although additional diacritics can be used in such cases, it is more often useful to ignore such differences for most analysis purposes. Despite some limitations, the IPA continues to be a valuable tool in the analysis of languages, language use, and language disorders throughout the world.

For measurement, there are two main signals to record, the acoustic and the articulatory. Although articulation is inherently more complex and difficult to capture completely, it was more accessible to early techniques than were the acoustics. Various ingenious devices were created by Abbé Rousselot ( 1897–1908 ) and E. W. Scripture ( 1902 ). Rousselot’s devices for measuring the velum (Figure 1 ) and the tongue (Figure 2 ) were not, unfortunately, terribly successful. Pliny Earl Goddard ( 1905 ) used more successful devices and was ambitious enough to take his equipment into the field to record dynamic air pressure and static palatographs of such languages as Hupa [ISO 639-3 code hup] and Chipewyan [ISO 639-3 code chp]. Despite these early successes, relatively little physiological work was done until the second half of the 20th century . Technological advances have made it possible to examine muscle activity, airflow, tongue-palate contact, and location and movement of the tongue and other articulators via electromagnetic articulometry, ultrasound, and real-time magnetic resonance imaging (see Huffman, 2016 ). These measurements have advanced our theories of speech production and have addressed both phonetic and phonological issues.

hypothesis phonetic transcription

Figure 1. Device to measure the height of the velum. From Rousselot ( 1897–1908 ).

hypothesis phonetic transcription

Figure 2. Device to measure the height of the tongue from external shape of the area under the chin. From Rousselot ( 1897–1908 ).

Acoustic recordings became possible with the Edison disks, but the ability to measure and analyze these recordings was much longer in coming. Some aspects of the signal could be somewhat reasonably rendered via flame recordings, in which photographs were taken of flames flickering in response to various frequencies (König, 1873 ). These records were of limited value, because of the limitations of the waveform itself and the difficulties of the recordings, including the time and expense of making them. Further, the ability to see the spectral properties in detail was greatly enhanced by the declassification (after World War II) of the spectrograph (Koenig, Dunn, & Lacy, 1946 ; Potter, Kopp, & Green, 1947 ). New methods of analysis are constantly being explored, with greater accuracy and refinement of data categories being the result.

Sound is the most obvious carrier of language (and is etymologically embedded in “phonetics”), and the recognition that vision also plays a role in understanding speech came relatively late. Not only do those with typical hearing use vision when confronted with noisy speech (Sumby & Pollack, 1954 ), they can even be misled by vision with speech that is clearly audible (McGurk & MacDonald, 1976 ). Although the lips and jaw are the most salient carriers of speech information, areas of the face outside the lip region co-vary with speech segments (Yehia, Kuratate, & Vatikiotis-Bateson, 2002 ). Audiovisual integration continues as an active area of research in phonetics.

Sign language, a modality largely devoid of sound, has also adopted the term “phonetics” to describe the system of realization of the message (Goldin-Meadow & Brentari, 2017 ; Goldstein, Whalen, & Best, 2006 ). Similarities between reduction of speech articulators and American Sign Language (ASL) indicate that both systems allow for (indeed, may require) reduction in articulation when content is relatively predictable (Tyrone & Mauk, 2010 ). There is evidence that unrelated sign languages use the same realization of telicity, that is, whether an action has an inherent (“telic”) endpoint (e.g., “decide”) or not (“atelic”, e.g., “think”) (Strickland et al., 2015 ). Phonetic constraints, such as maximum contrastiveness of hand shapes, has been explored in an emerging sign language, Al-Sayyid Bedouin Sign Language (Sandler, Aronoff, Meir, & Padden, 2011 ). As further studies are completed, we can expect to see more insights into the aspects of language realization that are shared across modalities, and to be challenged by those that differ.

2. Phonetics in Relation to Phonology

Just as phonetics describes the realization of words in a language, so phonology describes the patterns of elements that make meaningful distinctions in a language. The relation between the two has been, and continues to be, a topic for theoretical debate (e.g., Gouskova, Zsiga, & Boyer, 2011 ; Keating, 1988 ; Romero & Riera, 2015 ). Positions range from a strict separation in which the phonology completes its operations before the phonetics becomes involved (e.g., Chomsky & Halle, 1968 ) to a complete dissolution of the distinction (e.g., Flemming, 2001 ; Ohala, 1990 ). Many intermediate positions are proposed as well.

Regardless of the degree of separation, the early assumption that phonetic implementation was merely physical and universal (e.g., Chomsky & Halle, 1968 ; Halliday, 1961 ), which may invoke the mind-body problem (e.g., Fodor, 1981 ), has proven to be inadequate. Keating ( 1985 ) examined three phonetic effects—intrinsic vowel duration, extrinsic vowel duration, and voicing timing—and found that they were neither universal nor physiologically necessary. Further examples of language- and dialect-specific effects have been found in the fine-grained detail in Voice Onset Time (Cho & Ladefoged, 1999 ), realization of focus (Peters, Hanssen, & Gussenhoven, 2014 ), and even the positions of speech articulators before speaking (Gick, Wilson, Koch, & Cook, 2004 ). Whatever the interface between phonetics and phonology may be, there exist language-specific phonetic patterns, thus ensuring the place of phonetics within linguistics proper.

The overwhelming evidence for “language-specific phonetics” has prompted a reconsideration of the second traditionally assumed distinction between phonology and phonetics: phonetics is continuous, phonology is discrete. This issue has been raised relatively less often in discussions of the phonetics-phonology interface. An approach to phonology that has addressed the need to reconsider the purely representational phonological elements in a logical compliance with their physical realization is Articulatory Phonology, whose elements have been claimed to be available for “public” (phonetic) use and yet categorical for making linguistic distinctions (Goldstein & Fowler, 2003 ). Iskarous ( 2017 ) shows how dynamical systems analysis unites discrete phonological contrast and continuous phonetic movement into one non-dualistic description. Gafos and his colleagues provide the formal mechanisms that use such a system to address a range of phonological processes (Gafos, 2002 ; Gafos & Beňuš, 2006 ; Gafos, Roeser, Sotiropoulou, Hoole, & Zeroual, 2019 ). Relating categorical, meaningful distinctions to continuous physical realizations will continue to be developed and, one hopes, ultimately be resolved.

3. Phonetics in Relation to Other Aspects of Language

Phonetic research has had far-reaching effects, many of which are outlined in individual articles in this encyclopedia. Here are four issues of particular interest.

Perception: Until the advent of an easily manipulated acoustic signal, it was very difficult to determine which aspects of the speech signal are taken into account perceptually. The Pattern Playback (Cooper, 1953 ) was an early machine that allowed the synthesis of speech from acoustic parameters. The resulting acoustic patterns did not sound completely natural, but they elicited speech percepts that allowed discoveries to be made, ones that have been replicated in many other studies (cf. Shankweiler & Fowler, 2015 ). These findings have led to a sizable range of research in linguistics and psychology (see Beddor, 2017 ). Many studies of brain function also take these results as a starting point.

Acquisition: Learning to speak is natural for neurologically typical infants, with no formal instruction necessary for the process. Just how this process takes place depends on phonetic findings, so that the acoustic output of early productions can be compared with the target values in the adult language. Whether or not linguistic categories are “innate,” the development of links between what the learner hears and what s/he speaks is a matter of ongoing debate that would not be possible without phonetic analysis.

Much if not most of the world’s population is bi- or multilingual, and the phonetic effects of second language learning have received a great deal of attention (Flege, 2003 ). The phonetic character of a first language (L1) usually has a great influence on the production and perception of a second one (L2). The effects are smaller when L2 is acquired earlier in life than later, and there is a great deal of individual variability. Degree of L2 accent has been shown to be amenable to improvement via biofeedback (d’Apolito, Sisinni, Grimaldi, & Gili Fivela, 2017 ; Suemitsu, Dang, Ito, & Tiede, 2015 ).

Sociolinguistics: Phonetic variation within a language is a strong indicator of community membership (Campbell-Kibler, 2010 ; Foulkes & Docherty, 2006 ). From the biblical story of the shibboleth to modern everyday experience, speech indicates origin. Perception of an accent can thus lead to stereotypical judgements based on origin, such as assigning less intelligence to speakers who use “-in” rather than “-ing” (Campbell-Kibler, 2007 ). Accents can work in two directions at once, as when an African American dialect is simultaneously recognized as disfavored by mainstream society yet valued as a marker of social identity (Wolfram & Schilling, 2015 , p. 238). The level of detail that is available to and used by speakers and listeners is massive, requiring large studies with many variables. This makes sociophonetics both exciting and challenging (Hay & Drager, 2007 ).

Speech therapy: Not every instance of language acquisition is a smooth one, and some individuals face challenges in speaking their language. The tools that are developed in phonetics help with assessment of the differences and, in some cases, provide a means of remediation. One particularly exciting development that depends on articulation rather than acoustics is the use of ultrasound biofeedback (using images of the speaker’s tongue) to improve production (e.g., Bernhardt, Gick, Bacsfalvi, & Ashdown, 2003 ; Preston et al., 2017 ).

Speech technology: Speech synthesis was an early goal of phonetic research (e.g., Holmes, Mattingly, & Shearme, 1964 ), and research continues to the present. Automatic speech recognition made use of phonetic results, though modern systems rely on more global treatments of the acoustic signal (e.g., Furui, Deng, Gales, Ney, & Tokuda, 2012 ). Man-machine interactions have benefited greatly from phonetic findings, helping to shape the modern world. Further advances may begin, once again, to make less use of machine learning and more of phonetic knowledge.

4. Future Directions

Phonetics as a field of study began with the exceptional discriminative power of the human ear, but recent developments have been increasingly tied to technology. As our ability to record and analyze speech increases, our use of larger and larger data sets increases as well. Many of those data sets consist of acoustic recordings, which are excellent but incomplete records for phonetic analysis. Greater attention is being paid to variability in the signal, both in terms of covariation (Chodroff & Wilson, 2017 ; Kawahara, 2017 ) and intrinsic lack of consistency (Tilsen, 2015 ; Whalen, Chen, Tiede, & Nam, 2018 ). Assessing variability depends on the accuracy of individual measurements, and our current automatic formant analyses are known to be inaccurate (Shadle, Nam, & Whalen, 2016 ). Future developments in this domain are needed, but current studies that use current techniques must be appropriately limited in their interpretation.

Large data sets are rare for physiological data, though there are some exceptions (Narayanan et al., 2014 ; Tiede, 2017 ; Westbury, 1994 ). Quantification of articulator movement is easier than in the past, but it remains challenging in both collection and analysis. Mathematical tools for image processing and pattern detection are being adapted to the problem, and the future understanding of speech production should be enhanced. Although many techniques are too demanding for some populations, ultrasound has been found to allow investigations of young children (Noiray, Abakarova, Rubertus, Krüger, & Tiede, 2018 ), speakers in remote areas (Gick, Bird, & Wilson, 2005 ), and disordered populations (Preston et al., 2017 ). Thus the amount of data and the range of populations that can be measured can be expected to increase significantly in the coming years.

Understanding the brain mechanisms that underlie the phonetic effects being studied by other means will continue to expand. Improvements in the specificity of brain imaging techniques will allow narrower questions to be addressed. Techniques such as electrocorticographic (ECoG) signals (Hill et al., 2012 ), functional near-infrared spectroscopy (Yücel, Selb, Huppert, Franceschini, & Boas, 2017 ), and the combination of multiple modalities will allow more direct assessments of phonetic control in production and effects in perception. As with other levels of linguistic structure, theories will be both challenged and enhanced by evidence of brain activation in response to language.

Addressing more data allows a deeper investigation of the speech process, and technological advances will continue to play a major role. The study does, ultimately, return to the human perception and production ability, as each newly born speaker/hearer begins to acquire speech and the language it makes possible.

Further Reading

  • Fant, G. (1960). Acoustic theory of speech production . The Hague, The Netherlands: Mouton.
  • Hardcastle, W. J. , & Hewlett, N. (Eds.). (1999). Coarticulation models in recent speech production theories . Cambridge, UK: Cambridge University Press.
  • Ladefoged, P. (2001). A course in phonetics (4th ed.). Fort Worth, TX: Harcourt College Publishers.
  • Ladefoged, P. , & Maddieson, I. (1996). The sounds of the world’s languages . Oxford, UK: Blackwell.
  • Liberman, A. M. (1996). Speech: A special code . Cambridge, MA: MIT Press.
  • Lisker, L. , & Abramson, A. S. (1964). A cross-language study of voicing in initial stops: Acoustical measurements . Word , 20 , 384–422. doi:10.1080/00437956.1964.11659830
  • Ohala, J. J. (1981). The listener as a source of sound change. In M. F. Miller (Ed.), Papers from the parasession on language behavior (pp. 178–203). Chicago, IL: Chicago Linguistic Association.
  • Peterson, G. E. , & Barney, H. L. (1952). Control methods used in a study of the vowels. Journal of the Acoustical Society of America , 24 , 175–184.
  • Stevens, K. N. (1998). Acoustic phonetics . Cambridge, MA: MIT Press.
  • Asher, R. E. , & Henderson, J. A. (Eds.). (1981). Towards a history of phonetics . Edinburgh, UK: Edinburgh University Press.
  • Beddor, P. S. (2017). Speech perception in phonetics . In M. Aronoff (Ed.), Oxford research encyclopedia of linguistics . Oxford University Press. doi:10.1093/acrefore/9780199384655.013.62
  • Bernhardt, B. M. , Gick, B. , Bacsfalvi, P. , & Ashdown, J. (2003). Speech habilitation of hard of hearing adolescents using electropalatography and ultrasound as evaluated by trained listeners. Clinical Linguistics and Phonetics , 17 , 199–216.
  • Campbell-Kibler, K. (2007). Accent, (ing), and the social logic of listener perceptions . American Speech , 82 (1), 32–64. doi:10.1215/00031283-2007-002
  • Campbell-Kibler, K. (2010). Sociolinguistics and perception . Language and Linguistics Compass , 4 (6), 377–389. doi:10.1111/j.1749-818X.2010.00201.x
  • Cho, T. , & Ladefoged, P. (1999). Variation and universals in VOT: Evidence from 18 languages. Journal of Phonetics , 27 , 207–229.
  • Chodroff, E. , & Wilson, C. (2017). Structure in talker-specific phonetic realization: Covariation of stop consonant VOT in American English . Journal of Phonetics , 61 , 30–47. doi:10.1016/j.wocn.2017.01.001
  • Chomsky, N. , & Halle, M. (1968). The sound pattern of English . New York, NY: Harper and Row.
  • Coe, M. D. (1992). Breaking the Maya code . London, UK: Thames and Hudson.
  • Cooper, F. S. (1953). Some instrumental aids to research on speech. In A. A. Hill (Ed.), Fourth Annual Round Table Meeting on Linguistics and Language Teaching (pp. 46–53). Washington, DC: Georgetown University.
  • d’Apolito, I. S. , Sisinni, B. , Grimaldi, M. , & Gili Fivela, B. (2017). Perceptual and ultrasound articulatory training effects on English L2 vowels production by Italian learners. International Journal of Social, Behavioral, Educational, Economic, Business and Industrial Engineering , 11 (8), 2159–2167.
  • DeFrancis, J. (1989). Visible speech: The diverse oneness of writing systems . Honolulu: University of Hawa’ii Press.
  • Duckworth, M. , Allen, G. , Hardcastle, W. , & Ball, M. (1990). Extensions to the International Phonetic Alphabet for the transcription of atypical speech . Clinical Linguistics and Phonetics , 4 , 273–280. doi:10.3109/02699209008985489
  • Flege, J. E. (2003). Assessing constraints on second-language segmental production and perception. In N. O. Schiller & A. S. Meyer (Eds.), Phonetics and phonology in language comprehension and production: Differences and similarities (pp. 319–355). Berlin, Germany: Mouton de Gruyter.
  • Flemming, E. (2001). Scalar and categorical phenomena in a unified model of phonetics and phonology. Phonology , 18 , 7–44.
  • Fodor, J. A. (1981). The mind-body problem. Scientific American , 244 , 114–123.
  • Foulkes, P. , & Docherty, G. (2006). The social life of phonetics and phonology . Journal of Phonetics , 34 , 409–438. doi:10.1016/j.wocn.2005.08.002
  • Furui, S. , Deng, L. , Gales, M. , Ney, H. , & Tokuda, K. (2012). Fundamental technologies in modern speech recognition. IEEE Signal Processing Magazine , 29 (6), 16–17.
  • Gafos, A. I. (2002). A grammar of gestural coordination. Natural Language and Linguistic Theory , 20 , 269–337.
  • Gafos, A. I. , & Beňuš, Š. (2006). Dynamics of phonological cognition . Cognitive Science , 30 , 905–943. doi:10.1207/s15516709cog0000_80
  • Gafos, A. I. , Roeser, J. , Sotiropoulou, S. , Hoole, P. , & Zeroual, C. (2019). Structure in mind, structure in vocal tract . Natural Language and Linguistic Theory . doi:10.1007/s11049-019-09445-y
  • Gick, B. , Bird, S. , & Wilson, I. (2005). Techniques for field application of lingual ultrasound imaging. Clinical Linguistics and Phonetics , 19 , 503–514.
  • Gick, B. , Wilson, I. , Koch, K. , & Cook, C. (2004). Language-specific articulatory settings: Evidence from inter-utterance rest position. Phonetica , 61 , 220–233.
  • Goddard, P. E. (1905). Mechanical aids to the study and recording of language . American Anthropologist , 7 , 613–619. doi:10.1525/aa.1905.7.4.02a00050
  • Goldin-Meadow, S. , & Brentari, D. (2017). Gesture, sign, and language: The coming of age of sign language and gesture studies . Behavioral and Brain Sciences , 40 , e46. doi:10.1017/S0140525X15001247
  • Goldstein, L. M. , & Fowler, C. A. (2003). Articulatory phonology: A phonology for public language use. In N. Schiller & A. Meyer (Eds.), Phonetics and phonology in language comprehension and production: Differences and similarities (pp. 159–207). Berlin, Germany: Mouton de Gruyter.
  • Goldstein, L. M. , Whalen, D. H. , & Best, C. T. (Eds.). (2006). Papers in laboratory phonology 8 . Berlin, Germany: Mouton de Gruyter.
  • Gouskova, M. , Zsiga, E. , & Boyer, O. T. (2011). Grounded constraints and the consonants of Setswana . Lingua , 121 (15), 2120–2152. doi:10.1016/j.lingua.2011.09.003
  • Halliday, M. A. K. (1961). Categories of the theory of grammar . Word , 17 , 241–292. doi:10.1080/00437956.1961.11659756
  • Hay, J. , & Drager, K. (2007). Sociophonetics . Annual Review of Anthropology , 36 (1), 89–103. doi:10.1146/annurev.anthro.34.081804.120633
  • Hill, N. J. , Gupta, D. , Brunner, P. , Gunduz, A. , Adamo, M. A. , Ritaccio, A. , & Schalk, G. (2012). Recording human electrocorticographic (ECoG) signals for neuroscientific research and real-time functional cortical mapping . Journal of Visualized Experiments (64), 3993. doi:10.3791/3993
  • Holmes, J. N. , Mattingly, I. G. , & Shearme, J. N. (1964). Speech synthesis by rule. Language and Speech , 7 (3), 127–143.
  • Huffman, M. K. (2016). Articulatory phonetics. In M. Aronoff (Ed.), Oxford Research Encyclopedia of Linguistics . Oxford University Press.
  • International Phonetic Association . (1989). Report on the 1989 Kiel Convention. Journal of the International Phonetic Association , 19 , 67–80.
  • Iskarous, K. (2017). The relation between the continuous and the discrete: A note on the first principles of speech dynamics . Journal of Phonetics , 64 , 8–20. doi:10.1016/j.wocn.2017.05.003
  • Kawahara, S. (2017). Durational compensation within a CV mora in spontaneous Japanese: Evidence from the Corpus of Spontaneous Japanese . Journal of the Acoustical Society of America , 142 , EL143–EL149. doi:10.1121/1.4994674
  • Keating, P. A. (1985). Universal phonetics and the organization of grammars. In V. A. Fromkin (Ed.), Phonetic linguistics: Essays in honor of Peter Ladefoged (pp. 115–132). New York, NY: Academic Press.
  • Keating, P. A. (1988). The phonology-phonetics interface. In F. Newmeyer (Ed.), Linguistics: The Cambridge survey: Vol. 1. Grammatical theory (pp. 281–302). Cambridge, UK: Cambridge University Press.
  • Koenig, W. , Dunn, H. K. , & Lacy, L. Y. (1946). The sound spectrograph. Journal of the Acoustical Society of America , 18 , 19–49.
  • König, R. (1873). I. On manometric flames. London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science , 45 (297), 1–18.
  • Ladefoged, P. (1990). The Revised International Phonetic Alphabet . Language , 66 , 550–552. doi:10.2307/414611
  • Macmahon, M. K. C. (2009). The International Phonetic Association: The first 100 years . Journal of the International Phonetic Association , 16 , 30–38. doi:10.1017/S002510030000308X
  • McGurk, H. , & MacDonald, J. (1976). Hearing lips and seeing voices. Nature , 264 , 746–748.
  • Narayanan, S. , Toutios, A. , Ramanarayanan, V. , Lammert, A. , Kim, J. , Lee, S. , . . . Proctor, M. (2014). Real-time magnetic resonance imaging and electromagnetic articulography database for speech production research (TC) . Journal of the Acoustical Society of America , 136 , 1307–1311. doi:10.1121/1.4890284
  • Noiray, A. , Abakarova, D. , Rubertus, E. , Krüger, S. , & Tiede, M. K. (2018). How do children organize their speech in the first years of life? Insight from ultrasound imaging. Journal of Speech, Language, and Hearing Research , 61 , 1355–1368.
  • Ohala, J. J. (1990). There is no interface between phonology and phonetics: A personal view. Journal of Phonetics , 18 , 153–172.
  • Peters, J. , Hanssen, J. , & Gussenhoven, C. (2014). The phonetic realization of focus in West Frisian, Low Saxon, High German, and three varieties of Dutch . Journal of Phonetics , 46 , 185–209. doi:10.1016/j.wocn.2014.07.004
  • Potter, R. K. , Kopp, G. A. , & Green, H. G. (1947). Visible speech . New York, NY: Van Nostrand.
  • Preston, J. L. , McAllister Byun, T. , Boyce, S. E. , Hamilton, S. , Tiede, M. K. , Phillips, E. , . . . Whalen, D. H. (2017). Ultrasound images of the tongue: A tutorial for assessment and remediation of speech sound errors . Journal of Visualized Experiments , 119 , e55123. doi:10.3791/55123
  • Romero, J. , & Riera, M. (Eds.). (2015). The Phonetics–Phonology Interface: Representations and methodologies . Amsterdam, The Netherlands: John Benjamins.
  • Rousselot, P.‐J. (1897–1908). Principes de phonétique expérimentale . Paris, France: H. Welter.
  • Sandler, W. , Aronoff, M. , Meir, I. , & Padden, C. (2011). The gradual emergence of phonological form in a new language . Natural Language and Linguistic Theory , 29 , 503–543. doi:10.1007/s11049-011-9128-2
  • Scripture, E. W. (1902). The elements of experimental phonetics . New York, NY: Charles Scribner’s Sons.
  • Shadle, C. H. , Nam, H. , & Whalen, D. H. (2016). Comparing measurement errors for formants in synthetic and natural vowels. Journal of the Acoustical Society of America , 139 , 713–727.
  • Shankweiler, D. , & Fowler, C. A. (2015). Seeking a reading machine for the blind and discovering the speech code. History of Psychology , 18 , 78–99.
  • Strickland, B. , Geraci, C. , Chemla, E. , Schlenker, P. , Kelepir, M. , & Pfau, R. (2015). Event representations constrain the structure of language: Sign language as a window into universally accessible linguistic biases . Proceedings of the National Academy of Sciences , 112 (19), 5968–5973. doi:10.1073/pnas.1423080112
  • Suemitsu, A. , Dang, J. , Ito, T. , & Tiede, M. K. (2015). A real-time articulatory visual feedback approach with target presentation for second language pronunciation learning . Journal of the Acoustical Society of America , 138 , EL382–EL387. doi:10.1121/1.4931827
  • Sumby, W. H. , & Pollack, I. (1954). Visual contribution to speech intelligibility in noise. Journal of the Acoustical Society of America , 26 , 212–215.
  • Tiede, M. K. (2017). Haskins_IEEE_Rate_Comparison_DB .
  • Tilsen, S. (2015). Structured nonstationarity in articulatory timing. In The Scottish Consortium for ICPhS 2015 (Ed.), Proceedings of the 18th International Congress of Phonetic Sciences (Vol. Paper number 78, pp. 1–5). Glasgow, UK: University of Glasgow.
  • Tyrone, M. E. , & Mauk, C. E. (2010). Sign lowering and phonetic reduction in American Sign Language . Journal of Phonetics , 38 , 317–328. doi:10.1016/j.wocn.2010.02.003
  • Westbury, J. R. (1994). X-ray microbeam speech production database user’s handbook . Madison: Waisman Center, University of Wisconsin.
  • Whalen, D. H. , Chen, W.‐R. , Tiede, M. K. , & Nam, H. (2018). Variability of articulator positions and formants across nine English vowels. Journal of Phonetics , 68 , 1–14.
  • Wolfram, W. , & Schilling, N. (2015). American English: Dialects and variation . Malden, MA: John Wiley & Sons.
  • Yehia, H. C. , Kuratate, T. , & Vatikiotis-Bateson, E. S. (2002). Linking facial animation, head motion and speech acoustics . Journal of Phonetics , 30 , 555–568. doi:10.1006/jpho.2002.0165
  • Yücel, M. A. , Selb, J. J. , Huppert, T. J. , Franceschini, M. A. , & Boas, D. A. (2017). Functional Near Infrared Spectroscopy: Enabling routine functional brain imaging . Current Opinion in Biomedical Engineering , 4 , 78–86. doi:10.1016/j.cobme.2017.09.011
  • Zhou, X. , & Marslen-Wilson, W. D. (1999). Phonology, orthography, and semantic activation in reading Chinese. Journal of Memory and Language , 41 , 579–606.

Related Articles

  • Theoretical Phonology
  • Contrastive Specification in Phonology
  • Articulatory Phonetics
  • Speech Perception and Generalization Across Talkers and Accents
  • Korean Phonetics and Phonology
  • Clinical Linguistics
  • Speech Perception in Phonetics
  • Coarticulation
  • Sign Language Phonology
  • Second Language Phonetics
  • The Phonetics of Babbling
  • Direct Perception of Speech
  • Phonetics of Singing in Western Classical Style

Printed from Oxford Research Encyclopedias, Linguistics. Under the terms of the licence agreement, an individual user may print out a single article for personal use (for details see Privacy Policy and Legal Notice).

date: 29 April 2024

  • Cookie Policy
  • Privacy Policy
  • Legal Notice
  • [66.249.64.20|185.147.128.134]
  • 185.147.128.134

Character limit 500 /500

Browse Course Material

Course info.

  • Prof. Edward Flemming

Departments

  • Linguistics and Philosophy

As Taught In

Learning resource types, language variation and change, phonetic transcription exercise.

Due session 4

1. Transcribe the sentences in the following sound files:

  • Recording at www.dialectsarchive.com/scotland-5 , 1:03 to 1:05 (speaker from South Queensferry, near Edinburgh, Scotland)
  • speaker32.wav (from Michigan, I think) [not available for OCW use]
  • Recording at www.dialectsarchive.com/texas-1 , 2:01 to 2:07 (from San Marcos, Texas)

2. The Texan speaker in the recording at www.dialectsarchive.com/texas-1 0:06 to 1:05 uses two allophones of the /ɪ/ phoneme (the vowel in words like ‘this’, ‘it’, etc). Transcribe these two allophones. Do you think this is free variation or could it be conditioned variation? Explain why. If you think it might be conditioned variation, suggest a hypothesis about the conditioning factors.

Here is a transcript to help you locate relevant words:

First part of recording:

When the sunlight strikes raindrops in the air, they act as a prism and form a rainbow. The rainbow is a division of white light into many beautiful colors. These take the shape of a long round arch, with its path high above, and its two ends apparently beyond the horizon. There is, according to legend, a boiling pot of gold at one end. People look, but no one ever finds it. When a man looks for something beyond his reach, his friends say he is looking for the pot of gold at the end of the rainbow.

(public-domain text from International Dialects of English Archive )

Second part of recording:

(Uh) This is (oh)—it’s just a jail delivery that we (uh)—when we went through the (uh), the Heritage Association’s tour, this last year, (uh) we obtained copy of this. And (uh) this—I remember when my grandparents (uh), who had a farm…

(transcription © International Dialects of English Archive . This content is excluded from our Creative Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/ .)

facebook

You are leaving MIT OpenCourseWare

Chapter 2 - Phonology and Phonetic Transcription

Example 2.1 - Table 2.1: The Transcription of Consonants.

Example 2.2 - Table 2.2: The Transcription of Vowels.

Example 2.3 - Unstressed vowels

Homework Exercises

Performance exercises.

 alt=

ORIGINAL RESEARCH article

Echoes of l1 syllable structure in l2 phoneme recognition.

\r\nKanako Yasufuku*

  • Department of Linguistics and Asian/Middle Eastern Languages, San Diego State University, San Diego, CA, United States

Learning to move from auditory signals to phonemic categories is a crucial component of first, second, and multilingual language acquisition. In L1 and simultaneous multilingual acquisition, learners build up phonological knowledge to structure their perception within a language. For sequential multilinguals, this knowledge may support or interfere with acquiring language-specific representations for a new phonemic categorization system. Syllable structure is a part of this phonological knowledge, and language-specific syllabification preferences influence language acquisition, including early word segmentation. As a result, we expect to see language-specific syllable structure influencing speech perception as well. Initial evidence of an effect appears in Ali et al. (2011) , who argued that cross-linguistic differences in McGurk fusion within a syllable reflected listeners’ language-specific syllabification preferences. Building on a framework from Cho and McQueen (2006) , we argue that this could reflect the Phonological-Superiority Hypothesis (differences in L1 syllabification preferences make some syllabic positions harder to classify than others) or the Phonetic-Superiority Hypothesis (the acoustic qualities of speech sounds in some positions make it difficult to perceive unfamiliar sounds). However, their design does not distinguish between these two hypotheses. The current research study extends the work of Ali et al. (2011) by testing Japanese, and adding audio-only and congruent audio-visual stimuli to test the effects of syllabification preferences beyond just McGurk fusion. Eighteen native English speakers and 18 native Japanese speakers were asked to transcribe nonsense words in an artificial language. English allows stop consonants in syllable codas while Japanese heavily restricts them, but both groups showed similar patterns of McGurk fusion in stop codas. This is inconsistent with the Phonological-Superiority Hypothesis. However, when visual information was added, the phonetic influences on transcription accuracy largely disappeared. This is inconsistent with the Phonetic-Superiority Hypothesis. We argue from these results that neither acoustic informativity nor interference of a listener’s phonological knowledge is superior, and sketch a cognitively inspired rational cue integration framework as a third hypothesis to explain how L1 phonological knowledge affects L2 perception.

Introduction

Second language acquisition and representation is extensively affected by a learner’s knowledge of other languages. These effects appear at many levels, including lexical ( Schulpen et al., 2003 ; Weber and Cutler, 2004 ; Otake, 2007 ) and phonological recognition ( Dupoux et al., 1999 ; Carlson et al., 2016 ). This can happen even at an abstract level, with first language (L1) knowledge helping second/later language (L2) learners identify which phonetic dimensions are used for phonemic discrimination in the new language ( Lin, 2001 ; Pajak and Levy, 2012 ). In this paper, we examine how difficulties in L2 phonemic categorization may arise from the difference between the L1 and L2 phonological structure (in the form of syllabification preferences) and the quality of the acoustic/visual input. We find evidence that both phonological and phonetic cues are used by L2 learners to identify sound categories, and that the learners’ use of these cues varies depending on the input they are given and their L1 experience. This argues against previous proposals, Phonetic- and Phonological-Superiority Hypotheses, that claim one cue consistently outweighs the other ( Cho and McQueen, 2006 ; Ali et al., 2011 ), and suggests that learners may be capable of combining information from their L1 structure with perceived cue-reliability as part of a rational cue integration framework for L2 perception and categorization.

Whether we consider first or later language acquisition, learning to effectively represent the acoustic signals available in language input and building these observations up into phonological structure for the target language are key components of language learning. L1 learners build their fundamental understanding of language-specific sound patterns by analyzing acoustic data in the input and applying this knowledge in later speech perception and language comprehension ( Maye et al., 2002 ; Werker et al., 2007 ). L2 learners follow a similar process, but may also be influenced by the structure they have already developed from their L1. In some cases, this may be helpful; if a learner’s L1 has the same structure as the target L2, the L1 can provide a head-start for learning the target L2 structure ( Best and Tyler, 2007 ; Pajak and Levy, 2012 ). If, on the other hand, the L1 has a different structure, the L1 could impede L2 acquisition. This second case is a common pitfall for language learners anecdotally, and has been backed up by research showing that learners encounter more difficulty perceiving and processing linguistic features that are present in the L2 but either absent or significantly different in their L1 ( Best and Tyler, 2007 ). This difficulty may be due to the linguistic knowledge of the languages previously acquired, which L2 learners activate during the course of language processing and comprehension ( Dupoux et al., 1999 ; Carlson et al., 2016 ).

L1 knowledge can influence perceived L2 phonetic and phonological structure in many ways. At perhaps the most basic level, L2 phonemic distinctions that are present in a learner’s L1 can be easier to perceive than those that are not present or merely allophonic in the L1 ( Flege, 2003 ). For instance, Spanish-Catalan bilinguals whose L1 is Spanish find it more difficult to distinguish word pairs that only differ in one phoneme (i.e., /e/ vs. /ε/) than Catalan-dominant Catalan-Spanish bilinguals ( Palliar et al., 2001 ). This reflects the influence of Catalan phonology, since both /e/ and /ε/ are phonemically contrastive in Catalan while only /e/ exists in Spanish phonemic inventory. In addition, distinctions that are present in both L1 and L2, but with different boundaries in the two languages, also show an effect of L1. L2 learners tend to draw category boundaries between ambiguous sounds that are in line with the acoustic boundaries of their native language ( Escudero and Vasiliev, 2011 ; Kartushina and Frauenfelder, 2013 ).

These effects extend beyond the L1 phonemic inventory and boundaries as well. Higher-level phonological knowledge from L1, such as syllabic structure and phonotactic restrictions, can also contribute to inaccuracies in perceiving or representing L2 speech sounds. L2 learners often report illusory segments in L2 perception that bring the L2 observation closer to L1 phonotactics. Carlson et al. (2016) asked L1 Spanish-speakers to listen to L2 words with an initial consonant cluster composed of /s/ followed by another consonant (i.e., #sC_). Such initial clusters are phonotactically illicit in Spanish, and the participants reported hearing an illusory vowel /e/ before the word-initial consonant cluster #sC, which would be acceptable in Spanish phonotactics. Dupoux et al. (1999) found that Japanese listeners reported hearing an illusory vowel [u] between consonants when they were presented to VCCV as well as VCuCV nonwords while French listeners, whose L1 allows to have a consonant cluster, were able to judge when the vowel was absent. A likely explanation for this epenthetic repair pattern is that an illusory vowel was inserted after the final consonants to break the VC syllable into a combination of V followed by CV syllables, fitting the standard Japanese syllable structure, (C)V, and avoid a phonotactic violation. Indeed, this epenthetic repair has also been observed in Japanese loanword adaptation (insert /i/ when word is ending with a [tʃ] and [dʒ], /o/ after [t] and [d], and /ɯ/ in other phonetic environment).

A similar example comes from English listeners, who miscategorize certain onset phonemes in a way that fits better with English phonotactics. English-speaking listeners exposed to onset consonant clusters composed of a fricative consonant (e.g., [ʃ] and [s]) followed by a liquid consonant (e.g., [ɹ ] and [l]) tend to report hearing [ʃɹi] and [sli], compared to the phonotactically worse (within English) [sɹi] and [ʃli], even though the same fricative was being played. That is, even though the actual sounds were acoustically close to [ʃ], when it appears in the phonetic context [_li], English speakers reported that they heard [sli] rather than [ʃli]. In contrast, when an [s]-like sound was played in a phonetic context such as [_ɹ i], English speakers perceived it as [ʃɹ i] rather than [sɹ i] ( Massaro and Cohen, 2000 ). This supports the idea that speech perception is heavily influenced by listeners’ L1 phonotactic knowledge, which includes detailed phonotactic information and syllable structure preferences. Most importantly, it shows that effective acquisition of the phonemic inventory of an L2 is not just a matter of learning the new language’s phonemic inventory and phonetic boundaries, but also a matter of learning the language’s abstract structure, including phonotactics, to produce accurate sound categorization.

Building on this idea of abstract L1 influences, some studies have revealed that listeners of L1s with different syllable structure preferences syllabify unfamiliar strings of sounds differently, in line with the predominant syllable structure of their L1 ( Cutler et al., 1986 ; Otake et al., 1993 ). These differences have often been tested by asking listeners to syllabify words in their L1 or an L2, but this does not directly get at the question of whether the syllabification biases from L1 have any effects on perception itself.

An alternative way of testing this that directly measures perception relies on the McGurk effect, in which observers are played audio and video information that do not agree, and report hearing a sound that is inconsistent with the audio information. For instance, a common cross-linguistic McGurk “fusion” effect is perceiving a [t] sound when presented with audio of a [p] sound but video of a [k] sound ( McGurk and MacDonald, 1976 ; Tiippana et al., 2004 ). Researchers have examined cross-linguistic differences in English and Arabic speakers’ perception of monosyllabic CVC words, using McGurk fusion rate differences to argue for an influence of L1 syllable structure preferences on sound perception in onset and coda positions ( Ali and Ingleby, 2002 ; Ali, 2003 ; Ali and Ingleby, 2005 ). The results show that English-speaking participants fused audio-visually incongruent English consonant recordings more when they were in coda position than onset position, while Arabic speakers did not show such positional differences on similar stimuli in Arabic. Based on these L1 results, they argued that, despite both languages allowing surface-form coda consonants, the abstract representation of Arabic coda consonants is as an onset in a syllable with an illusory vowel (similar to Dupoux et al., 1999 and Carlson et al., 2016 ’s findings).

To argue that this difference is the result of syllable representation and a preference for certain syllable structure, Ali et al. (2011) further tested English learners of Arabic (with an average of 3 years of part-time Arabic instruction) in addition to English and Arabic monolinguals. English speakers with significant Arabic instruction showed a similar response to Arabic monolinguals when they were presented with Arabic stimuli: they fused audio and visual information almost equally in onset and coda positions. English speakers without Arabic exposure showed a similar response on both English and Arabic stimuli: increased fusion in codas versus onsets. Based on these results, they argued that an L1 syllabic structure effect was present, that English and Arabic had different mental representations of syllabic structure, and that the L1 effect could be overcome with sufficient exposure to the L2. This suggests that high-level phonological structure from L1 can introduce significant difficulties for L2 perception and categorization. However, there are at least two possible explanations for this data.

The first explanation is that, as Ali et al. (2011) claimed, the difference in the magnitude of McGurk fusion rate is dependent on the phonological syllabic structure, and thus reflects language-specific phonological preferences inherited from L1. This supports the Phonological-Superiority Hypothesis, which claims that cross-linguistic differences in consonant perception in different syllable positions arise from differences in the native phonological structure preferences ( Cho and McQueen, 2006 ). When learners encounter L2 speech, especially when sounds appear in unfamiliar sequences or phonotactically illicit positions according to the L1, they re-analyze it to fit an L1-acceptable syllable structure rather than maintaining the more accurate L2 structure. Ali et al. (2011) argue that English monolinguals show more fusion on Arabic codas than Arabic speakers do because they represent the Arabic coda (which does not show increased fusion) as if it were an English coda (which does show increased fusion). Given the findings that listeners can re-interpret L2 sounds to fit their L1 (as with illusory epenthetic vowels of Dupoux et al. (1999) and Carlson et al. (2016) ), such differences in representation are reasonable, and could induce some phonological effects on perception. However, Ali et al. (2011) do not propose a mechanism to explain how abstract differences in phonological structure representations (as an onset versus a coda) create strong perceptual differences in audio-visual integration.

An alternative, more explicitly mechanistic, approach proposes that the perceptual differences across syllabic positions come from differences in the actual phonetic realizations of the sounds in different positions. The English coda consonants tested in the previous studies may be acoustically less informative compared to onset consonants; English speakers do not have to audibly release word-final stops, and unreleased stops are harder to identify ( Lisker, 1999 ). In addition, although we use the same phonetic symbol to represent a sound category in onset and coda positions, the acoustic signal for onset and coda versions of a phone can be very different. This is especially true for stops, where the acoustically informative transition between vowel and closure changes occurs at the beginning of a coda stop but at the end of an onset stop. Recognizing the representational equivalence of the acoustically distinct onset and coda forms of the same L2 phoneme may not be trivial to the learner ( Tsukada, 2004 ). This would fit the Phonetic-Superiority Hypothesis, which argues that difficulties arise from differences in the quality of the phonetic identifiability of some sounds, independent of the phonological system of the speaker’s native language ( Cho and McQueen, 2006 ).

Previous work has found cases where L2 perception is relatively independent of L1 phonotactic restrictions, suggesting that phonetic differences can be perceptually salient despite cross-linguistic phonological differences. For instance, Dutch speakers are able to perceive and discriminate voicing differences in English word-final fricatives (i.e., /s/-/z/ and /f/-/v/ contrasts) and stops (i.e., /p/-/b/ and /t/-/d/ contrasts) as accurately as in word-initial position, even though Dutch neutralizes voicing contrasts in word-final position ( Broersma, 2005 ). Similarly, native Japanese speakers are able to discriminate English /r/ and /l/ more accurately in syllable-final position than in other positions, despite this not representing a phonemic distinction in any position in Japanese ( Sheldon and Strange, 1982 ). This seems to argue against the Phonological-Superiority Hypothesis, since Japanese is primarily a CV syllable language, yet the best performance was occurring in a phonotactically illicit position. Sheldon and Strange attributed the finding to the actual acoustic difference between English /r/ and /l/ sounds, which may be more acoustically distinct in word-final position than word-initial position, supporting the dominance of phonetic over phonological perceptual influences. Especially in the case of the audio-visual integration of Ali et al. (2011) , acoustically less informative audio input may have caused English and Arabic listeners to pay more attention to the visually presented information in English coda productions, increasing McGurk fusion rates because of less phonetic clarity. Because languages differ in their phonemic inventories and divisions, difficulties that appear to depend on the L1 structure may actually stem primarily from the joint difficulties of the task and the individual phone productions.

Some research studies have carefully investigated the interaction of phonetic and phonological information when distinguishing phoneme contrasts in different phonotactic contexts. Tsukada (2004) investigated how speakers of different languages discriminate Thai and English voiceless stop contrasts (i.e., /p/-/t/, /p/-/k/, and /t/-/k/) in syllable-final positions. English and Thai both have three-way stop contrasts in syllable-final position, but English final stops can be released or unreleased, while Thai final stops are obligatorily unreleased. Tsukada played recordings of English (released) and Thai (unreleased) CVC words to both English and Thai listeners. English and Thai listeners showed equal accuracy in discriminating English speakers’ stop contrasts, but Thai listeners were more accurate than English listeners in discriminating Thai speakers’ contrasts. Tsukada argued that the asymmetry may be due to the difference in acoustic information available in the Thai case, since unreleased stops are acoustically less informative. Tsukada and Ishihara (2007) followed up on this idea by exposing Japanese listeners to word-final English and Thai stop contrasts. Japanese has the same voiceless stop inventory as Thai and English, but Japanese phonotactics reject stops in coda position. Their results show that all three language groups correctly discriminate English word-final stop contrasts, but for unreleased Thai-language recordings, Thai listeners were more accurate than English listeners who, in turn, were more accurate than Japanese listeners.

These results suggest that the difficulties in perceiving L2 speech sounds may be motivated by the listeners’ L1 structure biases, but the degree of difficulty may vary depending on the actual acoustic cue informativity. In the present study, in addition to investigating how phonetic information and phonological information would influence L2 speech recognition, we would like to address the question of how listeners balance their reliance on phonetic and phonological factors when categorizing L2 phonemes. Phonetic and phonological influences on L2 phoneme categorizations suggest two different potential problems facing the L2 learner, one general and one specific to the L1. On the phonetic side, L2 learners may have to deal with a language that provides data of differing acoustic quality, making certain contrasts or phonotactic positions difficult to classify accurately. Learners then would have to learn to extract the information that they need to categorize sounds correctly. On the phonological side, L2 learners may need to learn to represent the phonological and syllabic structure of the L2 (if their L1 uses a different representation) to properly identify the phones. In the present work, we will look for signs of each of these problems in L2 representation. We will test to see if one difficulty is stronger than the other, as the Phonetic- and Phonological-Superiority Hypotheses propose, or if the two have a more complex relationship.

To see how such factors may interact in L1 and L2 perception, consider the possible explanations that each hypothesis offers for the findings of Ali et al. (2011) that English listeners show increased McGurk fusion (i.e., decreased ability to identify the correct phoneme from acoustic information) in codas versus onsets, while Arabic speakers (even advanced-L2 Arabic speakers) show no such difference. At first, this may seem to be clear-cut evidence of phonological superiority. The English speakers phonologically represent the stop consonants as onsets and codas, and categorize them differently. The Arabic speakers might represent onsets and codas identically, and thus show no difference in categorization accuracy. However, as mentioned above, this presupposes an unstated mechanism to explain why codas would be inherently inclined toward less accurate representation than onsets. How could phonetic superiority explain this? Suppose onsets are more acoustically identifiable than codas in general. English speakers’ accuracies are as expected, with worse accuracy on codas. Arabic speakers’ accuracies require an explanation, since they seem unaffected by this acoustic difference. Here we encounter two confounds in Ali et al.’s experimental design: the Arabic stimuli were real Arabic words, and Arabic speakers were only tested on Arabic, not English. Since Arabic speakers knew the potential words they were trying to identify, they may have been able to rely on prior knowledge about the wordforms to convert this to a lexical recognition task instead of a phonemic categorization task. Without Arabic listeners’ data on English wordforms, it is difficult to identify a specific mechanism for the differences, and each relies in part on supposition.

To fill these gaps and more directly test the relative influences of phonetic and phonological representation on phoneme categorization, we propose a study that examines L1 effects on listeners’ representation of an artificial language. This removes the potential lexical recognition confound and equalizes the listeners’ inherent familiarity with the test words. We also independently assess listeners’ accuracy on audio-only, audio-visual congruent, and audio-visual incongruent tasks. This builds a baseline to understand if the proposed onset-coda differences in acoustic informativity are real, and if they differ due to previous language exposure. We extend to congruent and incongruent audio-visual data to test both how L1 knowledge may affect L2 phonemic categorization in real-world applications, and how the phonetic and phonological structures interact.

We compare L1 English and Japanese speakers, taking advantage of the different underlying phonological representations of syllable they have. Whereas both Arabic and English allow stop consonants in the surface coda position (requiring Ali et al., 2011 , to consider potential differences in the abstract syllable structure), Japanese blocks stop consonants from almost all coda positions at the surface, presenting a more robust phonotactic difference. Japanese and English are ideal languages to address the current research questions because they are languages with substantially different phonologically defined syllables. Japanese, aside from limited exceptions for nasals and geminates, disallows coda consonant clusters. The complete syllable inventory of Japanese consists of V, VV, CV, CVV, CVN (i.e., consonant-vowel-nasal), and CVQ (i.e., consonant-vowel-first half of a geminate consonant, extending into a following syllable). In contrast, English has relatively flexible syllable structures and allows consonant clusters consisting of a wide range of onset or coda consonants (including CCCVCCC, as in strengths ). This makes Japanese a good comparison language for examining how the language-specific syllable structure preferences influence phonetic perception, and whether the perception of audio-visual incongruent sound is influenced primarily by listeners’ L1 phonological knowledge, the actual acoustic information available in the test items, or some other linguistic factor such as lexical knowledge. Using a set of nonsense monosyllabic CVC words and introducing them to participants as a novel language limits participants’ ability to rely on any other factors that may influence speech perception.

Consequently, we would like to address two research questions in the current study: (1) Do we see the signs of both phonetic and phonological influence in audio-visual information integration during speech perception, or does one dominate the other, as the Phonetic- and Phonological-Superiority Hypotheses predict? (2) If we see an interaction of phonetic and phonological cues, what causes the shift between phonetic and phonological preferences in speech perception? If Japanese listeners demonstrate relatively less accuracy in detecting coda consonants compared to onset consonants in audio-only condition while English listeners perceive consonants in both positions accurately in the same condition, and Japanese listeners’ response patterns to audio-visual incongruent stimuli differ from those of English listeners’, this will be evidence for the Phonological-Superiority Hypothesis, since L1 structures drive the performance differences. In contrast, if the Japanese and English listeners have smaller differences due to linguistic influence than due to the specific stimuli they are classifying, this will be evidence for Phonetic-Superiority Hypothesis, since differences in phonetic clarity drive the performance differences. Finally, if the response patterns show L1-dependent differences in some stimuli but not others, this may be evidence for a third possibility: that both phonetic information and phonological knowledge influence speech perception, and their relative importance may vary depending on the informativity and reliability of each cue available during the course of speech perception, rather than one dominating the other overall.

Methodology

Participants.

Twenty native English speakers (9 male, 11 female, age range of 11–44 years old, and mean age = 23.1 years old) and 20 native Japanese speakers (6 male, 14 females, age range of 19-53, and mean age = 22.1 years old) were recruited from San Diego State University and the associated American Language Institute (ALI). Data from two English-speaking participants were excluded from the data analysis due to their failure to understand the task. Data from two Japanese-speaking participants were excluded from the data analysis due to significant self-reported L3 exposure. None of the participants in either language group had lived in a foreign country for longer than a year, and all self-reported that they have normal hearing and vision. Averages of self-reported language proficiency level in their native language and second language(s) are provided in Table 1 .

www.frontiersin.org

Table 1. Participants’ Self-Reported Language Proficiency in their native language, their second best language, and any third language experience.

The Japanese speakers had studied English as a second language at a Japanese educational institution for at least 3 years, as it is common for Japanese middle schools to require English as a foreign language class. Averages of self-reported English language proficiency levels in native Japanese speakers were 4.53 for speaking, 5.18 for understanding, and 5.06 for writing on a 10-point scale. There was only one English participant who had studied Japanese at college. This participant was kept in the analysis, as his overall self-reported language proficiency level in Japanese was only 3 of 10. The additional languages reported by Japanese speakers were Korean, Mandarin Chinese, Spanish, French, an unspecified Nigerian language, and Arabic and those reported by English speakers were Vietnamese, Spanish, French, Korean, American Sign Language, Mandarin Chinese, German, Gujurati, and Japanese. Four participants whose L1 is English reported that they had some exposure to a third language in childhood. They reported that English is now their dominant language, and their childhood language exposure was to languages that allow the coda consonants being used in the current study, and thus, their phonotactics should be similar to monolingual English speakers within the context of this experiment. As such, data from those participants were included in the analysis.

To minimize the influence of word frequency, word recognition effects, and familiarity, and to control the lexical and phonetic information that could possibly influence the phonemic recognition and McGurk fusion effect, we created six monosyllabic (i.e., CVC) and six bisyllabic (i.e., CVCV) nonsense words in the current study. A complete list of nonsense words used as target stimuli can be found in Table 2 .

www.frontiersin.org

Table 2. Target stimuli, in IPA (Monosyllabic CVC words and Bisyllabic CVCV words, which were treated as additional fillers in the current data analysis).

To focus on investigating the main factors influencing the difference in McGurk fusion rate depending on the consonant positions, the CVCV words were used as extra fillers in addition to 14 monosyllabic and 14 bisyllabic filler words. All the words were phonemically and phonotactically valid in both English and Japanese, except for the coda consonants in the CVC words in Japanese. A linguistically trained female native English speaker was video-recorded by a video camera (Panasonic 4K PROFESSIONAL) and audio-recorded by a separate microphone (SHURE KSM44A) while pronouncing these nonsense words transcribed using IPA. Each word was pronounced three times by the talker. We selected the secondly pronounced word as target stimulus to avoid list-final intonation. The video recordings were done in a soundproof booth at San Diego State University. Each word’s recording was cropped to be approximately 1500 ms.

For the audio-only stimuli, the audio file of each video was extracted from video recordings and stored as a WAV file. For the audio-video congruent files, the video kept its original audio and was exported to a Quicktime MOV file. For incongruent audio-visual pairs, each CVC stimulus containing audio /p/ was paired with the corresponding visual /t/ and /k/, and likewise, audio /k/ with visual /p/ and /t/, resulting in eight monosyllabic incongruent audio-visual stimuli. The audio was dubbed on video by matching the onset of the target consonant in the audio file and the video within 20 ms. As a result, a total of eight audio-visually incongruent CVC words and six audio-visually congruent CVC words were created as test stimuli in addition to 42 filler words (8 audio-visually incongruent bisyllabic words, 6 audio-visually congruent words, 14 monosyllabic congruent filler words, and 14 bisyllabic congruent filler words). All the stimulus editing was done by Adobe Premiere Pro CC software.

To discourage participants from attending exclusively to audio information, noise was added to the audio recordings. Multi-talker babble (MTB) was used as background noise to mimic a situation where participants listening to a conversation in a busy café or public space. Since listeners with normal hearing can be more adversely affected by multi-talker babble in their native language than babble in other languages ( Van Engen and Bradlow, 2007 ), we created the MTB with recordings of both Japanese-native and English-native speakers. The MTB consisted of three “stories” created by randomly sequencing the filler words used in the experimental phase. Each story consisted of three to four nonsense sentences and is approximately 20 s long. Three native English speakers (1 male and 2 female) and 3 native Japanese speakers (1 male and 2 female) were recorded for each version. All six recordings were combined and loudness-normalized using Audacity 2.3.0.

The experimental session was conducted in the participant’s native language, either in English or Japanese. After filling out the consent form and language background questionnaire, each participant took four practice trials, consisting of two congruent audio-visual filler bisyllabic words and two audio-only filler bisyllabic words, so that participants would understand both types of stimuli in the experiment. The first instance of each was presented without MTB, so that the participant could recognize the voice of the main speaker apart from the background noise. After completing the practice trials, the investigator confirmed that the participant understood the experiment and then participants were left to complete the experiment alone, in a closed, quiet room.

The experiment consisted of two blocks: an audio-only block and an audio-visual block. The order of these blocks was counterbalanced among participants. In the audio-only block, each participant listened to a total of 40 words: 6 monosyllabic targets, 14 monosyllabic fillers, and 20 bisyllabic fillers. In the audio-visual block, each participant watched 52 words: 14 monosyllabic targets, 14 monosyllabic fillers, and 28 bisyllabic fillers. The order of the stimuli was randomized within each block. In both blocks, static pictures of the speaker were shown for 1500 ms before the target recording played. In the audio-only block, the static picture of the speaker continued to be shown while the target audio played, while in the audio-visual block, the video played at the same time as the audio. After the target audio, a different static picture of the speaker appeared for another 1000 ms. MTB played at a constant volume throughout all three segments. After the 1000 ms static picture, the MTB stopped, and text on the screen asked participants to type in the word that they thought they heard. The next trial started whenever participants hit the space key, so that participants could take breaks as needed, and there was no feedback given during the experiment.

Each participant was asked to listen to a series of words in an unfamiliar language played through Labsonic LS255 School Headphones while watching either a video or a static picture of the main speaker on the 13″ screen of a MacBook Air laptop. The laptop was placed on a standing desk riser Tao Tronics 24″ for participants to be able to adjust the screen to their eye level if needed. Each stimulus was presented using the experimental control software PsychoPy3. In each trial, participants typed their response directly on the display using the Roman alphabet keyboard (as Japanese orthography has no adequate way of encoding coda consonants). Participants’ responses and response times for each trial were logged by PsychoPy3. At the end of the experimental session, the primary investigator informally asked each participant if there was anything they noticed during the experiment.

Data Analysis

Each participant’s typed responses were transcribed before the actual data analysis. Since Japanese orthography does not have a spelling convention to express word-final consonant, the Roman alphabet was used for both language groups to report what they heard. Because all of the Japanese speakers had some exposure to alphabetic writing systems, they did not report significant difficulty with this request, and their answers generally adhered to the forms of English orthography (e.g., participants did not provide ill-formed sequences like “ihpk”). As we will see in the audio-visual congruent results, the Japanese speakers performed near-ceiling when the data were sufficiently clear, suggesting that responding in English orthography did not present a significant difficulty for the Japanese-speaking participants.

As English orthography sometimes maps multiple letters onto a single sound, some responses required translation into a phonemic form (e.g., word-initial “c” and “ch” as representations of the sound [k]). Generally, these spelling conventions were straightforward, and we developed a set of shared rules for both languages’ participants to convert the data from letter strings to sounds (e.g., word-initial “c,” “ch,” “ck,” and “c” transcribed as /k/; see Appendix A for the full spelling conversions). After this conversion, the reported sounds were checked against the true values of the stimuli. Each response was categorized into one of the six categories listed below. Some categories are restricted to certain conditions, which are listed in brackets (e.g., an audio-only recording cannot be in category V, as it cannot be consistent with non-existent video):

• Audio response (A): response that shares place and manner of articulation features of the audio component [audio-only or audio-visual incongruent].

• Visual response (V): response that shares place and manner of articulation features of the video component [audio-visual incongruent only].

• Audio-visual response (AV): response that shares place and manner of articulation features of both audio and video components [audio-visual congruent only].

• No consonant response (NA): response lacks a consonant in the onset or coda position being tested [all conditions].

• Mid-fusion response (midF): response of a single consonant that reflects the perception of a third sound which is different from both audio and visual information but at an articulatory place between them [audio-visual incongruent only].

• Two-letter fusion response (bothF): response of multiple consonants with one consistent with audio and another consistent with visual information when stimulus is audio and another consistent with visual information when stimulus is audio-visually incongruent [audio-visual incongruent only].

• Other response (O): response that had at least one consonant in the target position but does not fit any other above category [all conditions].

Some research treats audio-visual fusion effects, where the perceived sound lies articulatorily between the audio and video phones, as the only relevant McGurk effect, while others include any deviation from the audio information as a McGurk effect ( Tiippana, 2014 ). In the present study, in order to make out data comparable to those from the study of Ali et al., only the consonants that represent the articulatory midpoint of auditorily and visually presented phones are considered to be McGurk fusion. Voicing values were disregarded (e.g., “b” and “p” were treated as equivalent), since all target sounds were voiceless, and the intended McGurk effects only affected the perceived place of articulation. Lastly, participants occasionally included multiple consonants in the target positions. When these agreed in place and manner, or when one was a stop and the other was not, they were treated as a single stop and classified as above. When they formed a common English digraph (e.g., “ch,” “sh”), they were also treated as a single sound. Other cases are listed in Appendix A.

We will discuss each of the three conditions (audio-only, audio-visual congruent, and audio-visual incongruent) in order. The audio-only data will establish the baseline confusability of the auditory components for audio-visual cases, in addition to directly testing the Phonetic- and Phonological-Superiority Hypotheses. The audio-visual congruent case will establish how participants are able to incorporate additional cues, providing a sense of how much they rely on each modality. Finally, considering the audio-visual incongruent case will test potential Phonological-Superiority effects.

Audio-Only Block

Table 3 summarizes the participants’ overall percentage of correct responses for the target consonants, split by consonantal position, in the audio-only stimuli. Binomial tests showed that English speakers perceived both onset and coda target consonants in monosyllabic CVC words significantly above chance (onsets: p < 0.001, codas: p < 0.05), as expected. Japanese speakers perceived the target consonants significantly above chance when they appear in onset position ( p < 0.001), but perceived them significantly below chance in coda position ( p < 0.01). Furthermore, both language groups showed significant differences in onset and coda accuracy, with both groups having lower accuracy on codas (English speakers: χ 2 = 6.1295, df = 1, p < 0.05; Japanese speakers: χ 2 = 25.115, df = 1, p < 0.001). Thus, overall accuracy patterns in identifying the target consonants [p], [t], and [k] were similar across the two language groups, with lower overall accuracy in Japanese, and in coda positions. This seems to be consistent with a primarily phonetic influence on perception, since English speakers have no phonotactic reason to underperform on coda identification.

www.frontiersin.org

Table 3. Audio-only accuracies for onsets and codas in CVC stimuli.

The phonetic influence on perception accuracy suggests that different phones have different baseline accuracies, either due to differences in how easy it is to distinguish the phones, or due to differences in the quality of the individual recordings of the phones in our experiment. Further analysis split the above results based on the specific target consonants, as shown in Table 4 . This analysis found that some consonants had significantly different accuracies, which depended on consonant identity, onset/coda position, and language.

www.frontiersin.org

Table 4. Audio-only accuracies for onsets and codas in CVC stimuli, split by correct consonant identity.

Binomial tests show that the English speakers’ accuracy in perceiving onset [t] and [k] were significantly higher than chance ( p < 0.01), while the difference was not significant in onset [p] perception. For English coda perception, [k] was perceived significantly more accurately than chance ( p < 0.01), but not [p] or [t]. Japanese speakers’ response patterns for onset target consonants were similar to those of English speakers, with significantly greater than chance performance on [t] and [k], but not [p]. In coda position, no consonant was perceived above chance by the Japanese speakers. Numerically, however, the same basic pattern appears in both languages, in both positions, with the exception of Japanese coda [k] perception. Again, there is no obvious evidence for large-scale phonotactic influences on accuracy across the two languages, and instead, individual phonetic differences seem to be the best explanation for this pattern.

However, binomial statistical tests compare the accuracy against chance, rather than against each other. The Phonological-Superiority Hypothesis predicts that speakers of Japanese will be relatively more affected by coda-position difficulties than English speakers are. We can directly test this, by using a logistic regression model, fit by R’s glm function, that predicts participant accuracy based on three control factors: participant’s L1, consonant position, and consonant identity. We also include two interaction terms: the interaction between L1 and consonant position, and between L1 and consonant identity. The first interaction is where we expect to see an effect if there is a phonotactic influence on the accuracies; the second is included as a control in case some of the consonants are easier or harder to identify based on L1 phonology. Table 5 contains the results of this model.

www.frontiersin.org

Table 5. Logistic regression coefficients (log-odds change, with standard error in parentheses) for the audio-only model.

The values in Table 5 are the log-odds effects of a change to the “default” feature of an English-speaking participant identifying a [k] in onset position, the situation with the best performance. Based on these values, for instance, an English speaker identifying a [t] in the onset position would have log-odds of a correct identification of 1.62 (the default 3.29, minus 1.67 for switching from [k] to [t]). Interpreting log-odds is slightly complex, but the key factors identified here are that switching from onset to coda has a significant negative effect on accuracy, as does switching between the individual phones. However, the effect of Japanese as an L1 instead of English is numerically negative, but not significant. Likewise, our key factor for identifying a phonological effect—the interaction between L1 and consonant position—is also not quite significant ( p = 0.10). (The L1-consonant interaction was also insignificant, and omitted from the table).

Overall, it seems that there may be a small phonotactic/phonological effect in the audio-only data, but strongest, and only significant, effects appear to be more consistent with phonetic influences. [k] appears to be an especially identifiable consonant, compared to [p] and [t], and onsets appear to be more recognizable regardless of a listener’s preferred phonotactics. However, the large numeric drop-off in Japanese coda performance suggests that phonological knowledge plays an important role as well. We will continue this analysis with the audio-visual congruent data below.

Audio-Visual Block

Audio-visual congruent.

The McGurk effects that Ali et al. (2011) tested rely on an audio-video mismatch. To establish a baseline performance, we first look at audio-visual congruent data, where the participant sees the actual video of the speaker pronouncing the words. The perception accuracies for each target consonant in different positions are provided in Table 6 , and they are almost uniformly above chance.

www.frontiersin.org

Table 6. Audio-visual congruent accuracies for onsets and codas in CVC stimuli, split by correct consonant identity.

Participants benefited greatly from the visual information, with uniformly higher scores than the audio-only condition. In aggregate, this is unsurprising, since the audio-only stimuli were embedded in multi-talker babble that introduced significant noise. The relatively clean visual data were especially helpful for identifying the lip closure that differentiates [p] from the non-labial [t] and [k] sounds, but surprisingly, visual information was also extremely helpful in distinguishing between [t] and [k] sounds, despite them looking very visually similar. Visual information also appears to have been effectively used by speakers of both languages, with the only accuracy that was not significantly greater than chance being Japanese [t] codas, which had the lowest accuracy in the audio-only condition. Even still, visual information had the single greatest impact for Japanese coda identification, and Japanese [t] coda accuracy was significantly higher in audio-visual than audio-only by a chi-squared test (χ 2 = 7.3143, df = 1, p < 0.01). All the consonants that were not perceived accurately in the audio-only condition were perceived correctly when compatible visual information was provided. This confirms that listeners, especially when the audio input is noisy, rely on visual information in order to perceive and process speech cross-linguistically.

These results are more in line with the Phonetic-Superiority Hypothesis. The only conditions that are not essentially at ceiling are Japanese listeners identifying coda consonants. The visual information seems to be sufficient to overcome the phonetic noise, and the remaining difficulty is focused on the one phonotactically illicit position. This suggests that both the Phonetic- and Phonological-Superiority Hypotheses may be slightly off target. Instead, it appears that different tasks can induce stronger phonetic or phonological effects. Furthermore, the ability of both language groups to marshal visual information when the audio information is weak suggests that phonetic and phonological information is only one of the informational components used in phonemic identification. This suggests a more complex alternative hypothesis, that phonetic and phonological factors are integrated in an information-based framework, based on the perceived reliability of each cue type in the present task. We will discuss this possibility further in the section “Discussion,” but first, let us turn to the audio-visual incongruent data to examine McGurk fusion effects across the languages.

Audio-Visual Incongruent

Given how much of a factor visual information played in the AV-congruent data, we expect to see significant McGurk effects in the AV-incongruent data. We categorize the responses to understand how exactly participants resolved the conflicts, focusing on four critical categories of responses in these data, based on the division in section “Data Analysis”:

• A: an answer consistent with the audio, rather than visual, information.

• V: an answer consistent with the visual, rather than audio, information.

• F: an answer that fuses the two modalities by producing a place of articulation between the audio and visual components (e.g., reporting [t] when hearing [p] but seeing [k]).

• O: all other answers.

The more fine-grained distinction will help reveal the relative importance and confidence that participants assigned to each of the modalities. An A or V response reflects higher confidence in audio or visual information, respectively. An F response reflects comparable importance assigned to both modalities, leading the participant to seek middle ground that is partially consistent with each. An O response reflects low confidence in both, leading participants to choose options that are inconsistent with both modalities. Overall, we will see that participants change their responses in line with this trade-off between confidence levels in the information that each modality supplies, suggesting the need for a more nuanced integration of phonetic, phonological, and other sources of information in L2 phonetic categorization than the superiority hypotheses offer.

We analyzed two types of audio-visual incongruent pairs separately: (a) fusible incongruent stimuli and (b) non-fusible incongruent stimuli. Fusible incongruent stimuli consisted of an audio /p/ and visual /k/ (AV[pk]) or audio /k/ and visual /p/ (AV[kp]). McGurk fusion on these cases results in perception of /t/ (midF response category), an articulatory midpoint between the two inputs. The non-fusible incongruent stimuli are ones composed of audio /p/ or /k/ with visual /t/ (AV[pt] and AV[kt]). There is no articulatory midpoint between the inputs, so “fusion” responses are not possible (see Table 7 ).

www.frontiersin.org

Table 7. Proportion of responses in each category for fusible audio-visual incongruent stimuli.

Beginning with the fusible results, shown in Table 7 , the most obvious pattern is that the choice of audio and video data has a significant effect on participants’ preferences. In onset positions, for both Japanese and English listeners, audio [p] with visual [k] (the AV[pk] condition) induced primarily fusion responses of [t]. Listeners appear to have similar trust in the audio and visual datastreams, and seek out a compromise response that does not favor one over the other. This was significantly different, by a chi-squared test (χ 2 = 30.066, df = 1, p < 0.001), from onsets with audio [k] and visual [p] (the AV[kp] condition) for each L1. Audio [k] in onset was favored by listeners from both L1s, with no fusion responses, suggesting that this audio information was viewed as more reliable than the video information.

The audio-only results are helpful in understanding this switch. In onset positions, listeners from both L1s found [p] harder to identify than [k]. When the audio was less reliable ([p]), listeners incorporated the visual data to help their categorization, leading to fusion. When the audio was more reliable ([k]), the visual data were overruled by the audio. It appears that listeners, regardless of language, reach their categorization decisions by balancing the information that each modality provides against the perceived reliability of that cue’s information.

A similar pattern emerges for codas. In the AV[pk] condition, both English and Japanese listeners respond primarily with visually consistent or other responses (critically, they only rarely respond with audio-consistent or fusion responses). Again, this reflects the participants’ perception of the cue reliability; coda [p] had very low accuracy in the audio-only condition in both languages, so non-audio information dominates. The AV[kp] condition may be a little surprising; coda [k] accuracy was high for English listeners, yet both English and Japanese listeners favor visual information in the incongruent presentation. However, visual [p] with lip closure is stronger visual information than visual [t] or [k], and this may explain the phenomenon.

Both English and Japanese speakers fused significantly more when auditorily visually incongruent AV[pk] was presented than AV[kp] (χ 2 = 30.066, df = 1, p < 0.001). In AV[pk] perception, fusion was significantly more common in onset position than coda position. In AV[kp] perception, speakers of both languages preferred auditorily consistent answers in onset position whereas they showed a strong preference for visually consistent answers in coda position. Note that McGurk fusion is quite rare in this dataset, and only occurs in appreciable amounts in onset positions with [p] audio.

Non-fusible stimuli showed a similar overall pattern to the fusible stimuli. Visual [t] dominated auditory [p] regardless of language background and syllabic position, although in coda position, many responses were inconsistent with either the auditory or visual information. On the other hand, auditory [k] dominated visual [t] regardless of language background or syllabic position. Again, this appears to be consistent with participants preferring the input cue that they find more reliable, rather than relying on phonetic or phonological information across the board.

To test the idea that the perceived auditory reliability influences how closely participants adhere to the auditory and visual inputs, we performed a post-hoc correlation test. For each of the incongruent stimuli, we used the audio-only accuracies from Table 4 to predict the rate of audio-consistent responses in Tables 7 , 8 . The basic idea is that when participants are able to use the audio information to categorize the sound, they prefer to rely on it. When the audio information is noisy or otherwise unreliable, they turn to whatever other information is available, whether that is visual information, phonological preferences, or something else. If none is perceived as reliable, they guess.

www.frontiersin.org

Table 8. Proportion of responses in each category for non-fusible audio-visual incongruent stimuli.

The correlation pattern is shown in Figure 1 . Each dot represents an L1, syllabic position, and consonant identity combination. Correlations were fit separately for each language, to account for individual preferences toward one modality over the other; for instance, Hayashi and Sekiyama (1998) found different preferences for visual information cross-linguistically. In both languages, the correlation test found a significant correlation between audio-only accuracy and audio-consistent responses (Japanese: t = 4.18, df = 6, p < 0.01; English: t = 3.51, df = 6, p < 0.05), which fits with the idea that participants adjust their answers to account for the reliability of the auditory information they are receiving.

www.frontiersin.org

Figure 1. Scatterplot of audio-only accuracy and audio-consistent responses across L1, syllable position, and consonants. Correlations were significant for both languages.

Overall, unlike the results from previous studies (e.g., Ali et al., 2011 ), the results from the audio-visual incongruent condition show more fusion in the onsets than in the codas, and do not show an interaction of consonant position and listeners’ L1 phonology. This seems to be an argument against the Phonological-Superiority Hypothesis. However, participants’ reliance on auditory information is significantly positively correlated with the audio-only accuracy, and patterns slightly differently in each language.

The current study was conducted to further assess the influence of L1 phonology on non-native speech perception and the McGurk fusion effect, beyond the findings of Ali et al. (2011) . To focus on syllable structural influence on speech perception, in addition to English, which is open to CVC syllables, Japanese was tested to conduct the baseline as the language predominantly prefers CV syllables but has a strict restriction on consonants in coda position. Moreover, in order to investigate influence of acoustic quality and listeners’ phonological knowledge (e.g., phonotactic constraints), instead of real words, we created monosyllabic nonsense words composed of phonemes that are common in the English and Japanese phonemic inventory.

We first conducted audio-only condition to further investigate the influence of the stimuli’s acoustic quality and listeners’ L1 phonotactic knowledge over the course of speech perception. We used this condition to look for phonetic and phonological influences, testing the Phonetic- and Phonological-Superiority Hypotheses. There was clear evidence of phonetic influences, depending on the syllable position and phoneme identities. Phonological influence, in the form of a difference between Japanese and English listeners’ performance on onset versus coda consonant identification, was numerically present, but did not reach significance. Within this experiment, the Phonological-Superiority Hypothesis appears to be invalid, but the Phonetic-Superiority Hypothesis may not hold either.

The overall results from the audio-only condition demonstrated that both the acoustic quality of the audio input and the listener’s phonological knowledge can influence perception of unfamiliar speech sounds, with clearer evidence for an effect of acoustic quality. Although Japanese listeners reported lower accuracy for coda consonant perception than English listeners, the fact that English listeners did not perceive all the coda consonants accurately suggests that listeners combine both acoustic and phonological information available in input and in their mental representation over the course of speech perception, especially when the input data are noisy.

To probe the nature of this relationship more carefully, we introduced visual information that was either consistent with the audio information or inconsistent with it. When the visual information was consistent with the audio information, participants were able to achieve high rates of phonemic categorization accuracy, even for consonants that were very difficult in the audio-only condition. This indicates that people “listen” to more than just acoustic signals available in input and integrate multiple cues in order to extract more precise estimates of the phonemic identity, even in unfamiliar languages. The relatively higher accuracy for consonant perception in both syllable positions by both language groups in the audio-visual congruent condition suggests that visual information provides strong information, and is especially important when the auditory information is incomplete. With sufficiently clear L2 input (i.e., auditorily and visually presented information), phonetic and phonological effects on phonemic categorization can be overcome—at least when the phonemic categories are familiar from the L1.

Taking these observations into consideration, we argue that neither Phonetic- nor Phonological-Superiority Hypotheses can fully explain the nonnative speech perception. Rather, when listeners perceive speech sounds in natural setting, they combine multiple cues such as acoustic information, phonological knowledge of the language(s) previously acquired, and visually presented information regarding articulation altogether in order to resolve the difficulty identifying unfamiliar sounds. In some situations, this can manifest as a Phonetic-Superiority Effect (e.g., when the phonetic information is strong enough to minimize phonological difficulties). In others, this can manifest as Phonological-Superiority (e.g., when phonetic information is comparable in different syllabic positions). In still other situations, other factors can dominate, such as the visual information in some of the congruent and incongruent conditions.

So far, these results show that pure speech perception is influenced by multiple information types available both in the input and in the listener’s mental representation. Indeed, it is a well-known fact that listeners attend to multiple cues to different extent depending on the tasks and quality of input during the course of non-native speech perception ( Detey and Nespoulous, 2008 ; Escudero and Wanrooij, 2010 ). What is interesting though is that listeners seem to combine not only auditorily and visually presented cues but also their L1 and/or L2 knowledge even in the situation where they are provided with inconsistent inputs from different modalities.

When the inputs conflict, how do learners resolve the conflicts? Unlike the findings reported in previous studies (e.g., Ali et al., 2011 ), in the audio-visual incongruent condition in the present study, both English and Japanese speakers showed very similar McGurk effects, including in the particular case of McGurk fusion, in each consonant position. The only stimulus that elicited a strong McGurk fusion effect was AV[pk] (audio /p/ with visual /k/) in onset position but not in coda position. This is the opposite of what was originally found among English speakers in the previous studies, which found increased fusion in codas. Instead, we found elevated rates of responses that were not consistent with the audio information, but were not necessarily fusion responses.

There are potentially three explanations for the difference in the results reported in the previous studies from the results demonstrated in the current study. The first explanation is that, in the previous studies, additional information other than acoustic or phonological (e.g., semantic information) may have been available in stimuli since researchers used real words in English and Arabic. As a result, their English listeners’ perception in Arabic stimuli may have been influenced by other than simple phonetic- or phonologically driven factor but something else, such as lexical knowledge or frequency of the word. Consequently, the previous results showed a different fusion pattern because of possible factors influencing speech perception other than phonetic- or phonological-information. The second possible explanation for Japanese speakers in the current study showing the fusion patterns similar to English listeners is that Japanese listeners’ exposure to English may have changed the way Japanese participants perceiving non-native sounds. Although most of our Japanese participants were native Japanese speakers and had never lived in a country where English is spoken as a common language for more than a year prior to the research participation, almost all of them were students studying English at ALI (American Language Institute) in San Diego. Though most of them were tested within 30 days after their arrival to the U.S., the exposure to English may have made them more familiar with coda consonants. Thus, it is reasonable to argue that although Japanese phonology does not allow to have /p/, /t/, and /k/ in coda position unless it is geminated, the Japanese participants in the present study perceived stimuli in a similar way to English speakers. Consequently, there was no significant cross-language difference in McGurk fusion rates between current English-speaking participants and Japanese-speaking participants. Lastly, the difference between the results from the present study and previous studies may be reflective of a phonetic influence from the specific stimuli being used. Audio stimuli used in the present study was less reliable due to the added noise, compared to those used in previous studies. Since visual information was essentially noiseless compared to the audio files, our participants may have been more reliant on visual information.

The present study suggests a strong evidence of listeners integrating multiple cues available in input such as acoustic information and visual information as well as knowledge they had built based on previous language learning experience and phonological knowledge of language or languages previously acquired. Although how and when they shift their reliance on each cue is unanswered, the current results from three separate conditions suggest that listeners integrate acoustically and visually presented information available in input as well as phonological knowledge over the course of speech perception. Also, learners unconsciously balance their reliance on different information during the course of speech perception depending on the certainty they established regarding each type of information, and the reliability of each cue available in input.

In order to further investigate how listeners would integrate acoustic-, visual-, and phonological-information cues available in input when audio and visual cues disagree in terms of the place of articulation, a separate data from visual-only condition are required. Is visual information more dominant than audio information or vice versa? Or do listeners attempt to resolve this disagreement across two cues presented at the same time in a way that is consistent with both? A complete answer to these questions will require further research. However, the findings from the present study provide insights into the bilingual and multilingual speech perception process and influences of L1 and L2 structure when they encounter speech sounds in an unfamiliar language.

Data Availability Statement

The datasets generated for this study are available on request to the corresponding author.

Ethics Statement

The studies involving human participants were reviewed and approved by Graduate and Research Affairs, San Diego State University. The participants provided their written informed consent to participate in this study.

Author Contributions

KY and GD contributed to conception and design of the study, performed the data analysis, contributed to manuscript revision, and read and approved the submitted version. KY collected data. KY wrote the first draft of the manuscript. Both authors contributed to the article and approved the submitted version.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Ali, A. N. (2003). “Perception difficulties and errors in multimodal speech: the case of consonants,” in Proceedings of the 15th international congress of phonetic sciences , eds D. Recasens and J. Romero (Adelaide, Australia: Casual Productions), 2317–2320.

Google Scholar

Ali, A. N., and Ingleby, M. (2002). Perception Difficulties And Errors In Multimodal Speech: The Case Of Vowels. In Proceedings of the 9th Australian international conference in speech science and technology. Canberra, Australia: Australian Speech Science and Technology Association, 438–444.

Ali, A. N., and Ingleby, M. (2005). Probing the syllabic structure of words using the audio- visual McGurk effect. Proc. Ann. Meet. Cognit. Sci. Soc. 27, 21–23.

Ali, A. N., Ingleby, M., and Peebles, D. (2011). “Anglophone perceptions of Arabic syllable structure,” in Handbook of the syllable , eds C. E. Cairns and E. Raimy (Boston, MA: Brill), 329–351. doi: 10.1163/ej.9789004187405.i-464.93

CrossRef Full Text | Google Scholar

Best, C. T., and Tyler, M. D. (2007). “Nonnative and second-language speech perception: commonalities and complementarities,” in Language experience in second language speech learning: In hornor of James Emil Flege , eds M. Munro and O.-S. Bohn (Western Sydney: MARCS Auditory Laboratories), 13–34. doi: 10.1075/lllt.17.07bes

PubMed Abstract | CrossRef Full Text | Google Scholar

Broersma, M. (2005). Perception of familiar contrasts in unfamiliar positions. J. Acoust. Soc. Am. 117, 3890–3901. doi: 10.1121/1.1906060

Carlson, M. T., Goldrick, M., Blasingame, M., and Fink, A. (2016). Navigating conflicting phonotactic constraints in bilingual speech perception. Bilingualism Lang. Cognit. 19, 939–954. doi: 10.1017/s1366728915000334

Cho, T., and McQueen, J. M. (2006). Phonological versus phonetic cues in native and nonnative listening: korean and dutch listeners’ perception of dutch and english consonants. J. Acoust. Soc. Am. 119, 3085–3096. doi: 10.1121/1.2188917

Cutler, A., Mehler, J., Norris, D., and Segui, J. (1986). The syllable’s differing role in the segmentation of french and english. J. Mem. Lang. 25, 385–400. doi: 10.1016/0749-596x(86)90033-1

Detey, S., and Nespoulous, J. L. (2008). Can orthography influence second language syllabic segmentation ? japanese epenthetic vowels and french consonantal clusters. Lingua 118, 66–81. doi: 10.1016/j.lingua.2007.04.003

Dupoux, E., Kakehi, K., Hirose, Y., Pallier, C., and Mehler, J. (1999). Epenthetic vowels in japanese: a perceptual illusion? J. Exp. Psychol. Hum. Percep. Perform. 25, 1568–1578. doi: 10.1037/0096-1523.25.6.1568

Escudero, P., and Vasiliev, P. (2011). Cross-language acoustic similarity predicts perceptual assimilation of canadian english and canadian french vowels. J. Acoust. Soc. Am. 130, 277–283.

Escudero, P., and Wanrooij, K. (2010). The effect of L1 orthography on non-native vowel perception. Lang. Speech 53, 343–365. doi: 10.1177/0023830910371447

Flege, J. E. (2003). “Assessing constraints on second-language segmental production and perception,” in Phonetics and phonology in language comprehension and production, differences and similarities , eds N. Schiller and A. Meyer (Berlin: Mouton De Gruyter), 319–355.

Hayashi, T., and Sekiyama, K. (1998). Native-foreign language effect in the McGurk effect: A test with Chinese and Japanese. In Proceedings of auditory-visual speech processing (AVSP’98). Baixas, France: International Speech Communication Association, 61–66.

Kartushina, N., and Frauenfelder, U. H. (2013). On the role of L1 speech production in L2 perception: evidence from spanish learners of french. Proc. Interspeech 14, 2118–2122.

Lin, H. (2001). A grammar of Mandarin Chinese. München: Lincom Europa.

Lisker, L. (1999). Perceiving final voiceless stops without release: effects of preceding monophthongs versus nonmonophthongs. Phonetica 56, 44–55. doi: 10.1159/000028440

Massaro, D. W., and Cohen, M. M. (2000). Tests of auditory–visual integration efficiency within the framework of the fuzzy logical model of perception. J. Acoust. Soc. Am. 108, 784–789. doi: 10.1121/1.429611

Maye, J., Werker, J. F., and Gerken, L. (2002). Infant sensitivity to distributional information can affect phonetic discrimination. Cognition 82, B101–B111.

McGurk, H., and MacDonald, J. (1976). Hearing lips and seeing voices. Nature 264, 746–748. doi: 10.1038/264746a0

Otake, T. (2007). “Interlingual near homophonic words and phrases in L2 listening: evidence from misheard song lyrics,” in Proceedings of the 16th International Congress of Phonetic Sciences (ICPhS 2007) , eds J. Trouvain and W. J. Barry (Saarbrücken: University of Glasgow), 777–780.

Otake, T., Hatano, G., Cutler, A., and Mehler, J. (1993). Mora or syllable? Speech segmentation in Japanese. J. Mem. Lang. 32, 258–278. doi: 10.1006/jmla.1993.1014

Pajak, B., and Levy, R. (2012). Distributional learning of L2 phonological categories by listeners with different language backgrounds. In Proceedings of the 36th Boston University conference on language development. Somerville, MA: Cascadilla Press, 400–413.

Palliar, C., Colome, A., and Sebastian-Galles, N. (2001). The influence of native-language phonology on lexical access: exemplar-based versus abstract lexical entries. Psychol. Sci. 12, 445–449. doi: 10.1111/1467-9280.00383

Schulpen, B., Dijkstra, T., Schriefers, H., and Hasper, M. (2003). Recognition of interlingual homophones in bilingual auditory word recognition. J. Exp. Psychol. Hum. Percep. Perform. 29, 1155–1178. doi: 10.1037/0096-1523.29.6.1155

Sheldon, A., and Strange, W. (1982). The acquisition of /r/ and /l/ by japanese learners of english: evidence that speech production can precede speech perception. Appl. Psychol. 3, 243–261. doi: 10.1017/s0142716400001417

Tiippana, K. (2014). What is the McGurk effect? Front. Psychol. 5:725. doi: 10.3389/fpsyg.2014.00725

Tiippana, K., Andersen, T. S., and Sams, M. (2004). Visual attention modulates audiovisual speech perception. Eur. J. Cognit. Psychol. 16, 457–472. doi: 10.1080/09541440340000268

Tsukada, K. (2004). Cross-language perception of final stops in Thai and English: A comparison of native and non-native listeners. In Proceedings of the 10th australian international conference on speech science and technology. Canberra, Australia: Australasian Speech Science and Technology Association, 563–568.

Tsukada, K., and Ishihara, S. (2007). The effect of first language (L1) in cross-language speech perception: comparison of word-final stop discrimination by english, japanese and thai listeners. J. Phonetic Soc. Japan 11, 82–92.

Van Engen, K. J., and Bradlow, A. R. (2007). Sentence recognition in native and foreign- language multi-talker background noise. J. Acoust. Soc. Am. 121, 519–526. doi: 10.1121/1.2400666

Weber, A., and Cutler, A. (2004). Lexical competition in nonnative spoken-word recognition. J. Mem. Lang. 50, 1–24.

Werker, J. F., Pons, F., Dietrich, C., Kajikawa, S., Fais, L., and Amano, S. (2007). Infant-directed speech supports phonetic category learning in english and Japanese. Cognition 103, 147–162. doi: 10.1016/j.cognition.2006.03.006

1. “c” and “ch” in word initial position are mostly transcribed as /k/ (this is because both “c” and “ch” responses were common in “k” stimuli but virtually unattested for “t” stimuli) except when the response started with “cer-” or “cea-,” in which case /s/ was used, since the letter “c” in those contexts is usually pronounced as [s] in English.

2. “ck” and “c” occurring word-finally were transcribed as /k/ (since in English word-final “ck” and “c” are usually recognized as /k/, as in “dock” or “tactic”).

3. “gh” or “ght” at the end of the word were transcribed as /w/ or /f/.

4. Homorganic consonant clusters (a sequence of two or more consonants that have the same place of articulation, like [dt]0 appearing in word-medial positions) were coded as a combination of a glottal stop followed by the consonant (since, in Japanese, a word-medial homorganic consonant cluster is usually recognized as a combination of a glottal stop followed by the consonant, as in “katto” pronounced as [kaʔto]).

Learning to move from auditory signals to phonemic categories is a crucial component of first, second, and multilingual language acquisition. Yet, especially in the case of second and/or later language acquisition, learners are confronted with difficulties to accurately perceive unfamiliar sounds. This difficulty may be induced due to the acoustic informativity (Phonetic-Superiority Hypothesis) or interference of listeners’ phonological knowledge that they have built based on the previous language exposure (Phonological-Superiority Hypothesis). The present study carefully tested the influence of acoustic informativity and learner’s phonological preferences during speech perception. The findings from the present study suggest that listeners integrate multiple cues available (acoustic, visual, and phonological cue) during the course of speech perception. Based on the current findings, we propose a cognitively inspired rational cue integration framework as a third hypothesis to explain how L1 phonological knowledge affects L2 perception. The findings from the current study provide insights into the bilingual and multilingual speech perception process and influences of L1 and L2 structure when they encounter speech sounds in an unfamiliar language.

Keywords : phonetics, phonology, phonotactics, language acquisition, McGurk effect, cue combination, syllables

Citation: Yasufuku K and Doyle G (2021) Echoes of L1 Syllable Structure in L2 Phoneme Recognition. Front. Psychol. 12:515237. doi: 10.3389/fpsyg.2021.515237

Received: 27 November 2019; Accepted: 23 March 2021; Published: 20 July 2021.

Reviewed by:

Copyright © 2021 Yasufuku and Doyle. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Kanako Yasufuku, [email protected]

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

how to pronounce hypothesis

/haɪˈpɑːθəsəs/.

audio example by a male speaker

audio example by a female speaker

the above transcription of hypothesis is a detailed (narrow) transcription according to the rules of the International Phonetic Association; you can find a description of each symbol by clicking the phoneme buttons in the secction below.

hypothesis is pronounced in four syllables

press buttons with phonetic symbols to learn how to precisely pronounce each sound of hypothesis

example pitch curve for pronunciation of hypothesis

Test your pronunciation of hypothesis.

press the "test" button to check how closely you can replicate the pitch of a native speaker in your pronunciation of hypothesis

video examples of hypothesis pronunciation

An example use of hypothesis in a speech by a native speaker of american english:

“… a hypothesis is evoked right and then …”

meaning of hypothesis

Hypothesis is a proposed explanation for a phenomenon.

hypothesis frequency in english - C2 level of CEFR

the word hypothesis occurs in english on average 4.2 times per one million words; this frequency warrants it to be in the study list for C2 level of language mastery according to CEFR, the Common European Framework of Reference.

topics hypothesis can be related to

it is hard to perfectly classify words into specific topics since each word can have many context of its use, but our machine-learning models believe that hypothesis can be often used in the following areas:

1) communication, information, and media;

2) education, science, and technology;

words with pronunciation similar to hypothesis

Did this page help you.

  • ELT Concourse home
  • A-Z site index
  • Teacher training index
  • Teacher development
  • For teachers
  • For trainers
  • For managers
  • For learners
  • About language
  • Language questions
  • Other areas
  • Academic English
  • Business English
  • Entering ELT
  • Courses index
  • Basic ELT course
  • Language analysis
  • Training to train
  • Transcription

Transcription: a teach-yourself course

This guide also forms Strand 6 of the Teacher Development section.

This guide concerns transcription, not a description of the sounds of English.  For a description of how the sounds of English are made and what mouth parts do, see the in-service guides to pronunciation (new tab). This course is:

  • Firstly, a guide to teaching yourself to transcribe words in what are called their citation forms, i.e., the way they are pronounced when you ask someone to read a list carefully.
  • Secondly, a guide to how things are pronounced in normal but not very rapid or mumbled connected speech.
  • Thirdly, to provide you with some practice at transcribing what you read and what you hear (see the link at the end for the latter).
  • Free.  All materials on this site are covered by a Creative Commons licence which means that you are free to share, copy and amend any of the materials but under certain conditions. You may not use this material for commercial purposes.  The material may be used with fee-paying learners of English but may not be used on fee-paying courses for teachers.  Small excerpts from materials, conventionally attributed, may be used on such courses but wholesale lifting of materials is explicitly forbidden.  There is, of course, no objection at all to providing fee-paying course participants with a link to this guide or any other materials on this site.  Indeed, that is welcomed.

This is what the course covers. If you cannot transcribe or your transcription skills are shaky, do the whole course. If you are here for a specific area of transcription or returning for some revision, the following will help you find what you need. Clicking on -index - at the end of each section will return you to this menu.

The sounds transcribed here are generally those of an educated southern British-English speaker.  That is not intended to imply that the dialect is somehow better than others.  It is one of the conventional ways to do these things. American English pronunciation and any of the other multiple standard forms of the language would be different, especially but not solely, concerning the vowel sounds. Where there are significant differences (for example in the use of a rhotic or non-rhotic pronunciation) it will be noted.

There are teachers of English language who can lead successful careers in the classroom without ever using more than a minimal amount of phonemic transcription.  Some use none at all.  There are, however, six good reasons why knowing how to transcribe sounds is a useful skill for a teacher and knowing how to read transcriptions is a useful skill for learners.  Here they are:

  • For the teacher, the ability to transcribe what is heard allows rapid identification of troublesome sounds and other issues that need to be brought to the attention of learners.  One can, of course, rely on a pronouncing or other form of dictionary to do the work but that is time consuming and not always possible.  Freeing yourself from the need to consult a website or a dictionary for the pronunciation of words allows you to focus on what's important.
  • For the learner, the ability to read the transcriptions of pronunciation in a dictionary, mono- or bi-lingual, gives autonomous access to how the word should sound without reference to the spelling or to a model.  Many people, dazzled by the spelling, are unaware that, for example, no and know are identically pronounced or that the words right, rite and write also share a single pronunciation. By the same token, it may not at first be obvious that in the words troupe, bought, should, cough and tourist the combination of the letters o and u is differently pronounced in each case. It is, of course, possible for the teacher to model the pronunciations in the classroom but the ability to note them down in phonemic script is a valuable learning tool.
  • Systematicity Phonemic transcription is independent of the language insofar as it is systematic.  Many attempts have been made to spell English words phonetically but without some unambiguous system of symbols such attempts fail.  Unless we can rely on a generally accepted system, there is, for example, no easy way to show the pronunciation of diphthongs and the difference between long and short vowels without resorting to a range of odd and obscure marks over letters such as ç, â, œ and so on.  Having a single system works.
  • Reliability As you may know, spelling in English is not a reliable guide to how a word is pronounced so, even if a learner can correctly recognise and produce the different pronunciations of the o in love and move that is not a guide to knowing how shove or hove are pronounced at all.  Access to the phonemic script allows learners instantly to relate the pronunciation of words to one another and not to pronounce hove as if it rhymed with love or shave as if it rhymed with have . Teachers who can transcribe have the ability to make this clear and learners who can read transcription are able to make a note of the difference.
  • Ambiguity Even when words are spelled the same, they may be differently pronounced (a phenomenon known as homography) so we get, for example, entrance meaning a way in and a verb meaning bewitch or hold someone's complete attention .  The words are very differently pronounced.  Other examples will include row, minute, live and hundreds more.  Being able instantly to spot the difference is a skill learners need to develop if they are sensibly to use a dictionary and teachers need instantly to point out when teaching.  The best way to do that is via phonemic transcription.
  • Professionalism The ability to use a simple, if technical, area of linguistics is an indicator of professional competence.  An inability to read or write a transcription of how something is pronounced is a handicap when it comes to teaching pronunciation and most learners expect formal pronunciation work to be part of what happens in the classroom.
  • Correction and needs analysis The ability to transcribe how learners pronounce certain phonemes and words is helpful when it comes to researching the needs of your learners.  Having a transcription of what they actually produce next to a transcription of what they should produce helps you to identify what needs work and what is adequate. For learners, too, a transcription of what they say and the model to compare it to helps them to notice the gap.

If even some of that sounds convincing, read on.

We are talking about English sounds here.  The study of language sounds (phonemic analysis) is language specific .  This mini-course is concerned with the transcription of English sounds. You will not, therefore, find mention of the vowel /ɯ/ (which occurs in Turkish, Korean, Irish and many other languages) or /ɾ/ which is the Spanish trilled /r/ sound that does not appear in English but is common in, e.g., Japanese and other languages.  The chart below does not, therefore, describe all the sounds of language, just the ones that are used in English (and not all of them as we shall shortly see).

  • the sound /t/ can be pronounced with and without a following /h/ sound Compare the sounds in track and tack In the first, when the /t/ is followed by another consonant, the sound is without much aspiration (transcribed as /t/) but in the second, the sound is aspirated and transcribed as /tʰ/. In English, these sounds are not phonemes because you can change /t/ to /tʰ/ without changing the meaning of a word.  In some languages, Mandarin, for example, /t/ and /tʰ/ are separate phonemes and swapping them around will change the meaning of what you say.
  • The same applies to /k/ vs. /kʰ/ ( ski vs. cat ).  When /k/ is not the first sound in the word, it lacks much aspiration but when it is the first sound, or, like /t/ when it is the first sound of a syllable, it carries the aspiration and is transcribed that way.
  • The sound /p/ exhibits the same characteristic.  Make it the first sound in a word or a syllable, such as in copper and it will be aspirated to /pʰ/.
  • the light [l] as in lap which is simply transcribed as /læp/ with the sound pronounced with the tongue tip on the alveolar ridge behind the top teeth.  This pronunciation occurs when /l/ is followed by any vowel sound so pull it is transcribed as /pʊl.ɪt/.
  • the dark version (which has the symbol [ɫ]) and occurs at the end of words like full , transcribed as [fʊɫ] with the tongue further back in the mouth approaching the velum or soft palate.  The sound is referred to, incidentally, as velarised.  The pronunciation is used when /l/ is not followed by a vowel so pull that is transcribed as /pʊɫ.ðæt/.  The transcription may safely be left as /l/ in all cases because it is simpler and we have a rule for the pronunciation of the allophones.
  • in Standard British English, the word nurse is transcribed with a long vowel (as /nɜːs/) but in rapid speech the vowel may be shortened to give /nɜs/.  No-one listening will mistake the word or assume that the word with a shorter vowel carries a different meaning so the transcription need not distinguish too carefully.  The sounds are allophones. In Standard American English the word is transcribed as /ˈnɝːs/ with the tiny /r/ denoting that the 'r' sound is pronounced by most American-English speakers but, again, that is an allophonic, not phonemic, difference because the word remains the same with the same meaning.  In some varieties of British English, too, the /r/ will be pronounced so we will have /nɜːrs/ as the transcription.
  • the words beauty and booty may be pronounced identically as /ˈbuː.ti/ although the standard form for beauty is /ˈbjuː.ti/ and for booty , it is /ˈbuː.ti/.  It makes no difference to meaning if your dialect does not distinguish.

For a list of useful minimal pairs (for your own practice and use in the classroom, click here (new tab).

Click here to take a short test to see if you can match minimal pairs.  There are no transcriptions in this test so you will have to say the words aloud or to yourself to find the pairs. You can click on the other answers to see what feedback you get.

Here's the list you'll learn.  If you want to download this chart as a PDF document to keep by you as reference, click here .

The symbols we are using in the course are those introduced by Gimson (1962) in the first edition of An Introduction to the Pronunciation of English .

The consonants are the easiest so we can start there.  Most of them are transcribed using the normal letters of the alphabet and they are often the same as the written form of the letter but remember that spelling in English is not a reliable guide to pronunciation.  There are five sounds which are denoted by special symbols and these you have to learn:

  • /ʒ/ which is the sound represented by the letter 's' in plea s ure (/ˈple.ʒə/)
  • /ʃ/ which is the sound represented by the letters 'sh' in sham (/ʃæm/)
  • /θ/ which is the sound represented by the letters 'th' in thank (/θæŋk/)
  • /ð/ which is the sound represented by the letters 'th' in mother (/ˈmʌð.ə/)
  • /ŋ/ which is the sound represented by the letters 'ng' in ring (/rɪŋ/)

Additionally, there is one anomaly and two combination consonant sounds:

  • /j/ is not a representation of the sound at the beginning of jug but is the sound which the letter 'y' represents in young (/jʌŋ/)
  • /dʒ/ is a combination of /d/ and /ʒ/ and is the sound represented by the letter 'j' in jump (/dʒʌmp/)
  • /tʃ/ is a combination of /t/ and /ʃ/ and is the sound represented by the letters 'ch' in chat (/tʃæt/)

Voicing describes how phonemes may be different depending on whether the vocal cords vibrate or not at the time of pronunciation.  (There are those who will aver that the technically correct term is vocal folds not vocal cords.)  Voicing is also called sonorisation. For example, the /k/ sound is made without voicing but the /ɡ/ sound is made with the mouth parts in the same place but with voice added.  Here are some examples of words containing voiced and unvoiced consonants.  The consonant in question is underlined, in bold .  Say them aloud and you will hear the differences.

In all the words above, the place of articulation (i.e., where in the mouth the sound is made) is identical for both pairs of consonants.  All that changes is whether or not the vocal cords or folds vibrate. If you put your hand on your throat and say the words sue and zoo , you will see what is meant and feel a slight vibration on the second word (/s/ is unvoiced but /z/ is voiced). Try saying the words and examples in the table above out loud and you will see that you need to pronounce the voiced consonants with a vibration of the vocal cords and a little more energy than the sounds in the unvoiced cases. A check is to try saying ZZZZZZZZZZZZZZZZZZZSSSSSSSSSSSSSSSSSSSSSSSSZZZZZZZZZZZZZZZZZZZZZSSSSSSSSSSSSSSSSSS with a hand on your throat so you can feel the vibration.

Of the consonants, 16 form pairs of voiced-unvoiced sounds:

You have to listen out for voicing when you are transcribing because voiced and unvoiced consonants are full phonemes in English. There is, in fact, a cline between fully unvoiced phonemes and those which are heavily voiced but for the purposes of a phonemic rather than phonetic transcription, we simply have to draw a line and have unvoiced and voiced consonants on either side of it. Individual speakers, too, vary in the amount of voicing they exhibit and in which consonants they voice in which environments. For more, see the section below under assimilation concerning voicing and devoicing.

Click here for a little test to see if you can match voiced and unvoiced sounds by saying some words aloud.

To get us started with transcribing consonants, take a piece of paper and transcribe the consonants only in these words, using the right-hand side of the phoneme chart.  Look at the example words and check to see if the pronunciation is the same as the words in this test. Click on the table when you have done that.

All the other sounds are transcribed using ordinary English alphabetic letters taking on their usual pronunciation.

Now transcribe the underlined CONSONANTS only in these words.  Do not worry now about the rest of the words.

You should have:

If you have included an /r/ sound at the end of your transcription of the word chair , that's fine because it would be pronounced that way if the following sound were a vowel.  If not, in BrE the /r/ is not included but it is in many other varieties of English including AmE .

As a very simple check, try these three tests which just ask you to match the transcription of the consonant with a word containing only that consonant. If you would like to try an exercise in transcribing the consonants you hear rather than ones you read, click here (new tab).

The pronunciation of /w/ If you are transcribing the speech of someone from Scotland, Ireland or parts of the southern United States, listen out for how, for example, they pronounce the initial consonant on where, when, whether, whine, what etc . Although the sound is now almost extinct except in some varieties, a variant of /w/ usually transcribed as /hw/ (or you may see it as [ʍ]), appears at the beginning of words spelled wh- but has for almost all speakers of English now merged with /w/.  The result is that apart from a small minority of speakers, there is no distinction in pronunciation between weather and whether, wine and whine etc.  The merger is generally called the whine-wine merger.

Here's a list of the vowels in English (authorities may differ slightly about how many there are, incidentally).

* This diphthong in the example words is not pronounced by all speakers.  For example, sure may be pronounced with the diphthong as /ʃʊə/ or with a monophthong as /ʃɔː/ † /i/ may be transcribed as /iː/ in some analyses. The schwa (/ə/) is the commonest sound in English but there is no letter for its representation.

The first two columns contain the 13 pure vowels in English. The right-hand column contains the 8 diphthongs making a total of 21 in all.

If you haven't already done so, to do this exercise, you may want to download the chart as a PDF document so you can have it at your elbow.  Click here to do that .

Using the chart, transcribe the following words and then click on the table to check your answers.

If you didn't get the final vowel of ago , or the first one of happy , that doesn't matter (yet).  In the first case the initial vowel was the schwa , transcribed as /ə/, and in the second case, the final vowel is transcribed as /i/ and lies between the short vowel in sit (/sɪt/and the longer one in seat (/siːt/). Try another short recognition test by clicking here .

There are 8 of these and they are combinations of pure vowels which merge together.  We have, e.g., /ɪ/ + /ə/ (the sounds we know from bid and ago ) following one another to produce /ɪə/ as in merely (mee-err-ly).  You can usually work out what the diphthong is by saying the word it contains very slowly and distinctly.

There is another test of your ability to recognise all the diphthongs here .

An issue to note is with the transcription here of tour.   Here, we use the diphthong /ʊə/ but there are many speakers who pronounce, especially, short words such as sure, poor etc. with the monophthong /ɔː/ so, for example, sure as /ʃʊə/ or with a monophthong as /ʃɔː/, poor / pour as /pʊə/ or with a monophthong as /pɔː/ and tour as /tʊə/ or with a monophthong as /tɔː/. This sound is more often present in longer words such as individ ua l (/ˌɪn.dɪ.ˈvɪ.dʒʊəl/). If you are being careful to transcribe exactly what someone says, this is worth listening out for.

Finally, there is a set of three tests of your ability to recognise some commonly confused vowel transcriptions. Click here to go to it (new tab).

You have now transcribed words using all the vowels and consonant sounds of English.

As a check of your knowledge, try the following.

Did you get it right?  One thing to notice is that in rapid connected speech, the transcription of come with me would probably be /kʌm wɪ miː/ without the /ð/ because we usually leave it out.  You may also, depending on how you say things, have had /iɡ's/ or even /ik's/ at the beginning of exactly .  That doesn't matter too much but note the convention for marking the stress on multisyllabic words: it's a ' inserted before the stressed syllable. There is also the convention of putting a stop (.) between syllables (as in, e.g., sentence ('sen.təns).  Your students may not need that but many find it helpful.  More on that in a moment.

The most common vowel in the spoken language has no letter to represent it. It is, of course, the humble schwa.  If you teach no other phoneme symbol, teach this one.  Including it in your transcriptions is simply a matter of listening out for it and making sure that you aren't being influenced by the spelling of words.  You should also note that the schwa only occurs in unstressed syllables .  You can't stress the schwa. The schwa may be how any of the traditionally spelled vowels are pronounced:

How many schwa sounds can you detect when you say and transcribe this sentence?  Click on the bar when you have an answer.

No fewer than 12 in 11 words.  Note:

  • The last bits of the words celebration and official are known as syllabic consonants .  That is to say, there is no proper vowel sound between the /ʃ/ and the /l/ in official and between the /ʃ/ and the /n/ in celebration .  Other examples are table , doable and so on where there is no proper vowel between the /b/ and the /l/. Some transcriptions would remove the schwa, transcribing them as /ə.'fɪʃ.l/ and /se.lə.'breɪʃ.n/.  An alternative is to insert a raised schwa for these very short vowels (e.g., /ˌse.lə.ˈbreɪʃ.ᵊn/). A third alternative you may see is to place a dot below the final consonant to indicate the pronunciation, using, e.g., /l̩/ and /n̩/.  You choose but be consistent with your learners. There is more on this below in the bit about crushing the schwa. (Many would omit or use the raised symbol in a word like simple (/'sɪm.pl/ or /'sɪm.pᵊl/), but the comparative form, simpler , is pronounced with nothing between the two consonants (/'sɪm.plə/).  In simple , the /l/ is dark ([ɫ]), in simpler , it is light.)
  • The first incidence of the definite article is not transcribed with a schwa because it is followed by a vowel sound (so it's pronounced /ði/).  The rules for article form and pronunciation (especially before /h/) are not simple and are set out in the guide to them here (new tab).
  • You may have preferred to have the second vowel sound in celebration as /ɪ/ rather than the schwa and that just depends on your preferred pronunciation.
  • The syllables containing the schwa are all unstressed.

As a check, we'll look again at an exercise from the section on consonants and ask you to try the test again but this time, transcribe the whole of each word, putting in the correct vowel transcriptions, the stress marks and the schwa. Here's the list again:

You may have transcribed three of these words slightly differently.  They are:

  • Google : which can be transcribed either as /ˈɡuː.ɡl̩/ or as /ˈɡuː.ɡəl/
  • baffle : which can be transcribed either as /ˈbæf.l̩/or as /ˈbæf.əl/
  • rabble : which can be transcribed either as /ˈræb.l̩/ or as /ˈræb.əl/

The reason for this is that the final syllable is so short that most people do not pronounce the schwa at all in rapid speech and instead produce what is known as a syllabic consonant which is transcribed as /l̩/ with a mark below the phoneme to show that it is a syllable in its own right.  There is more on this below.  Be patient. By the way, if you transcribed chair as /tʃeər/, that's OK, too, because it would be pronounced that way if the following sound were a vowel.  If not, in BrE the /r/ is not included but it is in many other varieties of English including AmE.  The usual way to transcribe the word in AmE is as /ˈtʃer/.

Now you can get a little practice in transcribing the vowels you hear in some simple words.  Click here to do that.

There are those who argue (Wells, for example) that there is actually no such thing as a triphthong in English.  They take the view, roughly summarised, that the vowels in, e.g., player break into two syllables so what we have is simply a diphthong followed by another vowel so the transcription should be:     /ˈpleɪ.ə/, not /ˈpleɪə/ and that means the diphthong /eɪ/ as in day followed by the schwa in the second syllable. Wells puts it like this:

I would argue that part of the definition of a true triphthong must be that it constitutes a single V unit, making with any associated consonants just a single syllable. Given that, do we have triphthongs in English? I claim that generally, at the phonetic level, we don’t. I treat the items we are discussing as basically sequences of a strong vowel plus a weak vowel. Wells, 2009

Roach, on the other hand, argues differently and states that:

The most complex English sounds of the vowel type are the triphthongs. They can be rather difficult to pronounce, and very difficult to recognise. A triphthong is a glide from one vowel to another and then to a third, all produced rapidly and without interruption . Roach, 2009:29 (emphasis added)

Crystal, states:

The distinction between triphthongs and the more common diphthongs is sometimes phonetically unclear. Crystal, 2008:497

This is not the place to pit two esteemed phoneticians against each other so we'll stick with the simplest explanation, the one proposed by Wells, and suggest that what is sometimes called a triphthong is, in fact a glide from a diphthong to another vowel, the schwa and that there are (or can be) two syllables in such pronunciations. Here, we will recognise five of these combinations of sounds.  Whether whomever you are transcribing produces all five is a matter of the accent and background of the speaker as well as how carefully and slowly the words are spoken. Here's the list:

  • /eɪə/ as in player (/ˈpleɪər̩/) or mayor (/ˈmeɪər/.  Start with the diphthong /eɪ/ as in say (ˈseɪ/) and glide from the end of that to the /ə/.
  • /aɪə/ as in liar (/ˈlaɪər/) or shyer (/ˈʃaɪər/).  Start with the diphthong /aɪ/ as in nice (ˈnaɪs/) and glide to the /ə/.
  • /ɔɪə/ as in soil (/ˈsɔɪəl/) or loyal (ˈlɔɪəl/).  Start with the diphthong /ɔɪ/ as in toy (tɔɪ/) and glide to the /ə/.
  • /əʊə/ as in lower (/ˈləʊ.ə/) or knower . (/ˈnəʊ.ə/  This one has a schwa at both ends.  Start with the diphthong /əʊ/ as in coat (/kəʊt/) and glide to the /ə/.
  • /aʊə/ as in tower (/ˈtaʊər/) or our  (/ˈaʊər/.  Start with the diphthong /aʊ/ as in mouth (/ˈmaʊθ/) and glide to the /ə/.

You should have:     mower : /ˈməʊə/ or /ˈməʊ.ə/     tyre : /ˈtaɪə/ or /ˈtaɪ.ə/     slayer : /ˈsleɪə/ or /ˈsleɪ.ə/     toil : /ˈtɔɪəl/ or /ˈtɔɪ.əl/ (but many pronounce that as /tɔɪl/, a single-syllable word with a diphthong vowel sound)     shower : /ˈʃaʊə/ or /ˈʃaʊ.ə/

As far as transcription is concerned, you do not have to take sides in the Roach-Wells debate and can equally well have the transcription with the syllable-marking '.' or without.  It just depends on whether you hear the sound as a single vowel or two syllables and that will vary from speaker to speaker. See the next section for how we recognise syllables.

As we saw, the main stressed syllable is conventionally indicated by ' before the syllable (e.g., /'sɪl.əb.l̩/). It is sometimes helpful to mark secondary stress in longer words like incontrovertible by a lowered symbol like this:     /ɪn ˌ k.ɒn.trə.'vɜː.təb.l̩/ in which you can see a small ˌ before the /k/ sound indicating that the second syllable carries secondary stress and the main stress falls on the fourth syllable and is shown by the 'vɜː in the transcription.  Most learners find just one stressed syllable enough to cope with. If we want to show that non-phonemically, we might write:     in con tro VER tible on the board with an underline lower-case for secondarily stressed syllables but bold, underlined CAPITALS for the main stress. (An alternative way to mark stress sometimes used by professional phoneticians is to place an acute accent over the onset vowel of a stressed syllable and a grave accent over a secondarily stressed item.  In this case, the syllable borders are usually ignored.)

However, before we can decide where to put the stress mark, we need to identify the syllables in an utterance.  That is not always as easy as it sounds. A syllable is a unit of pronunciation having one vowel sound, with or without surrounding consonants . By that definition, all of the following are single syllables:     or     go     ask     bus of various kinds (there is a guide to the difference on this site which you can access in the guide to syllables and phonotactics (new tab)). You can transcribe these individual words without any stress or syllable marks because there is no stress to note and only one syllable in question.  In connected speech, of course, we may need to insert a stress mark if the word carries stress in a longer string of text.  The transcription of those words is, therefore:     or: /ɔː/     go: /ɡəʊ/     ask: /ɑːsk/     bus: /bʌs/ and there are no other markings.

However, words or utterances of more than one syllable pose a problem because the transcription needs to show both the division into syllables and the place where the stress appears.

You should have: There are 7 syllables so the word is broken down as de-na-tio-na-li-za-tion . The transcription is:     /ˌdi:.ˌnæ.ʃə.nə.laɪ.ˈzeɪʃ.n̩/ In rapid speech, however, many speakers will omit some of the syllables in very long words so the transcription might easily be:     /ˌdi:.ˌnæʃ.nə.laɪ.ˈzeɪʃ.n̩/ with the third syllable dropped. Notice, too, that the final syllable in both cases is simply /n̩/ which denotes that there is no obvious vowel between the /ʃ/ and the /n/ sounds.  That's called a syllabic consonant because the single consonant forms a syllable. If you transcribed that with /ən/ at the end, that's fine (and correct).  It can also be transcribed as /ᵊn/ to show that the vowel is very short.  There is a bit more on syllabic consonants below.

The rules for deciding where a syllable starts and stops are quite complex in English but there is a rule of thumb we can use to decide, for example, how to divide a word like tumbler . We could have:     /ˈtʌm.blə/ or     /ˈtʌ.mblə/ or     /ˈtʌmb.lə/ so how do we decide? Here are the rules:

  • If there is a choice, attach the consonant to the right-hand syllable, not the left: That would mean that the transcription would be:     /ˈtʌm.blə/ and that's fine, but why don't we attach both consonants to the right-hand syllable and it isn't:     /ˈtʌ.mblə/? Here, we need rule 2:
  • If attaching the consonants to the right-hand element produces a syllable which is forbidden as the beginning of a word in English, move one of them to the left. In English, no word can begin /mb/ (although that is allowable in some languages).  We can, however, have a word beginning /bl/, of course, such as black, blur, block etc. Therefore, applying both rules, we end up with     /ˈtʌm.blə/

Now the stress marking. Once we have applied the rules (or used a bit of common sense and intuition) we can divide multisyllabic items up conventionally and then decide where the stresses fall.

You should have:     /ɪm.ˌpɒ.sə.ˈbɪ.lɪ.ti/ and     /ˌɪnt.ə.ˌnæʃ.n̩.əl.aɪ.ˈzeɪʃ.n̩/ We cannot, of course, following Rule 2 above have     /ɪ.ˌmpɒ.sə.ˈbɪ.lɪ.ti/ or     /ˌɪnt.ə.ˌnæ.ʃn̩.əl.aɪ.ˈzeɪʃ.n̩/ because no word in English can begin with /mp/ or /ʃn/ so we move the /m/ and the /ʃ/ to the left and then we have something acceptable. Do not worry if your transcription was not exactly the same.  If you have identified the syllable divisions and put the main and subsidiary stresses in the right places, that's OK for now.

Transcribing connected speech spoken at normal speed rather than someone reading from a list of words, requires the inclusion of a variety of new factors.  Four are considered here.

There are three sounds which speakers insert between vowels in connected speech.  They need to be included in your transcriptions.  They are:

You may see an intrusive sound put in superscript ( rwj ) and that's a good way to draw your learners' attention to the sounds.  There is, however, a case to be made that you don't have to teach these at all because they are the inevitable effects of vowel-vowel combinations in speech.  They aren't, of course, only applicable to English.

Try this next mini-test.  Click on the table to get the answer.

There are times when you have to listen extremely carefully to hear whether a speaker is actually producing the intrusive sound or inserting /ʔ/, a glottal stop (see next section). For example, many will pronounce      Go out as /gəʊʔaʊt/ rather than /ɡəʊ.ˈwaʊt/,     The gorilla and me as /ðə.ɡə.ˈrɪ.ləʔənd.miː/ rather than /ðə.ɡə.ˈrɪ.lə.rənd.miː/ and     I am here as /ˈaɪʔæm.hɪə/ rather than /ˈaɪ.jæm.hɪə/.

As we saw above with the transcription of suet , many speakers of all varieties will insert an intrusive /w/ in the middle of the word, producing /ˈsuːwɪt/ instead of /ˈsuːɪt/ and so we also hear fuel as /ˈfjuːwəl/ but that word, even without the intrusive /w/ is pronounced with an intrusive /j/ as the transcriptions show. if you listen carefully to some British English speakers pronouncing words such as tune, fortune, produce, century, nature, mixture, picture, creature, opportunity, situation, actually you may hear an intrusive /j/ sound after the /t/ or /d/ not shown in the spelling. Therefore, the transcription is actually:     tune /tjuːn/     actually /ˈæk.tjuə.li/     situation /ˌsɪ.tjʊ.ˈeɪʃ.n̩/ etc. although ˈæk.tʃuə.li/ and /ˌsɪ.tʃʊ.ˈeɪʃ.n̩/ are also heard.  Not all speakers do this.

A further issue to listen for is the linking /r/ sound. In British English, the final 'r' on many words is unsounded so, for example, harbour is pronounced as /ˈhɑː.bə/, whereas in AmE, the standard pronunciation includes the /r/ sound and the pronunciation is /ˈhɑːr.bər/. However, when a word ending in 'r' immediately precedes a word with an initial vowel, we get the linking /r/ and the sound is produced so, for example:      My father asked will be pronounced as     /maɪ.ˈfɑːð.ər.ˈɑːskt/ in BrE and as     /maɪ.ˈfɑːð.r̩.ˈæskt/ in AmE.

The guide to connected speech contains more detail on the different forms of assimilation.  For the purposes of transcribing sounds in connected speech, the various types are not as important as the ability to step away from the written word and transcribe only what you hear. You must be aware, however, that not all speakers will pronounce everything the same way and the phenomena listed here are not consistently produced by everyone.  Much will depend on how careful speakers are and what variety of English they use. Assimilation describes the alteration of sounds under the influence of other sounds in the vicinity.  The guide to connected speech has this table:

In the last case, the assimilation of /s/ and /z/ to /ʃ/, some would aver that the /s/ and /z/ sounds are simply being omitted and that's elision, the topic of the next section.  Others believe that the /ʃ/ sound is, in fact being extended to nearly double its usual pronunciation so this is a case of assimilation.  The distinction, such as it is, is not vital for teaching purposes. Consonant lengthening is a minor area in English (but not so in some languages).  There are times when two non-plosive consonants occur together and, normally in rapid speech, one of them is assimilated (or elided, depending on your point of view).  So, for example:     some milk is usually pronounced as /səm.ɪlk/ with only one /m/ sound. However, when people are being slightly more careful and speaking a little more slowly, both /m/ sounds are heard so the transcription is /səm.mɪlk/ and it would appear from that that there are two separate sounds in the middle of the phrase.  What in fact frequently happens is not that we have two /m/ sounds but that we have a single sound slightly lengthened. The transcription is sometimes adjusted to take this into account and a length mark is inserted after the consonant so we get the transcription as /səmː.ɪlk/. The phenomenon is called gemination (from the Latin gemini , meaning twins). This sort of lengthening occurs most frequently with certain consonants because plosives such as /p/ cannot usually be lengthened. Some examples are:     club bar /klʌb.bɑː/     mad demons /mæd.ˈdiː.mənz/     safe fire /seɪf.ˈfaɪə/     big gate /bɪɡ.ɡeɪt/     full label /fʊl.ˈleɪb.l̩/     warm margarine /wɔːm.ˌmɑː.dʒə.ˈriːn/     gin next /dʒɪn.nekst/     car research /kɑː.rɪ.ˈsɜːtʃ/     less sense /les.sens/     mash shop /mæʃ.ʃɒp/     cave visit /keɪv.ˈvɪ.zɪt/ and here we have followed the convention of transcribing both consonants although we are aware that in rapid speech one will not usually be sounded.  If that is the case in what you hear, delete the second of the consonant sounds but retain the syllable marker.  If you wish to use the length marker after the consonant, that's fine too, providing that it is what you heard, but be aware that it is not a widely used convention. In some languages, including Arabic, Danish, Estonian, Hindi, Hungarian, Italian, Japanese, Polish and Turkish, consonant lengthening carries meaning so a short or long consonant are independent phonemes.  In English, no such meaning attaches to a longer consonant so we are dealing with allophonic differences and the transcription may be unaffected, therefore.

You should have:     /ˈɡəʊl.dəm.bɒks/     /ˈtʃɪl.drəm.mʌst/ (or /ˈtʃɪl.drəm.mʌs/ with the elision of the final /t/ on must )     /ˈpʊt.paɪ/ or / ˈpʊʔ.baɪ/     /həʔ.ˈmæ.nɪdʒd/ With /n/ assimilated to /m/, /t/ to / ʔ / and /d/ to / ʔ/. Do not worry if your transcription was not exactly the same.

You should have:     /faɪŋ.ˈkɑːs.l̩/ (with /ŋ/ not /n/)     /sɪ ʔ .ˈkʌmf.tə.bli/     /həd.ˈ ɡ ʌ.vəd/ With /n/ assimilated to /ŋ/, /t/ to / ʔ / and /k/ to / ɡ/. Do not worry if your transcription was not exactly the same.

You should have:     /peɪnʃ.ˈje.ləʊ/     /wʊdʒ.et/ With /t/ assimilated to /ʃ/ and /d/ to /dʒ /. Do not worry if your transcription was not exactly the same.

You should have:     /le.ˈʃʊ.ɡə/     /wə. ˈ ʃʊə/ (or / wə. ˈʃɔːr/ depending on your accent) With /s/ assimilated to /ʃ/ (and lengthened, often) and /z/ also assimilated into a lengthened / ʃ/. Do not worry if your transcription was not exactly the same.

Voicing and devoicing

Assimilation, both progressive and regressive, also affects voicing (sometimes known as sonorisation).  For example:

  • the s following an unvoiced consonant will be pronounced as /s/ so we get hat and hats (/hæt/ and /hæts/), make and makes (/ˈmeɪk/ and /ˈmeɪks/) and so on.
  • following a voiced consonant, however, s is usually voiced from /s/ to /z/ so we get rug and rugs (/rʌɡ/ and /rʌɡz/), cab and cabs (/kæb/ and /kæbz/) and so on.
  • some speakers carry this over to other sounds, particularly the /θ/ and may pronounce, for example, baths as /bɑːðz/ and youths as /juːðz/.  Others will retain the /θ/ in the plural forms.  You simply have to listen out for which the speaker is doing.
  • regressively, the /v/ in, for example, have is often devoiced before a voiceless consonant such as /t/ so the pronunciation of have to is /həf.tuː/ or /həf.tə/ and love camping is /ˈlʌf.ˈkæmp.ɪŋ/.  Not all speakers do this and many retain the voiced /v/ in such expressions so, again, you have to listen for which variety the speaker is producing.
  • a teaching point is that in some languages, German, Dutch, Polish and Russian, for example, a final consonant is always devoiced so, e.g., bag, club, has, had and cave may be pronounced as /bæk/, /klʌp/, /hæs/, /hət/, /keɪf/, respectively, instead of /bæɡ/, /klʌb/, /hæz/, /həd/ and /keɪv/. If you are transcribing learner English, that is something to listen out for.

It is important, too, to listen carefully for what is not pronounced and this also involves releasing oneself from the spell of the written word and hearing only what is being said, not what one expects to be said. Again, the guide to connected speech has more detail in this area but here it will be enough to present some examples:

  • English uses a variety of contracted forms, leaving out whole sections of words ( hasn't, can't, wouldn't've etc.).  You need to listen carefully to detect whether the speaker has used these or not (as, e.g., /ˈhæznt/, /kɑːnt/, /ˈwʊdnt.əv/ etc.  There are other examples such as:     the loss of the /d/ in sandwich (/ˈsæn.wɪdʒ/)     the pronunciation of library as /ˈlaɪ.bri/, comfortable as /ˈkʌmf.təb.l̩/ or probably as /ˈprɒbli/.
  • The initial /h/ sound is often dropped in rapid speech so, e.g.:     Give it to him may be correctly transcribed as / ˈ ɡɪv.ɪt.tu.ɪm/.
  • The initial vowel in if is often elided in conditional clauses so, for example:     If I were you ... is pronounced as: /ˈfaɪ.wə.ju/
  • Function words are often reduced.  We saw above that the schwa often appears in reduced function words such as of, and, to and so on.  Reduction also occurs, however, when all or part of a function word such as of is omitted as in, e.g.:     cup of coffee being pronounced as cuppa coffee (/kʌpə ˈkɒ.fi/ In many cases the word and is reduced to 'n' and the /d/ is dropped as in, e.g.:     tea 'n' cakes as /tiː n̩ keɪks/.
  • Clusters of consonants are often simplified so, e.g.:     sixths may usually be transcribed as /sɪkθs/ or even /sɪkfs/ and     text message becomes /teks.ˈme.sɪdʒ/ and so on. More examples are in the guide to connected speech.
  • Adjacent sound elision occurs frequently.  When the sound at the end of one stretch of language is the same as the one at the beginning of the next item, they are usually reduced to a single sound in connected speech so, for example:     I'm meeting Mary is pronounced as: /aɪ.ˈmiːt.ɪŋ.ˈmeər.i/ not /aɪm.ˈmiːt.ɪŋ.ˈmeər.i/ and     Don't take that table is pronounced as /dəʊn.teɪk.ðæ.ˈteɪb.l̩/ not /dəʊnt.teɪk.ðæt.ˈteɪb.l̩/ In the transcription here, we have removed the first of the sounds but you can decide whether it is the first or the second which is elided. Speakers are not consistent in this and some will retain both sounds or, when it is possible, as with /m/ to extend the sound slightly.  That is not possible with stops such as /t/, /k/ /d/ etc. but occurs with fricatives like /f/ and /s/ and with the nasal sounds.  When it happens both phonemes appear in the transcription so, e.g.,     She makes sandwiches can be transcribed either as /ʃi.ˈmeɪk.ˈsæn.wɪdʒ.ɪz/ or as /ʃi.ˈmeɪks.ˈsæn.wɪdʒ.ɪz/

Again, speakers vary in this with some being more careful and correct and others less so (or sloppy as writers to newspapers often describe them).  You have to listen hard to hear what is really being said.

You may have:     /bæɡ.ə.pə.ˈteɪ.təʊz/, with the /v/ elided, or /bæɡ.əv.pə.ˈteɪ.təʊz/

You should have:     /hi. ʃ ʊdnt.əv. ˈ b ɪ n.ðeə/ or /hi. ʃ ʊdənt.əv. ˈ b ɪ n.ðeə/ with the /ə/ included.

You should have either:     /pɑːs. ˈ ðə.tə.hə/ or /pɑːs. ˈ ðə ʔ .tə.hə/ (with the assimilation of /t/ to / ʔ/).  You may also have omitted the /h/ on her . If you transcribed that as /ðæ/ (with the full vowel sound and the elided /t/), that's OK, too.  Many speakers, even when they are speaking quite quickly avoid the schwa for the vowel.

You could have either:     /ˈsevnθs/ or /ˈsevns/ (with the elision of / θ /).

You could have either:     /dʒɪm.meɪ/ or /dʒɪ.meɪ/ (with the elision of / m /).

At the back of your mouth there is a part of your larynx called the glottis and this is where the glottal stop is produced, hence its name. A glottal stop is formed by briefly blocking the airflow at the back of the mouth and then releasing it. The symbol for this sound is /ʔ/ and we have seen a lot of examples of how some sounds are replaced by the glottal stop above.

You probably have:     /'pʊ ʔ ɒn/, / ' pɪ ʔ ʌp/ and / ' hɪ ʔ ɪm/ instead of the more careful forms of     /'pʊt ɒn/, / ' pɪk ʌp/ and / ' hɪt ɪm/

We can have also butter as /'bʌ ʔ .ə/ not /'bʌt.ə/ or fatter as /ˈfæ ʔ .ə/ not /ˈfæ.tə/ in some common dialects (London and Scots, for example).

See also the use of the glottal stop to avoid a linking /r/, /w/ or /j/ sound, above.

Dropping the /h/ on him is not always sloppy speech; it is very commonly acceptable.  And it is very common (but not in all dialects). The /h/ in I have, when not contracted, is often replaced by an intrusive /j/ as in /'aɪj æv/ and this happens frequently elsewhere, too ( they have, we have, e.g., rendered as /'ðeɪjəv/, /'wijæv/).  Notice, too, the tendency to pronounce have as /həv/ in they have but as /hæv/ in we have . Hello is often pronounced /hə.'ləʊ/ sometimes /hæ.ˈləʊ/ but often /ə.'ləʊ/ or /æ.ˈləʊ/.  It may be safer to stick with /haɪ/.

Similarly, in many dialects the final /ŋ/ in words ending with -ing is often rendered as /n/ but this is generally considered low status.  We get, e.g., /'ɡəʊɪn 'aʊt/ instead of /'ɡəʊɪŋ 'aʊt/.  Oddly, some high-status British accents also make this conversion, exemplified by the so-called huntin', fishin' and shootin' set (the /'hʌnt.ɪn 'fɪʃ.ɪn ən 'ʃuːt.ɪn set/).

You could have:     /ˈdeɪ.zi.əz.ə.dɒɡ/ instead of the more careful     /ˈdeɪ.zi.hæz.ə.dɒɡ/

You could have:     /ə.ju.ˈɡəʊɪn.ˈaʊt.tə.ˈnaɪt/ instead of the more careful     /ə.ju.ˈɡəʊɪŋ.ˈaʊt.tə.ˈnaɪt/ or even the very careful     /ɑː.ju.ˈɡəʊɪŋ.ˈaʊt.tə.ˈnaɪt/

A rhotic dialect or variety of English is one in which the letter 'r' is pronounced as /r/ before a consonant or at the end of a word.  For example, a rhotic accent, in this case General American is our example, will pronounce:     card as /kɑːrd/     far as /fɑːr/     murder as /ˈmɝː.dər/ The symbol /ɝ/ is called an R-coloured or rhotic vowel. A non-rhotic variety such as BrE will pronounce those three words as /kɑːd/, /fɑː/ and /ˈmɜː.də/. The rhotic pronunciation is standard in American English and is becoming slightly more frequent in BrE, too. You sometimes need to listen hard to recognise whether an 'r' is being sounded in the middle of a word so, for example the AmE pronunciation of diversion is:     /daɪ.ˈvɝː.ʒən/ whereas the BrE pronunciation lacks rhoticity:     /daɪ.ˈvɜːʃ.n̩/. There are other differences, too, which are covered later. The symbol /ɝː/ to show the sound (a rhotic vowel) may also appear as /ɜːr/

We saw above that in most southern British dialects, the /r/ sound is only pronounced when the following sound is a vowel so we get, e.g.:      My father asked pronounced as     /maɪ.ˈfɑːð.ər.ˈɑːskt/ in BrE and as     /maɪ.ˈfɑːð.r̩.ˈæskt/ in AmE. When the following sound is non-vocalic (not a vowel), this linking /r/ does not occur so, e.g.:     My father told me is pronounced as:     /maɪ.ˈfɑːð.ə.təʊld.miː/ in non-rhotic accents but as     /maɪ.ˈfɑːð.ər.təʊld.miː/ in rhotic accents. In some speakers, the linking /r/ is avoided in favour of a glottal /ʔ/ (see above). In transcribing what is actually said, either by speakers of the language or by learners, it is important to be alert to whether the speaker is using a rhotic accent or a non-rhotic accent.

There are three influences which determine the use of a rhotic accent and using knowledge of them can help you to listen out when transcribing speakers' production.

  • Geographical location As a rule of thumb, the following varieties of English are non-rhotic:     Southern British English     BBC English     Welsh English     New Zealand English (although there is some evidence of the influence of Scottish settlers contributing to rhoticity in some areas)     Australian English     Malaysian English     Singaporean English     South African English and English spoken elsewhere in Africa as a lingua franca     Trinidadian and Tobagonian English and the following varieties are generally rhotic:     Most varieties of Scottish English (although non-rhotic accents are common in Edinburgh and latterly in Glasgow)     South West English and some varieties in and around Manchester, parts of Yorkshire and Lincolnshire and on the Scottish Borders     Irish English     American English     Barbadian English     Indian English     Pakistani English     Bangladeshi English.
  • Social class and perceived status The status of rhotic vs . non-rhotic accents has a somewhat chequered history. In most varieties of British English a non-rhotic production is associated with high status dialects such as BBC English and rhotic varieties are confined to rural areas and the north.  This is changing with an identifiable move towards rhotic varieties everywhere.  The degree of rhoticity closely matches socio-economic class. In North America, the situation is more complex. Up until the end of the 19th century, a non-rhotic pronunciation was considered prestigious, especially in East coast cities such as New York and Boston. The American Civil War (1861-1865) and, arguably, the previous War of Independence (1775-1783), changed things, removing the prestige of r -dropping and, since the Second World War, a rhotic pronunciation has become a standard high-prestige accent in the USA. African-American Vernacular English retains the non-rhotic pronunciation but some speakers with Hispanic heritages may use a Spanish-influenced trill for the /r/. Canadian English is generally rhotic.
  • Influences Many speakers of English as a second language will use a rhotic accent either because of the influence of AmE on their learning of the language or because their first languages are rhotic and they carry over the pronunciation of /r/ into English. For example, speakers of English who have Hindi or a Dravidian language as their first will often use a rhotic pronunciation in English but speakers of Cantonese (which lacks the /r/ consonant) will often use a non-rhotic pronunciation.  Hence, for example, Hong Kong English is generally non-rhotic although the influence of American media and cultural issues are contributing to more rhoticity even there.

16 Phrase-level Phonological and Phonetic Phenomena

Stefanie Shattuck-Hufnagel is a Principal Research Scientist in the Speech Communication Group at MIT. She received her PhD in psycholinguistics from MIT in 1974, taught in the Department of Psychology at Cornell University, and returned to MIT in 1979. Her research is focused on the cognitive processes and representations that underlie speech production planning, using behaviour such as speech errors, context-governed systematic variation in surface phonetic form, prosody, and co-speech gesture to test hypotheses about the planning process and to derive constraints on models of that process. Additional interests include developmental and clinical aspects of speech production, and the role of individual acoustic cues to phonological features in the processing of speech perception. She is a proud founding member of the Zelma Long Society.

  • Published: 01 July 2014
  • Cite Icon Cite
  • Permissions Icon Permissions

For many decades, investigators emphasized the search for invariant aspects of the speech signal that might explain the ability to extract a speaker’s intended words despite wide variation in their acoustic shape. In contrast, over the past few decades, as the extraordinary range of phonetic variation has been revealed, the focus has shifted to documenting variation and determining the factors governing it. This review identifies two seemingly contradictory types of findings about connected speech: the extreme loss versus the concurrent preservation of word-form information. Based on these observations, a productive research strategy for understanding the planning and production of connected speech may be to focus on (1) the systematic nature of phonetic reduction patterns, making them a source of information rather than noise; (2) the ability of human listeners to interpret reduced and overlapped forms; and (3) the implications of these two ideas for speech production models.

For many decades after the development of technical tools enabled detailed analysis of the pronunciation of words in connected speech, investigators emphasized the search for invariant aspects of the speech signal that might explain the listener’s striking ability to extract the speaker’s intended words despite wide variation in their acoustic shape across different contexts. In contrast, over the past few decades, as the widespread availability of speech analysis freeware running on personal computers and of recorded utterances from corpora of typical communicative speech began to reveal the extraordinary range of this variation, the focus has shifted to documenting its nature and determining the factors that govern it, such as prosodic structure (constituent boundaries and prominences) and frequency of word use. This review identifies two seemingly contradictory types of findings about the surface forms of words in continuous speech (i.e., the extreme loss of word-form information vs. the concurrent preservation of word-form information) and discusses the implications of these two observations for models of speech production planning at the sound level. Current knowledge about phrasally induced phonetic variation suggests that a productive research strategy for understanding how human speakers plan and produce connected speech may be to focus on (1) the systematic nature of phonetic reduction patterns, which makes them a source of information rather than noise; (2) the corresponding ability of human listeners to interpret reduced and overlapped forms in terms of the cues they nevertheless contain to the speaker’s intended lexical items; and (3) the implications of these two ideas for speech production planning models.

Introduction: Phonetic Variation and Phonological Invariance

Human beings have a remarkable skill in the use of spoken language: speakers adapt word forms to different contexts, and listeners deal with these differences, recognizing the speaker’s intended words without noticeable effort. Understanding the nature and extent of this contextual variation in word form has not been easy, because to a listener the intended words seem transparently available in the speech signal. It is only when the signal is particularly challenging (e.g., low in amplitude, produced by a nonnative speaker, or conveyed over a poor channel) that the listener becomes aware of doing any cognitive work to understand what is said. Thus, when technical developments achieved by the mid-1900s, such as the sound recorder, the oscillograph, and the sound spectrograph, allowed speech scientists to study the fleeting acoustical events of spoken utterances in detail, the degree of variation revealed by these technologies was surprising. Instead of an orderly array of sequentially organized sound segments with their acoustic cues temporally aligned, these new tools revealed that information about successive speech sounds was spread more widely across time, with cues to adjacent parts of words overlapping each other (see, e.g., Lehiste, 1967 ). For example, the place of articulation cues for a stop consonant are not entirely contained within the temporal interval between the closure and release of the consonant; for voiceless stops this interval can be almost completely silent. Instead, cues are also found in the regions before and after these events, such as in the temporal course of changes in the resonant frequencies of the vocal tract (formant transitions) in the voiced regions associated with the adjacent vowels, and in the spectrum of the release noise. Similarly, in such words as can or some , the nasal quality of the coda consonant sometimes begins in the preceding vowel, well before the oral closure for the nasal. Another example is the migration of rhotic quality from the /r/ in utterances of Africa across the intervening /f/, to appear during the initial /ae/ ( Espy-Wilson, 1987 ).

In addition to evidence for the temporal overlap of information about multiple speech sounds, these signal analysis tools also supported an idea that had long been promulgated by phoneticians attempting to capture the sound systems of dialects and undocumented languages: that the same contrastive sound category (e.g., the voiceless alveolar stop /t/or the rounded high back vowel /u/) could be realized in quite different ways in different contexts. Such variants were called allophones, reflecting the idea that they were alternative implementations of the same phoneme. One of the first dimensions that investigators used to characterize the contexts that evoked such distinctions in the phonetic realization of a sound category was position (e.g., structural location in a constituent, such as initial, medial, final in a word or syllable); or adjacent segments (e.g., singleton onset vs. consonant cluster); or adjacent stressed versus reduced vowel. A classic example is found in the positional allophones of /t/ in American English, which include at least the following:

Word-initial or prestressed /t/ ( top, today, return ), where /t/ takes the form of an aspirated stop

Following an /s/ in an onset cluster ( stop ), where the aspiration is reduced or missing

Preceding an /r/ in an onset cluster ( try ), where the release is lengthened to resemble an affricate like /ch/

Word-medially between a strong and a weak vowel ( city, lotto ), where the /t/ is usually shortened or even produced without a full closure to form a flap (ranging from a short closure with pressure buildup and release, through a dip in amplitude with continuous voicing and sometimes a small release burst riding on one pitch period, to no obvious amplitude reduction)

In final position ( pot, repeat ), where the /t/ may be produced with closure but without release, may be strongly released, may be implemented acoustically as a sequence of irregular pitch periods at the end of the vowel with no obvious acoustic indication of oral closure, pressure buildup or release, or may be seemingly omitted altogether (e.g., in a cluster of alveolar coda consonants, such as the /st/ in lost )

Between two nasals (as in some variants of Clinton or mountain ), in which the following [-ən] is produced as a syllabic /n/ and the /t/ is implemented as a glottal closure and release

This is only one of many possible examples; a wide variety of English phonemes are realized in different ways depending on their segmental and structural contexts, and it was a natural assumption that these differences could be captured in terms of the same kinds of distinctive feature differences as those that relate the various forms of a morpheme that occur in different words. However, the weight of the evidence from increasingly detailed acoustic analysis of phrase-level phonetic phenomena suggests that a different vocabulary may be more appropriate for capturing these contextual variations.

Any discussion of phonetic variation and the factors that govern it must begin with the explicit recognition that words are made up of different combinations of a small number of elements. Early alphabetic systems reflected individual sound elements, sometimes morphemes, sometimes syllables, and at least once, for the Phoenicians, individual sound segments, and the development of orthographic symbols for these sound elements largely ignored the contextually governed phonetic differences. Early grammarians of sound (dubbed phonologists by Trubetzkoy, 1939 ) called these individual sound segments phonemes ( Badouin de Courtenay, 1880s/1972 ), and later practitioners hypothesized that they were defined by distinctive features that differentiate the contrastive sound categories of a language and relate them to each other ( Jakobson, 1941/1968 ). Several different frameworks for defining the distinctive features have been proposed, including both acoustic and articulatory characteristics ( Jakobson, Fant, & Halle, 1952 ), or more purely acoustic ( Stevens, 1998 ) or articulatory ( Chomsky & Halle, 1968 ) characteristics. These distinctive features define natural classes of sounds (e.g., vowels and consonants; stops, fricatives and nasals, high and low vowels; labial and velar consonants) whose members undergo similar processes in similar contexts, and describe the kinds of systematic changes that occur when morphemes are combined into words. Chomsky and Halle’s (1968) volume The Sound Pattern of English described feature-changing phonological processes, such as those reflected in the relationship between the final /k/ of electric or domestic and the corresponding /s/ in electricity or domesticity , as well as the relationship between their respective stress patterns.

The postulation of phonemes as bundles of distinctive features, some of which can change in within-word contexts, seemed to provide a natural way of describing the kinds of sound changes that occur in word combinations. For example, it seemed reasonable to draw a parallel between the change that occurs in the lexical form of the /n/ of the prefix in- , in such words as important (where it becomes the labial nasal /m/ under the influence of the following labial /p/), indeed (where it maintains its alveolar place feature under the influence of the following alveolar /d/), and income (where it becomes the velar nasal /ŋ/ under the influence of the following velar /k/), and the change that occurs in the /n/ of the preposition in , in phrase-level word combinations such as in Boston, in Denver , and in Ghana . Similarly, it was widely assumed that the ways in which positional allophones in a language, or differences in the implementation of a single contrastive category across languages, relate to each other is by changes in the value of individual distinctive features.

However, this picture of a single feature-changing mechanism as the only means of relating a word form to its variants in a language, or a sound category to its variants across languages, began to change as the understanding of phonetic variation in spoken word combinations grew deeper. This evolution resulted in part from the development of convenient tools for the display and analysis of speech signals. Following on the advent of acoustic recording devices in the late 1800s, important further tools were the sound spectrogram developed in the 1940s (which displayed changes in the distribution of energy across the frequency spectrum over time), and digital tools for further analysis and display of speech signals on mainframe computers and desktops, such as xWaves ( Talkin, 1995 ) and the Klattools ( Truslow, 2010 ), which enabled the convenient computation of an individual spectral slice at a particular time point, with quantitative estimates of the frequencies of peaks in the spectral distribution of energy. Finally, the advent of laptop computers and a powerful piece of analysis freeware called Praat ( Boersma, 2001 , Boersma & Weenik, 2012 ) brought convenient acoustic-phonetic analysis to just about anyone. These tools revealed even more clearly that speakers do not produce words as strings of discrete sounds, but as a sequence of overlapping and somewhat contextually distorted remnants of the original target sounds that defined the words. An extreme form of this view was described by Hockett (1955) , who proposed the analogy of the individual sounds of a target word as being like a line of brightly patterned Easter eggs, and the speaking process as being like moving that line of distinct eggs down a conveyor belt toward a pair of rollers, which squeezes and smashes them into a flattened pattern of randomly arranged pieces. Although this metaphor captures the fact that spoken word forms do not preserve the temporal alignment of cues to each individual sound segment, it fails to do justice to the informative systematicity of phrase-level phonetic processes.

The Nature and Extent of Phonetic Variation Unveiled

The nondiscrete nature of sound segments in the speech signal inspired several proposals that moved away from the assumption that sound variation was best captured in terms of differences in categorical features. One particularly productive proposal was the importation, from task dynamics approaches to general motor control ( Saltzman & Kelso, 1987 ), of the idea that variations in the phonetic implementation of word forms could be usefully described in terms of timing overlap and spatial reduction of the articulatory configurations that create the acoustic speech signal ( Fowler, Rubin, Remez, & Turvey, 1980 ; Saltzman & Munhall, 1989 ; Browman & Goldstein, 1986 , 1989 , 1990 , 1992 ). It seemed natural to press the matter further (i.e., to propose that lexical representations are stored in the form of such gestural scores). Joining the phonological representation in the lexicon and the phonetic adjustments in the speech signal into a common articulatory representation was felicitous in another way: more and more investigations began to reveal that phonetic variation is not always categorical (as feature-changing processing might predict), but instead is often continuous-valued.

For example, Zsiga (1997) described two different vowel-changing processes in the Igbo language: vowel harmony (in which the addition of a suffix to a root results in changes in the features of earlier vowels to harmonize with those of the suffix) and interaction between two adjacent vowels . Zsiga reported that although a harmonized vowel was indistinguishable from a nonharmonized vowel of the same category, a vowel influenced by an adjacent vowel showed a gradient distribution in measured formant values. Such continuous-valued changes are much more easily accounted for in a model that can adjust temporal and spatial specifications gradiently, than in a model that has only a single categorical feature-changing mechanism. Thus, Zsiga’s result supports a model in which both mechanisms are available to a speaker: a feature-changing mechanism that results in categorical change; and a mechanism for computing articulatory overlap, which results in gradient change.

An additional advantage for mechanisms that could account for systematic multi-valued or continuous-valued variation lay in the ability to deal with the fact that speakers produce word forms with a range of phonetic values that vary with their location in hierarchical prosodic structures. In the 1970s and 1980s, several linguists had proposed the existence of a prosodic hierarchy of constituents, ranging from the utterance down through the intonational phrase, the prosodic word, and subword constituents, such as the foot, the syllable, and the mora (Hayes, 1984; Nespor & Vogel, 1986; Selkirk, 1984 ). Similarly, Beckman and Edwards (1994) proposed a hierarchy of prosodic prominences, ranging from the most prominent word or syllable in an intonational phrase (the nuclear accent; Halliday, 1967 ), down through prenuclear accents and unaccented full-vowel syllables bearing primary word stress, to full-vowel syllables without primary lexical stress and reduced syllables ( Okobi, 2006 ). Various investigators tested the hypothesis that such hierarchical representations (derived in part from, but yet still independent of, the morphosyntactic structures of the sentence; Ferreira, 1993 , 2007 ) play a role in speech production planning and implementation. These investigations uncovered systematic variation in the nature of phonetic implementation with level in the prosodic hierarchy.

For example, Pierrehumbert and Talkin (1992) and Dilley, Shattuck-Hufnagel, and Ostendorf (1996) reported more frequent occurrence of irregular pitch periods (“glottalization”) for word-onset vowels that occur at the beginning of intonational phrases and of pitch accented words; Wightman, Shattuck-Hufnagel, Ostendorf, and Price (1992) and Ladd and Campbell (1991) reported monotonically increasing constituent-final lengthening with higher levels in the prosodic constituent hierarchy; and Jun (1993) , studying Korean, reported longer voice onset times (the delay between release of a stop consonant and the onset of vocal fold vibration for the following vowel) at the onset of increasingly higher-level prosodic constituents. Keating and Fougeron (1997 ; see also Fougeron, 1998 , 2001 ) reported a corresponding hierarchical articulatory correlate of prosodic structure in the form of articulatory strengthening at the onsets of prosodic constituents.

Variation of this kind is well-suited to signaling other kinds of information that are also highly relevant to acts of communication. For example, in a groundbreaking series of studies, Labov and colleagues (1966 , 1972) demonstrated highly systematic variation in the subfeatural characteristics of vowels in different neighborhoods within a city. In addition to signaling the geographic origin and tribal affiliation of a speaker, several investigators have argued that such dimensions as noncontrastive variations in vowel quality, voice quality, and hyperarticulation can signal the attitudinal and emotional state of the speaker. Especially noteworthy in this regard is the work of Kohler (1990 , 2011 ), Coleman (2002 , 2003 ), and Hawkins (2003 , 2011 ), discussed later.

Moreover, even noncategorical variation may nevertheless contribute to the listener’s ability to identify the speaker’s intended linguistic constituents and structure. For example, Klatt (1976) , pointed out that the duration of a segment (in his case, /s/ in American English) can be influenced by many factors, including position in constituent structure, adjacent segments, and lexical stress. A well-known example of segmental cue ambiguity is found in some utterances of the sentence The sky is blue , which (with enough reduction in the first vowel) can be heard as This guy is blue , when lack of aspiration for the /k/ of sky (because of the preceding tautomorphemic /s/) and the ambiguous nature of the duration of /s/ (shortened because of its position in a word-initial /s/+stop cluster) make the cue pattern also consistent with a pronunciation of This guy . Like the visually ambiguous Nekker cube, a listener can find such an utterance flipping back and forth between the two possible interpretations on repeated listening. Phenomena like this are consistent with the view that perception involves the detection of individual cues rather than entire segments, with subsequent parsing and interpretation of those cues in ways that can be influenced by other aspects of the perceiver’s knowledge ( Cutler, 2010 ; Gow, 2002 ; Stevens, 2002 ).

A second example illustrates a parallel ambiguity for some prosodic contours: when the final syllables of an intonational phrase contain a sequence of two full vowels, as in words like compact or digest , realized with an F0 movement, such as a High→Low, it is sometimes difficult to determine whether or not the boundary-related intonation is also conveying a pitch accent. Thus, some pronunciations sound consistent with both interpretations: the noun COMpact , with a pitch accent on its initial and main-stressed syllable, or the verb comPACT , with a different type of pitch accent on its second and main-stressed syllable. Such ambiguities and ambiguity resolutions ( Dilley & Shattuck-Hufnagel, 1998 , 1999 ) provide a window into how listeners parse cues in context, and raise the possibility that speakers might plan the phonetics of their utterances in terms of such cue patterns (discussed later).

Another set of studies that illustrated continuous rather than categorical behavior examined the effects of word frequency/predictability on the phonetic implementation of phonological word forms. Fowler and Housum (1987) showed that the second mention of a word in a short discourse is more reduced than the first mention, and Bybee and colleagues (e.g., Bybee, 2001 ) took this concept further, documenting the effects of frequency on the phonetic implementation of words. For example, they examined the phonetics of homophonous words, such as the seldom-reduced numeral four and the often-reduced function word for , the various lexical items associated with that (e.g., the seldom-reduced pronoun, as in That is the one , vs. the often-reduced complementizer, as in I knew that he was coming ), and found striking differences in the phonetic implementation of these separate words, despite their identical phonemic form. Johnson (2004) described a range of different types of phonetic variation in his landmark study. In a series of corpus studies, Jurafsky and colleagues (e.g., Jurafsky, Bell, Fosler-Lussier, Girand, & Raymond, 1998 ) described related phenomena and pointed out that it was difficult to disentangle the effects of word frequency from predictability in context, because high-frequency words are more predictable overall. Nonetheless, these studies revealed that the gradient nature of phonetic implementation of word forms was strongly influenced by patterns of use.

Interestingly, many blending and reduction phenomena occur over portions of utterances that are larger than a single word, and even over sequences of several words. Examples, described here in symbolic terms for convenience include such sequences as cuppa for cup of , doncha for don’t you , [goinə] for going to , and such longer sequences as [amənə] for I’m going to , waintʃə for why don’t you , and haudʒə for how did you . Like earlier examples cited to illustrate interactions within words and across word boundaries, these phenomena resist description in terms of sequences of segments or feature bundles, in part because the acoustic-phonetic cues are not segmentally aligned.

It should be noted that these empirical studies have not simply remained unconnected phonetic observation but have been incorporated into several theoretical perspectives. The theory of autosegmental phonology, proposed by Goldsmith (1976) , explicitly suggested that distinctive features are not aligned into autonomous phonemic segments or feature bundles, but instead are represented on separate tiers, so that a given feature value could extend across material drawn from several traditional segments, including the segments of more than one word. At about the same time, Ogden and Local and their colleagues (e.g., Ogden & Local, 1994 ) were giving new emphasis to an idea proposed earlier by Firth (1948) , termed “prosodies” (a term that should, for the present at least, be sharply distinguished from the hierarchy of prosodic structures described previously). A Firthian prosody is a region of an utterance over which a “phonetic exponent,” such as nasalization or palatalization, is realized. Ogden and Local observed that such characteristics can extend over a relatively long region of an utterance, and that this region did not always correspond to a morphosyntactic constituent, just as Firth had proposed. Such a view, which has also been extensively explored in the work of Kohler and colleagues (German) and of Hawkins and colleagues (British English), is reminiscent of the multiword blending phenomena seen in the previous American English examples.

The foregoing brief and necessarily limited presentation highlights two contrasting trends in how speakers implement word forms in continuous speech. First, there is a high degree of apparent elimination of important information about the speaker’s intended words, including blending, temporal overlap, apparently missing parts of words, and extreme reduction of articulatory gesture size and acoustic differentiation. Second, because this structure-governed variation is systematic, it is highly informative. Thus, the field was simultaneously moving deeper into phonetic detail (and documenting apparent loss of information), and higher into the structures that govern it (and finding gains in information). Additionally, as researchers attempted to delineate the range and nature of phonetic variation and how it arises, the field was also moving outside of traditional syntactic structures (to address prosodic structures) and traditional distinctive feature representations (to address continuous-valued variation). As Cho and Jun (2000) remark, this makes it necessary to control for position in prosodic structures when eliciting or selecting utterances for analysis, but it also indicates that speakers are providing information about those prosodic structures in the systematic nature of some aspects of that variation.

What has been learned about the patterns of phonetic variation in connected speech phenomena as a result of these developments? First, the kind of positional variation captured by the categories of the International Phonetic Alphabet (IPA; and other categorical, invariant-based theories) is just the tip of the iceberg. Many changes take place through interactions across word boundaries; others involve severe reduction (as when and in black and white becomes a single syllabic nasal, or did you eat in did you eat yet becomes jeet in something like jeet jet ). Moreover, these changes do not always lend themselves to being captured by the traditional feature-changing rules of generative phonology. Instead, careful acoustic analysis has revealed that information about different target segments may overlap in the signal, as when the word can’t is produced with a partially nasalized vowel, no oral closure for the /n/, and a shortened nucleus combined with period of irregular phonation indicating the voiceless final /t/. Feature-changing rules and phonetic-category-selection mechanisms alike presumed a complete change in the category of a segment, whereas a growing body of evidence suggested that a more complex process involving more fine-grained specifications is at work. In fact, many types of variation that had been described in categorical terms were shown to leave traces of the original target segment in the signal, and further experiments showed that listeners could use this information to infer the original target string. Some of this evidence is described in the following section.

Information about Word Form Preserved Despite Reduction and Overlap

An example of information-preserving change in continuous speech is the production of the interdental voiced fricative /ð/ in American English. This sound, common at the onset of function words, such as the, this , and that , is often produced in a stop-like manner, and transcribed as /d/. However, Zhao (2010) showed in a careful analysis of the spectra of the release of stop-like /ð/ that such tokens are acoustically distinct from /d/; their spectra reflect an interdental constriction location. Another pervasive phenomenon that involves /ð/ in American English is commonly observed in /n+ð/ sequences, such as in the . When /ð/ is preceded by the nasal /n/, even across a word boundary, speakers often produce a single nasal segment, usually transcribed as /n/. This transcription suggests that the /ð/ has been deleted, leaving just an /n/ between the two vowels. However, Manuel et al. (1992) reported that, in a corpus of such tokens, the sequence in the was distinguished from the sequence in a by the longer duration of the intervocalic nasal region, indicating that there might be a perceptually useful duration cue to the presence of the /ð/, despite the absence of a region of voiced frication in the signal. Manuel (1995) took this question one step further, by asking whether, even in sequences like win those versus win No’s , where both of the intervocalic sequences contain two consonants, there was an acoustic cue to the /ð/ in the form of a higher frequency for the second formant at the onset of the second vowel. The answer was yes; in cases where the /n+ð/ sequence was produced with no voiced fricative in the signal but only a nasal, the initial frequencies of following vowel formants differed significantly from /n+n/, consistent with the smaller front cavity for the interdental place of articulation versus the larger front cavity for an alveolar constriction. Thus, speakers seem to combine feature cues from two adjacent segments into a single interdental nasal articulation.

A particularly interesting aspect of Manuel’s results was that the combination of cues to features of both /n/ and /ð/ was easily parsed by the listener into the underlying target sound sequence. When she tested listener’s perceptions (using synthesized stimuli of win those and win No’s , which differed only in the single cue of the frequency of F2 at release), they consistently distinguished the two sequences. Thus, the cue to the place of articulation feature of /ð/, which overlapped completely with the cues to the following vowel, allowed the listener to infer the underlying /n + ð/ sequence.

Related work by Gow (2002 , 2003 ) similarly suggests that the speaker’s tendency to overlap individual feature cues is complemented by the listener’s ability to parse them to infer the intended segments. Gow elicited casual productions of sequences, such as right berries and ripe berries , and asked listeners to transcribe them orthographically. In this overt transcription task, many tokens of right berries were transcribed as ripe berries , presumably because the closure for the /b/ of berries began before or co-occurred with the closure for the preceding coda /t/. Such transcription results suggest that a substitution of /p/ for /t/ had occurred. However, using an online lexical decision task, Gow showed that for listeners, tokens of right berries that were transcribed as ripe berries nevertheless activated the word right , rather than ripe . This was shown by faster reaction times to decide that a word related to right was an existing lexical item of English, than for a word related to ripe . Like Manuel’s finding for win those , this result supports the view that individual cues to a feature of an apparently missing segment can be preserved in production, overlapping with cues for a different segment; and listeners can disentangle these cues and infer the speaker’s intended sequence of segments.

Other lines of work also support the view that speakers leave behind cues to apparently deleted segments, or to the original identity of apparently changed segments. For example, there is growing evidence that cases of “neutralization,” such as for the voiced-voiceless distinction in final position in German ( Port, Mitleb, & O’Dell, 1981 ) and in Dutch ( Ernestus, Baayen, & Schreuder, 2002 ), may not be entirely neutralized. In a different domain regarding sound-level speech errors, Goldrick and Blumstein (2006) found acoustic evidence for the “missing” target segment in apparent sound substitutions, and Pouplier and Goldstein (2010) found articulatory evidence for target and intrusion segments in errors elicited by repetitive regularly rhythmic tongue twisters. Kohler (1999) , working on spontaneous speech in German, uses the term “articulatory residues” for such phenomena in continuous communicative speech:

Nasalization may become a feature of a syllable or a whole syllable chain and not be tied to a delimitable nasal consonant. The same applies to labi(odent)alization. Both nasal and labi(odent)al consonants may be absent as separate units as long as the nasal and labial gestures are integrated in the total articulatory complex. This results in articulatory residues in the fusion of words. (p. 92)

Speakers not only provide cues to apparently missing or changed segments and their features; they also often signal the way those segments should be grouped into words. Lehiste (1960) documented a number of cues that signal the presence of a word boundary, disambiguating such sequences as my seat versus mice eat , and several investigators have reported an increased probability of irregular pitch periods for word-initial vowels (and sometimes sonorant consonants) at the beginnings of intonational phrases ( Pierrehumbert & Talkin, 1992 ) and at accented syllables ( Dilley et al., 1996 ), potentially signaling the constituent structure of word sequences; Surana and Slifka (2006) argued that this phenomenon could also aid in word segmentation. Phrase-final creak and phrase-final lengthening can presumably also reinforce the intonation cues to phrasing in languages like English. Thus, despite sometimes severe reduction or merging of sounds across constituent boundaries, the speaker nevertheless often produces enough cues to those boundaries to signal the intended structure.

Another such example is an early finding by Cooper (1991) showing that word-onset voiceless stops like /t/ are normally produced with an open glottis (and thus considerable aspiration noise) even before a schwa vowel, as in today, tomorrow, tonight, Toledo, potato, polite, Canadian , and collection . That is, the reduction to weaker versions of these stops that is usually seen before schwa vowels in word-medial position (e.g., papa, mocha, beta ) or across a word boundary ( keep a, met a, block a ) is avoided in word-onset position, providing a cue to the word affiliation of the stop. Yet another example is found in a study by Manuel (1991) that compared the English word sport with reduced forms of the word support , which can sometimes be produced with no visible or audible first vowel. In such tokens of s’port , the /p/ was produced with aspiration after the release, rather than with the nonaspirated form seen after the tautosyllabic /s/ in sport . Thus, evidence for the original reduced vowel, and thus for the correct contrastive word form, is contained in the signal despite the lack of a “vocalic” region for the first syllable. Davidson (2006) reported a continuum of possible elision degrees for such vowels with increasing speaking rate, in such words as potato , arguing that this supports a gestural overlap rather than a schwa-deletion account.

Four additional lines of evidence are mentioned here. The first comes from analyses of possible “resyllabification” in American English by de Jong, Lim, and Nagao (2004) . Although this term is often used, and deJong’s work (and others) shows that some degree of reorganization does occur in repeated utterances of a VC syllable, so that they somewhat resemble CV syllables, the reorganization is not complete. That is, the characteristics of the C in such a transformed production are not exactly like those of the onset in an original CV. Shattuck-Hufnagel (2006) showed similar lack of full resyllabification for final /t/ in such phrases as edit us . Spinelli, McQueen, and Cutler (2003) studied the effects of elision in French, as when the final /t/ of petit restructures to become prevocalic in petit agneau , and found that this process had little if any effect on word recognition in French listeners, despite the seeming erosion of boundaries between the speaker’s intended words. This result is expected if even after words were linked in elision, the signal nevertheless cued the original word affiliation of the features and segments. Additional support for the notion that it does comes from Scharenborg (2010) : her laboratory’s FastTracker algorithm detects boundary- and grouping-related phonetic markers in spoken Dutch, distinguishing syllables that make up monosyllabic words from those that are part of polysyllabic words, and word-initial from word-final /s/.

Taken together, the evidence for (a) widespread and sometimes extreme phonetic variation of word forms in continuous speech, (b) a role for both prosodic structure and frequency of word use (and lexical word structure) in governing that variation, (c) the gradient nature of at least some of that variation, and (d) the observation that the cues left in the signal often permit recognition of segments and their features, and the structural organization of these linguistic elements, inspired new approaches to modeling the cognitive process of speech production planning. Some highlights of these modeling developments are reviewed in the next section.

Models of Connected Speech Production

Up until the 1980s, speech production models were of three major types. The first type focused on broad characterizations of the planning process, largely based on evidence from studies of the systematic ways in which the system can go wrong, such as in speech errors ( Fromkin, 1971 , 1973 ; Garrett, 1975 ; Shattuck-Hufnagel, 1992 ) or in aphasia ( Morton, 1969 ). These models provided constraints on what might constitute an adequate planning model, from the lexical retrieval process ( Dell, 1986 ) through serial ordering of words and sounds. However, they did not extend to the phrase-level phonetic planning process, and there had been few efforts to model the entire process of speech planning and implementation from beginning to end. Most germane to the concerns here, there had been almost no modeling of how context-governed phonetic variation in connected speech might arise, other than the proposition that these processes were mechanical and universal (as suggested in Chomsky & Halle, 1968 ) or that they involved selection among allophonic categories (as suggested by IPA transcription conventions).

These gaps were filled by two modeling developments: Levelt’s (1989) comprehensive model of the entire planning process that underlies speech production, from message conceptualization to motor movements of the articulatory system, and Browman and Goldstein’s (1986 , et seq.) proposal for articulatory phonology (discussed previously). Articulatory phonology modeled how at least some aspects of systematic phonetic variation might come about. It began with assumptions about the primacy of articulation in understanding the phonetics of speech, and worked its way back through the phonetic phenomena of word combinations to claims about word form representations in the lexicon and perhaps even earlier in the production process. The persuasive fit between this model and articulatory data made it unlikely that selection among symbolic categories would ever again be put forward as a serious model for all of phonetic variation.

In contrast, in his monograph Speaking , Levelt (1989) started at the other end of the production planning process, working his way down from message generation to articulation, and likewise changed the landscape of continuous speech planning models irrevocably. That volume integrated the available literature into a model of the entire speech production process, from generation of the message to movement of the articulators—marking a watershed in its degree of comprehensiveness and detail. This was followed a decade later by a paper with two of his colleagues, Antje Meyer and Ardi Roelofs ( Levelt, Roelofs, & Meyer, 1999 ). This paper (henceforth LRM99; see also Levelt 2001 for a shorter summary statement) described Roelofs’ computer implementation of the model, and a series of behavioral experiments that tested some of its predictions regarding reaction times to begin an utterance under various priming conditions. This implementation contained several important changes from the original 1989 description, including the explicit incorporation of articulatory phonology’s mechanisms ( Browman & Goldstein, 1986 , inter alia) for storing and retrieving syllabic articulatory plans. Although it did not deal with adjusting these plans to fit their contexts, it represented a substantial advance over earlier “black box” models that simply fed phonological plans into a module labeled “articulation.” Thus, even though Levelt et al. did not model the actual production of sound, they moved things substantially closer to that goal.

As discussed, prosodic grouping and prominence influence phrase-level phonetics, and in this domain, the LRM99 model made two important moves. First, it computed sound-level plans one prosodic word at a time, accounting for phonetic variation that involves constituents slightly larger than the lexical word, arguing principally from patterns of production of verb+pronoun sequences, such as escort us or heat it . Levelt et al. noted that in some varieties of British English, the final /t/ in such word combinations is produced with a noisy release, suggesting that it has been resyllabified to become an onset in the following vowel-initial word. The ability of a model to account for such cross-word-boundary interactions is critical, and the move to the prosodic word as the articulatory planning unit was an important step in this direction (although it is unlikely that this restructuring is complete).

A second important aspect of the LRM99 model was the inclusion of a later stage of phrase-level prosodic processing, as in the 1989 version. This component accounted for the planning of intonational contours as well as the structures (prosodic constituents and accents) that govern those contours. It could also, in theory, provide an account of systematic boundary- and prominence-related adjustments to the proposed syllable-sized articulatory plans, such as the articulatory strengthening at constituent onsets and duration lengthening at boundaries described previously. However, no explicit mechanism was proposed for the adjustment of the selected syllable-sized articulatory plans to their prosodic contexts.

In addition to articulatory phonology and the LRM model of speech planning, a third development in the modeling of speech production has taken the form of exemplar-based models of production ( Johnson, 1997 ; Pierrehumbert, 2001 , 2002 ). These were inspired by several findings, starting with experimental results showing that listeners process auditorally presented words better (i.e., more accurately and more quickly) when they have previously heard them produced in the same voice ( Goldinger, 1996 ). This suggested that listeners store tokens of earlier auditory experiences of each word in a multidimensional space that includes parameters of individual speaker productions, recognizing incoming words by accessing the best-fitting token from this cloud of stored memories ( Goldinger, 1997 ; MacLennan, Luce, & Charles-Luce, 2003 ). This approach to modeling speech perception, which imported concepts from category learning ( Nosofsky, 1986 ; Hintzman, 1986 ) into the cognition of language processing, opened the door to the possibility of formulating a production model that made use of a similar mechanism, such as selection of the most appropriate token from a set of memories of previously produced tokens stored in a multidimensional space that reflects the dimensions of variation that matter to the listener. Additional inspiration for this proposal came from findings of gradient and lexical-item-specific effects of frequency of word use and word predictability on the production of phonetic variation as described previously. Such word-specific differences are naturally accommodated in an exemplar-based framework, because each word in the lexicon has its own cloud of stored word forms. An exemplar-based approach is also consistent with results reported by Labov and colleagues (e.g., Labov, 1966 , 1972 ) and others (e.g., Foulkes & Docherty, 2006 ), who used detailed acoustic analysis of vowel formants to establish the gradient nature of sociolinguistic variation and sound change. In exemplar models, such patterns can be captured by associating detailed phonetic forms to social/indexical labels.

The advantage of an exemplar-based model for production over a model based on feature-changing rewrite rules (i.e., selection from among phonemically distinct categories) is its ability to account for the nonbinary gradient nature of surface variation. It does this by storing so many remembered tokens of how a word has been produced in different contexts that there is always one available to fit the current context, at least for an adult who has had many decades of speaking experience. Pure exemplar-based theories, however, face several challenges in accounting for speaker behavior in the production of continuous speech. First, exactly how does the speaker decide which exemplar in the cloud of exemplars for a particular word to select as appropriate for a given context? Second, how is it that serial ordering errors involving word subconstituents can occur, if exemplars are retrieved as whole units? Thirdly, by precisely what mechanism does frequency of use lead to selection of exemplars with greater reduction? Fourth, how can such a model account for the speaker’s ability to choose to produce any phrase, no matter how frequently it has been produced in the past, in a clear canonical (i.e., unreduced) or even hyperarticulated manner? Finally, when a speaker knows a word, he or she can do many things that an exemplar model might have trouble accounting for, such as manipulate the subparts of the word in language games and experimental tasks, judge the nature and location of similarities and differences among word forms, transform the word by inflection or derivation, and produce it in contexts that the speaker has not produced before, all abilities that can be accounted for in a unified manner by reference to a traditional phonological lexicon in which word forms are represented as sequences of discrete phonological segments (or feature bundles). Such considerations led to proposals for a hybrid model of production, in which each entry in a traditional phonological lexicon is also associated with a cloud of stored exemplars of that word as already produced ( Pierrehumbert, 2002 , 2003a , 2003b ; Ernestus, in press) . This set of exemplars stores previous experiences that include contextual variation and reduction patterns resulting from frequency of use, and that could be accessed during production, while still providing a mechanism to account for more traditional phonological abilities of the speaker. This approach is discussed further later.

In sum, the past few decades have seen the emergence of serious attempts to quantify the nature and extent of systematic surface variation in word forms, to identify the factors which govern it, and to model the mechanisms by which it occurs during speech production planning and articulation. There have been striking advances in the comprehensiveness of proposed models, and in their ability to make testable predictions about measurable aspects of the acoustics and the articulation of spoken utterances, and about the relationship between these two intertwined aspects of speech. However, one characteristic shared by many of these models is the rejection of traditional lexical representations of word form defined by sequences of phonemes, each of which is defined in turn by a bundle of distinctive features. The impetus to reject this traditional view has come from findings that feature cues are not aligned in the signal, that cue values are often distributed in a continuous-valued rather than a categorical way, and that entire chunks of words are sometimes apparently lost or changed unrecognizably in continuous speech. However, despite the indubitable occurrence of such phenomena, the remaining minimal cues to the distinctive features, phonemic segments, and grouping pattern of the speaker’s intended words are often sufficient to allow the listener to infer those word forms with considerable accuracy. Thus, it might be time to ask in greater detail how the traditional model of lexical word form might be extended to account for the effects of experience on reduction patterns, particularly in light of how well it accounts for the wide variety of things that a speaker knows how to do with a word form.

Feature Cues as a Bridge between Abstract Invariant Lexical Representations and Surface Phonetic Variation

Insights gained from articulatory phonology show that many aspects of phonetic variation in continuous speech result from changes in the temporal overlap of adjacent articulatory movements and in their spatial and temporal extent. Relatedly, insights gained from work in exemplar theory, and a host of studies of phonetic variation in continuous speech, show that speakers produce systematic language-specific, dialect-specific, context-specific, and lexical-item-specific phonetic patterns when they implement a given word form or sound category. Moreover, countless analyses of the speech signal show that acoustic cues to the features of a given segment in a word are not reliably aligned together temporally in the signal. Do these collective observations render untenable the idea that speakers represent words as sequences of phonemes made up of feature bundles? The answer is no; not only are these observations compatible with this traditional view of the lexicon, but if one considers the full range of what a speaker knows about how to process words, then a phoneme-based lexicon (in the sense of phoneme as an abstract symbol for a bundle of distinctive features) may well provide the simplest account.

Several investigators working in various frameworks, such as Cutler (2010) , Ernestus (in press) , Munson, Beckman, and Edwards (in press) , and Pierrehumbert (2001 , 2002 , 2006 ), have made a similar point: that the observed facts about phrase-level phonetic variation seem to require a hybrid model, with the advantages of both a phonemic lexicon and some way of storing more detailed phonetic knowledge. This section explores two aspects of a proposal for how this might work: that when speakers plan the surface phonetic form for a word in a particular utterance, they manipulate not segments or features, but individual acoustic cues to distinctive features; and that stored knowledge about frequency and its phonetic effects may be stored not for lexical items but for prosodic constituents that can correspond to more than one word.

The type of hybrid production model in mind here is, like those proposed previously, built around a lexicon whose word forms are represented in traditional phonological terms (i.e., as sequences of abstract symbols that correspond to bundles of distinctive features). Such a representation allows the speaker to generate a production plan for each word that specifies the effectors that will be used to articulate the acoustic cues for each feature bundle. This translation from abstract symbol to proto-motor plan produces a parameterizable representation (i.e., one for which temporal and spatial specifications can be developed for a particular utterance or speaking situation). Studies carried out in the Articulatory Phonology framework suggest that these paramterizable representations are gestural, specifying a sequence of articulatory configurations whose values for temporal overlap and reduction can be adjusted to fit the goals for a given utterance. Evidence from online acoustic perturbation studies (e.g., Cai, 2012 ; Houde & Jordan, 1998 ; Villacorta, Perkell, & Guenther, 2007 ), however, suggests that the goals are sensory, as assumed in Guenther’s DIVA model ( Guenther & Perkell, 2004 ; Guenther, this volume). In some way that is not yet entirely clear, there must be a mapping between the sensory (auditory and proprioceptive) goals and the motor movements that creates the articulatory configurations to accurately and appropriately produce those sensory goals.

Under some speaking conditions, such individual word plans are fitted together with appropriate phonetic adjustments for interactions across word boundaries and for the prosodic constituent structure and prominence structure of the particular utterance intended by the speaker. In other conditions, however, and perhaps typically in conversational speech, there is an additional route available, via retrieval from a set of stored but parameterizable representations of certain constituents that the speaker has produced before. The proposal here is that these stored representations of past production plans, with their parameterizable spatial and temporal targets, are the locus of frequency-based effects on production, specifying greater articulatory reduction and overlap for high-frequency constituents that can nevertheless be overcome, if the speaker prefers it, by the less-reduced instructions that emerge from the encoding process that starts with the traditional phonemic lexicon. However, these proposed stored plans differ from existing hybrid models in one significant dimension: they need not correspond to individual words, because they can correspond to prosodic constituents at any level of the prosodic hierarchy, from the highest to the lowest. Thus, they can in principle account for “reduction constituents” ranging from a pair of adjacent phonemes (as in the coda cluster /st/ in lost produced without a /t/ closure or release) to an entire utterance (as in I don’t know produced as a low-high-low-high intonation contour on a single nasalized vowel sound [ Hawkins, 2003] ). Because the stored elements that support this second planning mechanism are not associated with single lexical items, their direct link to a single lexical meaning is severed. Instead, they form what might be called a prosodicon of constituents, separate from the form-meaning pairings in the lexicon. Activated by their corresponding lexical word sequences, these stored prosodic constituent plans facilitate specification of the phonetic shape of an utterance, because they are represented not in terms of abstract symbolic categories but in a vocabulary closer to what the motor system can use, such as abstract articulatory gestures.

Such a mechanism is consistent with the concept of optimization, as suggested in the DIVA model ( Guenther & Perkell, 2004 ) and Keating’s window model ( Keating 1990 ), although it suggests that additional factors other than the balancing of efficient articulatory paths with meeting sensory goals may be involved. Note that this proposal provides a mechanism that frees the speaker from deterministic control by the mere arithmetic accumulation of exemplars of reduced forms. Past experience can be overridden by the speaker, because even high-frequency constituents can be produced in canonical or even hyperarticulated form when desired. The proposal that these stored elements are prosodic constituents is consistent with approaches like the Prosody First model of Keating and Shattuck-Hufnagel (2002) , in which the planning frames for spoken utterances are made up of a hierarchy of prosodic constituents; elements of the prosodicon would fit neatly into such planning frames.

A second way in which this sketch of a hybrid model differs from existing models is in the elements that are manipulated in the setting of phonetic goals: it is proposed that these are the acoustic landmarks (i.e., the moments of abrupt spectral change associated with quantal changes in the acoustics at certain points during the continuous-valued changes in the articulatory configuration during the production of an utterance). Examples include the release burst of a stop consonant or the onset of frication noise for a fricative. In the perceptual domain, landmarks have been proposed as critical events that can be robustly detected by the perceptual system, so that their detection serves as the first step in the speech perception process. Stevens (2000 , 2002 ) and Stevens and Keyser (1989 ; Keyser & Stevens, 2006 ) outline a model based on individual acoustic cues to the distinctive features that define, differentiate, and relate the word forms in the lexicon. In their model, a given feature contrast can be signaled by a number of different acoustic cues, and the speaker’s cue selection can vary systematically with the other features in the feature bundle (e.g., labial is signaled differently for stops than for fricatives in English), with the nature of adjacent segments (different cues for a word-medial voiceless stop consonant before a stressed vowel vs. a reduced vowel), and with position in the larger context (word-initial /t/ has a different range of cues than word-final /t/). Although Stevens and Keyser do not propose a production model based on acoustic cue selection, one can envision extending their proposals for speech perception in this way. Individual feature-cue manipulation would be consistent with constraints on phonetic reduction that are not necessarily predicted by the physiology of the vocal tract; speakers may be constrained to preserve at least one cue to at least one feature of certain phonemes in a word in many reduction processes (although perhaps not in all of the most extreme cases), providing listeners with minimal cues to distinguish one contrastive word form from another. For example, informal observation suggests that when speakers produce a reduced form of the English phrase why did you , they can fully palatalize the cross-word-boundary sequence /-d+y-/ to produce something very close to the voiced palatal affricate /dʒ/, whereas when they produce a reduced form of why do you , there is much less overlap, resulting in less affrication and more of a glide-like production after the stop. This preserves cues to the difference between an interaction of the /y/ with a final /d/ in why did you but with the initial /d/ in why do you . These preserved cues are informative about the lexical items, whereas the pattern of cue loss and change may be informative about other aspects of the communicative act, such as the state of mind and social affiliations of the speaker. Moreover, intervals between acoustic landmark cues are good candidates for the computation of timing patterns for individual utterances, which are influenced by many phrase-level factors.

A critical aspect of this idea is that the speaker formulates a plan for phonetic implementation in terms of which cues to a distinctive feature will be realized, and what amplitude and timing with which they will be realized. On this view, one particularly informative class of feature cue (i.e., the abrupt changes associated with articulatory closures and releases called landmarks) are good candidates for the events that might be timed to implement duration shortening or lengthening. This move to the individual feature cue, rather than the feature or the segment, makes it possible to deal with many aspects of subfeatural variation.

The history of phonology and phonetics can be looked at as the evolution of attempts to find the appropriate set of elements to capture the invariance and the variability in word forms. From very early grammarians, such as Panini, who described sets of sounds with common characteristics in Sanskrit, through Badouin de Courtenay, Trubetzkoy, and de Saussure to Jakobson, and the development of the IPA as a system of for describing allophonic variation, there has been a search for the linguistic elements that define word forms and describe the relationships among them, and describe the range of systematic phonetic variation in their production. More recently, phonology and phonetics have focused on systematic subfeatural variation among word forms and their sounds, highlighting such phenomena as duration cues and articulatory strengthening at prosodic boundaries and prominences; differences in articulatory overlap with structural position; and language-specific, dialect-specific, and speaker-specific patterns of variation in phonetic implementation. For some investigators, these observations have cast doubt on the need for the phoneme (bundle of distinctive features) as a representational unit. Throughout this process, however, one bedrock fact has remained: one word form is distinct from other word forms in the language, and yet related to them and to the various surface forms it can take in speech, in highly systematic ways that a native speaker understands.

The best way to account for this set of facts is in terms of the speaker’s knowledge of a lexicon of word forms made up of phonemes defined by contrasting features, and the set of acoustic-phonetic cues that are appropriate for signaling those contrasts in different contexts. Evidence consistent with this view is found in the observation that, even in reduced and overlapped tokens, speakers often provide cues to the features of segments that are apparently “missing” from the utterance, and to the higher-level structural affiliations of those features and segments. This suggests that speakers have the capacity to explicitly represent and manipulate individual cues during the production planning process, and that the goal of preserving cues to features and structures constrains phrase-level adjustments to word forms. In addition, speakers may store and retrieve precompiled motor programs for frequently produced prosodic components, permitting the further reduction of these elements.

The acid test of such a view will come from the development of a synthesis algorithm that combines lexical access, prosodic planning, and sound generation to produce natural-sounding speech. The resulting acoustic signals will allow evaluation of this approach by the most sensitive and appropriate instrument: the human speech perception system, which is exquisitely tuned to the appropriateness of phrase-level phonetic phenomena.

Badouin de Courtenay, J. (1800s/ 1972 ). A Badouin de Courtenay anthology: The beginnings of structural linguistics (Ed.//Trans. E. Stankiewicz ). Bloomington: Indiana University Press.

Google Scholar

Google Preview

Beckman, M. E. , & Edwards, J. ( 1994 ). Articulatory evidence for differentiating stress categories. In P. A. Keating (Ed.), Phonological structure and phonetic form: Papers in laboratory phonology III (pp. 7–33). Cambridge: Cambridge University Press.

Boersma, P. ( 2001 ). Praat, a system for doing phonetics by computer.   Glot International , 5 (9/10), 341–345.

Boersma, P. , & Weenink, D. (2012). Praat: doing phonetics by computer [Computer program]. Version 5.3.04. Retrieved from http://www.praat.org/

Browman, C. P. , & Goldstein, L. ( 1986 ). Towards an articulatory phonology.   Phonology Yearbook , 3 , 219–252.

Browman, C. P. , & Goldstein, L. ( 1989 ). Articulatory gestures as phonological units.   Phonology , 6 , 201–251.

Browman, C. P. , & Goldstein, L. ( 1990 ). Tiers in articulatory phonology, with some implications for casual speech. In T. Kingston & M. E. Beckman (Eds.). Papers in laboratory phonology I: Between the grammar and physics of speech (pp. 341–376). Cambridge: Cambridge University Press.

Browman, C. P. , & Goldstein, L. ( 1992 ). Articulatory phonology: An overview.   Phonetica , 49 , 155–180.

Bybee, J. ( 2001 ). Phonology and language use . Cambridge, England: Cambridge University Press.

Cai, S. (2012). Online control of articulation based on auditory feedback in normal speech and stuttering: Behavioral and modeling studies (Unpublished doctoral thesis). MIT, Cambridge, MA.

Cho, T. , & Jun, S.-A. ( 2000 ). Domain-initial strengthening as enhancement of laryngeal features: Aerodynamic evidence from Korean.   CLS , 36 , 31–44.

Chomsky, N. , & Halle, M. ( 1968 ). The sound pattern of English . New York: Harper & Row.

Coleman, J. ( 2002 ). Phonetic representations in the mental lexicon. In J. Durand , & B. Laks (Eds.), Phonetics, phonology and cognition (pp. 96–130). Oxford: Oxford University Press.

Coleman, J. ( 2003 ). Discovering the acoustic correlates of phonological contrasts.   Journal of Phonetics , 31 , 351–372.

Cooper, A. M. (1991). An articulatory account of aspiration in English (Unpublished doctoral thesis). Yale University, New Haven, CT.

Cutler, A. ( 2010 ). Abstraction-based efficiency in the lexicon.   Laboratory Phonology , 1 (2), 301–318. doi:10.1515/LABPHON.2010.016.

Davidson, L. ( 2006 ). Schwa elision in fast speech: segmental deletion or gestural overlap?   Phonetica , 63 , 79–112.

deJong, K. J. , Lim, B. L. , & Nagao, K. ( 2004 ). The perception of syllable affiliation of singleton stops in repetitive speech.   Language and Speech , 47 , 241–266.

Dell, G. ( 1986 ) A spreading activation theory of retrieval in sentence production,   Psychological Review , 93 , 283–321.

Dilley, L. C. , & Shattuck-Hufnagel S. (1998). Ambiguity in prominence perception in spoken utterances of American English. In Proceedings of the 16th International Congress on Acoustics and 135th Meeting of the Acoustical Society of America, Vol. 2 (pp. 1237–1238).

Dilley, L. C. , & Shattuck-Hufnagel, S. (1999). Effects of repeated intonation patterns on perceived word-level organization. In Proceedings of the ICPhS , San Francisco .

Dilley, L. , Shattuck-Hufnagel, S. , & Ostendorf, M. ( 1996 ). Glottalization of vowel-initial syllables as a function of prosodic structure.   Journal of Phonetics , 24 , 423–444.

Ernestus, M. ( in press ). Acoustic reduction and the roles of abstractions and exemplars in speech processing.   Lingua .

Ernestus, M. , Baayen, H. R. , & Schreuder, R. ( 2002 ). The recognition of reduced word forms.   Brain and Language , 81 , 162–173.

Espy-Wilson, C. (1987). An acoustic-phonetic approach to speech recognition: application to the semivowels (Unpublished doctoral thesis). Department of Electrical Engineering and Computer Science, MIT, Cambridge, MA.

Ferreira, F. ( 1993 ). The creation of prosody during sentence production.   Psychological Review , 100 , 233–253.

Ferreira, F. ( 2007 ). Prosody and performance in language production.   Language and Cognitive Processes , 22 , 1151–1177.

Firth, J. R. ( 1948 ). Sounds and prosodies.   Transactions of the Philological Society . (Reprinted from Prosodic analysis , pp. 1–26, by F. R. Palmer , Ed., 1970, Oxford: Oxford University Press)

Fougeron, C. ( 1998 ) Variations articulatoires en début de constituants prosodiques de différents niveaux en français, Université Paris III.

Fougeron, C. ( 2001 ). Articulatory properties of initial consonants in several prosodic constituents in French.   J Phonetics , 29 , 109–135.

Foulkes, P. , & Docherty, G. ( 2006 ). The social life of phonetics and phonology.   Journal of Phonetics , 34 , 409–438.

Fowler, C. A. ( 1988 ). Differential shortening of repeated content words produced in various communicative contexts.   Language and Speech , 3 (4), 307–319.

Fowler, C. A. , & Housum, J. ( 1987 ). Talkers’ signalling of “new” and “old” words in speech and listeners’ perception and use of the distinction.   Journal of Memory and Language , 26 , 489–504.

Fowler, C. A. , Rubin, P. , Remez, R. E. , & Turvey, M. T. ( 1980 ). Implications for speech production of a general theory of action. In B. Butterworth (Ed.), Language production (pp. 373–420). New York: Academic Press.

Fromkin, V. A. ( 1971 ). The nonanomalous nature of anomalous utterances.   Language , 47 , 27–52.

Fromkin, V. A. (Ed., 1973 ). Speech errors as linguistic evidence . The Hague: Mouton.

Garrett, M. F. ( 1975 ). The analysis of sentence production. In: G. Bower (Ed.), Psychology of learning and motivation (Vol. 9, pp. 133–177). New York: Academic Press.

Goldinger, S. D. ( 1996 ). Words and voices: Episodic traces in spoken word identification and recognition memory.   Journal of Experimental Psychology: Learning, Memory, and Cognition , 22 , 1166–1183.

Goldinger, S. ( 1997 ). Echoes of echoes: An episodic theory of lexical access.   Psychological Review , 105 , 251–279.

Goldrick, M. , & Blumstein, S. E. ( 2006 ). Cascading activation from phonological planning to articulatory processes: Evidence from tongue twisters.   Language and Cognitive Processes , 21 (6), 649–683.

Goldsmith, J. A. ( 1976 ). An overview of autosegmental phonology.   Linguistic Analysis . 2, 23–68.

Gow, D. W. ( 2002 ). Does English coronal place assimilation create lexical ambiguity?   Journal of Experimental Psychology: Human Perception and Performance , 28 , 163–179.

Gow, D. W. ( 2003 ). Feature parsing: Feature cue mapping in spoken word recognition.   Perception & Psychophysics , 65 , 575–590.

Guenther, F. , & Perkell, J. (2004, June). A neural model of speech production and supporting experiments . In S. Manuel & J. Slifka (Eds.), From sound to sense: 50 years of speech research . Conference held at MIT, Cambridge, MA.

Halliday, M. ( 1967 ). Intonation and grammar in British English (Janua Linguarum, series practica, 48). The Hague: Mouton.

Hawkins, S. ( 2003 ). Roles and representations of systematic fine phonetic detail in speech understanding.   Journal of Phonetics , 31 , 373–405.

Hawkins, S. (2011). On the robustness of speech perception. Plenary lecture presented at the International Congress of Phonetic Sciences 2011, Hong Kong.

Hayes, B. ( 1989 ). The prosodic hierarchy in meter. In P. Kiparsky & G. Youmans (Eds.), Rhythm and meter (pp. 201–260). Orlando, FL: Academic Press.

Hintzman, D. L. ( 1986 ). “Schema abstraction” in a multiple-trace memory model.   Psychological Review , 93 , 411–428.

Hockett, C. ( 1955 ). A manual of phonology. Indiana University Publications in Anthropology and Linguistics 11 . Baltimore, MD: Waverly Press.

Houde, J. F. , & Jordan, M. I. ( 1998 ). Sensorimotor adaptation in speech production.   Science , 279 , 1213–l216.

Jakobson, R. ( 1941 /1968). Child language, aphasia and phonological universals . The Hague: Mouton

Jakobson, R. , Fant, C. G. M. , & Halle, M. ( 1952 ). Preliminaries to speech analysis: The distinctive features and their correlates . Cambridge, MA: MIT Press.

Johnson, K. ( 1997 ). Speech perception without speaker normalization: An exemplar model. In K. Johnson & J. Mullenix (Eds.), Talker variability in speech processing (pp. 145–166). San Diego, CA: Academic Press.

Johnson, K. (2004). Massive reduction in conversational American English. Spontaneous speech: data and analysis. Proceedings of the 1st session of the 10th international symposium (pp. 29–54). Tokyo: The National International Institute for Japanese Language.

Jun, S.-A. (1993). The phonetics and phonology of Korean prosody (Doctoral dissertation). Ohio State University, Columbus, OH. [Published in 1996 by Garland, New York].

Jurafsky, D. , Bell, A. , Fosler-Lussier, E. , Girand, C. , & Raymond, W. D. (1998). Reduction of English function words in Switchboard. In ICSLP-98 , Sydney (Vol. 7, pp. 3111–3114). Canberra City, Australia: Australian Speech Science and Technology Association.

Jurafsky, D. , Bell, A. , & Girand, C. ( 2002 ). The role of the lemma in form variation. In C. Gussenhoven & N. Warner (Eds.), Papers in laboratory phonology VII (pp. 1–34). Berlin: Mouton de Gruyter.

Keating, P. A. ( 1990 ). The window model of coarticulation: Articulatory evidence. In J. Kingston & M. E. Beckman (Eds.), Papers in laboratory phonology I (pp. 451–470). Cambridge: Cambridge University Press.

Keating, P. A. , & Fougeron, C. ( 1997 ). Articulatory strengthening at edges of prosodic domains.   Journal of the Acoustical Society of America , 101 , 3728–3740.

Keating, P. A. , & Shattuck-Hufnagel, S. ( 2002 ). A prosodic view of word form encoding for speech production.   UCLA Working Papers in Phonetics , 101 , 112–156.

Keyser, S. J. , & Stevens, K. N. ( 2006 ). Enhancement and overlap in the speech chain.   Language , 82 (1), 33–63.

Klatt, D. H. ( 1976 ). Linguistic uses of segmental duration in English: Acoustic and perceptual evidence.   Journal of the Acoustical Society of America , 59 , 1208–1221.

Kohler, K. ( 1990 ). Segmental reduction in connected speech in German: Phonological facts and phonetic explanations. In W. J. Hardcastle & A. Marchal (Eds.), Speech production and speech modelling (pp. 21–33). Dordrecht: Kluwer Academic Publishers.

Kohler, K. (1999). Articulatory prosodies in German reduced speech. Proceedings of the XIVth International Congress of Phonetic Sciences , San Francisco.

Kohler, K. (2011). Does phonetic detail guide situation-specific speech recognition. Proceedings of the XVIIth International Congress of Phonetic Sciences , Hong Kong.

Labov, W. ( 1966 ). The social stratification of English in New York City . Washington: Center for Applied Linguistics.

Labov, W. ( 1972 ). Sociolinguistic patterns . Philadelphia: University of Pennsylvania Press.

Ladd, D. R. , & Campbell, N. (1991). Theories of prosodic structure: Evidence from syllable duration. Proceedings of the XII ICPhS. Aix-en- Provence, France.

Lehiste, I. ( 1960 ). An acoustic-phonetic study of internal open juncture.   Phonetica , 5 (Suppl.1), 5–54.

Lehiste, I. (Ed.). ( 1967 ). Readings in acoustic phonetics . Cambridge, MA: MIT Press.

Levelt, W. J. M. ( 1989 ). Speaking: From intention to articulation . Cambridge, MA: MIT Press.

Levelt, W. J. M. ( 2001 ). Spoken word production: A theory of lexical access.   Proceedings of the National Academy of Sciences , 98 , 13464–13471.

Levelt, W. J. M. , Roelofs, A. , & Meyer, A. ( 1999 ). A theory of lexical access in speech production.   Behavioral and Brain Sciences , 22 , 1–38.

Manuel, S. ( 1991 ). Recovery of “deleted” schwa. In O. Engstrand & C. Kylander (Eds.), Current phonetic research paradigms: Implications for speech motor control (PERILUS XIV) . Stockholm: University of Stockholm.

Manuel, S. Y. ( 1995 ). Speakers nasalize /<eth>/ after /n/, but listeners still hear /<eth>/,   Journal of Phonetics , 23 , 453–476.

Manuel, S. , Shattuck-Hufnagel, S. , Huffman, M. , Stevens, K. , Carlsson, R. , & Hunnicutt, S. (1992). Studies of vowel and consonant reduction. In Proceedings of the 1992 International Congress on Spoken Language Processing , 943–946.

MacLennan, C. , Luce, P. , & Charles-Luce, J. ( 2003 ). Representation of lexical form.   Journal of Experimental Psychology: Learning, Memory and Cognition , 29 , 539–553.

McQueen, J. M. , Dahan, D. , & Cutler, A. ( 2003 ). Continuity and gradedness in speech processing. In N. O. Schiller , & A. S. Meyer (Eds.), Phonetics and phonology in language comprehension and production: Differences and similarities (pp. 39–78). Berlin: Mouton de Gruyter.

Morton, J. ( 1969 ). Interaction of information in word recognition.   Psychological Review , 76 (2), 165–178.

Munson, B. , Beckman, M. , & Edwards, J. (2011). Phonological representations in language acquisition: Climbing the ladder of abstraction . In A. Cohn , C. Fougeron , & M. Huffman (Eds.), Oxford handbook in laboratory phonology (pp. 288–309). Oxford: Oxford University Press.

Nosofsky, R. ( 1986 ). Attention, similarity and the identification-categorization relationship.   Journal of Experimental Psychology: General , 115 (1), 39–57.

Nespor, M. , & Vogel, I. ( 1986 ). Prosodic phonology . Dordrecht: Foris Publications.

Ogden, R. , & Local, J. K. ( 1994 ). Disentangling autosegments from prosodies: A note on the misrepresentation of a research tradition in phonology.   Journal of Linguistics , 30 , 477–498.

Okobi, A. O. (2006). Acoustic correlates of word stress in American English (Unpublished doctoral thesis). MIT, Cambridge, MA.

Pierrehumbert, J. ( 2001 ). Exemplar dynamics: Word frequency, lenition, and contrast. In J. Bybee & P. Hopper (Eds.), Frequency effects and the emergence of linguistic structure (pp. 137–157). Amsterdam: John Benjamins.

Pierrehumbert, J. ( 2002 ). Word-specific phonetics. In C. Gussenhoven & N. Warner (Eds.), Laboratory Phonology 7 (pp. 101–139). Berlin: Mouton de Gruyter.

Pierrehumbert, J. ( 2003 a). Probabilistic phonology: Discrimination and robustness. In R. Bod , J. Hay , & S. Jannedy (Eds.), Probability theory in linguistics (pp. 177– 228). Cambridge, MA: The MIT Press.

Pierrehumbert, J. ( 2003 b). Phonetic diversity, statistical learning, and acquisition of phonology.   Language and Speech , 46 (2–3), 115–154.

Pierrehumbert, J. ( 2006 ). The statistical basis of an unnatural alternation . In L. Goldstein , D. H. Whalen , & C. Best (Eds.), Laboratory phonology VIII, varieties of phonological competence (pp. 81–107). Berlin: Mouton de Gruyter.

Pierrehumbert, J. B. , & Talkin, D. ( 1992 ). Lenition of /h/ and glottal stop. In G. Doherty & D. R. Ladd (Eds.), Papers in laboratory phonology II: Gesture, segment, prosody (pp. 90–127). Cambridge: Cambridge University Press.

Port, R. F. , Mitleb, F. M. , & O’Dell, M. ( 1981 ). Neutralization of obstruent voicing is incomplete.   Journal of the Acoustical Society of America , 70 , S10.

Pouplier, M. , & Goldstein, L. ( 2010 ). Intention in articulation: Articulatory timing of coproduced gestures and its implications for models of speech production.   Language and Cognitive Processes , 25 (5), 616–664.

Saltzman, E. , & Kelso, J. A. S. ( 1987 ). Skilled actions: A task-dynamic approach.   Psychological Review , 94 , 84–106.

Saltzman, E. , & Munhall, K. ( 1989 ). A dynamical approach to gestural patterning in speech production.   Ecological Psychology , 1 , 333–382.

Scharenborg, O. ( 2010 ). Modeling the use of durational information in human spoken-word recognition.   Journal of the Acoustical Society of America , 127 (6), 3758–3770.

Selkirk, E. O. ( 1984 ). Phonology and syntax: The relation between sound and structure . Cambridge, MA: MIT Press.

Shattuck-Hufnagel, S. ( 1992 ). The role of word structure in segmental serial ordering.   Cognition , 42 (1–3), 213–259.

Shattuck-Hufnagel, S. ( 2006 ). Prosody first or prosody last? Evidence from the phonetics of word-final /t/ in American English. In L. M. Goldstein , D. H. Whalen , & C. sT. Browman (Eds.), Laboratory phonology 8: Varieties of phonological competence (pp. 445–472). Berlin: de Gruyter.

Spinelli, E. , McQueen, J. , & Cutler, A. ( 2003 ). Processing resyllabified words in French.   Journal of Memory and Language , 48 , 233–254.

Stevens, K. N. ( 1998 ). Acoustic phonetics . Cambridge, MA: MIT Press.

Stevens, K. N. ( 2000 ). Diverse acoustic cues at consonantal landmarks.   Phonetica , 57 (2–4), 139–151.

Stevens, K. N. ( 2002 ). Toward a model for lexical access based on acoustic landmarks and distinctive features.   Journal of the Acoustical Society of America , 111 , 1872–1891.

Stevens, K. N. , & Keyser, S. J. ( 1989 ). Primary features and their enhancement in consonants.   Language , 65 , 81–106.

Surana, K. , & Slifka, J. (2006). Is irregular phonation a reliable cue towards the segmentation of continuous speech in American English? In Proceedings of Speech Prosody 2006 (Paper 177). Dresden, Germany: ISCA Special Interest Group on Speech Prosody.

Talkin, D. ( 1995 ). A robust algorithm for pitch tracking (RAPT). In W. B. Kleijn & K. K. Paliwal (Eds.) Speech coding and synthesis (pp. 495–518). New York, NY: Elsevier. [xWaves is now embodied in Wavesurfer, www.speech.kth.se/wavesurfer/ ]

Trubetzkoy, N. S. ( 1939 ). Grundzuge der phonologie . Goettigen: Vandenhoeck and Ruprecht.

Truslow, E. (2010). Xkl: A Tool for Speech Analysis (Thesis, Union College). Retrieved from http://hdl.handle.net/10090/16913 [Xkl is the Unix implementation of the Klattools, which were developed by Dennis Klatt at MIT in the 1980s].

Villacorta, V. M. , Perkell, J. S. , & Guenther, F. H. ( 2007 ). Sensorimotor adaptation to feedback perturbations of vowel acoustics and its relation to perception.   Journal of the Acoustical Society of America , 122 (4), 2306–2319.

Wightman, C. , Shattuck-Hufnagel, S. , Ostendorf, M. , & Price, P. ( 1992 ). Segmental durations in the vicinity of prosodic phrase boundaries.   Journal of the Acoustical Society of America , 91 (3), 1707–1717.

Zhao, S. ( 2010 ). Stop-like modification of the dental fricative /dh/: an acoustic analysis.   Journal of the Acoustical Society of America , 128 , 2009–2020.

Zsiga, E. ( 1997 ). Features, gestures and Igbo vowels: An approach to the phonetics/phonology interface.   Language , 73 (2), 227–274.

  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Cambridge Dictionary

  • Cambridge Dictionary +Plus

English pronunciation of hypothesis

Your browser doesn't support HTML5 audio

(English pronunciations of hypothesis from the Cambridge Advanced Learner's Dictionary & Thesaurus and from the Cambridge Academic Content Dictionary , both sources © Cambridge University Press)

{{randomImageQuizHook.quizId}}

Word of the Day

under lock and key

locked away safely

Dead ringers and peas in pods (Talking about similarities, Part 2)

Dead ringers and peas in pods (Talking about similarities, Part 2)

hypothesis phonetic transcription

Learn more with +Plus

  • Recent and Recommended {{#preferredDictionaries}} {{name}} {{/preferredDictionaries}}
  • Definitions Clear explanations of natural written and spoken English English Learner’s Dictionary Essential British English Essential American English
  • Grammar and thesaurus Usage explanations of natural written and spoken English Grammar Thesaurus
  • Pronunciation British and American pronunciations with audio English Pronunciation
  • English–Chinese (Simplified) Chinese (Simplified)–English
  • English–Chinese (Traditional) Chinese (Traditional)–English
  • English–Dutch Dutch–English
  • English–French French–English
  • English–German German–English
  • English–Indonesian Indonesian–English
  • English–Italian Italian–English
  • English–Japanese Japanese–English
  • English–Norwegian Norwegian–English
  • English–Polish Polish–English
  • English–Portuguese Portuguese–English
  • English–Spanish Spanish–English
  • English–Swedish Swedish–English
  • Dictionary +Plus Word Lists
  • All translations

Add ${headword} to one of your lists below, or create a new one.

{{message}}

Something went wrong.

There was a problem sending your report.

Buy Me a Coffee at ko-fi.com

English Phonetic Transcription

This tool is the online converter of English text to IPA phonetic transcription . Paste or type English text in the text field, and Click the "Transcribe" button. Click the "Speak" button, and listen to the sound of input text in browsers that support TTS (Chrome, Safari, Firefox). Copy the transcription in multiple formats by the "Copy" button and three options ("Transcription only", "Word by word", "Line by line").

The two most common English dialects are supported:

  • British pronunciations are compiled from various sources (wiktionary etc.), including parts of speech.
  • American pronunciations are compiled from various sources ( CMU pronunciation dictionary , wiktionary , etc.), including parts of speech.

IMAGES

  1. PPT

    hypothesis phonetic transcription

  2. PPT

    hypothesis phonetic transcription

  3. Phonetic Transcription and Its Types

    hypothesis phonetic transcription

  4. Phonemic Transcription Examples

    hypothesis phonetic transcription

  5. PHY112

    hypothesis phonetic transcription

  6. Phonetic Transcription

    hypothesis phonetic transcription

VIDEO

  1. How to Pronounce hypothesize

  2. How to pronounce hypotheses

  3. Epenthesis in English| Phonetics and Phonology

  4. Phonetic Transcription

  5. Phonetic Transcription

  6. How to Pronounce Hypotheses (Plural, Hypothesis)

COMMENTS

  1. Phonetics and Phonology: The Basics

    Each graphic symbol is a hypothesis, because the selection of each graphic symbol requires an interpretation of its static and dynamic (i.e., coarticulatory) state. Phonetic Transcription is one of the three types of speech representation. ... and phonetic transcription: [´tʰ˜ɛn̪θ].

  2. Phonetic transcription

    For the distinction between [ ], / / and , see IPA § Brackets and transcription delimiters. Phonetic transcription (also known as phonetic script or phonetic notation) is the visual representation of speech sounds (or phones) by means of symbols. The most common type of phonetic transcription uses a phonetic alphabet, such as the International ...

  3. PDF Position of Phonetics in Grammars

    The question that Keating (1985) addresses is how much phonetic detail can be supplied by 'universal phonetics'. The short answer: not much. Implication: Most aspects of phonetic realization must be specified in grammar, either in phonology or in a phonetic component. Case study: Voicing effects on vowel duration.

  4. Phonetics

    For transcription, the main development was the creation of the International Phonetic Alphabet (IPA) (e.g., International Phonetic Association, 1989). Initiated in 1886 as a tool for improving language teaching and, relatedly, reading (Macmahon, 2009 ), the IPA was modified and extended both in terms of the languages covered and the ...

  5. Phonetic transcription

    phonetic transcription, representation of discrete units of speech sound through symbols. Over the years, multiple writing systems and computer symbol sets have been developed for this purpose. The most common is perhaps the International Phonetic Alphabet.. Most modern languages have standard orthographies, or ways that they are represented in written or typed characters or symbols.

  6. 2 Origins and Development of Phonetic Transcription

    In Chapter 1 I described proper phonetic transcription as a technographic form of writing in which the symbols have phonetic definitions supplied by phonetic theory. In this chapter I will look at how writing became available as a means of representing pronunciation and consider the rise of the discipline of phonetics as a means of analysing and describing it.

  7. Corpus and Research in Phonetics and Phonology: Methodological and

    Phonetics analyses sounds in terms of their physical reality, while phonology is more concerned with the knowledge a speaker may have about the sound patterns of his language. ... a list of words or sentences transcribed at a phonemic level using a more or less narrow IPA transcription; (ii) audio data that have been recorded in accordance with ...

  8. Phonetic transcription exercise

    Phonetic transcription exercise "Your vowels" exercise Short paper 1: What is /æ/-tensing? Short paper 2: Lexical diffusion Short paper 3: Computational models of sound change ... If you think it might be conditioned variation, suggest a hypothesis about the conditioning factors. Here is a transcript to help you locate relevant words:

  9. Phonetic Transcription in Clinical Practice

    Phonetic transcription provides the data currently considered fundamental to the assessment, diagnosis, and treatment of people with atypical speech (e.g., Howard & Heselwood, 2002; Stemberger & Bernhardt, 2020). It delivers a principled approximation of the speaker's output in linear notation form which can identify areas of strength and ...

  10. Chapter 2

    A Course in Phonetics. Chapter 2 - Phonology and Phonetic Transcription. Example 2.1 - Table 2.1: The Transcription of Consonants.. Example 2.2 - Table 2.2: The Transcription of Vowels.. Example 2.3 - Unstressed vowels. Homework Exercises

  11. American phonetic transcription of hypothesis

    English Phonetic Transcription. This tool is the online converter of English text to IPA phonetic transcription. Paste or type English text in the text field, and Click the "Transcribe" button. Click the "Speak" button, and listen to the sound of input text in browsers that support TTS (Chrome, Safari, Firefox).

  12. Types of transcription (Chapter 18)

    The first classificatory division of types of transcription depends on whether the motivation for constructing the transcription is primarily phonological or directly phonetic. Phonologically motivated transcriptions include phonemic and allophonic transcription. In the case of both phonemic and allophonic transcriptions, the intention is to ...

  13. Echoes of L1 Syllable Structure in L2 Phoneme Recognition

    However, when visual information was added, the phonetic influences on transcription accuracy largely disappeared. This is inconsistent with the Phonetic-Superiority Hypothesis. ... This suggests a more complex alternative hypothesis, that phonetic and phonological factors are integrated in an information-based framework, based on the perceived ...

  14. On Automatic Phonological Transcription of Speech Corpora

    Within the framework of Corpus Phonology, spoken language corpora are used for conducting research on speakers' and listeners' knowledge of the sound system of their native languages, and on the laws underlying such sound systems as well as their role in first and second language acquisition. Many of these studies require a phonological annotation of the speech data contained in the corpus.

  15. How to pronounce hypothesis: examples and online exercises

    the above transcription of hypothesis is a detailed (narrow) transcription according to the rules of the International Phonetic Association; you can find a description of each symbol by clicking the phoneme buttons in the secction below. ... press buttons with phonetic symbols to learn how to precisely pronounce each sound of hypothesis . 1. h ...

  16. How to pronounce HYPOTHESIS in English

    How to pronounce HYPOTHESIS. How to say hypothesis. Listen to the audio pronunciation in the Cambridge English Dictionary. Learn more.

  17. ELT Concourse: learn phonemic transcription

    The 'th' in thin is different from the 'th' in this . The first is unvoiced, the second voiced so they are transcribed as /θ/ and /ð/. The 'c' in cider is transcribed as /s/ not as in cake, of course. The final 's' on matches is voiced so it is transcribed as /z/. It is a different sound from the beginning of cider.

  18. toPhonetics

    Hi! Got an English text and want to see how to pronounce it? This online converter of English text to IPA phonetic transcription will translate your English text into its phonetic transcription using the International Phonetic Alphabet. Paste or type your English text in the text field above and click "Show transcription" button (or use [Ctrl+Enter] shortcut from the text input area).

  19. 16 Phrase-level Phonological and Phonetic Phenomena

    Current knowledge about phrasally induced phonetic variation suggests that a productive research strategy for understanding how human speakers plan and produce connected speech may be to focus on (1) the systematic nature of phonetic reduction patterns, which makes them a source of information rather than noise; (2) the corresponding ability of ...

  20. PDF 24.900

    Hypothesis A: Mohawk has the six distinct oral stop phonemes /p b t d k g/. Hypothesis B: Mohawk has only three distinct oral stop phonemes in its underlying phoneme ... data given in broad phonetic transcription, give a rule that states the process and the context in the most succinct manner possible, using distinctive features. Again, assume ...

  21. HYPOTHESIS

    HYPOTHESIS pronunciation. How to say hypothesis. Listen to the audio pronunciation in English. Learn more.

  22. English Phonetic Transcription

    English Phonetic Transcription. This tool is the online converter of English text to IPA phonetic transcription. Paste or type English text in the text field, and Click the "Transcribe" button. Click the "Speak" button, and listen to the sound of input text in browsers that support TTS (Chrome, Safari, Firefox). Copy the transcription in ...