The Auditory Cognitive Neuroscience of Speech Perception in Context
- First Online: 22 February 2022
Cite this chapter
- Lori L. Holt 21 &
- Jonathan E. Peelle 22
Part of the book series: Springer Handbook of Auditory Research ((SHAR,volume 74))
1499 Accesses
Speech is undeniably significant as a conspecific human communication signal, and it is also perhaps the most ubiquitous class of acoustic signals encountered by the human auditory system. However, historically there was little integration between speech research and the field of auditory neuroscience. Much of this divide can be traced back to the Motor Theory of speech perception, which framed speech not as an auditory process but as one grounded in motor gestures. Recent decades have seen a marked shift in perspective, with mutual interest from researchers in understanding both how neuroscientific principles can be used to study speech perception and, conversely, how speech as a complex acoustic stimulus can advance auditory neuroscience. This introductory chapter reviews this historical context for the modern field of auditory cognitive neuroscience before placing the remaining chapters of the book in context. A number of important themes emerge: methodological improvements, particularly in human brain imaging; the ability to study more natural speech (stories and conversations, rather than isolated stimuli); an appreciation for ways in which different listeners (e.g., of different ages or hearing levels) perceive speech; and incorporation of regions outside traditional auditory and language networks into our neuroanatomical frameworks for speech perception. Evolving techniques, theories, and approaches have provided unprecedented progress in understanding speech perception. These opportunities challenge researchers to ask new questions and to fully integrate speech perception into auditory neuroscience.
This is a preview of subscription content, log in via an institution to check access.
Access this chapter
Subscribe and save.
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
- Available as PDF
- Read on any device
- Instant download
- Own it forever
- Available as EPUB and PDF
- Durable hardcover edition
- Dispatched in 3 to 5 business days
- Free shipping worldwide - see info
Tax calculation will be finalised at checkout
Purchases are for personal use only
Institutional subscriptions
Similar content being viewed by others
Speech rhythms and their neural foundations
Human Auditory Cortex: In Search of the Flying Dutchman
Neural Network Dynamics and Audiovisual Integration
Cooper FS, Liberman AM, Borst JM (1951) The interconversion of audible and visible patterns as a basis for research in the perception of speech. Proc Natl Acad Sci U S A 37:318–325
Article CAS PubMed PubMed Central Google Scholar
Delattre PC, Liberman AM, Cooper FS (1955) Acoustic loci and transitional cues for consonants. J Acoust Soc Am 27:769–773
Article Google Scholar
Diehl RL, Lotto AJ, Holt LL (2004) Speech perception. Annu Rev Psychol 55:149–179
Article PubMed Google Scholar
Elman JL, McClelland JL (1988) Cognitive penetration of the mechanisms of perception: compensation for coarticulation of lexically restored phonemes. J Mem Lang 27:143–165
Evans S, McGettigan C (2017) Comprehending auditory speech: previous and potential contributions of functional MRI. Lang Cogn Neurosci 32:829–846
Fowler CA (2001) Obituary: Alvin M. Liberman (1917-2000). Am Psychol 56:1164–1165
Galantucci B, Fowler CA, Turvey MT (2006) The motor theory of speech perception reviewed. Psychon Bull Rev 13:361–377
Article PubMed PubMed Central Google Scholar
Greenberg S, Ainsworth WA (2004) Speech processing in the auditory system: an overview. Springer, New York
Book Google Scholar
Harnad S (1987) Categorical perception: The groundwork of cognition. Cambridge University Press, Cambridge
Google Scholar
Heald S, Nusbaum HC (2014) Speech perception as an active cognitive process. Front Syst Neurosci 8:35
Hickok G (2009) Eight problems for the mirror neuron theory of action understanding in monkeys and humans. J Cogn Neurosci 21:1229–1243
Holt LL (2005) Temporally nonadjacent nonlinguistic sounds affect speech categorization. Psychol Sci 16:305–312
Holt LL, Lotto AJ (2010) Speech perception as categorization. Atten Percept Psychophys 72:1218–1227
Klemmer ET, Snyder FW (1972) Measurement of time spent communicating. J Commun 22:142–158
Kluender KR, Diehl RL, Killeen PR (1987) Japanese quail can learn phonetic categories. Science 237:1195–1197
Article CAS PubMed Google Scholar
Kraljic T, Samuel AG, Brennan SE (2008) First impressions and last resorts: how listeners adjust to speaker variability. Psychol Sci 19:332–338
Kraus MJ, Torrez B, Park JW, Ghayebi F (2019) Evidence for the reproduction of social class in brief speech. Proc Natl Acad Sci USA 116:22998–23003
Kronrod Y, Coppess E, Feldman NH (2016) A unified account of categorical effects in phonetic perception. Psychon Bull Rev 23:1681–1712
Kuhl PK, Miller JD (1978) Speech perception by the chinchilla: identification function for synthetic VOT stimuli. J Acoust Soc Am 63:905–917
Leonard MK, Chang EF (2016) Direct cortical neurophysiology of speech perception. In: Hickok G, Small SL (eds) Neurobiology of language. Academic Press, London, pp 479–489
Chapter Google Scholar
Liberman AM (1957) Some results of research on speech perception. J Acoust Soc Am 29:117–123
Liberman AM, Mattingly IG (1985) The motor theory of speech perception revised. Cognition 21:1–36
Liberman AM, Harris KS, Hoffman HS, Griffith BC (1957) The discrimination of speech sounds within and across phoneme boundaries. J Exp Psychol 54:358–368
Liberman AM, Cooper FS, Shankweiler DP, Studdert-Kennedy M (1967) Perception of the speech code. Psychol Rev 74:431–461
Lotto AJ, Kluender KR (1998) General contrast effects in speech perception: effect of preceding liquid on stop consonant identification. Percept Psychophys 60:602–619
Lotto AJ, Kluender KR, Holt LL (1997) Perceptual compensation for coarticulation by Japanese quail ( Coturnix coturnix japonica ). J Acoust Soc Am 102:1135–1140
McClelland JL, Elman JL (1986) The TRACE model of speech perception. Cogn Psychol 18:1–86
Mehl MR, Vazire S, Ramírez-Esparza N et al (2007) Are women really more talkative than men? Science 317:82
Mirman D, Holt LL, McClelland JL (2004) Categorization and discrimination of nonspeech sounds: differences between steady-state and rapidly-changing acoustic cues. J Acoust Soc Am 116:1198–1207
Mirman D, McClelland JL, Holt LL, Magnuson JS (2008) Effects of attention on the strength of lexical influences on speech perception: behavioral experiments and computational mechanisms. Cogn Sci 32:398–417
Moineau S, Dronkers NF, Bates E (2005) Exploring the processing continuum of single-word comprehension in aphasia. J Speech Lang Hear Res 48:884–896
Norris D (1999) The merge model: speech perception is bottom-up. J Acoust Soc Am 106:2295–2295
Norris D, McQueen JM, Cutler A (2003) Perceptual learning in speech. Cogn Psychol 47:204–238
Palmer A, Shamma S (2004) Physiological representations of speech. In: Greenberg S, Ainsworth WA (eds) Speech processing in the auditory system: an overview. Springer, New York
Peelle JE (2017) Optical neuroimaging of spoken language. Lang Cogn Neurosci 32:847–854
Peelle JE (2018) Listening effort: how the cognitive consequences of acoustic challenge are reflected in brain and behavior. Ear Hear 39:204–214
Peterson GE, Barney HL (1952) Control methods used in a study of the vowels. J Acoust Soc Am 24:175–184
Pichora-Fuller MK, Kramer SE, Eckert MA, Edwards B, Hornsby BW, Humes LE, Lemke U, Lunner T, Matthen M, Mackersie CL, Naylor G, Phillips NA, Richter M, Rudner M, Sommers MS, Tremblay KL, Wingfield A (2016) Hearing impairment and cognitive energy: the framework for understanding effortful listening (FUEL). Ear Hear 37:5S–27S
Quam RM, Ramsier MA, Fay RR, Popper AN (2017) Primate hearing and communication. Springer, Cham
Samuel AG (2011) Speech perception. Annu Rev Psychol 62:49–72
Samuel AG (2020) Psycholinguists should resist the allure of linguistic units as perceptual units. J Mem Lang 111:104070
Wöstmann M, Fiedler L, Obleser J (2017) Tracking the signal, cracking the code: speech and speech comprehension in non-invasive human electrophysiology. Lang Cogn Neurosci 32:855–869
Download references
Acknowledgments
This work was supported in part by grants R01 DC014281, R21 DC016086, R21 DC015884, and R56 AG059265 from the US National Institutes of Health to JEP as well as R01DC017734, R03HD099382, and R21DC019217 from the US National Institutes of Health and BCS1950054 and BCS1655126 from the US National Science Foundation to LLH.
Compliance with Ethics Requirements
Lori L. Holt declares that she has no conflict of interest.
Jonathan E. Peelle declares that he has no conflict of interest.
Author information
Authors and affiliations.
Department of Psychology, Carnegie Mellon University, Pittsburgh, PA, USA
Lori L. Holt
Department of Otolaryngology, Washington University in St. Louis, St. Louis, MO, USA
Jonathan E. Peelle
You can also search for this author in PubMed Google Scholar
Corresponding author
Correspondence to Lori L. Holt .
Editor information
Editors and affiliations.
Department of Otolaryngology, Washington University in St. Louis, Saint Louis, MO, USA
Integrative Physiology and Neuroscience, Washington State University, Vancouver, WA, USA
Allison B. Coffin
Department of Biology, University of Maryland, Silver Spring, MD, USA
Arthur N. Popper
Department of Psychology, Loyola University Chicago, Chicago, IL, USA
Richard R. Fay
Rights and permissions
Reprints and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Holt, L.L., Peelle, J.E. (2022). The Auditory Cognitive Neuroscience of Speech Perception in Context. In: Holt, L.L., Peelle, J.E., Coffin, A.B., Popper, A.N., Fay, R.R. (eds) Speech Perception. Springer Handbook of Auditory Research, vol 74. Springer, Cham. https://doi.org/10.1007/978-3-030-81542-4_1
Download citation
DOI : https://doi.org/10.1007/978-3-030-81542-4_1
Published : 22 February 2022
Publisher Name : Springer, Cham
Print ISBN : 978-3-030-81541-7
Online ISBN : 978-3-030-81542-4
eBook Packages : Biomedical and Life Sciences Biomedical and Life Sciences (R0)
Share this chapter
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
- Publish with us
Policies and ethics
- Find a journal
- Track your research
- Corpus ID: 11241368
Perception A Brief Primer on Experimental Designs for Speech Perception Research
- Grant L. McGuire
- Published 2010
- Linguistics, Psychology
21 Citations
Statistical and acoustic effects on the perception of stop consonants in kaqchikel (mayan), why aren’t all cantonese tones equally confusing to english listeners.
- Highly Influenced
Effect of musical expertise on the perception of duration and pitch in language: A cross-linguistic study.
The perception of voice onset time by english-speaking l2 learners of spanish with an extended partial immersion experience, audiovisual training effects on l2 speech perception and production, the development of phonological stratification: evidence from stop voicing perception in gurindji kriol and roper kriol, perceptual distinctiveness between dental and palatal sibilants in different vowel contexts and its implications for phonological contrasts, discriminating non-native segmental length contrasts under increased task demands, dominance, mode, and individual variation in bilingual speech production and perception, contributions of auditory and somatosensory feedback to vocal motor control., 21 references, hearing lips and seeing voices, stimulus presentation order and the perception of lexical tones in cantonese., an analysis of perceptual confusions among some english consonants, effects of categorization and discrimination training on auditory perceptual space..
- Highly Influential
Integrality of nasalization and F1 in vowels in isolation and before oral and nasal consonants: a detection-theoretic application of the Garner paradigm.
Eye movement of perceivers during audiovisualspeech perception, discrimination of non-native consonant contrasts varying in perceptual assimilation to the listener's native phonological system., reaction times to comparisons within and across phonetic categories, an audiovisual test of kinematic primitives for visual speech perception., effect of lexical status on phonetic categorization., related papers.
Showing 1 through 3 of 0 Related Papers
- Advanced Search
- All new items
- Journal articles
- Manuscripts
- All Categories
- Metaphysics and Epistemology
- Epistemology
- Metaphilosophy
- Metaphysics
- Philosophy of Action
- Philosophy of Language
- Philosophy of Mind
- Philosophy of Religion
- Value Theory
- Applied Ethics
- Meta-Ethics
- Normative Ethics
- Philosophy of Gender, Race, and Sexuality
- Philosophy of Law
- Social and Political Philosophy
- Value Theory, Miscellaneous
- Science, Logic, and Mathematics
- Logic and Philosophy of Logic
- Philosophy of Biology
- Philosophy of Cognitive Science
- Philosophy of Computing and Information
- Philosophy of Mathematics
- Philosophy of Physical Science
- Philosophy of Social Science
- Philosophy of Probability
- General Philosophy of Science
- Philosophy of Science, Misc
- History of Western Philosophy
- Ancient Greek and Roman Philosophy
- Medieval and Renaissance Philosophy
- 17th/18th Century Philosophy
- 19th Century Philosophy
- 20th Century Philosophy
- History of Western Philosophy, Misc
- Philosophical Traditions
- African/Africana Philosophy
- Asian Philosophy
- Continental Philosophy
- European Philosophy
- Philosophy of the Americas
- Philosophical Traditions, Miscellaneous
- Philosophy, Misc
- Philosophy, Introductions and Anthologies
- Philosophy, General Works
- Teaching Philosophy
- Philosophy, Miscellaneous
- Other Academic Areas
- Natural Sciences
- Social Sciences
- Cognitive Sciences
- Formal Sciences
- Arts and Humanities
- Professional Areas
- Other Academic Areas, Misc
- Submit a book or article
- Upload a bibliography
- Personal page tracking
- Archives we track
- Information for publishers
- Introduction
- Submitting to PhilPapers
- Frequently Asked Questions
- Subscriptions
- Editor's Guide
- The Categorization Project
- For Publishers
- For Archive Admins
- PhilPapers Surveys
- Bargain Finder
- About PhilPapers
- Create an account
Speech Perception
Author's profile.
Reprint years
Other versions.
original |
PhilArchive
External links.
Through your library
- Sign in / register and customize your OpenURL resolver
- Configure custom resolver
Similar books and articles
Citations of this work, references found in this work.
An official website of the United States government
The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.
The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.
- Publications
- Account settings
The PMC website is updating on October 15, 2024. Learn More or Try it out now .
- Advanced Search
- Journal List
- HHS Author Manuscripts
Speech Perception as a Multimodal Phenomenon
Lawrence d. rosenblum.
University of California, Riverside
Speech perception is inherently multimodal. Visual speech (lip-reading) information is used by all perceivers and readily integrates with auditory speech. Imaging research suggests that the brain treats auditory and visual speech similarly. These findings have led some researchers to consider that speech perception works by extracting amodal information that takes the same form across modalities. From this perspective, speech integration is a property of the input information itself. Amodal speech information could explain the reported automaticity, immediacy, and completeness of audiovisual speech integration. However, recent findings suggest that speech integration can be influenced by higher cognitive properties such as lexical status and semantic context. Proponents of amodal accounts will need to explain these results.
We all read lips. We read lips to better understand someone speaking in a noisy environment or speaking with a heavy foreign accent (for a review, see Rosenblum, 2005 ). Even with clear speech, reading lips enhances our comprehension of a speaker discussing a conceptually dense topic. While wide individual differences exist in lip-reading skill, evidence suggests that all sighted individuals from every culture use visual speech information . Virtually any time we are speaking with someone in person, we use information from seeing movement of their lips, teeth, tongue, and non-mouth facial features, and have likely been doing so all our lives. Research shows that, even before they can speak themselves, infants detect characteristics of visual speech, including whether it corresponds to heard speech and contains one or more language. Infants, like adults, also automatically integrate visual with auditory speech streams.
Speech perception is inherently multimodal . Despite our intuitions of speech as something we hear, there is overwhelming evidence that the brain treats speech as something we hear, see, and even feel. Brain regions once thought sensitive to only auditory speech (primary auditory cortex, auditory brain stem), are now known to respond to visual speech input ( Fig. 1 ; e.g., Calvert et al., 1997 ; Musacchia, Sams, Nicol, & Kraus, 2005 ). Visual speech automatically integrates with auditory speech in a number of different contexts. In the McGurk effect ( McGurk & MacDonald, 1976 ), an auditory speech utterance (e.g., a syllable or word) dubbed synchronously with a video of a face articulating a discrepant utterance induces subjects to report “hearing” an utterance that is influenced by the mismatched visual component. The “heard” utterance can take a form in which the visual information overrides the auditory (audio “ba” + visual “va” = heard “va”) or in which the two components fuse to create a new perceived utterance (audio “ba” + visual “ga” = heard “da”).
Functional magnetic resonance imaging (fMRI) scans depicting average cerebral activation of five individuals when listening to words (blue voxels) and when lip-reading a face silently mouthing numbers (purple voxels; adapted from Calvert et al., 1997 ). The yellow voxels depict the overlapping areas activated by both the listening and lip-reading tasks. The three panels represent the average activation measured at different vertical positions, and the left side of each image corresponds to the right side of the brain. The images reveal that the silent lip-reading task, like the listening task, activates primary auditory and auditory-association cortices.
Even felt speech, accessed either through touching a speaker’s lips, jaw, and neck or through the kinesthetic feedback from one’s own speech movements, readily integrates with heard speech (e.g., Fowler & Dekle, 1991 ; Sams, Mottonen, & Sihvonen, 2005 ). The former of these effects occurs with naive subjects who have no experience perceiving speech through touch. This finding suggests that our skill with multimodal speech perception is likely based not on learned cross-modal associations but, rather, on a more ingrained sensitivity to lawfully structured speech information.
It is also likely that human speech evolved as a multimodal medium (see Rosenblum, 2005 , for a review). Most theories of speech evolution incorporate a critical influence of visuofacial information, often bridging the stages of manuo-gestural and audible language. Also, multimodal speech has a traceable phylogeny. Rhesus monkeys and chimpanzees are sensitive to audible–facial correspondences of different types of calls (alarm, coo, hoot). Brain imaging shows that the neural substrate for integrating audiovisual utterances is analogous across monkeys and humans ( Ghazanfar, Maier, Hoffman, & Logothetis, 2005 ). Finally, there is speculation that the world’s languages have developed to take advantage of visual as well as auditory sensitivities to speech. Languages typically show a complementarity between the audibility and visibility of speech segments such that segment distinctions that are harder to hear (“m” vs. “n”) are easier to see and vice versa.
Together, these findings suggest a multimodal primacy of speech. Nonauditory recognition of speech is not simply a function piggybacked on auditory speech perception; instead the relevant operations and associated neurophysiology of speech are likely designed for multimodal input. The multimodal primacy of speech is consistent with recent findings in general perceptual psychology showing the predominance of cross-modal influences in both behavioral and neurophysiological contexts ( Shimojo & Shams, 2001 , for a review). This has led a number of researchers to suggest that the perceptual brain is designed around multimodal input. Audiovisual speech is considered a prototypic example of the general primacy of multimodal perception, and the McGurk effect is one of the most oft-cited phenomena in this literature. For these reasons, multimodal speech research has implications that go well beyond the speech domain.
AMODAL THEORIES OF MULTIMODAL SPEECH PERCEPTION
Findings supporting the primacy of multimodal speech have influenced theories of speech integration. In “amodal” or “modality neutral” accounts, speech perception is considered to be blind to the modality specifics of the input from the very beginning of the process (e.g., Rosenblum, 2005 ). From this perspective, the physical movements of a speech gesture can shape the acoustic and optic signals in a similar way, so that the signals take on the same overall form. Speech perception then involves the extraction of this common, higher-order information from both signals, rendering integration a consequence and property of the input information itself. In other words, for the speech mechanism, the auditory and visual information is functionally never really separate. While the superficial details of the acoustic and optic signals—along with the associated peripheral physiology—are distinct, the overall informational form of these signals is the same. This fact would obviate any need for the speech function to translate or actively bind one modality’s information to another’s prior to speech-segment recognition.
Fortunately for speech perception, the optic and acoustic structures most always specify the same articulatory gesture. However when faced with McGurk-type stimuli, amodal speech perception could extract whatever informational components are common across modalities, which could end up either spuriously specifying a “hybrid” segment or a segment closer to that specified in one or the other of the two modalities.
SUPPORT FOR AMODAL ACCOUNTS
Support for amodal accounts comes from the aforementioned evidence for the neurophysiological and behavioral primacy of multimodal speech perception. If the modalities are functionally never separate, then evidence that the system is designed around multimodal input would be expected. For similar reasons, amodal theories predict evidence for an automaticity, completeness, and immediacy of audiovisual speech integration. Support for these predictions has come from research using the McGurk effect (see Rosenblum, 2005 , for a review). It turns out that the effect works even when the audio and visual components are made conspicuously distinct by spatial or temporal separation, or by using audio and visual components taken from speakers of different genders. These facts provide evidence for the automaticity of speech integration. The McGurk effect also occurs when subjects are told of the dubbing procedure or are told to concentrate on the audio channel, suggesting that perceivers do not have access to the unimodal components once integration occurs: Integration seems functionally complete.
There is also evidence that audiovisual speech integrates at the earliest observable stage, before phonemes or even phoneme features are determined. Research shows that visible information can affect auditory perception of the delay between when a speaker initiates a consonant (e.g., separating their lips for “b” or for “p”) and when their vocal chords start vibrating. This voice-onset time , is considered a critical speech feature for distinguishing a voiced from a voiceless consonant (e.g., “b” from “p”; Green, 1998 ). Relatedly, the well-known perceptual compensation of phoneme features based on influences of adjacent phonemes (coarticulation) occurs even if the feature and adjacent phoneme information are from different modalities ( Green, 1998 ). Thus, cross-modal speech influences seem to occur at the featural level, which is the earliest stage observable using perceptual methodologies. This evidence is consistent with neurophysiological evidence that visual speech modulates the auditory brain’s peripheral components (e.g., the auditory brainstem; Musacchia et al., 2005 ) and supports the amodal theory’s claim that the audio and visual streams are functionally integrated from the start.
MODALITY-NEUTRAL SPEECH INFORMATION
Additional support for amodal theories of speech comes from evidence for similar informational forms across modalities—that is, evidence for modality-neutral information. Macroscopic descriptions of auditory and visual speech information reveal how utterances that involve reversals in articulator movements structure corresponding reversals in both sound and light. For example, the lip reversal in the utterance “aba” structures an amplitude reversal in the acoustic signal (loud to soft to loud) as well as a corresponding reversal in the visual information for the lip movements ( Summerfield, 1987 ). Similar modality-neutral descriptions have been applied to quantal (abrupt and substantial) changes in articulation (shifts from contact of articulators to no contact, as in “ba”) and repetitive articulatory motions. More recently, measurements of speech movements on the front of the face have revealed an astonishingly close correlation between movement parameters of visible articulation and the produced acoustic signal’s amplitude and spectral parameters ( Munhall & Vatikiotis-Bateson, 2004 ).
Other research shows how correlations in cross-modal information are perceptually useful and promote integration. It is known that the ability to detect the presence of auditory speech in a background of noise can be improved by seeing a face articulating the same utterance. Importantly, this research shows that the amount of improvement depends on the degree to which the visible extent of mouth opening is correlated with the changing auditory amplitude of the speech ( Grant & Seitz, 2000 ). Thus, cross-modal correspondences in articulatory amplitude facilitate detection of an auditory speech signal.
Perceivers also seem sensitive to cross-modal correlations informative about more subtle articulator motions. Growing evidence shows that articulatory characteristics once considered invisible to lip reading (e.g., tongue-back position, intra-oral air pressure) are actually visible in subtle jaw, lip, and cheek movements ( Munhall & Vatikiotis-Bateson, 2004 ). Also, the prosodic dimensions of word stress and sentence intonation (distinguishing statements from questions), typically associated with pitch and loudness changes of heard speech, can be recovered from visual speech. Even the pitch changes associated with lexical tone (salient for Mandarin and Cantonese), can be perceived from visual speech ( Burnham, Ciocca, Lauw, Lau, & Stokes, 2000 ). These new results not only suggest the breadth of visible speech information that is available but are encouraging that the visible dimensions closely correlated with acoustic characteristics have perceptual salience.
There are other commonalities in cross-modal information that take a more general form. Research on both modalities reveals that the speaker properties available in the signals can facilitate speech perception. Whether listening or lip-reading, people are better at perceiving the speech of familiar speakers ( Rosenblum, 2005 , for a review). For both modalities, some of this facilitating speaker information seems available in the specified phonetic attributes: that is, in the auditory and visual information for a speaker’s idiolect (idiosyncratic manner of articulating speech segments). Research shows that usable speaker information is maintained in auditory and visual stimuli that have had the most obvious speaker information (voice quality and pitch, facial features and feature configurations) removed, but maintain phonetic information. For auditory speech, removal of speaker information is accomplished by replacing the spectrally complex signal with simple transforming sine waves that track speech formants (intense bands of acoustic energy composing the speech signal) ( Remez, Fellowes, & Rubin, 1997 ). For visual speech, a facial point-light technique, in which only movements of white dots (placed on the face, lips, and teeth) are visible, accomplishes the analogous effect ( Rosenblum, 2005 ). Despite missing information typically associated with person recognition, speakers can be recognized from these highly reduced stimuli. Thus, whether hearing or reading lips, we can recognize speakers from the idiosyncratic way they articulate phonemes. Moreover, these reduced stimuli support cross-modal speaker matching , suggesting that perceivers are sensitive to the modality-neutral idiolectic information common to both modalities.
Recent research also suggests that our familiarity with a speaker might be partly based on this modality-neutral idiolectic information. Our lab has shown that becoming familiar with a speaker through silent lip-reading later facilitates perception of that speaker’s auditory speech ( Fig. 2 ; Rosenblum, Miller, & Sanchez, 2007 ). This cross-modal transfer of speaker familiarity suggests that some of the information allowing familiarity to facilitate speech perception takes a modality-neutral form.
Data from an experiment testing the influence of lip-reading from a specific talker on the ability to later hear speech produced by either that same talker or a different talker, embedded in varying amounts of noise (adapted from Rosenblum, Miller, & Sanchez, 2007 ). Sixty subjects screened for minimal lip-reading skill first lip-read 100 simple sentences from a single talker. Subjects were then asked to identify a set of 150 auditory sentences produced by either the talker from whom they had just lip-read or a different talker. The heard sentences were presented against a background of noise that varied in signal-to-noise ratios: +5 dB (decibels), 0 dB, and −5 dB. For all levels of noise, the subjects who heard sentences produced by the talker from whom they had previously lip-read were better able to identify the auditory sentences than were subjects who heard sentence from a different talker.
In sum, amodal accounts of multimodal speech perception claim that, in an important way, speech information is the same whether instantiated as acoustic or optic energy. This is not to say that speech information is equally available across modalities: A greater range of speech information is generally available through hearing than vision. Still, the information that is available takes a common form across modalities, and as far as speech perception is concerned, the modalities are never really separate.
ALTERNATIVE THEORIES OF MULTIMODAL SPEECH PERCEPTION
While amodal accounts have been adopted by a number of audiovisual speech researchers, other researchers propose that the audio and visual streams are analyzed individually, and maintain that they are separated up through the stages of feature determination (e.g., Massaro, 1998 ) or even through word recognition ( Bernstein, Auer, & Moore, 2004 ). These late-integration theories differ on how the evidence for early integration is explained, but some propose influences of top-down feedback from multimodal brain centers to the initial processing of individual modalities ( Bernstein et al., 2004 ).
In fact, some very recent findings hint that speech integration might not be as automatic and immediate as amodal perspectives would claim. These new results have been interpreted as revealing higher-cognitive, or “upstream,” influences on speech integration—an interpretation consistent with late-integration theories. For example, lexical status (whether or not an utterance is a word) can bear on the strength of McGurk-type effects. Visual influences on subject responses are greater if the influenced segment (audio “b” + visual “v” = “v”) is part of a word ( valve ) rather than nonword ( vatch ; Brancazio, 2004 ). Similarly, semantic context can affect the likelihood of reporting a visually influenced segment ( Windmann, 2004 ). Attentional factors can also influence responses to McGurk-type stimuli. Observers presented stimuli composed of speaking-face videos dubbed with sine-wave speech will only report a visual influence if instructed to hear the sine waves as speech ( Tuomainen, Andersen, Tiippana, & Sams, 2005 ). Other results challenge the presumed completeness of audiovisual speech integration. When subjects are asked to shadow (quickly repeat) a McGurk-type utterance (audio “aba” + visual “aga” = shadowed “ada”), the formant structure of the production response shows remnants of the individual audio and visual components ( Gentilucci & Cattaneo, 2005 ).
These new results might challenge the presumed automaticity and completeness of audiovisual speech integration, and could be interpreted as more consistent with late-integration than with amodal accounts. However, other explanations for these findings exist. Perhaps the observed upstream effects bear not on integration itself but, instead, on the recognition of phonemes that are already integrated (which, if composed of incongruent audio and visual components, can be more ambiguous and thus more susceptible to outside influences). Further, evidence that attending to sine-wave signals as speech is necessary for visual influences might simply show that while attention can influence whether amodal speech information is detectable, its recovery, once detected, is automatic and impervious to outside influences. Future research will be needed to test these alternative explanations. At the least, these new results will force proponents of amodal accounts to more precisely articulate the details of their approach.
FUTURE DIRECTIONS
As I have suggested, multimodal speech perception research has become paradigmatic for the field of general multimodal integration. In so far as an amodal theory can account for multimodal speech, it might also explain multimodal integration outside of the speech domain. There is growing evidence for an automaticity, immediacy, and neurophysiological primacy of nonspeech multimodal perception ( Shimojo & Shams, 2001 ). In addition, modality-neutral descriptions have been applied to nonspeech information (e.g., for perceiving the approach of visible and audible objects) to help explain integration phenomena ( Gordon & Rosenblum, 2005 ). Future research will likely examine the suitability of amodal accounts to explain general multimodal integration.
Finally, mention should be made of how multimodal-speech research has been applied to practical issues. Evidence for the multimodal primacy of speech has enlightened our understanding of brain injuries, autism, schizophrenia, as well as the use of cochlear implant devices. Rehabilitation programs in each of these domains have incorporated visual-speech stimuli. Future research testing the viability of amodal accounts should further illuminate these and other practical issues.
Acknowledgments
This research was supported by the National Institute on Deafness and Other Communication Disorders Grant 1R01DC008957-01. The author would like to thank Rachel Miller, Mari Sanchez, Harry Reis, and two anonymous reviewers for helpful comments.
Recommended Reading
Bernstein, L.E., Auer, E.T., Jr., & Moore, J.K. (2004) . (See References). Presents a “late integration” alternative to amodal accounts as well as a different interpretation of the neurophysiological data on multimodal speech perception.
Brancazio, L. (2004) . (See References). This paper presents experiments showing lexical influences on audiovisual speech responses and discusses multiple explanations.
Calvert, G.A., & Lewis, J.W. (2004). Hemodynamic studies of audiovisual interactions. In G.A. Calvert, C. Spence, & B.E. Stein (Eds.), The handbook of multisensory processing (pp. 483–502). Cambridge, MA: MIT Press. Provides an overview of research on neurophysiological responses to speech and nonspeech cross-modal stimuli.
Fowler, C.A. (2004). Speech as a supramodal or amodal phenomenon. In G.A. Calvert, C. Spence, & B.E. Stein (Eds.), The handbook of multisensory processing (pp. 189–202). Cambridge, MA: MIT Press. Provides an overview of multimodal speech research and its relation to speech production and the infant multimodal perception literature; also presents an argument for an amodal account of cross-modal speech.
Rosenblum, L.D. (2005) . (See References). Provides an argument for a primacy of multimodal speech and a modality-neutral (amodal) theory of integration.
- Bernstein LE, Auer ET, Jr, Moore JK. Audiovisual speech binding: Convergence or association. In: Calvert GA, Spence C, Stein BE, editors. Handbook of multisensory processing. Cambridge, MA: MIT Press; 2004. pp. 203–223. [ Google Scholar ]
- Brancazio L. Lexical influences in audiovisual speech perception. Journal of Experimental Psychology: Human Perception & Performance. 2004; 30 :445–463. [ PubMed ] [ Google Scholar ]
- Burnham D, Ciocca V, Lauw C, Lau S, Stokes S. Perception of visual information for Cantonese tones. In: Barlow M, Rose P, editors. Proceedings of the Eighth Australian International Conference on Speech Science and Technology. Canberra: Australian Speech Science and Technology Association; 2000. pp. 86–91. [ Google Scholar ]
- Calvert GA, Bullmore E, Brammer MJ, Campbell R, Iversen SD, Woodruff P, et al. Silent lipreading activates the auditory cortex. Science. 1997; 276 :593–596. [ PubMed ] [ Google Scholar ]
- Fowler CA, Dekle DJ. Listening with eye and hand: Cross-modal contributions to speech perception. Journal of Experimental Psychology: Human Perception & Performance. 1991; 17 :816–828. [ PubMed ] [ Google Scholar ]
- Gentilucci M, Cattaneo L. Automatic audiovisual integration in speech perception. Experimental Brain Research. 2005; 167 :66–75. [ PubMed ] [ Google Scholar ]
- Ghazanfar AA, Maier JX, Hoffman KL, Logothetis NK. Multisensory integration of dynamic faces and voices in rhesus monkey auditory cortex. The Journal of Neuroscience. 2005; 25 :5004–5012. [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Gordon MS, Rosenblum LD. Effects of intra-stimulus modality change on audiovisual time-to-arrival judgments. Perception & Psychophysics. 2005; 67 :580–594. [ PubMed ] [ Google Scholar ]
- Grant KW, Seitz P. The use of visible speech cues for improving auditory detection of spoken sentences. Journal of the Acoustical Society of America. 2000; 108 :1197–1208. [ PubMed ] [ Google Scholar ]
- Green KP. The use of auditory and visual information during phonetic processing: Implications for theories of speech perception. In: Campbell R, Dodd B, editors. Hearing by eye II: Advances in the psychology of speechreading and auditory-visual speech. London: Erlbaum; 1998. pp. 3–25. [ Google Scholar ]
- Massaro DW. Perceiving talking faces: From speech perception to a behavioral principle. Cambridge, MA: MIT Press; 1998. [ Google Scholar ]
- McGurk H, MacDonald JW. Hearing lips and seeing voices. Nature. 1976; 264 :746–748. [ PubMed ] [ Google Scholar ]
- Munhall K, Vatikiotis-Bateson E. Spatial and temporal constraint on audiovisual speech perception. In: Calvert GA, Spence C, Stein BE, editors. The handbook of multisensory processing. Cambridge, MA: MIT Press; 2004. pp. 177–188. [ Google Scholar ]
- Musacchia G, Sams M, Nicol T, Kraus N. Seeing speech affects acoustic information processing in the human brainstem. Experimental Brain Research. 2005; 168 :1–10. [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Remez RE, Fellowes JM, Rubin PE. Talker identification based on phonetic information. Journal of Experimental Psychology: Human Perception & Performance. 1997; 23 :651–666. [ PubMed ] [ Google Scholar ]
- Rosenblum LD. The primacy of multimodal speech perception. In: Pisoni D, Remez R, editors. Handbook of speech perception. Malden, MA: Blackwell; 2005. pp. 51–78. [ Google Scholar ]
- Rosenblum LD, Miller RM, Sanchez K. Lip-read me now, hear me better later: Cross-modal transfer of talker-familiarity effects. Psychological Science. 2007; 18 :392–396. [ PubMed ] [ Google Scholar ]
- Sams M, Mottonen R, Sihvonen T. Seeing and hearing others and oneself talk. Cognitive Brain Research. 2005; 23 :429–435. [ PubMed ] [ Google Scholar ]
- Shimojo S, Shams L. Sensory modalities are not separate modalities: Plasticity and interactions. Current Opinion in Neurobiology. 2001; 11 :505–509. [ PubMed ] [ Google Scholar ]
- Summerfield Q. Some preliminaries to a comprehensive account of audio-visual speech perception. In: Dodd B, Campbell R, editors. Hearing by eye: The psychology of lip-reading. London: Erlbaum; 1987. pp. 53–83. [ Google Scholar ]
- Tuomainen J, Andersen TS, Tiippana K, Sams M. Audiovisual speech perception is special. Cognition. 2005; 96 :B13–B22. [ PubMed ] [ Google Scholar ]
- Windmann S. Effects of sentence context and expectation on the McGurk illusion. Journal of Memory and Language. 2004; 50 :212–230. [ Google Scholar ]
- Search Menu
Sign in through your institution
- < Previous chapter
- Next chapter >
10 The Motor Theory of Speech Perception
- Published: November 2009
- Cite Icon Cite
- Permissions Icon Permissions
This chapter has two parts. The first is concerned with the Motor Theory of Speech's explanandum, and shows that it is rather hard to give a precise account of what the Motor Theory is a theory of . The second part of the chapter identifies problems with the explanans: There are difficulties in finding a plausible account of what the content of the Motor Theory is supposed to be. The agenda of both parts is rather negative, and problems will be uncovered rather than solved. In the concluding section, it is suggested where one might look if one wants to solve the Motor Theory's problems, but it is unclear whether the Motor Theory's problems ought to be solved, or whether the whole theory should be abandoned.
Personal account
- Sign in with email/username & password
- Get email alerts
- Save searches
- Purchase content
- Activate your purchase/trial code
- Add your ORCID iD
Institutional access
Sign in with a library card.
- Sign in with username/password
- Recommend to your librarian
- Institutional account management
- Get help with access
Access to content on Oxford Academic is often provided through institutional subscriptions and purchases. If you are a member of an institution with an active account, you may be able to access content in one of the following ways:
IP based access
Typically, access is provided across an institutional network to a range of IP addresses. This authentication occurs automatically, and it is not possible to sign out of an IP authenticated account.
Choose this option to get remote access when outside your institution. Shibboleth/Open Athens technology is used to provide single sign-on between your institution’s website and Oxford Academic.
- Click Sign in through your institution.
- Select your institution from the list provided, which will take you to your institution's website to sign in.
- When on the institution site, please use the credentials provided by your institution. Do not use an Oxford Academic personal account.
- Following successful sign in, you will be returned to Oxford Academic.
If your institution is not listed or you cannot sign in to your institution’s website, please contact your librarian or administrator.
Enter your library card number to sign in. If you cannot sign in, please contact your librarian.
Society Members
Society member access to a journal is achieved in one of the following ways:
Sign in through society site
Many societies offer single sign-on between the society website and Oxford Academic. If you see ‘Sign in through society site’ in the sign in pane within a journal:
- Click Sign in through society site.
- When on the society site, please use the credentials provided by that society. Do not use an Oxford Academic personal account.
If you do not have a society account or have forgotten your username or password, please contact your society.
Sign in using a personal account
Some societies use Oxford Academic personal accounts to provide access to their members. See below.
A personal account can be used to get email alerts, save searches, purchase content, and activate subscriptions.
Some societies use Oxford Academic personal accounts to provide access to their members.
Viewing your signed in accounts
Click the account icon in the top right to:
- View your signed in personal account and access account management features.
- View the institutional accounts that are providing access.
Signed in but can't access content
Oxford Academic is home to a wide variety of products. The institutional subscription may not cover the content that you are trying to access. If you believe you should have access to that content, please contact your librarian.
For librarians and administrators, your personal account also provides access to institutional account management. Here you will find options to view and activate subscriptions, manage institutional settings and access options, access usage statistics, and more.
Our books are available by subscription or purchase to libraries and institutions.
Month: | Total Views: |
---|---|
October 2022 | 1 |
November 2022 | 2 |
December 2022 | 1 |
January 2023 | 4 |
February 2023 | 1 |
March 2023 | 5 |
April 2023 | 7 |
June 2023 | 2 |
July 2023 | 3 |
August 2023 | 5 |
September 2023 | 1 |
October 2023 | 2 |
November 2023 | 2 |
December 2023 | 5 |
January 2024 | 19 |
March 2024 | 4 |
April 2024 | 7 |
May 2024 | 5 |
June 2024 | 4 |
July 2024 | 3 |
August 2024 | 3 |
October 2024 | 2 |
- About Oxford Academic
- Publish journals with us
- University press partners
- What we publish
- New features
- Open access
- Rights and permissions
- Accessibility
- Advertising
- Media enquiries
- Oxford University Press
- Oxford Languages
- University of Oxford
Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide
- Copyright © 2024 Oxford University Press
- Cookie settings
- Cookie policy
- Privacy policy
- Legal notice
This Feature Is Available To Subscribers Only
Sign In or Create an Account
This PDF is available to Subscribers Only
For full access to this pdf, sign in to an existing account, or purchase an annual subscription.
IMAGES
VIDEO
COMMENTS
Speech perception is the process by which the sounds of language are heard, interpreted, and understood. The study of speech perception is closely linked to the fields of phonology and phonetics in linguistics and cognitive psychology and perception in psychology.Research in speech perception seeks to understand how human listeners recognize speech sounds and use this information to understand ...
2. Overview of the special issue. The paper by Young (2008) describes the representation of speech sounds in the auditory nerve and at higher levels in the central nervous system, focusing especially on vowel sounds. The experimental data are derived mainly from animal models (especially the cat), so some caution is needed in interpreting the results in terms of the human auditory system.
Speech Perception. H. Mitterer, A. Cutler, in Encyclopedia of Language & Linguistics (Second Edition), 2006 The goal of speech perception is understanding a speaker's message. To achieve this, listeners must recognize the words that comprise a spoken utterance. This in turn implies distinguishing these words from other minimally different words (e.g., word from bird, etc.), and this involves ...
A wide-ranging and authoritative volume exploring contemporary perceptual research on speech, updated with new original essays by leading researchers Speech perception is a dynamic area of study that encompasses a wide variety of disciplines, including cognitive neuroscience, phonetics, linguistics, physiology and biophysics, auditory and speech science, and experimental psychology. The ...
Speech perception is conventionally defined as the perceptual and cognitive processes leading to the discrimination, identification, and interpretation of speech sounds. However, to gain a broader understanding of the concept, such processes must be investigated relative to their interaction with long-term knowledge—lexical information in ...
determine cognitive resources recruited during perception including focused attention, learning, and working memory. Theories of speech perception need to go beyond the current corticocentric approach in order to account for the intrinsic dynamics of the auditory encoding of speech. In doing so, this may provide new insights into ways in which
The purpose of this essay is to critically evaluate various theoretical approaches toward speech perception. The most influential theories of speech perception include Motor theory, Cohort theory, and TRACE model (Eysenck 1995:280).
Speech perception is a vital means of human communication, ... The time span was set between 2000 and 2020 and non-English papers were excluded. Originally, 9,436 research articles were found. Given that the focus of the present bibliometric analysis is on speech perception from the perspectives of phonetics/linguistics, ...
This book is organized such that interested readers can dip into individual chapters of interest, or read the book cover to cover. Although it would be impossible to review the auditory cognitive neuroscience of speech perception in its entirety in a single volume, the chapters included here survey a broad range of theoretical perspectives, methodological approaches, and listening contexts ...
This report presents a summarised view of some research papers on visual cues in speech perception, especially the work of Harry McGurk and the McGurk effect. The report does not present any research done ... None of the thusfar developed speech perception theories take the role of visual stimuli into account. Is it then possible to fit these ...
The perception of speech occurs when a sensory contour, whether auditory or visual or tactile, is organized to stand out against the sensory background by attention, and then is projected from its primitive sensory attributes into linguistic ones. ... This essay reviews the inadequacy of generic accounts for handling the complexity of speech ...
The research topics surveyed include categorical perception, phonetic context effects, learning of speech and related nonspeech categories, and the relation between speech perception and production.
Smith's essay on speech perception, by contrast, is a nice example of how the two areas can be fruitfully connected. Smith is concerned with refuting a McDowellian theory of meaning as 'on the surface' of speech and thus communicated quite literally from speaker to audience. Employing a wealth of empirical data on speech perception, he argues ...
Overall, I hope this is the beginning of a dialog on best way to answer questions in speech perception. There certainly are mistakes in this essay, as well as a few controversies. Please contact me with any corrections or comments you have. This essay is very much in debt to my lucky experiences as a researcher--especially to Dan Silverman as the first person to expose me to the role of ...
It concerns whether speech perception should be distinguished as a distinctive or a unique perceptual capacity. Put in this way, the question relies on a comparison. The most common contrast is with general audition. ... This essay focuses on the contrast between speech perception and human non-linguistic auditory perception. I distinguish the ...
Speech Sounds and the Direct Meeting of Minds. Barry C. Smith - 2009 - In Matthew Nudds & Casey O'Callaghan (eds.), Sounds and Perception: New Philosophical Essays.Oxford, GB: Oxford University Press UK.
Speech Perception Essay. Better Essays. 2005 Words; 9 Pages; Open Document. Speech Perception Speech perception is the ability to comprehend speech through listening. Mankind is constantly being bombarded by acoustical energy. The challenge to humanity is to translate this energy into meaningful data. Speech perception is not dependent on the ...
Speech perception is inherently multimodal. Visual speech (lip-reading) information is used by all perceivers and readily integrates with auditory speech. Imaging research suggests that the brain treats auditory and visual speech similarly. These findings have led some researchers to consider that speech perception works by extracting amodal ...
Mole, Christopher, 'The Motor Theory of Speech Perception', in Matthew Nudds, and Casey O'Callaghan (eds), Sounds and Perception: New Philosophical Essays (Oxford, 2009; online edn, ... The second feature of speech perception that motivated the proposal of the Motor Theory is the 'lack of invariance' in the speech signal. In the literature ...
Study with Quizlet and memorize flashcards containing terms like How do we produce speech?, How are consonants produced?, What are sound waves in speech? and more.