Audio- Aero Tactile Integration in Speech Perception Using an Open Choice Paradigm
Audio- Aero tactile integration in speech perception using an open choice paradigm
Chapter I
Background
Humans are equipped with a board category of senses with the classic five being vision, hearing, smell, taste and touch which all together makes up the sensory system. Experiences whether enjoyable or miserable, can be experienced because of these sensory modalities. Our sense organs provide the interface for the brain, our coordinating center for sensation and intellect, enabling us to interpret and understand our surrounding environment. There are obvious benefits associated with having multiple senses. Each sense is of optimal use in various circumstances and collectively they increase the likelihood of identifying and understanding of events and objects in our everyday life. For example, integration of auditory and tactile stimuli would be beneficial in understanding our environment when we are a dark room. Risberg and Lubker (1978) found that integrating information from different sensory channels had a supra- additive effect on comprehension of speech as well. This interaction among the senses and the fusion of their information is described by the phrase ‘multisensory or multimodal integration’. So, multisensory integration refers to the influence of one sensory modality over another in the form of enhancement or suppression relative to the strongest “unimodal” response (Stein and Meredith, 1993).
Stimulus identification and localization has been found to be enhanced significantly due multisensory integration (Stein et al., 1988; 1989). Behavioural outcome of multisensory integration in speech perception has been studied exclusively in numerous perception studies. (for reviews, see Calvert et al., 1998; Gick & Donald,2009).
When an auditory stimulus is presented simultaneously, but a different location from a visual stimulus, localisation of sound is perceived to be in the location of the visual stimulus. This perceptual effect or illusion that arise in stimulus localisation as a result of multisensory integration is generally termed as “ventriloquism effect” (Howard & Templeton, 1966). A similar type of illusion has been replicated in speech perception studies as well. McGurk effect is the capability of modifying perceptual outcome in speech perception which is another typical illustration of multisensory integration in humans. (McGurk & MacDonald, 1976). This study will be discussed in detail later in 2.2.1 of chapter II.
The ventriloquist`s illusion and the McGurk effect both arise due to interaction between auditory and visual stimulus and thereby highlights the impact and influence that interaction among senses have on the perception of an individual.
In addition to behavioural studies, many electrophysiological studies using neuroimaging techniques have made it possible to investigate multisensory processes in humans. Macaluso and his colleagues (2000) conducted an fMRI study to prove that visual cortex is enhanced by the tactile stimulation of the hand on the same side as the visual stimulus. A similar finding by Sadato et al. (1996) using positron emission tomography study (PET) indicated the activation of primary and secondary visual cortical areas induced by Braille reading in early blind subjects. Likewise, an event-related potential (ERP) study also suggests that activity in the visual cortical areas is modulated by sound (Shams et al.,2001).
The array of behavioural and electrophysiological studies advocates for the influence that interaction among senses can bring in. Even if auditory and visual modalities dominate over others, other modalities like tactile also has been effective in contributing to better perception. This thesis concerns on how multisensory integration can enhance communication, more specifically how tactile information can help us to perceive speech better.
Initially, multisensory studies in speech perception focused primarily on the integration of audio and visual information (for reviews, see McGurk & MacDonald, 1976; Green & Kuhl, 1989, 1991). Research along this line has shown that visual information does not only enhance our perception of speech but can also alter it. Recently, researchers started to focus on the effect of tactile sensation beside the auditory signal, starting to reveal the impact of tactile information on speech perception (for reviews see Reed et al.,1989; Gick & Donald,2009). Moreover, different modes of response elicitation like open choice, forced choice has been used to study these multisensory interactions (for reviews see Colin et al.,2008; Sekiyama & Tohkura 1991; Van Wassenhove, V., Grant, K. W., & Poeppel, D. ,2005). In the following sections the associated findings will be elaborated, quoting evidences from relevant behavioural studies, thereby substantiating the fact of our ability to perceive speech as a multimodal sensation.
Outline of thesis
In this thesis, I employ the open choice as the response elicitation paradigm and investigate if tactile stimuli when presented with auditory signal improve perception of monosyllables, in order to explore how the human perceptual system synthesizes speech.
In Chapter II, 2.1 discusses production of speech in relation to speech characteristics that distinguishes perception of phonemes. 2.2 reviews speech perception as a multisensory event with evidences from behavioural studies and 2.3, methodological aspects like response type and stimulus type that can influence speech perception is elaborated with supporting literature. Chapter III and IV, describes the statement of the problem and methodology respectively. In Chapter V, results provided by the study will be discussed. Finally, in the General Discussion (Chapter VI), I summarize the empirical findings and suggest further directions of research.
Chapter II
Review of Literature
2.1 Speech Production
Production of speech is underpinned by a series of complex interaction of numerous individual processes beginning at brain with phonetic and motoric planning, followed by expelling of air from lungs that leads to vibration of vocal cords, termed as phonation. Puffs air released after phonation is then routed to oral or nasal cavities, which gives the resonance quality needed. Before expelling off, as speech sounds, these air puffs undergo further shaping by various oral structures called articulators. These physical mechanisms by articulators are being studied as articulatory phonetics by experts in the field of linguistics and phonetics. The basic unit of speech sound is called phoneme which consists of vowels and consonants. Perceptual studies and spectrographic analysis revealed that speech sounds could be specified in terms of some simple and independent dimensions and that they were not grouped along single complex dimension (Liberman,1957). Linguists and phoneticians have thereby classified phonemes according to features of the articulation process used to generate the sounds. These features of speech production are reflected in certain acoustic characteristics which are presumably discriminated by the listener.
The articulatory features that describes a consonant are its place and manner of articulation, whether it is voiced or voiceless, and whether it is nasal or oral. For example, [b] is made at the lips by stopping the airstream, is voiced, and is oral. These features are represented as:
Consonants | [p] | [b] | [k] | [g] |
Voicing | Voiceless | Voiced | Voiceless | Voiced |
Place | Labial | Labial | Velar | Velar |
Manner | Stop | Stop | Stop | Stop |
Nasality | Oral | Oral | Oral | Oral |
Figure illustrates the place and manner of articulation of labials and velars stops
Much of the featural study in speech has focused on the stop consonants of English. The stop consonants are a set of speech sounds that share the same manner of articulation. Their production begins with a build-up of pressure behind some point in the vocal tract, following which is a sudden release of that pressure. Two articulatory features in the description of stop consonant production are place of articulation and voicing. They have well defined acoustic properties with minimal difference in production(Delatte,1951). Place of articulation refers to the point of constriction in the vocal tract, especially in oral cavity, where closure occurs. Formant transitions are the acoustic cues that underlie the place of articulation in CV syllables (Liberman et al.,1967). On the other hand, presence or absence of periodic vocal cord vibration is the voicing feature. The acoustic cue that underlie the voicing feature is voice onset time (Lisker & Abramson, 1964) and it corresponds to the time interval between release from stop closure and the onset of laryngeal pulsing. In English, six stops of three cognate pairs share place of articulation but differ in voicing feature. These consonants are /p/ and /b/ are labial, /t/ and /d/ are alveolar, and /k/ and /g/ are velar where the first of each pair is voiceless and the second is voiced.
Initially, processing of place and voicing features were thought to be independent of each other in stops. Miller and Nicely (1955) investigated perceptual confusion in 16 syllables presented to listeners in various signal-to- noise ratios. Information from several features including place and voicing, were calculated individually and in combination and the values of the two were found to be approximately equal, leading them to a conclusion that these features are mutually independent. So, this study suggests that features are extracted separately during preliminary perceptual processing and recombined later in response.
In contrast, non-independence of articulatory features especially, place and voicing, has also been evidenced by researchers in literature.
Lisker and Abramson (1970) found evidence for change in voicing as a function of place of articulation. Voice onset time(VOT) is the aspect of voicing feature that depends on place of articulation. It has been found that the typical voice onset time lag at the boundary between voiced and voiceless stop consonants produced under natural conditions is longer as the place of articulation moves further back in the vocal tract (i.e. from /ba/ to /da/ to /ga/).
Pisoni and Sawusch (1974) reached a similar conclusion regarding the dependency of place and voicing feature. They constructed two models of interaction of phonetic features in speech perception to predict syllable identification for a bi-dimensional series of synthetic CV stop consonants (/pa/,/ba/,/ta/,/da/) by systematically varying acoustic cues underling place and voicing features of articulation inorder to examine the nature of the featural integration process. Based on their comparison and evaluation of the two models, they concluded that acoustic cues underlying phonetic features are not combined independently to form phonetic segments.
Eimas and colleagues (1978) also reported a mutual, but unequal, dependency of place and manner information for speech perception using syllables (/ba/, /da/, /va/, /za/), in which they concluded that place is more dependent on manner than manner is on place.
In addition, Miller`s (1977) comparative study on mutual dependency of phonetic features using the labial‐alveolar and the nasal‐stop distinctions(/ba/,/da/,/ma/,/na/) also provided strong support for the claim that the place and manner features undergo similar forms of processing even though they are stated by different acoustic parameters.
In summary, there is considerable evidence for dependency effects, in the analysis of phonetic features from several experimental paradigms establishing the fact that combination of phonetic features can aid in better perception of speech sounds.
In face-to-face communication between normally hearing people, manner of articulation of consonantal utterances is detected by ear (e.g., whether the utterance is voiced or voiceless, oral or nasal, stopped or continuant, etc.); place of articulation, on the other hand, is detected by eye.
Binnie et al. (1974) found that the lip movements for syllables /da/, /ga/, /ta/, or /ka/ were visually difficult to discriminate from each other; similarly, lip movements for /ba/, /pa/, /mal/ were frequently confused with each other. Conversely, labial and nonlabial consonants were never confused. They also stated that the place of articulation is more efficiently detected by vision than manner of articulation. At the same time, they established Miller and Nicely’s (1955) finding that voicing and nasality feature of consonants are readily perceived auditorily, even in low signal noise ratios. Furthermore, they explained that auditory-visual confusions stated, indicate that the visual channel in bi-sensory presentations reduced errors in phoneme identification, when varied by place of articulation.
Hence, it can be concluded that speech perception ideally is a multisensory process involving simultaneous syncing of visual, and acoustic cues generated during phonetic production . Therefore, investigation of speech perceptual performance in humans require use of conflicting information. This can be achieved by using congruent and incongruent stimulus conditions.
2.1.1 Congruent and incongruent condition
During an identification task of a stimuli, multisensory representation of stimuli induces a more effective representation than unisensory. The effect would be remarkably more pronounced for congruent stimuli. Stimulus congruency is defined in terms of the properties of the stimulus. In stimulus congruence, there is a dimension or feature of stimulus in a single modality or in different modalities, that are common to them.
Stroop effect (Stroop, 1935) is a classic example to understand stimulus congruency. In the Stroop task, participants respond to the ink colour of coloured words. Typically, participants performed in terms of reaction times and accuracy if the word’s meaning is congruent with its colour (e.g., the word “BLUE” in blue ink) than if colour and meaning are incongruent (e.g., the word “BLUE” in green ink). Similarly, when perceiving speech if the stimuli is congruent i.e. stimuli from one modality (e.g. visual) is in sync with another modality (e.g. auditory), then perception of speech is found to be substantially better. Sumby and Pollack (1954) evidenced that in a noisy background, watching congruent articulatory gestures improves the perception of degraded acoustic speech stimuli. Similar findings were also found in auditory – tactile stimulus congruency conditions as well which will be discussed later.
Now that we discussed how various speech sounds are produced, articulatory features that define speech sounds and reviewed studies suggesting that these features are encoded in brain in an integrated fashion making speech perception a unified phenomenon. Also, we extended our understanding of the terms – congruent and incongruent stimuli which will be frequently used in next session with supporting evidence from literature. In the subsequent section, we will continue to review speech perception as a multisensory event with evidences from behavioural studies.
2.2 Bimodal Speech perception
Speech perception is the process by which the sounds of language are heard, interpreted and understood. Research in speech perception seeks to understand how human listeners recognize speech sounds and use this information to understand spoken language. Just like speech production process, the perception of speech also is a very complicated, multi faced process that is not yet fully understood.
As discussed in previous sections, speech can be perceived by the functional collaboration of the sensory modalities. In ideal conditions, hearing a speaker`s words is sufficient to identify auditory information. Auditory perception may be understood as processing of those attributes that allow an organism to deal with the source that produced the sound than simply processing attributes of sound like pitch, loudness etc. So, recently researchers have viewed speech perception as a specialized aspect of general human ability, the ability to seek and recognize patterns. These patterns can be acoustic, visual, tactile or a combination of these. Speech perception, as a unified phenomenon involving association of various multisensory modalities (e.g. auditory- visual, auditory – tactile, visual -tactile) will be discussed in detail now reviewing significant behavioural studies of multisensory literature.
2.2.1 Auditory – visual (AV) integration in speech perception
The pioneering work on multisensory integration showed that while audition remains vital in perceiving speech, our understanding of auditory speech is supported by visual cues in everyday life, that is, seeing the articulating mouth movements of the speaker. This is especially true when signal is degraded or distorted (e.g., due hearing loss, environmental noise or reverberation). Literature suggests that individuals with moderate to severe hearing impairment can achieve high levels of oral communication skills.
Thornton and Erber (1979) evaluated 55 hearing impaired children aged between 9-15 years using sentence identification task. Their written responses were scored and was found that their speech comprehension and language acquisition is quite high and quick with auditory – visual perception.
Grant et.al. (1998) also arrived at a similar conclusion when they studied auditory- visual speech recognition in 29 hearing impaired adults through consonant and sentence recognition task. Responses were made by choosing the stimuli heard from a screen and it was obtained for auditory, visual and auditory – visual conditions. Result is suggestive of AV benefit in speech recognition for hearing impaired.
Influence of visual cues on auditory perception was studied by Sumby and Pollack (1954) in adults using bi-syllabic words presented in noisy background. And the results demonstrate that visual contribution in speech recognition is significantly evident at low signal to noise ratios. Moreover, findings from Macleod & Summerfield`s (1990) investigation of speech perception in noise using sentences also confirms that speech is better understood in noise with visual cues.
On the other hand, even with fully intact speech signals, visual cues can have an impact on speech recognition. Based on a general observation for a film, McGurk and MacDonald (1976) designed a AV study that lead to a remarkable break through in speech perception studies. They presented incongruent AV combinations of 4 monosyllables (/pa/, /ba/, /ga/, /ka/) to school children and adults in auditory only and AV conditions. Subjects responses were elicited by asking them to repeat the utterances they heard. Findings from the study demonstrated that when the auditory production of a syllable is synchronized with visual production of an incongruent syllable, most subjects will perceive a third syllable that is not represented by either auditory or visual modality. For instance, when the visual stimulus /ga/ is presented with the auditory stimulus /ba/, many subjects report perceiving /da/. While reverse combinations of such incongruent monosyllabic stimuli usually elicit response combinations like /baga/ or /gaba/ (McGurk & MacDonald, 1976; Hardison, 1996; Massaro, 1987). This unified integrated illusionary percept, formed as a result of either fusion or combination of stimulus information, is termed as “McGurk effect”. This phenomenon stands out to be the basis of speech perception literature substantiating multisensory integration.
Conversely, studies suggests that McGurk effect cannot be easily induced in other languages (e.g. Japanese) as in English. Sekiyama and Tohkura (1991) evidenced this in their investigation to evaluate how Japanese perceivers respond to McGurk effect. 10 Japanese monosyllables were presented in AV and audio only condition in noisy and noise- free environment. Perceivers had to write down the syllable they perceived. When AV and auditory only condition was compared, results suggested that McGurk effect depended on auditory intelligibility and McGurk effect was induced when auditory intelligibility was lower than 100%, else it was absent or weak.
Early existence of audio-visual correspondence ability has established when literature evidenced AV integration in pre-linguistic children (Kuhl & Meltzoff,1982; Burnham &Dodd, 2004; Pons etal.,2009) and in non-human primates (Ghazanfar and Logothetis,2003).
Kuhl and Meltzoff (1982) investigated AV integration of vowels (/a/ and /i/) in 18 to 20 weeks old normal infants by scoring their visual fixation. Results showed significant effect of auditory- visual correspondence and vocal imitation by some infants, which is suggestive of multimodal representation of speech. Similarly, McGurk effect was replicated in pre-linguistic infants aged four months employing visual fixation paradigm by Burnham and Dodd (2004). Infant studies indicate that new-borns also possess sophisticated innate speech perception abilities demonstrated by their multisensory syncing capacity which lay the foundations for subsequent language learning.
In addition, it is quite interesting to know that other non-human primates also exhibit a similar AV synchronisation, as humans ,in their vocal communication system. Rhesus monkeys were assessed to see if they could recognize auditory–visual correspondence between their ‘coo’ and ‘threat’ calls by Ghazanfar and Logothetis (2003) using preferential-looking technique to elicit responses generated during the AV task. Their findings are suggestive of an inherent ability in rhesus monkeys to match their species-typical vocalizations presented acoustically with the appropriate facial articulation posture. The presence of multimodal perception in an animal’s communication signals may represent an evolutionary precursor of humans’ ability to make the multimodal associations necessary for speech perception.
Complementarity of visual signal to acoustic signal and how it is an advantage to speech perception has been reviewed so far. Moreover, we also considered the fact that AV association is an early existing ability as evidenced by animal and infant studies. Recently research progressed a step ahead beyond AV perception and extended findings by examining how tactile modality can influence auditory perception which is discussed in the following section.
2.2.2 Auditory – tactile (AT) integration in speech perception
Less intuitive than the auditory and visual modalities, the tactile modality also has an influence on speech perception. Remarkably, the literature suggests that speech can be perceived not only by eyes and ears but also by hand (feeling). Previous research on the effect of tactile information on speech perception has focused primarily on enhancing the communication abilities of deaf-blind individuals (Chomsky, 1986; Reed et al., 1985). But robust evidence for manual tactile speech perception mainly derives from researches on the Tadoma method (Alcorn, 1932; Reed, Rubin, Braida, & Durlach, 1978). The Tadoma method, is a technique where oro-facial speech gestures are felt and monitored from manual-tactile/hand contact/haptic with the speaker’s face.
Numerous behavioural studies have also shown the influence of tactile cues on speech recognition in healthy individuals. Treille and colleagues (2014) showed that speech perception is altered regarding the reaction time when haptic (manual) information is provided additionally to the audio signal in untrained healthy adult participants. Fowler and Dekle (1991), on the other hand, described a subject who perceived /va/ responses when feeling tactile (mouthed) /ba/, presented simultaneously with acoustic /ga/. So, when the manual-tactile contact with a speaker’s face coupled with incongruous auditory input, integration of both audio and tactile information evoked a fused perception of /va/, thereby representing an audio-tactile McGurk effect.
But not only manual-tactile contact has been shown to influence speech perception. Abundant work conducted by Derrick and colleagues (2009) demonstrated puffs of air on skin when combined with auditory signal can enhance speech perception. In a recent study conducted by Gick and Derrick (2009), untrained and uninformed healthy adult perceivers received puffs of air (aero-tactile stimuli) on their neck or hand while simultaneously hearing aspirated or unaspirated English plosives (i.e., /pa/ or /ba/). Participant responses for the mono- syllable identification task was recorded by pressing keys corresponding to the syllable they heard. Participants reported that they perceived more /pa/ in the aero-tactile condition indicating that listeners integrate this tactile and auditory speech information in much the same way as they do synchronous visual and auditory information. A similar effect was replicated using puffs of air at the ankle (Derrick & Gick, 2013) demonstrating the effect does not depend on spacial ecological validity.
In addition, in the supplementary methods of their original article (Aero-tactile integration in speech perception), Derrick and Gick (2009b) demonstrated validity of tactile integration in speech perception by replicating their original experiment on hand with 22 participants ,but by replacing puffs of air with taps presented from a metallic solenoid plunger. No significant effect was observed on speech perception by tap stimulation. Thereby confirming that participants were not merely responding to generalized tactile information nor it was the result of increased attention. And this indicates that listeners are shown to respond to aero-tactile cues normally produced during speech, which evidences multisensory ecological validity.
Moreover, to illustrate how airflow would help in distinguishing minor differences in speech sounds, Derrick et al. (2014b) designed a battery of eight experiments consisting of comparisons with different combinations of voiced and voiceless (stops, fricatives and affricatives) English monosyllables (/pa/, /ba/, /ta/, /da/, /fa/, /ʃa/, /va/, /t͡ʃa/, /d͡ʒa/). Study was run on 24 healthy participants where auditory stimuli was presented simultaneously with air puff on their head. Participants were asked to choose which of the two syllables they heard for all the experiments. Stops and fricatives perception was found to be enhanced on analysis of data. Results obtained advocates that aero-tactile information can be extracted from the audio signal and used to enhance speech perception of a large class of speech sounds found in many languages of the world.
Extending Gick and Derrick’s (2009) findings on air puffs, Goldenberg and colleagues (2015) investigated the effect of air puffs during identification of syllables on a voicing continuum rather than using voiced and voiceless exemplars as in the original work (Gick & Derrick, 2009). English consonants (/pa/,/ba/,/ka/ and /ga/) were the syllables used for the study. Auditory signal was presented simultaneously with and without air puff received on the participant`s hand. 18 normal adults who took part in the study, were made to press key corresponding to their response from a choice of two syllables. Their findings showed an increase in voiceless responses when co-occurring puffs of air were presented on the skin. In addition, this effect became less pronounced at the endpoints of the continuum. This suggests that the tactile stimuli exert greater influence in cases where auditory voicing cues are ambiguous, and that the perception system weighs auditory and aero-tactile inputs differently.
Temporal asynchrony to establish the temporal ecological validity of auditory-tactile integration during speech perception was evaluated by Gick and colleagues (2010). They assessed whether asynchronous cross-modal information became integrated across modalities in a similar way as in audio-visual perception by presenting auditory (aspirated “pa” and unaspirated “ba” stops) and tactile (slight, inaudible, cutaneous air puffs) signals synchronously and asynchronously. Experiment was conducted in 13 healthy participants who made choice of the speech sound they heard using a button box.Conclusions from the study states that subjects integrate audio- tactile speech over a wide range of asynchronies, like in cases of audio- visual speech events. Furthermore, the temporal difference between speed of sound and air flow is varied as our perceptual system incorporates different physical transmission speeds for different multimodal signals, leading to an asymmetry in multisensory enhancement in this study.
Aero- tactile integration was found to be effective for perception of syllables, as evidenced in studies above, so researchers extended their evaluation to see if a similar effect could be elicited for complex stimulus types like words, and sentences.
Derrick et al. (2016) studied the effect of speech air flow on syllable and word identification with onset fricatives and affricates in two languages- Mandarin and English where all the 24 participants had to choose response from an alternative of two expected responses when auditory stimulus was presented with and without air puff conditions. And the results show air flow is in helping distinguish syllables. The greater the distinction between the two choices, the more useful air flow is in helping to distinguish syllables. Though this effect was noticed to be stronger in English than in Mandarin, it was significantly present in both languages.
Recently, benefit of AT integration in continuous speech perception was examined with highly complex stimulus and with increased task complexity by Derrick and colleagues (2016). The study was conducted on hearing typical and hearing impaired adult perceivers who received air puffs on their temple while simultaneously hearing five-word English sentences (e.g. Amy bought eight big bikes). Data as recorded for puff and no puff conditions and the participants were asked to say aloud the words after perceiving each sentence. Outcome of the study suggests that air flow doesn`t enhance recognition of continuous speech as it could not be demonstrated by hearing- typical and hearing- impaired population. Thus, in continuous speech study the beneficial effect of air-flow in speech perception could not be replicated.
The line of speech perception literature discussed above clearly illustrated the effect of AT integration in speech perception for several stimulus types (syllables or words), in different languages and for wide range of temporal asynchronies. But a similar benefit couldn’t be noticed in continuous speech. Factors that altered the result, remains unclear and needs further investigation. Just as AT and AV integration was found to be beneficial in enhancing understanding of speech, investigations were carried out to see if visuo- tactile integration could aid in better perception of speech, which will be discussed in the upcoming section.
2.2.3 Visuo – tactile (VT) integration in speech perception
It is quite interesting that speech, once believed to be an aural phenomenon, can be perceived without the actual presence of an auditory signal. Gick et al. (2008) examined the influence of tactile information on visual speech perception using the Tadoma method. They found that syllable perception of untrained perceivers improved by around 10% when they felt the speaker’s face whilst watching them silently speak, when compared to visual speech information alone.
Recently, effect of aero-tactile information on visual speech perception English labials (/pa/ and/ba/) in the absence of audible speech signal has been investigated by Bicevskis, Derrick & Gick (2016). Participants received visual signal of syllables alone or synchronously with air puffs on neck at various timings. Participants had to identify the syllables perceived from a choice of two. Even with temporal asynchrony between air flow and video signal, perceivers were more likely to respond that they perceived /pa/ when air puffs were present. Findings of the study shows that perceivers have shown the ability to utilise aero-tactile information to distinguish speech sounds when they are presented with an ambiguous visual speech signal which in turn confirms that visual –tactile integration occurs in the same way as audio-visual and audio-tactile integration.
Research into multimodal speech perception thus shows that perceptual integration can occur with audio-visual, audio-(aero)tactile, and visual-tactile modality combinations. These findings support the assumption that speech perception is the sub-total of all the information from different modalities, rather than being primarily an auditory signal that is merely supplemented by information from other modalities. However, multisensory integration could not be proven in all the studies described, even though most of it does. Hence a thorough understanding of factors that disrupted the study outcome is essential, which is elaborately explained in the next section.
2.3 Methodological factors that affect speech perception
In the previous sections of chapter, we discussed the wide range of behavioural studies in different sensory modality combinations (AV, VT and AT) that supported multisensory integration. But a similar result could not be replicated in continuous speech perception study described in 2.2.2 (Derrick etal., 2016). The next step would be to determine factors that potentially resulted in this variability. From a careful literature review, I found that methodological differences including the response type used to collect reports from the participants and specific stimulus differences like stimulus type ranging from syllable level to five-word sentence identification task might be those factors.
2.3.1 Effect of Response Type on Speech Perception
Speech perception studies have predominantly adopted two different ways in which individuals are asked to report what they perceived. These response types include open-choice and forced-choice responses. Open-choice responses allow individuals to independently produce a response to a question by repeating aloud or writing down, while forced-choice responses give preselected response alternatives out of which individuals are instructed to select the best (or correct) option. Moreover, forced-choice responses deliver cues that may not be spontaneously considered, but in open-choice responses generated by the participants, no cues are made available. However, open-choice responses can sometimes be difficult to code, whereas forced-choice responses are easier to code and work with (Cassels & Birch, 2014).
Therefore, different approaches may be used in these two response types. For example, an eliminative strategy might be adopted by participants in case of forced-choice responses, whereby, participants continue to refine alternatives by eliminating the least likely alternative until they arrive at the expected alternative (Cassels & Birch, 2014). A task that exhibits demand characteristics or experimenter expectations (“did the stimulus sound like pa?”) might give different results than one that does not (“what did the stimulus sound like?”) (Orne, 1962). Forced-choice responses are often influenced by demand characteristics. However, in case of open-choice responses expression of demand characteristics are easily avoided with non-directed questions.
In addition, literature evidences that the experimental paradigm can affect the behavioural response. Colin, Radeau and, Deltenre (2005), examined how sensory and cognitive factors regulate mechanisms of speech perception using the McGurk effect. They calculated McGurk response percentage by manipulating the auditory intensity of speech, face size of the speaker, and the participant instructions to make responses aligned with a forced-choice or an open-choice format. However, like many other studies, they instructed their participants to report what they heard. They found significant effect of instruction manipulation, with higher percentage of McGurk response for forced choice responses. But in the open choice task, the participant responses were diverse as they were not provided with any response alternatives. And the reduction in the number of McGurk responses in the open-choice task may be attributed to the participants being more conservative about their responses so that they could report exactly what they perceived. They also found an interaction between instructions used and the intensity of auditory speech. Likewise, Massaro (1998, p184-188) found that stimuli were correctly identified more frequently and elicited in a limited set (forced choice) than in an open choice task.
Recently, Mallick and colleagues (2015) replicated findings of Colin et al. (2005) when they examined the effect of McGurk effect in modifying parameters like population, stimuli, time, and response type. They demonstrated that the frequency of the McGurk effect can be significantly altered by response type manipulation, with forced-choice response increasing the frequency of McGurk perception by 18 % approximately, when compared with open choice for identical stimuli.
In the literature review provided, multisensory integration effect is not very consistent. Table given below summarizes the described studies from 2.2, providing an overview with respect to used stimuli, paradigm and effect of multisensory integration.
Article/ Study | Response type/ Paradigm used | Stimuli used | Multisensory integration present or not |
Thornton, N. E., & Erber, N. (1979) | Open choice (write down responses) | Sentences | Significant AV Integration |
Grant, K. W., Walden, B. E., & Seitz, P. F. (1998) | Forced choice | Consonant and sentences | Significant AV benefit |
Sumby, W. H., & Pollack, I. (1954) | Forced choice | Bi-syllabic words | Significant AV benefit |
Macleod, A., & Summerfield, Q. (1990) | Open choice (write down response) | Sentences | Significant AV benefit |
McGurk, H., & MacDonald, J. (1976) | Open choice (say aloud reponses) | Monosyllables | Significantly strong AV benefit |
Sekiyama, K., & Tohkura, Y. i. (1991) | Open choice (write down response) | Monosyllables | AV integration present but not very strong as in English |
Kuhl, P. K., & Meltzoff, A. N. (1982) | Forced choice (visual fixation) | Vowels | Significant AV benefit |
Burnham, D., & Dodd, B. (2004). | Forced choice (visual fixation) | Monosyllables | Significant AV benefit |
Ghazanfar, A. A., & Logothetis, N. K. (2003) | Forced choice (preferential looking) | Species specific vocalization | Significant AV benefit |
Avril Treille, C. C., Coriandre Vilain, Marc Sato. (2014) | Forced choice | Monosyllables | Significant auditory -haptic (tactile) benefit |
Fowler, C. A., & Dekle, D. J. (1991) | Forced choice | Monosyllables | Significant auditory -haptic (tactile) benefit |
Gick, B., & Derrick, D. (2009) | Forced choice | Monosyllables | Significantly strong auditory – tactile benefit |
Derrick, D., & Gick, B. (2013) | Forced choice | Monosyllables | Significantly strong auditory – tactile benefit |
Derrick, D., & Gick, B. (2009b) | Forced choice | Monosyllables presented with metallic taps instead of air puff | No auditory – tactile benefit |
Derrick, D., O’Beirne, G. A., Rybel, T. d., & Hay, J. (2014) | Forced choice | Monosyllables | Significantly strong auditory – tactile benefit |
Goldenberg, D., Tiede, M. K., & Whalen, D. (2015) | Forced choice | Monosyllables | Significant auditory – tactile benefit |
Gick, B., Ikegami, Y., & Derrick, D. (2010) | Forced choice | Monosyllables | Significantly strong auditory – tactile benefit |
Derrick, D., Heyne, M., O’Beirne, G. A., Rybel, T. d., Hay, J. & Fiasson, R. (2016) | Forced choice | Syllables and words | AV integration present but not very strong as in English |
Derrick, D., O’beirne, G. A., De Rybel, T., Hay, J., & Fiasson, R. (2016) | Open choice (say aloud) | 5-word sentences | No auditory – tactile benefit |
Gick, B., Jóhannsdóttir, K. M., Gibraiel, D., & Mühlbauer, J. (2008) | Open choice (say aloud) | Disyllables | Significant visuo – tactile benefit |
Bicevskis, K., Derrick, D., & Gick, B. (2016) | Forced choice | Monosyllables | Significant visuo – tactile benefit |
The conclusions of Colin et al. (2005) and the literature review suggest that response choice may be a significant contributor to variability in speech perception. In case of forced-choice responses, participants can compare their percept with available response alternatives. However, in case of open-choice responses, participants attempt to retrieve which syllable closely matches their percept from an unrestricted number of possible syllables.
2.3.1 Effect of stimulus type on Speech Perception
As outlined in previous sections (2.2.1 and 2.2.2) and illustrated in the table, the type of stimuli may also affect the behavioural outcome. Use of continuous speech such as phrases or sentences containing confounding factors (semantic information, context information, utterance length etc.) requiring complex central processing rather than syllables disrupted the findings of the study showing no effect of tactile in speech perception (Derrick etal., 2016).
Similarly, Liu and Kewley-Port (2004) measured vowel formant discrimination in syllables, phrases, and sentences for high- fidelity speech and found that the thresholds of formant discrimination were poorest for sentences context, the best for the syllable context, with the isolated vowel in between indicating that complexity of task is increased with higher stimulus type.
In summary, literature suggests stimulus type, represents context effect, is a product of multiple levels of processing combining the effect of both general auditory and speech-related processing (e.g., phonetic, phonological, semantic and syntactic processing). Hence type of stimuli (e.g. sentences) in a study can contribute to an increase in the task complexity.
So, methodological factors, response type and stimulus type, discussed above can be the possible features that interfered in multisensory integration tasks of speech perception studies. So, in this study, I would like to take these aspects into consideration to generate the best experiment design to study aero-tactile integration in speech perception.
Chapter III
Statement of the problem
Studies investigating audio-tactile integration using syllables as stimuli & response type as closed choice paradigm demonstrated multisensory integration, but audio-tactile integration could not be replicated when investigated using sentences as stimuli & response type as open choice paradigm. Approach of the study didn`t follow a continuous hierarchical pattern, i.e. stimulus type was suddenly upgraded to five- word long sentences from monosyllables which a radical change. In addition to that, comparatively sophisticated response type, open choice paradigm, was also chosen for the study. Thus, the unanswered question is to determine which of the factors; use of sentences or open choice paradigm, lead to null result in continuous speech perception paper (Derrick et al., 2016). To determine this, it is essentially important to take a step back & investigate the multisensory integration by using syllables in an open-choice paradigm.
3.1 Study aim
The present study aims to identify whether the benefits from audio-tactile integration uphold for monosyllable identification task, in varying signal to noise ratios (SNRs) when the participants do not have to make a forced choice between two alternatives, but are presented with a more ecologically valid open-choice condition.
3.2 Hypothesis
The research question is whether aero-tactile information influence syllable perception using an open choice identification task. This will be investigated by testing the following 2 hypotheses
SNR at 80% accuracy level will interact with phoneme and air flow such that:
Hypothesis 1: – SNR at 80% accuracy level will be decreased when listening to congruent audio-tactile stimuli than audio only stimuli.
Hypothesis 2: – SNR at 80% accuracy level will be increased when listening to incongruent audio-tactile stimuli than audio only stimuli.
3.3 Justification
The proposed study is an extended version of the previous work of Gick and Derrick (2009) but with the mode of response being an open- choice design instead of a 2-way forced-choice paradigm task. This response format was chosen because it has been shown to provide a more conservative estimate of the participant`s percept in previous studies (Colin et al., 2005; Massaro, 1998). Moreover, an open-choice design allows for a better assessment of the precision of audio- tactile integration in speech perception as the possibility of subject guessing can be minimized. Hence, the outcome of this study will extend our knowledge on the effectiveness of integration of tactile information in the enhancement of auditory speech perception, in a more natural setting with minimal cues.
In addition to this, simplifying stimulus type to monosyllables would allow us to identify the conditions under which audio-tactile integration occurs, without the confounding factors needing higher cognitive and linguistic processing (semantic information, context information, utterance length etc.) that were present during the continuous speech studies (Derrick et al., 2016).
3.4 Significance
This study would be a valuable contribution for multisensory speech perception literature for fundamental scientific studies and further researches. Moreover, insights from the study could be used for evidence based practise clinically especially for training communication skills in individuals with sensory deficits.
Methodology
The current study builds on the methodology of the original aero-tactile integration paper (Gick & Derrick, 2009), coupling an acoustic speech signal with small puffs of air on the skin. The difference with the present study is that this time the participants are free to choose their response, without any constraints, based on their own perceptual judgement rather having to choose between two response alternatives. University of Canterbury Human Ethics Committee has reviewed and approved this study on 15 May 2017 (Approval number 2017-21 LR). See Appendix for a copy of the approval letter.
4.1 Participants
Forty-four (44) healthy participants (40 females and 4 males), with a mean age of 23.34 years were recruited for the study. Inclusionary criteria set for recruitment process were
- Native English speaker
- Aged between 18 to 45
- No history of speech, language or hearing issues
Of 44 participants tested, seven participants didn`t match language criteria (New Zealand, Canada, United States or United Kingdom English), three participants had higher pure tone threshold of >25db in either ear, which leaves with 34 participants. In addition, 5 participants from them reached a ceiling effect for some of the conditions, hence their database could not be included fully. They were unable to correctly identify some of the stimuli at a +10dB SNR level, suggesting they had difficulty doing task in an effectively noiseless environment. None had to be completely excluded, but participant 2’s “ka” and “ba”, participant 6’s “pa”, participant 8’s “ka”, participant 14’s “ka”, and participant 37’s “ga”, “ da” and “ba” data had to be excluded due to these ceiling effects. None of the participants had history of speech or language delays.
Participants (n= 34) were primarily undergraduate speech-language therapy students and remaining (n=10) were recruited via email, Facebook, advertisement on the New Zealand Institute of Language, Brain and Behaviour (NZILBB) website and around the university. Undergraduate students received credits for their research participation while other volunteers were given a $10 gift voucher as compensation for their time. As part of recruitment process, participants received an information sheet (Appendix) which was discussed with them before beginning any of the procedures. Following this discussion, if they chose to participate, they were asked to sign a written consent form (Appendix).
All the participants were asked to complete a questionnaire (Appendix) detailing demographic information on age, dialect and history of speech, language and hearing difficulties. As part of the initial protocol, participants underwent an audiological screening. Pure tone audiometry was carried for frequencies 500Hz, 1KHz, 2KHz and 4KHz using a Interacoustics AS608 screening audiometer. Pure tone thresholds were calculated and if the threshold is less than or equal to 25, hearing sensitivity was considered to be within normal range. Participants not meeting the inclusion criteria could choose to still complete the study to gain research experience. This resulted in data for 7 non-native English speakers.
4.2 Recording procedure and stimulus
Speaker was asked to come in a sound-attenuated booth in lab and speech audio was recorded using a Sennheiser MKH-416 microphone attached to a Sound Devices USB-Pre2 microphone amplifier fed into a PC. Video recordings of the English syllables, labials (/pa/and /ba/) and velars (/ka/ and /ga/), spoken by a female native New Zealand English speaker were recorded using a video camera (Panasonic Lumix DMC-LX100) speaking with their lips ~1 cm away from a custom-made airflow estimator system that does not interfere with audio speech production. Speaker produced twenty repetitions of each stimulus, and stimuli were presented in randomized order to be read aloud off a screen.
To produce the air puff, an 80 ms long 12 kHz sine wave used to drive the pump action of Aerotak (Derrick & De Rybel, 2015). This system stores the audio signal and the air flow signal in the left and right channel of a stereo audio output respectively. The stored audio is used to drive a conversion unit that splits the audio into a headphone out (to both ears) and right channel air pump drive signal to a piezoelectric pump that is mounted on the tripod.
- Auditory stimuli
The speech stimuli of the English syllables were matched for duration (390-450ms each), fundamental frequency (falling pitch from 90 Hz to 70 Hz) and intensity (70 decibels). Using an automated process, speech token recordings were randomly superimposed 10.000 times within a 10 second looped sound file to generate speech noise for the speaker. According to Jansen and colleagues (2010) and Smits and colleagues (2004), this method of noise generation results in a noise spectrum virtually identical to the long-term spectrum of the speech tokens of the speaker and thus ensuring accurate signal to noise ratios for each speaker and token. Speech tokens and the noise samples were adjusted to the same A-weighted sound level prior to mixing at different signal to noise ratios.
- Tactile stimuli
In order to create best match of the airflow produced by the airflow generation system with the dynamics of it produced in speech, the air flow outputs were generated by a Murata MZB1001T02 piezoelectric device (Tokyo, Japan), controlled through the Aerotak system, as described in Derrick, de Rybel, and Fiasson (2015). This device extracts signal representing turbulent airflow during speech from the recorded speech samples. These stimuli syllables were then passed through the air flow extraction algorithm to generate a signal for driving a system to present air flow to the skin of participants simultaneous with audio stimuli.
4.3 Stimulus presentation
Experiment was run individually for each participant. The entire procedure lasted approximately 40 minutes. Data was collected using a Apple MacBook Air laptop in sound attenuated room for of four underlying tokens each of ‘ba’, ‘pa’, ‘ga’, and ‘ka’. Stimuli were placed in speech in noise generated using the same techniques described in Derrick, et al., (2016) with exception that the software used was custom R and FFMPEG. Speech in noise ranged from -20 to 10 SNR with 0.1 SNR increments. From -20 to 0 SNR, signal was decreased, and noise kept stable. From 0 to 10 SNR, signal was kept the same volume and noise decreased. Thus, the overall amplitude was maintained stable throughout the experiment.
The pump has the following specifications: the 5-95% rise time takes 30 ms (Derrick, et al., 2015), with a maximum pressure of 1.5 kPa during loud speech, and a maximum flow rate of 0.8 l/m, which corresponds to about a twelfth of that of actual speech.
And the correct responses were lower-case ‘pa’, ‘ba’, ‘ga’, and ‘ka’/’ca’ based on the underlying audio signal. Whenever the participant responded accurately, the SNR increases, thereby increasing the task complexity. Similarly, for every incorrect response, SNR drops, making signal clearer and simplifying task for the participants. Thus, results of each trial allow for a re-tuning of the SNR’s for each syllable to compensate for how easy the individual recording was for perceivers to detect in noise. This method of assigning stimulus values based on preceding response is the procedure of an adaptive staircase. The auditory signals were degraded with speech- based noise and the signal-to-noise ratio were varied using software implemented with an adaptive transformed up-down staircase to obtain a psychometric curve of perception based on the 80% accuracy response in noise. The transformed up – down method (Quest staircase) has been adopted as it is a reasonably fast and typical method. (Watson & Pelli, 1983). Eight adaptive staircases were set for each token stimuli and thus each QUEST staircase had 32 repetitions.
4.4 Procedure
The study was designed to examine the influence of audio – aero tactile integration on speech perception using an open choice task. Each participant`s perception was assessed using randomized presentation of 6 possible combinations of auditory only and congruent and incongruent auditory and aero-tactile stimuli of English syllables – /pa/, /ba/, /ga/ and /ka/. Participant heard 32 times of 8 tokens of each syllable without air flow, and 32 tokens of each syllable with air flow generated from the underlying sound file, for a total of 256 tokens. Length of time per token is per participant 6.5 seconds on average. Once the initial protocol was done, participants were seated in a sound- attenuated booth wearing a sound isolating headphone (Panasonic Stereo Headphones RP-HT265). They were presented with the auditory stimuli via headphones at a comfortable loudness level through an experiment designed in PsychoPy software (Peirce 2007 & 2009). Tactile stimuli were delivered at the suprasternal notch via the air pump placed aiming at the subject`s neck at a pressure of ~7 cm H20, fixed at ~2.2cm from the skin surface. The back of the hand was chosen because it is a location where participants typically receive no direct airflow during their own speech production. Participants Integrated perception was estimated by asking them to type out the perceived syllable into the experiment control program that indicated whether the answer is correct or not based on the software-provided expected outcome.
Participants were told that they might experience some noise and unexpected puffs of air along with syllables, consisting of a consonant and a vowel, during the task. Participants were asked to type down the syllables that they heard, and push enter key to record their responses. Since the experiment part of the study, requiring active listening, lasted about 20 minutes of the total procedure, participants could take short listening breaks if they required one. Researcher stayed inside the experiment room with the participant during the experiment to monitor if placement is not disturbed and to ensure that participants are comfortable.
4.5 Data Analysis
From the forty-four participants, who took part in the study, data of thirty-four (34) participants who fit in the inclusion criteria, were analyzed to answer the research question. Initially, data was entered and sorted in Microsoft Excel 2016 spreadsheet. 32 repetitions were extracted for each staircase and statistics were run on the last SNR which is at 80% accuracy level. Descriptive statistics were run initially.
Box plots were used to plot to visualize variation of SNR with place of articulation (graphs).
Variation of SNR with audio only and audio-tactile condition for each target stimuli was plotted using a boxplot (graph).
Generalized linear mixed-effects models (GLMM), seen in the R-formatted (R Core Team, 2016), were run on the interaction between aspiration [aspirated (‘pa’ and ‘ka’) vs. unaspirated (‘na’ and ‘ga’) stops], place [labial (‘pa’ and ‘ba’) vs. velar [‘ga’ and ‘ka’], and artificial air puff (present vs. absent).
Model fitting was then performed in a stepwise backwards iterative fashion, and models were back-fit along the Akaike information criterion (AIC), to measure quality of fit. This technique isolates the statistical model that provides the best fit for the data, allowing elimination of interactions in a statistically appropriate manner. The final model was:
SNR ~ place * manner + (1 + (place * manner) | participant)
In this model, the SNR at 80% accuracy was compared to the fixed effects. These included: 1) Place of articulation (labial vs. velar), 2) manner of articulation (voiced vs. voiceless), 3) the interaction of place and manner, and 4) The full-factorial random effect covering place and manner of articulation by participant.