Fundamentals of Sound

Sound-Source Auditory Localization Introduction / Definitions
The term auditory localization describes judgments on the location (orientation & distance), movement, and size of a sound source, based solely on auditory cues. Systematic identification of these cues and their physical/acoustical correlates supports: _ microphone selection and placement, appropriate to a given sonic context and desired spatial effect outcome. _ postproduction audio techniques and processes able to convey spatial imagery through sonic means. _ deliberate creation of realistic or imaginary soundscapes in multisensory experiences (e.g. films; computer games; other A/V composites; immersive AR/VR experiences) _ 3-D audio: video illustration (by Sennheiser - use headphones - start at 3'10"). _ 3-D audio: outline (by Sony). _ 3-D Audio Lab (Princeton University). Terminology Monaural signal Signal entering a single ear (usually through headphones - earplug in one ear and signal/stimulus in the other) Binaural signal Signal entering both ears via the air (most common type of listening in real-world environments - can be simulated via headphones) Diotic (special case of Binaural) Exactly the same sound signal entering both ears (artificial type of binaural listening - e.g. listening to a mono recording through stereo headphones) Dichotic (special case of Binaural - also referred to as Binaural Splitting) Completely different sound signals entering each ear (artificial type of binaural listening - through headphones). Ipsilateral ear Ear closest to the sound source Contralateral ear Ear furthest from the sound source
Auditory localization judgments are mainly described in terms of the apparent position of a sound source relative to the listener's head (see to the right). Azimuth describes sound-source location relative to the transverse (mid-horizontal) plane. Elevation describes sound-source location relative to the median (mid-sagittal) plane. Front center: 0⁰ Left/Up: positive angles Right/Down: negative angles Lateralization describes judgments on the apparent position of a sound source when listening through stereo headphones, which is limited within a listener's head. 3D Headphones circumvent this limitation (article by Waves Audio Ltd. - concepts apply broadly).	(in Plack, 2005: 174)
The JND of auditory localization is defined in terms of minimum audible angles (MAAs) of rotation in azimuth, elevation, or both, necessary to convey a corresponding change in sound-source positioning relative to the head (see below and to the right) MAAs depend on: i) rotation orientation, ii) starting angle, & iii) frequency. Smallest possible MAA = ~ 1⁰ ( in the azimuth; 0⁰ starting angle; low frequency ). MMAs are larger in all other cases. (Source: Freigang et al., 2014)	(in Plack, 2005: 175).
We localize sound sources based on three kinds of interaural differences (i.e. differences between the portion of sonic energy arriving at each ear): _ intensity level differences _ arrival time (more specifically: period-related time- or phase-) differences _ spectral differences *Interaural differences in intensity and arrival time (phase)* constitute the most important sound-source localization cues. The theory outlining their contribution to sound-source localization judgments is referred to as the duplex theory of sound-source localization and was introduced in the early 20th century by British physicist, Lord Rayleigh (John William Strutt, 3rd Baron Rayleigh).

Interaural Intensity/Level Differences (IID / ILD)

For high frequency sound signals, >~1,500Hz, with wavelengths <~0.23m/cycle (<~1/2 of the average head's half circumference (i.e. ~0.28m), auditory localization judgments are based mainly on interaural intensity/level differences (IIDs or ILDs) .

Frequencies >~1,500Hz cannot diffract around a listener's head, which blocks acoustic energy enough to produce interpretable intensity level differences.

The higher the frequency the larger the portion of the signal energy that will be blocked by the head, producing increasingly stronger and more salient interaural intensity differences.

For frequencies below 500Hz, IIDs are negligible (why?) and increase gradually with frequency.
For frequencies >1,500Hz, IIDs change systematically with azimuth changes and provide reliable localization cues on the horizontal plane, except for front-to-back confusion, where IID=0 (why?).

Artificial IIDs (i.e. imposed through headphones and not due to sound-source positioning relative to the head) can provide localization cues at all frequencies.

Example: Three successive 300Hz tones are presented in stereo (must use headphones)
_In the first, the intensity level is higher on the left channel (by 7dB);
_In the second, the level is the same across channels;
_In the third, the level is higher on the right channel (by 7dB).
This arrangement results in signal lateralization that appears to move from the left, to the middle, to the right, following the signals' IID.

Audio engineers often use artificial IIDs to simulate different sound source positions.
This practice may introduce complications in outdoor, live performance settings, where audiences listen through left and right loudspeaker arrays positioned far (>25ft) apart.
One such complication is related to a perceptual strategy developed to address auditory localization ambiguities when listening in reflective environments. The manifestation of this strategy (referred to as "precedence effect") is addressed at the end of the module.

Interaural Time/Phase Differences (ITD / IPD)

For low frequency sound signals (e.g. <~500Hz, with wavelengths >~0.69m, equivalent to > twice the average head's half circumference of ~0.28m), auditory localization judgments are based mainly on single-cycle interaural time differences (ITDs), equivalent to interaural phase differences (IPDs).

The minimum detectable interaural time difference is 10microseconds or 0.000010s, corresponding to a shift in sound source location by 1⁰ in azimuth relative to straight ahead (consistent with the minimum audible angle).

Explanation: Low frequency signals with wavelengths that are large enough --compared to the average listener's head-- to diffract efficiently, do not produce perceptible interaural level differences.
However, they do reach the two ears at different times. For frequencies <500Hz, for example, the interaural time difference is shorter than the signal's period and the entire head fits within < half a wavelength. Consequently, at low frequencies, sound-source location produces unambiguous and interpretable interaural phase differences or IPDs (save for locations on the median plane, where IPD=0), because sounds will arrive at each ear at a different phase within the same cycle.

The maximum possible interaural time difference that can occur due to sonic energy travelling around the head is ~0.0008s (assuming an average head half circumference of ~0.28m and speed of sound = 345m/s).
For a sine signal to reliably take advantage of IPD cues, its period must be >1.5 times this value (>0.0012) so that the entire head can fit within <2/3 of a cycle. IPDs are therefore reliable localization cues for frequencies <700-800Hz. The highest frequency for which IPDs provide any cues is ~1,500Hz.

IPDs also provide useful localization cues for complex signals whose spectrum includes components with frequencies <~750Hz.

For complex signals with no low-frequency content (or for amplitude-modulated high-frequency-content signals), IPDs remain useful as long as several of the frequency components are separated by (or as long as the modulation rate is) <~750Hz. In such cases, the complex signal's envelope (in Plack, 2005:178) will display amplitude fluctuations at rates <~750Hz and the IPDs between the signal envelopes arriving in each ear will provide interpretable localization cues.

At higher frequencies, single-cycle IPDs do not provide useful localization cues because they depend not only on sound source location on the horizontal & median planes but also on frequency and, most importantly, distance. Different sound source locations can result in the same IPDs, while the same angular location in the localization coordinate system may result in different IPDs, depending on distance.

In Plack, 2005: 176

Strongest separation in lateralization (i.e. in virtual location inside the head) occurs for IPDs = 1/4 cycle, corresponding to signals with:
Period T ~0.0032s/cycle (= 4 x "longest possible ITD" = 4 x 0.0008s) and Frequency f ~315Hz (= 1/T = 1/0.0032).

Example: Three successive 300Hz tones are presented in stereo (must use headphones - no onset difference)
_In the first, the left channel leads by 1/4 cycle;
_In the second, the phase is the same across channels;
_In the third, the right channel leads by 1/4 cycle.
This arrangement results in signal lateralization that appears to move from the left, to the middle, to the right, following the signals' IPD.

Listen to three stereo signals with the same phase relationships, as above, at 100Hz and 8000Hz.
What types of lateralization do they result in? Do you still get a clear sense of motion?

Interaural phase difference of exactly 1/2 cycle (180⁰) results in a wider stereo image rather than in lateralization changes.

Example: Two successive 300Hz tones are presented in stereo (must use headphones - no onset difference)
_In the first, both channels are in phase (i.e. IPD = 0⁰);
_In the second, there is a 1/2 cycle phase difference between channels (i.e. IPD = 180⁰).

Interaural Time Difference (ITD) vs. Interaural Phase Difference (IPD)
(must listen over headphones)

* When IPD=0, the signal appears to come from the center, regardless of whether ITD=0 (top-left graph) or ITD≠0 (bottom-left graph).
* When IPD≠0, the signal appears to come from one side, regardless of whether ITD≠0 (top-right graph) or ITD=0 (bottom-right graph)
In other words, apparent location of the sound source is determined by the IPD rather than the ITD values.
ITDs are relevant only in terms of the IPD values they impose.

INTERLUDE: DICHOTIC BEATS

The phenomenon of dichotic beats (often inaccurately referred to as 'binaural beats') describes a beating-like sensation arising when two signals with slightly different frequencies are presented dichoticaly (i.e. one per ear through headphones).

Contrary to the beating sensations that accompany signals with amplitude fluctuation rates <~15 fluctuations/sec., dichotic beats are not the result of periodic alterations between constructive and destructive interference; dichotic presentation does not permit physical interaction. Rather, they are the result of periodic changes in IPDs and a direct manifestation of our ability to detect the systematic IPD changes that, for low frequencies, accompany sound-source motion. So, the sensation of dichotic beats is based on our hearing mechanism's use of static IPDs as sound-source localization cues and of dynamic IPDs as sound-source motion detection cues, at low frequencies.

Dichotic/binaural beats have acquired cult status, with hundreds of websites and videos linking them to a variety of mental effects. The apparent fascination is partially due to the disorienting sensation elicited by this unnatural form of listening (dichotic listening) and partially due to the coincidence between the most salient dichotic beat rates and the frequencies of some brain-waves (e.g. Theta and Delta brain-waves).

For very small frequency differences (<~4Hz) between ears, the resulting IPD modulations give rise to a "rotating" sensation inside the head that can be easily identified as such (e.g. Channel 1: 250Hz & Channel 2: 250.5Hz).

For larger frequency differences, (>~4Hz), the sensation does resemble the loudness fluctuations (beating) that would result if the tones in each ear were allowed to interfere, because the rotation rate is faster than the hearing mechanism's ability to follow it. However, this beating-like sensation is much less pronounced than it would be if actual interference had taken place (e.g. Channel 1: 250Hz & Channel 2: 257Hz - listen via headphones and via loudspeakers; is there a perceptual difference?).

Dichotic and interference-based beating sensations are manifestations of different physical, physiological, and perceptual phenomena.
(For additional information see here and here)

Monaural & Interaural Spectral Differences - HRTF / ATF

For sound signals of intermediate frequencies (~ 500Hz<f<~1,500Hz), IID and IPD cues do provide some useful localization information, but only in the azimuth, and only if combined (IID cues are perceivable down to ~500Hz and IPD cues are perceivable up to ~1,500Hz).
For such frequencies, where IID and IPD cues are ambiguous/unreliable in the azimuth, as well as for all frequencies in the case of sound-source elevation changes, where IID=IPD=0, the auditory system relies on monaural spectral cues and interaural spectral difference cues.

Monaural spectral cues and interaural spectral difference cues are due to the torso, head, and outer ear performing azimuth- and, most importantly, elevation-dependent spectral filtering on signals. This filtering 'colors' the spectral composition of the signal arriving in each ear differently, depending on sound-source location.
It is commonly referred to as Head Related Transfer Function or HRTF. The more accurate Anatomic Transfer Function or ATF captures the contribution of parts of the body other than the head but is less commonly used.

As is the case with most sound source localization cues, spectral cues are not foolproof. For example, in the median plane, signals rich in high frequencies tend to be localized higher than signals rich in low frequencies, even when the source elevation remains the same.

At frequencies <~200Hz, whose wavelengths are larger than the dimensions of the structures involved (head, torso, pinnae), interaural spectral differences and the associated HRTFs do not provide useful localization cues at any orientation.

Interaural spectral differences due specifically to structural differences between the pinnae of the two ears contribute to better lateralization (i.e. more accurate virtual location of the sound within the head), especially in the elevation plane. Pinna-related spectral filtering alone does not help us 'construct' a complete aural 'image' of the outside world.
Experiments that apply HRTFs (ATFs) obtained from dummy heads/torsos on the equalization and reproduction of signals indicate that interaural spectral differences due to the entire torso/head/pinnae system contain information that helps us perceive and reconstruct the actual source location outside the head.

Head Related Transfer Function (HRTF) is defined as the ratio of the sound pressure spectrum measured at the eardrum to the sound pressure spectrum that would exist at the same location if the listener was removed. The figure to the right displays HRTFs as a function of sound-source elevation angle. From this and other similar datasets it has been inferred that:

� The 8kHz region seems to correlate with overhead perception (i.e. spectral changes in this region correlate with changes in the perceived location of sound sources above our heads)
� Regions in the frequency bands 300-600Hz & 3000-6000Hz seem to correlate with frontal perception
� Regions centered at around 1200Hz & 12000Hz seem to correlate with rear perception.

HRTFs are personalized, as they depend on variable pinna, head, and torso construction among individuals (e.g. data in the figure, below).
Consequently, spectral sound-source localization cues and the associated HRTFs are most likely learned through experience, with individuals generally localizing better with their own cues than with those of others.

At the same time, it has been shown that physiological differences may result in some listeners performing much better than others on auditory localization tasks.
Imposing 'good' HRTFs (through appropriately equalized headphone listening) on individuals who have difficulty localizing sound sources can improve their localization performance by providing more salient interaural spectral differences, assuming comparable head size between 'donor' and 'recipient' and sufficient learning time.

Experimental explorations of HRTFs/ATFs use specially-designed binaural heads (e.g. KEMAR, by G.R.A.S. Sound & Vibration, Denmark) to record signals at various positions in their path to the ear drum and tease out the various anatomical spectral-filtering contributions.

Binaural Cues & "Release from Masking"

IID and IPD cues also help us perceive tones that would have otherwise been masked. The following listening examples illustrate this point and correspond to the three scenarios described in the figure to the right (must use headphones).

(A) Example 1: a 300Hz sine signal with no IIDs or IPDs is presented along with a 600Hz-wide noise band, centered at 300Hz. Due to the level difference between signal and noise (the signal is 15dB below the noise) the sine tone is masked.

(B) Example 2: same as in (A) but with a 180⁰ IPD for the sine signal. In spite of the level difference between noise and signal, the sine tone is now perceivable.

(C) Example 3: same as in (A) but with the left channel of the sine signal removed (extreme case of IID for the sine signal). In spite of the now increased level difference between noise and signal, the sine tone is again perceivable. In other words, a masked tone can become audible by reducing its overall level.

[ signals used for the audio examples: Noise / 300Hz / 300Hz 180⁰ IPD / 300Hz no left channel ]

IPDs (for low frequencies) and IIDs (for high frequencies) can reduce a signal's detection threshold by up to ~15dB. The effect is less salient for "IPD & high frequencies" and IIDs & low frequencies" contexts (why?).

The release from masking of complex signals is facilitated further by interaural spectral differences.

The described release from masking is not due to our ability to localize a signal thanks to the imposed interaural differences. Rather, it is due to signal de-correlation between ears, supported by the interaural changes and supporting the employment of cognitive strategies (e.g. attention focusing) for signal detection in complex sonic environments.

(in Plack, 2005: 180)

Judging Sound-Source Distance

Distance and loudness

Judging sound-source distance can be partially aided by loudness cues. In general, softer sounds are more likely to be associated with sources farther away and louder sounds with sources nearer.

Loudness cues are only reliable when comparing the loudness of a sound to a known reference that precedes it and/or when judging familiar sounds. In all other cases, loudness cues cannot reliably support distance judgments.

Even in reliable contexts, distance changes are underestimated when judged based solely on loudness changes.
More specifically, although distance doubling in the free field (i.e. where there are no reflections) corresponds to an ~6dB SPL reduction, listeners require an ~20dB SPL reduction in order to report that their distance to a sound-source has doubled.

Distance and reverberation

In reflective environments, distance judgments are aided by reverberation cues. In general, the greater the distance of the source the greater the proportion of the reverberant (relative to the direct) sound.

Direct-to-reverberant-sound cues provide coarse source-distance information and only become perceptible for distance changes by a factor of two or larger. In addition, distance-change judgments based on this cue alone tend, again, to be underestimated.

Distance and spectral composition (timbre)

Changes in sound-source distance are linked to timbral changes, mainly due to corresponding changes in a sound's high-to-low frequency SPL ratio. In general, increasing the distance from a sound source tends to reduce this ratio because air absorption reduces high frequencies far more than low frequencies. This cue is most perceptible for large changes in distance.

In addition, the increased likelihood of higher frequencies to be blocked by obstacles and of lower frequencies to diffract around them further reduces a sound's high-to-low frequency energy ratio with distance. Absorption (by air) and blocking (by obstacles) of high frequency can therefore explain the observation that low frequencies travel much further than high frequencies and, at very large distances, the level of high frequencies drops to 0.

NOTE
In the absence of obstacles and for short-to-middle distances (~20-40m from a source, where air absorption is negligible), all frequencies lose ~6dB SPL for each doubling in distance (why?) and the high-to-low frequency SPL ratio remains fixed.

However:
A given change in dB corresponds to a larger loudness change at low vs. high frequencies, as can be inferred from the equal loudness contours (why?).
In the above scenario, increasing the distance from a source will therefore reduce the loudness of low frequencies more than that of high frequencies, even though the SPL of both frequency ranges will be reduced by the same amount. In other words, while the SPL ratio of high-to-low frequencies remains fixed, their loudness ratio increases.

Sound-source distance and overall location judgments based solely on aural cues are not precise and require a combination of experience/familiarity with any given sound source and with changes in context, loudness, reverberation, timbral, and visual cues that may accompany changes in sound source location.
This is one of the reasons why, for example, AI implementations of sound source localization cues to self-driving vehicles had, until recently, failed. The most promising approaches, based on neural networks and deep learning, employ the discussed sound-source localization cues differently than human listeners. See for example these studies:
_ Three-Dimensional Sound Source Localization for Unmanned Ground Vehicles with a Self-Rotational Two-Microphone Array
_ Sound Source Localization of Cars at Intersections Based on Deep Learning

The Precedence Effect
The precedence effect describes a learned strategy employed implicitly by listeners in order to address conflicting or ambiguous localization cues occurring in environments where sound wave reflections play an important role (e.g. all rooms other than anechoic environments). This strategy is believed to develop through exposure to reverberant listening contexts and is eventually applied automatically to all listening contexts, leading to possible auditory localization "illusions." According to this effect, listeners make their sound source localization judgments based on the earliest arriving sound onset. The term "precedence" is used to indicate that the direct sound, with presumably accurate localization information, is given precedence over the subsequent reflections and reverberation, which convey inaccurate localization information and blur the sources aural "image." In fact, in reflective/reverberant environments, both IPD and ILD cues may be diffused to such an extent that they become unusable, making the precedence effect a necessary localization strategy.
	The Haas effect (named after mid 20th century German psychoacoustician, Helmut Haas), describes the observation that, for arrival-times differences up to 30�40 milliseconds, the precedence effect persists (i.e. listeners localize the sound source based on the earlier arriving signal) even if the delayed signal is up to 12dB stronger than the first (see the image, below, and watch this demonstration). The Franssen effect (named after mid 20th century Dutch physicist and inventor, Nico V. Franssen) describes the observation that, an intense, fast-rising tone presented in a reverberant space via a single speaker will continue to be perceived as originating from that speaker even after it has gradually moved to a second loudspeaker, 45 degrees away. The figure to the left illustrates a precedence effect demonstration with two loudspeakers reproducing the same pulsed wave. The pulse from the left speaker leads in the left ear by a few microseconds, suggesting that the source is on the left. The pulse from the right speaker leads in the right ear by a similar amount, which provides a contradictory localization cue. Because the listener is closer to the left speaker, the left pulse arrives sooner and wins the competition�the listener perceives just a single pulse coming from the left. From "How we Localize Sound" by American physicist, W. M. Hartmann; standard reference resource on the topic of sound source localization. [ Optional: Watch Prof. Hartmann's 2018 presentation on the state of sound source localization research at McGill University's Center for Interdisciplinary Research in Music Media and Technology. ]
The figure, below, illustrates the interaction between interaural time & intensity differences, observed in experiments exploring the precedence effect. For small delays (<~1ms) between ears (i.e. small differences in the distance between a sound source and each ear), localization is biased towards the side producing the louder sound. For delays between 1-40ms, localization is biased towards the earlier sound. For larger delays we hear two sounds, with the second perceived as an echo of the first, even if the 'echo' is stronger than the original. [Optional: Brown et. al, 2015. The Precedence Effect in Sound Localization.]

Cone of Confusion - Head-Movement Localization Cues
Cone of confusion Sound sources moving on a sagittal plane (i.e. changing in elevation) produce no IID or IPD changes and only moderate interaural spectral difference changes. Consequently, their movement is difficult to track by purely auditory means. More generally, for any given IID or IPD value, there will be a conical surface extending out of the ear that will produce identical IIDs and IPDs, hindering sound source localization over the surface (see to the right). The most effective strategy in resolving sound-source localization ambiguities is head movement, assuming the sound signal lasts long enough, unchanged, to allow for such a movement to be of use. Moving the head in the horizontal plane can help resolve front-to-back ambiguities (e.g. see below-left), while head tilting can help resolve top-to-bottom ambiguities. Interaural spectral differences and the associated HRTFs do help resolve localization ambiguities, whether in the median sagittal plane or within any cone of confusion, but only partially. As is the case with most sound-source localization cues, the salience of head-movement cues depends largely upon long-term learning and experience. Auditory localization experiments using headphones and conflicting source and head movements exploit our reliance on previous experience, resulting in revealing illusions. Explore the figures, below, and read the explanations in the captions.	(in Plack, 2005: 184)
(in Plack, 2005: 185)

OPTIONAL: Sound source localization neural mechanisms
To explain ITD (IPD) detection, America psychologist, Lloyd Jeffress hypothesized the presence of a coincidence detector, at the neural level, that uses delay lines to compare arrival times at each ear. His theory is illustrated in the figure, below left (Jeffress, 1948, in Plack, 2005: 182). Assuming the presence of neuron arrays, which are tuned to different delays and encode signal arrival time differences between ears, the Jeffress model is equivalent to cross-correlation, comparing the inputs from each ear at different time delays [ video explanation ].
	Jeffress's model has been partially confirmed by physiological evidence from birds (e.g. barn owl). However, evidence from mammals (e.g. gerbil) suggest broad sensitivity to interaural time differences per characteristic frequency that corresponding to the period of that frequency (see above; in Plack, 2005: 183). This observation fits to the already discussed observations that a) ITDs do not provide sound source localization at high frequencies (the Jeffress model cannot explain this) and b) it is a very special type of ITDs that is of importance: IPDs (only up to 1/2-cycle IPDs are well represented by the gerbil data). [Advanced review in Ashida & Carr, 2011]
IID detection can be explained in terms of the associated difference between excitatory and inhibitory activity in the two ears when stimulated by signals that display IIDs (see Park et al., 1996). Explore this list of links to relevant publications by Prof. W. Hartmann and his colleagues (Michigan State University).