PATTERN


Spectral patterns

From a physical point of view, it is opportune to model the wave-like quality of speech signals as a finite sum of harmonic oscillations at discrete vibration rates, that is, a fundamental oscillation superimposed by higher harmonics. Each of these "partial tones" is uniquely characterized by 2 qualities, its frequency ("pitch") and its amplitude ("loudness").
      In terms of this model, speech signals are composed of a series of partial tones ranging in frequency between 64 - 16384 Hz (8 octaves). Specifically, speech signals encompass: (1) characteristic sounds made up of a limited number of partial tones whose vibration rates are integer multiples of a fundamental tone and, (2) characteristic noise made up of a large number of partial tones with vibration rates covering the whole frequency range. The amplitudes of partial tones are not constant but, rather, fluctuate around average values. The same is true, though less pronounced, of frequency. Here we find subtle shifts of partial tones towards lower and higher pitch, resulting in characteristic intonation patterns that exhibit variations around a mean vocal pitch.
      Even though the fluctuations of partial tones possess a random character, - there are a number of factors involved in the speech production process which cannot be controlled precisely enough -, detailed analyses also reveal strong intrinsic regularities allowing, for example, to distinguish between (1) "natural" fluctuations which are characteristic of the speaker and, (2) fluctuations which reflect responses to stimuli or indicate interactions with the immediate environment. Moreover, all fluctuations of partial tones display marked inter-individual differences, thus pointing to the distinct individuality of human speech.
      Based on the partial tone model and taking into account the specific properties of time-dependent spectra derived from non-stationary time series, we developed our concept of spectral patterns (Stassen 1980; Stassen et al. 1985). This approach generalizes the notion "spectrum" in the sense that corresponding spectral intensities are regarded as fluctuating rather than being fixed-valued, thus incorporating the non-stationary nature of speech signals into the model. Specifically, minimum and maximum intensity bounds as a function of frequency are used to define the narrowest possible bounded region in the frequency domain which entirely encompasses the upper and lower limits of spectral distributions derived from consecutive epochs. In order to determine such minimum and maximum intensity bounds, spectral analyses are carried out independently for non-overlapping neighbourhoods of appropriately chosen time points t1, t2,..... tk. The region in the spectral domain bounded by the 2 distribution curves is called a spectral pattern and measures the fluctuations in the frequency composition of a time series.
      Figures 4.1 and 4.2 show spectral patterns derived in one case from an average male speaker (Figure 4.1) and, in the other case, from an average female speaker (Figure 4.2). In these figures, the variability of spectral intensities are plotted as shaded areas on log-proportional scales along the vertical axes as a function of frequency (horizontal axis). The frequency resolution is 1 quartertone over the full range of 64 - 8192 Hz (7 octaves).


Figure 4.1
Fig. 4.1: Spectral voice pattern derived from a male speaker. The variablity of spectral intensities are plotted as shaded area on log-proportional scales along the vertical axes. The spectral resolution is 1 quartertone over the frequency range of 64-8192 Hz (7 octaves).



Regarded as a function of frequency, the width of the shaded area is for the most part constant and apparently independent of the actual intensity values. In other words, the chosen logarithmic transformation almost perfectly compensates for the high correlation between intensities and corresponding variabilities. As to the specific properties of spectral voice patterns, we will later see that they are quasi-stationary quantities, closely related to each individual, and even allow a computerized identification of individuals at high reliability.


Figure 4.2
Fig. 4.2: Spectral voice pattern derived from a female speaker. The variablity of spectral intensities are plotted as shaded area on log-proportional scales along the vertical axes. The spectral resolution is 1 quartertone over the frequency range of 64 - 8192 Hz (7 octaves).



Similarity between spectral patterns

Spectral patterns are analogously constructed like feature vectors in the field of pattern recognition and require the definition of an appropriate similarity measure in order to quantify inter-individual differences or intra-individual coincidences. Because of the specific nature of spectral patterns, set-theoretical similarity functions are well-suited for measuring their overall agreement or lack thereof (Levandowsky and Winter 1971; Tversky 1977). The characteristics of such a similarity measure are displayed in Figures 4.3 and 4.4 for 168 equally-weighted frequency bands of 1 quarter-tone width between 64 and 8192 Hz. An essentially constant similarity over all 168 quarter-tones (which is, of course, above a certain threshold value) suggests that the patterns under comparison were produced by the same person (Figure 4.3) whereas a more irregular similarity curve with pronounced break-in's (Figure 4.4) indicates that patterns of different speakers are compared.


Figure 4.3
Fig. 4.3: Similarity between spectral patterns as a function of frequency resulting from a comparison of 2 recordings from the same person at a 14-day interval. The spectral resolution is 1 quartertone over the frequency range of 64 - 8192 Hz (7 octaves).



Although this generalized similarity measure works quite well in most cases, a serious problem is, nevertheless, not covered by this approach: when comparing spectral patterns obtained from the same person but under different experimental conditions, a tonal shift of mean vocal pitch by one or more quartertones considerably reduces the overall agreement of patterns. Such tonal shifts are due, for example, to psychological factors like stress, to physical factors like fatigue or diseases of the throat, amongst others. Since our spectral analysis provides for a resolution of equidistant quartertones over the full tonal range 64 - 8192 Hz, the location of overtones relative to each other remain unchanged if mean vocal pitch changes. Thus, a simple linear transvection (which shifts the spectrum as a whole) is automatically used to compensate for differences in mean vocal pitch of a speaker. In our studies, allowing for a tonal shift of quartertone turned out to be optimum.


Figure 4.4
Fig. 4.4: Similarity between spectral patterns as a function of frequency resulting from a comparison of 2 unrelated persons whose voices had been recorded under comparable experimental conditions. The spectral resolution is 1 quartertone over the frequency range of 64 - 8192 Hz (7 octaves).



Iterative optimization of spectral patterns

Once spectral patterns have been designed and a suitable similarity measure has been selected, we need to determine the free parameters inherent to the approach: (1) an appropriate recording time containing enough information for a reliable estimation of spectral patterns, (2) a subdivision of the recording time into epochs large enough that the actual composition of partial tones is revealed, yet small enough that the individual variability of each spectral component can be estimated from a sequence of consecutive epochs and, (3) a subdivision of the frequency domain into intervals compatible with the relative importance of frequency bands in their dependence upon psycho-acoustic functions. This task is a typical problem of pattern recognition and, accordingly, can be carried out by means of trainable algorithms in connection with a design sample set, a test sample set, and an appropriate criterion function ("supervised learning"). Such procedures merely assume that if all training samples are correctly processed, few mistakes will be made on the test sample set.
      In view of determining the above quantities by means of optimizing procedures, we could rely upon our calibration study based on a sufficiently large and representative sample of a total of 192 male and female speakers. Since the speakers' voices were recorded twice at an interval of 14 days, one set of recordings could be used as design samples whereas the other set of recordings could be referred to as test samples in order to derive sample independent and reproducible calibration parameters. Specifically, the optimization was carried out on the basis of a subsample of 97 persons (age group I). Then, in a second step, the resulting calibration values were applied to the other subsample of 90 persons (age group II) in order to test the overall performance of the method under discussion.
      Lacking appropriate a-priori knowledge, we determined the problem-specific parameter setting by using the computerized identification of persons by means of their spectral voice patterns as external validation criterion. During optimization, the principal design parameters (1) epoch length, (2) number of consecutive epochs, and (3) number, width and location of frequency bands were systematically varied within prespecified margins (Table 4.1).


Table 4.1
Tab. 4.1: Variation of calibration parameters during optimization.



Our optimization (age group I, N=97) yielded clear and reproducible (age group II, N=90) maxima for the following parameters:


Table 4.2
Tab. 4.2: Recognition rate as a function of 2 equally weighted frequency bands derived from a bisection of the frequency domain (fixed text, 2 measurements at 14 day intervals).


Table 4.3
Tab. 4.3: Reproducibility of results: recognition rates derived independently from the 2 age groups on the basis of the same calibration values.



Systematic intra- and interindividual comparisons between spectral voice patterns were carried out in order to determine the degree to which the sampled interspeaker variability was greater than the sampled intraspeaker variability. Both forms of variability were quantified by computing the underlying distribution curves (Figures 4.5, 4.6). Based on our sample of 97 persons (age group I), we found, with respect to a given cutoff-value, a total of 5/97 (5.2%) false-negative and 257/9312 (2.8%) false-positive comparisons. The results derived from our second sample of 90 persons (age group II) are almost identical (Table 4.3).
      The latter classification errors, however, suggest an overly optimistic picture. Indeed, the true rate of uniquely recognized persons is slightly worse: using all 3 texts simultaneously, we found 90/97 (92.8%) of persons being uniquely recognized, 1/97 (1.0%) of persons in doubt, whereas 6/97 (6.2%) of persons could not be recognized. Based on one text only - a fixed text but no matter which of the 3 available texts - the rate of uniquely recognized persons was reduced to about 85%. The rate of uniquely recognized persons even dropped to about 71% if a recognition was tried on the basis of different texts (text-independent speaker recognition). All these results were highly stable and reproducible. Almost no differences showed up, neither between "backward" recognition (2nd recording served as reference) and "forward" recognition (1st recording served as reference) nor between age groups. Moreover, our findings did not seem to depend on sample sizes: reducing the sample size at random to 60 and 40 persons did not significantly improve the rate of uniquely recognized persons.


Figure 4.5
Fig. 4.5: Discrimination between the distributions of intra-individual (upper) and inter-individual (lower) similarity coefficients based on 97 healthy subjects and recordings at 14-day intervals (age group I).


Figure 4.6
Fig. 4.6: Discrimination between the distributions of intra-individual (upper) and inter-individual (lower) similarity coefficients based on 90 healthy subjects and recordings at 14-day intervals (age group II).



Conclusions

In the spectral pattern approach, our interest has focused on the individual sound characteristics of speakers ("timbre") rather than on speech behaviour. Since the timbre of a voice is primarily determined by the overtone distributions of the underlying sound waves, we determined the optimum parameter setting for a problem-specific, reliable estimation of time-dependent spectra in order to be able to differentiate between those parts of overtone distributions which are invariant over time and represent the "identity" of a speaker, and those parts which reflect responses to or interactions with the immediate environment.
      An interval of 1 second length was found to be optimum for reproducibly assessing formants and corresponding bandwidths in >95% of cases. Based on these findings, we adapted the concept of "spectral patterns" to speech analysis. It turned out that spectral voice patterns are stable over time and measure the fine graduations of mutual differences between human voices. Even a computerized recognition of persons by means of these quantities and on the basis of 16 - 32 second time series was possible with a high reliability: 92.8% of persons could be uniquely recognized at 14 day intervals. Hence, we succeeded in developing specific means for modelling intra-individual changes of voice timbres over time. This is of particular interest for investigations into the speech characteristics of affectively disturbed patients since the tonal expressiveness of human voices, or the lack thereof, essentially depends on the actual distribution of overtones and the corresponding variabilities.


Feedback

If you have questions or comments concerning this program package send e-mail to one of the following addresses:


[ HOME Psychiatric University Hospital Zurich ]
[ HOME Psychiatric University Hospital Zurich, German Pages ]