From a physical point of view, it is opportune to model the
wave-like quality of speech signals as a finite sum of harmonic
oscillations at discrete vibration rates, that is, a fundamental
oscillation superimposed by higher harmonics. Each of these
"partial tones" is uniquely characterized by 2 qualities, its
frequency ("pitch") and its amplitude ("loudness").
In terms of this model, speech signals are composed of a
series of partial tones ranging in frequency between 64 - 16384
Hz (8 octaves). Specifically, speech signals encompass: (1)
characteristic sounds made up of a limited number of partial
tones whose vibration rates are integer multiples of a
fundamental tone and, (2) characteristic noise made up of a
large number of partial tones with vibration rates covering the
whole frequency range. The amplitudes of partial tones are not
constant but, rather, fluctuate around average values. The same
is true, though less pronounced, of frequency. Here we find
subtle shifts of partial tones towards lower and higher pitch,
resulting in characteristic intonation patterns that exhibit
variations around a mean vocal pitch.
Even though the fluctuations of partial tones possess a
random character, - there are a number of factors involved in
the speech production process which cannot be controlled
precisely enough -, detailed analyses also reveal strong
intrinsic regularities allowing, for example, to distinguish
between (1) "natural" fluctuations which are characteristic of
the speaker and, (2) fluctuations which reflect responses to
stimuli or indicate interactions with the immediate
environment. Moreover, all fluctuations of partial tones display
marked inter-individual differences, thus pointing to the
distinct individuality of human speech.
Based on the partial tone model and taking into account the
specific properties of time-dependent spectra derived from
non-stationary time series, we developed our concept of
spectral patterns (Stassen 1980; Stassen et al. 1985). This
approach generalizes the notion "spectrum" in the sense that
corresponding spectral intensities are regarded as fluctuating
rather than being fixed-valued, thus incorporating the
non-stationary nature of speech signals into the model. Specifically,
minimum and maximum intensity bounds as a function of
frequency are used to define the narrowest possible bounded
region in the frequency domain which entirely encompasses
the upper and lower limits of spectral distributions derived
from consecutive epochs. In order to determine such minimum
and maximum intensity bounds, spectral analyses are carried
out independently for non-overlapping neighbourhoods of
appropriately chosen time points t1, t2,..... tk. The
region in the spectral domain bounded by the 2 distribution
curves is called a spectral pattern and measures the
fluctuations in the frequency composition of a time series.
Figures 4.1 and 4.2 show
spectral patterns derived in one case from an average male
speaker (Figure 4.1) and, in the other case, from an average
female speaker (Figure 4.2). In these figures, the variability of
spectral intensities are plotted as shaded areas on
log-proportional scales along the vertical axes as a function of
frequency (horizontal axis). The frequency resolution is 1
quartertone over the full range of 64 - 8192 Hz (7 octaves).
Fig. 4.1: Spectral voice pattern derived from a male speaker.
The variablity of spectral intensities are plotted as shaded area
on log-proportional scales along the vertical axes. The spectral
resolution is 1 quartertone over the frequency range of
64-8192 Hz (7 octaves).
Regarded as a function of frequency, the width of the shaded area is for the most part constant and apparently independent of the actual intensity values. In other words, the chosen logarithmic transformation almost perfectly compensates for the high correlation between intensities and corresponding variabilities. As to the specific properties of spectral voice patterns, we will later see that they are quasi-stationary quantities, closely related to each individual, and even allow a computerized identification of individuals at high reliability.
Fig. 4.2: Spectral voice pattern derived from a female
speaker. The variablity of spectral intensities are plotted as
shaded area on log-proportional scales along the vertical axes.
The spectral resolution is 1 quartertone over the frequency
range of 64 - 8192 Hz (7 octaves).
Spectral patterns are analogously constructed like feature vectors in the field of pattern recognition and require the definition of an appropriate similarity measure in order to quantify inter-individual differences or intra-individual coincidences. Because of the specific nature of spectral patterns, set-theoretical similarity functions are well-suited for measuring their overall agreement or lack thereof (Levandowsky and Winter 1971; Tversky 1977). The characteristics of such a similarity measure are displayed in Figures 4.3 and 4.4 for 168 equally-weighted frequency bands of 1 quarter-tone width between 64 and 8192 Hz. An essentially constant similarity over all 168 quarter-tones (which is, of course, above a certain threshold value) suggests that the patterns under comparison were produced by the same person (Figure 4.3) whereas a more irregular similarity curve with pronounced break-in's (Figure 4.4) indicates that patterns of different speakers are compared.
Fig. 4.3: Similarity between spectral patterns as a function of
frequency resulting from a comparison of 2 recordings from
the same person at a 14-day interval. The spectral resolution is
1 quartertone over the frequency range of 64 - 8192 Hz (7 octaves).
Although this generalized similarity measure works quite well in most cases, a serious problem is, nevertheless, not covered by this approach: when comparing spectral patterns obtained from the same person but under different experimental conditions, a tonal shift of mean vocal pitch by one or more quartertones considerably reduces the overall agreement of patterns. Such tonal shifts are due, for example, to psychological factors like stress, to physical factors like fatigue or diseases of the throat, amongst others. Since our spectral analysis provides for a resolution of equidistant quartertones over the full tonal range 64 - 8192 Hz, the location of overtones relative to each other remain unchanged if mean vocal pitch changes. Thus, a simple linear transvection (which shifts the spectrum as a whole) is automatically used to compensate for differences in mean vocal pitch of a speaker. In our studies, allowing for a tonal shift of quartertone turned out to be optimum.
Fig. 4.4: Similarity between spectral patterns as a function of
frequency resulting from a comparison of 2 unrelated persons
whose voices had been recorded under comparable
experimental conditions. The spectral resolution is 1
quartertone over the frequency range of 64 - 8192 Hz (7 octaves).
Once spectral patterns have been designed and a suitable
similarity measure has been selected, we need to determine the
free parameters inherent to the approach: (1) an appropriate
recording time containing enough information for a reliable
estimation of spectral patterns, (2) a subdivision of the
recording time into epochs large enough that the actual
composition of partial tones is revealed, yet small enough that
the individual variability of each spectral component can be
estimated from a sequence of consecutive epochs and, (3) a
subdivision of the frequency domain into intervals compatible
with the relative importance of frequency bands in their
dependence upon psycho-acoustic functions. This task is a
typical problem of pattern recognition and, accordingly, can be
carried out by means of trainable algorithms in connection with
a design sample set, a test sample set, and an appropriate
criterion function ("supervised learning"). Such procedures
merely assume that if all training samples are correctly
processed, few mistakes will be made on the test sample set.
In view of determining the above quantities by means of
optimizing procedures, we could rely upon our calibration
study based on a sufficiently large and representative sample of
a total of 192 male and female speakers. Since the speakers'
voices were recorded twice at an interval of 14 days, one set of
recordings could be used as design samples whereas the other
set of recordings could be referred to as test samples in order to
derive sample independent and reproducible calibration
parameters. Specifically, the optimization was carried out on
the basis of a subsample of 97 persons (age group I). Then, in a
second step, the resulting calibration values were applied to the
other subsample of 90 persons (age group II) in order to test
the overall performance of the method under discussion.
Lacking appropriate a-priori knowledge, we determined the
problem-specific parameter setting by using the computerized
identification of persons by means of their spectral voice
patterns as external validation criterion. During optimization,
the principal design parameters (1) epoch length, (2) number of
consecutive epochs, and (3) number, width and location of
frequency bands were systematically varied within prespecified
margins (Table 4.1).
Tab. 4.1: Variation of calibration parameters during
optimization.
Our optimization (age group I, N=97) yielded clear and reproducible (age group II, N=90) maxima for the following parameters:
Tab. 4.2: Recognition rate as a function of 2 equally weighted
frequency bands derived from a bisection of the frequency
domain (fixed text, 2 measurements at 14 day intervals).
Tab. 4.3: Reproducibility of results: recognition rates derived
independently from the 2 age groups on the basis of the same
calibration values.
Systematic intra- and interindividual comparisons between
spectral voice patterns were carried out in order to determine
the degree to which the sampled interspeaker variability was
greater than the sampled intraspeaker variability. Both forms of
variability were quantified by computing the underlying
distribution curves (Figures 4.5, 4.6). Based on our sample of
97 persons (age group I), we found, with respect to a given
cutoff-value, a total of 5/97 (5.2%) false-negative and
257/9312 (2.8%) false-positive comparisons. The results
derived from our second sample of 90 persons (age group II)
are almost identical (Table 4.3).
The latter classification errors, however, suggest an overly
optimistic picture. Indeed, the true rate of uniquely recognized
persons is slightly worse: using all 3 texts simultaneously, we
found 90/97 (92.8%) of persons being uniquely recognized,
1/97 (1.0%) of persons in doubt, whereas 6/97 (6.2%) of
persons could not be recognized.
Based on one text only - a fixed text but no matter which of the
3 available texts - the rate of uniquely recognized persons was
reduced to about 85%. The rate of uniquely recognized
persons even dropped to about 71% if a recognition was tried
on the basis of different texts (text-independent speaker
recognition). All these results were highly stable and
reproducible. Almost no differences showed up, neither
between "backward" recognition (2nd recording served as
reference) and "forward" recognition (1st recording served as
reference) nor between age groups. Moreover, our findings did
not seem to depend on sample sizes: reducing the sample size
at random to 60 and 40 persons did not significantly improve
the rate of uniquely recognized persons.
Fig. 4.5: Discrimination between the distributions of
intra-individual (upper) and inter-individual (lower) similarity
coefficients based on 97 healthy subjects and recordings at
14-day intervals (age group I).
Fig. 4.6: Discrimination between the distributions of
intra-individual (upper) and inter-individual (lower) similarity
coefficients based on 90 healthy subjects and recordings at
14-day intervals (age group II).
In the spectral pattern approach, our interest has focused on
the individual sound characteristics of speakers ("timbre") rather
than on speech behaviour. Since the timbre of a voice is
primarily determined by the overtone distributions of the
underlying sound waves, we determined the optimum
parameter setting for a problem-specific, reliable estimation of
time-dependent spectra in order to be able to differentiate
between those parts of overtone distributions which are
invariant over time and represent the "identity" of a speaker,
and those parts which reflect responses to or interactions with
the immediate environment.
An interval of 1 second length was found to be optimum for
reproducibly assessing formants and corresponding bandwidths
in >95% of cases. Based on these findings, we adapted the
concept of "spectral patterns" to speech analysis. It turned out
that spectral voice patterns are stable over time and measure
the fine graduations of mutual differences between human
voices. Even a computerized recognition of persons by means
of these quantities and on the basis of 16 - 32 second time
series was possible with a high reliability: 92.8% of persons
could be uniquely recognized at 14 day intervals. Hence, we
succeeded in developing specific means for modelling
intra-individual changes of voice timbres over time. This is of
particular interest for investigations into the speech
characteristics of affectively disturbed patients since the tonal
expressiveness of human voices, or the lack thereof, essentially
depends on the actual distribution of overtones and the
corresponding variabilities.