Acoustic Instrument Resynthesis
From DSPWiki
Contents |
Old Theories
From 1885 until the 1960s, physics books proclaimed that instruments got their signature sounds from the relative intensities of the steady-state harmonics of a tone. This theory was presented by Helmholtz in 1885 in On the sensations of tone as a psychological basis for the theory of music, and was partially correct, but didn't tell the whole story. The stereotype profiles of acoustic instruments weren't very consistent, and attempts to resynthesize instrumental sounds with banks of oscillators were unsuccessful.
Partials of a church bell
A famous frequency analysis of a church bell was conducted by A. Lehr, and published in 1965. The analysis was done with filters to determine the exact partials that sound in a bell, which are:
sub-octave of fundamental fundamental minor third fifth octave octave + major third octave + fifth 2 octaves 2 octaves + flat fourth 2 octaves + sixth 3 octaves ...
Inharmonic Sound
The attempt to synthesize traditional harmonic instrument tones failed, so inharmonic composition became popular in electronic music. For instance, Karlheinz Stockhauesen decided that there was a soft boundary between timbre and rhythm, and began to compose music that treated waveforms as microrhythms that related to the macrorhythm of his piece.
Successful Resynthesis of an Acoustic Sound
Max Matthews and Jean-Claude Risset came to the rescue in 1969 with their article Analysis of musical instrument tones, which contained Fourier analysis of a trumpet tone. Their paper was published in Physics Today, and they found that the relative levels of each harmonic in the tone changed over time. Matthews and Risset proposed that the dynamic components of the spectrum were more important to human recognition of instrument classification than the steady-state components of a tone. Matthews and Risset demonstrated this principle by synthesizing a fairly accurate trumpet tone using additive synthesis. They combined several sine-wave oscillators, created a separate amplitude envelope for each oscillator, and generated the very first synthesized sound that resembled an acoustic instrument. The most important thing learned is that the energy of each harmonic changes over time, and this is the most important distinguishing characteristic of a tone produced by an acoustic instrument.
They also discovered that the trumpet tones were different every time, but there were similarities between all trumpet tones. The beginning of the tone had more energy in low harmonics, and energy was transferred to higher harmonics as the tone progressed, returning to lower harmonics again at the end of the tone. The beginning of a sound carries the most information about what made the sound, and most of the important information in a tone is generated within .08 seconds. The auditory system is very adept at distinguishing signature sounds, and identifying them.
In addition to dynamic amplitudes, harmonics also change frequency over time. Especially at the beginning of a tone, where the harmonic frequencies are very erratic. For instance, the beginning of a violin tone is a mess of random frequencies, but the harmonics all come into line as soon as the string starts its resonant vibration. This all happens very quickly, but it's enough for us to determine that what we're listening to is a violin.
The relative energy put into the system (dynamics: f, mf, mp, p, etc.) also vary the levels of the harmonics. Timbre typically changes proportionally to volume level, so loud tones produced on acoustic instruments tend to have higher intensity levels in upper harmonic.
Additive Synthesis
Additive Synthesis is one of the oldest synthesis techniques, and it relies on the summation of elementary waveforms to create a more complex waveform. The concept of additive synthesis began with pipe organs, with their multiple register stops. Additive synthesis has also been used in electronic music since its beginning, such as the Telharmonium synthesizer, unveiled in 1906, which summed sound from dozens of electrical tone generators. Using a smaller version of the Telharmonium's rotating tone generators, the Hammond organs are also pure additive synthesis instruments.
Fixed-waveform Additive Synthesis: Some software packages and synthesizers let musicians create waveforms by harmonic addition. Once a desirable spectrum is tuned, the software calculates a waveform that reproduces the spectrum when played on a digital oscillator.
Phase: Depending on its context, phase may or may not be a significant factor in additive synthesis. While the different starting points in the phase of each individual harmonic frequency are not audible to the listener, they do have a significant effect on the visual appearance of the waveform. These phase relationships do play a role though in attacks, grains, and transients of a tone. The ear is also sensitive to phase relationships of tones where the phase relationship changes over time. Proper phase data help reassemble short-lived components of a sound in their correct order, and are therefore essential in reconstructing an analyzed sound.
Demands: Time-varying additive synthesis makes heavy demands on a digital music system. If we the assumption that each sound event in a piece may have up to 24 partials, and that up to sixteen events can be playing simultaneously, we need 384 oscillators at any given time. If this system sampling rate is 48 kHz, it must be capable of generating 48,000 x 384 = 18,432,000 samples per second. If each sample requires about 768 operations, the total computational load is over 1.4 billion operations per second. This is all without counting table-lookup operations, or control data.
Additive Analysis / Resynthesis
Analysis/resynthesis encompasses many techniques that have a three-step process in common:
1. A recorded sound is analyzed 2. The musician modifies the analysis data 3. The modified data is used in resynthesizing the altered sound
Analysis data of a waveform is performed by successively segmenting an input signal (called windowing), and sending each windowed segment through a bank of bandpass filters. Each bandpass filter is tuned to a specific center frequency, which provides frequency and amplitude information for an individual sine wave oscillator tuned to that frequency. In practice, a Fast Fourier Transform usually replaces the filter bank, but performs essentially the same task (measuring the energy in each frequency band).
Methods for Sound Analysis
Various spectrum analysis methods, such as pitch-synchronous analysis, the phase vocoder, and constant-Q analysis are variations on the basic technique of Fourier analysis. The practical form of Fourier analysis is the Short-Time Fourier Transform (STFT), which extracts successive short-duration overlapping segments (shaped by a window function), and applies a bank of filter to each slice. At the core of STFT is the FFT, a computationally efficient implementation of Fourier analysis.
The Phase Vocoder is a popular method of sound analysis/resynthesis that converts a sampled input signal into a time-varying spectral format that may be edited and resynthesized. The phase vocoder can be used for time compression/expansion without altering pitch.
Data Reduction
The amount of data produced by analyzing complex waveforms can be far too much to be useful for the composer. An important goal of data reduction is to compact data without eliminating perceptually salient features of the input signal, and to make the data easy to manipulate the data-reduced material.
Line-segment Approximation:
Line-segment approximation reduces data by storing only a set of breakpoint pairs, which are time and amplitude points where there is significant change in the waveform. In resynthesis, the system connects the dots with straight lines interpolated between the breakpoint pairs. This method gives an approximate representation of the original signal, with a great deal less data. Initially, this reduction was done by hand, but can be partially automated. Beauchamp (1975) also developed a heuristic technique for inferring the approximate amplitude curve of all harmonics from the curve of the first harmonic.
Principal Components Analysis:
PCA breaks down a waveform using covariance matrix calculation. This results in a set of principal components, and a set of weighting coefficients for these basic components. When the components are summed according to their weights, the result is a close approximation of the original waveform.
Spectral Interpolation Synthesis:
SIS generates time-varying sounds by interpolating between analyzed spectra. It starts from analyses of recorded sounds, and uses additive synthesis to crossfade between the analyses of successive spectra in the frequency domain. The main difficulty with this method is the handling of the attack portion of sounds.
Spectral Modeling Synthesis:
SMS reduces analysis data into a deterministic component and a stochastic component. The deterministic component is a data-reduced version of the analysis that models the most prominent frequencies in the spectrum, which are then resynthesized with sine-waves. This part is much the same as is used in phase vocoders. The stochastic component, however, analyzes the difference between the deterministic component and the original signal, and forms a series of envelopes that control a bank of filters through which white-noise is passed. This reduces the amount of data that needs to be stored, because the noise portion of the signal would otherwise have to be represented by many sine waves, and white noise can be generated at resynthesis time. This method improves the ability for noisy components of a sound to remain noisy, even after transformations are applied to them, but the approximation of the original noise of a signal with uniform noise frequently leaves room for improvement.
Walsh Function Synthesis
Walsh function synthesis is a different way of looking at synthesis with square waves instead of Fourier's sine waves. A family of square waves called the Walsh functions can be used to approximate a signal after it has been analyzed by means of the Walsh-Hadamard transform. The signal can then be resynthesized with the Fast Walsh Transform (FWT). It is also possible to convert between the Fourier domain and the Walsh domain mathematically.
Sine wave additive synthesis and Walsh function synthesis are conceptual opposites. With sine wave synthesis, the hardest waveform to synthesize is a waveform with rectangular corners, such as a square wave. With Walsh function synthesis, the hardest waveform to synthesize is the sine wave, since it will always remain patially jagged. The main advantage of Walsh function is the ease of which square waves can be generated in the digital domain, but as the price of processor-speed and memory comes down, the usefulness for Walsh function synthesis is diminishing.
Timbre and Perception
Perception of timbre has been broken down into three dimensions:
Dimension 1: Dark / Bright (amount of energy in high harmonics) Dimension 2: How well the harmonics move in a correlated fashion. Dimension 3: The amount of attack noise (not completely independent though).
The timbre characteristics of acoustic instruments can be plotted in these three dimensions, which shows the relative characteristics of instruments.
Attempts to interpolate the empty spaces on the graph (e.g. cello + fr. horn) create tones that sound like a combination of the instruments, and not like a new tone because of "categorical perception". Categorical perception is the idea that we tend to want to fit sounds we hear into categories with which we're already familiar. Thus, if we hear an interpolation of a cello and a french horn, we tend to hear both the cello and the french horn, and not a completely new instrument. I'm not sure I totally buy this reasoning, since there are plenty of timbres in electronic music that don't sound anything like acoustic instruments. Even for something like a square wave, which is remarkably similar to the sound of a clarinet, most people don't say, "oh, that sounds like a clarinet", when they hear a square wave unless somebody tells them that it sounds a bit like a clarinet. It also seems possible to me that the problem lies in a poor interpolation algorithm.
