Fundamentals of Digital Audio
Sound waves are essentially a series of pressure changes that move through the atmosphere at the speed of sound - around 343.2 metres per second (1,126 ft/s), the exact value depending on temperature and pressure. Sound waves are conducted into the inner ear by means of the eardrum and associated mechanisms, where in the fluid-filled inner ear, the vibrations are turned into neural impulses by hair cells which are sensitive to different frequencies. In this sense one could say that the hearing system is an analogue-to-digital (A-D) converter, but it's more helpful to think of hearing as an analogue system, as this is how we experience sound.
Until the development of digital audio systems, sound transmission and recording technology was analogue. Analogue tape recording, for example, relies on re-positioning magnetic particles to a greater or lesser extent depending on the instantaneous amplitude of the pressure waves captured by microphones. Vinyl discs have a groove cut in them whose displacement from a centre line is proportional to the instantaneous amplitude of the signal.
However analogue audio techniques - recording in particular - have disadvantages that are mainly to do with the introduction of noise and distortion, although they also have advantages, notably the ability to capture timing information very accurately. They can be referred to as "continuous-time" signals.
The promise of digital audio was that once the signal was sampled from the analogue to the digital domain, the stream of ones and zeroes thus generated could be transferred unchanged from one end of a recording system (the A-D) to the digital-to-analogue (D-A) converter at the receiving end, where the digital stream would be turned back into analogue electrical activity and, via a transducer, back into audible sound.
While this premise is true, the tricky bit is getting the A-D and D-A converters to do their job flawlessly.
The A-D converter is fed with an analogue signal and measures its instantaneous value many times a second - the "sample rate". This series of measurements is expressed as a binary code (we can call these "discrete-time" signals) and this is stored (recorded), transmitted, (and modified where desired with Digital Signal Processing - DSP) and ultimately decoded by the D-A, which generates an instantaneous voltage depending on the value of the digital input, thus recreating the original analogue signal or its processed derivative.
At the heart of traditional digital audio is the Nyquist–Shannon sampling theorem, whose name honours Harry Nyquist and Claude Shannon, who were among its discoverers. For our purposes here, we can say that it indicates that the highest frequency you can capture with a digital audio system is half the sampling frequency. The human hearing range extends up to around 20,000 Hz (20kHz) at best, and as a result, when Compact Disc was developed in the 1970s, it was determined that the sample rate need not be much higher than twice that. In fact the value chosen was 44.1 kHz (ie 44,100 samples per second).
An important requirement for this kind of digital audio system is a filter on each end, called an anti-aliasing filter. Nyquist-Shannon requires that the bandwidth being sampled be limited, so that frequencies beyond it are not captured, because if they are, their value is indeterminate. In fact, a signal above half the sample rate behaves as if it was the same amount below half the sample rate, which is a form of distortion (called aliasing). In early digital audio systems,these lowpass filters (ie they allow through lower frequencies) had to be quite sharp. In the case of CD, if we want to capture all the frequencies up to 20kHz, and half the sample rate is 22.05kHz, the filter has to go from wide open to totally closed in 2kHz - a so-called "brick wall" filter. This is a tall order, especially for the analogue filters in the original digital systems, which caused all manner of phase shifts and other distortions and were largely responsible for the bad reputation that early digital audio systems gained in some quarters. Even with modern digital filters this is still a significant challenge to achieve without introducing unwanted artefacts, and getting filters well 'out of the way' of the audio (where they also have the space to be more gentle, and thus have fewer negative impacts) is one reason for the move towards higher sample rates over recent years.
Arguably, the 'standard' sample rates established for CD (44.1kHz) and DVD (48kHz for video synchronisation reasons) were on the edge of possibility at the time these systems were originally developed, and soon after there was a call for higher sample rates, which has continued up to the present-day calls for "high-resolution audio" (96kHz sampling, 24-bit or better) . On the face of it, you don't need sample rates to be a whole lot higher, because we simply can't hear much above 20kHz. However the story is not quite as simple as that. As noted above, a good reason for higher sample rates is to move the filters well out of the audio range so they can be more gentle and don't impact the audible signal. However, you can do this simply by multiplying up the sample rate and inserting interpolated fake samples between the real ones - a technique called 'oversampling' - and this is today common practice. But increasingly, higher sampling rates are being included in studio equipment - typically up to 192 kHz (4x 48kHz). Why?
There has long been debate about the value or otherwise of higher sample rates. Certainly, we can't hear anything much above the limit of the 'standard' sample rates. However, there are numerous claims that we can hear the difference between standard and higher sample rates - some with a good basis in research (this AES paper for example, which establishes that "there exist audible signals that cannot be encoded transparently by a standard CD").
The problem with higher sample rates, however, is that they capture a whole lot of nothing. There is no music up there, but if we want to maximise timing accuracy and keep filters out of the way, we will have to suffer the large file sizes that result. Luckily, storage is increasingly cheap, and recording at 96kHz is well within the capability of many music computer systems.
There is also a degree of a "numbers game" in that, just as with specifications such as dynamic range and noise floor, impressive numbers sell units. But everything has a trade-off: if you strive for the best noise floor, you may compromise the distortion, say, and the question would be whether the lower noise floor is worth it - if you're well below the noise floor of the input system already, lower noise in the converter is no benefit at all. The same may be true of sample rates to an extent. Happily, however, the need to retain as much information as possible does justify higher rates, especially for archiving. You want to archive recordings based on the assumption that replay systems will improve in the future. If you've over-specified the accuracy going into the archive, then the chances are that future listeners will get more out of your recording than you can experience now.
One thing to bear in mind is that to properly reconstruct the original analogue waveform in a traditional digital audio system, the D-A needs low-pass filtering to properly reconstruct the signal. It's commonly believed that a digital signal when reconstructed contains a large number of steps, each representing an instantaneous sample and the whole approximating to the original analogue waveform. This is not the case: the lowpass anti-aliasing filter smooths these steps out so that the result accurately conforms to the original waveform.
Bit depths and word lengths
We've discussed the question of sample rates - how many times in a second the A-D converter measures the instantaneous voltage of the analogue waveform. But there is another fundamental parameter too - the "bit depth" or "word length" (ie the number of 'binary digits' or 'bits' in each digital 'word' or sample value). This is the number of bits that are used to store the measurement of each sample - or to put it another way, the size of the "steps" between adjacent values that can be captured. On the face of it, the more bits that are used, the higher the resolution of the system, but this is not quite the case. This is because in a real digital system, the steps are deliberately smoothed out so the difference between adjacent values isn't so much a step as a probability. This is due to "dither" and we will come to that shortly.
In a properly-dithered system, the word length determines the dynamic range of the digital system (the difference in level between the loudest and softest sounds the system can handle, expressed in dB). Compact Disc uses 16 bits, corresponding to around 96dB of dynamic range, while more advanced systems use 24 (around 144dB) and sometimes more. It has been estimated that we can actually hear around 20 bits of dynamic range (though that actually depends on what frequencies we are talking about), so 16 is rather too few, and 24 is too many - the quietest sound a 24-bit system can capture is several times quieter than we can hear, below the limit of current electronic components, and a good deal quieter than anything in your studio. 24-bit operation is common - going beyond it to 32 and beyond is really not worth the effort.
As signal levels going into a digital system drop, say during a fade or a reverb tail, at some point you will reach the point where the very last bit is reached - where the digital signal value goes from all zeroes and a 1 to all zeroes. This is a very small step if there is a normal number of bits in the system - say 16 or 24 - but even so, it's often clearly audible. This is called quantization error and it manifests as a kind of distortion (quantization distortion). As the level of the original analogue system drifts up and down as it fades out, for example, so the digital output switches the last few bits on and off. This was a problem with early digital systems until, the story goes, it was discovered that when recording using high-gain microphone preamplifiers with a certain amount of inherent noise, the problem went away. And so, 'dither' was discovered - the addition of random noise to a digital system randomises those last few tiny transitions, smoothing them out so you don't hear them. (In fact, dither was first discovered in the world of analogue computing during the Second World War, were vibrations were found to loosen up the movements of mechanical computers.) Exactly what flavour of dither noise you should add to a system is a matter of debate, and different approaches may be suitable for different applications, but in most cases you'll find "TPDF" ("Triangular Probability Density Function") dither is the most common, but this is an extensive and complex subject, and in most cases you don't have any choice in the matter – nor, arguably, do you need to.
When a system is properly dithered, the word length defines the dynamic range of the system, not its resolution (ie the size of the steps). Whenever a digital operation is carried out, dither must be applied to avoid quantisation errors.
Clocking and jitter
Fundamental to the operation of a digital sampling system is the requirement for an accurate clock that ensures that each sample is measured at exactly the right time, and that samples are perfectly evenly spaced. If this is not the case, when the samples are clocked out at the other end with another clock, the values will not come out correctly to recreate the original waveform.
Digital clocks are generally based on highly accurate crystal oscillators - sometimes contained in their own constant-temperature environment. In recent years, some manufacturers have investigated using atomic clock technology - used to create clocks that will be accurate over millions of years - but this is in fact overkill, as a digital sampling clock doesn't need million-year accuracy, it needs short-term accuracy during the course of a recording. Crystals are far better at this.
Unwanted short-term variations in clocking accuracy are referred to as "jitter" and timing inaccuracies of this kind can audibly detract from sound quality as the timing accuracy of the capture of musical events is compromised.
There has been a lot of controversy regarding the best way to handle clocking in a digital audio environment. Some - generally mastering engineers - hold the view that a converter needs a rock-solid internal clock and internal clocking is the way to go. Others - generally studio engineers - maintain that all the digital systems should be clocked by a rock-solid external clock to which all their devices are synchronised. Some mastering engineers have found that when they clock their converters externally, the sound doesn't improve: either there is no change in performance or it actually gets worse. Conversely, some studio engineers have found that external clocking audibly improves the sound. Which is right? Probably both. A converter with a good internal clock will not tend to change for the better when clocked externally by an accurate clocking system. A converter with a poor internal clock will probably sound better when clocked externally. Mastering facilities often have quite exotic audiophile-quality converters that pay special attention to their internal clocks, and this might explain what's going on.
Most conversion systems offer both internal and external (DARS or Word Clock) clocking: in theory, a good converter should not change its sound when clocked externally, and you should be able to choose either internal or external clocking based on how your studio needs to work, getting optimum performance in either case. For more on the internal/external clocking controversy, see this Pink Paper.
The discussion of digital audio in this article is largely based on traditional digital audio theory and practice. But the Nyquist-Shannon theorem is over 65 years old now, and in the last decade or so, new sampling technologies have been developed and more research has been done both into how we hear and how we perceive sound. These advances may signal the ultimate development of significantly higher-quality digital audio systems and converters, but it will take time for these new technologies to be available to the professional audio community.