Information Theory and Channel Capacity

Information theory establishes the mathematical foundation on which all of digital communication rests. Formulated by Claude Shannon in his 1948 paper "A Mathematical Theory of Communication," the discipline answers two fundamental questions: how compactly can a source of information be represented, and how fast can information be transmitted reliably over a noisy channel. The answers take the form of precise limits, expressed in bits, that no coding scheme can exceed yet that well-designed schemes can approach arbitrarily closely. These limits frame every practical decision a communication engineer makes, from the choice of modulation and coding to the allocation of bandwidth and transmit power.

The central quantity throughout information theory is the bit, a measure of information rather than merely a binary digit of storage. One bit corresponds to the resolution of a single equally likely binary choice. From this simple notion grow the related measures of entropy, mutual information, and channel capacity, each capturing a different aspect of how much information a random quantity carries or how much of it can survive transmission. The sections below develop these ideas, derive the celebrated Shannon-Hartley capacity formula, and apply the results to the additive white Gaussian noise channel and to the fading channels that dominate wireless communication.

A recurring theme is the separation between what is possible and how to achieve it. Shannon's theorems are existence results: they prove that codes attaining a given performance exist without exhibiting them. Decades of work in coding theory, treated elsewhere in this guide, have since produced explicit codes that operate within a fraction of a decibel of these theoretical bounds. Understanding the bounds themselves is therefore the necessary first step toward understanding why modern systems are designed as they are.

Entropy and Information Measures

Entropy quantifies the average uncertainty, or equivalently the average information content, of a random variable. For a discrete random variable X taking values from an alphabet with probabilities p(x), the entropy is defined as H(X) = -∑ p(x) log₂ p(x), measured in bits when the logarithm is taken to base two. Entropy reaches its maximum, equal to the logarithm of the alphabet size, when all outcomes are equally likely, and it falls to zero when one outcome is certain. A fair coin therefore carries exactly one bit of entropy per flip, while a heavily biased coin carries less.

The definition follows from a small set of intuitive requirements: information should be continuous in the probabilities, should increase as the number of equally likely outcomes grows, and should be additive for independent events. Shannon proved that the logarithmic form is the unique measure satisfying these conditions up to the choice of logarithm base. The use of base two ties the measure directly to binary representation, so that an entropy of H bits indicates that the source can in principle be represented using H binary digits per symbol on average.

Joint and Conditional Entropy

When two random variables are considered together, the joint entropy H(X, Y) measures their combined uncertainty, while the conditional entropy H(Y | X) measures the uncertainty remaining in Y once X is known. These quantities satisfy the chain rule H(X, Y) = H(X) + H(Y | X), which states that the total uncertainty equals the uncertainty in the first variable plus the residual uncertainty in the second given the first. Conditional entropy never exceeds unconditional entropy, formalizing the principle that side information can only reduce, never increase, average uncertainty.

Conditional entropy plays a central role in the analysis of noisy channels. If X denotes a transmitted symbol and Y the corresponding received symbol, then H(X | Y) measures the uncertainty about what was sent given what was received. This residual uncertainty, sometimes called the equivocation, represents the information lost to noise and forms the basis for defining how much information the channel conveys.

Differential Entropy

For continuous random variables, the summation in the entropy definition becomes an integral, yielding the differential entropy h(X) = -∫ f(x) log₂ f(x) dx, where f is the probability density function. Differential entropy differs from discrete entropy in important respects: it can be negative, and it depends on the choice of coordinate scale. Despite these subtleties, differences of differential entropies, such as those appearing in mutual information, remain well defined and physically meaningful.

A key result concerns the Gaussian distribution. Among all continuous distributions with a given variance, the Gaussian maximizes differential entropy, attaining the value h = (1/2) log₂(2πeσ²) bits. This maximum-entropy property explains why Gaussian noise represents the worst case for many communication problems and why the capacity of the Gaussian channel serves as a benchmark against which other channels are measured.

Mutual Information

Mutual information measures the amount of information that one random variable conveys about another. For variables X and Y it is defined as I(X; Y) = H(X) - H(X | Y) = H(Y) - H(Y | X), the reduction in uncertainty about one variable achieved by observing the other. Mutual information is symmetric, is never negative, and equals zero precisely when the two variables are statistically independent. When Y is a noisy observation of X, mutual information quantifies how much the observation reveals about the original.

An equivalent expression writes mutual information as the relative entropy, or Kullback-Leibler divergence, between the joint distribution of X and Y and the product of their marginals. This formulation emphasizes that mutual information measures how far the two variables are from independence. Relative entropy itself, defined as D(p || q) = ∑ p(x) log₂[p(x)/q(x)], measures the inefficiency incurred when a code optimized for distribution q is used for data actually drawn from distribution p, and it underlies many results in both information theory and statistics.

Discrete Memoryless Channels

A discrete memoryless channel is specified by a set of transition probabilities p(y | x) giving the chance that input symbol x produces output symbol y, with each use of the channel independent of all others. Two canonical examples illustrate the model. The binary symmetric channel flips each transmitted bit independently with crossover probability p, so its transition matrix is fully described by that single parameter. The binary erasure channel instead replaces each bit with an erasure symbol with probability p, leaving the receiver uncertain about the value but aware that an erasure occurred.

For a given channel, the mutual information between input and output depends on the input distribution chosen by the transmitter. Maximizing this mutual information over all possible input distributions yields the channel capacity, the largest rate at which information can be conveyed. For the binary symmetric channel the capacity equals 1 - H(p) bits per use, where H(p) is the binary entropy function, and for the binary erasure channel it equals 1 - p bits per use. Both expressions show capacity falling smoothly to zero as the channel becomes maximally noisy.

Channel Capacity

Channel capacity is the supreme rate, measured in bits per channel use or bits per second, at which information can be transmitted with arbitrarily small probability of error. Shannon defined it as the maximum of the mutual information between channel input and output, taken over all admissible input distributions: C = max I(X; Y). This definition transforms a vague engineering aspiration into a precise number determined entirely by the statistical description of the channel. Capacity depends only on the channel itself, not on any particular coding or modulation scheme.

The operational meaning of capacity is profound. It does not merely indicate a rate that is comfortable or convenient; it marks a sharp boundary. At any rate below capacity, reliable communication is achievable, while at any rate above capacity, the error probability cannot be driven to zero no matter how cleverly the system is designed. The proof that this boundary is meaningful and attainable constitutes Shannon's noisy channel coding theorem, the cornerstone of the entire field.

The Channel Coding Theorem

The noisy channel coding theorem states that for any rate R less than the capacity C, there exist codes of increasing block length whose probability of decoding error approaches zero. Conversely, for any rate greater than C, the error probability is bounded away from zero regardless of the code. The forward part is proved through a random coding argument: Shannon showed that the average error probability over an ensemble of randomly chosen codes vanishes, which guarantees that at least one good code exists even though the argument identifies none explicitly.

The theorem overturned the prevailing intuition that reliable communication over a noisy channel inevitably demanded vanishing data rates, with reliability bought only by endless repetition. Shannon demonstrated instead that a fixed positive rate, namely any rate below capacity, can be sustained with arbitrarily high reliability by coding over sufficiently long blocks. The cost is paid in latency and decoder complexity rather than in throughput, a trade-off that has shaped communication engineering ever since.

The Source Coding Theorem

The companion source coding theorem governs data compression rather than transmission. It states that a discrete source with entropy H can be encoded, without loss, using an average of arbitrarily close to but not less than H bits per symbol. Entropy therefore sets the ultimate limit of lossless compression: no scheme can represent the source in fewer bits per symbol on average, while practical entropy coders such as Huffman and arithmetic coding approach the limit closely. The theorem explains why already compressed data resists further compression, having been reduced near its entropy.

Taken together, the source and channel coding theorems justify the separation of compression and error protection into independent stages, a principle known as the separation theorem. For a single point-to-point link with a stationary source and channel, one may first compress the source to its entropy and then protect the result with a channel code, losing nothing relative to a jointly designed scheme. This separation underlies the layered architecture of communication systems, although it can break down in networks, under delay constraints, or for non-stationary sources, motivating the study of joint source-channel coding.

The Shannon-Hartley Theorem

The Shannon-Hartley theorem specializes the capacity result to the most important continuous channel, the band-limited channel corrupted by additive white Gaussian noise. It states that the capacity in bits per second is C = B log₂(1 + S/N), where B is the channel bandwidth in hertz, S is the average received signal power, and N is the average noise power within that bandwidth. The ratio S/N is the signal-to-noise ratio expressed as a linear power ratio rather than in decibels. This compact formula ranks among the most consequential results in all of engineering.

The theorem draws its name from two contributions. Ralph Hartley had argued in 1928 that the achievable information rate scales with bandwidth and with the logarithm of the number of distinguishable signal levels, capturing the qualitative structure of the result. Shannon supplied the rigorous noise model and the proof that the bound is both necessary and sufficient, replacing Hartley's heuristic count of levels with a precise function of the signal-to-noise ratio. The combined statement gives the exact capacity of the Gaussian channel.

Interpreting the Formula

The formula exposes the two levers available to a system designer. Capacity grows linearly with bandwidth but only logarithmically with signal-to-noise ratio. Doubling the bandwidth therefore roughly doubles capacity, whereas doubling the signal power adds only a fixed increment. Because each additional unit of bandwidth contributes a full proportional share of capacity while each additional decibel of power contributes a diminishing share, bandwidth is the more powerful resource when it is available, and power efficiency becomes the dominant concern when bandwidth is scarce.

The logarithmic dependence on signal-to-noise ratio also implies that pushing for higher spectral efficiency becomes progressively more expensive. To raise the capacity per unit bandwidth, the spectral efficiency in bits per second per hertz, by one bit requires roughly doubling the signal-to-noise ratio, an increase of about three decibels. Systems that aim for very high spectral efficiency, such as cable and microwave backhaul links employing dense quadrature amplitude modulation, must therefore maintain correspondingly large signal-to-noise ratios.

The Bandwidth-Limited and Power-Limited Regimes

Two limiting regimes clarify the trade-offs. In the bandwidth-limited regime, signal-to-noise ratio is high and bandwidth is the binding constraint; here the logarithmic term dominates and designers invest in high-order modulation to extract many bits per channel use. In the power-limited regime, signal-to-noise ratio is low and power is scarce; here capacity becomes nearly linear in signal-to-noise ratio, and the priority shifts to energy efficiency and robust low-rate coding rather than dense constellations.

Deep-space links and many sensor networks operate in the power-limited regime, where every decibel of link margin is precious and spreading the signal over a wide band costs little. Terrestrial cellular and fixed wireless links typically operate closer to the bandwidth-limited regime, where licensed spectrum is expensive and the emphasis falls on packing the maximum number of bits into each hertz. Recognizing which regime applies guides the entire design of a communication link.

Capacity of the AWGN Channel

The additive white Gaussian noise channel models the received signal as the transmitted signal plus an independent Gaussian noise term whose power spectral density is flat across the band of interest. This model accurately describes thermal noise in receiver electronics and serves as the standard reference channel for evaluating modulation and coding schemes. Its capacity is exactly the Shannon-Hartley value, and the input distribution that achieves capacity is Gaussian, consistent with the maximum-entropy property of the Gaussian density.

A particularly illuminating form expresses capacity in terms of the energy per bit relative to the noise power spectral density, written E₋/N₀. Combining the Shannon-Hartley formula with the relationships among power, rate, and bandwidth yields a fundamental lower bound on this ratio. As the spectral efficiency approaches zero, meaning the system spreads each bit over unlimited bandwidth, the minimum energy per bit converges to ln 2, approximately -1.59 decibels. No communication system, however sophisticated, can achieve reliable transmission below this Shannon limit, which therefore defines the ultimate energy efficiency of communication.

Capacity with Discrete Inputs

Practical systems do not transmit Gaussian-distributed signals; they select symbols from finite constellations such as phase-shift keying or quadrature amplitude modulation. Constraining the input to a discrete constellation reduces the achievable mutual information below the Gaussian capacity, and the resulting constellation-constrained capacity saturates at the logarithm of the constellation size once the signal-to-noise ratio grows large. The gap between this constrained capacity and the unconstrained Shannon limit is known as the shaping loss, which amounts to about 1.53 decibels for uniformly chosen constellation points.

Constellation shaping techniques recover much of this loss by transmitting outer constellation points less frequently than inner ones, approximating the Gaussian input distribution. Probabilistic amplitude shaping, adopted in recent high-throughput optical and wireless standards, implements this idea efficiently and allows finite constellations to operate close to the true Gaussian capacity. The technique illustrates how careful attention to the input distribution, and not only to the code, contributes to approaching fundamental limits.

Capacity of Fading Channels

Wireless channels rarely resemble the steady additive Gaussian model, because multipath propagation causes the received signal amplitude to fluctuate, a phenomenon called fading. The channel gain becomes a random process, and capacity must be reconsidered in light of this randomness. Two notions of capacity arise, depending on how rapidly the channel varies relative to the duration of a codeword and on whether the transmitter knows the instantaneous channel state.

When the channel changes quickly enough that a single codeword experiences many independent fading states, the relevant measure is the ergodic capacity, obtained by averaging the instantaneous Gaussian capacity over the distribution of the channel gain. A coded transmission spanning many fades effectively sees the average channel, and reliable communication at the ergodic rate becomes possible. When instead the channel remains fixed for the entire codeword, as in a slowly moving terminal, a single deep fade can render any fixed rate undeliverable, and the appropriate measure is the outage capacity.

Outage Capacity

Outage capacity addresses the slow-fading case, where the channel gain is random but constant over a codeword. Because there is a nonzero probability that the gain is too small to support a chosen rate, one cannot guarantee reliable communication at any positive rate with certainty. Instead, designers specify an acceptable outage probability and define the outage capacity as the largest rate that can be supported while the probability of the channel falling below the required level stays within that target.

This formulation reflects engineering reality in cellular and broadcast systems, which tolerate occasional brief interruptions in exchange for a useful sustained rate. The outage framework converts the fundamental randomness of the wireless channel into a clear design parameter, the tolerable outage probability, that can be traded against transmit power, coverage area, and data rate. Diversity techniques, which provide the receiver with several independently faded copies of the signal, sharply reduce outage probability by making deep simultaneous fades unlikely.

Multiple-Antenna Channels

Equipping both transmitter and receiver with multiple antennas creates a multiple-input multiple-output channel whose capacity can far exceed that of a single-antenna link. Under favorable scattering conditions, the channel decomposes into several parallel spatial streams, and capacity grows approximately in proportion to the smaller of the numbers of transmit and receive antennas. This multiplexing gain, identified in the late 1990s, transformed wireless system design and underpins the high data rates of modern cellular and local-area networks.

Realizing the multiplexing gain requires a rich multipath environment that renders the spatial streams sufficiently independent, together with accurate channel knowledge to separate them at the receiver. When the channel is known at the transmitter, power can be allocated across the spatial streams to maximize the total rate. The same multiple-antenna structure can alternatively be devoted to diversity, trading some multiplexing gain for greater robustness against fading, and practical systems balance these two uses according to the operating conditions.

Rate-Distortion Theory

Rate-distortion theory extends information theory to lossy compression, where some controlled degradation of the reconstructed data is acceptable in exchange for a lower bit rate. It answers the question of how few bits per symbol suffice to represent a source within a specified average distortion. The rate-distortion function R(D) gives the minimum rate required to reconstruct the source with average distortion no greater than D, and it is obtained by minimizing the mutual information between source and reconstruction subject to the distortion constraint.

The rate-distortion function decreases monotonically as the permitted distortion grows, reaching zero at the distortion attainable by ignoring the source entirely. At zero distortion it reduces, for a discrete source, to the source entropy, recovering the lossless result as a special case. For a Gaussian source measured under mean-squared-error distortion, the function takes the explicit form R(D) = (1/2) log₂(σ²/D) for distortions below the source variance, showing that each additional bit of rate reduces the reconstruction distortion by a factor of four, a six-decibel improvement in signal-to-distortion ratio per bit.

Applications to Source Compression

Rate-distortion theory provides the conceptual foundation for the lossy compression of audio, images, and video. Practical codecs cannot in general reach the rate-distortion bound, but the theory establishes the target and clarifies the trade-off between bit rate and reconstruction quality. Transform coding, the dominant practical approach, decorrelates the source with a transformation such as the discrete cosine transform and then allocates bits across the resulting coefficients in a manner that approximates the optimal water-filling solution suggested by the theory.

The theory also explains why perceptual coding succeeds. By measuring distortion with a metric that reflects human perception rather than raw signal error, codecs concentrate bits where they matter most to the observer and discard information the senses cannot perceive. This perceptual weighting effectively redefines the distortion measure in the rate-distortion problem, allowing dramatic rate reductions at a given perceived quality, as exemplified by the audio and video formats in everyday use.

SNR and Bandwidth Trade-offs

The Shannon-Hartley theorem makes the exchange between signal-to-noise ratio and bandwidth quantitative, and this exchange governs the architecture of every communication system. Because capacity rises only logarithmically with signal-to-noise ratio but linearly with bandwidth, a designer can hold capacity constant while trading one resource for the other. Spreading a signal over a wider band permits operation at a lower signal-to-noise ratio for the same rate, the principle exploited by spread-spectrum and ultra-wideband systems to achieve robustness and low spectral power density.

Conversely, when bandwidth is constrained by regulation or hardware, the only route to higher capacity is higher signal-to-noise ratio, achieved through greater transmit power, larger antennas, or shorter links. The diminishing return of the logarithm means that each successive increment of rate demands a disproportionately larger investment in signal-to-noise ratio, which sets a practical ceiling on the spectral efficiency that a real system pursues. The choice of modulation order is in essence a decision about where on this trade-off curve to operate.

The Bandwidth-Power Exchange in Practice

Real systems exploit the trade-off in both directions according to their constraints. Optical fiber links enjoy enormous bandwidth and therefore favor wideband, power-efficient operation, whereas satellite links contend with strict power budgets and lean on wide bandwidth and strong coding to compensate. Cellular networks, operating under both spectrum scarcity and power limits, adapt their modulation and coding to the instantaneous channel, selecting dense constellations when the signal-to-noise ratio is high and falling back to robust low-rate schemes when it is low.

Adaptive modulation and coding embodies this trade-off as a dynamic control loop. By continuously estimating the channel quality and choosing the highest modulation and code rate that the current signal-to-noise ratio can support, a system tracks the Shannon-Hartley curve in real time, maximizing throughput while maintaining a target error rate. This adaptation, central to every contemporary broadband wireless standard, turns the static capacity formula into an operating policy that responds to a changing channel.

Summary

Information theory supplies the fundamental limits that bound all digital communication and compression. Entropy measures the information content of a source and sets the limit of lossless compression, mutual information measures how much one variable reveals about another, and channel capacity, the maximum mutual information over input distributions, marks the sharp boundary between achievable and unachievable transmission rates. Shannon's source and channel coding theorems prove that these limits are both binding and attainable, and that compression and error protection may be designed separately for a single link.

The Shannon-Hartley theorem renders the capacity of the band-limited Gaussian channel as a concise function of bandwidth and signal-to-noise ratio, exposing the linear value of bandwidth, the logarithmic value of power, and the universal energy-per-bit limit of about -1.59 decibels. Extending the analysis to fading and multiple-antenna channels yields the ergodic and outage notions of capacity and explains the spatial multiplexing gains of modern wireless systems, while rate-distortion theory governs lossy compression and underlies the audio, image, and video codecs in daily use. Together these results convert the engineering of communication from intuition into a discipline grounded in precise, achievable bounds.

Electronics Guide