What Is Information Theory? Entropy, Bits, and Communication
Information theory is the mathematical study of how information is quantified, stored, and transmitted. Learn about entropy, bits, channel capacity, and Shannon's foundational contributions.
What Is Information Theory?
Information theory is a branch of applied mathematics and electrical engineering that deals with the quantification, storage, and communication of information. Founded by Claude Shannon with his landmark 1948 paper A Mathematical Theory of Communication, information theory provides the mathematical foundation for modern digital communication, data compression, cryptography, and artificial intelligence.
Shannon's insight was to separate the meaning of a message from its informational content โ to treat information as a purely mathematical quantity that could be measured, compressed, and transmitted reliably regardless of what the message actually said. This abstraction, radical at the time, enabled engineers to design communication systems with provable performance guarantees.
The Bit: The Basic Unit of Information
The fundamental unit of information theory is the bit (short for binary digit), a term coined by John Tukey in 1948 and formalized by Shannon. A bit represents the information gained from observing an event that can take one of two equally probable outcomes โ like a fair coin flip.
More precisely, the information content of an event is measured in bits as the negative logarithm (base 2) of its probability:
I(x) = โlogโ(P(x))
An event with probability 1 (certainty) carries 0 bits of information โ it tells us nothing we did not already know. An event with probability 1/2 carries 1 bit. An event with probability 1/8 carries 3 bits. The rarer the event, the more information its occurrence conveys.
Shannon Entropy
Shannon entropy (H) measures the average information content โ or equivalently, the uncertainty โ of a probability distribution. For a random variable X with possible outcomes xโ, xโ, ..., xโ:
H(X) = โฮฃ P(xแตข) logโ P(xแตข)
Entropy is maximized when all outcomes are equally probable and minimized (zero) when one outcome is certain.
| Source | Probability Distribution | Entropy (bits) |
|---|---|---|
| Fair coin | P(H) = P(T) = 0.5 | 1.0 bit |
| Biased coin (P(H) = 0.9) | P(H) = 0.9, P(T) = 0.1 | ~0.47 bits |
| Fair die (6 sides) | P = 1/6 for each face | ~2.58 bits |
| English letter distribution | Non-uniform; e is most common | ~4.11 bits per letter |
The English language, for example, has substantial structure and redundancy โ not all letter combinations are equally likely. Its effective entropy is approximately 1.0โ1.5 bits per character (accounting for context and predictability), even though a random choice among 26 letters would require logโ(26) โ 4.7 bits.
Data Compression
Shannon's source coding theorem establishes the fundamental limit on data compression: no lossless compression scheme can encode messages from a source with entropy H using fewer than H bits per symbol on average. This is a theoretical lower bound โ a ceiling on how much compression is possible.
Practical compression algorithms approach this limit:
- Huffman coding: Assigns shorter binary codes to more frequent symbols and longer codes to rarer ones; used in JPEG, ZIP, and many other formats
- Arithmetic coding: More efficient than Huffman for adaptive symbol probabilities; used in video compression standards
- LZ algorithms (LZ77, LZ78, LZW): Dictionary-based compression exploiting repeated patterns; foundation of ZIP, GIF, and DEFLATE compression
- Modern codecs (HEVC, AV1): Combine multiple techniques with transform coding to achieve high video compression ratios
Channel Capacity and the Noisy Channel Theorem
Shannon's second major theorem addresses reliable communication over noisy channels. Every communication channel โ whether a telephone line, wireless radio link, or fiber optic cable โ is subject to noise that can corrupt transmitted bits.
The channel capacity C is the maximum rate at which information can be transmitted through a channel with arbitrarily low error probability. For a channel with bandwidth B Hz and signal-to-noise ratio SNR:
C = B ร logโ(1 + SNR) bits per second
This is the Shannon-Hartley theorem. Remarkably, Shannon proved that as long as the transmission rate falls below channel capacity, it is theoretically possible to achieve error-free communication through appropriate encoding โ regardless of how noisy the channel is.
| Channel Type | Typical Bandwidth | Typical Capacity |
|---|---|---|
| Traditional phone line (PSTN) | 3.4 kHz | ~30โ50 kbps |
| 4G LTE cellular | Up to 100 MHz | ~100 Mbps peak |
| 5G millimeter wave | Up to 800 MHz | ~1โ10 Gbps peak |
| Single-mode optical fiber | ~50 THz | Theoretically hundreds of Tbps |
Applications of Information Theory
Information theory's influence extends far beyond communications engineering:
- Error-correcting codes: Reed-Solomon codes (used in CDs, DVDs, QR codes) and turbo/LDPC codes (used in 4G/5G) approach Shannon's theoretical limits
- Cryptography: Shannon's concept of perfect secrecy (the one-time pad) and entropy analysis of cryptographic systems
- Machine learning: Cross-entropy loss functions, mutual information for feature selection, and information-theoretic analysis of neural networks
- Genomics: Measuring information content of DNA sequences and gene expression patterns
- Physics: Connections between thermodynamic entropy (Boltzmann) and Shannon entropy; the physics of information (Landauer's principle, Maxwell's demon)
- Neuroscience: Quantifying information processing efficiency in neural circuits
Related Articles
applied mathematics
What Is Statistics? Descriptive, Inferential, Probability, and the Science of Data
A comprehensive introduction to statistics โ descriptive vs. inferential statistics, probability and distributions, hypothesis testing, p-values, confidence intervals, correlation vs. causation, common statistical errors, and why statistical literacy is essential for understanding research and data.
8 min read
applied mathematics
Probability Theory Explained: Fundamentals, Rules, and Real-World Applications
A clear introduction to probability theory โ from basic definitions and rules to conditional probability, Bayes' theorem, and how probability underpins everything from medicine to machine learning.
8 min read
applied mathematics
How Algorithms Work: Logic, Efficiency, and Applications
Understand what algorithms are, how they are designed and analyzed, key algorithm types including sorting and searching, and their role in modern computing.
8 min read
applied mathematics
How Fractals Work: Self-Similarity in Mathematics and Nature
Explore the mathematics of fractals โ self-similar geometric patterns with fractional dimensions, from the Mandelbrot set to coastlines and biological systems.
8 min read