What Is Information Theory? Entropy, Bits, and Communication

What Is Information Theory?

Information theory is a branch of applied mathematics and electrical engineering that deals with the quantification, storage, and communication of information. Founded by Claude Shannon with his landmark 1948 paper A Mathematical Theory of Communication, information theory provides the mathematical foundation for modern digital communication, data compression, cryptography, and artificial intelligence.

Shannon's insight was to separate the meaning of a message from its informational content — to treat information as a purely mathematical quantity that could be measured, compressed, and transmitted reliably regardless of what the message actually said. This abstraction, radical at the time, enabled engineers to design communication systems with provable performance guarantees.

The Bit: The Basic Unit of Information

The fundamental unit of information theory is the bit (short for binary digit), a term coined by John Tukey in 1948 and formalized by Shannon. A bit represents the information gained from observing an event that can take one of two equally probable outcomes — like a fair coin flip.

More precisely, the information content of an event is measured in bits as the negative logarithm (base 2) of its probability:

I(x) = −log₂(P(x))

An event with probability 1 (certainty) carries 0 bits of information — it tells us nothing we did not already know. An event with probability 1/2 carries 1 bit. An event with probability 1/8 carries 3 bits. The rarer the event, the more information its occurrence conveys.

Shannon Entropy

Shannon entropy (H) measures the average information content — or equivalently, the uncertainty — of a probability distribution. For a random variable X with possible outcomes x₁, x₂, ..., xₙ:

H(X) = −Σ P(xᵢ) log₂ P(xᵢ)

Entropy is maximized when all outcomes are equally probable and minimized (zero) when one outcome is certain.

Source	Probability Distribution	Entropy (bits)
Fair coin	P(H) = P(T) = 0.5	1.0 bit
Biased coin (P(H) = 0.9)	P(H) = 0.9, P(T) = 0.1	~0.47 bits
Fair die (6 sides)	P = 1/6 for each face	~2.58 bits
English letter distribution	Non-uniform; e is most common	~4.11 bits per letter

The English language, for example, has substantial structure and redundancy — not all letter combinations are equally likely. Its effective entropy is approximately 1.0–1.5 bits per character (accounting for context and predictability), even though a random choice among 26 letters would require log₂(26) ≈ 4.7 bits.

Data Compression

Shannon's source coding theorem establishes the fundamental limit on data compression: no lossless compression scheme can encode messages from a source with entropy H using fewer than H bits per symbol on average. This is a theoretical lower bound — a ceiling on how much compression is possible.

Practical compression algorithms approach this limit:

Huffman coding: Assigns shorter binary codes to more frequent symbols and longer codes to rarer ones; used in JPEG, ZIP, and many other formats
Arithmetic coding: More efficient than Huffman for adaptive symbol probabilities; used in video compression standards
LZ algorithms (LZ77, LZ78, LZW): Dictionary-based compression exploiting repeated patterns; foundation of ZIP, GIF, and DEFLATE compression
Modern codecs (HEVC, AV1): Combine multiple techniques with transform coding to achieve high video compression ratios

Channel Capacity and the Noisy Channel Theorem

Shannon's second major theorem addresses reliable communication over noisy channels. Every communication channel — whether a telephone line, wireless radio link, or fiber optic cable — is subject to noise that can corrupt transmitted bits.

The channel capacity C is the maximum rate at which information can be transmitted through a channel with arbitrarily low error probability. For a channel with bandwidth B Hz and signal-to-noise ratio SNR:

C = B × log₂(1 + SNR) bits per second

This is the Shannon-Hartley theorem. Remarkably, Shannon proved that as long as the transmission rate falls below channel capacity, it is theoretically possible to achieve error-free communication through appropriate encoding — regardless of how noisy the channel is.

Channel Type	Typical Bandwidth	Typical Capacity
Traditional phone line (PSTN)	3.4 kHz	~30–50 kbps
4G LTE cellular	Up to 100 MHz	~100 Mbps peak
5G millimeter wave	Up to 800 MHz	~1–10 Gbps peak
Single-mode optical fiber	~50 THz	Theoretically hundreds of Tbps

Applications of Information Theory

Information theory's influence extends far beyond communications engineering:

Error-correcting codes: Reed-Solomon codes (used in CDs, DVDs, QR codes) and turbo/LDPC codes (used in 4G/5G) approach Shannon's theoretical limits
Cryptography: Shannon's concept of perfect secrecy (the one-time pad) and entropy analysis of cryptographic systems
Machine learning: Cross-entropy loss functions, mutual information for feature selection, and information-theoretic analysis of neural networks
Genomics: Measuring information content of DNA sequences and gene expression patterns
Physics: Connections between thermodynamic entropy (Boltzmann) and Shannon entropy; the physics of information (Landauer's principle, Maxwell's demon)
Neuroscience: Quantifying information processing efficiency in neural circuits

What Is Information Theory? Entropy, Bits, and Communication