What Is Information Theory? Entropy, Bits, and Communication

Information theory is the mathematical study of how information is quantified, stored, and transmitted. Learn about entropy, bits, channel capacity, and Shannon's foundational contributions.

The InfoNexus Editorial TeamMay 7, 20268 min read

What Is Information Theory?

Information theory is a branch of applied mathematics and electrical engineering that deals with the quantification, storage, and communication of information. Founded by Claude Shannon with his landmark 1948 paper A Mathematical Theory of Communication, information theory provides the mathematical foundation for modern digital communication, data compression, cryptography, and artificial intelligence.

Shannon's insight was to separate the meaning of a message from its informational content โ€” to treat information as a purely mathematical quantity that could be measured, compressed, and transmitted reliably regardless of what the message actually said. This abstraction, radical at the time, enabled engineers to design communication systems with provable performance guarantees.

The Bit: The Basic Unit of Information

The fundamental unit of information theory is the bit (short for binary digit), a term coined by John Tukey in 1948 and formalized by Shannon. A bit represents the information gained from observing an event that can take one of two equally probable outcomes โ€” like a fair coin flip.

More precisely, the information content of an event is measured in bits as the negative logarithm (base 2) of its probability:

I(x) = โˆ’logโ‚‚(P(x))

An event with probability 1 (certainty) carries 0 bits of information โ€” it tells us nothing we did not already know. An event with probability 1/2 carries 1 bit. An event with probability 1/8 carries 3 bits. The rarer the event, the more information its occurrence conveys.

Shannon Entropy

Shannon entropy (H) measures the average information content โ€” or equivalently, the uncertainty โ€” of a probability distribution. For a random variable X with possible outcomes xโ‚, xโ‚‚, ..., xโ‚™:

H(X) = โˆ’ฮฃ P(xแตข) logโ‚‚ P(xแตข)

Entropy is maximized when all outcomes are equally probable and minimized (zero) when one outcome is certain.

SourceProbability DistributionEntropy (bits)
Fair coinP(H) = P(T) = 0.51.0 bit
Biased coin (P(H) = 0.9)P(H) = 0.9, P(T) = 0.1~0.47 bits
Fair die (6 sides)P = 1/6 for each face~2.58 bits
English letter distributionNon-uniform; e is most common~4.11 bits per letter

The English language, for example, has substantial structure and redundancy โ€” not all letter combinations are equally likely. Its effective entropy is approximately 1.0โ€“1.5 bits per character (accounting for context and predictability), even though a random choice among 26 letters would require logโ‚‚(26) โ‰ˆ 4.7 bits.

Data Compression

Shannon's source coding theorem establishes the fundamental limit on data compression: no lossless compression scheme can encode messages from a source with entropy H using fewer than H bits per symbol on average. This is a theoretical lower bound โ€” a ceiling on how much compression is possible.

Practical compression algorithms approach this limit:

  • Huffman coding: Assigns shorter binary codes to more frequent symbols and longer codes to rarer ones; used in JPEG, ZIP, and many other formats
  • Arithmetic coding: More efficient than Huffman for adaptive symbol probabilities; used in video compression standards
  • LZ algorithms (LZ77, LZ78, LZW): Dictionary-based compression exploiting repeated patterns; foundation of ZIP, GIF, and DEFLATE compression
  • Modern codecs (HEVC, AV1): Combine multiple techniques with transform coding to achieve high video compression ratios

Channel Capacity and the Noisy Channel Theorem

Shannon's second major theorem addresses reliable communication over noisy channels. Every communication channel โ€” whether a telephone line, wireless radio link, or fiber optic cable โ€” is subject to noise that can corrupt transmitted bits.

The channel capacity C is the maximum rate at which information can be transmitted through a channel with arbitrarily low error probability. For a channel with bandwidth B Hz and signal-to-noise ratio SNR:

C = B ร— logโ‚‚(1 + SNR) bits per second

This is the Shannon-Hartley theorem. Remarkably, Shannon proved that as long as the transmission rate falls below channel capacity, it is theoretically possible to achieve error-free communication through appropriate encoding โ€” regardless of how noisy the channel is.

Channel TypeTypical BandwidthTypical Capacity
Traditional phone line (PSTN)3.4 kHz~30โ€“50 kbps
4G LTE cellularUp to 100 MHz~100 Mbps peak
5G millimeter waveUp to 800 MHz~1โ€“10 Gbps peak
Single-mode optical fiber~50 THzTheoretically hundreds of Tbps

Applications of Information Theory

Information theory's influence extends far beyond communications engineering:

  • Error-correcting codes: Reed-Solomon codes (used in CDs, DVDs, QR codes) and turbo/LDPC codes (used in 4G/5G) approach Shannon's theoretical limits
  • Cryptography: Shannon's concept of perfect secrecy (the one-time pad) and entropy analysis of cryptographic systems
  • Machine learning: Cross-entropy loss functions, mutual information for feature selection, and information-theoretic analysis of neural networks
  • Genomics: Measuring information content of DNA sequences and gene expression patterns
  • Physics: Connections between thermodynamic entropy (Boltzmann) and Shannon entropy; the physics of information (Landauer's principle, Maxwell's demon)
  • Neuroscience: Quantifying information processing efficiency in neural circuits
mathematicscomputer scienceinformation

Related Articles