The Life of a Data Byte

Communications of the ACM 

A byte of data has been stored in a number of different ways through the years as newer, better, and faster storage media are introduced. A byte is a unit of digital information that most commonly refers to eight bits. A bit is a unit of information that can be expressed as 0 or 1, representing a logical state. Let's take a brief walk down memory lane to learn about the origins of bits and bytes. Going back in time to Babbage's Analytical Engine, you can see that a bit was stored as the position of a mechanical gear or lever. In the case of paper cards, a bit was stored as the presence or absence of a hole in the card at a specific place. For magnetic storage devices, such as tapes and disks, a bit is represented by the polarity of a certain area of the magnetic film. In modern DRAM (dynamic random-access memory), a bit is often represented as two levels of electrical charge stored in a capacitor, a device that stores electrical energy in an electric field. In June 1956, Werner Buchholz coined the word byte to refer to a group of bits used to encode a single character of text. Let's address character encoding, starting with ASCII (American Standard Code for Information Interchange). ASCII was based on the English alphabet; therefore, every letter, digit, and symbol (a-z, A-Z, 0-9,, -, /, ",!, among others) were represented as a seven-bit integer between 32 and 127. To support other languages, Unicode extended ASCII so that each character is represented as a code-point, or character; for example, a lowercase j is U 006A, where U stands for Unicode followed by a hexadecimal number. UTF-8 is the standard for representing characters as eight bits, allowing every code-point from 0 to 127 to be stored in a single byte. This is fine for English characters, but other languages often have characters that are expressed as two or more bytes.