Binary-to-text encoding

A binary-to-text encoding is encoding of data in plain text. More precisely, it is an encoding of binary data in a sequence of characters. These encodings are necessary for transmission of data when the channel does not allow binary data (such as email or NNTP) or is not 8-bit clean. PGP documentation (RFC 4880) uses the term ASCII armor for binary-to-text encoding when referring to Base64.

Description

The ASCII text-encoding standard uses 128 unique values (0–127) to represent the alphabetic, numeric, and punctuation characters commonly used in English, plus a selection of control codes which do not represent printable characters. For example, the capital letter A is ASCII character 65, the numeral 2 is ASCII 50, the character } is ASCII 125, and the metacharacter carriage return is ASCII 13. Systems based on ASCII use seven bits to represent these values digitally.

In contrast, most computers store data in memory organized in eight-bit bytes. Files that contain machine-executable code and non-textual data typically contain all 256 possible eight-bit byte values. Many computer programs came to rely on this distinction between seven-bit text and eight-bit binary data, and would not function properly if non-ASCII characters appeared in data that was expected to include only ASCII text. For example, if the value of the eighth bit is not preserved, the program might interpret a byte value above 127 as a flag telling it to perform some function.

It is often desirable, however, to be able to send non-textual data through text-based systems, such as when one might attach an image file to an e-mail message. To accomplish this, the data is encoded in some way, such that eight-bit data is encoded into seven-bit ASCII characters (generally using only alphanumeric and punctuation characters—the ASCII printable characters). Upon safe arrival at its destination, it is then decoded back to its eight-bit form. This process is referred to as binary to text encoding. Many programs perform this conversion to allow for data-transport, such as PGP and GNU Privacy Guard (GPG).

Encoding plain text

Binary-to-text encoding methods are also used as a mechanism for encoding plain text. For example:

By using a binary-to-text encoding on messages that are already plain text, then decoding on the other end, one can make such systems appear to be completely transparent. This is sometimes referred to as 'ASCII armoring'. For example, the ViewState component of ASP.NET uses base64 encoding to safely transmit text via HTTP POST, in order to avoid delimiter collision.

Encoding standards

The table below compares the most used forms of binary-to-text encodings.

Encoding Data type Efficiency Programming language implementations Comments
Ascii85 Arbitrary 80% awk, C, C (2), C#, F#, Go, Java Perl, Python, Python (2)  
Base16 (hexadecimal) Arbitrary 50% Probably any language around  
Base32 Arbitrary 63% ANSI C, Java, Python  
Base58 Integer ~73% C++, Python Similar to Base64, but modified to avoid both non-alphanumeric characters and letters which might look ambiguous when printed.
Base64 Arbitrary 75% awk, C, C (2), Python, many others  
Base85 (RFC 1924) Arbitrary 80% C, Python Python (2) Revised version of Ascii85.
Base91 Arbitrary ~82% C, Java, PHP, 8086 assembly, AWK  
Base122 Arbitrary ~86% JavaScript
BinHex Arbitrary 75% Perl, C, C (2) MacOS Classic
.boo Arbitrary 75+% [1] C, BASIC, assembly, Pascal[2] Developed by Columbia University for its Kermit protocol[3]
Btoa Arbitrary 80% Early form of Ascii85
Intel HEX Arbitrary ~<50% C library, C++ Typically used to program EPROM, ROM, NOR-Flash memory chips
MIME Arbitrary See Quoted-printable and Base64 See Quoted-printable and Base64 Encoding container for e-mail-like formatting
S-record (Motorola hex) Arbitrary ~<50% C library, C++ Typically used to program EPROM, ROM, NOR-Flash memory chips
Percent encoding Text (URIs), Arbitrary (RFC1738) ~40%[4] (33%-70%[5]) C, Python, probably many others  
Quoted-printable Text ~33%-100%[6] Probably many Preserves line breaks; cuts lines at 76 characters
Uuencoding Arbitrary ~60% (up to 70%) Perl, C, probably many others Largely replaced by MIME and yEnc
Xxencoding Arbitrary ~75% (similar to Uuencoding) C Proposed (and occasionally used) as replacement for Uuencoding to avoid character set translation problems between ASCII and the EBCDIC systems that could corrupt Uuencoded data
yEnc Arbitrary, mostly non-text ~98% C Includes a CRC checksum
Z85 Arbitrary 80% C, C/C++, Python, Ruby, Node.js, Go ZeroMQ base85; safe for inclusion as string in source code
RFC 1751 (S/KEY) Arbitrary 33% C,[7] Python, ...

"A Convention for Human-readable 128-bit Keys". A series of small English words is easier for humans to read, remember, and type in than decimal or other binary-to-text encoding systems.[8] Each 64-bit number is mapped to six short words, of one to four characters each, from a public 2048-word dictionary.[7]

The 95 isprint codes 32 to 126 are known as the ASCII printable characters.

Some older and today uncommon formats include BOO, BTOA, and USR encoding.

Most of these encodings generate text containing only a subset of all ASCII printable characters: for example, the base64 encoding generates text that only contains upper case and lower case letters, (A–Z, a–z), numerals (0–9), and the "+", "/", and "=" symbols.

Some of these encoding (quoted-printable and percent encoding) are based on a set of allowed characters and a single escape character. The allowed characters are left unchanged, while all other characters are converted into a string starting with the escape character. This kind of conversion allows the resulting text to be almost readable, in that letters and digits are part of the allowed characters, and are therefore left as they are in the encoded text. These encodings produce the shortest plain ASCII output for input that is mostly printable ASCII.

Some other encodings (base64, uuencoding) are based on mapping all possible sequences of six bits into different printable characters. Since there are more than 26 = 64 printable characters, this is possible. A given sequence of bytes is translated by viewing it as stream of bits, breaking this stream in chunks of six bits and generating the sequence of corresponding characters. The different encodings differ in the mapping between sequences of bits and characters and in how the resulting text is formatted.

Some encodings (the original version of BinHex and the recommended encoding for CipherSaber) use four bits instead of six, mapping all possible sequences of 4 bits onto the 16 standard hexadecimal digits. Using 4 bits per encoded character leads to a 50% longer output than base64, but simplifies encoding and decoding—expanding each byte in the source independently to two encoded bytes is simpler than base64's expanding 3 source bytes to 4 encoded bytes.

Out of PETSCII's first 192 codes, 164 have visible representations when quoted: 5 (white), 17-20 and 28-31 (colors and cursor controls), 32-90 (ascii equivalent), 91-127 (graphics), 129 (orange), 133-140 (function keys), 144-159 (colors and cursor controls), and 160-192 (graphics).[9] This theoretically permits encodings, such as base128, between PETSCII-speaking machines.

References

  1. http://www.columbia.edu/kermit/ftp/boo/ckboo.txt
  2. Doupnik, Joe; da Cruz, Frank (1988-01-11). "Announcing MS-DOS Kermit 2.30". Info-Kermit Digest (Mailing list). Kermit Project, Columbia University. Retrieved 3 March 2016.
  3. da Cruz, Frank (1986-03-20). "Re: Printable Encodings for Binary Files". Info-Kermit Digest (Mailing list). Kermit Project, Columbia University. Retrieved 1 March 2016.
  4. For arbitrary data; encoding all 189 non-unreserved characters with three bytes, and the remaining 66 characters with one
  5. For text; only encoding each of the 18 reserved characters
  6. One byte stored as =XX. Encoding all but the 94 characters which don't need it (incl. space and tab)
  7. 1 2 RFC 1760 "The S/KEY One-Time Password System".
  8. RFC 1751 "A Convention for Human-Readable 128-bit Keys"
  9. http://sta.c64.org/cbm64pet.html et al
This article is issued from Wikipedia - version of the 11/28/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.