What is Unicode? Definition and Explanation

Definition

Unicode is an international standard for character encoding. It is the format that allows for the correct encoding and display of characters from most international writing systems. The same technology is also used to standardize and encode symbols and emojis. The Unicode standards cover several character encoding formats, such as UTF-8, UTF-16, and UTF-32.

Architecture of Unicode

Unicode allows for the encoding and display of virtually any character or symbol by assigning each character its own code in the Unicode format. On a fundamental level, it works similarly to other character encoding formats.

The problem with the multitude of character encoding formats that came before Unicode is that most formats only covered a subset of characters, like ASCII, which didn’t support characters outside of the Latin alphabet. In addition to this, the assigned number for each character would differ between formats, which could result in displaying completely different characters if the wrong encoding format was selected.

Unicode fixes these problems by including characters from as many writing systems as possible and assigning each character a unique number. Unicode has become the industry standard, and so has the code assigned to the symbols.

Planes

Unicode is divided into continuous groups of code blocks, known as planes. There are 17 planes from 0-16 in Unicode and they are further sub-divided into blocks. Each plane is generally intended to contain within it very loosely related characters. For example, Plane 0 is known as the Basic Multilingual Plane and contains characters for virtually all modern, extant languages. Plane 1, the Supplementary Multilingual Plane, contains characters from historic and extinct writings systems, among other things.

Although 17 planes have been defined, not all of these have been assigned. There is ample room for more characters and the Unicode Consortium is continuously adding more and more symbols, characters, and emojis. The intention is for people of all languages to be able to use computers in their native writing system, but it also includes code for the use of characters from extinct writing systems, like Egyptian Hieroglyphics and Khitan script from Ancient China. It is intended to be a truly universal character encoding system with a universal code for each of its symbols.

Code points

Code points are the smallest sub-division of characters in Unicode, representing one character. Code points are addresses in memory that tell computers which symbols to display by the assigned code point. Unicode has 1,114,112 code points. These work well for writing systems with discrete characters, but it becomes more difficult to encode writing systems that compose characters using different systems, like merging glyphs.

Unicode in practice

Virtually all major devices, from desktop PCs to tablets and mobile phones, and all major software now support Unicode. This means that any of the symbols contained within Unicode can be used when writing on a modern device.

Most of the time, the character encoding is done automatically. Your keyboard translates the keys pressed into the correct symbols. But, how do you write symbols that don’t have a button on your keyboard? A common use case for this is how to use emojis on desktop PCs with a physical keyboard.

Unicode code points can be written using a shorthand code form. They always start with ‘U+’, followed by 4-5 alphanumeric characters. Within supported programs, such as Microsoft Word, you can type in the code and then highlight it and press ‘Alt’ + ‘X’ to convert it into the Unicode symbol. For example, the code ‘U+1F937’ generates a ‘shrug’ emoji 🤷. Unicode is now a widely accepted standard, but there are still many programs using legacy encoding systems, so it may not work everywhere.

Critical view of the Unicode standard

The proposition of encoding all extant and extinct writing systems is not an easy one, and there are some issues that come along with that. One major point of criticism comes from Unicode’s handling of CJK (Chinese, Japanese, and Korean) characters.

Several East Asian countries developed writing systems that borrowed Chinese characters in some way but introduced regional variations in both their form and meaning. Unicode originally proposed a system of ‘Han Unification’, where the original Chinese character would be used as the ‘standard’ character and derivations would be considered variations, rather than being individual characters themselves. This was done in an attempt to save space, as encoding all CJK characters individually would easily number 100,000+ characters.

This was controversial for a number of reasons. Some groups objected to the subordination of their characters to Chinese symbols. Another issue is that the initial proposal was made by a consortium of companies and organizations from North America, but none from East Asia.

Another issue stemming from this is that Unicode encodes characters rather than glyphs. Glyphs are the smallest unit of writing, and some languages create characters by combining glyphs. This makes reading and writing historical texts in older versions of CJK languages very difficult. This issue also comes up in other languages, like Arabic.

https://unicode-table.com/en/