A short overview of
ISO/IEC 10646 and Unicode

By Olle Järnefors <ojarnef@admin.kth.se>

Summary

The purpose of this text is to give a brief technical overview of the new character set standard ISO/IEC 10646 and the nearly related Unicode standard. I have omitted descriptions of the history of the standard as well as general talk about why a standard of this type is badly needed.

Previous knowledge

The reader should have some knowledge about coded character sets, have seen an ASCII table, and know of some 8-bit character sets, like Latin-1 (ISO/IEC 8859-1).

Document history

Various drafts of this text have previously been available over Internet, the latest of which is version Ap4 (from 1993-09-14).

1993-09-14, version Ap4: Last draft

1996-02-24, version A: Final document, prepared for the IAB character set workshop 1996-02-29/1996-03-01

1996-02-26, version Ar1: Added one item to the author presentation. HTML home added. Section 1: added three limitations of plain text removed by UCS. Section 5: paragraph about privat

About the author

Having joined SIS-ITS/AG2 (the Swedish standardization working group corresponding to ISO/IEC JTC1/SC2 -- Character sets and information coding) in 1988, I made contributions to the Swedish comments on several drafts of the ISO/IEC 10646 standard. I also had the pleasure to take part in the big merger of Unicode and ISO/IEC 10646 that was accomplished at three meetings during 1991 in San Francisco, Geneva and Paris, representing Sweden on the ISO side. I have also worked with character set standardization in European standardization (CEN/TC304) and within IETF. Lately, I have provided character set knowledge to and edited the first proposal for extending ISO/IEC 10646 with a major historical script, the Runic script.

Original home

The latest version of this text is available at
<URL:ftp://ftp.admin.kth.se/pub/misc/ucs/unicode-iso10646-oview.txta;type=A>

HTML home

An HTML version of this text is available at
<URL:http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html>

Table of content with synopsis

1. Most important facts
ISO/IEC 10646 = UCS. Universal in scope. Multi-octet character set. Relation to Unicode. Plain and rich text

2. The structure of the coding space
The half-filled UCS-2. The unused UCS-4. Cell, row, plane, group. Relation to ISO/IEC 8859-1. UCS-2 = BMP = plane 0 of group 0.

3. Implementation levels Level 1 (enough for Europe, the Middle East, East Asia). Level 2 (needed for South Asia). Bi-directional text. Precomposed characters, combining characters, composite sequences.

4. Adaptation to data communication needs
UCS transformation formats. UTF-8: UCS represented in 8-bit text. UTF-7: UCS-2 represented in 7-bit text. UTF-16: Part of UCS-4 represented in UCS-2.

5. What is accepted as a character in UCS?
Existing coded character sets amalgamated. CJK unification. Characters not shapes, not meanings. Compatibility characters. Private use characters.

6. References

7. Annex: Overview of the BMP (group=00, plane=00)

1. Most important facts

ISO/IEC 10646 is a relatively new character set standard, published in 1993 by the International Organization for Standardization (ISO). Its name is "Universal Multiple-Octet Coded Character Set". Troughout this overview I use its acronym, UCS.

UCS is the first offcially standardized coded character set with the purpose to eventually include all characters used in all the written languages in the world (and, in addition, all mathematical and other symbols). This is certainly a very ambitious goal, but the current first edition at least covers all major languages and all commercially important languages.

To be able to give every character of this grand repertoire a unique coded representation, the designers of UCS chose a uniform encoding, using bit sequences consisting of 16 or 31 bits (in the two coding forms, UCS-2 and UCS-4). This is the reason for the phrase "multi-octet" in the name of the standard.

Unicode is a coded character set specified by a consortium of major American computer manufacturers, primarily to overcome the chaos of different coded character sets in use when creating multilingual programs and internationalizing software. From version 1.1 on, Unicode is scrupulously kept compatible with ISO/IEC 10646 and its extensions. The consortium is also an important contributor to the ISO work to further develop ISO/IEC 10646.

In short, Unicode can be characterized as the (restricted) 2-octet form of UCS on (the most general) implementation level 3, with addition of a more precise specification of the bi-directional behavior of characters, when used in the Arabic and Hebrew scripts. Unicode is presently at version 1.1. Extensions in the soon forthcoming version 2.0 will make it possible to access also the wider coding space of UCS-4, within this 16-bit encoding.

UCS is intended to be usable both for internal data representation in computer systems and in data communication. UCS is already employed in commercial products from Microsoft, Novell, Apple and others. It is implemented in free software like Linux, and is proposed for inclusion in advanced data communication standards like HTML.

Strong but in my opinion ill-founded criticism has met UCS from programmer groups in Japan. It has, however, recently been adopted as a Japanese national standard.

ISO/IEC 10646 is a fundamental standard, potentially affecting almost all parts of information technology. But it specifies only a coded character set, not a complete system for text representation. It provides the basis for internationalization, but does not in itself give a complete solution of the problems in this field.

The simple kind of text for whose representation a coded character set standard is sufficient, plain text, is essentially only a linear sequence of graphic characters, with a fixed division into lines and possibly pages.

ISO/IEC 10646 and Unicode removes some assumptions often made about plain text, which simplifies implementations but are untenable in multilingual text and monolingual text in some languages:

For several important aspects of text, as treated in modern text processing programs, UCS needs to be supplemented by further standards or rules, so-called higher-level text protocols. Some examples of these aspects are tables, mathematical formulas, information about the language of text fragments, text variations like italic text and different text sizes, choice of particular fonts, content mark-up, document structure, hyperlinks. This is called rich text. (Some standards for rich text are HTML, SGML, Microsoft RTF.)

The evolution of ISO/IEC 10646 and, in parallel, Unicode will continue for a long period of time, mostly by additions of scripts and symbol collections. This overview describes the first edition of the standard from 1993, but some of the extensions that are about to be adopted are also touched upon.

2. The structure of the coding space

In the first version of UCS 34203 different characters are included. Of these 21204 are ideographic characters used in Chinese, Japanese and Korean, and 6656 are Korean Hangul syllabograms. To guarantee that the coding space will not be filled up even in the future -- 2 octets give 65536 different character positions -- a 4-octet form of UCS (UCS-4) is also definied.

The 65536 positions in the 2-octet form of UCS are divided into 256 rows with 256 cells in each. The first octet of a character representation gives the row number, the second the cell number. The first row, row 0, contains exactly the same characters as ISO/IEC 8859-1. The first 128 characters are thus the ASCII characters. The octet representing an ISO/IEC 8859-1 character is easily transformed to the representation in UCS, by putting a 0 octet in front of it. UCS includes the same control characters as ISO/IEC 8859 and these are also in row 0. An overview of the content of all rows are found in the annex.

In the 4-octet form more than 2 billion (2147483648) different characters can be represented. (The first bit of the first octet must be 0 so only 31 of the 32 bits are used by UCS.) This coding space is subdivided into 128 groups, each containing 256 planes. The first octet in a character representation indicates the group number and the second the plane number. The third and fourth octets gives the row number and the cell number of the character. Those characters that can be represented by the 2-octet form of UCS belong to plane 0 of group 0, which is called the Basic Multilingual Plane, BMP. The 4-octet representation of a character in the BMP is produced by putting two 0 octets before its 2-octet representation.

Still no characters have been allocated to positions outside the BMP, and only the 2-octet form is used in practice.

3. Implementation levels

Independently of the two encoding forms of UCS, the standard ISO/IEC 10646 also draws a distinction between three different implementation levels. The full coded character set is available on level 3. On the lower levels certain subsets of the characters are not usable. This restricts the range of langauges that can be coded on these levels. On the other hand it makes simpler implementations possible.

A full implementation of the Unicode standard amounts to an implementation at level 3 of UCS.

4. Adaptation to data communication needs

Many data communication protocols treat octets with values in the hexadecimal range 00-1F specially; they represent control characters in most 7-bit and 8-bit character sets. It is even the case that the most used protocol for electronic mail, classical SMTP, explicitly forbids the 128 octets > hex 7F. In certain datatypes used in data communication, e.g. domain names on Internet, even harder restrictions are imposed an allowed octets. In some important operating systems, notably Unix, even some octets that in ASCII represents graphic characters can not be used in file names.

When UCS is used in these contexts, the simple solution to just partition the 16-bit or 31-bit codes into 2 or 4 octets does not work. For many graphic characters this will produce octets in the ranges forbidden by the above mentioned protocols and operating system designs.

For these reasons, several algorithmic transformation methods have been defined for UCS data. The UTF-1 method (UCS Transformation Format No. 1), defined in an annex to ISO/IEC 10646, is of little interest and will be withdrawn. More important are the following:

UTF-8 and UTF-16 will be added to ISO/IEC 10646 in the next revision of the standard, and are included in the forthcoming Unicode version 2.0. UTF-7 is a specification of IETF, the Internet Engineering Task Force, and formally unrelated to ISO/IEC 10646.

5. What is accepted as a character in UCS?

The character repertoire of the first version of UCS is based on an amalgamation of all internationally standardized coded character sets and the most important company-defined de facto standards for coded character sets that existed in 1991. Whenever what was deemed as the same character was found in different coded character sets, these were unified into one character with one code in UCS. But two different characters in the same coded character set was never unified. Also the letters of some scripts with no existing standard coded character set, and vast collections of mathematical symbols, technical symbols, geometric shapes, dingbats and other conventional signs were included in the repertoire of UCS.

When deciding on whether a graphic character should be added to UCS, the most important principle have been that a new character must differ from all already included characters both in meaning and in appearance to be accepted.

Alternative graphic forms of existing characters (font variants, glyphs) are consequently not given UCS codes of their own. In Chinese, Japanese and Korean there is a very big number of ideographic characters which have the same historical origin and only minor differences in appearance between the three languages. These national variants of the same ideographic character have been given a joint UCS code, a solution which is known as CJK unification.

On the other hand, not even a completely new way of using an existing character -- the same appearance but different meanings -- is sufficient justification to get it included in UCS as a separate character. For example the punctuation mark asterisk, "*", of considerable age in itself, has in recent years also been used as multiplication sign in different programming languages. This case is regarded as two different uses of the same character, which is given only one UCS representation.

There are two important exceptions from the criteria for character sameness outlined above:

What is said here is only a general outline of the principles used to identify individual characters to be given a code position in UCS and Unicode. These are unfortunately not described at all in the text of ISO/IEC 10646. In many specific cases it is of course not at all clear how to apply them. Quite a number of the decisions made are fairly arbitrary.

On important feature of UCS is that a large number of code positions are reserved for private use characters. No future revision of ISO/IEC 10646 will use these positions. There is room for 6400 private characters i the 2-octet form, and more in the 4-octet form.

6. References

UCS is defined in:
ISO/IEC International Standard 10646-1:1993(E): Information technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Arcitecture and Basic Multilingual Plane. International Organization for Standardization, Geneva, 1993.

Unicode version 1.0 is defined in two books:
The Unicode Consortium: The Unicode Standard Worldwide Character Encoding. Version 1.0. Volume 1 (Arcitecture, non-ideographic characters) Addison-Wesley, 1991

The Unicode Consortium: The Unicode Standard Worldwide Character Encoding. Version 1.0. Volume 2 (Ideographic characters) Addison-Wesley, 1992

The changes made between version 1.0 and version 1.1 are specified in:
Unicode Technical Report #4: The Unicode Standard, Version 1.1 The Unicode Consortium, 1993

Definitions of the various transformation formats proposed to be included in ISO/IEC 10646 and Unicode 2.0 are available on the Internet:
UTF-7 Encoding Form
[HTML-version of RFC 1642]
http://www.stonehand.com/unicode/standard/utf7.html

UCS Transformation Format 8 (UTF-8) [HTML-version of ISO-document ISO/IEC JTC1/SC2/WG2 N1036] http://www.stonehand.com/unicode/standard/wg2n1036.html

UCS Transformation Format 16 (UTF-16) [HTML-version of ISO-document ISO/IEC JTC1/SC2/WG2 N1035] http://www.stonehand.com/unicode/standard/wg2n1035.html

Internet sites with much information about Unicode:
http://www.stonehand.com/unicode/

ftp://ftp.stonehand.com/pub/

ftp://unicode.org/pub/


A good account of the history of ISO work on multi-octet character sets and the merger between ISO/IEC 10646 and Unicode can be found in:
Michael Y. Ksar: Untying tongues. ISO/IEC breaks down computer barriers in processing worldwide languages ISO Bulletin, No. 6 (June 1993)

Annex: Overview of the BMP (group=00, plane=00)

_______ ___________________________________________________________________

Row(s)  Content (script, other groups of characters, reserved area)
_______ ___________________________________________________________________

======= A-ZONE (alphabetical characters and symbols) =======================
00      (Control characters,) Basic Latin, Latin-1 Supplement (=ISO/IEC 8859-1)
01      Latin Extended-A, Latin Extended-B
02      Latin Extended-B, IPA Extensions, Spacing Modifier Letters
03      Combining Diacritical Marks, Basic Greek, Greek Symbols and Coptic
04      Cyrillic
05      Armenian, Hebrew
06      Basic Arabic, Arabic Extended
07--08  (Reserved for future standardization)
09      Devanagari, Bengali
0A      Gumukhi, Gujarati
0B      Oriya, Tamil
0C      Telugu, Kannada
0D      Malayalam
0E      Thai, Lao
0F      (Reserved for future standardization)
10      Georgian
11      Hangul Jamo
12--1D  (Reserved for future standardization)
1E      Latin Extended Additional
1F      Greek Extended
20      General Punctuation, Super/subscripts, Currency, Combining Symbols
21      Letterlike Symbols, Number Forms, Arrows
22      Mathematical Operators
23      Miscellaneous Technical Symbols
24      Control Pictures, OCR, Enclosed Alphanumerics
25      Box Drawing, Block Elements, Geometric Shapes
26      Miscellaneous Symbols
27      Dingbats
28--2F  (Reserved for future standardization)
30      CJK Symbols and Punctuation, Hiragana, Katakana
31      Bopomofo, Hangul Compatibility Jamo, CJK Miscellaneous
32      Enclosed CJK Letters and Months
33      CJK Compatibility
34--4D  Hangul

======= I-ZONE (ideographic characters) ===================================
4E--9F  CJK Unified Ideographs

======= O-ZONE (open zone) ================================================
A0--DF  (Reserved for future standardization)

======= R-ZONE (restricted use zone) ======================================
E0--F8  (Private Use Area)
F9--FA  CJK Compatibility Ideographs
FB      Alphabetic Presentation Forms, Arabic Presentation Forms-A
FC--FD  Arabic Presentation Forms-A
FE      Combining Half Marks, CJK Compatibility Forms, Small Forms, Arabic-B
FF      Halfwidth and Fullwidth Forms, Specials

^ Up to the KTH/NADA collection of information resources about character sets and the Internet IAB-charsets page.


Author: Olle Järnefors <ojarnef@admin.kth.se>
Maintainer: Peter Svanberg <psv@nada.kth.se> Organization: Royal Institute of Technology (KTH), Stockholm, Sweden
Version: Ar1
Document type: overview
Newest version at: ftp://ftp.admin.kth.se/pub/misc/ucs/unicode-iso10646-oview.txta
URL: http://www.nada.kth.se/i18n/unicode-iso10646-oview.html
This version updated: 1996-02-26