1993-09-14, version Ap4: Last draft
1996-02-24, version A: Final document, prepared for the IAB character set workshop 1996-02-29/1996-03-01
1996-02-26, version Ar1: Added one item to the author presentation. HTML home added. Section 1: added three limitations of plain text removed by UCS. Section 5: paragraph about privat
2. The structure of the coding space
The half-filled UCS-2. The unused UCS-4.
Cell, row, plane, group. Relation to ISO/IEC 8859-1.
UCS-2 = BMP = plane 0 of group 0.
3. Implementation levels Level 1 (enough for Europe, the Middle East, East Asia). Level 2 (needed for South Asia). Bi-directional text. Precomposed characters, combining characters, composite sequences.
4. Adaptation to data communication needs
UCS transformation formats. UTF-8: UCS represented in
8-bit text. UTF-7: UCS-2 represented in 7-bit text.
UTF-16: Part of UCS-4 represented in UCS-2.
5. What is accepted as a character in UCS?
Existing coded character sets amalgamated. CJK
unification. Characters not shapes, not meanings.
Compatibility characters. Private use characters.
7. Annex: Overview of the BMP (group=00, plane=00)
UCS is the first offcially standardized coded character set with the purpose to eventually include all characters used in all the written languages in the world (and, in addition, all mathematical and other symbols). This is certainly a very ambitious goal, but the current first edition at least covers all major languages and all commercially important languages.
To be able to give every character of this grand repertoire a unique coded representation, the designers of UCS chose a uniform encoding, using bit sequences consisting of 16 or 31 bits (in the two coding forms, UCS-2 and UCS-4). This is the reason for the phrase "multi-octet" in the name of the standard.
Unicode is a coded character set specified by a consortium of major American computer manufacturers, primarily to overcome the chaos of different coded character sets in use when creating multilingual programs and internationalizing software. From version 1.1 on, Unicode is scrupulously kept compatible with ISO/IEC 10646 and its extensions. The consortium is also an important contributor to the ISO work to further develop ISO/IEC 10646.
In short, Unicode can be characterized as the (restricted) 2-octet form of UCS on (the most general) implementation level 3, with addition of a more precise specification of the bi-directional behavior of characters, when used in the Arabic and Hebrew scripts. Unicode is presently at version 1.1. Extensions in the soon forthcoming version 2.0 will make it possible to access also the wider coding space of UCS-4, within this 16-bit encoding.
UCS is intended to be usable both for internal data representation in computer systems and in data communication. UCS is already employed in commercial products from Microsoft, Novell, Apple and others. It is implemented in free software like Linux, and is proposed for inclusion in advanced data communication standards like HTML.
Strong but in my opinion ill-founded criticism has met UCS from programmer groups in Japan. It has, however, recently been adopted as a Japanese national standard.
ISO/IEC 10646 is a fundamental standard, potentially affecting almost all parts of information technology. But it specifies only a coded character set, not a complete system for text representation. It provides the basis for internationalization, but does not in itself give a complete solution of the problems in this field.
The simple kind of text for whose representation a coded character set standard is sufficient, plain text, is essentially only a linear sequence of graphic characters, with a fixed division into lines and possibly pages.
ISO/IEC 10646 and Unicode removes some assumptions often made about plain text, which simplifies implementations but are untenable in multilingual text and monolingual text in some languages:
The evolution of ISO/IEC 10646 and, in parallel, Unicode will continue for a long period of time, mostly by additions of scripts and symbol collections. This overview describes the first edition of the standard from 1993, but some of the extensions that are about to be adopted are also touched upon.
The 65536 positions in the 2-octet form of UCS are divided into 256 rows with 256 cells in each. The first octet of a character representation gives the row number, the second the cell number. The first row, row 0, contains exactly the same characters as ISO/IEC 8859-1. The first 128 characters are thus the ASCII characters. The octet representing an ISO/IEC 8859-1 character is easily transformed to the representation in UCS, by putting a 0 octet in front of it. UCS includes the same control characters as ISO/IEC 8859 and these are also in row 0. An overview of the content of all rows are found in the annex.
In the 4-octet form more than 2 billion (2147483648) different characters can be represented. (The first bit of the first octet must be 0 so only 31 of the 32 bits are used by UCS.) This coding space is subdivided into 128 groups, each containing 256 planes. The first octet in a character representation indicates the group number and the second the plane number. The third and fourth octets gives the row number and the cell number of the character. Those characters that can be represented by the 2-octet form of UCS belong to plane 0 of group 0, which is called the Basic Multilingual Plane, BMP. The 4-octet representation of a character in the BMP is produced by putting two 0 octets before its 2-octet representation.
Still no characters have been allocated to positions outside the BMP, and only the 2-octet form is used in practice.
A full implementation of the Unicode standard amounts to an implementation at level 3 of UCS.
01FA
(the simple representation that must be used on level 1 and 2)
00C5 0301
("A with ring above" + combining acute accent)
0041 030A 0301
("A" + combining ring above + combining acute accent)
(The code positions in UCS are usually given in hexadecimal
notation. 01FA indicates two octets, first the octet with the
value 1, corresponding to row 1, then the octet with the
hexadecimal value FA, corresponding to cell 250 in that row.)
When UCS is used in these contexts, the simple solution to just partition the 16-bit or 31-bit codes into 2 or 4 octets does not work. For many graphic characters this will produce octets in the ranges forbidden by the above mentioned protocols and operating system designs.
For these reasons, several algorithmic transformation methods have been defined for UCS data. The UTF-1 method (UCS Transformation Format No. 1), defined in an annex to ISO/IEC 10646, is of little interest and will be withdrawn. More important are the following:
When deciding on whether a graphic character should be added to UCS, the most important principle have been that a new character must differ from all already included characters both in meaning and in appearance to be accepted.
Alternative graphic forms of existing characters (font variants, glyphs) are consequently not given UCS codes of their own. In Chinese, Japanese and Korean there is a very big number of ideographic characters which have the same historical origin and only minor differences in appearance between the three languages. These national variants of the same ideographic character have been given a joint UCS code, a solution which is known as CJK unification.
On the other hand, not even a completely new way of using an existing character -- the same appearance but different meanings -- is sufficient justification to get it included in UCS as a separate character. For example the punctuation mark asterisk, "*", of considerable age in itself, has in recent years also been used as multiplication sign in different programming languages. This case is regarded as two different uses of the same character, which is given only one UCS representation.
There are two important exceptions from the criteria for character sameness outlined above:
On important feature of UCS is that a large number of code positions are reserved for private use characters. No future revision of ISO/IEC 10646 will use these positions. There is room for 6400 private characters i the 2-octet form, and more in the 4-octet form.
_______ ___________________________________________________________________ Row(s) Content (script, other groups of characters, reserved area) _______ ___________________________________________________________________ ======= A-ZONE (alphabetical characters and symbols) ======================= 00 (Control characters,) Basic Latin, Latin-1 Supplement (=ISO/IEC 8859-1) 01 Latin Extended-A, Latin Extended-B 02 Latin Extended-B, IPA Extensions, Spacing Modifier Letters 03 Combining Diacritical Marks, Basic Greek, Greek Symbols and Coptic 04 Cyrillic 05 Armenian, Hebrew 06 Basic Arabic, Arabic Extended 07--08 (Reserved for future standardization) 09 Devanagari, Bengali 0A Gumukhi, Gujarati 0B Oriya, Tamil 0C Telugu, Kannada 0D Malayalam 0E Thai, Lao 0F (Reserved for future standardization) 10 Georgian 11 Hangul Jamo 12--1D (Reserved for future standardization) 1E Latin Extended Additional 1F Greek Extended 20 General Punctuation, Super/subscripts, Currency, Combining Symbols 21 Letterlike Symbols, Number Forms, Arrows 22 Mathematical Operators 23 Miscellaneous Technical Symbols 24 Control Pictures, OCR, Enclosed Alphanumerics 25 Box Drawing, Block Elements, Geometric Shapes 26 Miscellaneous Symbols 27 Dingbats 28--2F (Reserved for future standardization) 30 CJK Symbols and Punctuation, Hiragana, Katakana 31 Bopomofo, Hangul Compatibility Jamo, CJK Miscellaneous 32 Enclosed CJK Letters and Months 33 CJK Compatibility 34--4D Hangul ======= I-ZONE (ideographic characters) =================================== 4E--9F CJK Unified Ideographs ======= O-ZONE (open zone) ================================================ A0--DF (Reserved for future standardization) ======= R-ZONE (restricted use zone) ====================================== E0--F8 (Private Use Area) F9--FA CJK Compatibility Ideographs FB Alphabetic Presentation Forms, Arabic Presentation Forms-A FC--FD Arabic Presentation Forms-A FE Combining Half Marks, CJK Compatibility Forms, Small Forms, Arabic-B FF Halfwidth and Fullwidth Forms, Specials
Up to the KTH/NADA collection of information resources about
character sets and the Internet IAB-charsets page.