Unicode, unicode, universal.utf8, UCS-2, UCS-4, UTF-8,
UTF-16, UTF-32, iso10646 - Support for the Unicode and
ISO/IEC 10646 standards
The operating system provides locales and codeset converters
that support the following standards: The Unicode
Standard, Version 3.0, Unicode, Inc., 2000 The Unicode
Standard, Version 3.1, Unicode, Inc., 2001 Information
Technology-Universal Multiple-Octet Coded Character Set,
ISO/IEC 10646:2001
The Basic Multilingual Plane defined by this standard
is identical with the main body of Unicode
character encoding.
These standards define generalized character encoding
rules that can be applied to characters in most native
language scripts. The Unicode standard specifies a universal
character set (UCS). Version 3.0 of the Unicode standard
contains definitions for 49,194 characters and also
includes a Private Use Area for vendor- or user-defined
characters. Version 3.1 of the Unicode standard adds
44,946 new character definitions, incorporates UTF-32
(32-bit encoding) into the standard, and adds three new
planes beyond the 16-bit codespace of Plane 0 (Basic Multilingual
Plane). Plane 1 (Supplementary Multilingual
Plane) contains code positions U+10000 to U+1FFFF; Plane 2
(Supplementary Ideographic Plane) contains code positions
U+20000 to U+2FFFF; Plane 14 (Supplementary Special-Purpose
Plane) contains code positions U+E0000 to U+EFFFF.
See the Unicode web site at http://www.unicode.org/ for
more information on the Unicode standard. See the Unicode
ReadMe document in /usr/share/unidata/, which describes
the Unicode standard version currently supported on the
operating system.
The following list summarizes the main features of the
Unicode character set: Characters have properties, such as
base, numeric, spacing, combination, and directionality.
The Unicode standard provides rules for ordering characters
with different properties so that parsing of character
sequences is unambiguous. The relationship between
Unicode characters and the glyphs in the native language
script that users see, type, or print is not necessarily
one-to-one. A glyph may be mapped to a single abstract
character or to a composed character. Conversely, more
than one glyph can be mapped to a character. Certain
sequences of Unicode characters in a text stream are
transformed into other characters, called composed characters.
The ISO 8859-1 character set occupies the first 256
code positions (and the ASCII character set the first 128
positions) of the UCS.
The Unicode and ISO/IEC 10646 standards specify a universal
repertoire of characters that can be used by all major
languages and that allow character units to be processed
for all languages under the same set of rules. Therefore,
system support for the universal character set does not
need to include multiple algorithms (one or more per
language) for converting between file code and internal
process code. However, the two different character sizes
(16-bit or 32-bit) that the standards support require different
parsing schemes for data input and output. Universal
character encoding that an implementation parses in
16-bit units (2 octets) is known as UCS-2. Universal character
encoding that an implementation parses in 32-bit
units (4 octets) is known as UCS-4. This is the canonical
ISO/IEC 10646 encoding that is in use on systems that can
support the larger data unit size.
Because UCS-2 is a subset of UTF-16, the operating system
supports UCS-2 with UTF-16 codeset converters. The operating
system supports UCS-4 with both codeset converters and
locales. (Keep in mind that UCS-2 cannot be used to encode
characters outside of the Basic Multilingual Plane.)
In terms of locales, the operating system supports both
Unicode and dense code. The two types of locales differ in
their manner of wide character encoding support. See
l10n_intro(5) for information comparing the two locale
types and for information on switching between Unicode and
dense code locales.
The Unicode and ISO/IEC 10646 standards define a number of
transformation formats for the universal character set
(UTF-8 and UTF-32 are the preferred transformation formats
for the operating system): UTF-8, the standard method for
transforming UCS-4 process encoding into a sequence of
8-bit bytes and ensuring interchange transparency for
characters in C0 code positions (0 to 31), the SPACE (32)
character, and the DEL (127) character
The operating system supports UTF-8 with both codeset
converters and locales. UTF-7, an obsolete
interchange format for environments that strip the
eighth bit from each byte
The operating system does not support UTF-7.
UTF-1, an obsolete interchange format that is similar
to UTF-8 but also ensures interchange transparency
of characters in C1 code positions (128 to
159)
The operating system does not support UTF-1.
UTF-16, which uses the surrogate character extension
technique defined by Version 2.0 and later of
the Unicode standard and represents characters in
16-bit units
UTF-16 is a superset of UCS-2. As with UCS-2,
UTF-16 encodes characters in the range U+0000 to
U+FFFF as single 16-bit units. For characters in
the range U+10000 to U+10FFFF, UTF-16 transforms
them into a surrogate pair. The result of this
transformation is that the high surrogate (the
first of the pair) is in the range U+D800 to
U+DBFF, while the low surrogate (the second part of
the pair) is in the range U+DC00 to U+DFFF. These
two 16-bit values represent a single character.
Although UTF-16 does not support representation of
the entire UCS-4 code space (including private-use
ranges for character values above U+10FFFF), it
does supports all characters that have been currently
defined for the languages covered by both
standards.
Byte orientation in file code can differ and,
depending on the platform on which the file was
generated, can be little-endian (LE) or big-endian
(BE). UTF-16 uses a byte order mark (BOM), which
is not part of the file text data, to indicate byte
orientation. The code point of the BOM is U+FEFF.
The Unicode standard also defines UTF-16LE and
UTF-16BE, which are specific to the little-endian
and big-endian orientations, respectively, and do
not include a byte order mark.
The operating system supports UTF-16, UTF-16LE, and
UTF-16BE through codeset converters. The codeset
converter name, UCS-2 is recognized as an alias for
UTF-16*, but with a restricted repertoire of characters.
Note
By default, the operating system uses UTF-16 rather
than UTF-16LE or UTF-16BE.
In an input file, the software first looks for a
BOM. If a BOM is not found, the converter assumes
UTF-16BE. This means that you must explicitly specify
UTF-16LE to the converter (convert files manually)
when UTF-16LE applies to an input file.
For an output file, the converter automatically
inserts a BOM. This means that you must explicitly
specify UTF-16LE or UTF-16BE (convert files manually)
when you want conversion output to be
UTF-16LE or UTF-16BE rather than UTF-16. UTF-32
allows character representation in 4-byte encoding
units
UTF-32 is a restricted subset of UCS-4. UTF-32 is
restricted in values to the range U+0000 to
U+10FFFF, which precisely matches the range of
character values defined by UTF-16. Like UTF-16,
UTF-32 does not support private-use ranges for
character values above U+10FFFF.
UTF-32 uses a BOM to indicate little-endian or bigendian
byte orientation. The Unicode standard also
defines UTF-32LE and UTF-32BE, which are specific
to the little-endian and big-endian orientations,
respectively, and do not include a BOM. As with
UTF-16, big-endian is the default byte order when a
BOM is not generated.
UTF-32 is almost the same as UCS-4, so you can use
UCS-4 codeset converters to process UTF-32. UCS-4
converter software includes support for UTF-32,
UTF-32LE, or UTF-32BE.
Codeset Conversion [Toc] [Back]
Codeset converters are available to convert data in all
the major encoding formats that the operating system supports
to and from UCS-2, UTF-16, UCS-4, and UTF-8. If the
worldwide support subsets are installed on your system,
you can enter the following commands to find the names of
these converters: % cd /usr/lib/nls/loc/iconv % ls | grep
UTF % ls | grep UCS
Among the converters listed, you will find some that handle
conversion of data in the code-page format used on PC
systems. See code_page(5) for more information about converting
between codeset and code-page formats. You can use
all codeset converters with the iconv command and associated
library functions.
Note
The mapping of Korean Hangul characters changed between
Version 1.1 and Version 2.0 of the Unicode standard. By
default, UTF-16, UCS-4, and UTF-8 conversion assumes Version
2.0 character mapping for Hangul characters. Therefore,
if data is in Version 1.1 format, you must first
convert the data to Version 2.0 format before converting
from UTF-16, UCS-4, or UTF-8 to an entirely different format.
The format of a codeset converter name is from-codeset_tocodeset.
In converter names, the Version 1.1 codeset formats
for UCS-2, UCS-4, and UTF-8 are represented by UNICODE-1-1,
UNICODE-1-1-UCS-4, and UNICODE-1-1-UTF-8,
respectively. The Version 2.0 codeset names are represented
by UTF-16, UCS-4, and UTF-8.
For example, if Korean data is currently in UCS-4 Version
1.1 format, the data must first be processed by the UNICODE-1-1-UCS-4_UCS-4
converter before being processed by
the UCS-4_deckorean converter.
See iconv_intro(5) for general information on codeset conversion.
Locales [Toc] [Back]
The following locales use UTF-32 as internal processing
code: universal.UTF-8
This locale is used by applications. It converts
data in UTF-8 file format to UCS-4 process code and
can be used to test any UCS-4 character to determine
if it is included in one of the following
classes defined for the LC_CTYPE category: alnum,
alpha, blank, cntrl, digit, graph, lower, print,
punct, space, upper, or xdigit.
In the universal.UTF-8 locale, the LC_MESSAGES,
LC_MONETARY, LC_NUMERIC, and LC_TIME category definitions
match those for the POSIX (C) locale. language_territory.UTF-8
These locales limit classification information to
the characters in a particular native language,
make country-specific data available to the application,
and assume file data follows UTF-8 encoding
rules. The operating system locales that support
the euro monetary symbol use either the UTF-8 or
ISO8859-15 codeset. See euro(5) for more information.
Note
The X locale database file used by applications
running in the universal.UTF-8, en_US.UTF-8, or
Asian locales (Chinese, Japanese, or Korean) contains
font definitions that include all the fonts
used with the operating system. This enables applications
under en_US.UTF-8 to display all the font
characters installed with Worldwide Language Software
(WLS). Applications under the Asian locales
display all the font characters installed with WLS,
except for ISO8859-2, -4, -5, -7, -8, -9, -15, and
TACTIS. native_locale_name
These locales are installed in the default Unicode
path, /usr/i18n/lib/nls/ucsloc/ and use UTF-32 as
internal processing code. However, they differ in
the following ways: The file code is specified by
the codeset portion (for example, ISO8859-1) of
native_locale_name. Classification information is
not provided for the full set of UTF-32 characters,
but only for those in a particular native language
(for example, French). Country-specific data is
also available to the application. The LC_COLLATE,
LC_MESSAGES, LC_MONETARY, LC_NUMERIC, and LC_TIME
category definitions match those defined in
native_locale_name. native_locale_name@ucs4
These locales are installed in
/usr/i18n/lib/nls/loc/ and are the same as the
native_locale_name locales installed in
/usr/i18n/lib/nls/ucsloc/ except that they are not
a complete set of locales and will not be enhanced
in future versions of the operating system. They
are provided for compatibility with existing applications.
You cannot select @ucs4 locales from the
CDE login menu; you must specify the locale name in
the LANG environment variable.
CDE desktop users can select locales by choosing names
followed by (Unicode) from the CDE language menu at session
startup. In this case, the locale setting applies by
default to all applications run during the CDE session.
Unicode Character Database [Toc] [Back]
For the convenience of programmers, the source file for
the Unicode character database is available on line. This
source file is the one used to build the locales provided
in optional software subsets included with the operating
system product. When the locales are installed on your
system, both the Unicode character database and an associated
ReadMe file are also installed in the
/usr/share/unidata directory. The ReadMe file discusses
the character properties supported by Unicode.
Font Support [Toc] [Back]
The operating system provides the following types of
bitmap fonts for UCS characters: Public domain Unicode
fonts:
-etl-fixed-medium-r-normal--14-140-72-72-c-70-iso10646-1
-etl-fixedmedium-r-normal--16-160-72-72-c-80-iso10646-1
-etlfixed-medium-r-nor-
mal--24-240-72-72-c-120-iso10646-1 Composite fonts
that the libfr_FGC font renderer creates by
combining fonts available for other codesets Two
sets of monospaced fonts (a 16x18 pixel set and a
24x24 pixel set) for UTF-8 locales with the following
CDE font aliases (where -n is -1, -2, -3, -4,
-5, -7, -8,- 9, or -15):
-dt-interface-*-*-*-*-*-*-*-*-*-*-*-iso8859-n@mono
-dt-interface-*-*-*-*-*-*-*-*-*-*-*-iso10646-1@mono
These fonts currently cover only a subset of the characters
in UCS. Each of the ETL public domain fonts supports
about 1000 characters, but does not include any characters
for Chinese, Japanese, or Korean. The composite fonts created
by the font renderer are generated only from fonts
available for the ISO 8859-1 (Latin-1) and ISO 8859-15
(Latin-9) codesets.
See iso8859-1(5) and iso8859-15(5) for the names of fonts
available for Latin-1 and Latin-9 characters. The Latin-9
fonts, which include glyphs for the euro character, provide
the best support for the language_territory.UTF-8
locales, which also support this character.
See i18n_printing(5) and wwpsof(8) for information on
printer support and converting bitmap font encoding to
PostScript.
Commands: locale(1), wwpsof(8)
Others: ascii(5), code_page(5), iso8859-1(5),
iso8859-15(5), i18n_intro(5), i18n_printing(5),
iconv_intro(5), l10n_intro(5)
Using International Software
Unicode(5)
[ Back ] |