UTF-8 - Tru64

· Home

+ man pages

-> Linux

-> FreeBSD

-> OpenBSD

-> NetBSD

-> Tru64 Unix

-> HP-UX 11i

-> IRIX

· Linux HOWTOs

· FreeBSD Tips

· *niX Forums

man pages->Tru64 Unix man pages -> UTF-8 (5)

Unicode(5)

NAME
DESCRIPTION
SEE ALSO

NAME [Toc] [Back]

       Unicode,  unicode,  universal.utf8,  UCS-2,  UCS-4, UTF-8,
       UTF-16, UTF-32, iso10646 - Support  for  the  Unicode  and
       ISO/IEC 10646 standards

DESCRIPTION [Toc] [Back]

       The operating system provides locales and codeset converters
 that support  the  following  standards:  The  Unicode
       Standard,  Version  3.0,  Unicode,  Inc., 2000 The Unicode
       Standard, Version 3.1,  Unicode,  Inc.,  2001  Information
       Technology-Universal  Multiple-Octet  Coded Character Set,
       ISO/IEC 10646:2001

              The Basic Multilingual Plane defined by this  standard
  is  identical  with  the main body of Unicode
              character encoding.

       These  standards  define  generalized  character  encoding
       rules  that  can  be  applied to characters in most native
       language scripts. The Unicode standard specifies a universal
  character set (UCS). Version 3.0 of the Unicode standard
 contains definitions for 49,194 characters  and  also
       includes  a  Private  Use Area for vendor- or user-defined
       characters. Version  3.1  of  the  Unicode  standard  adds
       44,946  new  character  definitions,  incorporates  UTF-32
       (32-bit encoding) into the standard, and  adds  three  new
       planes  beyond the 16-bit codespace of Plane 0 (Basic Multilingual
  Plane).  Plane  1  (Supplementary  Multilingual
       Plane) contains code positions U+10000 to U+1FFFF; Plane 2
       (Supplementary Ideographic Plane) contains code  positions
       U+20000  to  U+2FFFF; Plane 14 (Supplementary Special-Purpose
 Plane) contains code positions U+E0000 to U+EFFFF.

       See the Unicode web site  at  http://www.unicode.org/  for
       more information on the Unicode standard.  See the Unicode
       ReadMe document in  /usr/share/unidata/,  which  describes
       the  Unicode  standard  version currently supported on the
       operating system.

       The following list summarizes the  main  features  of  the
       Unicode character set: Characters have properties, such as
       base, numeric, spacing, combination,  and  directionality.
       The  Unicode  standard provides rules for ordering characters
 with different properties so that parsing of  character
  sequences  is  unambiguous.  The relationship between
       Unicode characters and the glyphs in the  native  language
       script  that  users see, type, or print is not necessarily
       one-to-one. A glyph may be mapped  to  a  single  abstract
       character  or  to  a  composed character. Conversely, more
       than one glyph can be  mapped  to  a  character.   Certain
       sequences  of  Unicode  characters  in  a  text stream are
       transformed into other characters, called composed characters.
  The ISO 8859-1 character set occupies the first 256
       code positions (and the ASCII character set the first  128
       positions) of the UCS.

       The  Unicode and ISO/IEC 10646 standards specify a universal
 repertoire of characters that can be used by all major
       languages  and  that allow character units to be processed
       for all languages under the same set of rules.  Therefore,
       system  support  for  the universal character set does not
       need to include  multiple  algorithms  (one  or  more  per
       language)  for  converting  between file code and internal
       process code. However, the two different  character  sizes
       (16-bit or 32-bit) that the standards support require different
 parsing schemes for data input and output.  Universal
  character  encoding  that an implementation parses in
       16-bit units (2 octets) is known as UCS-2. Universal character
  encoding  that  an  implementation parses in 32-bit
       units (4 octets) is known as UCS-4.  This is the canonical
       ISO/IEC  10646 encoding that is in use on systems that can
       support the larger data unit size.

       Because UCS-2 is a subset of UTF-16, the operating  system
       supports UCS-2 with UTF-16 codeset converters. The operating
 system supports UCS-4 with both codeset converters and
       locales. (Keep in mind that UCS-2 cannot be used to encode
       characters outside of the Basic Multilingual Plane.)

       In terms of locales, the operating  system  supports  both
       Unicode and dense code. The two types of locales differ in
       their manner  of  wide  character  encoding  support.  See
       l10n_intro(5)  for  information  comparing  the two locale
       types and for information on switching between Unicode and
       dense code locales.

       The Unicode and ISO/IEC 10646 standards define a number of
       transformation formats for  the  universal  character  set
       (UTF-8 and UTF-32 are the preferred transformation formats
       for the operating system): UTF-8, the standard method  for
       transforming  UCS-4  process  encoding  into a sequence of
       8-bit bytes  and  ensuring  interchange  transparency  for
       characters  in C0 code positions (0 to 31), the SPACE (32)
       character, and the DEL (127) character

              The operating system supports UTF-8 with both codeset
  converters  and  locales.   UTF-7, an obsolete
              interchange format for environments that strip  the
              eighth bit from each byte

              The   operating  system  does  not  support  UTF-7.
              UTF-1, an obsolete interchange format that is similar
  to  UTF-8  but also ensures interchange transparency
 of characters in C1 code positions (128  to
              159)

              The   operating  system  does  not  support  UTF-1.
              UTF-16, which uses the surrogate  character  extension
  technique defined by Version 2.0 and later of
              the Unicode standard and represents  characters  in
              16-bit units

              UTF-16  is  a  superset  of  UCS-2.  As with UCS-2,
              UTF-16 encodes characters in the  range  U+0000  to
              U+FFFF  as  single  16-bit units. For characters in
              the range U+10000 to  U+10FFFF,  UTF-16  transforms
              them  into  a  surrogate  pair.  The result of this
              transformation is  that  the  high  surrogate  (the
              first  of  the  pair)  is  in  the  range U+D800 to
              U+DBFF, while the low surrogate (the second part of
              the  pair)  is in the range U+DC00 to U+DFFF. These
              two 16-bit values represent a single character.

              Although UTF-16 does not support representation  of
              the  entire UCS-4 code space (including private-use
              ranges for character  values  above  U+10FFFF),  it
              does  supports  all characters  that have been currently
 defined for the languages  covered  by  both
              standards.

              Byte  orientation  in  file  code  can  differ and,
              depending on the platform on  which  the  file  was
              generated,  can be little-endian (LE) or big-endian
              (BE).  UTF-16 uses a byte order mark  (BOM),  which
              is not part of the file text data, to indicate byte
              orientation. The code point of the BOM  is  U+FEFF.
              The  Unicode  standard  also  defines  UTF-16LE and
              UTF-16BE, which are specific to  the  little-endian
              and  big-endian  orientations, respectively, and do
              not include a byte order mark.

              The operating system supports UTF-16, UTF-16LE, and
              UTF-16BE  through  codeset  converters. The codeset
              converter name, UCS-2 is recognized as an alias for
              UTF-16*,  but with a restricted repertoire of characters.


                                     Note

              By default, the operating system uses UTF-16 rather
              than UTF-16LE or UTF-16BE.

              In  an  input  file, the software first looks for a
              BOM. If a BOM is not found, the  converter  assumes
              UTF-16BE. This means that you must explicitly specify
 UTF-16LE to the converter (convert files  manually)
 when UTF-16LE applies to an input file.

              For  an  output  file,  the converter automatically
              inserts a BOM. This means that you must  explicitly
              specify  UTF-16LE  or UTF-16BE (convert files manually)
  when  you  want  conversion  output  to   be
              UTF-16LE  or  UTF-16BE  rather than UTF-16.  UTF-32
              allows character representation in 4-byte  encoding
              units

              UTF-32  is  a restricted subset of UCS-4. UTF-32 is
              restricted  in  values  to  the  range  U+0000   to
              U+10FFFF,  which  precisely  matches  the  range of
              character values defined by  UTF-16.  Like  UTF-16,
              UTF-32  does  not  support  private-use  ranges for
              character values above U+10FFFF.

              UTF-32 uses a BOM to indicate little-endian or bigendian
 byte orientation.  The Unicode standard also
              defines UTF-32LE and UTF-32BE, which  are  specific
              to  the  little-endian and big-endian orientations,
              respectively, and do not include  a  BOM.  As  with
              UTF-16, big-endian is the default byte order when a
              BOM is not generated.

              UTF-32 is almost the same as UCS-4, so you can  use
              UCS-4  codeset  converters to process UTF-32. UCS-4
              converter software  includes  support  for  UTF-32,
              UTF-32LE, or UTF-32BE.

   Codeset Conversion    [Toc]    [Back]
       Codeset  converters  are  available to convert data in all
       the major encoding formats that the operating system  supports
  to and from UCS-2, UTF-16, UCS-4, and UTF-8. If the
       worldwide support subsets are installed  on  your  system,
       you  can enter the following commands to find the names of
       these converters: % cd /usr/lib/nls/loc/iconv % ls |  grep
       UTF % ls | grep UCS

       Among  the converters listed, you will find some that handle
 conversion of data in the code-page format used on  PC
       systems.  See code_page(5) for more information about converting
 between codeset and code-page formats. You can use
       all  codeset converters with the iconv command and associated
 library functions.

                                  Note

       The mapping of Korean Hangul  characters  changed  between
       Version  1.1  and  Version 2.0 of the Unicode standard. By
       default, UTF-16, UCS-4, and UTF-8 conversion assumes  Version
  2.0  character mapping for Hangul characters. Therefore,
 if data is in Version 1.1  format,  you  must  first
       convert  the  data to Version 2.0 format before converting
       from UTF-16, UCS-4, or UTF-8 to an entirely different format.


       The format of a codeset converter name is from-codeset_tocodeset.
 In converter names, the Version 1.1 codeset  formats
  for  UCS-2, UCS-4, and UTF-8 are represented by UNICODE-1-1,
   UNICODE-1-1-UCS-4,   and    UNICODE-1-1-UTF-8,
       respectively.  The  Version  2.0  codeset names are represented
 by UTF-16, UCS-4, and UTF-8.

       For example, if Korean data is currently in UCS-4  Version
       1.1  format,  the data must first be processed by the UNICODE-1-1-UCS-4_UCS-4
 converter before being  processed  by
       the UCS-4_deckorean converter.

       See iconv_intro(5) for general information on codeset conversion.


   Locales    [Toc]    [Back]
       The following locales use UTF-32  as  internal  processing
       code: universal.UTF-8

              This  locale  is  used by applications. It converts
              data in UTF-8 file format to UCS-4 process code and
              can  be  used to test any UCS-4 character to determine
 if it is included  in  one  of  the  following
              classes  defined  for the LC_CTYPE category: alnum,
              alpha, blank, cntrl, digit,  graph,  lower,  print,
              punct, space, upper, or xdigit.

              In  the  universal.UTF-8  locale,  the LC_MESSAGES,
              LC_MONETARY, LC_NUMERIC, and LC_TIME category definitions
 match those for the POSIX (C) locale.  language_territory.UTF-8


              These locales limit classification  information  to
              the  characters  in  a  particular native language,
              make country-specific data available to the  application,
 and assume file data follows UTF-8 encoding
              rules. The operating system  locales  that  support
              the  euro  monetary  symbol use either the UTF-8 or
              ISO8859-15 codeset. See euro(5) for  more  information.








                                     Note

              The  X  locale  database  file used by applications
              running in  the  universal.UTF-8,  en_US.UTF-8,  or
              Asian  locales  (Chinese, Japanese, or Korean) contains
 font definitions that include all  the  fonts
              used with the operating system. This enables applications
 under en_US.UTF-8 to display all  the  font
              characters  installed with Worldwide Language Software
 (WLS). Applications under  the  Asian  locales
              display all the font characters installed with WLS,
              except for ISO8859-2, -4, -5, -7, -8, -9, -15,  and
              TACTIS.  native_locale_name

              These  locales are installed in the default Unicode
              path, /usr/i18n/lib/nls/ucsloc/ and use  UTF-32  as
              internal  processing  code. However, they differ in
              the following ways: The file code is  specified  by
              the  codeset  portion  (for  example, ISO8859-1) of
              native_locale_name.  Classification information  is
              not provided for the full set of UTF-32 characters,
              but only for those in a particular native  language
              (for  example,  French).   Country-specific data is
              also available to the application.  The LC_COLLATE,
              LC_MESSAGES,  LC_MONETARY,  LC_NUMERIC, and LC_TIME
              category  definitions  match   those   defined   in
              native_locale_name.  native_locale_name@ucs4

              These       locales      are      installed      in
              /usr/i18n/lib/nls/loc/ and  are  the  same  as  the
              native_locale_name     locales     installed     in
              /usr/i18n/lib/nls/ucsloc/ except that they are  not
              a  complete set of locales and will not be enhanced
              in future versions of the  operating  system.  They
              are provided for compatibility with existing applications.
 You cannot select @ucs4 locales  from  the
              CDE login menu; you must specify the locale name in
              the LANG environment variable.

       CDE desktop users can select  locales  by  choosing  names
       followed  by  (Unicode) from the CDE language menu at session
 startup. In this case, the locale setting applies  by
       default to all applications run during the CDE session.

   Unicode Character Database    [Toc]    [Back]
       For  the  convenience  of programmers, the source file for
       the Unicode character database is available on line.  This
       source  file is the one used to build the locales provided
       in optional software subsets included with  the  operating
       system  product.  When  the  locales are installed on your
       system, both the Unicode character database and an associated
    ReadMe    file   are   also   installed   in   the
       /usr/share/unidata directory.  The ReadMe  file  discusses
       the character properties supported by Unicode.

   Font Support    [Toc]    [Back]
       The  operating  system  provides  the  following  types of
       bitmap fonts for UCS  characters:  Public  domain  Unicode
       fonts:

              -etl-fixed-medium-r-normal--14-140-72-72-c-70-iso10646-1
       -etl-fixedmedium-r-normal--16-160-72-72-c-80-iso10646-1
 -etlfixed-medium-r-nor-

              mal--24-240-72-72-c-120-iso10646-1  Composite fonts
              that  the  libfr_FGC  font  renderer   creates   by
              combining  fonts  available  for other codesets Two
              sets of monospaced fonts (a 16x18 pixel set  and  a
              24x24 pixel set) for UTF-8 locales with the following
 CDE font aliases (where -n is -1, -2,  -3,  -4,
              -5, -7, -8,- 9, or -15):

              -dt-interface-*-*-*-*-*-*-*-*-*-*-*-iso8859-n@mono
              -dt-interface-*-*-*-*-*-*-*-*-*-*-*-iso10646-1@mono


       These  fonts  currently cover only a subset of the characters
 in UCS.  Each of the ETL public domain fonts supports
       about 1000 characters, but does not include any characters
       for Chinese, Japanese, or Korean. The composite fonts created
  by  the  font renderer are generated only from fonts
       available for the ISO 8859-1  (Latin-1)  and  ISO  8859-15
       (Latin-9) codesets.

       See  iso8859-1(5) and iso8859-15(5) for the names of fonts
       available for Latin-1 and Latin-9 characters. The  Latin-9
       fonts,  which  include glyphs for the euro character, provide
 the best  support  for  the  language_territory.UTF-8
       locales, which also support this character.

       See  i18n_printing(5)  and  wwpsof(8)  for  information on
       printer support and converting  bitmap  font  encoding  to
       PostScript.

Unicode(5)

Contents

NAME [Toc] [Back]

DESCRIPTION [Toc] [Back]

SEE ALSO [Toc] [Back]