About Unicode

Before Unicode was developed, there were many different encoding
systems, many of which conflicted with each other. For example,
the same number could represent different characters in different
encoding systems. Unicode provides a unique number for each character
in all supported written languages. For languages that can be written
in several scripts, Unicode provides a unique number for each character
in each supported script.

For more information about the supported languages and scripts,
see the Unicode Web site

Encoding forms

There are three Unicode encoding forms: UTF-8, UTF-16, and
UTF-32. Originally UTF stood for Unicode Transformation Format.
The acronym is used now in the names of these encoding forms, which
map from a character set definition to the actual code units that
represent the data, and to the encoding schemes, which are encoding
forms with a specific byte serialization.

  • UTF-8
    uses an unsigned byte sequence of one to four bytes to represent each
    Unicode character.

  • UTF-16 uses one or two unsigned 16-bit code units,
    depending on the range of the scalar value of the character, to
    represent each Unicode character.

  • UTF-32 uses a single unsigned 32-bit code unit to
    represent each Unicode character.

Encoding schemes

An encoding scheme specifies how the bytes in an encoding
form are serialized. When you manipulate files, convert blobs and
strings, and save DataWindow data in PowerBuilder, you can choose
to use ANSI encoding, or one of three Unicode encoding schemes:

  • UTF-8 serializes a UTF-8 code unit sequence in exactly
    the same order as the code unit sequence itself.

  • UTF-16BE serializes a UTF-16 code unit sequence
    as a byte sequence in big-endian format.

  • UTF-16LE serializes a UTF-16 code unit sequence
    as a byte sequence in little-endian format.

UTF-8 is frequently used in Web requests and responses. The
big-endian format, where the most significant value in the byte
sequence is stored at the lowest storage address, is typically used
on UNIX systems. The little-endian format, where the least significant
value in the sequence is stored first, is used on Windows.

