About Unicode
Before Unicode was developed, there were many different encoding
systems, many of which conflicted with each other. For example, the same
number could represent different characters in different encoding
systems. Unicode provides a unique number for each character in all
supported written languages. For languages that can be written in
several scripts, Unicode provides a unique number for each character in
each supported script.
For more information about the supported languages and scripts,
see the Unicode website at http://www.unicode.org/cldr/charts/latest/supplemental/scripts_and_languages.html.
Encoding forms
There are three Unicode encoding forms: UTF-8, UTF-16, and UTF-32.
Originally UTF stood for Unicode Transformation Format. The acronym is
used now in the names of these encoding forms, which map from a
character set definition to the actual code units that represent the
data, and to the encoding schemes, which are encoding forms with a
specific byte serialization.
-
UTF-8 uses an unsigned byte sequence of one to four bytes to
represent each Unicode character. -
UTF-16 uses one or two unsigned 16-bit code units, depending
on the range of the scalar value of the character, to represent each
Unicode character. -
UTF-32 uses a single unsigned 32-bit code unit to represent
each Unicode character.
Encoding schemes
An encoding scheme specifies how the bytes in an encoding form are
serialized. When you manipulate files, convert blobs and strings, and
save DataWindow data in PowerBuilder, you can choose to use ANSI
encoding, or one of three Unicode encoding schemes:
-
UTF-8 serializes a UTF-8 code unit sequence in exactly the
same order as the code unit sequence itself. -
UTF-16BE serializes a UTF-16 code unit sequence as a byte
sequence in big-endian format. -
UTF-16LE serializes a UTF-16 code unit sequence as a byte
sequence in little-endian format.
UTF-8 is frequently used in Web requests and responses. The
big-endian format, where the most significant value in the byte sequence
is stored at the lowest storage address, is typically used on UNIX
systems. The little-endian format, where the least significant value in
the sequence is stored first, is used on Windows.