Using Unicode
Unicode is a character encoding scheme that enables text display
for most of the world’s languages. Support for Unicode
characters is built into PowerBuilder. This means that you can display
characters from multiple languages on the same page of your application,
create a flexible user interface suitable for deployment to different
countries, and process data in multiple languages.
About Unicode
Before Unicode was developed, there were many different encoding
systems, many of which conflicted with each other. For example,
the same number could represent different characters in different
encoding systems. Unicode provides a unique number for each character
in all supported written languages. For languages that can be written
in several scripts, Unicode provides a unique number for each character
in each supported script.
For more information about the supported languages and scripts,
see the Unicode Web site
.
Encoding forms
There are three Unicode encoding forms: UTF-8, UTF-16, and
UTF-32. Originally UTF stood for Unicode Transformation Format.
The acronym is used now in the names of these encoding forms, which
map from a character set definition to the actual code units that
represent the data, and to the encoding schemes, which are encoding
forms with a specific byte serialization.
- UTF-8
uses an unsigned byte sequence of one to four bytes to represent each
Unicode character. - UTF-16 uses one or two unsigned 16-bit code units,
depending on the range of the scalar value of the character, to
represent each Unicode character. - UTF-32 uses a single unsigned 32-bit code unit to
represent each Unicode character.
Encoding schemes
An encoding scheme specifies how the bytes in an encoding
form are serialized. When you manipulate files, convert blobs and
strings, and save DataWindow data in PowerBuilder, you can choose
to use ANSI encoding, or one of three Unicode encoding schemes:
- UTF-8 serializes a UTF-8 code unit sequence in exactly
the same order as the code unit sequence itself. - UTF-16BE serializes a UTF-16 code unit sequence
as a byte sequence in big-endian format. - UTF-16LE serializes a UTF-16 code unit sequence
as a byte sequence in little-endian format.
UTF-8 is frequently used in Web requests and responses. The
big-endian format, where the most significant value in the byte
sequence is stored at the lowest storage address, is typically used
on UNIX systems. The little-endian format, where the least significant
value in the sequence is stored first, is used on Windows.
Unicode support in PowerBuilder
PowerBuilder uses UTF-16LE encoding internally. The source
code in PBLs is encoded in UTF-16LE, any text entered in an application
is automatically converted to Unicode, and the string and character PowerScript
datatypes hold Unicode data only. Any ANSI or DBCS characters assigned
to these datatypes are converted internally to Unicode encoding.
Support for Unicode databases
Most PowerBuilder database interfaces support both ANSI and
Unicode databases.
A Unicode database is a database whose character set is set
to a Unicode format, such as UTF-8 or UTF-16. All data in the database
is in Unicode format, and any data saved to the database must be
converted to Unicode data implicitly or explicitly.
A database that uses ANSI (or DBCS) as its character set can
use special datatypes to store Unicode data. These datatypes are
NChar, NVarChar, and NVarChar2. Columns with one of these datatypes
can store Unicode data, but data saved to such a column must be
converted to Unicode explicitly.
For more specific information about each interface, see Connecting
to Your Database
.
String functions
PowerBuilder string functions, such as Fill, Len, Mid,
and Pos, take characters instead of bytes as
parameters or return values and return the same results in all environments.
These functions have a “wide” version (such as FillW)
that is obsolete and will be removed in a future version of PowerBuilder
because it produces the same results as the standard version of
the function. Some of these functions also have an ANSI version
(such as FillA). This version is provided for
backwards compatibility for users in DBCS environments who used
the standard version of the string function in previous versions
of PowerBuilder to return bytes instead of characters.
You can use the GetEnvironment function
to determine the character set used in the environment:
1 |
environment env<br />getenvironment(env)<br /><br />choose case env.charset<br />case charsetdbcs!<br /> // DBCS processing<br /> ...<br />case charsetunicode!<br /> // Unicode processing<br /> ...<br />case charsetansi!<br /> // ANSI processing<br /> ...<br />case else<br /> // Other processing<br /> ...<br />end choose |
Encoding enumeration
Several functions, including Blob, BlobEdit, FileEncoding, FileOpen, SaveAs, and String,
have an optional encoding parameter. These
functions let you work with blobs and files with ANSI, UTF-8, UTF-16LE,
and UTF-16BE encoding. If you do not specify this parameter, the
default encoding used for SaveAs and FileOpen is
ANSI. For other functions, the default is UTF-16LE.
The following examples illustrate how to open different kinds
of files using FileOpen:
1 |
// Read an ANSI File<br />Integer li_FileNum<br />String s_rec<br />li_FileNum = FileOpen("Employee.txt")<br />// or:<br />// li_FileNum = FileOpen("Emplyee.txt", &<br />// LineMode!, Read!)<br />FileRead(li_FileNum, s_rec)<br /><br />// Read a Unicode File<br />Integer li_FileNum<br />String s_rec<br />li_FileNum = FileOpen("EmployeeU.txt", LineMode!, &<br /> Read!, EncodingUTF16LE!)<br />FileRead(li_FileNum, s_rec)<br /><br />// Read a Binary File<br />Integer li_FileNum<br />blob bal_rec<br />li_FileNum = FileOpen("Employee.imp", Stream Mode!, &<br /> Read!)<br />FileRead(li_FileNum, bal_rec) |
Initialization files
The SetProfileString function can write
to initialization files with ANSI or UTF16-LE encoding on Windows
systems, and ANSI or UTF16-BE encoding on UNIX systems. The ProfileInt and ProfileString PowerScript
functions and DataWindow expression functions can read files with
these encoding schemes.
Exporting and importing
source
The Export Library Entry dialog box lets you select the type
of encoding for an exported file. The choices are ANSI/DBCS,
which lets you import the file into PowerBuilder 9 or earlier, HEXASCII,
UTF8, or Unicode LE.
The HEXASCII export format is used for source-controlled files.
Unicode strings are represented by hexadecimal/ASCII strings
in the exported file, which has the letters HA at the beginning
of the header to identify it as a file that might contain such strings.
You cannot import HEXASCII files into PowerBuilder 9 or earlier.
If you import an exported file from PowerBuilder 9 or earlier,
the source code in the file is converted to Unicode before the object
is added to the PBL.
External functions
When you call an external function that returns an ANSI string
or has an ANSI string argument, you must use an ALIAS clause in
the external function declaration and add ;ansi to
the function name. For example:
1 |
FUNCTION int MessageBox(int handle, string content, string title, int showtype)<br />LIBRARY "user32.dll" ALIAS FOR "MessageBoxA;ansi" |
The following declaration is for the “wide” version
of the function, which uses Unicode strings:
1 |
FUNCTION int MessageBox(int handle, string content, string title, int showtype)<br />LIBRARY "user32.dll" ALIAS FOR "MessageBoxW" |
If you are migrating an application from PowerBuilder 9 or
earlier, PowerBuilder replaces function declarations that use ANSI
strings with the correct syntax automatically.
Setting fonts for multiple
language support
The default font in the System Options and Design Options
dialog boxes is Tahoma.
Setting the font in the System Options dialog box to Tahoma
ensures that multiple languages display correctly in the Layout
and Properties views in the Window, User Object, and Menu painters
and in the wizards.
If the font on the Editor Font page in the Design Options
dialog box is not set to Tahoma, multiple languages cannot be
displayed in Script views, the File and Source editors, the ISQL
view in the DataBase painter, and the Debug window.
You can select a different font for printing on the Printer
Font tab page of the Design Options dialog box for Script views,
the File and Source editors, and the ISQL view in the DataBase painter.
If the printer font is set to Tahoma and the Tahoma font is not
installed on the printer, PowerBuilder downloads the entire font
set to the printer when it encounters a multilanguage character.
If you need to print multilanguage characters, specify a printer
font that is installed on your printer.
To support multiple languages in DataWindow objects, set the
font in every column and text control to Tahoma.
The default font for print functions is the system font. Use
the PrintDefineFont and PrintSetFont functions
to specify a font that is available on users’ printers and
supports multiple languages.
PBNI
The PowerBuilder Native Interface is Unicode based. PBNI extensions
must be compiled using the _UNICODE preprocessor directive
in your C++ development environment.
Your extension’s code must use TCHAR, LPTSTR,
or LPCTSTR instead of char, char*,
and const char* to ensure that it
works correctly in a Unicode environment. Alternatively, you can
use the MultiByteToWideChar function to map character
strings to Unicode strings. For more information about enabling Unicode
in your application, see the documentation for your C++ development environment.
Unicode enabling for Web
services
In a PowerScript target, the PBNI extension classes instantiated
by Web service client applications use Unicode for all internal
processing. However, calls to component methods are converted to
ANSI for processing by EasySoap, and data returned from these calls
is converted to Unicode.
In a JSP target, the authoring tool (HTML editor) is Unicode-enabled
so you can input text in multiple languages on a single page. When
you type in the editor, the text is saved in UTF-16LE encoding.
However, JSP files with other encoding schemes can still be imported
in the editor. Text with these encodings is automatically converted
to UTF-16LE.
XML string encoding
The XML parser cannot parse a string that uses an eight-bit
character code such as windows-1253. For example, a string
with the following declaration cannot be parsed:
1 |
string ls_xml<br />ls_xml += '<?xml version="1.0" encoding="windows-1253"?>' |
You must use a Unicode encoding value such as UTF16-LE.