World-Wide Java

World-Wide Java

This article is the first of a series that will describe the technology and discuss the work of key organizations and individuals involved in using state-of-the-art localization technology for Java software. Be warned that the technology for internationalizing and localizing software is very much a work in progress, and the state of the art is advancing very quickly.

Java offers programmers an advanced mechanism to develop internationalized and localized software, that is, software that can be targeted for different languages, cultures and geographic areas. In order to use take advantage of Java's features, however, you need to be sure you don't make some fatal assumptions - assumptions you probably don't even know you make.

Top 10 Internationalization Errors

Assuming that all letters lie between a-z and A-Z. (For example, the Danish alphabet is abcdefghijklmnopqrstuvwxyzæøå.)
Assuming that all languages only use one character to represent a letter (Spanish use "ch" and "ll" as two distinct letters - for example, "llama" is pronounced "yama", and collates between "loma" and "mañana".)
Assuming that a letter only represents one character (German sorts "ß" as "ss".)
Assuming that all characters can be converted to upper or lower case (Chinese, Japanese and Korean characters do not have the concept of case.)
Assuming that words are separated by spaces (words in most Asian languages are not usually separated.)
Assuming that sentences are read from left to right, then top to bottom (Japanese is usually, but not always, written from top to bottom, then right to left.)
Assuming that sentences are read in only one direction (Arabic has a bi-directional writing system, in which sentences are mostly written from right to left, with the exception that numbers are written from left to right.)
Assuming that there are twelve months in a year (the Israelis have an thirteenth month every leap year, and their months have only 29 to 30 days.)
Assuming that the year 1998 means the same thing to everyone (India and Thailand's calendars number from Buddha's birth. Right now, in Japan, it is the year 10 Heisei; in Moslem countries it is the year 1418, and it is the year 5759 according to the Hebrew calendar.)
Assuming that time zones are multiples of one hour apart (Newfoundland, Canada, has a time zone which is half an hour different from the mainland, and Guyana's time zone is offset by 45 minutes from its neighbors.)

Motivations for Internationalizing

In today's world, Americans are not the only people who can write good software. Customers want to work in their own language with software that works the way they do. Best of all, there is a whole lot more money to be made selling a core product to the world as a whole, instead of being restricted to the domestic market. Let's look at some key facts:

Less than 7% of all people in the world speak English.
Seven bit ASCII is only suitable for writing US English, Swahili and Hawaiian.
Over 30% of US jobs are export related, triple the number of 15 years ago.
If a major U.S. manufacturer is not receiving over half of its revenue from abroad, it is probably doing something wrong.

So how does one write software that can express the diverse writing systems of the world and accept input for those languages?

Unicode: The Foundation for Java's Internationalization Classes

It takes a lot of fonts to express all the world's written languages. Some writing systems move from left to right, others move from right to left. Still others are bi-directional writing systems, and there are writing systems that move from top to bottom. There is even an ancient Irish system that spirals inwards! Each language needs at least one font to express sentences, and more fonts allow for greater expression within a language.

It's remarkable that Unicode 2.0, a standard published in 1996 by The Unicode Consortium, lets you write software that can use sets of fonts to display any combination of languages simultaneously. Even more remarkable is Java 1.1's Internationalization Classes that extend host operating systems in a platform-independent manner to support the full promise of Unicode. Java's internationalization support elevates Java from beyond a mere programming language to a programming environment.

Unicode provides a 16 bit addressing space for a total of 64K characters, including representation of all Latin-based languages, and the entire CJK (Chinese, Japanese and Korean) set of ideographs such as Hiragana, Katakana, Bopomofo, Jamo and Kanbun. Unicode also includes complete alphabets for ancient Greek, Cyrillic, Hangul syllables (Korean), Thai, Sanskrit (Devanagari, Bengali, Gurmukhi, etc.), Arabic, Hebrew, Armenian, geometric symbols, Dingbats, and many, many more… and there are still over 18,000 characters to spare.

You can even define a private font and load it, so your favorite ancient script or Klingon saying can be displayed proudly.

Existing fonts are mapped into the Unicode address space and extended to 16 bits if required, so Unicode can use fonts such as ASCII, UCS-2, JIS X0208 and many others. If you'd like to learn more about fonts and internationalization in general, a good book is "Programming for the World", by Sandra Martin O'Donnell (Prentice Hall, 440 pages). This book was published in 1994, when Unicode 1.1 was current, and when Sandra was responsible for all internationalization-related activities at the Open Software Foundation. Java programmers will find the first half of the book to be a solid foundation for further study. It also describes parsing issues for Asian languages, something I have not seen in other computer books. Unfortunately, the book's programming examples are all written in C and assume Unix, and the methodology used in the Java Internationalization classes has advanced far beyond what is described in the book. As an introduction, however, this book is one of the best. Among other things, it explains multibyte character sets and wide characters. I, an old-time C and C++ programmer, really appreciated learning why multibyte characters are used for storage in files, and why they are often converted to wide characters for parsing.

Another good introduction is "Developing International Software for Windows 95 and Windows NT", by Nadine Kano (Microsoft Press, 1995, 743 pages.) This book, although very specifically targeted towards Windows 95 and Windows NT, is full of useful information. It begins with a ten page overview of the internationalization and localization process. Chapter 2, entitled 'Designing a Global Program', contains material applicable for any operating system and any programming language. Half of chapter three discusses double-byte encoding and Unicode; the remainder of that chapter, and the next ('Preparing the User Interface for Localization') are only partially useful for localizing platform-independent Java programs. Chapter 5, 'Supporting Local Conventions' is well covered in the Java tutorial materials. Chapter 6, 'Accommodating Multilingual I/O on Microsoft Windows' is quite helpful to anyone needing to multilingual input data for any language. Chapter 7, 'Processing Far Eastern Writing Systems' is a very good overview of the Chinese, Japanese and Korean written languages, and covers input methods, fonts, sort rules and line breaking rules. Unfortunately, the book does not cover NT 4.0, only NT 3.51. The remaining half of the book consists of appendices, including an extensive glossary (good to have if you are new to the subject!), Latic Diacritics and Ligatures, international punctuation symbols, and tables of code pages, Unicode to double byte mapping tables, and hundreds of pages of other Microsoft-specific information (useless to Java programmers).

Pick either O'Donnell's or Kano's book to start your education on the subject. Next on your reading list should be "The Unicode Standard, Version 2.0", by The Unicode Consortium. This is the bible for programmers localizing software. There is no substitute. You should be familiar with the material contained in O'Donnell's or Kano's books before tackling this one, however.

The Lineage of Java 1.1's Internationalization Classes

Out Of Unicode, Out of Taligent/IBM, Out of Javasoft

Java has been based on Unicode since the beginning, and all Java strings are in fact Unicode strings. Java 1.1 deprecated many of Java 1.0.2's ASCII-based input/output methods in favor of Unicode-compliant methods.

IBM used the Unicode 2.0 standard as a building block for their Internationalization Classes for C++ and Java products. Reading IBM's white paper, you immediately see the basis for Java 1.1's Internationalization Classes. IBM has worked closely with Sun and Unicode, in fact, the President of the Unicode Consortium, Dr. Mark Davis, is also Program Director of the IBM Center for Java Technology in Silicon Valley, and was the director of the Core Technologies department at Taligent, Inc. Taligent originally developed the Internationalization Classes prior to IBM re-absorbing the company back into Big Blue. (Did you know that IBM is the world's largest employer of Java programmers?) Dr. Davis published "Java Cookbook: Creating Global Applications" which is a good introduction to the Internationalization Classes. Sun's Java Internationalization Home Page is another good place to start.

Possibly the best material that is available to learn Java's Internationalization API is Core Java 1.1 - Volume II, by Horstmann and Cornell (Sunsoft Press, 1998, 661 pages). In the 66 pages of Chapter 9 ('Internationalization'), this book contains an extremely clear development of the Java internationalization API. The explanation of the different types of resource bundles is particularly good. Together with Volume I ('Fundamentals', 630 pages), these two books form a solid basis for learning Java, and are a good reference to have.

Warning! If your mother tongue is written using a Latin script (like English, French, German or Russian) and you try to understand how to program the Java Internationalization API without reading two of the three books suggested in the section above, you will not have the background to be able to make design decisions based for programs which need to be localized to an Asian language.

There are two monthly magazines dedicated to internationalizing and localizing software: "MultiLingual Communications & Technology". Their web site was not ready at the time when this page was prepared, but you can phone them at 1-208-263-8178, or fax them at 1-208-263-6310. Another magazine is Language International.

More to come…

Michael Slinn, P. Eng., is a freelance software engineer and journalist who has produced localized software for English, French, German and Spanish.