Published 1998-08-11.
Time to read: 5 minutes.
Java offers programmers an advanced mechanism to develop internationalized and localized software, that is, software that can be targeted for different languages, cultures and geographic areas. To take advantage of Java’s features, however, you need to be sure you don’t make some fatal assumptions – assumptions you probably aren’t even aware you are making.
Unicode: The Foundation for Java’s Internationalization Classes
It takes many fonts to express all the world’s written languages. Some writing systems move from left to right, others move from right to left. Still others are bi-directional writing systems, and there are writing systems that move from top to bottom. There is even an ancient Irish system that spirals inwards! Each language needs at least one font to express sentences, and more fonts allow for greater expression within a language.
It’s remarkable that Unicode 2.0, a standard published in 1996 by The Unicode Consortium, lets you write software that can use sets of fonts to display any combination of languages simultaneously. Even more remarkable is Java 1.1’s Internationalization Classes that extend host operating systems in a platform-independent manner to support the full promise of Unicode. Java’s internationalization support elevates Java from beyond a mere programming language to a programming environment.
Unicode provides a 16 bit addressing space for a total of 64K characters, including representation of all Latin-based languages, and the entire CJK (Chinese, Japanese and Korean) set of ideographs such as Hiragana, Katakana, Bopomofo, Jamo and Kanbun. Unicode also includes complete alphabets for ancient Greek, Cyrillic, Hangul syllables (Korean), Thai, Sanskrit (Devanagari, Bengali, Gurmukhi, etc.), Arabic, Hebrew, Armenian, geometric symbols, Dingbats, and many, many more, and there are still over 18,000 characters to spare.
You can even define a private font and load it, so your favorite ancient script or Klingon saying can be displayed proudly. Existing fonts are mapped into the Unicode address space and extended to 16 bits if required, so Unicode can use fonts such as ASCII, UCS-2, JIS X0208 and many others.
The Lineage of Java 1.1’s Internationalization Classes
Out Of Unicode, Out of Taligent/IBM, Out of Javasoft
Java has been based on Unicode since its inception, and all Java strings are in fact Unicode strings. Java 1.1 deprecated many of Java 1.0.2’s ASCII-based input/output methods in favor of Unicode-compliant methods.
IBM used the Unicode 2.0 standard as a building block for their Internationalization Classes for C++ and Java products. Reading IBM’s white paper, you immediately see the basis for Java 1.1’s Internationalization Classes. IBM has worked closely with Sun and Unicode. In fact, the President of the Unicode Consortium, Dr. Mark Davis, is also Program Director of the IBM Center for Java Technology in Silicon Valley, and was the director of the Core Technologies department at Taligent, Inc. Taligent originally developed the Internationalization Classes before IBM re-absorbed the company back into itself. (Did you know that IBM is the world’s largest employer of Java programmers?)
For Further Study
If you’d like to learn more about fonts and internationalization in general, a good book is “Programming for the World,” by Sandra Martin O’Donnell (Prentice Hall, 440 pages). It was published in 1994, when Unicode 1.1 was current and the author was responsible for all internationalization-related activities at the Open Software Foundation. Java programmers will find the first half of the book to be a solid foundation for further study. The book also describes parsing issues for Asian languages, something I have not seen in other computer books. Unfortunately, the programming examples are all written in C and the author assumes the operating system is Unix. The methodology used in the Java Internationalization classes has advanced far beyond what is described by the author. As an introduction, however, this book is one of the best. Among other things, it explains multibyte character sets and wide characters. I, an old-time C and C++ programmer, really appreciated learning why multibyte characters are used for storage in files, and why they are often converted to wide characters for parsing.
Another good introduction to the subject is “Developing International Software for Windows 95 and Windows NT,” by Nadine Kano (Microsoft Press, 1995, 743 pages.) This book, although very specifically targeted towards Windows 95 and Windows NT, is full of useful information. It begins with a ten-page overview of the internationalization and localization process. Chapter 2, entitled “Designing a Global Program,” contains material applicable for any operating system and any programming language. Half of chapter three discusses double-byte encoding and Unicode; the remainder of that chapter, and the next (“Preparing the User Interface for Localization”) are only partially useful for localizing platform-independent Java programs. Chapter 5, “Supporting Local Conventions” is well covered in the Java tutorial materials. Chapter 6, “Accommodating Multilingual I/O on Microsoft Windows” is quite helpful to anyone needing to multilingual input data for any language. Chapter 7, “Processing Far Eastern Writing Systems” is an excellent overview of the Chinese, Japanese and Korean written languages, and covers input methods, fonts, sort rules and line breaking rules. Unfortunately, the book does not cover NT 4.0, only NT 3.51. The remaining half of the book consists of appendices, including an extensive glossary (good to have if you are new to the subject!), Latin diacritics and ligatures, international punctuation symbols, and tables of code pages, Unicode to double byte mapping tables, and hundreds of pages of other Microsoft-specific information (useless to Java programmers).
Pick either O’Donnell’s or Kano’s book to start your education on the subject. Next on your reading list should be “The Unicode Standard, Version 2.0,” by The Unicode Consortium. This is the bible for programmers localizing software. There is no substitute. You should be familiar with the material contained in O’Donnell’s or Kano’s books before tackling this one, however.
Dr. Davis published “Java Cookbook: Creating Global Applications” which is a good introduction to the Internationalization Classes. Sun’s Java Internationalization Home Page is another good place to start.
Perhaps the best material that is available to learn Java’s Internationalization API is “Core Java 1.1 – Volume II,” by Horstmann and Cornell (Sunsoft Press, 1998, 661 pages). In the 66 pages of Chapter 9 (“Internationalization”), this book clearly describes the development of the Java internationalization API. The explanation of the different types of resource bundles is excellent. Together with Volume I (“Fundamentals,” 630 pages), these two books form a solid basis for learning Java, and are a good reference to have.
Warning! If your mother tongue is written using a Latin script (like English, French, German or Russian) and you try to understand how to program the Java Internationalization API without reading two of the three books suggested in the section above, you will not have the background to be able to make design decisions based for programs which need to be localized to an Asian language.
There are two monthly magazines dedicated to internationalizing and localizing software:
- MultiLingual Communications & Technology
Language International
About the Author
Michael Slinn, P. Eng., is a software engineer and journalist who has produced localized commercial software products in English, French, German and Spanish.