i18n_of_R

Localization and Internationalization of R
Hints to R users in non-English speaking countries

Page Contents (Last modified 10, August, 2003 by S. Mase)

(1) Introduction
(2) Japanese and R
(3) Localization and Internationalization of R
(4) Prerequisites
(5) Patches of Nakama and Okada for EUC and SJIS environments
- (5-1) Japanese strings
- (5-2) Japanese object names and device outputs for Unix-like and EUC-* environments
(6) Binaries
(7) Present progress status summary by Nakama (for R-1.7.1)
(8) Misc R-KNOPPIX

This site is based on the PukiWiki system, a Japanese variant of Wiki (Internet collaboration) system. It was established in June, 2003 by courtesy of M. Okada (Tsukuba University, Japan). Its aim is the exchange and the accumulation of Japanese informations (documents and tips) on R. It has been quite successful so far getting supports of isolated and hidden Japanese R users. Other pages are written solely in Japanese and your browser may not display it correctly.

Feedbacks and reports are welcome, inquiries and/or complains not. Since I am poor both in these wizardry skills and in technical English, please don't expect re-feedbacks or replies from me. Please do not send mails personally to E. Nakama (he prefers C language to E(nglish) language) or M. Okada. If you (from abroad) want ot make a comment on this page, please put it in the companion comment page CommentsOnI18nOfR. However, I cannot guarantee replies you hope. For Japanese comments, please use the page JapaneseCommentOnI18nOfR.

(1) Introduction

This page is for "L10N" (Localization) and "i18n" (internationalization) patches for R by E. Nakama and M. Okada. They succeeded in "making R speak Japanese". Although they are still incomplete, we hope they will give useful hints to many R users in multi-bytes character countries like Japan. Please study their patches yourself carefully if you are interested in.

It is probably necessary to explain to R users in single-byte character countries our specific difficulties in using R in Japan (as well as other multi-bytes character countries such as China, Korea, etc). In Japan, we use several completely different sets of characters simultaneously and interchangeably;

"Roman alphabet" characters. single-byte characters as used in English speaking countries.
"Kata-Kana" characters. A series of Phonograms. About 70 kinds.
"Hira-Kana" characters. Another kind of Phonograms. About 70 kinds. Apart from shapes, they are essentially equivalent to Kata-Kana characters. Why are there two kinds, you may ask. Originally, Kata-Kana was for men, and Hira-Kana for women :-)
"Kanji" characters. Ideograms. They were originally Chinese characters. A lot of them are still the same as those used in China now, but there are also many which are of Japanese origin. About 3,000 kinds are commonly used now (in the long history of China, Japan, and neighboring areas, at least 50,000 kinds *1 were once used!).

Already complex enough, isn't it? But the story does not end yet. Throughout the period of adapting computer technologies into Japan, several incompatible kinds of character coding systems assigning byte-codes to above Japanese characters were proposed and are still in use in parallel. Three main coding systems in use now are:

EUC-JP codes. It was originally used in IBM mainframes and, now, is the main coding system for Unix-like OSes. Similar codes are also used in several countries (e.g. EUC-CH, EUC-TW and EUC-KR).
JIS codes. The Japan Industorial Standards code,
SJIS (Shift-JIS) codes. It is now mainly used in Microsoft Windows machines.

We have to add relatively new international codes such as UNICODE. One Japanese character is represented by 1 to 3 bytes (EUC-TW seems to have 4-bytes code partially). Further It may be necessary to note that Japanese PCs have only alphabetic keyboards which can be used also as Hira-Kana keyboards. Inputs of Japanese phrases are done first as alphabets (or Hira-Kana's) corresponding to Japanese phrases phonetically and, then, a special software called FEP (Front End Processor) translates them into final Japanese phrases. Since most alphabetic (phonetic) representations correspond to several Kanji words, it it usually necessary to choose correct ones from candidates FEP suggests.

↑

(2) Japanese and R

Regretfully, R has not such a popularity as it should deserve in Japan now. One of main reasons, I think, is that R cannot handle Japanese. The use of Japanese of the present R is confined to:

It can handle Japanese character strings if terminals can display Japanese characters and have Japanese fonts installed (though I cannot understand why it is possible even now),
Paul Murrell kindly made it possible to use hundreds of Japanese characters as well as other characters in graphic devices as graphical symbols (Hershey vector fonts).

But, as explained above, we want and have to use about three thousands of characters as object names, in file IO, and, in particular, as graphical objects such as titles. This prevents Japanese end-users from using R. Why not use English, you might ask. The reason is simple, ordinary Japanese are in general poor in English.

↑

(3) Localization and Internationalization of R

Localization (often abbreviated as L10N) means to adapt R to a particular language such as Japanese. While internationalization (often abbreviated as i18n) means to make R capable to handle many (if not to say any) languages simultaneously. Of course, the former is much easier. Recently, many softwares such as X window systems are i18n'ed. But a full adaptation of a software such as R to this mechanism is by no means easy. The following patches are for L10N and 18n of R. Several relevant remarks are:

It is necessary to L10N and i18n of terminals (Japanese object names as well as file IO with Japanese) .
Graphical devices such as X11, postscript, png, etc, have to be considered separately. It is a prerequisite companion softwares such as GS and LATEX are already L10N'ed or i18n'ed.
Different codings and OSes have to be treated separately. It seems the case of Unix-like OSes with EUC-JP code is simplest. Microsoft Windows case is more difficult. Okada starts an experiment on Mac OS, but has no reportable success now.

↑

(4) Prerequisites

In order to use L10N'ed or i18n'ed R you have to note followings:

Your OS should be already capable to handle your language.
Companion softwares such as terminals, GS and LATEX should be already L10N'ed or i18n'ed. In particular, they should be conscious of eighth-bits of character code bytes.
Local fonts files which companion softwares can use, of course.
Some (or a lot of) patience and knowledge about compilations and installations of softwares.

↑

(5) Patches of Nakama and Okada for EUC and SJIS environments

Warnings: Following patches and resulting binaries may cause your OSes troubles potentially. They are offered with no warranty. Please note that, although they seem to work fine (still with several restrictions) in Japan so far, we never guarantee that they also work in other multi-bytes codes country. You had better consider them hints to L10N and/or i18n of R necessary in your country.

↑

(5-1) Japanese strings

In EUC environment, there is no problem even now (at least in Japan). Whereas, in SJIS environment (e.g. Japanese MS Window case), characters having 0x5c as second bytes cannot be handled correctly. If one apply the following Japanized patch, it will become possible.

↑

(5-2) Japanese object names and device outputs for Unix-like and EUC-* environments

First download the R source file, and Nakama and Okada's patches:

  R-1.7.1.tgz
  http://r.nakama.ne.jp/R-1.7.1/patchs/
  R.l10n.YYYYMMDD.patch
  R.l10n.PSXFIG.YYYYMMDD.patch
  R.i18n.x11_mb.YYYYMMDD.patch

The first two are integrated patches applicable both to EUC-JP case and to SJIS case. The third one is for postscript (L10N) and xfig device (I18n, as to "i18n"ed xfig, see http://wwwusr.obspm.fr/departement/demirm/xfig/japanese/i18n.html ). According to Nakama's instruction, issue following commands at an appropriate working directory where R's source directory reside. Please note they are for Unix-like OSes with EUC environment. The command rm -f src/main/gram.c is mandatory (gram.y will be used instead).

gzip -d -c | tar xvf -
cd R-1.7.1
patch -p1 < ~/R.l10n.YYYYMMDD.patch
patch -p1 < ~/R.l10n.PSXFIG.YYYYMMDD.patch
patch -p1 < ~/R.i18n.x11_mb.YYYYMMDD.patch       
rm -f src/main/gram.c
MAIN_CFLAGS="-DL10N_JP" R_BROWSER="/usr/bin/mozilla" ./configure

For SJIS. use the flag MAIN_CFLAGS="-DL10N_JP -DL10N_SJIS_JP" instead of MAIN_CFLAGS="-DL10N_JP" . (For SJIS, no check is done yet.)

Now you should follow the R install instruction for the rest. Because i18n of xfig is only done for Japan (ja_JP) and Korea (ko_KR) at present, the third patch may be unnecessary.

Nakama made it possible to specify available fonts used by R flexibly via a X resource file. The following is an example to use free Japanese truetype fonts called kochi-mincho and kochi-gthoic. You should change them appropriately. <R_HOME> is the full path to R's home directory, which is the value of the environment variable R_HOME if it is already set. <locale> is your present locale, which is the value of the environment variables LANG if it is already (and correctly) set.

 <R_HOME>/etc/R_X11.<locale>

For example, it is /usr/lib/R/etc/R_X11.ja_JP.eucJP in my Debain GNU/Linux. The contents of this file may be as follows. You can list as many fonts available (which X programs can use) as you like.

 *fontSet0: -kochi-kochi gothic-medium-r-*-*-%d-*-*-*-*-*-iso8859-1, \
                 -kochi-kochi gothic-medium-r-*-*-%d-*-*-*-*-*-jisx0201.1976-0, \
                 -kochi-kochi gothic-medium-r-*-*-%d-*-*-*-*-*-jisx0208.1983-0
 *fontSet1: -kochi-kochi gothic-bold-r-*-*-%d-*-*-*-*-*-iso8859-1, \
                 -kochi-kochi gothic-bold-r-*-*-%d-*-*-*-*-*-jisx0201.1976-0, \
                 -kochi-kochi gothic-bold-r-*-*-%d-*-*-*-*-*-jisx0208.1983-0
 *fontSet2: -kochi-kochi mincho-medium-r-*-*-%d-*-*-*-*-*-iso8859-1, \
                 -kochi-kochi mincho-medium-r-*-*-%d-*-*-*-*-*-jisx0201.1976-0, \
                 -kochi-kochi mincho-medium-r-*-*-%d-*-*-*-*-*-jisx0208.1983-0
 *fontSet3: -kochi-kochi mincho-bold-r-*-*-%d-*-*-*-*-*-iso8859-1, \
                 -kochi-kochi mincho-bold-r-*-*-%d-*-*-*-*-*-jisx0201.1976-0, \
                 -kochi-kochi mincho-bold-r-*-*-%d-*-*-*-*-*-jisx0208.1983-0

Nakama believe these recipes will work also for Chinese including Taiwanese, and Korean (maybe even European languages) with appropriate changes and enables your R to display local fonts both on console and on graphical devices (with some restrictions). However, so far we have no chance to test.

Remarks:

Postscript device cannot show strings containing both single- and multi-byte characters properly. Nakama commented a full i18n of Postscript device will be extremely difficult. If you cannot get a satisfactory postscript output, try png device. You can coerce it into a postscript file afterwards if necessary using an apropriate tool such as Imagemagik. It works fine at least for me.
Pictex device is only L10N'ed.
Korean R users can get hints from Nakma's web page.

Pleae note he cannot understand Korean.

Following troubles were reported as side effects of patches:
- X fonts fixed-bold-r, fixed-medium, fixed-bold-o for plotting cannot be used. Please add corresponding fonts to the above resource file.
- symbol fonts (adobe symbol fonts used in plotmath) cannot be used. The newest patch of Nakama can show this symbol fonts.

↑

(6) Binaries

Nakama and Okada kindly offer patched binaries of R in their web page. Since they are primarily for Japanese, they may be of little interest for R users abroad.

↑

(6-1) RPM binary (by E. Nakama)

RPM packages. These can refer the X resource file above. R-1.7.1-1vl20.nosrc.rpm and R-1.7.1-1vl20.i386.rpm

↑

(6-2) Debian packages (woody and sid, by M. Okada)

Debian binaries. At present, these binaries cannot refer the X resource file above and has defaults fonts.

↑

(6-3) Japanese MS Windows binary (by M. Okada)

Integrated patch R-1.7.1-windows-Japanese.patch for Japanese MS Windows including all presently available patches.
Japanized R binary Rdll-1.7.1-jVar-jPIC-jPS-jGraph.lzh for Japanized MS Windows. R binary for Japanese MS Windows including all presently available patches. They are compressed using lha program which is a commonly used free archiver of Japan. You can get the Windows binary from Internet (use google with keywords lha or LHarc). After melting this lzh file, you can get the executable binary R.dll. Replace it with the original R.dll in bin directory of R's home directory of your PC.
You should have already installed the official R-1.7.1. The version of this R.dll (that is, 1.7.1) should be the same with your preintalled R.dll.
If your Japanized R cannot display Japanese correctly, replace the file etc\Rdevga in the R's home directory by Rdevga-JapaneseFont.txt.

This binary is still at a testing stage and you should install it with your own risk.

↑

(7) Present progress status summary by Nakama (for R-1.7.1)

Subjects \ OS	*nix	MS Windows	Classic MacOS	MacOS X
Parser	L10N 100%	L10N 100%	?	?
Regexp	?	?	?	?
Graphical devices	i18n 80%	?	?	?
Dataentry (Spreadsheet)	?	?	?	?
POSTSCRIPT	?	?	?	?
xfig	i18n 100%(needs i18n'ed xfig)	---	---	---
PDF	?	?	?	?

↑

(8) Misc R-KNOPPIX

KNOPPIX is a Linux distribution in a single CDROM. Since all files are compressed, it actually contains 1.6GB of files, enough for almost full Linux environment. It is based on Debian GNU/Linux with KDE desktop. The most remarkable feature of KNOPPIX is that it is bootable. Also it can recognize hardwares of your PCs marvellously. Since it does not reside in HD, it will left nothing after shutdown. KNOPPIX is of German origin and it is Japanized by A. Suzaki.

S. Tanimura (Nagasaki University, Japan) rebuilt KNOPPIX-jp (based on knoppix_20030606-20030625) including partially Japanized R. It can be downloaded from his web page or its mirror. R-KNOPPIX make you try R using home MS Windows PCs quite easily and safely.