My Tech Notes: June 2007

Monday, June 11, 2007

Unicode for Ruby

I am trying to get comfortable using Ruby (having already written a fair amount of code in Perl, Python, and PHP). It is apparent that one of the serious problems with Ruby is lack of support for Unicode (in the current 1.8.x Ruby version). Class String in Ruby is just a byte string and nothing else.

A few things of interest:

Standard Ruby distribution does support Iconv library. Fox example, to convert a string of text from UTF-16LE (used by Windows) to UTF8, one can use:
```
require 'iconv'
utf8_str = Iconv.iconv( "UTF-8", "UTF-16LE", win_str ).join
```
(obviously, this is merely a wrapper around "standard" GNU iconv library. To see full list of currently installed encodings, use `iconv -l')
Also, there is a special option 'U' to standard String 'unpack' function, which effectively treats string as having UTF-8 encoding and then splits it into array of Unicode integers;
Using the above, there is an attempt made on rubyforge to create well-behaved Unicode String class purely in Ruby (the only way to install above seems to be to download the only source file directly). However, this project was obviously not finished and is barely used, given that last modification occurred 18 months ago and there seems to be no activity on project public forums. While many features of "regular" strings are supported and others could be added, it is not clear whether for example full regular expression support is feasible;
Interestingly, Ruby, being originally developed in Japan, was designed to deal with non-ascii encodings from its day one. In particular, Ruby supports the notion of "default encoding", which corresponds to "global variable" $KCODE . It can be set with command line option -K, e.g. `-Ku' for UTF-8. This variable can be assigned to at any moment to overwrite default value. The only trouble is, in my (Cygwin) Ruby and iconv installation, actual value of $KCODE which corresponds to utf8 is "UTF8", whereas iconv only has "UTF-8" (note the dash) and refuses to understand "UTF8". Assigning explicitly `$KCODE = "UTF-8"' does not help, as Ruby still resets $KCODE to "UTF8". It means that to use Ruby unicode library mentioned above I had to make changes to have it pass to Iconv "UTF-8" whenever previously it wanted to pass "UTF8";
In this e-mail thread, someone explains why he thinks Ruby is fine without any (more) Unicode support;
Here some Unicode-related changes planned for Ruby 2.0 are described;
However, here is the detailed description of what has already been done in Ruby 1.9, with an understanding that 2.0 will be a subset of 1.9; I cannot see Unicode ever mentioned there;

In a nutshell, all of the above means that at least before Ruby 2.0 is released, the best option to deal with Unicode strings is to make all String's (=byte arrays) used share same Unicode-compatible encoding, e.g. UTF-16LE . Then, we will need to always keep that in mind while dealing with such strings. Fox example, to split TAB-separated string, we would use:

words = line.split( "\t\0" )

Appendix 1. Some reference material on Ruby:

Two similarly brief introductions to Ruby: Programming Ruby: The Pragmatic Programmer's Guide and Ruby User's Guide (by language creator);
As part of Pragmatic Guide mentioned above there is a library reference and built-in reference; also, seemingly same documents are also available from here; I am not why if there is any difference or not;
The Ruby Language FAQ;
Ruby Cheatsheet;
Ruby-doc.org server features, among other things, standard Ruby documentation and customized Google search;
Ruby QuickRef;
Book Programming Ruby (2-nd edition) is not available for free (you can buy it as a regular book or download PDF file for $25), but number of useful excerpts are available, and judging from these excerpts book appears to be well-done and useful;
Server rubylearning.com offers some on-line tutorial, as well as (huge) PDF file download (link is only provided via e-mail, but see this for book and this for accompanying Ruby programs)
One more PDF book from the Scribd collection.

Appendix 2. On main Ruby web site, there is a special section Ruby From Other Languages. Especially interesting to us are possible confusion and incompatibilities compared to Python and Perl. A few things, however, are missing from the list of incompatibilities.

Ranges in Ruby are inclusive while in Python they are not; that is, in Python `range(2,10)' consists of all numbers from 2 to 9 while in Ruby similar n spirit notation `2..10' is used for all numbers from 2 to 10 inclusively. As a result, to, for example, drop the last character from a string in Python, we use `str[:-1]' while in Ruby this translates to `str[0..-2]';
Including function from a file in Ruby does not imply any consequences in terms on namespace, whatsoever. To take advantage of namespaces, you must use `module' explicitly;
Apparently Ruby does not use popular Perl regular expression library, so its regular expression syntax, while very similar to that from Perl, might not be fully compatible. E.g., Ruby does not support look-behind zero-width assertions;
On a positive side, Ruby does implement Python-style '%' operator for parameters substitution, though it is not used anywhere in the standard documentation; instead, Ruby-own parameter substitution ("x = '#{my_var + 57}'") is used. I prefer Python style, besides, it allows for formatted output;

(Happy Ruby programming! :-)

UPD [18-Jun-07]. Slashdot reviews new book "Practical Ruby Gems"

Labels: ruby, unicode

# posted by Kostya @ 5:38 PM 0 comments

My Tech Notes

Monday, June 11, 2007

Unicode for Ruby

About Me

Links

archives