Natural Code

Code, science and politics.

Ruby 1.9 and Code Generation: How I Learned to Stop Worrying and Love Unicode

So I was working on this Ruby-based tool for generating Netbeans-compliant Swing app projects. Basically, I create a file that looks like this:

require 'java_swing'

Swing.app 'Project03AK', :subtitle => 'Laptop lending tracker',
:desc => 'This program keeps track of laptops borrowed by students.' do
  # Insert code here
end

I run this script, and it generates a Netbeans project with a main class that’s a Swing window, automatically centered and titled, the project and the window all have nice clean standardized names. Everything was going great until I got to the part where I started inserting comments in the generated Java code.

Basically, I have this Ruby script that inserts the arguments passed to Swing.app into a bunch of templates, and uses the resulting text to generate both the Java code and the related Netbeans project files. The problem here is that both Ruby and SciTE, my text editor, encode text in ASCII by default, whereas Netbeans encodes text in UTF-8.

That’s fine as long as Ruby is only generating code that uses the 26 english letters and regular english punctuation, but as soon as you start using things like àccéntêd characters, Netbeans interprets it as gibberish. I go to a French school, and my professors do not accept me handing in gibberish (except for VB code), so this is a problem.

If you don’t know/care about any of these encoding schemes or non-english characters, you need to read this. I did a few hours ago, and it helped me figure all of this out.

Basically, the solution is to install Ruby 1.9, which has Unicode support, and then go to File->Encoding->UTF-8 in SciTE. An é in the text editor will then be written to the generated Java files as a UTF-8 é, which will then be correctly interpreted as an é by Netbeans.

August 24, 2008 - Posted by naturalcode | Technology | , , , , , , , , , , , | 4 Comments

4 Comments »

  1. ASCII is a strict subset of UTF-8, i.e. every ASCII textfile is also a valid UTF-8 textfile, so you’re obviously mixing it up with some other charset. ISO-8859-1(5) maybe?

    Comment by ak | August 24, 2008 | Reply

  2. Doesn’t look like it. This is what happens when my text editor is on its default settings:

    puts ‘Café’.encoding
    >ASCII-8BIT

    Not only that, but when I do File->Encoding->UTF-8, the é and the last apostrophe become an Asian character. Once I erase it and type the é again, this is what happens:

    puts ‘Café’.encoding
    >UTF-8

    And NetBeans doesn’t interpret the é properly unless I do it that way. Seems pretty clear-cut to me. I think if anyone is mixing up their charsets, it must be NetBeans or SciTE.

    Comment by naturalcode | August 25, 2008 | Reply

  3. I think what ak was saying is that : ASCII-7bit (ie ASCII) is a strict subset of UTF-8. Anything with an accent is not ASCII but any of the various regional encodings (like iso-8859-1 for france), which are sometimes referred to as just ASCII-8bit, but which are not compatible with each other.

    Comment by guilhem | August 25, 2008 | Reply

  4. That makes sense. Probably the text editor has a France regional encoding and NetBeans uses a Canada French encoding, or something like that.

    Comment by naturalcode | August 25, 2008 | Reply


Leave a comment