Natural Code

Code, science and politics.

Ruby 1.9 and Unicode: The BOM Will Fuck Your Shit Up

So I’ve been playing around with the things mentioned in the title, and I found out something unfortunate when I moved a UTF-8 encoded file from a Ruby 1.9 machine to a Ruby 1.8 machine.

There’s this thing called a Byte Order Marker (BOM) that text editors use, apparently to remind themselves of the file’s UTF-8 encoding. I’m pretty sure it’s useless, because UTF-8 doesn’t actually have a variable byte order to keep track of, but there you go.

Basically, it’s 3 bytes that the text editor inserts at the beginning of a text file, and then hides from you. It might look like a plain text file, but it’s actually got 3 hidden bytes for no good reason. When you try to run it through the Ruby 1.8 interpreter, it’ll see 3 invalid characters on Line 1 and throw an error right away.

This sort of error message is pretty unhelpful, especially when you appear to have nothing at all on Line 1. You might enable visible whitespace: still nothing. You might try opening it in another text editor or IDE: you will likely still not see the problem, as the only program I’ve tried so far that doesn’t hide the BOM is NetBeans.

SciTE has two different UTF-8 encoding settings: UTF-8 and UTF-8 Cookie. In theory, the plain UTF-8 setting uses a Byte Order Marker, while UTF-8 Cookie setting doesn’t. In practice, the choice doesn’t seem to affect whether or not the Ruby interpreter chokes on the file, at least not with Ruby 1.8.

With 1.9 I’ve still had problems one or two times, but of the kind that could be fixed by closing the text editor, opening the file in NetBeans, removing the BOM, and restarting the text editor.

It’s not perfect, but at least it works now, even if it’s very slightly buggy.

August 30, 2008 Posted by naturalcode | Technology | , , , , , , , , , , | 3 Comments

Ruby 1.9 and Code Generation: How I Learned to Stop Worrying and Love Unicode

So I was working on this Ruby-based tool for generating Netbeans-compliant Swing app projects. Basically, I create a file that looks like this:

require 'java_swing'

Swing.app 'Project03AK', :subtitle => 'Laptop lending tracker',
:desc => 'This program keeps track of laptops borrowed by students.' do
  # Insert code here
end

I run this script, and it generates a Netbeans project with a main class that’s a Swing window, automatically centered and titled, the project and the window all have nice clean standardized names. Everything was going great until I got to the part where I started inserting comments in the generated Java code.

Basically, I have this Ruby script that inserts the arguments passed to Swing.app into a bunch of templates, and uses the resulting text to generate both the Java code and the related Netbeans project files. The problem here is that both Ruby and SciTE, my text editor, encode text in ASCII by default, whereas Netbeans encodes text in UTF-8.

That’s fine as long as Ruby is only generating code that uses the 26 english letters and regular english punctuation, but as soon as you start using things like àccéntêd characters, Netbeans interprets it as gibberish. I go to a French school, and my professors do not accept me handing in gibberish (except for VB code), so this is a problem.

If you don’t know/care about any of these encoding schemes or non-english characters, you need to read this. I did a few hours ago, and it helped me figure all of this out.

Basically, the solution is to install Ruby 1.9, which has Unicode support, and then go to File->Encoding->UTF-8 in SciTE. An é in the text editor will then be written to the generated Java files as a UTF-8 é, which will then be correctly interpreted as an é by Netbeans.

August 24, 2008 Posted by naturalcode | Technology | , , , , , , , , , , , | 4 Comments