Ruby 1.9 and Unicode: The BOM Will Fuck Your Shit Up
So I’ve been playing around with the things mentioned in the title, and I found out something unfortunate when I moved a UTF-8 encoded file from a Ruby 1.9 machine to a Ruby 1.8 machine.
There’s this thing called a Byte Order Marker (BOM) that text editors use, apparently to remind themselves of the file’s UTF-8 encoding. I’m pretty sure it’s useless, because UTF-8 doesn’t actually have a variable byte order to keep track of, but there you go.
Basically, it’s 3 bytes that the text editor inserts at the beginning of a text file, and then hides from you. It might look like a plain text file, but it’s actually got 3 hidden bytes for no good reason. When you try to run it through the Ruby 1.8 interpreter, it’ll see 3 invalid characters on Line 1 and throw an error right away.
This sort of error message is pretty unhelpful, especially when you appear to have nothing at all on Line 1. You might enable visible whitespace: still nothing. You might try opening it in another text editor or IDE: you will likely still not see the problem, as the only program I’ve tried so far that doesn’t hide the BOM is NetBeans.
SciTE has two different UTF-8 encoding settings: UTF-8 and UTF-8 Cookie. In theory, the plain UTF-8 setting uses a Byte Order Marker, while UTF-8 Cookie setting doesn’t. In practice, the choice doesn’t seem to affect whether or not the Ruby interpreter chokes on the file, at least not with Ruby 1.8.
With 1.9 I’ve still had problems one or two times, but of the kind that could be fixed by closing the text editor, opening the file in NetBeans, removing the BOM, and restarting the text editor.
It’s not perfect, but at least it works now, even if it’s very slightly buggy.
3 Comments »
Leave a comment
-
Archives
- August 2008 (6)
- July 2008 (13)
- June 2008 (6)
-
Categories
-
RSS
Entries RSS
Comments RSS


“I’m pretty sure it’s useless, because UTF-8 doesn’t actually have a variable byte order to keep track of”
This is incorrect; UTF-8 can get away with a single byte for commonly-used characters in Western European scripts, but there are many more characters in Unicode than there are bits in a single byte. As a result, UTF-8 can use anywhere from one to four bytes to encode a particular Unicode character; hence UTF-8 is a variable-length encoding (as is UTF-16; only UTF-32 can guarantee that it uses the same number of bytes for every character) and can encounter byte-order issues which the BOM solves.
Actually, no, naturalcode is correct. While UTF-8 is indeed a variable-length encoding, there are no byte order issues. The specification clearly describes the order in which the bytes must appear in the stream, and that order does not depend on the endianness of the processing machine.
Put another way, UTF-8 does not represent characters as mutli-byte numerical words, as UTF-16 does. It’s a variable-length stream of bytes. The UTF-8 file cannot be interpreted differently if the BOM is 0xFE 0xFF or 0xFF 0xFE.
I also don’t understand why some editors put a BOM in UTF-8 files. I suppose that since the BOM is, itself, encoded as UTF-8 (and therefore doesn’t even appear as a raw, two-byte 0xFE 0xFF or 0xFF 0xFE in the file), a processor can see the 0xEF 0xBB 0xBF sequence and recognize the file as UTF-8. But really…
http://en.wikipedia.org/wiki/Byte-order_mark