Waylaid by the BOM in UTF8
From my install process, which had to be hacked, to my WordPress feeds, which are still acting up regularly, rather like brief blackouts, I was repeatedly told that there was something wrong with my code or file configuration. This caused me to spend all my usual blogging hours, and a few nights, digging through every single file. There are a couple of hundred of these, only a dozen of which I had initially worked with for the set up and theme.
All that time spent wasn’t necessarily a bad thing (although I’m suffering from withdrawal in missing many of my favorite reads) since it afforded me an even deeper appreciation of what has been created by the WordPress community.
Why didn’t my index files work? Nobody knew, so we just kept making new ones.
What were those way too long strings that doubled back in error messages? Must be a non-tech newbie code mistake I made somewhere.
What was that —>  thing that kept appearing before certain pages loaded? Tech people just shrugged.
That thing represents the hidden encoding characters preceding a UTF-8 file that has been saved with a BOM, or byte order mark, and I’ve also now read somewhere that PHP doesn’t care much for BOM.
Although I found many entries in help forums by webmasters waylaid by BOM, the only formal faq I’ve found on it is by Sun and Unicode. The Wikipedia entry refers to this being a problem with Unix and not Windows servers, and I’ve read that including the BOM in UTF-8 by default was one of those unilateral Microsoft decisions. Here also is a post by WordPress blogger Pierre, and a related issue post on translating character sets and collation in WordPress.
I first found evidence by downloading one of the suspected problem files and opening it in WebTide, which showed me unicode hard break characters before the file content. They didn’t appear on the page, but rather in the code schema window.
Getting rid of the BOM is fairly simple, although time consuming. You have to open a file that can be saved in plain text with no encoding and then paste the file contents into it. After saving it, you then copy the contents again, and paste them into a new file in an editor that will save with no BOM. Not all editors will do that, even the fanciest professional ones might require you to know how to script in that function. I’m using Notepad++, one of the top downloads on sourceforge.net, and really liking it so far.
There’s a possibility that db files can become corrupted as well. There is a WordPress plugin called UTF-8 DB converter which I think was developed for upgrading from versions earlier than 2.0 (which I think is when WordPress switched from Latin-1 encoding to UTF-8) but I don’t know whether it would help with this issue.
It’ll still be a while before I can get to setting up all my blogrolls and links here, and to joining in as many conversations as usual, but in the meantime, I hope that this post on the BOM issue might make it a bit easier for someone else to learn about than it was for me. I do wonder how much buggy behavior remains a mystery for the moment because of BOM.




March 17th, 2008 at 1:41 am
Cheers for writing this up.
Been tearing my hair out about this with a site I’m building and your write-up about this weird bug has returned some of my sanity.