Search

Enter a search word or two and press return to see the search results.

Who am I?

Hi, I’m Graeme and these are my notes, from my messy desk. I started this blog because Google proved to be more useful at finding content than anything else I’ve used.

So I started adding my own content in the hopes that Google would index it and allow me to find things again in the future.

It works.

You can find out more about me here, and you should follow me on Twitter here.

Keeping up

You can automatically receive new content here by subscribing to the “Blog RSS” (link below). This is the easiest way to keep up with what I write here.  See this BBC article for a good introduction on RSS and keeping up with the goings on of the Internet more easily.

« My Dream App results are in | Main | Temptations »
Wednesday
25Oct2006

Migrating your Rails application to Unicode

**Update** Make sure you read the comments on this post before considering it. In particular, [Pete](/2006/10/25/migrating-your-rails-application-to-unicode/#comment-13156) brings up some concerns about applications having data which is already UTF-8, but marked as Latin1 in the database, may cause problems.

So you've got this Rails application you've been developing and all of a sudden you need to support Unicode. After all, not everybody speaks English. And some really awkward people like all sorts of typographic symbols in their medical articles. In fact, you wouldn't believe all the weird characters these print-production-oriented people like to use…

Most of the instructions here were gleamed from a [jabbering giraffe](http://happygiraffe.net/blog/archives/2006/09/16/unicode-for-rails) and the [notes I wrote up from his talk](/2006/10/11/railsconf-europe-2006-unicode-for-rails-dominic-mitchell/). But I like to think I've had a bright idea of my own. :-) Note that these instructions assume you're using Ruby 1.8.x, MySQL >= 5 and edge (soon to be 1.2) rails.

OK, so to get Rails basically talking UTF-8, you have to do a couple of things. Firstly, make Ruby itself a little bit Unicode-aware, by sticking the following in `config/environment.rb`:

$KCODE = 'u'

We also need to tell ActiveRecord that the connection it should open to MySQL should be UTF-8 encoded. This is done by putting the following in each of your database stanzas in `config/database.yml`:

encoding: utf8

Finally, from a setup perspective, we need to migrate the current database to one which uses UTF-8 encoding internally. This is what I consider to be my 'smart' bit. :-) Create yourself a migration:

script/generate migration make_unicode_friendly

then paste in the following code:

class MakeUnicodeFriendly < ActiveRecord::Migration
def self.up
alter_database_and_tables_charsets "utf8", "utf8_general_ci"
end

def self.down
alter_database_and_tables_charsets
end

private
def self.alter_database_and_tables_charsets charset = default_charset, collation = default_collation
case connection.adapter_name
when 'MySQL'
execute "ALTER DATABASE #{connection.current_database} CHARACTER SET #{charset} COLLATE #{collation}"

connection.tables.each do |table|
execute "ALTER TABLE #{table} CONVERT TO CHARACTER SET #{charset} COLLATE #{collation}"
end
else
# OK, not quite irreversible but can't be done if there's not
# the code here to support it...
raise ActiveRecord::IrreversibleMigration.new("Migration error: Unsupported database for migration to UTF-8 support")
end
end

def self.default_charset
case connection.adapter_name
when 'MySQL'
execute("show variables like 'character_set_server'").fetch_hash['Value']
else
nil
end
end

def self.default_collation
case connection.adapter_name
when 'MySQL'
execute("show variables like 'collation_server'").fetch_hash['Value']
else
nil
end
end

def self.connection
ActiveRecord::Base.connection
end
end

This migrates the current database to using UTF-8 with general, case-insensitive collation, which affects the creation of future tables. It also updates each of the current tables, converting their contents to UTF-8 too.

And it's reversible. Well, mostly. It makes the assumption that the previous character set you were using was the server's default (which, unless you explicitly specified a character set/collation upon creation will be the case), and reverts back to that. Of course, a backward migration may well be lossy, so you want to be careful trying that.

The next bit is the tricky one. Most of the Ruby string functions aren't Unicode-aware. They'll quite happily `slice` up multi-byte characters. Fortunately edge rails now extends `String` to provide a `chars` method which returns an [`ActiveSupport::Multibyte::Chars`](http://multibyterails.org/documentation/activesupport_multibyte/classes/ActiveSupport/Multibyte/Chars.html) object. It walks like a string and talks like a string, but is multibyte aware. Nice. Apparently there's active work going on in the core to get internal Rails stuff to use this new functionality, so hopefully it should be pretty good soon.

Hopefully it should be good enough for me to use just now...

PrintView Printer Friendly Version

EmailEmail Article to Friend

Reader Comments (10)

Pete: Thanks for pointing out a scenario I hadn't considered. I didn't realise Rails defaults to serving pages in UTF-8 so I've now done a little investigation.

With the current stable Rails (1.1.6), running in development mode, lighttpd doesn't specify a character set in the Content-Type header. However, WEBrick does specify UTF-8. In edge Rails, you're absolutely right. lighttpd, mongrel and WEBrick all serve content as UTF-8 by default.

Of course, this might or might not be the case in your production environment, depending upon how it's set up. Fortunately in my production environment, the content type is not specified -- I wonder what it defaults to if it's not specified? ASCII? -- because up until this application I still have in development, I'd never had to consider 'funny' characters. :-)

So how do you deal with the situation you describe?

October 27, 2006 | Unregistered Commentermathie

Great post, and I have shamelessly stolen your idea of raising an error based on the adapter name for some MySQL specific migrations I have.

October 27, 2006 | Unregistered CommenterDave Verwer

[...] fuente: es.wikipedia fuente: woss.name [...]

November 1, 2006 | Unregistered CommenterUTF8 en RoR « 3eq11

The migration approach was a great idea

November 1, 2006 | Unregistered CommenterEdgar

[...] Y para terminar, si tienes una aplicación Rails que no está UTF8, Graeme Mathieson escribió unos migrations que te ayudan a Migrar tu Aplicación Rails a Unicode. [...]

November 1, 2006 | Unregistered CommenterRubyOnRails y UTF8 ›&rs

Say your database was encoding as latin1, but you fed it utf8 data without telling it. Which is what Pete above is talking about.

So, instead of telling mysql to ALTER TABLE contacts CONVERT TO CHARACTER SET utf8, you should do the following:

mysqldump --default-character-set=latin1 {database} {table} | sed 's/latin1/utf8/' | mysql {database}

It's very important to use the right value for --default-character-set in the call to mysqldump. It should be your currently incorrect encoding (latin1, probably). By telling mysqldump to use the same encoding the tables are using, the output will be the untranscoded byte stream, which is exactly what you want. Then you take that output and change it so that it actualy claims to be utf8 data, and feed it back to mysql.

Done, problem solved. Your data is now properly encoded utf8 in properly encoded tables.

November 6, 2006 | Unregistered CommenterSebastian

[...] 3. Drop your database and rake db:migrate again. You might be able to convert the DB to UTF-8 (see this post, but also read the warning before proceeding). [...]

February 1, 2007 | Unregistered CommenterUTF-8 and Rails « Riff W

What if I want to (prior to creating tables) set the default encodings and collation. I am hoping it will use these when new tables, string columns, are created. Would this work:

def self.up
execute "ALTER DATABASE #{ActiveRecord::Base.connection.current_database} CHARACTER SET utf8 COLLATE utf8_bin"
end

Also, would it indeed (in MySQL) set all subsequent tables and columns to these settings?

Thanks.

May 10, 2008 | Unregistered CommenterG. Gibson

Based on my research, I wrote a migration that actually recreates the DB, you can check it out here:
http://snippets.dzone.com/posts/show/6070

September 13, 2008 | Unregistered CommenterRemi

Thank you, your post was really helpful to me. I'd been trying to do exactly this for over an hour, always missing out on some detail or another :)

October 14, 2008 | Unregistered CommenterMihail Minkov

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>