Migrating your Rails application to Unicode
Wednesday, October 25, 2006 at 4:46PM So you've got this Rails application you've been developing and all of a sudden you need to support Unicode. After all, not everybody speaks English. And some really awkward people like all sorts of typographic symbols in their medical articles. In fact, you wouldn't believe all the weird characters these print-production-oriented people like to use…
Most of the instructions here were gleamed from a [jabbering giraffe](http://happygiraffe.net/blog/archives/2006/09/16/unicode-for-rails) and the [notes I wrote up from his talk](/2006/10/11/railsconf-europe-2006-unicode-for-rails-dominic-mitchell/). But I like to think I've had a bright idea of my own. :-) Note that these instructions assume you're using Ruby 1.8.x, MySQL >= 5 and edge (soon to be 1.2) rails.
OK, so to get Rails basically talking UTF-8, you have to do a couple of things. Firstly, make Ruby itself a little bit Unicode-aware, by sticking the following in `config/environment.rb`:
$KCODE = 'u'
We also need to tell ActiveRecord that the connection it should open to MySQL should be UTF-8 encoded. This is done by putting the following in each of your database stanzas in `config/database.yml`:
encoding: utf8
Finally, from a setup perspective, we need to migrate the current database to one which uses UTF-8 encoding internally. This is what I consider to be my 'smart' bit. :-) Create yourself a migration:
script/generate migration make_unicode_friendly
then paste in the following code:
class MakeUnicodeFriendly < ActiveRecord::Migration
def self.up
alter_database_and_tables_charsets "utf8", "utf8_general_ci"
end
def self.down
alter_database_and_tables_charsets
end
private
def self.alter_database_and_tables_charsets charset = default_charset, collation = default_collation
case connection.adapter_name
when 'MySQL'
execute "ALTER DATABASE #{connection.current_database} CHARACTER SET #{charset} COLLATE #{collation}"
connection.tables.each do |table|
execute "ALTER TABLE #{table} CONVERT TO CHARACTER SET #{charset} COLLATE #{collation}"
end
else
# OK, not quite irreversible but can't be done if there's not
# the code here to support it...
raise ActiveRecord::IrreversibleMigration.new("Migration error: Unsupported database for migration to UTF-8 support")
end
end
def self.default_charset
case connection.adapter_name
when 'MySQL'
execute("show variables like 'character_set_server'").fetch_hash['Value']
else
nil
end
end
def self.default_collation
case connection.adapter_name
when 'MySQL'
execute("show variables like 'collation_server'").fetch_hash['Value']
else
nil
end
end
def self.connection
ActiveRecord::Base.connection
end
end
This migrates the current database to using UTF-8 with general, case-insensitive collation, which affects the creation of future tables. It also updates each of the current tables, converting their contents to UTF-8 too.
And it's reversible. Well, mostly. It makes the assumption that the previous character set you were using was the server's default (which, unless you explicitly specified a character set/collation upon creation will be the case), and reverts back to that. Of course, a backward migration may well be lossy, so you want to be careful trying that.
The next bit is the tricky one. Most of the Ruby string functions aren't Unicode-aware. They'll quite happily `slice` up multi-byte characters. Fortunately edge rails now extends `String` to provide a `chars` method which returns an [`ActiveSupport::Multibyte::Chars`](http://multibyterails.org/documentation/activesupport_multibyte/classes/ActiveSupport/Multibyte/Chars.html) object. It walks like a string and talks like a string, but is multibyte aware. Nice. Apparently there's active work going on in the core to get internal Rails stuff to use this new functionality, so hopefully it should be pretty good soon.
Hopefully it should be good enough for me to use just now...
Geekery,
Ruby and Rails,
Work
Reader Comments (10)
Pete: Thanks for pointing out a scenario I hadn't considered. I didn't realise Rails defaults to serving pages in UTF-8 so I've now done a little investigation.
With the current stable Rails (1.1.6), running in development mode, lighttpd doesn't specify a character set in the Content-Type header. However, WEBrick does specify UTF-8. In edge Rails, you're absolutely right. lighttpd, mongrel and WEBrick all serve content as UTF-8 by default.
Of course, this might or might not be the case in your production environment, depending upon how it's set up. Fortunately in my production environment, the content type is not specified -- I wonder what it defaults to if it's not specified? ASCII? -- because up until this application I still have in development, I'd never had to consider 'funny' characters. :-)
So how do you deal with the situation you describe?
Great post, and I have shamelessly stolen your idea of raising an error based on the adapter name for some MySQL specific migrations I have.
[...] fuente: es.wikipedia fuente: woss.name [...]
The migration approach was a great idea
[...] Y para terminar, si tienes una aplicación Rails que no está UTF8, Graeme Mathieson escribió unos migrations que te ayudan a Migrar tu Aplicación Rails a Unicode. [...]
Say your database was encoding as latin1, but you fed it utf8 data without telling it. Which is what Pete above is talking about.
So, instead of telling mysql to ALTER TABLE contacts CONVERT TO CHARACTER SET utf8, you should do the following:
mysqldump --default-character-set=latin1 {database} {table} | sed 's/latin1/utf8/' | mysql {database}
It's very important to use the right value for --default-character-set in the call to mysqldump. It should be your currently incorrect encoding (latin1, probably). By telling mysqldump to use the same encoding the tables are using, the output will be the untranscoded byte stream, which is exactly what you want. Then you take that output and change it so that it actualy claims to be utf8 data, and feed it back to mysql.
Done, problem solved. Your data is now properly encoded utf8 in properly encoded tables.
[...] 3. Drop your database and rake db:migrate again. You might be able to convert the DB to UTF-8 (see this post, but also read the warning before proceeding). [...]
What if I want to (prior to creating tables) set the default encodings and collation. I am hoping it will use these when new tables, string columns, are created. Would this work:
def self.up
execute "ALTER DATABASE #{ActiveRecord::Base.connection.current_database} CHARACTER SET utf8 COLLATE utf8_bin"
end
Also, would it indeed (in MySQL) set all subsequent tables and columns to these settings?
Thanks.
Based on my research, I wrote a migration that actually recreates the DB, you can check it out here:
http://snippets.dzone.com/posts/show/6070
Thank you, your post was really helpful to me. I'd been trying to do exactly this for over an hour, always missing out on some detail or another :)