Search

Enter a search word or two and press return to see the search results.

Who am I?

Hi, I’m Graeme and these are my notes, from my messy desk. I started this blog because Google proved to be more useful at finding content than anything else I’ve used.

So I started adding my own content in the hopes that Google would index it and allow me to find things again in the future.

It works.

You can find out more about me here, and you should follow me on Twitter here.

Keeping up

You can automatically receive new content here by subscribing to the “Blog RSS” (link below). This is the easiest way to keep up with what I write here.  See this BBC article for a good introduction on RSS and keeping up with the goings on of the Internet more easily.

« Yet another hectic weekend! | Main | Just not my day »
Friday
Nov042005

Dire not-understanding of Unicode

OK, I think I just plain don't understand Unicode, character sets, encodings, or anything of that nature. What I'm actually trying to achieve is to store Unicode data in a MySQL 4.1.x database. And I'd really like to store it as UTF-8, since I keep being told this is a good thing. (This already works fine on PostgreSQL, by the way.)

So, what I don't understand is when I do:

[code lang="sql"]mysql> CREATE DATABASE foo CHARACTER SET UTF8;
Query OK, 1 row affected (0.00 sec)[/code]

in my MySQL client, then do:

[code lang="python"]>>> import MySQLdb
>>> c = MySQLdb.connect(db = 'foo', use_unicode = True)
>>> c.character_set_name()
'latin1'[/code]

What's happening there? Is it that it encodes the Unicode string in the Latin-1 (ISO 8859-1) encoding for it to go over the connection between the MySQL client and server, only to decode it back to Unicode, only to be encoded as UTF-8 before it hits the database? Shouldn't the encoding for the connection match that of the database table? Aren't there characters in UTF-8 which aren't available in Latin-1 (like, for example, the € -- Euro -- symbol!)?

I wonder if the MySQL Python library ought to be using the C API function mysql_set_character_set() to set the desired character set in the connection (perhaps deriving it from the character set the database defaults to?), but I really don't feel confident enough with this stuff to ask upstream...

Would you please form an orderly queue in the comments to hit me with a clue-stick? :-)

PrintView Printer Friendly Version

EmailEmail Article to Friend

Reader Comments (7)

ooh i just had to look up the unicode for the euro wotsit... its Alt 0128
:o)

though i get the feeling that you're probably askin about sommat a bit more technical :oP

November 4, 2005 | Unregistered Commenterannabel

For MySQL, the easiest solution is:

cursor = connection.cursor()
cursor.execute('SET CHARACTER SET UTF8')

Another solution is to use read_default_file keyword option to connect. That's how "deriving from the character set the database defaults to" is done. For example, connect(read_default_file='/etc/my.cnf'), where /etc/my.cnf has

[mysql]
default-character-set = utf8

See also MySQL reference manual. You want to read chapter 10, Character Set Support.

November 4, 2005 | Unregistered CommenterSeo Sanghyeon

OK, I'm still not getting it...

[code]mathie@Tandoori:mailmanager-2.0.1$ cat /Users/mathie/.my.cnf
[mysql]
deafult-character-set = utf8
mathie@Tandoori:mailmanager-2.0.1$ python
Python 2.4.1 (#1, Aug 30 2005, 11:16:54)
[GCC 4.0.0 (Apple Computer, Inc. build 5026)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import MySQLdb
>>> c = MySQLdb.connect(db = 'foo', use_unicode = True, read_default_file="/Users/mathie/.my.cnf")
>>> c.character_set_name()
'latin1'
>>> c.cursor().execute('SET CHARACTER SET UTF8')
0L
>>> c.character_set_name()
'latin1'[/code]

Thanks for the pointer to the MySQL docs -- I'll take a closer look this afternoon.

November 4, 2005 | Unregistered Commentermathie

character_set_name() does not get updated, when you change the encoding for the connection. I use

conn.cursor().execute("SET NAMES 'utf8'")
conn.charset = "utf-8"

Setting conn.charset looks like an pretty ugly hack to me... But it works so far.
The only thing I that confuses me, is that i have to manually decode strings comming from TEXT fields. VARCHAR fields are correctly decoded by MySQLdb, but TEXT fields arrive as array objects...

November 4, 2005 | Unregistered CommenterBenjamin Niemann

For reference, the Euro character is only ALT 0128 on Windows. Its actual Unicode codepoint is U+20AC. If you get used to thinking of the Euro as ALT-0128, you're going to set yourself up for problems later on. See http://www.cs.tut.fi/~jkorpela/html/euro.html and http://www.cs.tut.fi/~jkorpela/www/windows-chars.html for more details.

Oh, and my clue-stick has "Stay away from MySQL; use PostgreSQL instead if you want to keep your sanity" carved on it. I still have to beat *myself* with it from time to time... :-)

November 4, 2005 | Unregistered CommenterRobin Munn

Just do:

self.cur.execute('SET character_set_results="utf8"')

and INSERT utf8strings and you get those back.

also consider using sqlite or poste. the first at least does all that by default.

December 30, 2005 | Unregistered CommenterNikos Kouremenos

A SET NAMES 'x' statement is equivalent to these three statements:

mysql> SET character_set_client = x;
mysql> SET character_set_results = x;
mysql> SET character_set_connection = x;

A SET CHARACTER SET x statement is equivalent to these three statements:

mysql> SET character_set_client = x;
mysql> SET character_set_results = x;
mysql> SET collation_connection = @@collation_database;

(from http://www.informit.com/articles/article.asp?p=328641&seqNum=3#336)

January 25, 2006 | Unregistered Commenterbur

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>