16 messages in com.googlegroups.pylons-discussRe: SQLAlchemy + MySQL + UTF-8 suppor...
FromSent OnAttachments
andyprog10 May 2007 19:46 
bjpirt11 May 2007 02:42 
bjpirt11 May 2007 02:42 
Max Ischenko11 May 2007 04:44 
andyprog11 May 2007 07:28 
David Smith28 May 2007 23:44 
Shannon -jj Behrens28 May 2007 23:53 
David Smith29 May 2007 01:13 
Max Ischenko29 May 2007 02:02 
David Smith29 May 2007 02:27 
Shannon -jj Behrens29 May 2007 07:23 
Max Ischenko29 May 2007 09:56 
David Smith29 May 2007 10:02 
Mike Orr29 May 2007 17:09 
Mike Orr31 May 2007 14:19 
ysamlan12 Jun 2007 15:49 
Subject:Re: SQLAlchemy + MySQL + UTF-8 support - how?
From:David Smith (davi@public.gmane.org)
Date:05/29/2007 10:02:14 AM
List:com.googlegroups.pylons-discuss

Hi Mike,

Thank you for all the details. I've seen the merits of the arguments and I guess the conclusion is these points:

1) The MySQL specific option of use_unicode is not needed anymore. The MySQL specific charset option also is deprecated.

2) The SQLAlchemy convert_unicode option is recommended, and is needed when you're autoloading tables and you want to get unicode objects. It obsoletes the use_unicode option. Correspondingly, the SQLAlchemy encoding option obsoletes the MySQL specific charset option.

3) No options will help you when doing raw engine.execute operations; you must convert for yourself.

4) You should use Unicode fields inside of your application in order for your internal encoding/decoding logic to not depend on the database configuration. I.e., whether you're using convert_unicode or not, when using Unicode field you will always get a unicode object from SQLAlchemy.

5) Pylons does not properly pass the convert_unicode option to create_engine. This has been discussed on this list in many forms, most recently by Mike himself, and everyone seems to be agreed it needs fixing and soon. I've created ticket #250 for this.

There was one point you made that I believe is inaccurate and so did not include it in the above. Where you said that if you want to use the convert_unicode option, you must know what encoding the database is in, but my tests don't suggest that. My MySQL settings are currently:

character_set_client latin1 character_set_connection latin1 character_set_database latin1 character_set_filesystem binary character_set_results latin1 character_set_server latin1 character_set_system utf8 character_sets_dir /usr/share/mysql/charsets/ collation_connection latin1_swedish_ci collation_database latin1_swedish_ci collation_server latin1_swedish_ci

which are the defaults on Debian, and I'm happy in my ignorance. AFAIK as long as the character_set_system is utf-8, the data itself is stored in utf-8, and other variables like client and server encoding can be changed at will to force MySQL (client or server respectively) to convert from utf-8 to whatever else.

I've verified that with the above MySQL settings and only using Unicode field types, no covert_unicode or use_unicode or any other settings, the data in the DB is utf-8 and I get unicode objects and everything is groovy.

That suggests to me that if you really care, you should set the backend encoding to utf-8 or whatever with the SQLAlchemy encoding settings, but it is not necessary nor especially recommended.

Does that tie up the loose ends on this discussion?

Thanks again, David

Mike Orr wrote:

The first thing is to see what MySQL thinks its server and client character sets are:

$ paster shell development.ini

import pylons.database engine = pylons.database.create_engine() engine.execute("SHOW VARIABLES").fetchall()

Look for the variables that start with "character_set_" and "collation_". On my workstation they show:

(u'character_set_client', u'utf8'), (u'character_set_connection', u'utf8'), (u'character_set_database', u'utf8'), (u'character_set_filesystem', u'binary'), (u'character_set_results', u'utf8'), (u'character_set_server', u'utf8'), (u'character_set_system', u'utf8'), (u'character_sets_dir', u'/usr/share/mysql/charsets/'), (u'collation_connection', u'utf8_general_ci'), (u'collation_database', u'utf8_general_ci'), (u'collation_server', u'utf8_general_ci'),

Then you have a choice of "?use_unicode=1" (a MySQLdb option in the DSN) or:

create_engine(dsn, convert_unicode=True, encoding="utf8")

SQLAlchemy wants to standardize on 'convert_unicode' because it's the same for all engines, but currently there are some problems with that:

1) You have to know what character set the MySQL connection is.

2) If you're using pylons.database.session_context, you can't add arguments to the create_engine call without hacking pylons.database or reimplementing the functions yourself. Hopefully this will be improved in a future version of Pylons -- as soon as we figure out what to do.

3) convert_unicode works at the Table level, so raw engine.execute() queries will still be 'str' strings.

* * * * I have MySQLdb 1.2.1 because Kubuntu is still on the older version and I can't get the newer one to compile:

$ easy_install MySQL-python File "/tmp/easy_install-p8zEac/MySQL-python-1.2.2/setup_posix.py", line 24, in mysql_config EnvironmentError: mysql_config not found

Thank you, now what is "mysql_config"?

* * * * With autoloaded tables you'll get unicode values if you use either use_unicode or convert_unicode, or str values if you don't. With declared columns you'll also have to use the Unicode column type.

I had to use a my.cnf file to get the client and server charsets to match. Otherwise I was getting:

(u'character_set_client', u'latin1'), (u'character_set_connection', u'latin1'), (u'character_set_database', u'utf8'), (u'character_set_filesystem', u'binary'), (u'character_set_results', u'latin1'), (u'character_set_server', u'utf8'), (u'character_set_system', u'utf8'), (u'character_sets_dir', u'/usr/share/mysql/charsets/'), (u'collation_connection', u'latin1_swedish_ci'), (u'collation_database', u'utf8_general_ci'), (u'collation_server', u'utf8_general_ci'),

The my.cnf is a symlink to /etc/mysql/my.cnf and contains among other things:

[client] default-character-set = utf8

[mysqld] character-set-server = utf8 default-character-set = utf8

The DSN looks like this:

sqlalchemy.dbapi = mysql://...?read_default_file=%(here)s/my.cnf&use_unicode=1

Unfortunately MySQLdb does not raise an exception if the file is missing, so it's worth creating a unit test that checks whether all the character sets are correct. I haven't tried that yet.

* * * * If you have existing records in the database which are in a different charset than what the column purports to be, and you want to keep the actual values as is, change the column type to BINARY in MySQL, then change it to the desired charset:

http://dev.mysql.com/doc/refman/5.0/en/charset-conversion.html

If you want to change the data to the charset the column purports to be, write a Python program that connects WITHOUT Unicode support, read the values, decode them from the source charset, encode them into the destination charset, and write them back to the database.

With "mysql://...?use_unicode=1" and reflected tables, I get Unicode values. Without use_unicode, I get string values. If I hack pylons.database to pass 'convert_unicode=True' to create_engine() in a way that it'll be propagated to the session_context, I get Unicode values again. Curiously, in all three cases the column instances are MSString with convert_unicode=False:

Column(u'name',MSString(length=255,convert_unicode=False),nullable=False)

* * * * The following didn't work:

"mysql://...?use_unicode=1&charset=utf8"

sqlalchemy.exceptions.DBAPIError: (Connection failed) (OperationalError) (2019, 'Can\'t initialize character set utf8" (path: /usr/share/mysql/charsets/)')

This may be because I'm using the older MySQLdb, or maybe it's a bug Ubuntu's packaging of MySQL.