2 messages in com.googlegroups.pylons-discussRe: urls with non-ascii chars in vari...
FromSent OnAttachments
kettle19 Mar 2008 21:59 
kettle20 Mar 2008 03:31 
Subject:Re: urls with non-ascii chars in various encodings
From:kettle (Jose@public.gmane.org)
Date:03/20/2008 03:31:17 AM
List:com.googlegroups.pylons-discuss

I managed to solve this with a hack removing the default urllib.unquote_plus(val) clause from Routes-1.7.1-py2.4.egg/routes/base.py and handling this myself. this way i can write my own code to handle multiple different encodings - utf8 if provided, shiftjis if that is provided etc.

On Mar 20, 1:59 pm, kettle
<Jose@public.gmane.org> wrote:

Hi, I have a pylons application which I would like to provision with support such that users can access their private pages with unique url, similar to a subdomain, which contains their unique username or id. So far so good, so long as everything is ascii, or I provide a utf8 encoded version of the link, something akin to (not
real):http://www.localhost/user/日本語

and the user clicks on or copies the link into their browser. In this case pylons correctly escapes the uft8 chars and I am able to reconstruct the original string on the server side, and pull out the appropriate user info to render their personal pages. However, I figure it would also be nice to be able to type one's personal url into the browser address bar directly. Trying this in firefox it appears that the default encoding for the browser/os is used to urlencode the string - fine again if I'm using utf8, but if I switch to something like iso-2022-jp or shiftjis things start misbehaving. Say using shiftjis for my default encoding, I type the following url directly into the address bar on firefox:http://www.localhost/user/日本語

this is handled by a routes map that looks like map.connect('user/:id', controller='test', action="index")

then I get back a response like:http://www.localhost/user/%93%FA%96%7B%8C%EA

instead of the utf8 style url which would
be:http://www.doofer.tv/user/%E6%97%A5%E6%9C%AC%E8%AA%9E

nothing too terrible there as this is the proper shiftjis rendering of the japanese text the issue seems to be that on the server side, by the time I try to write the :id portion to my log, it has already been interpreted as utf8 or some other incorrect codec and mangled beyond repair such that I can no longer hope to retrieve the user info based on the original string in any encoding.

If I generate an error and check the path info it shows this on the debug page: 'PATH_INFO': '/user/\x93\xfa\x96{\x8c\xea'

so then looking up the \x7B I find that, yes indeed that happens to be the ascii/utf8 hex code for '{' (left curly bracket).

This issue does not crop up in POST or GET requests but seems to be unique to stuff dealt with in routes.

Am I missing something? Is there a solution? Do I have to muck around in paste to fix this?

Cheers, Joe