10 messages in com.freebase.developersRe: [Developers] mql_escape and UTF-8
FromSent OnAttachments
Shug BoabbyJul 30, 2008 8:07 am 
Christopher R. MadenJul 30, 2008 9:19 am 
Shug BoabbyJul 30, 2008 9:39 am 
Christopher R. MadenJul 30, 2008 9:43 am 
Shug BoabbyJul 30, 2008 10:15 am 
Nick ThompsonJul 30, 2008 10:26 am 
Warren HarrisJul 30, 2008 11:07 am 
brendanJul 30, 2008 11:09 am 
Shug BoabbyJul 31, 2008 2:41 am 
Will MoffatJul 31, 2008 6:40 am 
Actions with this message:
Paste this link in email or IM:
Paste this link in email or IM:
Atom feed for this thread
Paste this URL into your reader:
Subject:Re: [Developers] mql_escape and UTF-8Actions...
From:Nick Thompson (ni@metaweb.com)
Date:Jul 30, 2008 10:26:08 am
List:com.freebase.developers

The best way to think of the $xxxx encoding is as "MQL key encoding". These encoded keys are only used for /type/key/value properties I believe.

The original reason for MQL key encoding was to allow slash-separated MQL ids. It is also used to allow "." in the sort syntax and to allow the use of comparison suffixes like "<" and ">" without ambiguity. So we needed some syntax to escape these characters so that arbitrary text could be stored in /type/key.

I will add that MQL key escaping should really be thought of as an aspect of the string encoding of /type/key, not as inherent in /type/key itself. Because MQL key encoding and decoding are one-to-one mappings It should be possible to provide unencoded access to MQL keys - i think there is some low level support for this in MQL enumerations but it's not exposed through /type/key as far as i know.

So why not URL encoding?

Unfortunately URL encoding is the worst quoting syntax in common use. Decoding is straightforward, but implementations differ about which characters need to be encoded, and the rules are different in different parts of the URL. We didn't want to define a new encoding, but URL encoding seemed like a very risky choice. You would be able to use a stock URL decoder, but your stock URL encoder might produce confusing problems with some characters.

Furthermore there are cases where you have to layer URL encoding on top of MQL key encoding. Double URL encoding gets really really ugly. This is the reason for the choice of '$' as an escape character - $ does not require escaping in URLs. Since the key encoding is stricter than URL encoding about which characters are escaped, MQL ids should all be valid in URLs, and URL-decoding of a MQL id should not change the MQL id at all.

So yes, it's a pain, but it is a pretty reasonable solution to some tricky problems. We do need a better public definition of the encoding, testcases, and a library of implementations in various languages - at this point the python code that Kurt posted is probably the best starting point for implementors.

nick

Shug Boabby wrote:

Thanks Chris... I think I'd already worked all that out, but I was just wondering if anybody had actually written a Java encoder/decoder between UTF-8/MW Hex. I realise it should be simple to convert the $000 syntax, but it is troublesome to have to write this code myself. I really wish you'd decided to just use the URL encoding scheme as that would require no additional work on our side of things (despite it looking ugly). It's just not standard enough (although, admittedly, prettier).

2008/7/30 Christopher R. Maden <cri@metaweb.com>:

I am going to be somewhat overly detailed in this reply so that it will be archived for anyone else who is wondering.

Wikipedia article names can include Unicode characters:

Gabriel García Márquez

They do not include underscores, HTML entities, URL encodings, or anything else.

To refer to a Wikipedia article in a URL, one must turn spaces into underscores and URL-encode the result. This is true of any URL on any Web site, not just Wikipedia. The canonical URL for the above-mentioned article is:

http://en.wikipedia.org/wiki/Gabriel_Garc%C3%ADa_M%C3%A1rquez

The acute accented i is Unicode character U+00ED. Some broken systems will encode the character as %ED; this is wrong, though some Web servers will accept it. The correct URL encoding is to turn the character into a UTF-8 byte sequence (whose details I am not going to go into here). The UTF-8 byte sequence for í is C3 AD, so the URL encoding is %C3%AD.

Unfortunately, many standard URL escaping libraries do not correctly handle characters with Unicode codepoints above 128 (U+007F), which is why Kurt and I wrote the code that he posted.

The byte-wise URL encoding is horribly annoying, which is why Freebase uses a simpler escape mechanism. Every character in a key is either represented by itself, or by a dollar sign and four hex digits. The four hex digits are the Unicode codepoint for that character; í becomes $00ED. Spaces are turned into $0020, but are rarely used. To reduce annoyance when dealing with Wikipedia names, and to make our URLs look prettier, we copied the convention of turning spaces into _ before key-encoding them.

The key corresponding to the canonical name for the article about Gabriel García Márquez is thus Gabriel_Garc$00EDa_M$00E1rquez.

When working between these systems in Python, it is important to use Unicode strings, not normal strings, at all times. Similarly, in Java, remember that all strings are UTF-16 (2-byte-wide Unicode). Encoding or decoding Freebase $hhhh syntax should be straightforward in either case.