

![]() | Start a set with this search |
![]() | Include this search in one of my sets |
![]() | Exclude this search from one of my sets |
![]() | Permalink to these results Paste this link in email or IM: |
| Atom feed for tracking future search results Paste this URL into your reader: |
15 messages in com.freebase.developersRe: [Developers] /wikipedia/en_id not...| From | Sent On | Attachments |
|---|---|---|
| Shug Boabby | Jul 23, 2008 3:40 pm | |
| Christopher R. Maden | Jul 23, 2008 3:44 pm | |
| Brian Karlak | Jul 23, 2008 3:55 pm | |
| Kurt Bollacker | Jul 23, 2008 4:01 pm | |
| Kurt Bollacker | Jul 23, 2008 4:01 pm | |
| Shug Boabby | Jul 24, 2008 2:29 am | |
| Brian Karlak | Jul 24, 2008 10:53 am | |
| Alec Flett | Jul 24, 2008 11:11 am | |
| Shug Boabby | Jul 24, 2008 2:34 pm | |
| Kurt Bollacker | Jul 24, 2008 3:47 pm | |
| Shug Boabby | Jul 26, 2008 4:29 am | |
| Shug Boabby | Jul 26, 2008 12:02 pm | |
| Alexander Marks | Jul 26, 2008 12:29 pm | |
| Alexander Marks | Jul 26, 2008 12:38 pm | |
| Shug Boabby | Jul 26, 2008 3:59 pm |

![]() | Permalink for this message Paste this link in email or IM: |
![]() | Permalink for this thread Paste this link in email or IM: |
| Atom feed for this thread Paste this URL into your reader: |
| Subject: | Re: [Developers] /wikipedia/en_id not giving results for redirected ids | Actions |
|---|---|---|
| From: | Kurt Bollacker (ku...@metaweb.com) | |
| Date: | Jul 24, 2008 3:47:19 pm | |
| List: | com.freebase.developers | |
On Thu, Jul 24, 2008 at 10:35:10PM +0100, Shug Boabby wrote:
and I would still like somebody to clarify for me how to get the actual "wikipedia/en" name, given the article.name from the WEX dumps (spaces to underscores, but what else?).
Here are some python functions we often use to convert WP names to Freebase keys. mql_escape() may be all you need, but cleanwikiword() is helpful for some messy WP names. You may also use utf8unescape() when you need to handle a WP name you got from a HTTP GET.
Keep in mind that not all WP names will resolve to Freebase IDs, for reasons such as they are not topics (e.g. disambiguation articles) or are brand new names that Freebase hasn't synced to yet.
Let me know if you have questions. Kurt :-)
###################################################################### import codecs,struct,re
# Let's deal with URL escaping of UTF utf8decode=codecs.getdecoder('utf-8') isodecode=codecs.getdecoder('iso8859_1')
def sub3(mo): return(utf8decode(struct.pack("BBB",int(mo.group(1),16),int(mo.group(2),16),int(mo.group(3),16)))[0])
def sub2(mo): return(utf8decode(struct.pack("BB",int(mo.group(1),16),int(mo.group(2),16)))[0])
def sub1(mo): try: return(utf8decode(struct.pack("B",int(mo.group(1),16)))[0]) except UnicodeDecodeError: return(isodecode(struct.pack("B",int(mo.group(1),16)))[0])
def utf8unescape(s): ''' Converts UTF-8 strings that have been URL escaped back into UTF-8. ''' # Get 3-byte UTF-8 sequences 1110xxxx 10yyyyyy 10zzzzzz s=re.sub('%(e[0-9a-f])%([89ab][0-9a-f])%([89ab][0-9a-f])(?im)',sub3,s) # Get 2-byte UTF-8 sequences 110xxxxx 10yyyyyy s=re.sub('%([cd][0-9a-f])%([89ab][0-9a-f])(?im)',sub2,s) # Get 1-byte UTF-8 sequences 0xxxxxxx s=re.sub('%([0-7][0-9a-f])(?im)',sub1,s) # Nuke any illegal characters. s=re.sub(u'[\ud800-\udfff\ufdd0-\ufdef\ufffe\uffff]','',s) s=re.sub('[\x00-\x08\x0b\x0c\x0e-\x1f]','',s) return(s)
# Clean up whitespace def cleanwikiword(s): ''' Clean up the spacing chars of a Wikipedia name ''' s=utf8unescape(s) s=re.sub('^[_ \t\r\n]+','',s) s=re.sub('[_ \t\r\n]+$','',s) s=re.sub('[_ \t\r\n]+','_',s) s=s[0].upper()+s[1:] return(s)
# Do the MW hex encoding def dollarhex(mo): return(("$%04x" % ord(mo.group(1))).upper())
def mql_escape(s): ''' Convert a string into a valid Freebase key value. ''' s=re.sub('([^-A-Za-z0-9_])',dollarhex,s) s=re.sub('(^[-_])',dollarhex,s) s=re.sub('([-_])$',dollarhex,s) return(s)
# This function shows how the above can be used together. def wikiurltomwkey(s): return(mql_escape(cleanwikiword(utf8unescape(s))))
# Do a test if __name__=='__main__': s0='----%e9%80%81 %d4%90 %45 %1f -_-_ blah____' s1=utf8unescape(s0) s2=cleanwikiword(s1) s3=mql_escape(s2) print ':'+repr(s0)+':\n:'+repr(s1)+':\n:'+repr(s2)+':\n:'+repr(s3)+':\n'
_______________________________________________ Developers mailing list Deve...@freebase.com http://lists.freebase.com/mailman/listinfo/developers







