7 messages in com.googlegroups.social-graph-apiRe: HTTP URL clustering/truncating pr...
FromSent OnAttachments
Brad Fitzpatrick07 Jul 2008 08:25 
Joseph Smarr07 Jul 2008 09:01 
Brad Fitzpatrick07 Jul 2008 09:03 
artemy tregubenko07 Jul 2008 09:23 
Joseph Smarr07 Jul 2008 09:28 
Martin Atkins07 Jul 2008 10:54 
Brad Fitzpatrick07 Jul 2008 11:06 
Subject:Re: HTTP URL clustering/truncating proposal (based on me links)
From:Joseph Smarr (jsm@gmail.com)
Date:07/07/2008 09:01:39 AM
List:com.googlegroups.social-graph-api

I like it, and I think it's well-founded (I also heard this sub-path argument from Tantek, and I buy it). Couldn't you also Is that for domains like flickr? I think I could make the same argument there. js

On Mon, Jul 7, 2008 at 8:26 AM, Brad Fitzpatrick <brad@google.com> wrote:

While most URLs in the Social Graph API are canonicalized using the open source sgnodemapper code, a fair number of nodes in the graph aren't, and never will be. Namely, "vanity domains".

Currently, these are all unique nodes in the graph:

http://bradfitz.com/ http://bradfitz.com/foaf.xml http://identi.ca/bradfitz http://identi.ca/bradfitz/foaf http://factoryjoe.com/ http://factoryjoe.com/hcard.html http://factoryjoe.com/blog http://factoryjoe.com/blog/2006/02/10/uspto-to-hold-open-source-meeting/ http://factoryjoe.com/blog/2006/07/25/hresume-plugin-now-available/ ......

And so on.

I'd like to cluster the three logical sets above, truncated as follows:

http://bradfitz.com/ http://identi.ca/bradfitz http://factoryjoe.com/

But where to do the truncation? Nobody likes brittle heuristics like hacky one-off regexp rules or similar.

Fortunately we have a much better data source: "me" links. (whether they're XFN, an openid delegate tag, rss/atom/foaf link, etc.)

Talking to Tantek Çelik awhile back, he'd mentioned there's an implict me link from a URL to its parent, (http://foo.com/bar/ ---me--> http://foo.com/) but not vice-versa (which might seem more intuitive) because (as he roughly said), "A root must always be able to partition its namespace." Consider that if http://foo.com/ implied a me link to http://foo.com/users/attacker/ , then user "attacker" could me link back to foo.com and cluster the whole site together.

Unfortunately, I don't see an explanation of this at http://gmpg.org/xfn/11 so I'm afraid I might be remembering it wrong.

But ideally what I'd like to do, if I'm not grossly confused:

If a url ${prefix} has a me link to url ${prefix} + ${suffix}, and the number of path components in the latter URL are greater than those of the former, then truncate at ${prefix}.

That is, whenever a site http://foo.com/ has a me link (XFN or otherwise: RSS/Atom/FOAF) to http://foo.com/anything, we truncate at http://foo.com/and any
links in the graph too http://foo.com/* now become http://foo.com/

The path component part is necessary because of all the sites which for what I imagine are aesthetic reasons have their URLs like this:

http://identi.ca/bradfitz

... instead of what one could argue is a bit more technically correct, like this:

http://identi.ca/bradfitz/

So considering that people are going to use things like /username as the URL, we need to guard against this case:

http://foo.com/dude http://foo.com/dude2_unrelated

If the rule were purely prefix-based, then the first dude, being naive or malicious, could "me"-link to dude2_unrelated and cluster with him, stealing all his outgoing and incoming edges, dirtying up the data.

If this is technically sound, then http://factoryjoe.com/ will have one node in the graph for his site, rather than the hundreds or more he does today. Likewise, a lot of people with a domain + foaf file (like me) will have 1 node on my vanity domain, not two, when doing simple fme=1 queries from it.

Thoughts?

- Brad