atom feed2 messages in com.freebase.freebase-discuss[Freebase-discuss] Increasing the qua...
FromSent OnAttachments
Shawn SimisterJan 18, 2012 2:00 pm 
Thad GuidryJan 18, 2012 2:15 pm 
Subject:[Freebase-discuss] Increasing the quality of Freebase data
From:Shawn Simister (simi@google.com)
Date:Jan 18, 2012 2:00:49 pm
List:com.freebase.freebase-discuss

High quality, reconciled data is what makes Freebase such an important resource. Over the past year Freebase has taken a much closer look at the quality of its data, particularly for our most visible and most “popular” topics.

To that end, our data quality workers have reviewed 10,000 of the most popular topics in Freebase from the following types: Film Actors, Films, Musical Artists, TV Programs, Authors and Visual Artists. As a result of this work they’ve contributed 1,199,885 new facts to Freebase and individually verified that each of these topics is factually correct. This is something that strengthens many of the existing apps that have been built on Freebase data and gives us a solid base of reconciled data to build upon for future data loads.

As much as we strive to hold every data load to the 99% quality standard, its inevitable that the more data we add to Freebase, the more outliers and exceptions we will have to clean up. Our community has been instrumental in identifying these outliers by flagging duplicate topics, bad merges and candidates for deletion in the review queue<http://wiki.freebase.com/wiki/Review_queue>. The oDesk team was able to take on one of the most time-consuming cleanup tasks by splitting apart 2,048 improperly merged topics. In addition to this, they’ve also resolved 10,651 merges or splits to clean up our IMDb keys.

The next cleanup task that our data quality workers are tackling is incompatible types<http://www.google.com/url?q=http%3A%2F%2Fwww.freebase.com%2Fview%2Fdataworld%2Fincompatible_types>. With the community’s help, we’ve defined a set of constraints that prevent entities from having conflicting types like Person and Organization applied to them. We’ve identified over 100,000 instances in Freebase where these constraints have been violated and the oDesk team has already resolved 25,000 of these conflicts. There are another 25,000 which we believe can be rolled back programmatically and 38,000 which are considered to be minor violations.

As the diversity and scale of data being contributed to Freebase continues to increase, I’ll continue to update you on our data quality efforts here at Google. Please don’t hesitate to jump in and let us know when we mess up but also understand that we’re working very hard to make Freebase bigger and better.

Developer Programs Engineer Google, San Francisco http://freebase.com