Abstract

Wolfgang Nejdl
Social Computing for Libraries - Data De-Duplication through the Crowd


The FreeSearch project develops new methods for simple literature search that works on any collection of catalogues, without requiring in-depth knowledge of the metadata schema. FreeSearch helps users proactively and unobtrusively by guessing at each step what the user's real information need is and providing precise suggestions. In this way we combine simplicity and ease of use with powerful search algorithms.

In addition to features such as faceted search, an intelligent search interface and thematic clustering, FreeSearch includes an advanced duplicate detection algorithm. Three factors are considered: (1) To obtain a rough duplicate detection, we use several methods to create signatures for documents (metadata normalization, Soundex, etc.) and group duplicates very efficiently by signatures. (2) As grouping by signatures sometimes yields false positives, we compare the resulting duplicate candidates from the first step using more complex algorithms to identify same entities relying on attribute similarities and relationships. (3) We integrate social computing into the duplicate detection process. Users help us with duplicate detection by using a simple feedback mechanism in the interface to mark records as duplicates or non-duplicates. This feedback from users helps us decide whether to present the records grouped as duplicates, and also focuses our algorithms on results relevant for users. In a further step, the use of crowd sourcing systems (Amazon Mechanical Turk) extends the input of our users and incorporates methods for spam detection. Thus, we exploit the huge potential of social computing for digital libraries.



Bielefeld University Library - last update: 21/02/2012