TR-IIS-06-017    Fulltext


Extracting Citation Relationships from Web Documents for Author Disambiguation

Kai-Hsiang Yang, Jian-Yi Jiang, Hahn-Ming Lee, Jan-Ming Ho

Abstract

Disambiguating the citation records of authors with the same name is a very interesting and challenging problem that affects many research and application fields, such as digital libraries. However, current bibliographic digital libraries like CiteSeer can not correctly disambiguate citation records because of two problems: information sparsity (citations for an individual have few or no common features), and information noise (citations for different individuals have the same coauthor names, title words, or venue words). To resolve these problems, we propose a novel author disambiguation scheme that searches for authors’ publication lists on the Web to enrich citation information. A binary classifier and a cluster separator are used to filter out noise. The experiment results show that the disambiguation accuracy improves from 51% to 73% when Web information is used in the disambiguation task. Furthermore, for most datasets, the clustering precision rate is satisfactory (more than 90%).

Categories and Subject Descriptors
H.3.3 [Information Search and Retrieval]: Clustering; H.3.7 [Digital Libraries]: Dissemination