Welcome to D
SIGMOD 2005
PODS 2005
SIGMOD-RECOR
CIDR 2005
CIKM 2005
COMAD 2005
CVDB 2005
DaMoN 2005
Data Enginee
DEBS05
DMSN 2005
DOLAP 2005
GIR 2005
GIS 2005
Hypertext 20
ICDE 2005
ICDM 2005
IHIS 2005
IQIS 2005
JCDL 2005
KRAS 2005
MDM 2005
MIR 2005
MobiDE 2005
P2PIR 2005
RIDE 2005
SBBD 2005
SIGIR 2005
SIGIR-FORUM
SIGKDD 2005
<<< = SIGKDD'05 Pa>>>
SIGKDD-EXP
SSDBM 2005
TIME 2005
TKDE 2005
TODS 2005
VLDB 2005
VLDBJ 2005
WebDB 2005
WIDM 2005

Finding Similar Files in Large Document Repositories


George Forman, Kave Eshghi, and Stephane Chiocchetti

  View Paper (PDF)  

Return to Industrial/Govt Track Session 4 [Document Analysis]


Abstract

Hewlett-Packard has many millions of technical support documents in a variety of collections. As part of content management, such collections are periodically merged and groomed. In the process, it becomes important to identify and weed out support documents that are largely duplicates of newer versions. Doing so improves the quality of the collection, eliminates chaff from search results, and improves customer satisfaction.The technical challenge is that through workflow and human processes, the knowledge of which documents are related is often lost. We required a method that could identify similar documents based on their content alone, without relying on metadata, which may be corrupt or missing.We present an approach for finding similar files that scales up to large document repositories. It is based on chunking the byte stream to find unique signatures that may be shared in multiple files. An analysis of the file-chunk graph yields clusters of related files. An optional bipartite graph partitioning algorithm can be applied to greatly increase scalability.


©2006 Association for Computing Machinery