Welcome to D
SIGMOD 2005
PODS 2005
SIGMOD-RECOR
CIDR 2005
CIKM 2005
COMAD 2005
CVDB 2005
DaMoN 2005
Data Enginee
DEBS05
DMSN 2005
DOLAP 2005
GIR 2005
GIS 2005
Hypertext 20
ICDE 2005
ICDM 2005
IHIS 2005
IQIS 2005
JCDL 2005
KRAS 2005
MDM 2005
MIR 2005
MobiDE 2005
P2PIR 2005
RIDE 2005
SBBD 2005
SIGIR 2005
SIGIR-FORUM
SIGKDD 2005
SIGKDD-EXP
SSDBM 2005
TIME 2005
TKDE 2005
<<< = TKDE'05 Pape>>>
TODS 2005
VLDB 2005
VLDBJ 2005
WebDB 2005
WIDM 2005

Fast detection of XML structural similarity


Sergio Flesca, Giuseppe Manco, Elio Masciari, Luigi Pontieri, and Andrea Pugliese

  View Paper (PDF)  

Return to February 2005, Volume 17, Issue 2


Abstract

Because of the widespread diffusion of semistructured data in XML format, much research effort is currently devoted to support the storage and retrieval of large collections of such documents. XML documents can be compared as to their structural similarity, in order to group them into clusters so that different storage, retrieval, and processing techniques can be effectively exploited. In this scenario, an efficient and effective similarity function is the key of a successful data management process. We present an approach for detecting structural similarity between XML documents which significantly differs from standard methods based on graph-matching algorithms, and allows a significant reduction of the required computation costs. Our proposal roughly consists of linearizing the structure of each XML document, by representing it as a numerical sequence and, then, comparing such sequences through the analysis of their frequencies. First, some basic strategies for encoding a document are proposed, which can focus on diverse structural facets. Moreover, the theory of discrete Fourier transform is exploited to effectively and efficiently compare the encoded documents (i.e., signals) in the domain of frequencies. Experimental results reveal the effectiveness of the approach, also in comparison with standard methods.


©2006 Association for Computing Machinery