Welcome to D
SIGMOD'00
PODS'00
SIGMOD Recor
CIKM 2000/CI
COMAD 2000
Data Enginee
DL 2000
DPDJ
EDBT 2000
Hypertext 20
ICDE 2000
<<< = ICDE'00 Pape>>>
KDD 2000
KDD Explorat
KRDB 2000
SBBD 2000
SIGIR 2000
SIGIR Forum
SSDBM 2000
TODS
VLDB'00
VLDBJ

An Extensible Framework for Data Cleaning


H. Galhardas, D. Florescu, D. Shasha, and E. Simon

  View Paper (PDF)  

Return to OLAP, DW, and Data Mining


Abstract


Data quality concerns arise when one wants to correct anomalies in a single data source (e.g., duplicate elimination in a file), or when one wants to integrate data coming from multiple sources into a single new data source (e.g., data warehouse construction). Three data quality problems are typically encountered: (1) the absence of universal keys across different databases that is known as the object identity problem, (2) the existence of keyboard errors in the data, and (3) the presence of inconsistencies in data coming from multiple sources. Dealing with these problems is globally called the data cleaning process. We propose a framework that models a data cleaning application as a directed graph of data transformations. Transformations are divided into four distinct classes: mapping, matching, clustering and merging; and each of them is implemented by a macro-operator. Moreover, we propose an SQL extension for specifying each of the macro-operators. One important feature of the framework is the ability to include human interaction explicitly in the process. Finally, we study performance optimizations which are tailored for data cleaning applications: mixed evaluation, neighborhood hash join, decision push-down and short-circuited computation. Keywords: data quality, data cleaning, query language, query optimization, data transformation, duplicate elimination, approximate join, object matching



DiSC'01 Copyright ©2002 ACM Inc.