2003 Digital Symposium Collection

Systematic Development of Data Mining-Based Data Quality Tools

Dominik Luebbers, Udo Grimmer, and Matthias Jarke
View Paper (PDF)

Return to Data Quality, Data Mining (Session B5)

Abstract

Data quality problems have been a persistent concern especially for large historically grown databases. If maintained over long periods, interpretation and usage of their schemas of- ten shifts. Therefore, traditional data scrub- bing techniques based on existing schema and integrity constraint documentation are hardly applicable. So-called data auditing environ- ments circumvent this problem by using ma- chine learning techniques in order to induce semantically meaningful structures from the actual data, and then classifying outliers that do not fit the induced schema as potential er- rors. However, as the quality of the analyzed database is a-priori unknown, the design of data auditing environments requires special methods for the calibration of error measure- ments based on the induced schema. In this paper, we present a data audit test generator that systematically generates and pollutes ar- tificial benchmark databases for this purpose. The test generator has been implemented as part of a data auditing environment based on the well-known machine learning algorithm C4.5. Validation in the partial quality audit of a large service-related database at Daimler- Chrysler shows the usefulness of the approach as a complement to standard data scrubbing.