SIGKDD Explorations: Newsletter of ACM SIG on Knowledge Discovery & Data Mining

Association for
Computing Machinery

SIGKDD

Newsletter of the Special
Interest Group (SIG) on
Knowledge Discovery &
Data Mining

January 2000. Volume 1, Issue 2

SIGKDD Explorations

Editor-in-Chief:
Usama Fayyad
Microsoft Research
Fayyad@acm.org

Associate Editor:
Sunita Sarawagi
I.I.T., Bombay
sunita@it.iitb.ernet.in

SIGKDD Explorations

KDD Conferences

Data Mining & Knowledge Discovery, An International Journal

Survey Articles

Data Mining for Hypertext: A Tutorial Survey
S. Chakrabarti
(available in PDF )
ABSTRACT: With over 800 million pages covering most areas of human endeavor, the World-wide Web is a fertile ground for data mining research to make a difference to the effectiveness of information search. Today, Web surfers access the Web through two dominant interfaces: clicking on hyperlinks and searching via keyword queries. This process is often tentative and unsatisfactory. Better support is needed for expressing one's information need and dealing with a search result in more structured ways than available now. Data mining and machine learning have significant roles to play towards this end.

In this paper we will survey recent advances in learning and mining problems related to hypertext in general and the Web in particular. We will review the continuum of supervised to semi-supervised to unsupervised learning problems, highlight the specific challenges which distinguish data mining in the hypertext domain from data mining in the context of data warehouses, and summarize the key areas of recent and ongoing research.

Web usage mining: discovery and applications of web usage patterns from web data
J. Srivastava, R. Cooley, M.Deshpande, P.Tan
(available in PDF )
ABSTRACT: Web usage mining is the application of data mining techniques to discover usage patterns from Web data, in order to understand and better serve the needs of Web-based applications. Web usage mining consists of three phases, namely preprocessing, pattern discovery, and pattern analysis. This paper describes each of these phases in detail. Given its application potential, Web usage mining has seen a rapid increase in interest, from both the research and practice communities. This paper provides a detailed taxonomy of the work in this area, including research efforts as well as commercial offerings. An up-to-date survey of the existing work is also provided. Finally, a brief overview of the WebSIFT system as an example of a prototypical Web usage mining system is given.

Position papers

Phenomenal Data mining
J. McCarthy
(available in PDF )
ABSTRACT: Phenomenal data mining finds relations between the data and the phenomena that give rise to data rather than just relations among the data. For example, suppose supermarket cash register data does not identify cash customers. Nevertheless, there really are customers, and these customers are characterized by sex, age, ethnicity, tastes, income distribution, and sensitivity to price changes. A data mining program might be able to identify which baskets of purchases are likely to have been made by the same customers. In this example, the receipts are the data, and the customers are phenomena not directly represented in the data. Once the "baskets" of purchases are grouped by customer, the way is open to infer further phenomena about the customers, e.g. their sex, age, etc. This article concerns what can be inferred by programs about phenomena from data and what facts are relevant to doing this. We work mainly with the supermarket example, but the idea is general. In order to infer phenomena from data, facts about their relations must be supplied. Sometimes these facts can be implicit in the programs that look for the phenomena, but more generality is achieved if the facts are represented as sentences of logic in a knowledge base used by the programs. The result of phenomenal data-mining might include an extended database with additional fields on existing relations and new relations. Thus the relations describing supermarket baskets might be extended with a customer field, and new relations about customers and their properties might be introduced.

Theoretical frameworks of data mining
H. Mannila
(available in PDF )
ABSTRACT: Is data mining a collection of different techniques and tricks, or is there a common background that would be useful in developing the field? The paper looks at theoretical approaches that might be used to describe the different data mining problems and techniques in an unified way.

Artificial neural networks - a science in trouble
A. Roy
(available in PDF )
ABSTRACT: This article points out some very serious misconceptions about the brain in connectionism and artificial neural networks. Some of the connectionist ideas have been shown to have logical flaws, while others are inconsistent with some commonly observed human learning processes and behavior. For example, the connectionist ideas have absolutely no provision for learning from stored information, something that humans do all the time. The article also argues that there is definitely a need for some new ideas about the internal mechanisms of the brain. It points out that a very convincing argument can be made for a "control theoretic" approach to understanding the brain. A "control theoretic" approach is actually used in all connectionist and neural network algorithms and it can also be justified from recent neurobiological evidence. A control theoretic approach proposes that there are subsystems within the brain that control other subsystems. Hence a similar approach can be taken in constructing learning algorithms and other intelligent systems.
Keywords: Connectionism, artificial neural networks, brain-like learning, data mining, intelligent systems, automated learning.

SIGKDD INFORMATION:

http://www.acm.org/sigkdd

join SIGKDD today!

SIGKDD NEWSLETTER CONTACT INFORMATION:

Usama Fayyad
Microsoft Research
One Microsoft Way
Redmond, WA 98052

Fax: +1-425-936-7329
fayyad@acm.org

Related Links

SIGKDD Explorations

Contributed articles

Discovering matrix association from biological databases
G. B. Singh
(available in PDF )
ABSTRACT: Biological databases have continued their exponential growth over the last decade, and data mining holds considerable promise for knowledge discovery in these databases. Discovery of the elements of locus control from genetic sequences is a significant problem as these elements are responsible for gene expression and viability of an organism. Matrix Attachment Regions or MARs are one such type of elements where the detection has been hampered due to the limited knowledge about their structure. A discovery approach utilizing statistical estimation of "interestingness" has been implemented in MARWiz software described in this paper. The strategy described is of general applicability for detecting other classes of signals in time-series or DNA sequence data

A note on "Beyond Market Baskets: Generalizing association rules to correlations"
K.M.Ahmed, N.M.El-Makky, Y.Taha
(available in PDF )
ABSTRACT: In their paper \cite{dm1}, S. Brin, R. Motwani and C. Silverstien discussed measuring significance of (generalized) association rules via the support and the chi-squared test for correlation. They provided some illustrative examples and pointed that the chi-squared test needs to be augmented by a measure of interest that they also suggested. This paper presents a further elaboration and extension of their discussion. As suggested by Brin et al, the chi-squared test succeeds in measuring the cell dependencies in a 2x2 contingency table. However, it can be misleading in cases of bigger contingency tables. We will give some illustrative examples based on those presented in \cite{dm1}. We will also propose a more appropriate reliability measure of association rules

Reports from the KDD-99 Conference

KDD-99: The Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
S. Chaudhuri, D. Madigan, U. Fayyad
(available in PDF )
ABSTRACT: KDD-99 was the fifth conference in the KDD series attracting over 200 high quality submissions and almost 600 attendees. Here we describe some of the highlights of the technical program.
Keywords: KDD Conference overview, ACM SIGKDD.

Data Snooping, Dredging and Fishing: The Dark Side of Data Mining
D. Jensen
(available in PDF )
ABSTRACT: This article briefly describes a panel discussion at SIGKDD99.
Keywords:Overfitting, SIGKDD99, Panels

Integrating Data Mining into Vertical Solutions
R. Kohavi, M. Sahami
(available in PDF )
ABSTRACT: At KDD-99, the panel on Integrating Data Mining into Vertical Solutions addressed a series of questions regarding future trends in industrial applications. Panelists were chosen to represent different viewpoints from a variety of industry segments, including data providers (Jim Bozik), horizontal and vertical tool providers (Ken Ono and Steve Belcher respectively), and data mining consultants (Rob Gerritsen and Dorian Pyle). Questions presented to the panelists included whether data mining companies should sell solutions or tools, who are the users of data mining, will data mining functionality be integrated into databases, do models need to be interpretable, what is the future of horizontal and vertical tool providers, and will industry-standard APIs be adopted?

Knowledge Discovery in Databases: Ten years after
G. Piatetsky-Shapiro
(available in PDF )
ABSTRACT: In this paper, we describe the past 10 years of KDD and outline predictions for the next 10 years.
Keywords:Knowledge Discovery in Databases, Data Mining, KDD, History.

Knowledge Discovery in Databases: A discussion on the last 10 and next 10 years
R. Quinlan
(available in PDF )
ABSTRACT: This paper presents the authors impressions at the panel with the above title held at KDD-99.

KDD-99 Classifier learning contest
Overview: C. Elkan
(available in PDF )
ABSTRACT: This paper presents a summary of the results of the classifier learning track of the KDD cup competition held at KDD-99.

First winner report: B. Pfahringer
(available in PDF)
ABSTRACT: The first place winners of the classifier learning contest describe their method in this report.

Second winner report: I. Levin
(available in PDF)
ABSTRACT: Kernel Miner is a new data-mining tool based on building the optimal decision forest. The tool won second place in the KDD'99 Classifier Learning Contest, August 1999. We describe the Kernel Miner's approach and method used for solving the contest task. The received results are analyzed and explained.
Keywords: Data Mining competition, decision trees, optimal decision forest, classification, prediction.

Third winner report: M. Vladimir, V. Alexei, S. Ivan
(available in PDF)
ABSTRACT: The MP13 method is best summarized as recognition based on voting decision trees using "pipes" in potential space.
Keywords: Voting; Decision Tree; Potential Space

KDD-99 Knowledge discovery contest
Overview: C. Elkan
(available in PDF)
ABSTRACT: This paper presents a summary of the results of the knowledge discovery track of the KDD cup competition held at KDD-99.

Co-winner 1: J. Georges, A. H. Milley
(available in PDF)
ABSTRACT: In this paper, we expand on the 1998 KDD cup competition findings: exploratory data analysis reveals unusual data anomalies; a two-stage prediction model yields superior results to those obtained in the 1998 competition; we use a decision tree to better understand the model (the decision boundary); and we apply a confidence interval to establish a range upon which we can reasonably judge model performance.
Keywords: Two-stage prediction, neural network, decision tree, model performance.

Co-winner 2: S. Rosset and A. Inger
(available in PDF)
ABSTRACT: This report describes the results of our knowledge discovery and modeling on the data of the 1997 donation campaign of an American charitable organization.

Honorary mention: P. Sebastiani, M. Ramoni, and A. Crea
(available in PDF)
ABSTRACT: This report describes a complete Knowledge Discovery session using Bayeswar e Discoverer, a program for the induction of Bayesian networks from incomplete data. We build tw o causal models to help an American Charitable Organization understand the characteristics of respo ndents to direct mail fund raising campaigns. The first model is a Bayesian network induced from the database of 96,376 Lapsed donors to the June '97 renewal mailing. The network describes the dependency of the probability of response to the renewal mail on a subset of the variables in the database. The second model is a Bayesian network representing the dependency of the dollar amo unt of the gift on the variables in the same reduced database. This model is induced from the 5\% o f cases in the database corresponding to the respondents to the renewal campaign. The two model s are used for both predicting the expected gift of a donor and understanding the characteristi cs of donors. These two uses can help the charitable organization to maximize the profit.
Keywords: Bayesian Networks, Customer Profiling, Missing Data

Other conference reports

Interface ’99: A Data Mining Overview
A. Goodman
(available in PDF)

Discovering geographic knowledge in data rich environments: a report on a specialist meeting
H.J. Miller and J. Han
(available in PDF)
ABSTRACT: On 18-20 March 1999, a Specialist Meeting on “Discovering geographic knowledge in data-rich environments” was convened under the auspices of the Varenius Project of the National Center for Geographic Information and Analysis (NCGIA). This workshop brought together a diverse group of researchers and practitioners with interests in developing and applying new techniques for exploring large and diverse geographic datasets. The interaction prior to, during and after the three-day workshop resulted in the identification of research priorities and directions for continued development of “geographic knowledge discovery” (GKD) theory and techniques.
Keywords: Geographic data mining, spatio-temporal data mining, geographic information systems, geographic research.

WebKDD-99: Workshop on Web Usage Analysis and User Profiling
Brij Masand, Dr. Myra Spiliopoulou
(available in PDF)
ABSTRACT: The WEBKDD'99 workshop on \Web Usage Analysis and User Pro,ling" took place at Aug. 15, 1999 under the auspices of the SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'99). We report on the topics addressed in the workshop, the contributions and the discussions that took place in its framework.
Keywords: Web usage mining

KDD-99 Workshop on Large-Scale Parallel KDD systems
M. Zaki, C.T. Ho
(available in PDF)

SIGMOD 99 Workshop on research issues in data mining and knowledge discovery
K. Shim, R. Srikant
(available in PDF)

Interesting KDD news from SIGMOD 99
D. Keim
(available in PDF)

Book Reviews

Data Mining Methods for Knowledge Discovery
by K. Cios, W. Pedrycz and R. Swiniarski, Kluwer
(available in PDF)
ABSTRACT: This paper is a review of the book Data Mining Methods for Knowledge Discovery", by K. Cios, W. Pedrycz and R. Swiniarski, Kluwer 1998, 495 pp.
Keywords: Data mining, Book review.

News, Events and Announcements
(available in PDF )