PODS '00- Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

Full Citation in the ACM Digital Library

The Web as a graph

The pages and hyperlinks of the World-Wide Web may be viewed as nodes and edges in a directed graph. This graph has about a billion nodes today, several billion links, and appears to grow exponentially with time. There are many reasons—mathematical, sociological, and commercial—for studying the evolution of this graph. We first review a set of algorithms that operate on the Web graph, addressing problems from Web search, automatic community discovery, and classification. We then recall a number of measurements and properties of the Web graph. Noting that traditional random graph models do not explain these observations, we propose a new family of random graph models.

Typechecking for XML transformers

We study the typechecking problem for XML transformers: given an XML transformation program and a DTD for the input XML documents, check whether every result of the program conforms to a specified output DTD. We model XML transformers using a novel device called a k-pebble transducer, that can express most queries without data-value joins in XML-QL, XSLT, and other XML query languages. Types are modeled by regular tree languages, a nobust extension of DTDs. The main result of the paper is that typechecking for k-pebble transducers is decidable. Consequently, typechecking can be performed for a broad range of XML transformation languages, including XML-QL and a fragment of XSLT.

Integrity constraints for XML

Integrity constraints are useful for semantic specification, query optimization and data integration. The ID/IDREF mechanism provided by XML DTDs relics on a simple form of constraint to describe references. Yet, this mechanism is not sufficient to express semantic constraints, such as keys or inverse relationships, or stronger, object-style references. In this paper, we investigate integrity constraints for XML, both for semantic purposes and to improve its current reference mechanism. We extend DTDs with several families of constraints, including key, foreign key, inverse constraints and constraints specifying the semantics of object identities. These constraints are useful both for native XML documents and to preserve the semantics of data originating in relational or object databases. Complexity and axiomatization results are established for the (finite) implication problems associated with these constraints. These results also extend relational dependency theory on the interaction between (primary) keys and foreign keys. In addition, we investigate implication of more general constraints, such as functional, inclusion and inverse constraints defined in terms of navigation paths.

DTD inference for views of XML data

We study the inference of Data Type Definitions (DTDs) for views of XML data, using an abstraction that focuses on document content structure. The views are defined by a query language that produces a list of documents selected from one or more input sources. The selection conditions involve vertical and horizontal navigation, thus querying explicitly the order present in input documents. We point several strong limitations in the descriptive ability of current DTDs and the need for extending them with (i) a subtyping mechanism and (ii) a more powerful specification mechanism than regular languages, such as context-free languages. With these extensions, we show that one can always infer tight DTDs, that precisely characterize a selection view on sources satisfying given DTDs. We also show important special cases where one can infer a tight DTD without requiring extension (ii). Finally we consider related problems such as verifying conformance of a view definition with a predefined DTD. Extensions to more powerful views that construct complex documents are also briefly discussed.

On the content of materialized aggregate views

We consider the problem of answering queries using only materialized views. We first show that if the views subsume the query from the point of view of the information content, then the query can be answered using only the views, but the resulting query might be extremely inefficient. We then focus on aggregate views and queries over a single relation, which are fundamental in many applications such as data warehousing. We show that in this case, it is possible to guarantee that as soon as the views subsume the query, it can be completely rewritten in terms of the views in a simple query language. Our main contribution is the conception of various rewriting algorithms which run in polynomial time, and the proof of their completeness which relies on combinatorial arguments. Finally, we discuss the choice of materializing or not ratio views such as average and percentage, important for the design of materialized views. We show that it has an impact on the information content, which can be used to protect data, as well as on the maintenance of views.

View-based query processing for regular path queries with inverse

View-based query processing is the problem of computing the answer to a query based on a set of materialized views, rather than on the raw data in the database. The problem comes in two different forms, called query rewriting and query answering, respectively. In the first form, we are given a query and a set of view definitions, and the goal is to reformulate the query into an expression that refers only to the views. In the second form, besides the query and the view definitions, we are also given the extensions of the views and a tuple, and the goal is to check whether the knowledge on the view extensions logically implies that the tuple satisfies the query.In this paper we address the problem of view-based query processing in the context of semistructured data, in particular for the case of regular-path queries extended with the inverse operator. Several authors point out that the inverse operator is one of the fundamental extensions for making regular-path queries useful in real settings. We present a novel technique based on the use of two-way finite-state automata. Our approach demonstrates the power of this kind of automata in dealing with the inverse operator, allowing us to show that both query rewriting and query answering with the inverse operator has the same computational complexity as for the case of standard regular-path queries.

Query containment for data integration systems

The problem of query containment is fundamental to many aspects of database systems, including query optimization, determining independence of queries from updates, and rewriting queries using views. In the data integration framework, however, the standard notion of query containment does not suffice. We define relative containment, which formalizes the notion of query containment relative to the sources available to the integration system. First we provide optimal bounds for relative containment for several important classes of datalog queries, including the common case of conjunctive queries. Next we provide bounds for the case when sources enforce access restrictions in the form of binding pattern constraints. Surprisingly, we show that relative containment for conjunctive queries is still decidable in this case, even though it is known that finding all answers to such queries may require a recursive datalog program over the sources. Finally, we provide tight bounds for variants of relative containment when the queries and source descriptions may contain comparison predicates.

Constraint satisfaction and database theory: a tutorial

A large class of problems in AI and other areas of computer science can be viewed as constraint-satisfaction problems. This includes problems in machine vision, belief maintenance, scheduling, temporal reasoning, type reconstruction, graph theory, and satisfiability. In general, the constraint satisfaction-problem is NP-complete, so searching for tractable cases is an active research area. It turns out that constraint satisfaction has an intimate connection with database theory: constraint-satisfaction problems can be recast as database problems and database problems can be recast as constraint-satisfaction problems. In this tutorial, I will cover the fundamentals of constraints satisfaction and describe its intimate relationship with database theory from various perspectives.

Auditing Boolean attributes

We study the problem of auditing databases which support statistical sum queries to protect the security of sensitive information; we focus on the special case in which the sensitive information is Boolean. Principles and techniques developed for the security of statistical database in the case of continuous attributes do not apply here. We prove certain strong complexity results suggesting that there is no general efficient solution for the auditing problem in this case. We propose two efficient algorithms: The first is applicable when the sum queries are one-dimensional range queries (we prove that the problem is NP-hard even in the two-dimensional case). The second is an approximate algorithm that maintains security, although it may be too restrictive. Finally, we consider a “dual” variant, with continuous data but an aggregate function that is combinatorial in nature. Specifically, we provide algorithms for two natural definitions of the auditing condition when the aggregate function is MAX.

Verification of relational tranducers for electronic commerce

Motivated by recent work of Abiteboul, Vianu, Fordham, and Yesha [3] we investigate the verifiability of transaction protocols specifying the interaction of multiple partiesvia a network, where each party is equipped with an (active) database that participates in the interaction. Such transaction protocols typically occur in the context of electronic commerce applications and can be formalized as relational transducers. We introduce a class of powerful relational transducers based on Gurevich's abstract state machines and show that several verification problems related to electronic commerce applications can be solved for these transducers. Our approach is, in some sense, complementary to the approach in [3].

Reachability and connectivity queries in constraint databases

It is known that standard query languages for constraint databases lack the power to express connectivity properties. Such properties are important in the context of geographical databases, where one naturally wishes to ask queries about connectivity (what are the connected components of a given set?) or reachability (is there a path from A to B that lies entirely in a given region?). No existing constraint query languages that allow closed form evaluation can express these properties.In the first part of the paper, we show that in principle there is no obstacle to getting closed languages that can express connectivity and reachability queries. In fact, we show that adding any topological property to standard languages like FO + LIN and FO+POLY results in a closed language. In the second part of the paper, we look for tractable closed languages for expressing reachability and connectivity queries. We introduce path logic, which allows one to state properties of paths with respect to given regions. We show that it is closed, has polynomial time data complexity for linear and polynomial constraints, and can express a large number of reachability properties beyond simple connectivity. Query evaluation in the logic involves obtaining a discrete abstraction of a continuous path, and model-checking of temporal formulae on the discrete structure.

Fixed-point query languages for linear constraint databases

We introduce a family of query languages for linear constraint databases over the reals. The languages are defined over two-sorted structures, the first sort being the real numbers and the second sort consisting of a decomposition of the input relation into regions. The languages are defined as extensions of first-order logic by transitive closure or fixed-point operators, where the fixed-point operators are defined over the set of regions only. It is shown that the query languages capture precisely the queries definable in various standard complexity classes including PTIME.

Linear approximation of planar spatial databases using transitive-closure logic

We consider spatial databases in the plane that can be defined by polynomial constraint formulas. Motivated by applications in geographic information systems, we investigate linear approximations of spatial databases and study in which language they can be expressed effectively. Specifically, we show that they cannot be expressed in the standard first-order query language for polynomial constraint databases but that an extension of this first-order language with transitive closure suffices to express the approximation query in an effective manner. Furthermore, we introduce an extension of transitive-closure logic and show that this logic is complete for the computable queries on linear spatial databases. This result together with our first result implies that this extension of transitive-closure logic can express all computable topological queries on arbitrary spatial databases in the plane.

Computational aspects of resilient data extraction from semistructured sources (extended abstract)

Automatic data extraction from semistructured sources such as HTML pages is rapidly growing into a problem of significant importance, spurred by the growing popularity of the so called “shopbots” that enable end users to compare prices of goods and other services at various web sites without having to manually browse and fill out forms at each one of these sites.The main problem one has to contend with when designing data extraction techniques is that the contents of a web page changes frequently, either because its data is generated dynamically, in response to filling out a form, or because of changes to its presentation format. This makes the problem of data extraction particularly challenging, since a desirable requirement of any data extraction technique is that it be “resilient”, i.e., using it we should always be able to locate the object of interest in a page (such as a form or an element in a table generated by a form fill-out) in spite of changes to the page's ntent and layout.In this paper we propose a formal computation model for developing resilient data extraction techniques from semistructured sources. Specifically we formalize the problem of data extraction as one of generating unambiguous extraction expressions, which are regular expressions with some additional structure. The problem of resilience is then formalized as one of generating a maximal extraction expression of this kind. We present characterization theorems for maximal extraction expressions, complexity results for testing them, and algorithms for synthesizing them.

Expressive and efficient pattern languages for tree-structured data (extended abstract)

It would be desirable to have a query language for tree-structured data that is (1) as easily usable as SQL, (2) as expressive as monadic second-order logic (MSO), and (3) efficiently evaluable. The paper develops some ideas in this direction. Towards (1) the specification of sets of vertices of a tree by combining conditions on their induced subtree with conditions on their path to the root is proposed. Existing query languages allow regular expressions (hence MSO logic) in path conditions but are limited in expressing subtree conditions. It is shown that such query languages fall short of capturing all MSO queries. On the other hand, allowing a certain guarded fragment of MSO-logic in the specification of subtree conditions results in a language fulfilling (2), (3) and, anguably, (1).

Expressive power and data complexity of nonrecursive query languages for lists and trees (extended abstract)

We extend the traditional query languages by primitives for handling lists and trees. Our main results characterize the expressive power and data complexity of the following extended languages: (1) relational algebra with lists and trees, (2) nonrecursive [email protected]@@@ with lists and trees, (3) nonrecursive Prolog with lists and trees, (4) first-order logic over lists and trees.Languages (2)-(4) turn out to have the same expressive power; their range-restricted fragments have the same expressive power as (1). Every query in these languages is a boolean combination of range-restricted queries.We also prove that these query languages have polynomial data complexity under any “reasonable” encoding of inputs. Furthermore, under a natural encoding of inputs, languages (2)-(4) have the same expressive power as first-order logic over finite structures, therefore their data complexity is in A Co. Thus, the use of lists and trees in nonrecursive query languages gives no gain in the expressiveness. This contrasts with a huge difference between the nonelementary program complexity of extended languages (2)-(4) and the PSPA CE program complexity of their relational counterparts.Our results partly explain why lists and trees are not so widely used in nonrecursive query languages as other collection types.

Indexing the edges—a simple and yet efficient approach to high-dimensional indexing

In this paper, we propose a new tunable index scheme, called iMinMax(&Ogr;), that maps points in high dimensional spaces to single dimension values determined by their maximum or minimum values among all dimensions. By varying the tuning “knob” &Ogr;, we can obtain different family of iMinMax structures that are optimized for different distributions of data sets. For a d-dimensional space, a range query need to be transformed into d subqueries. However, some of these subqueries can be pruned away without evaluation, further enhancing the efficiency of the scheme. Experimental results show that iMinMax(&Ogr;) can outperform the more complex Pyramid technique by a wide margin.

Indexing moving points (extended abstract)

We propose three indexing schemes for storing a set S of N points in the plane, each moving along a linear trajectory, so that a query of the following form can be answered quickly: Given a rectangle R and a real value tq, report all K points of S that lie inside R at time tq. We first present an indexing structure that, for any given constant &egr; > 0, uses O(N/B) disk blocks, where B is the block size, and answers a query in O((N/B)1/2+&egr; + K/B) I/Os. It can also report all the points of S that lie inside R during a given time interval. A point can be inserted or deleted, or the trajectory of a point can be changed, in O(log2B N) I/Os. Next, we present a general approach that improves the query time if the queries arrive in chronological order, by allowing the index to evolve over time. We obtain a trade off between the query time and the number of times the index needs to be updated as the points move. We also describe an indexing scheme in which the number of I/Os required to answer a query depends monotonically on the difference between tq and the current time. Finally, we develop an efficient indexing scheme to answer approximate nearest-neighbor queries among moving points.

On Herbrand semantics and conflict serializability of read-write transactions (extended abstract)

The quest for unified correctness criteria in database concurrency control is addressed from a new perspective. A family of Herbrand semantics is presented, where each semantics provides an interpretation for operations in the read-write model of transactions. Using commutativity arguments, each semantics leads to a notion of conflict, which then gives rise to distinct classes of serializable schedules. Surprisingly, the classical notion of serializability with respect to two of these sematics, update-in-place and deferred-update semantics, already embodies a unified correctness criterion; moreover, prefix-closed variants of it allow for a higher degree of transaction parallelism than, for example, prefix-reducibility. Finally, it is shown that previous criteria may permit undesirable schedules, which are ruled out by a stronger notion of serializability that captures all intuitively correct schedules, but is incomparable to prefix-reducibility and the classes of schedules recognized by optimistic protocols.

Entrepreneurship for information systems researchers (invited tutorial) (abstract only)

Today's environment offers tremendous opportunity for researchers to contribute to the advancement of commercial enterprises. Traditionally this involvement has been in the form of research grants from larger organizations and combined projects with research laboratories or advance development centers. More recently the involvement has been in the form of researchers starting or helping start small companies that are productising cutting edge technology.This tutorial will discuss logistics of start up companies, funding, and other issues associated with commercialising technology through the start-up route. Questions will be answered by the speaker and by knowledgeable members of the audience.

Optimal histograms for hierarchical range queries (extended abstract)

(Almost) optimal parallel block access to range queries

This guarantee is true for any number of dimensions. Subsequent to this work, Bhatia et al. [4] have proved that such a performance bound is essentially optimal for this kind of scheme, and have also extended our results to the case where the number of disks is a product of the form &kgr;1 * &kgr;2 ** &kgr;t where the &kgr;ts need not all be 2.

Selectively estimation for Boolean queries

In a variety of applications ranging from optimizing queries on alphanumeric attributes to providing approximate counts of documents containing several query terms, there is an increasing need to quickly and reliably estimate the number of strings (tuples, documents, etc.) matching a Boolean query. Boolean queries in this context consist of substring predicates composed using Boolean operators. While there has been some work in estimating the selectivity of substring queries, the more general problem of estimating the selectivity of Boolean queries over substring predicates has not been studied.Our approach is to extract selectivity estimates from relationships between the substring predicates of the Boolean query. However, storing the correlation between all possible predicates in order to provide an exact answer to such predicates is clearly infeasible, as there is a super-exponential number of possible combinations of these predicates. Instead, our novel idea is to capture correlations in a space-efficient but approximate manner. We employ a Monte Carlo technique called set hashing to succinctly represent the set of strings containing a given substring as a signature vector of hash values. Correlations among substring predicates can then be generated on-the-fly by operating on these signatures.We formalize our approach and propose an algorithm for estimating the selectivity of any Boolean query using the signatures of its substring predicates. We then experimentally demonstrate the superiority of our approach over a straight-forward approach based on the independence assumption wherein correlations are not explicitly captured.

Transversing itemset lattices with statistical metric pruning

We study how to efficiently compute significant association rules according to common statistical measures such as a chi-squared value or correlation coefficient. For this purpose, one might consider to use of the Apriori algorithm, but the algorithm needs major conversion, because none of these statistical metrics are anti-monotone, and the use of higher support for reducing the search space cannot guarantee solutions in its the search space. We here present a method of estimating a tight upper bound on the statistical metric associated with any superset of an itemset, as well as the novel use of the resulting information of upper bounds to prune unproductive supersets while traversing itemset lattices. Experimental tests demonstrate the efficiency of this method.

Computational properties of metaquerying problems

Metaquerying is a datamining technology by which hidden dependencies among several database relations can be discovered. This tool has already been successfully applied to several real-world applications. Recent papers provide only very preliminary results about the complexity of metaquerying. In this paper we define several variants of metaquerying that encompass, as far as we know, all variants defined in the literature. We study both the combined complexity and the data complexity of these variants. We show that, under the combined complexity measure, metaquerying is generally intractable (unless P=NP), but we are able to single out some tractable interesting metaquerying cases (whose combined complexity is LOGCFL-complete). As for the data complexity of metaquerying, we prove that, in general, this is in P, but lies within AC0 in some interesting cases. Finally, we discuss the issue of equivalence between metaqueries, which is useful for optimization purposes.

Information dependencies

This paper uses the tools of information theory to examine and reason about the information content of the attributes within a relation instance. For two sets of attributes X and Y, an information dependency measure (InD measure) characterizes the uncertainty remaining about the values for the set Y when the values for the set X are known. A variety of arithmetic inequalities (InD inequalities) are shown to hold among InD measures; InD inequalities hold in any relation instance. Numeric constraints (InD constraints) on InD measures, consistent with the InD inequalities, can be applied to relation instances. Remarkably, functional and multivalued dependencies correspond to setting certain constraints to zero, with Armstrong's axioms shown to be consequences of the arithmetic inequalities applied to constraints. As an analog of completeness, for any set of constraints consistent with the inequalities, we may construct a relation instance that approximates these constraints within any positive &egr;. InD measures suggest many valuable applications in areas such as data mining.

Uniform generation in spatial constraint databases and applications (Extended abstract)

We study the efficient approximation of queries in linear constraint databases using sampling techniques. We define the notion of an almost uniform generator for a generalized relation and extend the classical generator of Dyer, Frieze and Kannan for convex sets to the union and the projection of relations. For the intersection and the difference, we give sufficient conditions for the existence of such generators. We show how such generators give relative estimations of the volume and approximations of generalized relations as the composition of convex hulls obtained from the samples.

Analysis and application of adaptive sampling

An estimation algorithm for a query is a probabilistic algorithm that computes an approximation for the size (number of tuples) of the query. The main question that is studied is which classes of logically definable queries have fast estimation algorithms. Evidence from descriptive complexity theory is provided that indicates not all such queries have fast estimation algorithms. However, it is shown that on classes of structures of bounded degree, all first-order queries have fast estimation algorithms.These estimation algorithms use a form of statistical sampling known as adaptive sampling. Several versions of adaptive sampling have been developed by other researchers. The original version has been surpassed in some ways by a newer version and a more specialized Monte-Carlo algorithm. An analysis of the average run time of the original version is given, and the different algorithms are compared. The analysis is used to compute what appears to be the best known upper bound on the efficiency of the original algorithm. Also, contrary to what seems to be a commonly held opinion, the two methods of adaptive sampling are incomparable. Which method is superior depends on the query being estimated and the criteria that are being applied. Lastly, adaptive sampling can be more efficient than the Monte-Carlo algorithm if knowledge about the maximum values of the data being sampled is available.

Towards estimation error guarantees for distinct values

We consider the problem of estimating the number of distinct values in a column of a table. For large tables without an index on the column, random sampling appears to be the only scalable approach for estimating the number of distinct values. We establish a powerful negative result stating that no estimator can guarantee small error across all input distributions, unless it examines a large fraction of the input data. In fact, any estimator must incur a significant error on at least some of a natural class of distributions. We then provide a new estimator which is provably optimal, in that its error is guaranteed to essentially match our negative result. A drawback of this estimator is that while its worst-case error is reasonable, it does not necessarily give the best possible error bound on any given distribution. Therefore, we develop heuristic estimators that are optimized for a class of typical input distributions. While these estimators lack strong guarantees on distribution-independent worst-case error, our extensive empirical comparison indicate their effectiveness both on real data sets and on synthetic data sets.