In this overview paper we motivate the need for and research issues arising from a new model of data processing. In this model, data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, time-varying *data streams.* In addition to reviewing past work relevant to data stream systems and current projects in the area, the paper explores topics in stream query languages, new requirements and challenges in query processing, and algorithmic issues.

Research on information extraction from Web pages (wrapping) has seen much activity in recent times (particularly systems implementations), but little work has been done on formally studying the expressiveness of the formalisms proposed or on the theoretical foundations of wrapping.In this paper, we first study monadic datalog as a wrapping language (over ranked or unranked tree structures). Using previous work by Neven and Schwentick, we show that this simple language is equivalent to full monadic second order logic (MSO) in its ability to specify wrappers. We believe that MSO has the right expressiveness required for Web information extraction and thus propose MSO as a yardstick for evaluating and comparing wrappers.Using the above result, we study the kernel fragment Elog^{-} of the Elog wrapping language used in the Lixto system (a visual wrapper generator). The striking fact here is that Elog^{-} exactly captures MSO, yet is easier to use. Indeed, programs in this language can be entirely visually specified. We also formally compare Elog to other wrapping languages proposed in the literature.

Declustering schemes allocate data blocks among multiple disks to enable parallel retrieval. Given a declustering scheme *D,* its *response time* with respect to a query *Q,* *rt*(*Q*), is defined to be the maximum number of disk blocks of the query stored by the scheme in any one of the disks. If |*Q*| is the number of data blocks in *Q* and *M* is the number of disks then *rt*(*Q*) is at least |*Q*|/*M.* One way to evaluate the performance of *D* with respect to a set of queries 𝑄 is to measure its *additive error* - the maximum difference between *rt*(*Q*) from |*Q*|/*M* over all range queries *Q* ε 𝑄.In this paper, we consider the problem of designing declustering schemes for uniform multidimensional data arranged in a *d*-dimensional grid so that their additive errors with respect to range queries are as small as possible. It has been shown that such declustering schemes will have an additive error of Ω(log *M*) when *d* = 2 and Ω(log d-1/2 *M*) when *d* > 2 with respect to range queries.Asymptotically optimal declustering schemes exist for 2-dimensional data. For data in larger dimensions, however, the best bound for additive errors is *O*(*M*^{d-1}), which is extremely large. In this paper, we propose the two declustering schemes based on low discrepancy points in *d*-dimensions. When *d* is fixed, both schemes have an additive error of *O*(log^{d-1} M) with respect to range queries provided certain conditions are satisfied: the first scheme requires *d* ≥ 3 and *M* to be a power of a prime where the prime is at least *d* while the second scheme requires the size of the data to grow within some polynomial of *M,* with no restriction on *M.* These are the first known multidimensional declustering schemes with additive errors near optimal.

Modern search engines answer keyword-based queries extremely efficiently. The impressive speed is due to clever inverted index structures, caching, a domain-independent knowledge of strings, and thousands of machines. Several research efforts have attempted to generalize keyword search to keytree and keygraph searching, because trees and graphs have many applications in next-generation database systems. This paper surveys both algorithms and applications, giving some emphasis to our own work.

This paper investigates the on-line validation of streaming XML documents with respect to a DTD, under memory constraints. We first consider validation using constant memory, formalized by a finite-state automaton (FSA ). We examine two flavors of the problem, depending on whether or not the XML document is assumed to be well-formed. The main results of the paper provide conditions on the DTDs under which validation of either flavor can be done using an FSA . For DTDs that cannot be validated by an FSA , we investigate two alternatives. The first relaxes the constant memory requirement by allowing a stack bounded in the depth of the XML document, while maintaining the deterministic, one-pass requirement. The second approach consists in refining the DTD to provide additional information that allows validation by an FSA .

XPath is a simple language for navigating an XML document and selecting a set of element nodes. XPath expressions are used to query XML data, describe key constraints, express transformations, and reference elements in remote documents. This paper studies the containment and equivalence problems for a fragment of the XPath query language, with applications in all these contexts.In particular, we study a class of XPath queries that contain branching, label wildcards and can express descendant relationships between nodes. Prior work has shown that languages which combine any two of these three features have efficient containment algorithms. However, we show that for the combination of features, containment is coNP-complete. We provide a sound and complete EXPTIME algorithm for containment, and study parameterized PTIME special cases. While we identify two parameterized classes of queries for which containment can be decided efficiently, we also show that even with some bounded parameters, containment is coNP-complete. In response to these negative results, we describe a sound algorithm which is efficient for all queries, but may return false negatives in some cases.

XSLT is the prime example of an XML query language based on tree-walking. Indeed, stripped down, XSLT is just a tree-walking tree-transducer equipped with registers and look-ahead. Motivated by this connection, we want to pinpoint the computational power of devices based on tree-walking. We show that in the absence of unique identifiers even very powerful extensions of the tree-walking paradigm are not relationally complete. That is, these extensions do not capture all of first-order logic. In contrast, when unique identifiers are available, we show that various restrictions allow to capture LOGSPACE, PTIME, PSPACE, and EXPTIME . These complexity classes are defined w.r.t. a Turing machine model working directly on (attributed) trees. When no attributes are present, relational storage does not add power; whether look-ahead adds power is related to the open question whether tree-walking captures the regular tree languages.

This paper takes a first step towards the design and normalization theory for XML documents. We show that, like relational databases, XML documents may contain redundant information, and may be prone to update anomalies. Furthermore, such problems are caused by certain functional dependencies among paths in the document. Our goal is to find a way of converting an arbitrary DTD into a well-designed one, that avoids these problems. We first introduce the concept of a functional dependency for XML, and define its semantics via a relational representation of XML. We then define an XML normal form, XNF, that avoids update anomalies and redundancies. We study its properties and show that it generalizes BCNF and a normal form for nested relations when those are appropriately coded as XML documents. Finally, we present a lossless algorithm for converting any DTD into one in XNF.

We introduce and investigate a distributed computation model for querying the Web. Web queries are computed by interacting automata running at different nodes in the Web. The automata which we are concerned with can be viewed as register automata equipped with an additional communication component. We identify conditions necessary and sufficient for systems of automata to compute Web queries, and investigate the computational power of such systems.

We consider the fundamental operation of applying a conjunction of selection conditions to a set of records. With large main memories available cheaply, systems may choose to keep the data entirely in main memory, in order to improve query and/or update performance.The design of a data-intensive algorithm in main memory needs to take into account the architectural characteristics of modern processors, just as a disk-based method needs to consider the physical characteristics of disk devices. An important architectural feature that influences the performance of main memory algorithms is the branch misprediction penalty. We demonstrate that branch misprediction has a substantial impact on the performance of an algorithm for applying selection conditions.We describe a space of "query plans" that are logically equivalent, but differ in terms of performance due to variations in their branch prediction behavior. We propose a cost model that takes branch prediction into account, and develop a query optimization algorithm that chooses a plan with optimal estimated cost. We also develop an efficient heuristic optimization algorithm.We provide experimental results for a case study based on an event notification system. Our results show the effectiveness of the proposed optimization techniques. Our results also demonstrate that significant improvements in performance can be obtained by applying a methodology that takes branch misprediction latency into account.

We examine the problem of efficiently computing sum/count/avg aggregates over objects with non-zero extent. Recent work on computing multi-dimensional aggregates has concentrated on objects with zero extent (points) on a multi-dimensional grid, or one-dimensional intervals. However, in many spatial and/or spatio-temporal applications objects have extent in various dimensions, while they can be located anywhere in the application space. The aggregation predicate is typically described by a multi-dimensional box (*box-sum* aggregation). We examine two variations of the problem. In the simple case an object's value contributes to the aggregation result as a whole as long as the object intersects the query box. More complex is the *functional box-sum* aggregation introduced in this paper, where objects participate in the aggregation proportionally to the size of their intersection with the query box. We first show that both problems can he reduced to *dominance-sum* queries. Traditionally dominance-sum queries are addressed in main memory by a static structure, the ECDF-tree. We then propose two extensions, namely the ECDF-B-trees, that make this structure disk-based and dynamic. Finally, we introduce the DA-tree that combines the advantages from each ECDF-B-tree. We run experiments comparing the performance of the ECDF-B-trees, the BA-tree and a traditional R*-tree (which has been augmented to include aggregation information on its index nodes) over spatial datasets. Our evaluation reaffirms that the BA-tree has more robust performance. Compared against the augmented R*-tree, the BA-tree offers drastic improvement in query performance at the expense of some limited extra space.

Users of decision support system typically submit batches of range-sum queries simultaneously rather than issuing individual, unrelated queries. We propose a wavelet based technique that exploits T/O sharing across a query batch to evaluate the set of queries progressively and efficiently. The challenge is that now controlling the structure of errors across query results becomes more critical than minimizing error per individual query. Consequently, we define a class of structural error penalty functions and show how they are controlled by our technique Experiments demonstrate that our technique is efficient as an exact algorithm, and the progressive estimates are accurate, even after less than one I/O per query.

Views as a means to describe parts of a given data collection play an important role in many database applications. In dynamic environments, where data is updated, not only information provided by views, but also information provided by data sources but *missing* from views turns out to be relevant: Previously, this missing information was characterized in terms of *view complements;* recently, it was shown that view complements can be exploited in the context of data warehouses to guarantee desirable warehouse properties such as independence and self-maintainability. As the complete source information is a trivial complement for any given view, a natural interest for "small" or even "minimal" complements arises. However, the computation of minimal complements is still not too well understood. In this paper, we show how to compute reasonably small (and in special cases even minimal) complements for *monotonic* relational views, where the complexity of constructing complements is *polynomial* in the size of schema information.

We study two classes of view update problems in relational databases. We are given a source database *S*, a monotone query *Q*, and the view *Q*(*S*) generated by the query. The first problem that we consider is the classical view deletion problem where we wish to identify a minimal set *T* of tuples in *S* whose deletion will eliminate a given tuple *t* from the view. We study the complexity of optimizing two natural objectives in this setting, namely, find *T* to minimize the side-effects on the view, and the source, respectively. For both objective functions, we show a dichotomy in the complexity. Interestingly, the problem is either in P or is NP-hard, for queries in the same class in either objective function.The second problem in our study is the annotation placement problem. Suppose we annotate an attribute of a tuple in *S*. The rules for carrying the annotation forward through a query are easily stated. On the other hand, suppose we annotate an attribute of a tuple in the view *Q*(*S*), what annotation(*s*) in *S* will cause this annotation to appear in the view, minimizing the propagation to other attributes in *Q*(*S*)? View annotation is becoming an increasingly useful method of communicating meta-data among users of shared scientific data sets, and to our knowledge, there has been no formal study of this problem.Our study of these problems gives us important insights into computational issues involved in data provenance or lineage --- the process by which data moves through databases. We show that the two problems correspond to two fundamentally distinct notions of provenance, *why* and *where-provenance.*

The view-selection problem is to design and materialize a set of views over a database schema, such that the choice of views minimizes the cost of evaluating the selected workload of queries, and the combined size of the materialized views does not exceed a prespecified storage limit. Important applications of the view-selection problem include query optimization, data warehouse design, and information integration.We consider the view-selection problem in relational databases, for conjunctive queries and views. Suppose somebody wants to design a view-selection algorithm that outputs a polynomial number of views for all query workloads and storage limits and produces optimal selections of views independently of actual database contents. In previous work it was shown that it is impossible to design such an algorithm when the *product* (as for nested-loop joins) cost model is used. That is, there exist databases for which the number of views in an optima] viewset is exponential in the size of the database schema and query workload. As a consequence, under the product-cost model the view-selection problem has an exponential-time lower bound.Efficient join algorithms have a cost that is proportional to the *sum* of the sizes of the input and output relations. In this paper we prove that under the more practical sum-cost model, the view-selection problem also has an exponential time lower bound. As a consequence, under the sum-cost model it is also impossible to come up with a view-selection algorithm that outputs a polynomial number of views for all query workloads and databases, yet produces optimal selections of views.

In multidimensional data models intended for online analytic processing (OLAP), data are viewed as points in a multidimensional space. Each dimension has structure, described by a directed graph of categories, a set of members for each category, and a child/parent relation between members. An important application of this structure is to use it to infer summarizability, that is, whether an aggregate view defined for some category can be correctly derived from a set of precomputed views defined for other categories. A dimension is called heterogeneous if two members in a given category are allowed to have ancestors in different categories. In previous work, we studied the problem of inferring summarizability in a particular class of heterogeneous dimensions. In this paper, we propose a class of integrity constraints and schemas that allow us to reason about summarizability in general heterogeneous dimensions. We introduce the notion of frozen dimensions, which are minimal homogeneous dimension instances representing the different structures that are implicitly combined in a heterogeneous dimension. Frozen dimensions provide the basis for efficiently testing implication of dimension constraints, and are useful aid to understanding heterogeneous dimensions. We give a sound and complete algorithm for solving the implication of dimension constraints, that uses heuristics based on the structure of the dimension and the constraints to speed up its execution. We study the intrinsic complexity of the implication problem, and the running time of our algorithm.

Data Warehousing and OLAP applications typically view data an having multiple logical dimensions (e.g., product, location) with natural hierarchies defined on each dimension. OLAP queries usually involve hierarchical selections on some of the dimensions, and often aggregate measure attributes (e.g., sales, volume). Accurately estimating the distribution of measure attributes, under hierarchical selections, is important in a variety of scenarios, including approximate query evaluation and cost-based optimization of queries.In this paper, we propose fast (near linear time) algorithms for the problem of approximating the distribution of measure attributes with hierarchies defined on them, using histograms. Our algorithms are based on dynamic programming and a novel notion of sparse intervals that we introduce, and are the first *practical* algorithms for this problem. They effectively trade space for construction time without compromising histogram accuracy. We complement our analytical contributions with an experimental evaluation using real data sets, demonstrating the superiority of our approach.

Database applications for moving objects pose new challenges in modeling, querying, and maintenance of objects whose locations are rapidly changing over time. Previous work on modeling and querying spatio-temporal databases and constraint databases focus primarily on snapshots of changing databases. In this paper we study query evaluation techniques for moving object databases where moving objects are being updated frequently. We consider a constraint database approach to moving objects and queries. We classify moving object queries into: "past", "continuing", and "future" queries. We argue that while traditional constraint query evaluation techniques are suitable for past queries, new techniques are needed for continuing and future queries. Motivated by nearest-neighbor queries, we define a query language based on a single "generalized distance" function f mapping from objects to continuous functions from time to ℝ. Queries in this language may be past, continuing, or future. We show that if f maps to polynomials, queries can be evaluated efficiently using the plane sweeping technique from computational geometry. Consequently, many known distance based queries can be evaluated efficiently.

In Knowledge Compilation (KC) an intractable deduction problem *KB* ⊨ *f* is split into two phases: 1) *KB* is preprocessed, thus obtaining a data structure *D*KB ; 2) the problem is efficiently solved using *D*KB and *f.* Our goal is to study KC in the context of relational databases: Both *KB* and *f* are represented as databases, and '⊨' is represented as a query *Q* in second-order logic. *D*KB is a database, to be *synthesized* from *KB* by means of an appropriate *view.* *Q* is *rewritten,* thus obtaining *Q*r . We show syntactic restrictions on *Q* implying that a polynomial-size *D*KB and a first-order *Q*r exist, which imply that phase 2 can be done in polynomial time. We also present classes of queries (in some sense complementary to the former ones) for which either no polynomial-size *D*KB or no first-order *Q*r exist (unless the PH collapses). Compilation to other complexity classes is also addressed.

We consider the problem of answering queries using views, where queries and views are conjunctive queries with arithmetic comparisons (CQACs) over dense orders. Previous work only considered limited variants of this problem, without giving a complete solution. We have developed a novel algorithm to obtain maximally-contained rewritings (MCRs) for queries having left (or right) semi-interval-comparison predicates. For semi-interval queries, we show that the language of finite unions of CQAC rewritings is not sufficient to find a maximally-contained solution, and identify cases where datalog is sufficient. Finally, we show that it is decidable to obtain equivalent rewritings for CQAC queries.

We consider conjunctive queries with arithmetic comparisons over multiple continuous data streams. We specify an algorithm for determining whether or not a query can be evaluated using a bounded amount of memory for all possible instances of the data streams. When a query can be evaluated using bounded memory, we produce an execution strategy based on constant-sized synopses of the data streams.

Data integration is the problem of combining data residing at different sources, and providing the user with a unified view of these data. The problem of designing data integration systems is important in current real world applications, and is characterized by a number of issues that are interesting from a theoretical point of view. This document presents on overview of the material to be presented in a tutorial on data integration. The tutorial is focused on some of the theoretical issues that are relevant for data integration. Special attention will be devoted to the following aspects: modeling a data integration application, processing queries in data integration, dealing with inconsistent data sources, and reasoning on queries.

If the only information we have on a certain database is through a set of views, the question arises of whether this is sufficient to answer completely a given query. We say that the set of views is lossless with respect to the query, if, no matter what the database is, we can answer the query by solely relying on the content of the views. The question of losslessness has various applications, for example in query optimization, mobile computing, data warehousing, and data integration. We study this problem in a context where the database is semistructured, and both the query and the views are expressed as regular path queries. The form of recursion present in this class prevents us from applying known results to our case.We first address the problem of checking losslessness in the case where the views are materialized. The fact that we have the view extensions available makes this case solvable by extending known techniques. We then study a more complex version of the problem, namely the one where we abstract from the specific view extension. More precisely, we address the problem of checking whether, for every database, the answer to the query over such a database can be obtained by relying only on the view extensions. We show that the problem is solvable by utilizing, via automata-theoretic techniques, the known connection between view-based query answering and constraint satisfaction. We also investigate the computational complexity of both versions of the problem.

XML specifications often consist of a type definition (typically, a DTD) and a set of integrity constraints. It has been shown previously that such specifications can be inconsistent, and thus it is often desirable to check consistency at compile-time. It is known that for general keys and foreign keys, and DTDs, the consistency problem is undecidable; however, it becomes NP-complete when all keys are one-attribute (unary), and tractable, if no foreign keys are used.In this paper, we consider a variety of constraints for XML data, and study the complexity of the consistency problem. Our main conclusion is that in the presence of foreign keys, compile-time verification of consistency is usually infeasible. We look at two types of constraints: absolute (that hold in the entire document), and relative (that only hold in a part of the document). For absolute constraints, we extend earlier decidability results to the case of multi-attribute keys and unary foreign keys, and to the case of constraints involving regular expressions, providing lower and upper bounds in both cases. For relative constraints, we show that even for unary constraints, the consistency problem is undecidable. We also establish a number of restricted decidable cases.

We present algorithms to label the nodes of an XML tree which is subject to insertions and deletions of nodes. The labeling is done such that (1) we label each node immediately when it is inserted and this label remains unchanged, and (2) from a pair of labels alone, we can decide whether one node is an ancestor of the other. This problem arises in the context of XML databases that support queries on the structure of the documents as well us on the changes made to the documents over time. We prove that our algorithms assign the shortest possible labels (up to a constant factor) which satisfy these requirements.We also consider the same problem when "clues" that provide guarantees on possible future insertions are given together with newly inserted nodes. Such clues can be derived from the DTD or from statistics on similar XML trees. We present algorithms that use the clues to assign shorter labels. We also prove that the length of our labels is close to the minimum possible.

In this work, we study the complexity of the problem of approximate query optimization. We show that, for any δ > 0, the problem of finding a join order sequence whose cost is within a factor 2^{Θ(log1-δ(K))} of *K,* where *K* is the cost of the optimal join order sequence is NP-Hard. The complexity gap remains if the number of edges in the query graph is constrained to be a given function *e*(*n*) of the number of vertices *n* of the query graph, where *n*(*n* - 1)/2 - Θ(*n*^{τ}) ≥ *e*(*n*) ≥ *n* + Θ(*n*^{τ}) and τ is any constant between 0 and 1. These results show that, unless P=NP, the query optimization problem cannot be approximately solved by an algorithm that runs in polynomial time and has a competitive ratio that is within some polylogarithmic factor of the optimal cost.

A standard assumption in the database query optimization literature is that it suffices to optimize for the "typical" case---that is, the case in which various parameters (e.g., the amount of available memory, the selectivities of predicates, etc.) take on their "typical" values. It was claimed in [CHS99] that we could do better by choosing plans based on their *expected cost.* Here we investigate this issue more thoroughly. We show that in many circumstances of interest, a "typical" value of the parameter often does give acceptable answers, provided that it is chosen carefully and we are interested only in minimizing expected running time. However, by minimizing the expected running time, we are effectively assuming that if plan *p*^{1} runs three times as long as plan *p*^{2}, then *p*^{1} is exactly three times as bad as *p*^{2}. An assumption like this is not always appropriate. We show that focusing on least expected cost can lead to significant improvement for a number of cost functions of interest.