Several commercial applications, such as online comparison shopping and process automation, require integrating information that is scattered across multiple websites or XML documents. Much research has been devoted to this problem, resulting in several research prototypes and commercial implementations. Such systems rely on wrappers that provide relational or other structured interfaces to websites. Traditionally, wrappers have been constructed by hand on a per-website basis, constraining the scalability of the system.

We introduce a website structure inference mechanism called *compact skeletons* that is a step in the direction of automated wrapper generation. Compact skeletons provide a transformation from websites or other hierarchical data, such as XML documents, to relational tables. We study several classes of compact skeletons and provide polynomial-time algorithms and heuristics for automated construction of compact skeletons from websites. Experimental results show that our heuristics work well in practice. We also argue that compact skeletons are a natural extension of commercially deployed techniques for wrapper construction.

Flexible queries facilitate, in a novel way, easy and concise querying of databases that have varying structures. Two different semantics, flexible and semiflexible, are introduced and investigated. The complexity of evaluating queries under the two semantics is analyzed. Query evaluation is polynomial in the size of the query, the database and the result in the following two cases. First, a semiflexible DAG query and a tree database. Second, a flexible tree query and a database that is any graph. Query containment and equivalence are also investigated. For the flexible semantics, query equivalence is always polynomial. For the semiflexible semantics, query equivalence is polynomial for DAG queries and exponential when the queries have cycles. Under the semiflexible and flexible semantics, two databases could be equivalent even when they are not isomorphic. Database equivalence is formally defined and characterized. The complexity of deciding equivalences among databases is analyzed. The implications of database equivalence on query evaluation are explained.

The optimization of queries in distributed database systems is known to be subject to delicate trade-offs. For example, the Mariposa database system allows users to specify a desired delay-cost tradeoff (that is, to supply a decreasing function *u*(*d*), specifying how much the user is willing to pay in order to receive the query results within time *d*); Mariposa divides a query graph into horizontal “strides,” analyzes each stride, and uses a greedy heuristic to find the “best” plan for all strides. We show that Mariposa's greedy heuristic can be arbitrarily far from the desired optimum. Applying a recent approach in multiobjective optimization algorithms to this problem, we show that the optimum cost-delay trade-off (Pareto) curve in Mariposa's framework can be approximated fast within any desired accuracy. We also present a polynomial algorithm for the general multiobjective query optimization problem, which approximates arbirarily well the optimum cost-delay tradeoff (without the restriction of Mariposa's heuristic stride subdivision).

Database systems frequently have to execute a set of related queries, which share several common subexpressions. Multi-query optimization exploits this, by finding evaluation plans that share common results. Current approaches to multi-query optimization assume that common subexpressions are materialized. Significant performance benefits can be had if common subexpressions are pipelined to their uses, without being materialized. However, plans with pipelining may not always be realizable with limited buffer space, as we show. We present a general model for schedules with pipelining, and present a necessary and sufficient condition for determining validity of a schedule under our model. We show that finding a valid schedule with minimum cost is *NP*-hard. We present a greedy heuristic for finding good schedules. Finally, we present a performance study that shows the benefit of our algorithms on batches of queries from the TPCD benchmark.

The need to search for complex and recurring patterns in database sequences is shared by many applications. In this paper, we discuss how to express and support efficiently sophisticated sequential pattern queries in databases. Thus, we first introduce SQL-TS, an extension of SQL, to express these patterns, and then we study how to optimize search queries for this language. We take the optimal text search algorithm of Knuth, Morris and Pratt, and generalize it to handle complex queries on sequences. Our algorithm exploits the inter-dependencies between the elements of a sequential pattern to minimize repeated passes over the same data. Experimental results on typical sequence queries, such as double bottom queries, confirm that substantial speedups are achieved by our new optimization techniques.

This paper gives a short introduction into parameterized complexity theory, aimed towards database theorists interested in this area. The main results presented here classify the evaluation of first-order queries and conjunctive queries as hard parameterized problems.

Data structures with relaxed balance differ from standard structures in that rebalancing can be delayed and interspersed with updates. This gives extra flexibility in both sequential and parallel applications.

We study the version of multi-way trees called (*a*,*b*)-trees (which includes B-trees) with the operations insertion, deletion, and group insertion. The latter has applications in for instance document databases and WWW search engines. We prove that we obtain the optimal asymptotic rebalancing complexities of amortized constant time for insertion and deletion and amortized logarithmic time in the size of the group for group insertion. These results hold even for the relaxed version.

Our results also demonstrate that a binary tree scheme with the same complexities can be designed. This is an improvement over the existing results in the most interesting cases.

Assume that each object in a database has *m* grades, or scores, one for each of *m* attributes. For example, an object can have a color grade, that tells how red it is, and a shape grade, that tells how round it is. For each attribute, there is a sorted list, which lists each object and its grade under that attribute, sorted by grade (highest grade first). There is some monotone *aggregation function*, or *combining rule*, such as min or average, that combines the individual grades to obtain an overall grade.

To determine objects that have the best overall grades, the naive algorithm must access every object in the database, to find its grade under each attribute. Fagin has given an algorithm (“Fagin's Algorithm”, or FA) that is much more efficient. For some distributions on grades, and for some monotone aggregation functions, FA is optimal in a high-probability sense.

We analyze an elegant and remarkably simple algorithm (“the threshold algorithm”, or TA) that is optimal in a much stronger sense than FA. We show that TA is essentially optimal, not just for some monotone aggregation functions, but for all of them, and not just in a high-probability sense, but over every database. Unlike FA, which requires large buffers (whose size may grow unboundedly as the database size grows), TA requires only a small, constant-size buffer.

We distinguish two types of access: sorted access (where the middleware system obtains the grade of an object in some sorted list by proceeding through the list sequentially from the top), and random access (where the middleware system requests the grade of object in a list, and obtains it in one step). We consider the scenarios where random access is either impossible, or expensive relative to sorted access, and provide algorithms that are essentially optimal for these cases as well.

The paper investigates XML document specifications with DTDs and integrity constraints, such as keys and foreign keys. We study the consistency problem of checking whether a given specification is meaningful: that is, whether there exists an XML document that both conforms to the DTD and satisfies the constraints. We show that DTDs interact with constraints in a highly intricate way and as a result, the consistency problem in general is undecidable. When it comes to unary keys and foreign keys, the consistency problem is shown to be NP-complete. This is done by coding DTDs and integrity constraints with linear constraints on the integers. We consider the variations of the problem (by both restricting and enlarging the class of constraints), and identify a number of tractable cases, as well as a number of additional NP-complete ones. By incorporating negations of constraints, we establish complexity bounds on the implication problem, which is shown to be coNP-complete for unary keys and foreign keys.

Query languages for XML often use path expressions to locate elements in XML documents. Path expressions are regular expressions such that underlying alphabets represent conditions on nodes. Path expressions represent conditions on paths from the root, but do not represent conditions on siblings, siblings of ancestors, and descendants of such siblings. In order to capture such conditions, we propose to extend underlying alphabets. Each symbol in an extended alphabet is a triplet (*e**a*, *e**a* is a condition on nodes, and *e**e**e**e*

We investigate the *type checking* problem for XML queries: statically verifying that every answer to a query conforms to a given output DTD, for inputs satisfying a given input DTD. This problem had been studied by a subset of the authors in a simplified framework that captured the structure of XML documents but ignored data values. We revisit here the type checking problem in the more realistic case when data values are present in documents and tested by queries. In this extended framework, type checking quickly becomes undecidable. However, it remains decidable for large classes of queries and DTDs of practical interest. The main contribution of the present paper is to trace a fairly tight boundary of decidability for type checking with data values. The complexity of type checking in the decidable cases is also considered.

We study the representation and querying of XML with incomplete information. We consider a simple model for XML data and their DTDs, a very simple query language, and a representation system for incomplete information in the spirit of the representations systems developed by Imielinski and Lipski for relational databases. In the scenario we consider, the incomplete information about an XML document is continuously enriched by successive queries to the document. We show that our representation system can represent partial information about the source document acquired by successive queries, and that it can be used to intelligently answer new queries. We also consider the impact on complexity of enriching our representation system or query language with additional features. The results suggest that our approach achieves a practically appealing balance between expressiveness and tractability. The research presented here was motivated by the Xyleme project at INRIA, whose objective it to develop a data warehouse for Web XML documents.

When gathering data from multiple data sources, users need uniform, transparent access to data. Also, when extracting data from several independent, often only partially sound and complete data sources, it is useful to present users with meta-information about the confidence in the answer to a query, based on the number and quality of the sources that participated in constructing the answer. We consider the problem of querying collections of sources with incomplete and partially sound data. We provide a method for checking the consistency of a source collection, we give a tableaux-based characterization for the set of possible worlds consistent with a given source collection and we propose a probabilistic semantics for query answers.

We study the problem of computing a function *f*(*x*_{1},…, *x _{n}*) given that the actual values of the variables

We design online algorithms for this problem when *f* is either the selection function or an aggregation function such as sum or average. We consider three natural models of precision and give algorithms for each model. We analyze our algorithms in the framework of competitive analysis and show that our algorithms are asymptotically optimal. Finally, we also study online algorithms for functions that are obtained by composing together selection and aggregation functions.

We study relational calculi with support for string operations. While SQL restricts the ability to mix string pattern-matching and relational operations, prior proposals for embedding SQL in a compositional calculus were based on adding the operation of concatenation to first-order logic. These latter proposals yield compositional query languages extending SQL, but are unfortunately computationally complete. The unbounded expressive power in turn implies strong limits on the ability to perform optimization and static analysis of properties such as query safety in these languages.

In contrast, we look at compositional extensions of relational calculus that have nice expressiveness, decidability, and safety properties, while capturing string-matching queries used in SQL. We start with an extension based on the string ordering and LIKE predicates. This extension shares some of the attractive properties of relational calculus (e.g. effective syntax for safe queries, low data complexity), but lacks the full power of regular-expression pattern-matching. When we extend this basic model to include string length comparison, we get a natural string language with great expressiveness, but one which includes queries with high (albeit bounded) data complexity. We thus explore the space between these two languages. We consider two intermediate languages: the first extends our base language with functions that trim/add leading characters, and the other extends it by adding the full power of regular-expression pattern-matching. We show that both these extensions inherit many of the attractive properties of the basic model: they both have corresponding algebras expressing safe queries, and low complexity of query evaluation.

In a previous paper [10], the authors introduced the notion of hypertree decomposition and the corresponding concept of hypertree width and showed that the conjunctive queries whose hypergraphs have bounded hypertree-width can be evaluated in polynomial time. Bounded hypertree-width generalizes the notions of acyclicity and bounded treewidth and corresponds to larger classes of tractable queries. In the present paper, we provide natural characterizations of hypergraphs and queries having bounded hypertree-width in terms of game-theory and logic.

First we define the Robber and Marshals game, and prove that a hypergraph H has hypertree-width at most *k* iff *k* marshals have a winning strategy on H, allowing them to trap a robber who moves along the hyperedges. This game is akin the well-known Robber and Cops game (which characterizes bounded treewidth), except that marshals are more powerful than cops: They can control entire hyperedges instead of just vertices.

Kolaitis and Vardi [17] recently gave an elegant characterization of the conjunctive queries having treewidth < *k* in terms of the *k*-variable fragment of a certain logic L ( = existential-conjunctive fragment of positive FO). We use the Robber and Marshals game to derive a surprisingly simple and equally elegant characterization of the class HW[*k*] of queries of hypertree-width at most *k* in terms of guarded logic. In particular, we show that HW[*k*] = GF*k**k**k*-guarded fragment of L. In this fragment, conjunctions of *k* atoms rather than just single atoms are allowed to act as guards. Note that, for the particular case *k* = 1, our results provide new characterizations of the class of acyclic queries.

We extend the notion of bounded hypertreewidth to nonrecursive stratified datalog and show that the *k*-guarded fragment GF*k**k*.

We consider the complexity of join problems, focusing on equijoins, spatial-overlap joins, and set-containment joins. We use a graph pebbling model to characterize these joins combinatorially, by the length of their optimal pebbling strategies and computationally, by the complexity of discovering these strategies. Our results show that equijoins are the easiest of all joins, with optimal pebbling strategies that meet the lower bound over all join problems and that can be found in linear time. By contrast, spatial-overlap and set-containment joins are the hardest joins, with instances where optimal pebbling strategies reach the upper bound over all join problems and with the problem of discovering optimal pebbling strategies being NP-complete. For set-containment joins, we show that discovering the optimal pebbling is also MAX-SNP-Complete. As a consequence, we show that unless NP = P, there is a constant ∈

Query equivalence is investigated for disjunctive aggregate queries with negated subgoals, constants and comparisons. A full characterization of equivalence is given for the aggregation functions *count*, *max*, *sum*, *prod*, *top*2 and *parity*. A related problem is that of determining, for a given natural number *N*, whether two given queries are equivalent over all databases with at most *N* constants. We call this problem *bounded equivalence*. A complete characterization of decidability of bounded equivalence is given. In particular, it is shown that this problem is decidable for all the above aggregation functions as well as for *cntd* (count distinct), *stdev* (standard deviation), *median* and *avg*. For quasilinear queries (i.e., queries in which predicates that occur positively are not repeated) it is shown that equivalence can be decided in polynomial time for the aggregation functions *count*, *max*, *sum*, *prod*, *top*2, *parity*, and *avg*. A similar result holds for *cntd* provided that a few additional conditions hold. The results are couched in terms of abstract characteristics of aggregation functions, and new proof techniques are used. Finally, the results presented also imply that equivalence, under bag-set semantics, is decidable for nonaggregate queries with negation.

Fast estimates for aggregate queries are useful in database query optimization, approximate query answering and online query processing. Hence, there has been a lot of focus on “selectivity estimation”, that is, computing summary statistics on the underlying data and using that to answer aggregate queries fast and to a reasonable approximation. We present two sets of results for range aggregate queries, which are amongst the most common queries.

First, we focus on a histogram as summary statistics and present algorithms for constructing histograms that are provably optimal (or provably approximate) for range queries; these algorithms take (pseudo-) polynomial time. These are the first known optimality or approximation results for arbitrary range queries; previously known results were optimal only for restricted range queries (such as equality queries, hierarchical or prefix range queries).

Second, we focus on wavelet-based representations as summary statistics and present fast algorithms for picking wavelet statistics that are provably optimal for range queries. No previously-known wavelet-based methods have this property.

We perform an experimental study of the various summary representations show the benefits of our algorithms over the known methods.

A temporal aggregation query is an important but costly operation for applications that maintain time-evolving data (data warehouses, temporal databases, etc.). Due to the large volume of such data, performance improvements for temporal aggregation queries are critical. In this paper we examine techniques to compute temporal aggregates that include key-range predicates (*range temporal aggregates*). In particular we concentrate on SUM, COUNT and AVG aggregates. This problem is novel; to handle arbitrary key ranges, previous methods would need to keep a separate index for every possible key range. We propose an approach based on a new index structure called the *Multiversion SB-Tree*, which incorporates features from both the SB-Tree and the Multiversion B-Tree, to handle arbitrary key-range temporal SUM, COUNT and AVG queries. We analyze the performance of our approach and present experimental results that show its efficiency.

In this talk, we will give an overview of how content is distributed on the internet, with an emphasis on the approach being used by Akamai. We will describe some of the technical challenges involved in operating a network of thousands of content servers across multiple geographies on behalf of thousands of customers. The talk will be introductory in nature and should be accessible to a broad audience.

The increasing ability to track and collect large amounts of data with the use of current hardware technology has lead to an interest in the development of data mining algorithms which preserve user privacy. A recently proposed technique addresses the issue of privacy preservation by perturbing the data and reconstructing distributions at an aggregate level in order to perform the mining. This method is able to retain privacy while accessing the information implicit in the original attributes. The distribution reconstruction process naturally leads to some loss of information which is acceptable in many practical situations. This paper discusses an Expectation Maximization (EM) algorithm for distribution reconstruction which is more effective than the currently available method in terms of the level of information loss. Specifically, we prove that the EM algorithm converges to the maximum likelihood estimate of the original distribution based on the perturbed data. We show that when a large amount of data is available, the EM algorithm provides robust estimates of the original distribution. We propose metrics for quantification and measurement of privacy-preserving data mining algorithms. Thus, this paper provides the foundations for measurement of the effectiveness of privacy preserving data mining algorithms. Our privacy metrics illustrate some interesting results on the relative effectiveness of different perturbing distributions.

The dimensionality curse has profound effects on the effectiveness of high-dimensional similarity indexing from the performance perspective. One of the well known techniques for improving the indexing performance is the method of dimensionality reduction. In this technique, the data is transformed to a lower dimensional space by finding a new axis-system in which most of the data variance is preserved in a few dimensions. This reduction may also have a positive effect on the quality of similarity for certain data domains such as text. For other domains, it may lead to loss of information and degradation of search quality. Recent research indicates that the improvement for the text domain is caused by the re-enforcement of the semantic concepts in the data. In this paper, we provide an intuitive model of the effects of dimensionality reduction on arbitrary high dimensional problems. We provide an effective diagnosis of the causality behind the qualitative effects of dimensionality reduction on a given data set. The analysis suggests that these effects are very data dependent. Our analysis also indicates that currently accepted techniques of picking the reduction which results in the least loss of information are useful for maximizing precision and recall, but are not necessarily optimum from a qualitative perspective. We demonstrate that by making simple changes to the implementation details of dimensionality reduction techniques, we can considerably improve the quality of similarity search.

Given a large set of data, a common data mining problem is to extract the frequent patterns occurring in this set. The idea presented in this paper is to extract a condensed representation of the frequent patterns called disjunction-free sets, instead of extracting the whole frequent pattern collection. We show that this condensed representation can be used to regenerate all frequent patterns and their exact frequencies. Moreover, this regeneration can be performed without any access to the original data. Practical experiments show that this representation can be extracted very efficiently even in difficult cases. We compared it with another representation of frequent patterns previously investigated in the literature called frequent closed sets. In nearly all experiments we have run, the disjunction-free sets have been extracted much more efficiently than frequent closed sets.

A classic result of Johnson and Lindenstrauss asserts that any set of *n* points in *d*-dimensional Euclidean space can be embedded into *k*-dimensional Euclidean space where *k* is logarithmic in *n* and independent of *d* so that all pairwise distances are maintained within an arbitrarily small factor. All known constructions of such embeddings involve projecting the *n* points onto a random *k*-dimensional hyperplane. We give a novel construction of the embedding, suitable for database applications, which amounts to computing a simple aggregate over *k* random attribute partitions.

As databases have expanded in scope to storing string data (XML documents, product catalogs), it has become increasingly important to search databases based on matching substrings, often on multiple, correlated dimensions. While string B-trees are I/O optimal in one dimension, no index structure with non-trivial query bounds is known for two-dimensional substring indexing.

In this paper, we present a technique for two-dimensional substring indexing based on a reduction to the geometric problem of identifying common colors in two ranges containing colored points. We develop an I/O efficient algorithm for solving the common colors problem, and use it to obtain an I/O efficient (poly-logarithmic query time) algorithm for the two-dimensional substring indexing problem. Our techniques result in a family of secondary memory index structures that trade space for time, with no loss of accuracy. We show how our technique can be practically realized using a combination of string B-trees and R-trees.

In this paper, we propose *process locking*, a dynamic scheduling protocol based on ideas of ordered shared locks, that allows for the correct concurrent and fault-tolerant execution of transactional processes. Transactional processes are well defined, complex structured collections of transactional services. The process structure comprises flow of control between single process steps and also considers alternatives for failure handling purposes. Moreover, the individual steps of a process may have different termination characteristics, i.e., they cannot be compensated once they have committed. All these constraints have to be taken into consideration when deciding how to interleave processes. However, due to the higher level semantics of processes, standard locking techniques based on shared and exclusive locks on data objects cannot be applied. Yet, process locking addresses both atomicity and isolation simultaneously at the appropriate level, the scheduling of processes, and accounts for the various constraints imposed by processes. In addition, process locking aims at providing a high degree of concurrency while, at the same time, minimizing execution costs. This is done by allowing cascading aborts for rather simple processes white this is prevented for complex, long-running processes within the same framework.