We make the case for developing a *web of concepts* by starting with the current view of web (comprised of hyperlinked pages, or documents, each seen as a bag of words), extracting concept-centric metadata, and stitching it together to create a semantically rich aggregate view of all the information available on the web for each concept instance. The goal of building and maintaining such a web of concepts presents many challenges, but also offers the promise of enabling many powerful applications, including novel search and information discovery paradigms. We present the goal, motivate it with example usage scenarios and some analysis of Yahoo! logs, and discuss the challenges in building and leveraging such a web of concepts. We place this ambitious research agenda in the context of the state of the art in the literature, and describe various ongoing efforts at Yahoo! Research that are related.

Data-Exchange is the problem of creating new databases according to a high-level specification called a schema-mapping while preserving the information encoded in a source database. This paper introduces a notion of generalized schema-mapping that enriches the standard schema-mappings (as defined by Fagin et al) with more expressive power. It then proposes a more general and arguably more intuitive notion of semantics that rely on three criteria: Soundness, Completeness and Laconicity (non-redundancy and minimal size). These semantics are shown to coincide precisely with the notion of cores of universal solutions in the framework of Fagin, Kolaitis and Popa. It is also well-defined and of interest for larger classes of schema-mappings and more expressive source databases (with null-values and equality constraints). After an investigation of the key properties of generalized schema-mappings and their semantics, a criterion called Termination of the Oblivious Chase (TOC) is identified that ensures polynomial data-complexity. This criterion strictly generalizes the previously known criterion of Weak-Acyclicity. To prove the tractability of TOC schema-mappings, a new polynomial time algorithm is provided that, unlike the algorithm of Gottlob and Nash from which it is inspired, does not rely on the syntactic property of Weak-Acyclicity. As the problem of deciding whether a Schema-mapping satisfies the TOC criterion is only recursively enumerable, a more restrictive criterion called Super-weak Acylicity (SwA) is identified that can be decided in Polynomial-time while generalizing substantially the notion of Weak-Acyclicity.

An inverse of a schema mapping *M* is intended to "undo" what *M* does, thus providing a way to perform "reverse" data exchange. In recent years, three different formalizations of this concept have been introduced and studied, namely, the notions of an inverse of a schema mapping, a quasi-inverse of a schema mapping, and a maximum recovery of a schema mapping. The study of these notions has been carried out in the context in which source instances are restricted to consist entirely of constants, while target instances may contain both constants and labeled nulls. This restriction on source instances is crucial for obtaining some of the main technical results about these three notions, but, at the same time, limits their usefulness, since reverse data exchange naturally leads to source instances that may contain both constants and labeled nulls.

We develop a new framework for reverse data exchange that supports source instances that may contain nulls, thus overcoming the semantic mismatch between source and target instances of the previous formalizations. The development of this new framework requires a careful reformulation of all the important notions, including the notions of the identity schema mapping, inverse, and maximum recovery. To this effect, we introduce the notions of extended identity schema mapping, extended inverse, and maximum extended recovery, by making systematic use of the homomorphism relation on instances. We give results concerning the existence of extended inverses and of maximum extended recoveries, and results concerning their applications to reverse data exchange and query answering. Moreover, we show that maximum extended recoveries can be used to capture in a quantitative way the amount of information loss embodied in a schema mapping specified by source-to-target tuple-generating dependencies.

Relational schema mappings have been extensively studied in connection with data integration and exchange problems, but mappings between XML schemas have not received the same amount of attention. Our goal is to develop a theory of expressive XML schema mappings. Such mappings should be able to use various forms of navigation in a document, and specify conditions on data values. We develop a language for XML schema mappings, and concentrate on three types of problems: static analysis of mappings, their complexity, and their composition. We look at static analysis problems related to various flavors of consistency: for example, whether it is possible to map some document of a source schema into a document of the target schema, or whether all documents of a source schema can be mapped. We classify the complexity of these problems. We then move to the complexity of mappings themselves, i.e., recognizing pairs of documents such that one can be mapped into the other, and provide a classification based on sets of features used in mappings. Finally we look at composition of XML schema mappings. We study its complexity and show that it is harder to achieve closure under composition for XML than for relational mappings. Nevertheless, we find a robust class of XML schema mappings that have good complexity properties and are closed under composition.

This paper provides new worst-case bounds for the size and treewith of the result *Q(D)* of a conjunctive query *Q* to a database *D*. We derive bounds for the result size |*Q(D)*| in terms of structural properties of *Q*, both in the absence and in the presence of keys and functional dependencies. These bounds are based on a novel "coloring" of the query variables that associates a *coloring number* *C(Q)* to each query *Q*. Using this coloring number, we derive tight bounds for the size of *Q(D)* in case (i) no functional dependencies or keys are specified, and (ii) simple (one-attribute) keys are given. These results generalize recent size-bounds for join queries obtained by Atserias, Grohe, and Marx (FOCS 2008). An extension of our coloring technique also gives a lower bound for |*Q(D)*| in the general setting of a query with arbitrary functional dependencies. Our new coloring scheme also allows us to precisely characterize (both in the absence of keys and with simple keys) the treewidth-preserving queries--the queries for which the output treewidth is bounded by a function of the input treewidth. Finally we characterize the queries that preserve the sparsity of the input in the general setting with arbitrary functional dependencies.

We consider a fragment of XPath 1.0, where attribute and text values may be compared. We show that for any unary query in this fragment, the set of nodes that satisfy the query can be calculated in time linear in the document size and polynomial in the size of the query. The previous algorithm for this fragment also had linear data complexity but exponential complexity in the query size.

For many years, finite model theory was viewed as the backbone of database theory, and database theory in turn supplied finite model theory with key motivations and problems. By now, finite model theory has built a large arsenal of tools that can easily be used by database theoreticians without going to the basics such as combinatorial games. We survey such tools here, focusing not on how they are proved, but rather on how to apply them, as-is, in various questions that come up in database theory.

In this paper, we introduce a family of expressive extensions of Datalog, called Datalog+/-, as a new paradigm for query answering over ontologies. The Datalog+/- family admits existentially quantified variables in rule heads, and has suitable restrictions to ensure highly efficient ontology querying. We show in particular that Datalog+/- generalizes the DL-Lite family of tractable description logics, which are the most common tractable ontology languages in the context of the Semantic Web and databases. We also show how stratified negation can be added to Datalog+/- while keeping ontology querying tractable. Furthermore, the Datalog+/- family is of interest in its own right and can, moreover, be used in various contexts such as data integration and data exchange.

ManyWeb applications are based on dynamic interactions between Web components exchanging flows of information. Such a situation arises for instance in mashup systems [22] or when monitoring distributed autonomous systems [6]. This is a challenging problem that has generated recently a lot of attention; see Web 2.0 [38]. For capturing interactions between Web components, we use active documents interacting with the rest of the world via streams of updates. Their input streams specify updates to the document (in the spirit of RSS feeds), whereas their output streams are defined by queries on the document. In most of the paper, the focus is on input streams where the updates are only insertions, although we do consider also deletions.

We introduce and study two fundamental concepts in this setting, namely, satisfiability and relevance. Some fact is *satisfiable* for an active document and a query if it has a chance to be in the result of the query in some future state. Given an active document and a query, a call in the document is *relevant* if the data brought by this call has a chance to impact the answer to the query. We analyze the complexity of computing satisfiability in our core model (insertions only) and for extensions (e.g., with deletions). We also analyze the complexity of computing relevance in the core model.

The paper investigates the question of whether a partially closed database has complete information to answer a query. In practice an enterprise often maintains master data Dm, a closed-world database. We say that a database D is partially closed if it satisfies a set V of containment constraints of the form "q(D) is a subset of p(Dm)", where q is a query in a language Lc and p is a projection query. The part of D not constrained by (Dm,V) is open, from which some tuples may be missing. The database D is said to be complete for a query Q relative to (Dm,V) if for all partially closed extensions D' of D, Q(D')=Q(D), i.e., adding tuples to D either violates some constraints in V or does not change the answer to Q.

We first show that the proposed model can also capture the consistency of data, in addition to its relative completeness. Indeed, integrity constraints studied for consistency can be expressed as containment constraints. We then study two problems. One is to decide, given Dm, V, a query Q in a language Lq and a partially closed database D, whether D is complete for Q relative to (Dm,V). The other is to determine, given Dm, V and Q, whether there exists a partially closed database that is complete for Q relative to (Dm,V). We establish matching lower and upper bounds on these problems for a variety of languages Lq and Lc. We also provide characterizations for a database to be relatively complete, and for a query to allow a relatively complete database, when Lq and Lc are conjunctive queries.

We study privacy-preserving query answering over data containing relationships. A social network is a prime example of such data, where the nodes represent individuals and edges represent relationships. Nearly all interesting queries over social networks involve joins, and for such queries, existing output perturbation algorithms severely distort query answers. We propose an algorithm that significantly improves utility over competing techniques, typically reducing the error bound from polynomial in the number of nodes to polylogarithmic. The algorithm is, to the best of our knowledge, the first to answer such queries with acceptable accuracy, even for worst-case inputs.

The improved utility is achieved by relaxing the privacy condition. Instead of ensuring strict differential privacy, we guarantee a weaker (but still quite practical) condition based on adversarial privacy. To explain precisely the nature of our relaxation in privacy, we provide a new result that characterizes the relationship between ε-indistinguishability~(a variant of the differential privacy definition) and adversarial privacy, which is of independent interest: an algorithm is ε-indistinguishable iff it is private for a particular class of adversaries (defined precisely herein). Our perturbation algorithm guarantees privacy against adversaries in this class whose prior distribution is numerically bounded.

As advances in technology allow for the collection, storage, and analysis of vast amounts of data, the task of screening and assessing the significance of discovered patterns is becoming a major challenge in data mining applications. In this work, we address significance in the context of frequent itemset mining. Specifically, we develop a novel methodology to identify a meaningful support threshold *s*^{*} for a dataset, such that the number of itemsets with support at least *s*^{*} represents a substantial deviation from what would be expected in a random dataset with the same number of transactions and the same individual item frequencies. These itemsets can then be flagged as statistically significant with a small false discovery rate.

Our methodology hinges on a Poisson approximation to the distribution of the number of itemsets in a random dataset with support at least *s*, for any *s* greater than or equal to a minimum threshold *s*_{min}. We obtain this result through a novel application of the Chen-Stein approximation method, which is of independent interest. Based on this approximation, we develop an efficient parametric multi-hypothesis test for identifying the desired threshold *s*^{*}. A crucial feature of our approach is that, unlike most previous work, it takes into account the entire dataset rather than individual discoveries. It is therefore better able to distinguish between significant observations and random fluctuations. We present extensive experimental results to substantiate the effectiveness of our methodology.

We introduce the *similarity caching problem*, a variant of classical caching in which an algorithm can return an element from the cache that is similar, but not necessarily identical, to the query element. We are motivated by buffer management questions in approximate nearest-neighbor applications, especially in the context of caching targeted advertisements on the web. Formally, we assume the queries lie in a metric space, with distance function *d*(.,.). A query *p* is considered a cache hit if there is a point *q* in the cache that is sufficiently close to *p*, i.e., for a threshold radius *r*, we have *d*(*p,q*) ≤ *r*. The goal is then to minimize the number of cache misses, vis-à-vis the optimal algorithm. As with classical caching, we use the competitive ratio to measure the performance of different algorithms.

While similarity caching is a strict generalization of classical caching, we show that unless the algorithm is allowed extra power (either in the size of the cache or the threshold *r*) over the optimal offline algorithm, the problem is intractable. We then proceed to quantify the hardness as a function of the complexity of the underlying metric space. We show that the problem becomes easier as we proceed from general metric spaces to those of bounded doubling dimension, and to Euclidean metrics. Finally, we investigate several extensions of the problem: dependence of the threshold *r* on the query and a smoother trade-off between the cache-miss cost and the query-query similarity.

Querying uncertain data has emerged as an important problem in data management due to the imprecise nature of many measurement data. In this paper we study answering range queries over uncertain data. Specifically, we are given a collection *P* of *n* points in R, each represented by its one-dimensional probability density function (pdf). The goal is to build an index on *P* such that given a query interval *I* and a probability threshold τ, we can quickly report all points of *P* that lie in *I* with probability at least τ. We present various indexing schemes with linear or near-linear space and logarithmic query time. Our schemes support pdf's that are either histograms or more complex ones such as Gaussian or piecewise algebraic. They also extend to the external memory model in which the goal is to minimize the number of disk accesses when querying the index.

A *sliding windows* model is an important case of the streaming model, where only the most "recent" elements remain active and the rest are discarded in a stream. The sliding windows model is important for many applications (see, e.g., Babcock, Babu, Datar, Motwani and Widom (PODS 02); and Datar, Gionis, Indyk and Motwani (SODA 02)). There are two equally important types of the sliding windows model -- windows with fixed size, (e.g., where items arrive one at a time, and only the most recent *n* items remain active for some fixed parameter *n*), and bursty windows (e.g., where many items can arrive in "bursts" at a single step and where only items from the last *t* steps remain active, again for some fixed parameter *t*).

*Random sampling* is a fundamental tool for data streams, as numerous algorithms operate on the sampled data instead of on the entire stream. Effective sampling from sliding windows is a nontrivial problem, as elements eventually expire. In fact, the deletions are *implicit*; i.e., it is not possible to identify deleted elements without storing the entire window. The implicit nature of deletions on sliding windows does not allow the existing methods (even those that support explicit deletions, e.g., Cormode, Muthukrishnan and Rozenbaum (VLDB 05); Frahling, Indyk and Sohler (SOCG 05)) to be directly "translated" to the sliding windows model. One trivial approach to overcoming the problem of implicit deletions is that of over-sampling. When *k* samples are required, the over-sampling method maintains *k'*>*k* samples in the hope that at least *k* samples are not expired. The obvious disadvantages of this method are twofold:

(a) It introduces additional costs and thus decreases the performance; and

(b) The memory bounds are not deterministic, which is atypical for streaming algorithms (where even small probability events may eventually happen for a stream that is big enough).

Babcock, Datar and Motwani (SODA 02), were the first to stress the importance of improvements to over-sampling. They formally introduced the problem of sampling from sliding windows and improved the over-sampling method for *sampling with replacement*. Their elegant solutions for sampling with replacement are optimal *in expectation*, and thus resolve disadvantage *(a)* mentioned above. Unfortunately, the randomized bounds do not resolve disadvantage *(b)* above. Interestingly, all algorithms that employ the ideas of Babcock, Datar and Motwani have the same central problem of having to deal with randomized complexity (see e.g., Datar and Muthukrishnan (ESA 02); Chakrabarti, Cormode and McGregor (SODA 07)). Further, the proposed solutions of Babcock, Datar and Motwani for *sampling without replacement* are based on the criticized over-sampling method and thus do not solve problem *(a)*. Therefore, the question of whether we can solve sampling on sliding windows optimally (i.e., resolving both disadvantages) is implicit in the paper of Babcock, Datar and Motwani and has remained open for all variants of the problem.

In this paper we answer these questions affirmatively and provide optimal sampling schemas for all variants of the problem, i.e., sampling with or without replacement from fixed or bursty windows. Specifically, for fixed-size windows, we provide optimal solutions that require *O*(*k*) memory; for bursty windows, we show algorithms that require *O*(*k*log*n*), which is optimal since it matches the lower bound by Gemulla and Lehner (SIGMOD 08). In contrast to the work of of Babcock, Datar and Motwani, our solutions have deterministic bounds. Thus, we prove a perhaps somewhat surprising fact: the memory complexity of the sampling-based algorithm for all variants of the sliding windows model is comparable with that of streaming models (i.e., without the sliding windows). This is the first result of this type, since all previous "translations" of sampling-based algorithms to sliding windows incur randomized memory guarantees only.

The problem of finding heavy hitters and approximating the frequencies of items is at the heart of many problems in data stream analysis. It has been observed that several proposed solutions to this problem can outperform their worst-case guarantees on real data. This leads to the question of whether some stronger bounds can be guaranteed. We answer this in the positive by showing that a class of "counter-based algorithms" (including the popular and very space-efficient FREQUENT and SPACESAVING algorithms) provide much stronger approximation guarantees than previously known. Specifically, we show that errors in the approximation of individual elements do not depend on the frequencies of the most frequent elements, but only on the frequency of the remaining "tail." This shows that counter-based methods are the most space-efficient (in fact, space-optimal) algorithms having this strong error bound.

This tail guarantee allows these algorithms to solve the "sparse recovery" problem. Here, the goal is to recover a faithful representation of the vector of frequencies, *f*. We prove that using space *O*(*k*), the algorithms construct an approximation *f** to the frequency vector *f* so that the L1 error ||*f* -- *f**||_{1} is close to the best possible error min_{f2} ||*f*2 -- *f*||_{1}, where *f*2 ranges over all vectors with at most *k* non-zero entries. This improves the previously best known space bound of about *O*(*k* log *n*) for streams without element deletions (where *n* is the size of the domain from which stream elements are drawn). Other consequences of the tail guarantees are results for skewed (Zipfian) data, and guarantees for accuracy of merging multiple summarized streams.

We consider the the problem of tracking heavy hitters and quantiles in the distributed streaming model. The heavy hitters and quantiles are two important statistics for characterizing a data distribution. Let *A* be a multiset of elements, drawn from the universe *U*={1,...,*u*}. For a given 0 ≤ Φ ≤ 1, the Φ-heavy hitters are those elements of *A* whose frequency in *A* is at least Φ |*A*|; the Φ-quantile of *A* is an element *x* of *U* such that at most Φ|*A*| elements of *A* are smaller than *A* and at most (1-Φ)|*A*| elements of *A* are greater than *x*. Suppose the elements of *A* are received at *k* remote *sites* over time, and each of the sites has a two-way communication channel to a designated *coordinator*, whose goal is to track the set of Φ-heavy hitters and the Φ-quantile of *A* approximately at all times with minimum communication. We give tracking algorithms with worst-case communication cost *O*(*k*/ε ⋅ log *n*) for both problems, where *n* is the total number of items in *A*, and ε is the approximation error. This substantially improves upon the previous known algorithms. We also give matching lower bounds on the communication costs for both problems, showing that our algorithms are optimal. We also consider a more general version of the problem where we simultaneously track the Φ-quantiles for all *0* ≤ Φ ≤ 1.

In this tutorial we will describe some of the recent advances in the development of worst-case efficient range search indexing structures, that is, structures for storing a set of data points such that the points in a axis-parallel (hyper-) query rectangle can be found efficiently (with as few disk accesses - or I/Os - as possible). We first quickly discuss the well-known and optimal structure for the one-dimensional version of the problem, the B-tree [10, 12], along with its variants weight-balanced B-trees [9], multi-version (or persistent) B-trees [6, 11, 13, 22] and buffer-trees [4]. Then we discuss the external priority search tree [8], which solves a restricted version of the two-dimensional version of the problem where the query rectangle is unbounded on one side. This structure is then used in a range tree index structure [8, 21] that answers general two-dimensional queries in the same number of I/Os as the B-tree in the one-dimensional case, but using super-linear space. We also describe the linear space kdB-tree [19, 20] and O-tree [17] index structures that also solve the problem efficiently (but using more I/Os than the range tree). A detailed presentation of all the the above structures can be found in lecture notes by the author [5]. Finally, we also discuss lower bounds techniques, most notably the theory of indexability [16], that can be used to prove that both the range tree and kdB-tree/O-tree are optimal among query efficient and linear space structures, respectively [2, 8, 17], as well as recent index structures for higher-dimensional range search indexing [1]. We end by mentioning various R-tree variant [7, 18, 15] that can be used to solve the extended version of range search indexing where the queries as well as the data are (hyper-) rectangles. More comprehensive surveys of efficient index structures can be found in [3, 14, 23].

Let ∑ be a finite, ordered alphabet, and consider a string **x**=χ_{1}χ_{2}... χ_{n} ∈ ∑^{n}. A *secondary index* for **x** answers alphabet range queries of the form: Given a range [α_{l},α_{r}] ⊆ ∑, return the set *I*_{[αl,αr]} = {*i* |χ_{i} ∈ >[α_{l},α_{r}]}. Secondary indexes are heavily used in relational databases and scientific data analysis. It is well-known that the obvious solution, storing a dictionary for the set ∪_{i}{χ_{i}} with a position set associated with each character, does not always give optimal query time. In this paper we give the first theoretically optimal data structure for the secondary indexing problem. In the I/O model, the amount of data read when answering a query is within a constant factor of the minimum space needed to represent the set *I*_{[αl,αr]}, assuming that the size of internal memory is (|∑| lg *n*)^{δ} blocks, for some constant δ > *0*. The space usage of the data structure is *O*(*n*lg |∑|) bits in the worst case, and we further show how to bound the size of the data structure in terms of the *0*th order entropy of **x**. We show how to support updates achieving various time-space trade-offs.

We also consider an approximate version of the basic secondary indexing problem where a query reports a superset of *I*_{[αl,αr]} containing each element not in *I*_{[αl,αr]} with probability at most ∈, where ∈ > *0* is the false positive probability. For this problem the amount of data that needs to be read by the query algorithm is reduced to *O*(|*I*_{(αl,αr]}| lg(1/∈)) bits.

The *B-tree* is a fundamental external index structure that is widely used for answering one-dimensional range reporting queries. Given a set of *N* keys, a range query can be answered in *O*(log_{B} *N*over*M* + *K*over*B*) I/Os, where *B* is the disk block size, *K* the output size, and *M* the size of the main memory buffer. When keys are inserted or deleted, the B-tree is updated in *O*(log_{B} *N*) I/Os, if we require the resulting changes to be committed to disk right away. Otherwise, the memory buffer can be used to buffer the recent updates, and changes can be written to disk in batches, which significantly lowers the amortized update cost. A systematic way of batching up updates is to use the *logarithmic method,* combined with *fractional cascading,* resulting in a dynamic B-tree that supports insertions in *O*(1over*B* log *N*over*M*) I/Os and queries in *O*(log *N*over*M* + *K*over*B*) I/Os. Such bounds have also been matched by several known dynamic B-tree variants in the database literature. Note that, however, the query cost of these dynamic B-trees is substantially worse than the *O*(log_{B} *N*over*M* + *K*over*B*) bound of the static B-tree by a factor of ?(log *B*).

In this paper, we prove that for any dynamic one dimensional range query index structure with query cost *O*(*q* + *K*over*B*) and amortized insertion cost *O*(*u*/*B*), the tradeoff *q* · log(*u*/*q*) = ©(log *B*) must hold if *q* = *O*(log *B*). For most reasonable values of the parameters, we have *N*over*M* = *B*^{O(1)}, in which case our query-insertion tradeoff implies that the bounds mentioned above are already optimal. We also prove a lower bound of *u* · log *q* = ©(log *B*), which is relevant for larger values of *q*. Our lower bounds hold in a dynamic version of the *indexability model*, which is of independent interests. Dynamic indexability is a clean yet powerful model for studying dynamic indexing problems, and can potentially lead to more interesting complexity results.

In this work we investigate the satisfiability problem for the logic XPath(↓*, ↓,=), that includes all downward axes as well as equality and inequality tests. We address this problem in the absence of DTDs and the sibling axis. We prove that this fragment is decidable, and we nail down its complexity, showing the problem to be ExpTime-complete. The result also holds when path expressions allow closure under the Kleene star operator. To obtain these results, we introduce a new automaton model over data trees that captures XPath(↓*, ↓,=) and has an ExpTime emptiness problem. Furthermore, we give the exact complexity of several downward-looking fragments.

We consider the problem of deciding query equivalence for a conjunctive language in which queries output complex objects composed from a mixture of nested, unordered collection types. Using an encoding of nested objects as flat relations, we translate the problem to deciding the equivalence between encodings output by relational conjunctive queries. This encoding equivalence cleanly unifies and generalizes previous results for deciding equivalence of conjunctive queries evaluated under various processing semantics. As part of our characterization of encoding equivalence, we define a normal form for encoding queries and contend that this normal form offers new insight into the fundamental principles governing the behaviour of nested aggregation.

We consider the problem of finding equivalent minimal-size reformulations of SQL queries in presence of embedded dependencies [1]. Our focus is on select-project-join (SPJ) queries with equality comparisons, also known as safe conjunctive (CQ) queries, possibly with grouping and aggregation. For SPJ queries, the semantics of the SQL standard treats query answers as *multisets (bags),* whereas the stored relations are treated either as sets, which is called *bag-set semantics,* or as bags, which is called *bag semantics*. (Under *set semantics*, both query answers and stored relations are treated as sets.)

In the context of the above Query-Reformulation Problem, we develop a comprehensive framework for equivalence of CQ queries under bag and bag-set semantics in presence of embedded dependencies, and make a number of conceptual and technical contributions. Specifically, we develop equivalence tests for CQ queries in presence of arbitrary sets of embedded dependencies under bag and bag-set semantics, under the condition that chase [10] under set semantics *(set-chase)* on the inputs terminates. We also present equivalence tests for CQ queries *with grouping and aggregation* in presence of embedded dependencies. We use our equivalence tests to develop sound and complete (whenever set-chase on the inputs terminates) algorithms for solving instances of the Query-Reformulation Problem with CQ queries under each of bag and bag-set semantics, as well as for instances of the problem with aggregate queries.

Our contributions are clearly applicable beyond the Query-Reformulation Problem considered in this paper. Specifically, the results of this paper can be used in developing algorithms for rewriting CQ queries and queries in more expressive languages (e.g., including grouping and aggregation, or arithmetic comparisons) using views in presence of embedded dependencies, under bag or bag-set semantics for query evaluation.

Tree automata (specifically, bottom-up and unranked) form a powerful tool for querying and maintaining validity of XML documents. XML with uncertain data can be modeled as a probability space of labeled trees, and that space is often represented by a tree with distributional nodes. This paper investigates the problem of evaluating a tree automaton over such a representation, where the goal is to compute the probability that the automaton accepts a random possible world. This problem is generally intractable, but for the case where the tree automaton is deterministic (and its transitions are defined by deterministic string automata), an efficient algorithm is presented. The paper discusses the applications of this result, including the ability to sample and to evaluate queries (e.g., in monadic second-order logic) while requiring a-priori conformance to a schema (e.g., DTD). XML schemas also include attribute constraints, and the complexity of key, foreign-key and inclusion constraints are studied in the context of probabilistic XML. Finally, the paper discusses the generalization of the results to an extended data model, where distributional nodes can repeatedly sample the same subtree, thereby adding another exponent to the size of the probability space.

We study models of incomplete information for XML, their computational properties, and query answering. While our approach is motivated by the study of relational incompleteness, incomplete information in XML documents may appear not only as null values but also as missing structural information. Our goal is to provide a classification of incomplete descriptions of XML documents, and separate features - or groups of features - that lead to hard computational problems from those that admit efficient algorithms. Our classification of incomplete information is based on the combination of null values with partial structural descriptions of documents. The key computational problems we consider are consistency of partial descriptions, representability of complete documents by incomplete ones, and query answering. We show how factors such as schema information, the presence of node ids, and missing structural information affect the complexity of these main computational problems, and find robust classes of incomplete XML descriptions that permit tractable query evaluation.

A *distributed XML document* is an XML document that spans several machines or Web repositories. We assume that a distribution design of the document tree is given, providing an XML tree some of whose leaves are "docking points", to which XML subtrees can be attached. These subtrees may be provided and controlled by peers at remote locations, or may correspond to the result of function calls, e.g., Web services. If a global type τ, e.g. a DTD, is specified for a distributed document *T*, it would be most desirable to be able to break this type into a collection of local types, called a local typing, such that the document satisfies τ if and only if each peer (or function) satisfies its local type. In this paper we lay out the fundamentals of a theory of local typing and provide formal definitions of three main variants of locality: local typing, maximal local typing, and perfect typing, the latter being the most desirable. We study the following relevant decision problems: (i) given a typing for a design, determine whether it is local, maximal local, or perfect; (ii) given a design, establish whether a (maximal) local, or perfect typing does exist. For some of these problems we provide tight complexity bounds (polynomial space), while for the others we show exponential upper bounds. A main contribution is a polynomial-space algorithm for computing a perfect typing in this context, if it exists.

We address the problem of finding a "best" deterministic query answer to a query over a probabilistic database. For this purpose, we propose the notion of a consensus world (or a consensus answer) which is a deterministic world (answer) that minimizes the expected distance to the possible worlds (answers). This problem can be seen as a generalization of the well-studied inconsistent information aggregation problems (e.g. rank aggregation) to probabilistic databases. We consider this problem for various types of queries including SPJ queries, Top-k ranking queries, group-by aggregate queries, and clustering. For different distance metrics, we obtain polynomial time optimal or approximation algorithms for computing the consensus answers (or prove NP-hardness). Most of our results are for a general probabilistic database model, called and/xor tree model, which significantly generalizes previous probabilistic database models like x-tuples and block-independent disjoint models, and is of independent interest.

Database technology is playing an increasingly important role in understanding and solving large-scale and complex scientific and societal problems and phenomena, for instance, understanding biological networks, climate modeling, electronic markets, etc. In these settings, uncertainty or imprecise information is a pervasive issue that becomes a serious impediment to understanding and effectively utilizing such systems. Clustering is one of the key problems in this context.

In this paper we focus on the problem of clustering, specifically the *k*-center problem. Since the problem is NP-Hard in deterministic setting, a natural avenue is to consider approximation algorithms with a bounded performance ratio. In an earlier paper Cormode and McGregor had considered certain variants of this problem, but failed to provide approximations that preserved the number of centers. In this paper we remedy the situation and provide true approximation algorithms for a wider class of these problems.

However, the key aspect of this paper is to devise general techniques for optimization under uncertainty. We show that a particular formulation which uses the contribution of a random variable above its expectation is useful in this context. We believe these techniques will find wider applications in optimization under uncertainty.

Skyline computation is widely used in multi-criteria decision making. As research in uncertain databases draws increasing attention, skyline queries with uncertain data have also been studied, e.g. probabilistic skylines. The previous work requires "thresholding" for its efficiency -- the efficiency relies on the assumption that points with skyline probabilities below a certain threshold can be ignored. But there are situations where "thresholding" is not desirable -- low probability events cannot be ignored when their consequences are significant. In such cases it is necessary to compute skyline probabilities of all data items. We provide the first algorithm for this problem whose worst-case time complexity is sub-quadratic. The techniques we use are interesting in their own right, as they rely on a space partitioning technique combined with using the existing dominance counting algorithm. The effectiveness of our algorithm is experimentally verified.