PODS '04- Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

Full Citation in the ACM Digital Library

SESSION: Invited talk

The Lixto data extraction project: back and forth between theory and practice

We present the Lixto project, which is both a research project in database theory and a commercial enterprise that develops Web data extraction (wrapping) and Web service definition software. We discuss the project's main motivations and ideas, in particular the use of a logic-based framework for wrapping. Then we present theoretical results on monadic datalog over trees and on Elog, its close relative which is used as the internal wrapper language in the Lixto system. These results include both a characterization of the expressive power and the complexity of these languages. We describe the visual wrapper specification process in Lixto and various practical aspects of wrapping. We discuss work on the complexity of query languages for trees that was inseminated by our theoretical study of logic-based languages for wrapping. Then we return to the practice of wrapping and the Lixto Transformation Server, which allows for streaming integration of data extracted from Web pages. This is a natural requirement in complex services based on Web wrapping. Finally, we discuss industrial applications of Lixto and point to open problems for future study.

SESSION: XML processing

Conditional XPath, the first order complete XPath dialect

XPath is the W3C -- standard node addressing language for XML documents. XPath is still under development and its technical aspects are intensively studied. What is missing at present is a clear characterization of the expressive power of XPath, be it either semantical or with reference to some well established existing (logical) formalism. Core XPath (the logical core of XPath 1.0 defined by Gottlob et al.) cannot express queries with conditional paths as exemplified by "do a child step, while test is true at the resulting node." In a first-order complete extension of Core XPath, such queries are expressible, We add conditional axis relations to Core XPath and show that the resulting language, called conditional XPath, is equally expressive as first-order logic when interpreted on ordered trees. Both the result, the extended XPath language, and the proof are closely related to temporal logic. Specifically, while Core XPath may be viewed as a simple temporal logic, conditional XPath extends this with (counterparts of) the since and until operators.

Frontiers of tractability for typechecking simple XML transformations

Typechecking consists of statically verifying whether the output of an XML transformation is always conform to an output type for documents satisfying a given input type. We focus on complete algorithms which always produce the correct answer. We consider top-down XML transformations incorporating XPath expressions and abstract document types by grammars and tree automata. By restricting schema languages and transformations, we identify several practical settings for which typechecking is in polynomial time. Moreover, the resulting framework provides a rather complete picture as we show that most scenarios can not be enlarged without rendering the typechecking problem intractable. So, the present research sheds light on when to use fast complete algorithms and when to reside to sound but incomplete ones.

Positive active XML

The increasing popularity of XML and Web services have given rise to a new generation of documents, called Active XML documents (AXML), where some of the data is given explicitly while other parts are given intensionally, by means of embedded calls to Web services. Web services in this context can exchange intensional information, using AXML documents as parameters and results.The goal of this paper is to provide a formal foundation for this new generation of AXML documents and services, and to study fundamental issues they raise. We focus on Web services that are (1) monotone and (2) defined declaratively as conjunctive queries over AXML documents. We study the semantics of documents and queries, the confluence of computations, termination and lazy query evaluation.

SESSION: Invited tutorial 1

The past, present and future of web information retrieval

Web search engines have emerged as one of the central applications on the Internet. In fact, search has become one of the most important activities that people engage in on the the Internet. Even beyond becoming the number one source of information, a growing number of businesses are depending on web search engines for customer acquisition.The first generation of web search engines used text-only retrieval techniques. Google revolutionized the field by deploying the PageRank technology - an eigenvector-based analysis of the hyperlink structure - to analyze the web in order to produce relevant results. Moving forward, our goal is to achieve a better understanding of a page with a view towards producing even more relevant results.An exciting new form of search for the future is query-free search: While a user performs her daily tasks, searches are automatically performed to supply her with information that is relevant to her activity. We present one type of query-free search, namely query-free news search: While a user watches TV news the system finds in real-time web pages that are relevant to the news stories.

SESSION: Ranking

Comparing and aggregating rankings with ties

Rank aggregation has recently been proposed as a useful abstraction that has several applications, including meta-search, synthesizing rank functions from multiple indices, similarity search, and classification. In database applications (catalog searches, fielded searches, parametric searches, etc.), the rankings are produced by sorting an underlying database according to various fields. Typically, there are a number of fields that each have very few distinct values, and hence the corresponding rankings have many ties in them. Known methods for rank aggregation are poorly suited to this context, and the difficulties can be traced back to the fact that we do not have sound mathematical principles to compare two partial rankings, that is, rankings that allow ties.In this work, we provide a comprehensive picture of how to compare partial rankings, We propose several metrics to compare partial rankings, present algorithms that efficiently compute them, and prove that they are within constant multiples of each other. Based on these concepts, we formulate aggregation problems for partial rankings, and develop a highly efficient algorithm to compute the top few elements of a near-optimal aggregation of multiple partial rankings. In a model of access that is suitable for databases, our algorithm reads essentially as few elements of each partial ranking as are necessary to determine the winner(s).

Using non-linear dynamical systems for web searching and ranking

In the recent years there has been a surge of research activity in the area of information retrieval on the World Wide Web, using link analysis of the underlying hypertext graph topology. Most of the algorithms in the literature can be described as dynamical systems, that is, the repetitive application of a function on a set of weights. Algorithms that rely on eigenvector computations, such as HITS and PAGERANK, correspond to linear dynamical systems. In this work we consider two families of link analysis ranking algorithms that no longer enjoy the linearity property of the previous approaches. We study in depth an interesting special case of these two families. We prove that the corresponding non-linear dynamical system converges for any initialization, and we provide a rigorous characterization of the combinatorial properties of the stationary weights. The study of the weights provides a clear and insightful view of the mechanics of the algorithm. We also present extensive experimental results that demonstrate that our algorithm performs well in practice.

SESSION: Data exchange I

Specification and verification of data-driven web services

We study data-driven Web services provided by Web sites interacting with users or applications. The Web site can access an underlying database, as well as state information updated as the interaction progresses, and receives user input. The structure and contents of Web pages, as well as the actions to be taken, are determined dynamically by querying the underlying database as well as the state and inputs. The properties to be verified concern the sequences of events (inputs, states, and actions) resulting from the interaction, and are expressed in linear or branching-time temporal logics. The results establish under what conditions automatic verification of such properties is possible and provide the complexity of verification. This brings into play a mix of techniques from logic and automatic verification.

Composing schema mappings: second-order dependencies to the rescue

A schema mapping is a specification that describes how data structured under one schema (the source schema) is to be transformed into data structured under a different schema (the target schema). Schema mappings play a key role in numerous areas of database systems, including database design, information integration, and model management. A fundamental problem in this context is composing schema mappings: given two successive schema mappings, derive a schema mapping between the source schema of the first and the target schema of the second that has the same effect as applying successively the two schema mappings.In this paper, we give a rigorous semantics to the composition of schema mappings and investigate the definability and computational complexity of the composition of two schema mappings. We first study the important case of schema mappings in which the specification is given by a finite set of source-to-target tuple-generating dependencies (source-to-target tgds). We show that the composition of a finite set of full source-to-target tgds with a finite set of tgds is always definable by a finite set of source-to-target tgds, but the composition of a finite set of source-to-target tgds with a finite set of full source-to-target tgds may not be definable by any set (finite or infinite) of source-to-target tgds; furthermore, it may not be definable by any formula of least fixed-point logic, and the associated composition query may be NP-complete. After this, we introduce a class of existential second-order formulas with function symbols, which we call second-order tgds, and make a case that they are the "right" language for composing schema mappings. To this effect, we show that the composition of finite sets of source-to-target tgds is always definable by a second-order tgd. Moreover, the composition of second-order tgds is also definable by a second-order tgd. Our second-order tgds allow equalities, even though the "obvious" way to define them does not require equalities. Allowing equalities in second-order tgds turns out to be of the essence, because we show that second-order tgds without equalities are not sufficiently expressive to define even the composition of finite sets of source-to-target tgds. Finally, we show that second-order tgds possess good properties for data exchange. In particular. the chase procedure can be extended to second-order tgds so that it produces polynomial-time computable universal solutions in data exchange settings specified by second-order tgds.

Foundations of semantic web databases

The Semantic Web is based on the idea of adding more machine-readable semantics to web information via annotations written in a language called the Resource Description Framework (RDF). RDF resembles a subset of binary first-order logic including the ability to refer to anonymous objects. Its extended version, RDFS, supports reification, typing and inheritance. These features introduce new challenges into the formal study of sets of RDF/RDFS statements and languages for querying them. Although several such query languages have been proposed, there has been little work on foundational aspects. We investigate these, including computational aspects of testing entailment and redundancy. We propose a query language with well-defined semantics and study the complexity of query processing, query containment, and simplification of answers.

SESSION: Spatial data

A characterization of first-order topological properties of planar spatial data

Closed semi-algebraic sets in the plane form a powerful model of planar spatial datasets. We establish a characterization of the topological properties of such datasets expressible in the relational calculus with real polynomial constraints. The characterization is in the form of a query language that can only talk about points in the set and the "cones" around these points.

Roads, codes, and spatiotemporal queries

We present a novel coding-based technique for answering spatial and spatiotemporal queries on objects moving along a system of curves on the plane such as many road networks. We handle join, range, intercept, and other spatial and spatiotemporal queries under these assumptions, with distances being measured along the trajectories. Most work to date has studied the significantly simpler case of objects moving in straight lines on the plane. Our work is an advance toward solving the problem in its more general form.Central to our approach is an efficient coding technique, based on hypercube embedding, for assigning labels to nodes in the network. The Hamming distance between codes corresponds to the physical distance between nodes, so that we can determine shortest distances in the network extremely quickly. The coding method also efficiently captures many properties of the network relevant to spatial and spatiotemporal queries. Our approach also yields a very effective spatial hashing method for this domain. Our analytical results demonstrate that our methods are space- and time-efficient.We have studied the performance of our method for large planar graphs designed to represent road networks. Experiments show that our methods are efficient and practical.

Replicated declustering of spatial data

The problem of disk declustering is to distribute data among multiple disks to reduce query response times through parallel I/O. A strictly optimal declustering technique is one that achieves optimal parallel I/O for all possible queries. In this paper, we focus on techniques that are optimized for spatial range queries. Current declustering techniques, which have single copies of the data, have been shown to be suboptimal for range queries. The lower bound on extra disk accesses is proved to be Ω(log N) for N disks even in the restricted case of an N-by-N grid, and all current approaches have been trying to achieve this bound. Replication is a well-known and effective solution for several problems in databases, especially for availability and load balancing. In this paper, we explore the idea of replication in the context of declustering and propose a framework where strictly optimal parallel I/O is achievable using a small amount of replication. We provide some theoretical foundations for replicated declustering, e.g., a bound for number of copies for strict optimality on any number of disks, and propose a class of replicated declustering schemes, periodic allocations, which are shown to be strictly optimal. The results for optimal disk allocation are extended for larger number of disks by increasing replication. Our techniques and results are valid for any arbitrary a-by-b grids, and any declustering scheme can be further improved using our replication framework. Using the framework, we perform experiments to identify a strictly optimal disk access schedule for any given arbitrary range query. In addition to the theoretical bounds, we compare the proposed replication based scheme to other existing techniques by performing experiments on real datasets.

SESSION: Clustering, data mining, approximations

Clustering via matrix powering

Given a set of n points with a matrix of pairwise similarity measures, one would like to partition the points into clusters so that similar points are together and different ones apart. We present an algorithm requiring only matrix powering that performs well in practice and bears an elegant interpretation in terms of random walks on a graph. Under a certain mixture model involving planting a partition via randomized rounding of tailored matrix entries, the algorithm can be proven effective for only a single squaring. It is shown that the clustering performance of the algorithm degrades with larger values of the exponent, thus revealing that a single squaring is optimal.

Computational complexity of itemset frequency satisfiability

Computing frequent itemsets is one of the most prominent problems in data mining. We introduce a new, related problem, called FREQSAT: given some itemset-interval pairs, does there exist a database such that for every pair the frequency of the itemset falls in the interval? It is shown in this paper that FREQSAT is not finitely axiomatizable and that it is NP-complete. We also study cases in which other characteristics of the database are given as well. These characteristics can complicate FREQSAT even more. For example, when the maximal number of duplicates of a transaction is known, FREQSAT becomes PP-hard. We describe applications of FREQSAT in frequent itemset mining algorithms and privacy in data mining.

k-means projective clustering

In many applications it is desirable to cluster high dimensional data along various subspaces, which we refer to as projective clustering. We propose a new objective function for projective clustering, taking into account the inherent trade-off between the dimension of a subspace and the induced clustering error. We then present an extension of the k-means clustering algorithm for projective clustering in arbitrary subspaces, and also propose techniques to avoid local minima. Unlike previous algorithms, ours can choose the dimension of each cluster independently and automatically. Furthermore, experimental results show that our algorithm is significantly more accurate than the previous approaches.

Deterministic wavelet thresholding for maximum-error metrics

Several studies have demonstrated the effectiveness of the wavelet, decomposition as a tool for reducing large amounts of data down to compact, wavelet synopses that can be used to obtain fast, accurate approximate answers to user queries. While conventional wavelet synopses are based on greedily minimizing the overall root-mean-squared (i.e., L2-norm) error in the data approximation, recent work has demonstrated that such synopses can suffer from important problems, including severe bias and wide variance in the quality of the data reconstruction, and lack of non-trivial guarantees for individual approximate answers. As a result, probabilistic thresholding schemes have been recently proposed as a means of building wavelet synopses that try to probabilistically control other approximation-error metrics, such as the maximum relative error in data-value reconstruction, which is arguably the most important for approximate query answers and meaningful error guarantees.One of the main open problems posed by this earlier work is whether it is possible to design efficient deterministic wavelet-thresholding algorithms for minimizing non-L2 error metrics that are relevant to approximate query processing systems, such as maximum relative or maximum absolute error. Obviously, such algorithms can guarantee better wavelet synopses and avoid the pitfalls of probabilistic techniques (e.g., "bad" coin-flip sequences) leading to poor solutions. In this paper, we address this problem and propose novel, computationally efficient schemes for deterministic wavelet thresholding with the objective of optimizing maximum-error metrics. We introduce an optimal low polynomial-time algorithm for one-dimensional wavelet thresholding--our algorithm is based on a new Dynamic-Programming (DP) formulation, and can be employed to minimize the maximum relative or absolute error in the data reconstruction. Unfortunately, directly extending our one-dimensional DP algorithm to multi-dimensional wavelets results in a super-exponential increase in time complexity with the data dimensionality. Thus, we also introduce novel, polynomial-time approximation schemes (with tunable approximation guarantees for the target maximum-error metric) for deterministic wavelet thresholding in multiple dimensions.

SESSION: Query execution and optimization

On the memory requirements of XPath evaluation over XML streams

The important challenge of evaluating XPath queries over XML streams has sparked much interest in the past two years, A number of algorithms have been proposed, supporting wider fragments of the query language, and exhibiting better performance and memory utilization. Nevertheless, all the algorithms known to date use a prohibitively large amount of memory for certain types of queries. A natural question then is whether this memory bottleneck is inherent or just an artifact of the proposed algorithms.In this paper we initiate the first systematic and theoretical study of lower bounds on the amount of memory required to evaluate XPath queries over XML streams. We present a general lower bound technique, which given a query, specifies the minimum amount of memory that any algorithm evaluating the query on a stream would need to incur. The lower bounds are stated in terms of new graph-theoretic properties of queries. The proof is based on tools from communication complexity.We then exploit insights learned from the lower bounds to obtain a new algorithm for XPath evaluation on streams. The algorithm uses space close to the optimum. Our algorithm deviates from the standard paradigm of using automata or transducers, thereby avoiding the need to store large transition tables.

Conjunctive queries over trees

We study the complexity and expressive power of conjunctive queries over unranked labeled trees, where the tree structure are represented using "axis relations" such as "child", "descendant", and "following" (we consider a superset of the XPath axes) as well as unary relations for node labels. (Cyclic) conjunctive queries over trees occur in a wide range of data management scenarios related to XML, the Web, and computational linguistics. We establish a framework for characterizing structures representing trees for which conjunctive queries can be evaluated efficiently. Then we completely chart the tractability frontier of the problem for our axis relations, i.e., we find all subset maximal sets of axes for which query evaluation is in polynomial time. All polynomial-time results are obtained immediately using the proof techniques from our framework. Finally, we study the expressiveness of conjunctive queries over trees and compare it to the expressive power of fragments of XPath. We show that for each conjunctive query, there is an equivalent acyclic positive query (i.e., a set of acyclic conjunctive queries), but that in general this query is not of polynomial size.

Synopses for query optimization: a space-complexity perspective

Database systems use precomputed synopses of data to estimate the cost of alternative plans during query optimization. A number of alternative synopsis structures have been proposed, but histograms are by far the most commonly used. While histograms have proved to be very effective in (cost estimation for) single-table selections, queries with joins have long been seen as a challenge; under a model where histograms are maintained for individual tables, a celebrated result of Ioannidis and Christodoulakis observes that errors propagate exponentially with the number of joins in a query.In this paper, we make two main contributions. First, we study the space complexity of using synopses for query optimization from a novel information-theoretic perspective. In particular, we offer evidence in support of histograms for single-table selections, and illustrate their limitations for join queries. Second, for a broad class of common queries involving joins (specifically, all queries involving only key-foreign key joins) we show that the strategy of storing a small pre-computed sample of the database yields probabilistic guarantees that are almost space-optimal, in the sense that in order to provide the same guarantee as sampling, any strategy requires almost the same amount of space. This is an important property if these samples are to be used as database statistics. This is the first such optimality result, to our knowledge, and suggests that pre-computed samples might be an effective way to circumvent the error propagation problem for queries with key-foreign key joins. We support this result empirically through an experimental study that demonstrates the effectiveness of pre-computed samples, and also shows the increasing difference in the effectiveness of samples versus multi-dimensional histograms as the number of joins in the query grows.

Weighted hypertree decompositions and optimal query plans

Hypertree width [22, 25] is a measure of the degree of cyclicity of hypergraphs. A number of relevant problems from different areas, e.g., the evaluation of conjunctive queries in database theory or the constraint satisfaction in AI, are tractable when their underlying hypergraphs have bounded hypertree width. However, in practical contexts like the evaluation of database queries, we have more information besides the structure of queries. For instance, we know the number of tuples in relations, the selectivity of attributes and so on. In fact, all commercial query-optimizers are based on quantitative methods and do not care about structural properties.In this paper, we define the notion of weighted hypertree decomposition, in order to combine structural decomposition methods with quantitative approaches. Weighted hypertree decompositions are equipped with cost functions, that can be used for modelling many situations where we have further information on the given problem, besides its hypergraph representation. We analyze the complexity of computing the hypertree decompositions having the smallest weights, called minimal hypertree decompositions. We show that, in many cases, adding weights we loose tractability. However, we prove that, under some - not very severe - restrictions on the allowed cost functions and on the target hypertrees, optimal weighted hypertree decompositions can be computed in polynomial time. For some easier hypertree weighting functions, this problem is also highly parallelizable. Then, we provide a cost function that models query evaluation costs and show how to exploit weighted hypertree decompositions for determining (logical) query plans for answering conjunctive queries. Finally, we present the results of an experimental comparison of this query optimization technique with the query optimization of a commercial DBMS. These preliminary results are very promising, as for some large queries (with many joins) our hybrid technique clearly outperforms the commercial optimizer.

SESSION: Invited tutorial 2

Trees, automata and XML

Formal languages play an important role for many aspects of XML processing. This is obvious for type specifications (as DTD) which use context-free grammars and for navigation in documents (as in XPath) which is based on regular expressions. But the investigation of query, typing, navigation and transformation languages for XML has used many more concepts from Formal Language Theory, in particular many different kinds of string and tree automata.The close connection between automata and logics helps to allow a declarative specification of queries and transformations that can be evaluated or performed by tree automata. This connection also facilitates the investigation of the expressive power of query and transformation languages. Furthermore, in many cases automata characterizations enable static analysis like containment and satisfiability tests for queries or type checking for transformations.The tutorial will give a gentle introduction into the connections between XML languages and various kinds of automata and it will survey some classical and recent results in this area.

SESSION: Data exchange II

On the complexity of optimal K-anonymity

The technique of k-anonymization has been proposed in the literature as an alternative way to release public information, while ensuring both data privacy and data integrity. We prove that two general versions of optimal k-anonymization of relations are NP-hard, including the suppression version which amounts to choosing a minimum number of entries to delete from the relation. We also present a polynomial time algorithm for optimal k-anonymity that achieves an approximation ratio independent of the size of the database, when k is constant. In particular, it is a O(k log k)-approximation where the constant in the big-O is no more than 4, However, the runtime of the algorithm is exponential in k. A slightly more clever algorithm removes this condition, but is a O(k log m)-approximation, where m is the degree of the relation. We believe this algorithm could potentially be quite fast in practice.

Locally consistent transformations and query answering in data exchange

Data exchange is the problem of taking data structured under a source schema and creating an instance of a target schema. Given a source instance, there may be many solutions - target instances that satisfy the constraints of the data exchange problem. Previous work has identified two classes of desirable solutions: canonical universal solutions, and their cores. Query answering in data exchange amounts to rewriting a query over the target schema to another query that, over a materialized target instance, gives the result that is semantically consistent with the source. A basic question is then whether there exists a transformation sending a source instance into a solution over which target queries can be answered.We show that the answer is negative for many data exchange transformations that have structural properties similar to canonical universal solutions and cores. Namely, we prove that many such transformations preserve the local structure of the data. Using this notion, we further show that every target query rewritable over such a transformation cannot distinguish tuples whose neighborhoods in the source are similar. This gives us a first tool that helps check whether a query is rewritable, We also show that these results are robust: they hold for an extension of relational calculus with grouping and aggregates, and for two different semantics of query answering.

Logical foundations of peer-to-peer data integration

In peer-to-peer data integration, each peer exports data in terms of its own schema, and data interoperation is achieved by means of mappings among the peer schemas. Peers are autonomous systems and mappings are dynamically created and changed. One of the challenges in these systems is answering queries posed to one peer taking into account the mappings. Obviously, query answering strongly depends on the semantics of the overall system. In this paper, we compare the commonly adopted approach of interpreting peer-to-peer systems using a first-order semantics, with an alternative approach based on epistemic logic. We consider several central properties of peer-to-peer systems: modularity, generality, and decidability. We argue that the approach based on epistemic logic is superior with respect to all the above properties. In particular, we show that, in systems in which peers have decidable schemas and conjunctive mappings, but are arbitrarily interconnected, the first-order approach may lead to undecidability of query answering, while the epistemic approach always preserves decidability. This is a fundamental property, since the actual interconnections among peers are not under the control of any actor in the system.

SESSION: Data streams

Adaptive sampling for geometric problems over data streams

Geometric coordinates are an integral part of many data streams. Examples include sensor locations in environmental monitoring, vehicle locations in traffic monitoring or battlefield simulations, scientific measurements of earth or atmospheric phenomena, etc. How can one summarize such data streams using limited storage so that many natural geometric queries can be answered faithfully? Some examples of such queries are: report the smallest convex region in which a chemical leak has been sensed, or track the diameter of the dataset. One can also pose queries over multiple streams: track the minimum distance between the convex hulls of two data streams; or report when datasets A and B are no longer linearly separable.In this paper, we propose an adaptive sampling scheme that gives provably optimal error bounds for extremal problems of this nature. All our results follow from a single technique for computing the approximate convex hull of a point stream in a single pass. Our main result is this: given a stream of two-dimensional points and an integer r, we can maintain an adaptive sample of at most 2r + 1 points such that the distance between the true convex hull and the convex hull of the sample points is O(D/r2), where D is the diameter of the sample set. With our sample convex hull, all the queries mentioned above can be answered in either O(log r) or O(r) time.

Flexible time management in data stream systems

Continuous queries in a Data Stream Management System (DSMS) rely on time as a basis for windows on streams and for defining a consistent semantics for multiple streams and updatable relations. The system clock in a centralized DSMS provides a convenient and well-behaved notion of time, but often it is more appropriate for a DSMS application to define its own notion of time---its own clock(s), sequence numbers, or other forms of ordering and times-tamping. Flexible application-defined time poses challenges to the DSMS, since streams may be out of order and uncoordinated with each other, they may incur latency reaching the DSMS, and they may pause or stop. We formalize these challenges and specify how to generate heartbeats so that queries can be evaluated correctly and continuously in an application-defined time domain. Our heartbeat generation algorithm is based on parameters capturing skew between streams, unordering within streams, and latency in streams reaching the DSMS. We also describe how to estimate these parameters at run-time, and we discuss how heartbeats can be used for processing continuous queries.

Power-conserving computation of order-statistics over sensor networks

We study the problem of power-conserving computation of order statistics in sensor networks. Significant power-reducing optimizations have been devised for computing simple aggregate queries such as COUNT, AVERAGE, or MAX over sensor networks. In contrast, aggregate queries such as MEDIAN have seen little progress over the brute force approach of forwarding all data to a central server. Moreover, battery life of current sensors seems largely determined by communication costs --- therefore we aim to minimize the number of bytes transmitted. Unoptimized aggregate queries typically impose extremely high power consumption on a subset of sensors located near the server. Metrics such as total communication cost underestimate the penalty of such imbalance: network lifetime may be dominated by the worst-case replacement time for depleted batteries.In this paper, we design the first algorithms for computing order-statistics such that power consumption is balanced across the entire network. Our first main result is a distributed algorithm to compute an ε-approximate quantile summary of the sensor data such that each sensor transmits only O(log2 n/ε) data values, irrespective of the network topology, an improvement over the current worst-case behavior of Ω(n). Second, we show an improved result when the height, h, of the network is significantly smaller than n. Our third result is that we can exactly compute any order statistic (e.g., median) in a distributed manner such that each sensor needs to transmit O(log3 n) values.Further, we design the aggregates used by our algorithms to be decomposable. An aggregate Q over a set S is decomposable if there exists a function, f, such that for all S = S1S2, Q(S) = f(Q(S1), Q(S2)). We can thus directly apply existing optimizations to decomposable aggregates that increase error-resilience and reduce communication cost.Finally, we validate our results empirically, through simulation. When we compute the median exactly, we show that, even for moderate size networks, the worst communication cost for any single node is several times smaller than the corresponding cost in prior median algorithms. We show similar cost reductions when computing approximate order-statistic summaries with guaranteed precision. In all cases, our total communication cost over the entire network is smaller than or equal to the total cost of prior algorithms.

Approximate counts and quantiles over sliding windows

We consider the problem of maintaining ε-approximate counts and quantiles over a stream sliding window using limited space. We consider two types of sliding windows depending on whether the number of elements N in the window is fixed (fixed-size sliding window) or variable (variable-size sliding window). In a fixed-size sliding window, both the ends of the window slide synchronously over the stream. In a variable-size sliding window, an adversary slides the window ends independently, and therefore has the ability to vary the number of elements N in the window.We present various deterministic and randomized algorithms for approximate counts and quantiles. All of our algorithms require O(1/ε polylog(1/ε, N)) space. For quantiles, this space requirement is an improvement over the previous best bound of O(1/ε2 polylog(1/ε, N)). We believe that no previous work on space-efficient approximate counts over sliding windows exists.

SESSION: Foundations of query languages

On the decidability of containment of recursive datalog queries - preliminary report

The problem of deciding query containment has important applications in classical query optimization and heterogeneous database systems. Query containment is undecidable for unrestricted recursive queries, and decidable for recursive monadic queries and conjunctive queries over regular path expressions. In this paper, we identify a new class of recursive queries with decidable containment. Our framework extends the aforementioned query classes by supporting recursive predicates with more than two arguments and nonlinear recursion.

Processing first-order queries under limited access patterns

We study the problem of answering queries over sources with limited access patterns. Given a first-order query Q, the problem is to decide whether there is an equivalent query which can be executed observing the access patterns restrictions. If so, we say that Q is feasible. We define feasible for first-order queries---previous definitions handled only some existential cases---and characterize the complexity of many first-order query classes. For each of them, we show that deciding feasibility is as hard as deciding containment. Since feasibility is undecidable in many cases and hard to decide in some others, we also define an approximation to it which can be computed in NP for any first-order query and in P for unions of conjunctive queries with negation. Finally, we outline a practical overall strategy for processing first-order queries under limited access patterns.

On preservation under homomorphisms and unions of conjunctive queries

Unions of conjunctive queries, also known as select-project-join-union queries, are the most frequently asked queries in relational database systems. These queries are definable by existential positive first-order formulas and are preserved under homomorphisms. A classical result of mathematical logic asserts that existential positive formulas are the only first-order formulas (up to logical equivalence) that are preserved under homomorphisms on all structures, finite and infinite. It is long-standing open problem in finite model theory, however, to determine whether the same homomorphism-preservation result holds in the finite, that is, whether every first-order formula preserved under homomorphisms on finite structures is logically equivalent to an existential positive formula on finite structures. In this paper, we show that the homomorphism-preservation theorem holds for several large classes of finite structures of interest in graph theory and database theory. Specifically, we show that this result holds for all classes of finite structures of bounded degree, all classes of finite structures of bounded treewidth, and, more generally, all classes of finite structures whose cores exclude at least one minor.

Multi-valued dependencies in the presence of lists

Multi-valued depdendencies (MVDs) are an important class of relational constraints. We axiomatise MVDs in data models that support nested list types. In order to capture different data models at a time, an abstract approach based on nested attributes is taken. The set of subattributes of some fixed nested attribute carries the structure of a co-Heyting algebra. This enables us to generalise significant features of MVDs from the relational data model to the presence of lists. It is shown that an MVD is satisfied by some instance exactly when this instance can be decomposed without loss of information. The full power of the algebraic framework allows to provide a sound and complete set of inference rules for the finite implication of MVDs in the context of lists. The presence of the list operator calls for a new inference rule which is not required in the relational data model. Further differences become apparant when the minimality of the inference rules is investigated. The extension of the relational theory of MVDs to the presence of lists allows to specify more real-world constraints and increases therefore the number of application domains.