PODS '06- Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

Full Citation in the ACM Digital Library

Principles of dataspace systems

The most acute information management challenges today stem from organizations relying on a large number of diverse, interrelated data sources, but having no means of managing them in a convenient, integrated, or principled fashion. These challenges arise in enterprise and government data management, digital libraries, "smart" homes and personal information management. We have proposed dataspaces as a data management abstraction for these diverse applications and DataSpace Support Platforms (DSSPs) as systems that should be built to provide the required services over dataspaces. Unlike data integration systems, DSSPs do not require full semantic integration of the sources in order to provide useful services. This paper lays out specific technical challenges to realizing DSSPs and ties them to existing work in our field. We focus on query answering in DSSPs, the DSSP's ability to introspect on its content, and the use of human attention to enhance the semantic relationships in a dataspace.

SESSION: PODS award papers

Two-variable logic on data trees and XML reasoning

Motivated by reasoning tasks in the context of XML languages, the satisfiability problem of logics on data trees is investigated. The nodes of a data tree have a label from a finite set and a data value from a possibly infinite set. It is shown that satisfiability for two-variable first-order logic is decidable if the tree structure can be accessed only through the child and the next sibling predicates and the access to data values is restricted to equality tests. From this main result decidability of satisfiability and containment for a data-aware fragment of XPath and of the implication problem for unary key and inclusion constraints is concluded.

An adaptive packed-memory array

The packed-memory array (PMA) is a data structure that maintains a dynamic set of N elements in sorted order in a Θ(N)-sized array. The idea is to intersperse Θ(N) empty spaces or gaps among the elements so that only a small number of elements need to be shifted around on an insert or delete. Because the elements are stored physically in sorted order in memory or on disk, the PMA can be used to support extremely efficient range queries. Specifically, the cost to scan L consecutive elements is O(1+L/B) memory transfers.This paper gives the first adaptive packed-memory array (APMA), which automatically adjusts to the input pattern. Like the original PMA, any pattern of updates costs only O(log2 N) amortized element moves and O(1+(log2 N)/B) amortized memory transfers per update. However, the APMA performs even better on many common input distributions achieving only O(logN) amortized element moves and O(1+(logN)/B) amortized memory transfers. The paper analyzes sequential inserts, where the insertions are to the front of the APMA, hammer inserts, where the insertions "hammer" on one part of the APMA, random inserts, where the insertions are after random elements in the APMA, and bulk inserts, where for constant α∈[0,1], Nα elements are inserted after random elements in the APMA. The paper then gives simulation results that are consistent with the asymptotic bounds. For sequential insertions of roughly 1.4 million elements, the APMA has four times fewer element moves per insertion than the traditional PMA and running times that are more than seven times faster.

SESSION: Data exchange

The complexity of data exchange

Data exchange is the problem of transforming data structured under a source schema into data structured under a target schema in such a way that all constraints of a schema mapping are satisfied. At the heart of data exchange, lies a basic decision problem, called the existence-of-solutions problem: given a source instance, is there a target instance that satisfies the constraints of the schema mapping at hand? Earlier work showed that for schema mappings specified by embedded implicational dependencies, this problem is solvable in polynomial time, assuming that (1) the schema mapping is kept fixed and (2) the constraints of the schema mapping satisfy a certain structural condition, called weak acyclicity.We investigate the effect of these assumptions on the complexity of the existence-of-solutions problem, and show that each one is indispensable in deriving polynomial-time algorithms for this problem. Specifically, using machinery from universal algebra, we show that if the weak acyclicity assumption is relaxed even in a minimal way, then the existence-of-solutions problem becomes undecidable. We also show that if, in addition to the source instance, the schema mapping is part of the input, then the existence-of-solutions problem becomes EXPTIME-complete. Thus, there is a provable exponential gap between the data complexity and the combined complexity of data exchange. Finally, we study restricted classes of schema mappings and develop a comprehensive picture for the combined complexity of the existence-of-solutions problem for these restrictions. In particular, depending on the restriction considered, the combined complexity of this problem turns out to be either EXPTIME-complete or coNP-complete.

Data exchange: computing cores in polynomial time

Data exchange deals with inserting data from one database into another database having a different schema. We study and solve a central computational problem of data exchange, namely, computing the core of a universal solution to a data exchange problem. Fagin, Kolaitis, and Popa [9], have shown that among the universal solutions of a solvable data exchange problem, there exists a most compact one (up to isomorphism), "the core" (of any universal solution), and have convincingly argued that this core should be the solution of choice. They stated as an important open problem whether the core of a universal solution can be computed in polynomial time in the general setting where the source-to-target constraints are arbitrary tuple generating dependencies (TGDs) and the target constraints consist of equation generating dependencies (EGDs) and weakly-acyclic TGDs. In this paper we solve this problem by developing new efficient methods for computing the core of a universal solution. This positive result shows that the core approach of Fagin, Kolaitis, and Popa is feasible and applicable in a very general setting and thus provides a further momentum to the use of cores in data exchange.

Inverting schema mappings

A schema mapping is a specification that describes how data structured under one schema (the source schema) is to be transformed into data structured under a different schema (the target schema). Although the notion of an inverse of a schema mapping is important, the exact definition of an inverse mapping is somewhat elusive. This is because a schema mapping may associate many target instances with each source instance, and many source instances with each target instance. Based on the notion that the composition of a mapping and its inverse is the identity, we give a formal definition for what it means for a schema mapping M′ to be an inverse of a schema mapping M for a class S of source instances. We call such an inverse an S-inverse. A particular case of interest arises when S is the class of all instances, in which case an S-inverse is a global inverse. We focus on the important and practical case of schema mappings defined by source-to-target tuple-generating dependencies, and uncover a rich theory. When S is defined by a set of dependencies with a finite chase, we show how to construct an S-inverse when one exists. In particular, we show how to construct a global inverse when one exists. Given M and M′, we show how to define the largest class S such that M′ is an S-inverse of M.

Data exchange and incomplete information

Data exchange is the problem of finding an instance of a target schema, given an instance of a source schema and a specification of the relationship between the source and the target, and answering queries over target instances in a way that is semantically consistent with the information in the source. Theoretical foundations of data exchange have been actively explored recently. It was also noticed that the standard certain answers semantics may behave in very odd ways.In this paper I explain that this behavior is due to the fact that the presence of incomplete information in target instances has been ignored; in particular, proper query evaluation techniques for databases with nulls have not been used, and the distinction between closed and open world semantics has not been made. I present a concept of target solutions based on the closed world assumption, and show that the space of all solutions has two extreme points: the canonical universal solution and the core, well studied in data exchange. I show how to define semantics of query answering taking into account incomplete information, and show that the well-known anomalies go away with the new semantics. The paper also contains results on the complexity of query answering, upper approximations to queries (maybe-answers), and various extensions.

SESSION: Analysis of queries and workflows

Equivalence of queries combining set and bag-set semantics

The query equivalence problem has been studied extensively for set-semantics and, more recently, for bag-set semantics. However, SQL queries often combine set and bag-set semantics. For example, an SQL query that returns a multiset of elements may call a subquery or view that returns a set of elements. As another example, in SQL one can compute a multiset-union of queries, each of which returns a set of answers. This paper presents combined semantics, which formally models query evaluation combining set and bag-set semantics. The equivalence problem for queries evaluated under combined semantics is studied. A sufficient condition for equivalence is presented. For several important common classes of queries necessary and sufficient conditions for equivalence are presented.

The containment problem for <bi>Real</bi> conjunctive queries with inequalities

Query containment is a fundamental algorithmic problem in database query processing and optimization. Under set semantics, the query-containment problem for conjunctive queries has long been known to be NP-complete. In real database systems, however, queries are usually evaluated under bag semantics, not set semantics. In particular, SQL queries are evaluated under bag semantics and return multisets as answers, since duplicates are not eliminated unless explicitly requested. The exact complexity of the query-containment problem for conjunctive queries under bag semantics has been an open problem for more than a decade; in fact, it is not even known whether this problem is decidable.Here, we investigate, under bag semantics, the query-containment problem for conjunctive queries with inequalities. It has been previously shown that, under set semantics, this problem is complete for the second level of the polynomial hierarchy. Our main result asserts that, under bag semantics, the query-containment problem for conjunctive queries with inequalities is undecidable. Actually, we establish the stronger result that this problem is undecidable even if the following two restrictions hold at the same time: (1) the queries use just a single binary relation; and (2) the total number of inequalities is bounded by a certain fixed value. Moreover, the same undecidability results hold under bag-set semantics.

Verification of communicating data-driven web services

We study the verification of compositions of Web Service peers which interact asynchronously by exchanging messages. Each peer has access to a local database and reacts to user input and incoming messages by performing various actions and sending messages. The reaction is described by queries over the database, internal state, user input and received messages. We consider two formalisms for specification of correctness properties of compositions, namely Linear Temporal First-Order Logic and Conversation Protocols. For both formalisms, we map the boundaries of verification decidability, showing that they include expressive classes of compositions and properties. We also address modular verification, in which the correctness of a composition is predicated on the properties of its environment.

Analyzing workflows implied by instance-dependent access rules

Recently proposed form-based web information systems liberate the capture and reuse of data in organizations by substituting the development of technical implementations of electronic forms for the conceptual modelling of forms' tree-structured schemas and their data access rules. Significantly, these instance-dependent rules also imply a workflow process associated to a form, eliminating the need for a costly workflow design phase. Instead, the workflows thus created in an ad hoc manner by unsophisticated end-users can be automatically analyzed, and incorrect forms rejected.This paper examines fundamental correctness properties of workflows that are implied by instance-dependent access rules. Specifically, we study the decidability of the form completability property and the semi-soundness of a form's workflow. These problems are affected by a choice of constraints on the path language used to express access rules and completion formulas, and on the depth of the form's schema tree. Hence, we study these problems by examining them in the context of several different fragments determined by such constraints.

TUTORIAL SESSION: Invited tutorial 1

From statistical knowledge bases to degrees of belief: an overview

An intelligent agent will often be uncertain about various properties of its environment, and when acting in that environment it will frequently need to quantify its uncertainty. For example, if the agent wishes to employ the expected-utility paradigm of decision theory to guide its actions, she will need to assign degrees of belief (subjective probabilities) to various assertions. Of course, these degrees of belief should not be arbitrary, but rather should be based on the information available to the agent. This paper provides a brief overview of one approach for inducing degrees of belief from very rich knowledge bases that can include information about particular individuals, statistical correlations, physical laws, and default rules. The approach is called the random-worlds method. The method is based on the principle of indifference: it treats all of the worlds the agent considers possible as being equally likely. It is able to integrate qualitative default reasoning with quantitative probabilistic reasoning by providing a language in which both types of information can be easily expressed. A number of desiderata that arise in direct inference (reasoning from statistical information to conclusions about individuals) and default reasoning follow directly from the semantics of random worlds. For example, random worlds captures important patterns of reasoning such as specificity, inheritance, indifference to irrelevant information, and default assumptions of independence. Furthermore, the expressive power of the language used and the intuitive semantics of random worlds allow the method to deal with problems that are beyond the scope of many other non-deductive reasoning systems. The relevance of the random-worlds method to database systems is also discussed.

SESSION: Database design

On redundancy vs dependency preservation in normalization: an information-theoretic study of 3NF

A recently introduced information-theoretic approach to analyzing redundancies in database design was used to justify normal forms like BCNF that completely eliminate redundancies. The main notion is that of an information content of each datum in an instance (which is a number in [0,1]): the closer to 1, the less redundancy it carries. In practice, however, one usually settles for 3NF which, unlike BCNF, may not eliminate all redundancies but always guarantees dependency preservation.In this paper we use the information-theoretic approach to prove that 3NF is the best normal form if one needs to achieve dependency preservation. For each dependency-preserving normal form, we define the price of dependency preservation as an information-theoretic measure of redundancy that gets introduced to compensate for dependency preservation. This is a number in the [0,1] range: the smaller it is, the less redundancy a normal form guarantees. We prove that for every dependency-preserving normal form, the price of dependency preservation is at least 1/2, and it is precisely 1/2 for 3NF. Hence, 3NF has the least amount of redundancy among all dependency-preserving normal forms. We also show that, information-theoretically, unnormalized schemas have at least twice the amount of redundancy than schemas in 3NF.

Tractable database design through bounded treewidth

Given that most elementary problems in database design are NP-hard, the currently used database design algorithms produce suboptimal results. For example, the current 3NF decomposition algorithms may continue further decomposing a relation even though it is already in 3NF. In this paper we study database design problems whose sets of functional dependencies have bounded treewidth. For such sets, which frequently occur in practice, we develop polynomial-time and highly parallelizable algorithms for a number of central database design problems such as: • primality of an attribute • 3NF-test for a relational schema or subschema • BCNF-test for a subschema.For establishing these results, we propose a new characterization for keys and for the primality of a single attribute.In order to define the treewidth of a relational schema, we shall associate a hypergraph with it. Note that there are two main possibilities of defining the treewidth of a hypergraph H: One is via the primal graph of H and one is via the incidence graph of H. Our algorithms apply to the case where the primal graph is considered. However, we also show that the tractability results still hold when the incidence graph is considered instead.

SESSION: Popularity and privacy

Evolution of page popularity under random web graph models

The link structure of the Web can be viewed as a massive graph. The preferential attachment model and its variants are well-known random graph models that help explain the evolution of the web graph. However, those models assign more links to older pages without reference to the quality of web pages, which does not capture the real-world evolution of the web graph and renders the models inappropriate for studying the popularity evolution of new pages.We extend the preferential attachment model with page quality, where the probability of a page getting new links depends not only on its current degree but also on its quality. We study the distribution of degrees among different quality values, and prove that under discrete quality distributions, the degree sequence still follows a power law distribution. Then we use the model to study the evolution of page popularity. We show that for pages with the same quality, the older pages are more popular; if a younger page is better than an older page, then eventually the younger-and-better page will become more popular. We also use the model to study a randomized ranking scheme proposed earlier [18] and show that it accelerates popularity evolution of new pages.

Privacy via pseudorandom sketches

Imagine a collection of individuals who each possess private data that they do not wish to share with a third party. This paper considers how individuals may represent and publish their own data so as to simultaneously preserve their privacy and to ensure that it is possible to extract large-scale statistical behavior from the original unperturbed data. Existing techniques for perturbing data are limited by the number of users required to obtain approximate answers to queries, the richness of preserved statistical behavior, the privacy guarantees given and/or the amount of data that each individual must publish.This paper introduces a new technique to describe parts of an individual's data that is based on pseudorandom sketches. The sketches guarantee that each individual's privacy is provably maintained assuming one of the strongest definitions of privacy that we are aware of: given unlimited computational power and arbitrary partial knowledge, the attacker can not learn any additional private information from the published sketches. However, sketches from multiple users that describe a subset of attributes can be used to estimate the fraction of users that satisfy any conjunction over the full set of negated or unnegated attributes provided that there are enough users. We show that the error of approximation is independent of the number of attributes involved and only depends on the number of users available. An additional benefit is that the size of the sketch is minuscule: [log log O(M)] bits, where M is the number of users. Finally, we show how sketches can be combined to answer more complex queries. An interesting property of our approach is that despite using cryptographic primitives, our privacy guarantees do not rely on any unproven cryptographic conjectures.

Achieving anonymity via clustering

Publishing data for analysis from a table containing personal records, while maintaining individual privacy, is a problem of increasing importance today. The traditional approach of de-identifying records is to remove identifying fields such as social security number, name etc. However, recent research has shown that a large fraction of the US population can be identified using non-key attributes (called quasi-identifiers) such as date of birth, gender, and zip code [15]. Sweeney [16] proposed the k-anonymity model for privacy where non-key attributes that leak information are suppressed or generalized so that, for every record in the modified table, there are at least k−1 other records having exactly the same values for quasi-identifiers. We propose a new method for anonymizing data records, where quasi-identifiers of data records are first clustered and then cluster centers are published. To ensure privacy of the data records, we impose the constraint that each cluster must contain no fewer than a pre-specified number of data records. This technique is more general since we have a much larger choice for cluster centers than k-Anonymity. In many cases, it lets us release a lot more information without compromising privacy. We also provide constant-factor approximation algorithms to come up with such a clustering. This is the first set of algorithms for the anonymization problem where the performance is independent of the anonymity parameter k. We further observe that a few outlier points can significantly increase the cost of anonymization. Hence, we extend our algorithms to allow an ε fraction of points to remain unclustered, i.e., deleted from the anonymized publication. Thus, by not releasing a small fraction of the database records, we can ensure that the data published for analysis has less distortion and hence is more useful. Our approximation algorithms for new clustering objectives are of independent interest and could be applicable in other clustering scenarios as well.

On the efficiency of checking perfect privacy

Privacy-preserving query-answering systems answer queries while provably guaranteeing that sensitive information is kept secret. One very attractive notion of privacy is perfect privacy—a secret is expressed through a query QS, and a query QV is answered only if it discloses no information about the secret query QS. However, if QS and QV are arbitrary conjunctive queries, the problem of checking whether QV discloses any information about QS is known to be Πp2-complete.In this paper, we show that for large interesting subclasses of conjunctive queries enforcing perfect privacy is tractable. Instead of giving different arguments for query classes of varying complexity, we make a connection between perfect privacy and the problem of checking query containment. We then use this connection to relate the complexity of enforcing perfect privacy to the complexity of query containment.

SESSION: Query optimization

Finding and approximating top-k answers in keyword proximity search

Various approaches for keyword proximity search have been implemented in relational databases, XML and the Web. Yet, in all of them, an answer is a Q-fragment, namely, a subtree T of the given data graph G, such that T contains all the keywords of the query Q and has no proper subtree with this property. The rank of an answer is inversely proportional to its weight. Three problems are of interest: finding an optimal (i.e., top-ranked) answer, computing the top-k answers and enumerating all the answers in ranked order. It is shown that, under data complexity, an efficient algorithm for solving the first problem is sufficient for solving the other two problems with polynomial delay. Similarly, an efficient algorithm for finding a θ-approximation of the optimal answer suffices for carrying out the following two tasks with polynomial delay, under query-and-data complexity. First, enumerating in a (θ+1)-approximate order. Second, computing a (θ+1)-approximation of the top-k answers. As a corollary, this paper gives the first efficient algorithms, under data complexity, for enumerating all the answers in ranked order and for computing the top-k answers. It also gives the first efficient algorithms, under query-and-data complexity, for enumerating in a provably approximate order and for computing an approximation of the top-k answers.

Efficiently ordering subgoals with access constraints

In this paper, we study the problem of ordering subgoals under binding pattern restrictions for queries posed as nonrecursive Datalog programs. We prove that despite their limited expressive power, the problem is computationally hard—PSPACE-complete in the size of the nonrecursive Datalog program even for fairly restricted cases. As a practical solution to this problem, we develop an asymptotically optimal algorithm that runs in time linear in the size of the query plan. We also study extensions of our algorithm that efficiently solve other query planning problems under binding pattern restrictions. These problems include conjunctive queries with nested grouping constraints, distributed conjunctive queries, and first-order queries.

Flow algorithms for two pipelined filter ordering problems

Pipelined filter ordering is a central problem in database query optimization, and has received renewed attention recently in the context of environments such as the web, continuous high-speed data streams and sensor networks. We present algorithms for two natural extensions of the classical pipelined filter ordering problem: (1) a distributional type problem where the filters run in parallel and the goal is to maximize throughput, and (2) an adversarial type problem where the goal is to minimize the expected value of multiplicative regret. We show that both problems can be solved using similar flow algorithms, which find an optimal ordering scheme in time O(n2), where n is the number of filters. Our algorithm for (1) improves on an earlier O(n3 log n) algorithm of Kodialam.

Asking the right questions: model-driven optimization using probes

In several database applications, parameters like selectivities and load are known only with some associated uncertainty, which is specified, or modeled, as a distribution over values. The performance of query optimizers and monitoring schemes can be improved by spending resources like time or bandwidth in observing or resolving these parameters, so that better query plans can be generated. In a resource-constrained situation, deciding which parameters to observe in order to best optimize the expected quality of the plan generated (or in general, optimize the expected value of a certain objective function) itself becomes an interesting optimization problem.We present a framework for studying such problems, and present several scenarios arising in anomaly detection in complex systems, monitoring extreme values in sensor networks, load shedding in data stream systems, and estimating rates in wireless channels and minimum latency routes in networks, which can be modeled in this framework with the appropriate objective functions.Even for several simple objective functions, we show the problems are Np-Hard. We present greedy algorithms with good performance bounds. The proof of the performance bounds are via novel sub-modularity arguments.

TUTORIAL SESSION: Invited tutorial 2

Processing queries on tree-structured data efficiently

This is a survey of algorithms, complexity results, and general solution techniques for efficiently processing queries on tree-structured data. I focus on query languages that compute nodes or tuples of nodes—conjunctive queries, first-order queries, datalog, and XPath. I also point out a number of connections among previous results that have not been observed before.

SESSION: Query processing

Scalable computation of acyclic joins

The join operation of relational algebra is a cornerstone of relational database systems. Computing the join of several relations is NP-hard in general, whereas special (and typical) cases are tractable. This paper considers joins having an acyclic join graph, for which current methods initially apply a full reducer to efficiently eliminate tuples that will not contribute to the result of the join. From a worst-case perspective, previous algorithms for computing an acyclic join of k fully reduced relations, occupying a total of n≥k blocks on disk, use Ω((n+z)k) I/Os, where z is the size of the join result in blocks.In this paper we show how to compute the join in a time bound that is within a constant factor of the cost of running a full reducer plus sorting the output. For a broad class of acyclic join graphs this is O(sort(n+z)) I/Os, removing the dependence on k from previous bounds. Traditional methods decompose the join into a number of binary joins, which are then carried out one by one. Departing from this approach, our technique is based on computing the size of certain subsets of the result, and using these sizes to compute the location(s) of each data item in the result.Finally, as an initial study of cyclic joins in the I/O model, we show how to compute a join whose join graph is a 3-cycle, in O(n2/m+sort(n+z)) I/Os, where m is the number of blocks in internal memory.

Cache-oblivious string B-trees

B-trees are the data structure of choice for maintaining searchable data on disk. However, B-trees perform suboptimally
  • when keys are long or of variable length,
  • when keys are compressed, even when using front compression, the standard B-tree compression scheme,
  • for range queries, and
  • with respect to memory effects such as disk prefetching.
This paper presents a cache-oblivious string B-tree (COSB-tree) data structure that is efficient in all these ways:
  • The COSB-tree searches asymptotically optimally and inserts and deletes nearly optimally.
  • It maintains an index whose size is proportional to the front-compressed size of the dictionary. Furthermore, unlike standard front-compressed strings, keys can be decompressed in a memory-efficient manner.
  • It performs range queries with no extra disk seeks; in contrast, B-trees incur disk seeks when skipping from leaf block to leaf block.
  • It utilizes all levels of a memory hierarchy efficiently and makes good use of disk locality by using cache-oblivious layout strategies.

SESSION: Stream algorithms and complexity

Randomized computations on large data sets: tight lower bounds

We study the randomized version of a computation model (introduced in [9, 10]) that restricts random access to external memory and internal memory space. Essentially, this model can be viewed as a powerful version of a data stream model that puts no cost on sequential scans of external memory (as other models for data streams) and, in addition, (like other external memory models, but unlike streaming models), admits several large external memory devices that can be read and written to in parallel.We obtain tight lower bounds for the decision problems set equality, multiset equality, and checksort. More precisely, we show that any randomized one-sided-error bounded Monte Carlo algorithm for these problems must perform Ω(logN) random accesses to external memory devices, provided that the internal memory size is at most O(4N/logN), where N denotes the size of the input data.From the lower bound on the set equality problem we can infer lower bounds on the worst case data complexity of query evaluation for the languages XQuery, XPath, and relational algebra on streaming data. More precisely, we show that there exist queries in XQuery, XPath, and relational algebra, such that any (randomized) Las Vegas algorithm that evaluates these queries must perform Ω(logN) random accesses to external memory devices, provided that the internal memory size is at most O(4N/logN).

Counting triangles in data streams

We present two space bounded random sampling algorithms that compute an approximation of the number of triangles in an undirected graph given as a stream of edges. Our first algorithm does not make any assumptions on the order of edges in the stream. It uses space that is inversely related to the ratio between the number of triangles and the number of triples with at least one edge in the induced subgraph, and constant expected update time per edge. Our second algorithm is designed for incidence streams (all edges incident to the same vertex appear consecutively). It uses space that is inversely related to the ratio between the number of triangles and length 2 paths in the graph and expected update time O(log|V|⋅(1+s⋅|V|/|E|)), where s is the space requirement of the algorithm. These results significantly improve over previous work [20, 8]. Since the space complexity depends only on the structure of the input graph and not on the number of nodes, our algorithms scale very well with increasing graph size and so they provide a basic tool to analyze the structure of large graphs. They have many applications, for example, in the discovery of Web communities, the computation of clustering and transitivity coefficient, and discovery of frequent patterns in large graphs.We have implemented both algorithms and evaluated their performance on networks from different application domains. The sizes of the considered graphs varied from about 8,000 nodes and 40,000 edges to 135 million nodes and more than 1 billion edges. For both algorithms we run experiments with parameter s=1,000, 10,000, 100,000, 1,000,000 to evaluate running time and approximation guarantee. Both algorithms appear to be time efficient for these sample sizes. The approximation quality of the first algorithm was varying significantly and even for s=1,000,000 we had more than 10% deviation for more than half of the instances. The second algorithm performed much better and even for s=10,000 we had an average deviation of less than 6% (taken over all but the largest instance for which we could not compute the number of triangles exactly).

Space- and time-efficient deterministic algorithms for biased quantiles over data streams

Skew is prevalent in data streams, and should be taken into account by algorithms that analyze the data. The problem of finding "biased quantiles"—that is, approximate quantiles which must be more accurate for more extreme values—is a framework for summarizing such skewed data on data streams. We present the first deterministic algorithms for answering biased quantiles queries accurately with small—sublinear in the input size—space and time bounds in one pass. The space bound is near-optimal, and the amortized update cost is close to constant, making it practical for handling high speed network data streams. We not only demonstrate theoretical properties of the algorithm, but also show it uses less space than existing methods in many practical settings, and is fast to maintain.

Approximate quantiles and the order of the stream

Recently, there has been an increased focus on modeling uncertainty by distributions. Suppose we wish to compute a function of a stream whose elements are samples drawn independently from some distribution. The distribution is unknown, but the order in which the samples are presented to us will not be completely adversarial. In this paper, we investigate the importance of the ordering of a data stream, without making any assumptions about the actual distribution of the data. Using quantiles as an example application, we show that we can design provably better algorithms, and settle several open questions on the impact of order on streams. With the recent impetus in the investigation of models for sensor networks, we believe that our approach will allow the construction of novel and significantly improved algorithms.

SESSION: Frequency counting and aggregation

Deterministic k-set structure

A k-set structure over data streams is a bounded-space data structure that supports stream insertion and deletion operations and returns the set of (item, frequency) pairs in the stream, provided, the number of distinct items in the stream does not exceed k; and returns nil otherwise. This is a fundamental problem with applications in data streaming [14], data reconciliation in distributed systems [12] and mobile computing [16], etc. In this paper, we present a deterministic algorithm for the k-set problem that matches the space lower bound to within a logarithmic factor.

A simpler and more efficient deterministic scheme for finding frequent items over sliding windows

In this paper, we give a simple scheme for identifying ε-approximate frequent items over a sliding window of size n. Our scheme is deterministic and does not make any assumption on the distribution of the item frequencies. It supports O(1/ε) update and query time, and uses O(1/ε) space. It is very simple; its main data structures are just a few short queues whose entries store the position of some items in the sliding window. We also extend our scheme for variable-size window. This extended scheme uses O(1/ε log(εn)) space.

Finding global icebergs over distributed data sets

Finding icebergs–items whose frequency of occurrence is above a certain threshold–is an important problem with a wide range of applications. Most of the existing work focuses on iceberg queries at a single node. However, in many real-life applications, data sets are distributed across a large number of nodes. Two naïve approaches might be considered. In the first, each node ships its entire data set to a central server, and the central server uses single-node algorithms to find icebergs. But it may incur prohibitive communication overhead. In the second, each node submits local icebergs, and the central server combines local icebergs to find global icebergs. But it may fail because in many important applications, globally frequent items may not be frequent at any node. In this work, we propose two novel schemes that provide accurate and efficient solutions to this problem: a sampling-based scheme and a counting-sketch-based scheme. In particular, the latter scheme incurs a communication cost at least an order of magnitude smaller than the naïve scheme of shipping all data, yet is able to achieve very high accuracy. Through rigorous theoretical and experimental analysis we establish the statistical properties of our proposed algorithms, including their accuracy bounds.

Efficient gossip-based aggregate computation

Recently, there has been a growing interest in gossip-based protocols that employ randomized communication to ensure robust information dissemination. In this paper, we present a novel gossip-based scheme using which all the nodes in an n-node overlay network can compute the common aggregates of MIN, MAX, SUM, AVERAGE, and RANK of their values using O(n log log n) messages within O(log n log log n) rounds of communication. To the best of our knowledge, ours is the first result that shows how to compute these aggregates with high probability using only O(n log log n) messages. In contrast, the best known gossip-based algorithm for computing these aggregates requires O(nlog n) messages and O(log n) rounds. Thus, our algorithm allows system designers to trade off a small increase in round complexity with a significant reduction in message complexity. This can lead to dramatically lower network congestion and longer node lifetimes in wireless and sensor networks, where channel bandwidth and battery life are severely constrained.

SESSION: Query, update, and mining languages

Structural characterizations of the semantics of XPath as navigation tool on a document

Given a document D in the form of an unordered labeled tree, we study the expressibility on D of various fragments of XPath, the core navigational language on XML documents. We give characterizations, in terms of the structure of D, for when a binary relation on its nodes is definable by an XPath expression in these fragments. Since each pair of nodes in such a relation represents a unique path in D, our results therefore capture the sets of paths in D definable in XPath. We refer to this perspective on the semantics of XPath as the "global view." In contrast with this global view, there is also a "local view" where one is interested in the nodes to which one can navigate starting from a particular node in the document. In this view, we characterize when a set of nodes in D can be defined as the result of applying an XPath expression to a given node of D. All these definability results, both in the global and the local view, are obtained by using a robust two-step methodology, which consists of first characterizing when two nodes cannot be distinguished by an expression in the respective fragments of XPath, and then bootstrapping these characterizations to the desired results.

The expressivity of XPath with transitive closure

We extend Core XPath, the navigational fragment of XPath 1.0, with transitive closure and path equalities. The resulting language, Regular XPATH, is expressively complete for FO* (first-order logic extended with a transitive closure operator that can be applied to formulas with exactly two free variables). As a corollary, we obtain that Regular XPATH is closed under path intersection and complementation. We also provide characterizations for the *-positive fragment of Regular XPATH, and for μRegular XPATH (the extension of Regular XPATH with least fixed points).

Relational lenses: a language for updatable views

We propose a novel approach to the classical view update problem. The view update problem arises from the fact that modifications to a database view may not correspond uniquely to modifications on the underlying database; we need a means of determining an "update policy" that guides how view updates are reflected in the database. Our approach is to define a bi-directional query language, in which every expression can be read bot(from left to right) as a view definition and (from right to left) as an update policy. The primitives of this language are based on standard relational operators. Its type system, which includes record-level predicates and functional dependencies, plays a crucial role in guaranteeing that update policies are well-behaved, in a precise sense, and that they are total—i.e., able to handle arbitrary changes to the view.

Programmable clustering

We initiate a novel study of clustering problems. Rather than specifying an explicit objective function to optimize, our framework allows the user of clustering algorithm to specify, via a first-order formula, what constitutes an acceptable clustering to them. While the resulting genre of problems includes, in general, NP-complete problems, we highlight three specific first-order formulae, and provide efficient algorithms for the resulting clustering problems.

TUTORIAL SESSION: Invited tutorial 3

The logic of RDF and SPARQL: a tutorial

The Resource Description Framework (RDF [Hayes, 2004]) is a W3C standard language for representing information about resources in the World Wide Web; RDF provides a common framework for expressing this information so it can be exchanged between applications without loss of meaning. In this tutorial, RDF will be presented as a data model in the database sense. Its motivations will be analysed, and its current formal status revised. The data model can be understood both from a graph theoretical perspective and from a logical perspective. While the former has been the focus of most theoretical (see, e.g., [Gutierrez et al., 2004]) and practical approaches to RDF, the logical view of RDF has been mostly neglected by the community so far. Two provably correct (w.r.t. the normative W3C definitions of RDF [Hayes, 2004]) logical reconstructions of RDF will be presented, by reducing (a fragment of) it to a classical first-order framework suitable for knowledge representation (first developed in [de Bruijn et al., 2005]), and by encoding the full RDF data model in the HiLog logic introduced by Kifer et al. several years ago [Chen et al., 1993]. An emphasis will be given to three main characteristics of RDF: the presence of anonymous bnodes, the non-well-foundedness of the basic rdf:type relation, and the presence of the RDF vocabulary in the mode itself.In the second part of the tutorial, the relation of the logical reconstructions of RDF with a database perspective will be introduced. An RDF database is seen as a model of a suitable theory in first order logic or in HiLog. While in the pure RDF sense the two approaches are equivalent, it will be shown how the difference becomes relevant whenever additional constraints (e.g., in the form of ontologies or database dependencies) are introduced in the framework. In order to allow for additional constraints (e.g., in the standard W3C OWL-DL ontology language [Patel-Schneider et al., 2004]) while keeping the framework first order, only a fragment of RDF can be considered; this restriction is not needed if the framework is in HiLog (see, e.g., [Motik, 2005]). Various complexity and decidability results will be summarised. In the last part of the tutorial, the W3C standard query language for RDF (SPARQL [Prud'hommeaux and Seaborne, 2006]) will be presented. SPARQL is currently a candidate recommendation. The core of SPARQL is a conjunctive query language, with the added complication that the data model includes existential information in the form of bnodes, and that bnodes may be returned by the query. The formal semantics of the core query language will be given. The problem of the canonical representation of the answer set will be introduced, since bnodes introduce a behaviour similar to the null values in SQL. Complexity results for query answering will be given for different cases. Finally, the possible extensions of SPARQL with various classes of constraints will be discussed.

SESSION: Incomplete information

On the decidability and finite controllability of query processing in databases with incomplete information

In this paper we study queries over relational databases with integrity constraints (ICs). The main problem we analyze is OWA query answering, i.e., query answering over a database with ICs under open-world assumption. The kinds of ICs that we consider are functional dependencies (in particular key dependencies) and inclusion dependencies; the query languages we consider are conjunctive queries (CQs), union of conjunctive queries (UCQs), CQs and UCQs with negation and/or inequality. We present a set of results about the decidability and finite controllability of OWA query answering under ICs. In particular: (i) we identify the decidability/undecidability frontier for OWA query answering under different combinations of the ICs allowed and the query language allowed; (ii) we study OWA query answering both over finite databases and over unrestricted databases, and identify the cases in which such a problem is finitely controllable, i.e., when OWA query answering over finite databases coincides with OWA query answering over unrestricted databases. Moreover, we are able to easily turn the above results into new results about implication of ICs and query containment under ICs, due to the deep relationship between OWA query answering and these two classical problems in database theory. In particular, we close two long-standing open problems in query containment, since we prove finite controllability of containment of conjunctive queries both under arbitrary inclusion dependencies and under key and foreign key dependencies. Besides their theoretical interest, we believe that the results of our investigation are very relevant in many research areas which have recently dealt with databases under an incomplete information assumption: e.g., view-based information access, ontology-based information systems, data integration, data exchange, and peer-to-peer information systems.