PODS '07- Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

Full Citation in the ACM Digital Library

Management of probabilistic data: foundations and challenges

Many applications today need to manage large data sets with uncertainties. In this paper we describe the foundations of managing data where the uncertainties are quantified as probabilities. We review the basic definitions of the probabilistic data model, present some fundamental theoretical result for query evaluation on probabilistic databases, and discuss several challenges, open problems, and research directions.

SESSION: Query processing and rewriting

Generalized hypertree decompositions: np-hardness and tractable variants

The generalized hypertree width GHW(H) of a hypergraph H is a measure of its cyclicity. Classes of conjunctive queries or constraint satisfaction problems whose associated hypergraphs have bounded GHW are known to be solvable in polynomial time. However,it has been an open problem for several years if for a fixed constant k and input hypergraph H it can be determined in polynomial time whether GHW(H)< k. Here, this problem is settled by proving that even for k=3 the problem is already NP-hard. On the way to this result, another long standing open problem, originally raised by Goodman and Shmueli in 1984 all in the context of join optimization is solved. It is proven that determining whether a hypergraph H admits a tree projection with respect to a hypergraph G is NP-complete. Our intractability results on generalized hypertree width motivate further research on more restrictive tractable hypergraph decomposition methods that approximate general hypertree decomposition (GHD). We show that each such method is dnominated by a tractable decomposition method definable through a function that associates a set of partial edges to a hypergraph. By using one particular such function, we define the new Component Hypertree Decomposition method, which is tractable and strictly more general than other approximations to GHD published so far.

Queries determined by views: pack your views

A query Q is determined by a set of views V if, whenever V (I1) = V (I2) for two database instances I1, I2 then also Q(I1) = Q(I2). Does this imply that Q can be rewritten as a query Q0 that only uses the views V?.

For first-order (FO) queries and view definitions over possibly infinite databases, the answer is yes, as follows from old results of Beth and Craig. We say that FO is complete for FO-to-FO rewritings. However, Nash, Segoufin and Vianu (2007) prove that if the query and the view definitions are given by conjunctive queries, then it might not be possible to formulate Q' as a conjunctive query. In other words, CQ is not complete for CQ-to-CQ rewritings.

Here we consider queries and view definitions in the packed fragment (PF) of first-order logic. This is a generalization of the guarded fragment, a fragment of particular interest to database theory. Gottlob et.al. 2002 show that the guarded conjunctive queries are exactly the acyclic queries. Leinders et.al. 2005 characterize the entire guarded fragment by the semijoin algebra.

We show that for both finite and unrestricted databases, PF is complete for PF-to-PF rewritings. The same holds for packed (unions of) conjunctive queries. In both cases, we provide algorithms for testing whether a query is determined by a set of views, and for actually rewriting Q to Q'. To compare: these problems are undecidable for full FO, and still open for conjunctive queries.

Provenance semirings

We show that relational algebra calculations for incomplete databases, probabilistic databases, bag semantics and why-provenance are particular cases of the same general algorithms involving semirings. This further suggests a comprehensive provenance representation that uses semirings of polynomials. We extend these considerations to datalog and semirings of formal power series. We give algorithms for datalog provenance calculation as well as datalog evaluation for incomplete and probabilistic databases. Finally, we show that for some semirings containment of conjunctive queries is the same as for standard set semantics.

SESSION: Machine models and data structures

Machine models and lower bounds for query processing

This paper gives an overview of recent work on machine models for processing massive amounts of data. The main focus is on generalizations of the classical data stream model where, apart from an "internal memory" of limited size, also a number of (potentially huge) streams may be used as "external memory devices".

Decision trees for entity identification: approximation algorithms and hardness results

We consider the problem of constructing decision trees for entity identification from a given relational table. The input is a table containing information about a set of entities over a fixed set of attributes and a probability distribution over the set of entities that specifies the likelihood of the occurrence of each entity. The goal is to construct a decision tree that identifies each entity unambiguously by testing the attribute values such that the average number of tests is minimized. This classical problem finds such diverse applications as efficient fault detection, species identification in biology, and efficient diagnosis in the field of medicine. Prior work mainly deals with the special case where the input table is binary and the probability distribution over the set of entities is uniform. We study the general problem involving arbitrary input tables and arbitrary probability distributions over the set of entities. We consider a natural greedy algorithm and prove an approximation guarantee of O(rK • log N), where N is the number of entities and K is the maximum number of distinct values of an attribute. The value rK is a suitably defined Ramsey number, which is at most log K. We show that it is NP-hard to approximate the problem within a factor of Ω(log N), even for binary tables (i.e. K=2). Thus, for the case of binary tables, our approximation algorithm is optimal up to constant factors (since r2=2). In addition, our analysis indicates a possible way of resolving a Ramsey-theoretic conjecture by Erdos.


XML transformation by tree-walking transducers with invisible pebbles

The pebble tree automaton and the pebble tree transducer are enhanced by additionally allowing an unbounded number of "invisible" pebbles (as opposed to the usual ("visible" ones). The resulting pebble tree automata recognize the regular tree languages (i.e., can validate all generalized DTD's) and hence can find all matches of MSO definable n-ary patterns. Moreover, when viewed as a navigational device, they lead to an XPath-like formalism that has a path expression for every MSO definable binary pattern. The resulting pebbletree transducers can apply arbitrary MSO definable tests to (the observable part of) their configurations, they (still) have a decidable typechecking problem, and they can model the recursion mechanism of XSLT. The time complexity ofthe typechecking problem for conjunctive queries that use MSO definable binary patterns can often be reduced through the use of invisible pebbles.

The complexity of query containment in expressive fragments of XPath 2.0

Query containment has been studied extensively for fragments of XPath 1.0. For instance, the problem is known to be ExpTime-complete for CoreXPath, the navigational core of XPath 1.0. Much less is known about query containment in (fragments of) the richer language XPath 2.0. In this paper, we consider extensions of CoreXPath with the following operators, which are all part of XPath 2.0 (except the last): path intersection, path equality, path complementation, for-loops, and transitive closure. For each combination of these operators, we determine the complexity of query containment, both with and without DTDs. It turns out to range from ExpTime (for extensions with path equality) and 2-ExpTime (for extensions with path intersection) to non-elementary (for extensions with path complementation or for-loops). In almost all cases, adding transitive closure on top has no further impact on the complexity. We also investigate the effect of dropping the upward and/or sibling axes, and show that this sometimes leads to a reduction in complexity.Since the languages we study include negation and conjunction infilters, our complexity results can equivalently be stated in terms ofsatisfiability.We also analyze the above languages in terms of succinctness.

Expressiveness and complexity of xml publishing transducers

A number of languages have been developed for specifying XML publishing, i.e., transformations of relational data into XML trees. These languages generally describe the behaviors of a middleware controller that builds an output tree iteratively, issuing queries to a relational source and expanding the tree with the query results at each step. To study the complexity and expressive power of XML publishing languages, this paper proposes a notion of publishing transducers. Unlike automata for querying XML data, a publishing transducer generates a new XML tree rather than performing a query on an existing tree. We study a variety of publishing transducers based on what relational queries a transducer can issue, what temporary stores a transducer can use during tree generation, and whether or not some tree nodes are allowed to be virtual, i.e., excluded from the output tree. We first show how existing XML publishing languages can be characterized by such transducers. We then study the members ip, emptiness and equivalence problems for various classes of transducers and existing publishing languages. We establish lower and upper bounds, all matching except one, ranging from PTIME to undecidable. Finally, we investigate the expressive power of these transducers and existing languages. We show that when treated as relational query languages, different classes of transducers capture either complexity classes (e.g., PSPACE) or fragments of datalog (e.g., linear datalog). For tree generation, we establish connections between publishing transducers and logical transductions.

SESSION: Randomization

Maintaining bernoulli samples over evolving multisets

Random sampling has become a crucial component of modern data management systems. Although the literature on database sampling is large, there has been relatively little work on the problem of maintaining a sample in the presence of arbitrary insertions and deletions to the underlying dataset. Most existing maintenance techniques apply either to the insert-only case or to datasets that do not contain duplicates. In this paper, we provide a scheme that maintains a Bernoulli sample of an underlying multiset in the presence of an arbitrary stream of updates, deletions, and insertions. Importantly, the scheme never needs to access the underlying multiset. Such Bernoulli samples are easy to manipulate, and are well suited to parallel processing environments. Our method can be viewed as an enhancement of the "counting sample" scheme developed by Gibbons and Matias for estimating the frequency of highly frequent items. We show how the "tracking counters" used by our maintenance scheme can be exploited to estimate population frequencies, sums, and averages in an unbiased manner, with lower variance than the usual estimators based on a Bernoulli sample. The number of distinct items in the multiset can also be estimated without bias. Finally, we discuss certain problems of subsampling and merging that a rise in systems with limited memory resources or distributed processing, respectively.

Finding near neighbors through cluster pruning

Finding near(est) neighbors is a classic, difficult problem in data management and retrieval, with applications in text and image search,in finding similar objects and matching patterns. Here we study cluster pruning, an extremely simple randomized technique. During preprocessing we randomly choose a subset of data points to be leaders the remaining data points are partitioned by which leader is the closest. For query processing, we find the leader(s) closest to the query point. We then seek the nearest neighbors for the query point among only the points in the clusters of the closest leader(s). Recursion may be used in both preprocessing and in search. Such schemes seek approximate nearest neighbors that are "almost as good" as the nearest neighbors. How good are these approximations and how much do they save in computation.

Our contributions are: (1) we quantify metrics that allow us to study the tradeoff between processing and the quality of the approximate nearest neighbors; (2) we give rigorous theoretical analysis of our schemes, under natural generative processes (generalizing Gaussian mixtures) for the data points; (3) experiments on both synthetic data from such generative processes, as well as on from a document corpus, confirming that we save orders of magnitude in query processing cost at modest compromises in the quality of retrieved points. In particular, we show that p-spheres, a state-of-the-art solution, is outperformed by our simple scheme whether the data points are stored in main or in external memo.

SESSION: Data exchange and schema mappings

CWA-solutions for data exchange settings with target dependencies

Data exchange deals with the following problem: given an instance over a source schema, a specification of the relationship between the source and the target,and dependencies on the target, construct an instance over a target schema that satisfies the given relationships and dependencies. Recently - for data exchange settings without target dependencies - Libkin (PODS'06) introduced a new concept of solutions based on the closed world assumption (so calledCWA-solutions), and showed that, in some respects, this new notion behaves better than the standard notion of solutions considered in previous papers on data exchange. The present paper extends Libkin's notion of CWA-solutions to data exchange settings with target dependencies. We show that, when restricting attention to data exchange settings with weakly acyclic target dependencies, this new notion behaves similarly as before: the core is the unique "minimal" CWA-solution, and computing CWA-solutions as well as certain answers to positive queries is possible in polynomial time and can be PTIME-hard. However, there may be more than one "maximal" CWA-solution. And going beyond the class of positive queries, we obtain that there are conjunctive queries with (just) one inequality, for which evaluating the certain answers is coNP-hard. Finally, we consider the EXISTENCE-OF-CWA-SOLUTIONS problem: while the problem is tractable for data exchange settings with weakly acyclic target dependencies, it turns out to be undecidable for general data exchange settings. As a consequence, we obtain that also the EXISTENCE-OF-UNIVERSAL-SOLUTIONS problem is undecidable in genera.

Quasi-inverses of schema mappings

Schema mappings are high-level specifications that describe the relationship between two database schemas. Two operators on schema mappings, namely the composition operator and the inverse operator, are regarded as especially important. Progress on the study of the inverse operator was not made until very recently, as even finding the exact semantics of this operator turned out to be a fairly delicate task. Furthermore, this notion is rather restrictive, since it is rare that a schema mapping possesses an inverse.

In this paper, we introduce and study the notion of a quasi-inverse of a schema mapping. This notion is a principled relaxation of the notion of an inverse of a schema mapping; intuitively, it is obtained from the notion of an inverse by not differentiating between instances that are equivalent for data-exchange purposes. For schema mappings specified by source-to-target tuple-generating dependencies (s-t tgds), we give a necessary and sufficient combinatorial condition for the existence of a quasi-inverse, and then use this condition to obtain both positive and negative results about the existence of quasi-inverses. In particular, we show that every LAV (local-as-view) schema mappinghas a quasi-inverse, but that there are schema mappings specified by full s-t tgds that have no quasi-inverse. After this, we study the language needed to express quasi-inverses of schema mappings specifiedby s-t tgds, and we obtain a complete characterization. We also characterize the language needed to express inverses of schema mappings, and thereby solve a problem left open in the earlier study of the inverse operator. Finally, we show that quasi-inverses can be used in many cases to recover the data that was exported by the original schemamapping when performing data exchange.

On reconciling data exchange, data integration, and peer data management

Data exchange and virtual data integration have been the subject of several investigations in the recent literature. At the same time, the notion of peer data management has emerged as a powerful abstraction of many forms of flexible and dynamic data-centere ddistributed systems. Although research on the above issues has progressed considerably in the last years, a clear understanding on how to combine data exchange and data integration in peer data management is still missing. This is the subject of the present paper. We start our investigation by first proposing a novel framework for peer data exchange, showing that it is a generalization of the classical data exchange setting. We also present algorithms for all the relevant data exchange tasks, and show that they can all be done in polynomial time with respect to data complexity. Based on the motivation that typical mappings and integrity constraints found in data integration are not captured by peer data exchange, we extend the framework to incorporate these features. One of the main difficulties is that the constraints of this new class are not amenable to materialization. We address this issue by resorting to a suitable combination of virtual and materialized data exchange, showing that the resulting framework is a generalization of both classical data exchange and classical data integration, and that the new setting incorporates the most expressive types of mapping and constraints considered in the two contexts. Finally, we present algorithms for all the relevant data management tasks also in the new setting, and show that, again, their data complexity is polynomial.

SESSION: Types and chemas

A crash course on database queries

Complex database queries, like programs in general, can "crash", i.e., can raise runtime errors. We want to avoid crashes without losing expressive power, or we want to correctly predict the absence of crashes. We show how concepts and techniques from programming language theory, notably type systems and reflection, can be adaptedto this end. Of course, the specific nature of database queries (asopposed to general programs), also requires some new methods, andraises new questions.

The complexity of reasoning about pattern-based XML schemas

In a recent paper, Martens et al. introduced a specification mechanism for XML tree languages, based on rules of the form (r,s), wherer, s are regular expressions. Sets of such rules can be interpreted in an existential or a universal fashion. An XML tree is existentially valid with respect to a rule set, if for each node there is a rule such that the root path of the node matches r and the children sequence of the node matchess. It is universally valid if each node matching r also matchess. This paper investigates the complexity of reasoning about such rule sets, in particular the satisfiability and the implication problem. Whereas, in general these reasoning problems are complete for EXPTIME, two important fragments are identified with PSPACE and PTIME complexity, respectively.

SESSION: Query languages

Monadic datalog over finite structures with bounded treewidth

Bounded treewidth and Monadic Second Order (MSO) logic have proved to be key concepts in establishing fixed-para-meter tractability results. Indeed, by Courcelle's Theorem we know: Any property of finite structures, which is expressible by an MSO sentence, can be decided in linear time (data complexity) if the structures have bounded treewidth.

In principle, Courcelle's Theorem can be applied directly to construct concrete algorithms by transforming the MSO evaluation problem into a tree language recognition problem. The latter can then be solved via a finite tree automaton (FTA). However, this approach has turned out to be problematical, since even relatively simple MSO formulae may lead to a "state explosion" of the FTA.

In this work we propose monadic datalog (i.e., data log where all intentional predicate symbols are unary) as an alternative method to tackle this class of fixed-parameter tractable problems. We show that if some property of finite structures is expressible in MSO then this property can also be expressed by means of a monadic datalog program over the structure plus the treedecomposition. Moreover, we show that the resulting fragment of datalogcan be evaluated in linear time (both w.r.t. the program size and w.r.t. the data size). This new approach is put to work by devising a new algorithm for the PRIMALITY problem (i.e., testing if some attribute in a relational schema is part of a key). We also report on experimental results with a prototype implementation.

Index-based multidimensional array queries: safety and equivalence

We propose a new multidimensional array query model giving array bounds and other shape-related metadata a central role. Arrays are treated as shaped maps from indices to values. Schemas are augmented by shape constraints. Queries also have shape preconditions. Within this framework, we introduce the index-based array queries expressing index reorganizations and value summarizations. We define them via adeclarative, rule-based language with shape-membership constraints inits rule bodies and subscripting and aggregation in its rule heads. We explore safety (including bounds analysis) and query equivalence for various subclasses divided according to the aggregator type, whether we allow disjunctions, and whether we allow (limited) Presburger arithmetic in index and shape terms. We show safety istractable in the nonarithmetic cases, while state safety remains in P in the arithmetic ones. We show that, for a class of monoid-based setand bag aggregators, equivalence reduces to equivalence of index-cores- core queries collecting array indices rather than values. Forset-aggregator queries, we give complete characterizations of equivalence in terms of containment maps and show the equivalenceproblems are in P in the nonarithmetic, conjunctive case and in coNP in all others.

Non-linear prefixes in query languages

In first order logic there are two main extensions to quantification: generalized quantifiers and non-linear prefixes. While generalized quantifiers have been explored from a database perspective, non-linear prefixes have not-most likely because of complexity concerns. In this paper we first illustrate the usefulness of non-linear prefixes in query languages by means of example queries. We then introduce the subject formally, distinguishing between two forms of non-linearity: branching and cumulation. To escape complexity concerns, we focus on monadic quantifiers. In this context, we show that branching does not extend the expressive power of first order logic when it is interpreted over finite models, while cumulation does not extend the expressive power when it is interpreted over bounded models. Branching and cumulation do, however, allow us to formulate some queries in a succinct and elegant manner. When branching and cumulation are interpreted over infinite models, we show that the resulting language can be embedded in an infinitary logic proposed by Libkin. We also discuss non-linear prefixes from an algorithmic point of view.


Reasoning about XML update constraints

We introduce in this paper a class of constraints for describing howan XML document can evolve, namely XML update constraints. For these constraints, we study the implication problem, giving algorithms and complexity results for constraints of varying expressive power. Besides classical constraint implication, we also consider an instance-based approach. More precisely, we study implication with respect to a current tree instance, resulting from a series of unknown updates. The main motivation of our work is reasoning about data integrity under update restrictions in contexts where owners may lose control over their data, such as in publishing or exchange.

Polynomial time fragments of XPath with variables

Variables are the distinguishing new feature of XPath 2.0 which permits to select n-tuples of nodes in trees. It is known that the Core of XPath 2.0 captures n-ary first-order (FO) queries modulo linear time transformations. In this paper, we distinguish a fragment of Core XPath 2.0 that remains FO-complete with respect ton-ary queries while enjoying polynomial-time query answering.

SESSION: Data streams

Optimization of continuous queries with shared expensive filters

We consider the problem of optimizing and executing multiple continuous queries, where each query is a conjunction of filters and each filter may occur in multiple queries. When filters are expensive, significant performance gains are achieved by sharing filter evaluations across queries. A shared execution strategy in our scenario can either be fixed, in which filters are evaluated in the same predetermined order for all input, or adaptive, in which the next filter to be evaluated is chosen at runtime based on the results of the filters evaluated so far. We show that as filter costs increase, the best adaptive strategy is superior to any fixed strategy, despite the overhead of adaptivity. We show that itis NP-hard to find the optimal adaptive strategy, even if we are willing to approximate within any factor smaller than m where m is the number of queries. We then present a greedy adaptive execution strategy and show that it approximates the best adaptive strategy to within a factor O(log2m log n) where n is the number of distinct filters. We also give a precomputation technique that can reduce the execution overhead of adaptive strategies.

Variance estimation over sliding windows

Capturing characteristics of large data streams has received considerable attention. The constraints in space and time restrict the data stream processing to only one pass (or a small number of passes). Processing data streams over sliding windows make the problem more difficult and challenging. In this paper, we address the problem of maintaining ∈-approximate variance of data streams over sliding windows. To our knowledge, the best existing algorithm requires O(1/∈2 log N) space, though the lower bound for this problem is Ω(1/∈ log N). We propose the first ∈-approximation algorithm to this problem that is optimal in both space and worst case time. Our algorithm requires O(1/∈ log N) space. Furthermore, its running time is O(1) in worst case.

SESSION: Sequences, streams, events

Marrying words and trees

Traditionally, data that has both linear and hierarchical structure, such as annotated linguistic data, is modeled using ordered trees and queried using tree automata. In this paper, we argue that nested words and automata over nested words offer a better way to capture and process the dual structure. Nested words generalize both words and ordered trees, and allow both word and tree operations. We study various classes of automata over nested words, and show that while they enjoy expressiveness and succinctness benefits over word and tree automata, their analysis complexity and closure properties are analogous to the corresponding word and tree special cases. In particular, we show that finite-state nested word automata can be exponentially more succinct than tree automata, and pushdown nested word automata include the two incomparable classes of context-free word languages and context-free tree languages.

Estimating statistical aggregates on probabilistic data streams

The probabilistic-stream model was introduced by Jayram et al. [20].It is a generalization of the data stream model that issuited to handling "probabilistic" data, where each item of the stream represents a probability distribution over a set of possible events. Therefore, a probabilistic stream determines a distribution over apotentially exponential number of classical "deterministic" streams where each item is deterministically one of the domain values.

Designing efficient aggregation algorithms for probabilistic data is crucial for handling uncertainty in data-centric applications such as OLAP. Such algorithms are also useful in a variety of other setting including analyzing search engine traffic and aggregation in sensor networks.

We present algorithms for computing commonly used aggregates ona probabilistic stream. We present the first one pass streaming algorithms for estimating the expected mean of a probabilistic stream, improving upon results in [20]. Next, we consider the problem of estimating frequency moments for probabilistic data. We propose a general approach to obtain unbiased estimators working over probabilistic data by utilizing unbiased estimators designed for standard streams. Applying this approach, we extend a classical data stream algorithm to obtain a one-pass algorithm for estimating F2, the second frequency moment. We present the first known streaming algorithms forestimating F0, the number of distinct items on probabilistic streams.Our work also gives an efficient one-pass algorithm for estimatingthe median of a probabilistic stream.

Sketching unaggregated data streams for subpopulation-size queries

IP packet streams consist of multiple interleaving IP flows. Statistical summaries of these streams, collected for different measurement periods, are used for characterization of traffic, billing, anomaly detection, inferring traffic demands, configuring packet filters and routing protocols, and more. While queries are posed over the set of flows, the summarization algorithmis applied to the stream of packets. Aggregation of traffic into flows before summarization requires storage of per-flow counters, which is often infeasible. Therefore, the summary has to be produced over the unaggregated stream.

An important aggregate performed over a summary is to approximate the size of a subpopulation of flows that is specified a posteriori. For example, flows belonging to an application such as Web or DNS or flows that originate from a certain Autonomous System. We design efficient streaming algorithms that summarize unaggregated streams and provide corresponding unbiased estimators for subpopulation sizes. Our summaries outperform, in terms of estimates accuracy, those produced by packet sampling deployed by Cisco's sampled NetFlow, the most widely deployed such system. Performance of our best method, step sample-and-hold is close to that of summaries that can be obtainedfrom pre-aggregated traffic.

What is "next" in event processing?

Event processing systems have wide applications ranging from managing events from RFID readers to monitoring RSS feeds. Consequently, there exists much work on them in the literature. The prevalent use of these systems is on-line recognition of patterns that are sequences of correlated events in event streams. Query semantics and implementation efficiency are inherently determined by the underlying temporal model: how events are sequenced (what is the "next" event), and how the time stamp of an event is represented. Many competing temporal models for event systems have been proposed, with no consensus on which approach is best.

We take a foundational approach to this problem. We create a formal framework and present event system design choices as axioms. The axioms are grouped into standard axioms and desirable axioms. Standard axioms are common to the design of all event systems. Desirable axioms are not always satisfied, but are useful for achieving high performance. Given these axioms, we prove several important results. First, we show that there is a unique model up to isomorphism that satisfies the standard axioms and supports associativity, so our axioms are a sound and complete axiomatization of associative time stamps in eventsystems. This model requires time stamps with unbounded representations. We present a slightly weakened version of associativity that permits a temporal model with bounded representations. We show that adding the boundedness condition also results in a unique model, so again our axiomatization is sound and complete. We believe this model is ideally suited to be the standard temporal model for complex event processing.

SESSION: Privacy, probabilistic databases

Privacy, accuracy, and consistency too: a holistic solution to contingency table release

The contingency table is a work horse of official statistics, the format of reported data for the US Census, Bureau of Labor Statistics, and the Internal Revenue Service. In many settings such as these privacy is not only ethically mandated, but frequently legally as well. Consequently there is an extensive and diverse literature dedicated to the problems of statistical disclosure control in contingency table release. However, all current techniques for reporting contingency tables fall short on at leas one of privacy, accuracy, and consistency (among multiple released tables). We propose a solution that provides strong guarantees for all three desiderata simultaneously.

Our approach can be viewed as a special case of a more general approach for producing synthetic data: Any privacy-preserving mechanism for contingency table release begins with raw data and produces a (possibly inconsistent) privacy-preserving set of marginals. From these tables alone-and hence without weakening privacy--we will find and output the "nearest" consistent set of marginals. Interestingly, this set is no farther than the tables of the raw data, and consequently the additional error introduced by the imposition of consistency is no more than the error introduced by the privacy mechanism itself.

The privacy mechanism of [20] gives the strongest known privacy guarantees, with very little error. Combined with the techniques of the current paper, we therefore obtain excellent privacy, accuracy, and consistency among the tables. Moreover, our techniques are surprisingly efficient. Our techniques apply equally well to the logical cousin of the contingency table, the OLAP cube.

On the complexity of managing probabilistic XML data

In [3], we introduced a framework for querying and updating probabilistic information over unordered labeled trees, the probabilistic tree model. The data model is based on trees where nodes are annotated with conjunctions of probabilistic event variables. We briefly described an implementation and scenarios of usage. We develop here a mathematical foundation for this model. In particular, we present complexity results. We identify a very large class of queries for which simple variations of querying and updating algorithms from [3] compute the correct answer. A main contribution is a full complexity analysis of queries and updates. We also exhibit a decision procedure for the equivalence of probabilistic trees and prove it is in co-RP. Furthermore, we study the issue of removing less probable possible worlds, and that of validating a probabilistic tree against a DTD. We show that these two problems are intractable in the most general case.

The dichotomy of conjunctive queries on probabilistic structures

We show that for every conjunctive query, the complexity of evaluating it on a probabilistic database is either PTIME or P-complete, and we give an algorithm for deciding whether a given conjunctive query is PTIME or P-complete. The dichotomy property is a fundamental result on query evaluation on probabilistic databases and it gives a complete classification of the complexity of conjunctive queries.

Maximally joining probabilistic data

Conceptually, the common approach to manipulating probabilistic data is to evaluate relational queries and then calculate the probability of each tuple in the result. This approach ignores the possibility that the probabilities of complete answers are too low and, hence, partial answers (with sufficiently high probabilities) become important. Therefore, we consider the semantics in which answers are maximal (i.e., have the smallest degree of incompleteness), subject tothe constraint that the probability is still above a given threshold.

We investigate the complexity of joining relations under the above semantics. In contrast to the deterministic case, this approach gives rise to two different enumeration problems. The first is finding all maximal sets of tuples that are join consistent, connected and have a joint probability above the threshold. The second is computing all maximal tuples that are answers of partial joins and have a probability above the threshold. Both problems are tractable under data complexity. We also consider query-and-data complexity, which rules out as efficient the following naive algorithm: compute all partial answers and then choose the maximal ones among those with probabilities above the threshold. We give efficient algorithms for several, important special cases. We also show that, in general, the first problem is NP-hard whereas the secondis #P-hard.