PODS '03- Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

Full Citation in the ACM Digital Library

E-services: a look behind the curtain

The emerging paradigm of electronic services promises to bring to distributed computation and services the flexibility that the web has brought to the sharing of documents. An understanding of fundamental properties of e-service composition is required in order to take full advantage of the paradigm. This paper examines proposals and standards for e-services from the perspectives of XML, data management, workflow, and process models. Key areas for study are identified, including behavioral service signatures, verification and synthesis techniques for composite services, analysis of service data manipulation commands, and XML analysis applied to service specifications. We give a sample of the relevant results and techniques in each of these areas.

An information-theoretic approach to normal forms for relational and XML data

Normalization as a way of producing good database designs is a well-understood topic. However, the same problem of distinguishing well-designed databases from poorly designed ones arises in other data models, in particular, XML. While in the relational world the criteria for being well-designed are usually very intuitive and clear to state, they become more obscure when one moves to more complex data models.Our goal is to provide a set of tools for testing when a condition on a database design, specified by a normal form, corresponds to a good design. We use techniques of information theory, and define a measure of information content of elements in a database with respect to a set of constraints. We first test this measure in the relational context, providing information-theoretic justification for familiar normal forms such as BCNF, 4NF, PJ/NF, 5NFR, DK/NF. We then show that the same measure applies in the XML context, which gives us a characterization of a recently introduced XML normal form called XNF. Finally, we look at information-theoretic criteria for justifying normalization algorithms.

Algorithms for data migration with cloning

Our work is motivated by the problem of managing data on storage devices, typically a set of disks. Such high demand storage servers are used as web servers, or multimedia servers for handling high demand for data. As the system is running, it needs to dynamically respond to changes in demand for different data items. In this work we study the data migration problem, which arises when we need to quickly change one storage configuration into another. We show that this problem is NP-hard. In addition, we develop polynomial-time approximation algorithms for this problem and prove a worst case bound of 9.5 on the approximation factor achieved by our algorithm. We also compare the algorithm to several heuristics for this problem.

Privacy in data systems

The explosive progress in networking, storage, and processor technologies is resulting in an unprecedented amount of digitization of information. In concert with this dramatic increase in digital data, concerns about the privacy of personal information have emerged globally. The concerns over massive collection of data are naturally extending to analytic tools applied to data. Data mining, with its promise to efficiently discover valuable, non-obvious information from large databases, is particularly vulnerable to misuse.The challenge for the database community is to design information systems that protect the privacy and ownership of individual data without impeding information flow. One way of preserving privacy of individual data values is to perturb them. Since the primary task in data mining is the development of models about aggregated data, we explore if we can develop accurate models without access to precise information in individual data records. We consider the concrete case of building a decision-tree classifier from perturbed data. While it is not possible to accurately estimate original values in individual data records, we describe a reconstruction procedure to accurately estimate the distribution of original data values. By using these reconstructed distributions, we are able to build classifiers whose accuracy is comparable to the accuracy of classifiers built with the original data. We also discuss how to discover association rules over privacy preserved data.Inspired by the privacy tenet of the Hippocratic Oath, we argue that future database systems must include responsibility for the privacy of data they manage as a founding tenet. We enunciate the key principles for such Hippocratic database systems, distilled from the principles behind current privacy legislations and guidelines. We identify the technical challenges and problems in designing Hippocratic databases, and also outline some solution approaches.

Materializing views with minimal size to answer queries

In this paper we study the following problem. Given a database and a set of queries, we want to find, in advance, a set of views that can compute the answers to the queries, such that the size of the viewset (i.e., the amount of space, in bytes, required to store the viewset) is minimal on the given database. This problem is important for many applications such as distributed databases, data warehousing, and data integration. We explore the decidability and complexity of the problem for workloads of conjunctive queries. We show that results differ significantly depending on whether the workload queries have self-joins. If queries can have self-joins, then a disjunctive viewset can be a better solution than any set of conjunctive views. We show that the problem of finding a minimal-size disjunctive viewset is decidable, and give an upper bound on its complexity. If workload queries cannot have self-joins, there is no need to consider disjunctive viewsets, and we show that the problem is in NP. We describe a very compact search space of conjunctive views, which contains all views in at least one optimal disjunctive viewset. We give a dynamic-programming algorithm for finding minimal-size disjunctive viewsets for queries without self-joins, and discuss heuristics to make the algorithm efficient.

The impact of the constant complement approach towards view updating

Views play an important role as a means to structure information with respect to specific users' needs. While read access through views is easy to handle, update requests through views are dificult in the sense that they have to be translated into appropriate updates on database relations. In this paper the "constant complement translator" approach towards view updating proposed by Bancilhon and Spyratos is revisited within the realm of SQL databases, and a novel characterization is established showing that constant complement translators exist precisely if users have a chance to undo all effects of their view updates using further view updates. Based on this characterization view updates with and without constant complement translators are presented. As it turns out that users cannot fully understand updates on views violating the constant complement principle, the application of this principle in the context of external schema design is discussed.

View-based query containment

Query containment is the problem of checking whether for all databases the answer to a query is a subset of the answer to a second query. In several data management tasks, such as data integration, mobile computing, etc., the data of interest are only accessible through a given set of views. In this case, containment of queries should be determined relative to the set of views, as already noted in the literature. Such a form of containment, which we call view-based query containment, is the subject of this paper. The problem comes in various forms, depending on whether each of the two queries is expressed over the base alphabet or the alphabet of the view names. We present a thorough analysis of view-based query containment, by discussing all possible combinations from a semantic point of view, and by showing their mutual relationships. In particular, for the two settings of conjunctive queries and two-way regular path queries, we provide both techniques and complexity bounds for the different variants of the problem. Finally, we study the relationship between view-based query containment and view-based query rewriting.

The view selection problem for XML content based routing

We consider the view selection problem for XML content based routing: given a network, in which a stream of XML documents is routed and the routing decisions are taken based on results of evaluating XPath predicates on these documents, select a set of views that maximize the throughput of the network. While in view selection for relational queries the speedup comes from eliminating joins, here the speedup is obtained from gaining direct access to data values in an XML packet, without parsing that packet. The views in our context can be seen as a binary representation of the XML document, tailored for the network's workload.In this paper we define formally the view selection problem in the context of XML content based routing, and provide a practical solution for it. First, we formalize the problem; while the exact formulation is too complex to admit practical solutions, we show that it can be simplified to a manageable optimization problem, without loss in precision. Second we show that the simplified problem can be reduced to the Integer Cover problem. The Integer Cover problem is known to be NP-hard, and to admit a log n greedy approximation algorithm. Third, we show that the same greedy approximation algorithm performs much better on a class of work-loads called 'hierarchical workloads', which are typical in XML stream processing. Namely, it returns an optimal solution for hierarchical workloads, and degrades gracefully to the log n general bound as the workload becomes less hierarchical.

Computing full disjunctions

Under either the OR-semantics or the weak semantics, the answer to a query over semistructured data consists of maximal rather than complete matchings, i.e., some query variables may be assigned null values. In the relational model, a similar effect is achieved by computing the full disjunction (rather than the natural join or equijoin) of the given relations. It is shown that under either the OR-semantics or the weak semantics, query evaluation has a polynomial-time complexity in the size of the query, the database and the result. It is also shown that the evaluation of full disjunctions is reducible to query evaluation under the weak semantics and hence can be done in polynomial time in the size of the input and the output. Complexity results are also given for two related problems. One is evaluating a projection of the full disjunction and the other is evaluating the set of all tuples in the full disjunction that are non-null on some given attributes. In the special case of γ-acyclic relation schemes, both problems have polynomial-time algorithms in the size of the input and the output. In the general case, such algorithms do not exist, assuming that P ≠ NP. Finally, it is shown that the weak semantics can generalize full disjunctions by allowing tuples to be joined according to general types of conditions, rather than just equalities among attributes.

Data exchange: getting to the core

Data exchange is the problem of taking data structured under a source schema and creating an instance of a target schema that reflects the source data as accurately as possible. Given a source instance, there may be many solutions to the data exchange problem, that is, many target instances that satisfy the constraints of the data exchange problem. In an earlier paper, we identified a special class of solutions that we call universal. A universal solution has homomorphisms into every possible solution, and hence is a "most general possible" solution. Nonetheless, given a source instance, there may be many universal solutions. This naturally raises the question of whether there is a "best" universal solution, and hence a best solution for data exchange. We answer this question by considering the well-known notion of the core of a structure, a notion that was first studied in graph theory, but has also played a role in conjunctive-query processing. The core of a structure is the smallest substructure that is also a homomorphic image of the structure. All universal solutions have the same core (up to isomorphism); we show that this core is also a universal solution, and hence the smallest universal solution. The uniqueness of the core of a universal solution together with its minimality make the core an ideal solution for data exchange. Furthermore, we show that the core is the best among all universal solutions for answering unions of conjunctive queries with inequalities. After this, we investigate the computational complexity of producing the core. Well-known results by Chandra and Merlin imply that, unless P = NP, there is no polynomial-time algorithm that, given a structure as input, returns the core of that structure as output. In contrast, in the context of data exchange, we identify natural and fairly broad conditions under which there are polynomial-time algorithms for computing the core of a universal solution. Finally, we analyze the computational complexity of the following decision problem that underlies the computation of cores: given two graphs G and H, is H the core ofG? Earlier results imply that this problem is both NP-hard and coNP-hard. Here, we pinpoint its exact complexity by establishing that it is a DP-complete problem.

Soft stratification for magic set based query evaluation in deductive databases

In this paper we propose a new bottom-up query evaluation method for stratified deductive databases based on the Magic Set approach. As the Magic Sets rewriting may lead to unstratifiable rules, we propose to use Kerisit's weak consequence operator to compute the well-founded model of magic rules (guaranteed to be two-valued). We show that its application in combination with the concept weak stratification, however, may lead to a set of answers which is neither sound nor complete with respect to the well-founded model. This problem is cured by introducing the new concept soft stratification instead.

Query containment and rewriting using views for regular path queries under constraints

In this paper we consider general path constraints for semistructured databases. Our general constraints do not suffer from the limitations of the path constraints previously studied in the literature. We investigate the containment of regular path queries under general path constraints. We show that when the path constraints and queries are expressed by words, as opposed to languages, the containment problem becomes equivalent to the word rewrite problem for a corresponding semi-Thue system. Consequently, if the corresponding semi-Thue system has an undecidable word problem, the word query containment problem will be undecidable too. Also, we show that there are word constraints, where the corresponding semi-Thue system has a decidable word rewrite problem, but the general query containment under these word constraints is undecidable. In order to overcome this, we exhibit a large, practical class of word constraints with a decidable general query containment problem.Based on the query containment under constraints, we reason about constrained rewritings -using views- of regular path queries. We give a constructive characterization for computing optimal constrained rewritings using views.

Concise descriptions of subsets of structured sets

We study the problem of economical representation of subsets of structured sets, that is, sets equipped with a set cover. Given a structured set U, and a language L whose expressions define subsets of U, the problem of Minimum Description Length in L (L-MDL) is: "given a subset V of U, find a shortest string in L that defines V".We show that the simple set cover is enough to model a number of realistic database structures. We focus on two important families: hierarchical and multidimensional organizations. The former is found in the context of semistructured data such as XML, the latter in the context of statistical and OLAP databases. In the case of general OLAP databases, data organization is a mixture of multidimensionality and hierarchy, which can also be viewed naturally as a structured set. We study the complexity of the L-MDL problem in several settings, and provide an efficient algorithm for the hierarchical case.Finally, we illustrate the application of the theory to summarization of large result sets, (multi) query optimization for ROLAP queries, and XML queries.

On producing join results early

Support for exploratory interaction with databases in applications such as data mining requires that the first few results of an operation be available as quickly as possible. We study the algorithmic side of what can and what cannot be achieved for processing join operations. We develop strategies that modify the strict two-phase processing of the sort-merge paradigm, intermingling join steps with selected merge phases of the sort. We propose an algorithm that produces early join results for a broad class of join problems, including many not addressed well by hash-based algorithms. Our algorithm has no significant increase in the number of I/O operations needed to complete the join compared to standard sort-merge algorithms.

Correlating XML data streams using tree-edit distance embeddings

We propose the first known solution to the problem of correlating, in small space, continuous streams of XML data through approximate (structure and content) matching, as defined by a general tree-edit distance metric. The key element of our solution is a novel algorithm for obliviously embedding tree-edit distance metrics into an L1 vector space while guaranteeing an upper bound of O(log2 n log* n) on the distance distortion between any data trees with at most n nodes. We demonstrate how our embedding algorithm can be applied in conjunction with known random sketching techniques to: (1) build a compact synopsis of a massive, streaming XML data tree that can be used as a concise surrogate for the full tree in approximate tree-edit distance computations; and, (2) approximate the result of tree-edit distance similarity joins over continuous XML document streams. To the best of our knowledge, these are the first algorithmic results on low-distortion embeddings for tree-edit distance metrics, and on correlating (e.g., through similarity joins) XML data in the streaming model.

Numerical document queries

A query against a database behind a site like Napster may search, e.g., for all users who have downloaded more jazz titles than pop music titles. In order to express such queries, we extend classical monadic second-order logic by Presburger predicates which pose numerical restrictions on the children (content) of an element node and provide a precise automata-theoretic characterization. While the existential fragment of the resulting logic is decidable, it turns out that satisfiability of the full logic is undecidable. Decidable satisfiability and a querying algorithm even with linear data complexity can be obtained if numerical constraints are only applied to those contents of elements where ordering is irrelevant. Finally, it is sketched how these techniques can be extended also to answer questions like, e.g., whether the total price of the jazz music downloaded so far exceeds a user's budget.

Typing and querying XML documents: some complexity bounds

We study the complexity bound of validating XML documents, viewed as labeled unranked ordered trees, against various typing systems like DTDs, XML schemas, tree automata ... We also consider query evaluation complexities for various fragments of XPath. For both problems, validation and query evaluation, we consider data and combined complexity bounds.

The complexity of XPath query evaluation

In this paper, we study the precise complexity of XPath 1.0 query processing. Even though heavily used by its incorporation into a variety of XML-related standards, the precise cost of evaluating an XPath query is not yet wellunderstood. The first polynomial-time algorithm for XPath processing (with respect to combined complexity) was proposed only recently, and even to this day all major XPath engines take time exponential in the size of the input queries. From the standpoint of theory, the precise complexity of XPath query evaluation is open, and it is thus unknown whether the query evaluation problem can be parallelized.In this work, we show that both the data complexity and the query complexity of XPath 1.0 fall into lower (highly parallelizable) complexity classes, but that the combined complexity is PTIME-hard. Subsequently, we study the sources of this hardness and identify a large and practically important fragment of XPath 1.0 for which the combined complexity is LOGCFL-complete and, therefore, in the highly parallelizable complexity class NC2.

Query-preserving watermarking of relational databases and XML documents

Watermarking allows robust and unobtrusive insertion of information in a digital document. Very recently, techniques have been proposed for watermarking relational databases or XML documents, where information insertion must preserve a specific measure on data (e.g. mean and variance of numerical attributes.)In this paper we investigate the problem of watermarking databases or XML while preserving a set of parametric queries in a specified language, up to an acceptable distortion.We first observe that unrestricted databases can not be watermarked while preserving trivial parametric queries. We then exhibit query languages and classes of structures that allow guaranteed watermarking capacity, namely 1) local query languages on structures with bounded degree Gaifman graph, and 2) monadic second-order queries on trees or tree-like structures. We relate these results to an important topic in computational learning theory, the VC-dimension. We finally consider incremental aspects of query-preserving watermarking.

Revealing information while preserving privacy

We examine the tradeoff between privacy and usability of statistical databases. We model a statistical database by an n-bit string d1,..,dn, with a query being a subset q ⊆ [n] to be answered by Σiεq di. Our main result is a polynomial reconstruction algorithm of data from noisy (perturbed) subset sums. Applying this reconstruction algorithm to statistical databases we show that in order to achieve privacy one has to add perturbation of magnitude (Ω√n). That is, smaller perturbation always results in a strong violation of privacy. We show that this result is tight by exemplifying access algorithms for statistical databases that preserve privacy while adding perturbation of magnitude Õ(√n).For time-T bounded adversaries we demonstrate a privacypreserving access algorithm whose perturbation magnitude is ≈ √T.

Limiting privacy breaches in privacy preserving data mining

There has been increasing interest in the problem of building accurate data mining models over aggregate data, while protecting privacy at the level of individual records. One approach for this problem is to randomize the values in individual records, and only disclose the randomized values. The model is then built over the randomized data, after first compensating for the randomization (at the aggregate level). This approach is potentially vulnerable to privacy breaches: based on the distribution of the data, one may be able to learn with high confidence that some of the randomized records satisfy a specified property, even though privacy is preserved on average.In this paper, we present a new formulation of privacy breaches, together with a methodology, "amplification", for limiting them. Unlike earlier approaches, amplification makes it is possible to guarantee limits on privacy breaches without any knowledge of the distribution of the original data. We instantiate this methodology for the problem of mining association rules, and modify the algorithm from [9] to limit privacy breaches without knowledge of the data distribution. Next, we address the problem that the amount of randomization required to avoid privacy breaches (when mining association rules) results in very long transactions. By using pseudorandom generators and carefully choosing seeds such that the desired items from the original transaction are present in the randomized transaction, we can send just the seed instead of the transaction, resulting in a dramatic drop in communication and storage cost. Finally, we define new information measures that take privacy breaches into account when quantifying the amount of privacy preserved by randomization.

Maintaining time-decaying stream aggregates

We formalize the problem of maintaining time-decaying aggregates and statistics of a data stream: the relative contribution of each data item to the aggregate is scaled down by a factor that depends on, and is non-decreasing with, elapsed time. Time-decaying aggregates are used in applications where the significance of data items decreases over time. We develop storage-efficient algorithms, and establish upper and lower bounds. Surprisingly, even though maintaining decayed aggregates have become a widely-used tool, our work seems to be the first both to explore it formally and to provide storage-efficient algorithms for important families of decay functions, including polynomial decay.

Maintaining variance and k-medians over data stream windows

The sliding window model is useful for discounting stale data in data stream applications. In this model, data elements arrive continually and only the most recent N elements are used when answering queries. We present a novel technique for solving two important and related problems in the sliding window model---maintaining variance and maintaining a k--median clustering. Our solution to the problem of maintaining variance provides a continually updated estimate of the variance of the last N values in a data stream with relative error of at most ε using O(1/ε2 log N) memory. We present a constant-factor approximation algorithm which maintains an approximate k--median solution for the last N data points using O(k/τ4 N log2 N) memory, where τ < 1/2 is a parameter which trades off the space bound with the approximation factor of O(2O(1/τ)).

Optimal indexing using near-minimal space

We consider the index selection problem. Given either a fixed query workload or an unknown probability distribution on possible future queries, and a bound B on how much space is available to build indices, we seek to build a collection of indices for which the average query response time is minimized. We give strong negative and positive peformance bounds.Let m be the number of queries in the workload. We show how to obtain with high probability a collection of indices using space O(B ln m) for which the average query cost is optB, the optimal performance possible for indices using at most B total space. Moreover, this space relaxation is necessary: unless NPnO(log log n), no polynomial time algorithm can guarantee average query cost less than M1--ε optB using space αB, for any constant α, where M is the size of the dataset. We quantify the error in performance introduced by running the algorithm on a sample drawn from a query distribution.

On nearest neighbor indexing of nonlinear trajectories

In recent years, the problem of indexing mobile objects has assumed great importance because of its relevance to a wide variety of applications. Most previous results in this area have proposed indexing schemes for objects with linear trajectories in one or two dimensions. In this paper, we present methods for indexing objects with nonlinear trajectories. Specifically, we identify a useful condition called the convex hull property and show that any trajectory satisfying this condition can be indexed by storing a careful representation of these objects in a traditional index structure. Since a wide variety of relevant nonlinear trajectories satisfy this condition, our result significantly expands the class of trajectories for which nearest neighbor indexing schemes can be devised. We also show that even though many non-linear trajectories do not satisfy the convex hull condition, an approximate representation can often be found which satisfies it. We discuss examples of techniques which can be utilized to find representations that satisfy the convex hull property. We present empirical results to demonstrate the effectiveness of our indexing method.

On the decidability and complexity of query answering over inconsistent and incomplete databases

In databases with integrity constraints, data may not satisfy the constraints. In this paper, we address the problem of obtaining consistent answers in such a setting, when key and inclusion dependencies are expressed on the database schema. We establish decidability and complexity results for query answering under different assumptions on data (soundness and/or completeness). In particular, after showing that the problem is in general undecidable, we identify the maximal class of inclusion dependencies under which query answering is decidable in the presence of key dependencies. Although obtained in a single database context, such results are directly applicable to data integration, where multiple information sources may provide data that are inconsistent with respect to the global view of the sources.

How to quickly find a witness

The subfield of itemset mining is essentially a collection of algorithms. Whenever a new type of constraint is discovered, a specialized algorithm is proposed to handle it. All of these algorithms are highly tuned to take advantage of the unique properties of their associated constraints, and so they are not very compatible with other constraints. In this paper we present a more unified view of mining constrained itemsets such that most existing algorithms can be easily extended to handle constraints for which they were not designed a-priori. We apply this technique to mining itemsets with restrictions on their variance --- a problem that has been open for several years in the data mining community.

Feasible itemset distributions in data mining: theory and application

Computing frequent itemsets and maximally frequent item-sets in a database are classic problems in data mining. The resource requirements of all extant algorithms for both problems depend on the distribution of frequent patterns, a topic that has not been formally investigated. In this paper, we study properties of length distributions of frequent and maximal frequent itemset collections and provide novel solutions for computing tight lower bounds for feasible distributions. We show how these bounding distributions can help in generating realistic synthetic datasets, which can be used for algorithm benchmarking.

What's hot and what's not: tracking most frequent items dynamically

Most database management systems maintain statistics on the underlying relation. One of the important statistics is that of the "hot items" in the relation: those that appear many times (most frequently, or more than some threshold). For example, end-biased histograms keep the hot items as part of the histogram and are used in selectivity estimation. Hot items are used as simple outliers in data mining, and in anomaly detection in networking applications.We present a new algorithm for dynamically determining the hot items at any time in the relation that is undergoing deletion operations as well as inserts. Our algorithm maintains a small space data structure that monitors the transactions on the relation, and when required, quickly outputs all hot items, without rescanning the relation in the database. With user-specified probability, it is able to report all hot items. Our algorithm relies on the idea of "group testing", is simple to implement, and has provable quality, space and time guarantees. Previously known algorithms for this problem that make similar quality and performance guarantees can not handle deletions, and those that handle deletions can not make similar guarantees without rescanning the database. Our experiments with real and synthetic data shows that our algorithm is remarkably accurate in dynamically tracking the hot items independent of the rate of insertions and deletions.