PODS '93- Proceedings of the twelfth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems

Full Citation in the ACM Digital Library

A call to order

Scientific applications are infrequent users of commercial database management systems. We feel that a key reason is they do not offer good support for ordered data structures, such as multidimensional arrays, that are needed for natural representation of many scientific data types. In this papers, we lay out issues in database support of ordered structures, consider possible approaches along with their advantages and shortcomings, and direct the reader to the wide variety of prior work outside the data management field that might be successfully applied in this endeavor.

Reflective programming in the relational algebra

In reflective programming languages it is possible for a program to generate code that is integrated into the program's own execution. We introduce a reflective version of the relational algebra. Reflection is achieved by storing and manipulating relational algebra programs as relations in the database. We then study the expressibility and complexity of the reflective algebra thus obtained. It turns out that there is a close correspondence between reflection and bounded looping. We also discuss the applicability of the reflective algebra.

Normal forms and conservative properties for query languages over collection types

Strong normalization results are obtained for a general language for collection types. An induced normal form for sets and bags is then used to show that the class of functions whose input has height (that is, the maximal depth of nestings of sets/bags/lists in the complex object) at most i and output has height at most o definable in a nested relational query language without powerset operator is independent of the height of intermediate expressions used. Our proof holds regardless of whether the language is used for querying sets, bags, or lists, even in the presence of variant types. Moreover, the normal forms are useful in a general approach to query optimization. Paredaens and Van Gucht proved a similar result for the special case when i = o = 1. Their result is complemented by Hull and Su who demonstrated the failure of independence when powerset operator is present and i = o = 1. The theorem of Hull and Su was generalized to all i and o by Grumbach and Vianu. Our result generalizes Paredaens and Van Gucht's to all i and o, providing a counterpart to the theorem of Grumbach and Vianu.

Semantic representations and query languages for or-sets

Or-sets were introduced by Imielinski, Naqvi and Vadaparty for dealing with limited forms of disjunctive information in database queries. Independently, Rounds used a similar notion for representing disjunctive and conjunctive information in the context of situation theory. In this paper we formulate a query language with adequate expressive power for or-sets. Using the notion of normalization of or-sets, queries at the “structural” and “conceptual” levels are distinguished. Losslessness of normalization is established for a large class of queries. We have obtained upper bounds for the cost of normalization. An approach related to that of Rounds is used to provide semantics for or-sets.

Towards tractable algebras for bags

Bags, i.e. sets with duplicates, are often used to implement relations in database systems. In this paper we study the expressive power of algebras for manipulating bags. The algebra we present is a simple extension of the nested relation algebra. Our aim is to investigate how the use of bags in the language extends its expressive power, and increases its complexity. We consider two main issues, namely (i) the relationship between the depth of bag nesting and the expressive power, and (ii) the relationship between the algebraic operations, and their complexity and expressive power. We show that the bag algebra is more expressive than the nested relation algebra (at all levels of nesting), and that the difference may be subtle. We establish a hierarchy based on the structure of algebra expressions. This hierarchy is shown to be highly related to the properties of the powerset operator.

Optimization of real conjunctive queries

On the semantics of theory change: arbitration between old and new information

Katsuno and Mendelzon divide theory change, the problem of adding new information to a logical theory, into two types: revision and update. We propose a third type of theory change: arbitration. The key idea is the following: the new information is considered neither better nor worse than the old information represented by the logical theory. The new information is simply one voice against a set of others already incorporated into the logical theory. From this follows that arbitration should be commutative. First we define arbitration by a set of postulates and then describe a model-theoretic characterization of arbitration for the case of propositional logical theories. We also study weighted arbitration where different models of a theory can have different weights.

Extended commitment ordering, or guaranteeing global serializability by applying commitment order selectively to global transactions

The Extended Commitment Ordering (ECO) property of transaction histories (schedules) generalizes the Commitment Ordering (CO) property defined in [Raz 90]. In a multi resource manager (RM) environment ECO guarantees global serializability when supported locally by each RM that participates in global transactions (i.e., transactions that span more than a single RM) and provides local serializability (by any mechanism). ECO assumes that a RM has the knowledge to distinguish local transactions (i.e., transactions confined to that RM) from global transactions. ECO imposes an order condition, similar to the CO condition, on the commit events of global transactions only, and thus, it is less constraining than CO. Like CO, ECO provides a fully distributed solution to the long standing problem of guaranteeing global serializability across RMs with different concurrency control mechanisms. Also, like CO, no communication beyond atomic commitment (AC) protocol messages is required to enforce ECO. When RMs are provided with the information about transactions being local, and are coordinated solely via AC protocols (have the extended knowledge autonomy property), ECO, applied locally together with local serializability in each RM involved with global transactions, is a necessary condition for guaranteeing global serializability. ECO reduces to CO when all the transactions are assumed to be global (e.g. if no knowledge about the transactions being local is available).

On correctness of non-serializable executions

Equivalence, query-reachability and satisfiability in Datalog extensions

We consider the problems of equivalence, satisfiability and query-reachability for datalog programs with negation and dense-order constraints. These problems are important for optimizing datalog programs. We show that both query-reachability and satisfiability are decidable for programs with stratified negation provided that negation is applied only to EDB predicates or that all EDB predicates are unary. In the latter case, we show that equivalence is also decidable. The algorithms we present are also used to push constraints from a given query to the EDB predicates. Finally, we show that satisfiability is undecidable for datalog programs with unary IDB predicates, stratified negation and the interpreted predicate ≠

An alternating fixpoint tailored to magic programs

We study applying the magic-sets transformation technique to Datalog programs with negation that may not have 2-valued well-founded models. In this general setting we encounter the problem that the well-founded model of the original program does not always agree with the well-founded model of the magic program derived by commonly used left-to-right sips on the query. In order to fix this disagreement we present a novel method that is obtained by slightly and naturally tailoring Van Gelder's alternating fixpoint technique [16] to a magic program.

Finding nonrecursive envelopes for Datalog predicate

In this paper, we study the ability of data-independent conjunctive expressions (envelopes) to approximate fixpoint of Datalog predicates. We show that no effective procedure exists for finding envelopes that best approximate the fix-point (tight envelopes). Moreover, the problem of determining existence of tight envelopes is undecidable. The relationship between tight envelopes and the boundedness property is explored. Although the property of having tight envelopes seems weaker than boundedness, we note that a predicate can have a tight (lower) envelope iff it is bounded. On the other hand, there exist Datalog predicates that are not bounded but have tight (upper) envelopes. We relax our requirement for tight envelopes and settle for connected envelopes. An algorithm to determine connected envelopes for Datalog predicates is presented. We mention several applications of envelopes.

Negation and minimality in non-horn databases

Two main approaches have been followed in the literature to give a semantics to non-Horn databases. The first one is based on considering the set of rules composing the programs as inference rules and interpreting the negation in the body as failure to prove. The other approach is based on the so-called closed-world assumption and its objective is to define a stronger notion of consequence from a theory than the classical one, where, very roughly, negative information can be inferred whenever its positive counterpart cannot be deduced from the theory. In this work we generalize the semantics for negation in logic programs, putting together the constructive nature of the rule-based deductive databases with the syntax-independence of the closed-world reasoning rules. These generalized semantics are shown to be a well-motivated and well-founded alternative to closed-world assumptions since they enjoy nice semantic and computational properties.

Complexity aspects of various semantics for disjunctive databases

This paper addresses complexity issues for important problems arising with disjunctive databases. In particular, the complexity of inference of a literal and a formula from a propositional disjunctive database under a variety of well-known disjunctive database semantics is investigated, as well deciding whether a disjunctive database has a model under a particular semantics. The problems are located in appropriate slots of the polynomial hierarchy.

Query evaluation under the well-founded semantics

SLD resolution with negation as finite failure (or SLDNF) reflects the procedural interpretation of Horn-clause predicate logic as a programming language and forms the computational basis for prolog systems. Despite its advantages in memory management, SLDNF is often not appropriate for query evaluation for three reasons; a) it may not terminate due to infinite positive recursion; b) it may not terminate due to infinite recursion through negation; and c) it may repeatedly evaluate the same clause body literal, leading to unacceptable performance. We address all three problems for goal-oriented query evaluation of arbitrary programs by presenting an extension of SLDNF, called SLG resolution, with the following distinctive features: (i) SLG resolution is a partial deduction procedure, consisting of several transformations. Each query is transformed step by step into a set of answer clauses; (ii) SLG resolution is sound and ideally complete for all non-floundering queries with respect to all three-valued stable models (including the well founded partial model); (iii) SLG resolution allows an arbitrary computation rule and an arbitrary control strategy for selecting transformations to apply; (iv) SLG resolution avoids both positive and negative loops and always terminates for programs with the bounded-term-size property; (v) SLG resolution has a polynomial time data complexity for well founded negation. Restricted forms of SLG resolution are identified for definite, locally stratified, and modularly stratified programs, thereby shedding light on the role each transformation plays. To provide answers to a query under different three-valued stable models, SLG resolution can be enhanced by further processing of the derived set of answer clauses. SLG resolution makes many more clausal specifications into effective programs. With simple (user or computer generated) annotations, SLDNF resolution and SLG resolution can be fully integrated. Thus a system including SLG resolution can be fully integrated. Thus a system including SLG resolution is naturally upward compatible with Prolog. For all these reasons we believe that SLG resolution will provide the computational basis for the next generation of logic programming systems.

Multiple join size estimation by virtual domains (extended abstract)

A model is described to estimate the size of intermediate relations produced by large relational algebra expressions, in particular, those containing several equi-joins. The intended application is within query optimization searches, where fast estimates are needed as many alternative plans are examined. It is shown that previous methods, which use an independence assumption when several attributes are joined, can lead to unrealistically low size estimates. This method attempts to overcome that problem by the introduction of “virtual domains”, which avoid the independence assumption. The method does not require extensive statistics about the database. After describing an “exact” version, an approximation that is simpler and faster is presented.

Fixed-precision estimation of join selectivity

We compare the performance of sampling-based procedures for estimation of the selectivity of an equijoin. While some of the procedures have been proposed in the database sampling literature, their relative performance has never been analyzed. A main result of this paper is a partial ordering that compares the variability of the estimators for the different procedures after an arbitrary fixed number of sampling steps. Prior to the current work, it was also unknown whether these fixed-step estimation procedures can be extended to asymptotically efficient fixed-precision estimation procedures. Our second main result is a general method for such an extension and a proof that the method is valid for all the estimation procedures under consideration. Finally, we show that, under reasonable assumptions on sampling costs, the partial ordering on the variability of the fixed-step estimation procedures implies a partial ordering on the cost of the corresponding fixed-precision estimation procedures. These results lead to a new algorithm for fixed-precision estimation of the selectivity of an equijoin. The algorithm appears to be the best available when there are no indices on the join key. Our results can be extended to general select-join queries.

On the feasibility of checking temporal integrity constraints

We analyze the computational feasibility of checking temporal integrity constraints formulated in some sublanguages of first-order temporal logic. Our results illustrate the impact of the quantification on the complexity of this problem. The presence of a single quantifier in the scope of a temporal operator makes the problem undecidable. On the other hand, if no quantifiers are in the scope of a temporal operator and all the quantifiers are universal, temporal integrity checking can be done in exponential time.

Towards an analysis of range query performance in spatial data structures

In this paper, we motivate four different user defined window query classes and derive a probabilistic model for each of them. For each model, we characterize the efficiency of spatial data structures in terms of the expected number of data bucket accesses needed to perform a window query. Our analytical approach exhibits the performance phenomena independent of data structure and implementation details and whether the objects are points or non-point objects.

Blocking for external graph searching

In this paper, we consider the problem of using disk blocks efficiently in searching graphs that are too large to fit in internal memory. Our model allows a vertex to be represented any number of times on the disk in order to take advantage of redundancy. We give matching upper and lower bounds for complete d-ary trees and d-dimensional grid graphs, as well as for classes of general graphs that intuitively speaking have a close to uniform number of neighbors around each vertex. We also show that for the special case of grid graphs blocked with isothetic hypercubes, there is a provably better speed-up if even a small amount of redundancy is permitted.

Indexing for data models with constraints and classes (extended abstract)

We examine I/O-efficient data structures that provide indexing support for new data models. The database languages of these models include concepts from constraint programming (e.g., relational tuples are generalized to conjunctions of constraints) and from object-oriented programming (e.g., objects are organized in class hierarchies). Let n be the size of the database, c the number of classes, B the secondary storage page size, and t the size of the output of a query. Indexing by one attribute in the constraint data model (for a fairly general type of constraints) is equivalent to external dynamic interval management, which is a special case of external dynamic 2-dimensional range searching. We present a semi-dynamic data structure for this problem which has optimal worst-case space O(n/B) pages and optimal query I/O time O(logBn+t/B) and has O(logBn+(log2Bn)/B) amortized insert I/O time. If the order of the insertions is random then the expected number of I/O operations needed to perform insertions is reduced to O(logBn). Indexing by one attribute and by class name in an object-oriented model, where objects are organized as a forest hierarchy of classes, is also a special case of external dynamic 2-dimensional range searching. Based on this observation we first identify a simple algorithm with good worst-case performance for the class indexing problem. Using the forest structure of the class hierarchy and techniques from the constraint indexing problem, we improve its query I/O time from O(log2c logBn + t/B) to O(logB + log2B).

Completeness results for recursive data bases

We consider infinite recursive (i.e., computable) relational data bases. Since the set of computable queries on such data bases is not closed under even simple relational operations, one must either make do with a very humble class of queries or considerably restrict the class of allowed data bases. We define two query languages, one for each of these possibilities, and prove their completeness. The first is the language of quantifier-free first-order logic, which is shown to be complete for the non-restricted case. The second is an appropriately modified version of Chandra and Harel's complete language QL, which is proved complete for the case of “highly symmetric” data bases, i.e., ones whose set of automorphisms is of finite index for each tuple-width. We also address the related notion of BP-completeness.

Safety and translation of calculus queries with scalar functions

Database method schemas and object creation

The expressiveness of various object-oriented languages is investigated with respect to their ability to create new objects. We focus on database method schemas (dms), a model capturing the data manipulation capabilities of a large class of deterministic methods in object-oriented databases. The results clarify the impact of various language constructs on object creation. Several new constructs based on expanded notions of deep equality are introduced. In particular, we provide a tractable construct which yields a language complete with respect to object creation. The new construct is also relevant to query complexity. For example, it allows expressing in polynomial time some queries, like counting, requiring exponential space in dms alone.

Context-based synchronization: an approach beyond semantics for concurrency control

The expressiveness of various object-oriented languages is investigated with respect to their ability to create new objects. We focus on database method schemas (dms), a model capturing the data manipulation capabilities of a large class of deterministic methods in object-oriented databases. The results clarify the impact of various language constructs on object creation. Several new constructs based on expanded notions of deep equality are introduced. In particular, we provide a tractable construct which yields a language complete with respect to object creation. The new construct is also relevant to query complexity. For example, it allows expressing in polynomial time some queries, like counting, requiring exponential space in dms alone.

Strict histories in object-based database systems

Towards a unified theory of concurrency control and recovery