PODS '90- Proceedings of the ninth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems

Full Citation in the ACM Digital Library

Research directions in object-oriented database systems

The set of object-oriented concepts found in object-oriented programming languages forms a good basis for a data model for post-relational database systems which will extend the domain of database applications beyond conventional business data processing. However, despite the high level of research and development activities during the past several years, there is no standard object-oriented data model, and criticisms and concerns about the field still remain. In this paper, I will first provide a historical perspective on the emergence of object-oriented database systems in order to derive a definition of object-oriented database systems. I will then examine a number of major challenge which remain for researchers and implementers of object-oriented database systems.

Method schemas

The concept of method schemas is proposed as a simple model for object-oriented programming with features such as classes with methods and inheritance, method name overloading, and late binding. An important issue is to check whether a given method schema can possibly lead to inconsistencies in some interpretations. The consistency problem for method schemas is studied. The problem is shown to be undecidable in general. Decidability is obtained for monadic and/or recursion-free method schemas. The effect of covariance is considered. The issues of incremental consistency checking and of a sound algorithm for the general case are briefly discussed.

Representability of design objects by ancestor-controlled hierarchical specifications

A simple model, called a VDAG, is proposed for representing hierarchically specified design data in CAD database systems where there are to be alternate expansions of hierarchically specified modules. The model uses an ancestor-based expansion scheme to control which instances of submodules are to be placed within each instance of a given module. The approach is aimed at reducing storage space in engineering design database systems, and providing a means for designers to specify alternate expansions of a module. The expressive power of the VDAG model is investigated, and the set of design forests which are VDAG-generable is characterized. The problem of determining whether a given design forest is VDAG-generable is shown to be NP-complete, even when the height of the forest is bounded. However, it is shown that determining whether a given forest is VDAG-generable and producing such a VDAG if it exists, can be partitioned into a number of simpler subproblems, each of which may not be too computationally difficult in practice. Furthermore, for forests in a special natural class that has broad applicability, a polynomial time algorithm is provided that determines whether a given forest is VDAG-generable, and produces such a VDAG if it exists. However, we show that it is NP-hard to produce a minimum-sized such VDAG for forests in this special class, even when the height of the forest is bounded.

Query size estimation by adaptive sampling (extended abstract)

We present an adaptive, random sampling algorithm for estimating the size of general queries. The algorithm can be used for any query Q over a database D such that 1) for some n, the answer to Q can be partitioned into n disjoint subsets Q1, Q2, …, Qn, and 2) for 1 ≤ in, the size of Qi is bounded by some function b(D, Q), and 3) there is some algorithm by which we can compute the size of Qi, where i is chosen randomly. We consider the performance of the algorithm on three special cases of the algorithm: join queries, transitive closure queries, and general recursive Datalog queries.

Deriving constraints among argument sizes in logic programs (extended abstract)

In a logic program the feasible argument sizes of derivable facts involving an n-ary predicate are viewed as a set of points in the positive orthant of Rn. We investigate a method of deriving constraints on the feasible set in the form of a polyhedral convex set in the positive orthant, which we call a polycone. Faces of this polycone represent inequalities proven to hold among the argument sizes. These inequalities are often useful for selecting an evaluation method that is guaranteed to terminate for a given logic procedure. The methods may be applicable to other languages in which the sizes of data structures can be determined syntactically. We introduce a generalized Tucker representation for systems of linear equations and show how needed operations on polycones are performed in this representation. We prove that every polycone has a unique normal form in this representation, and give an algorithm to produce it. This in turn gives a decision procedure for the question of whether two set of linear equations define the same polycone. When a predicate has several rules, the union of the individual rule's polycones gives the set of feasible argument size vectors for the predicate. Because this set is not necessarily convex, we instead operate with the smallest enclosing polycone, which is the closure of the convex hull of the union. Retaining convexity is one of the key features of our technique. Recursion is handled by finding a polycone that is a fixpoint of a transformation that is derived from both the recursive and nonrecursive rules. Some methods for finding a fixpoint are presented, but there are many unresolved problems in this area.

On the expressive power of datalog: tools and a case study

We study here the language Datalog(≠), which is the query language obtained from Datalog by allowing equalities and inequalities in the bodies of the rules. We view Datalog(≠) as a fragment of an infinitary logic L&ohgr; and show that L&ohgr; can be characterized in terms of certain two-person pebble games. This characterization provides us with tools for investigating the expressive power of Datalog(≠). As a case study, we classify the expressibility of fixed subgraph homeomorphism queries on directed graphs. Fortune et al. [FHW80] classified the computational complexity of these queries by establishing two dichotomies, which are proper only if P ≠ NP. Without using any complexity-theoretic assumptions, we show here that the two dichotomies are indeed proper in terms of expressibility in Datalog(≠).

Load control for locking: the “half-and-half” approach

A number of concurrency control performance studies have shown that, under high levels of data contention, concurrency control algorithms can exhibit thrashing behavior which is detrimental to overall system performance. In this paper, we present an approach to eliminating thrashing in the case of two-phase locking, a widely used concurrency control algorithm. Our solution, which we call the 'Half-and-Half' Algorithm, involves monitoring the state of the DBMS in order to dynamically control the multiprogramming level of the system. Results from a performance study indicate that the Half-and-Half algorithm can be very effective at preventing thrashing under a wide range of operating conditions and workloads.

Locks with constrained sharing (extended abstract)

In this paper, we propose a new mode for locks that permits sharing in a constrained manner. We develop a family of locking protocols, the strictest of which is the two phase locking protocol while the most permissive recognizes all conflict-preserving serializable histories. This is the first locking-based protocol that can recognize the entire class of conflict-preserving serializable histories.

A serialization graph construction for nested transactions

This paper makes three contributions. First, we present a proof technique that offers system designers the same ease of reasoning about nested transaction systems as is given by the classical theory for systems without nesting, and yet can be used to verify that a system satisfies the robust “user view” definition of correctness of [10]. Second, as applications of the technique, we verify the correctness of Moss' read/write locking algorithm for nested transactions, and of an undo logging algorithm that has not previously been presented or proved for nested transaction systems. Third, we make explicit the assumptions used for this proof technique, assumptions that are usually made implicitly in the classical theory, and therefore we clarify the type of system for which the classical theory itself can reliably be used.

Multi-level recovery

Multi-level transactions have received considerable attention as a framework for high-performance concurrency control methods. An inherent property of multi-level transactions is the need for compensating actions, since state-based recovery methods do no longer work correctly for transaction undo. The resulting requirement of operation logging adds to the complexity of crash recovery. In addition, multi-level recovery algorithms have to take into account that high-level actions are not necessarily atomic, e.g., if multiple pages are updated in a single action. In this paper, we present a recovery algorithm for multi-level transactions. Unlike typical commercial database systems, we have striven for simplicity rather than employing special tricks. It is important to note, though, that simplicity is not achieved at the expense of performance. We show how a high-performance multi-level recovery algorithm can be systematically developed based on few fundamental principles. The presented algorithm has been implemented in the DASDBS database kernel system.

On the optimality of strategies for multiple joins

Polynomial-time program transformations in deductive databases

We investigate the complexity of various optimization techniques for logic databases. In particular, we provide polynomial-time algorithms for restricted versions of common program transformations, and show that a minor relaxation of these restrictions leads to NP-hardness. To this end, we define the k-containment problem on conjunctive queries, and show that while the 2-containment problem is in P, the 3-containment problem is NP-complete. These results provide a complete description of the complexity of conjunctive query containment. We also extend these results to provide a natural characterization of certain optimization problems in logic databases, such as the detection of sequencability and commutativity among pairs of Linear rules, the detection of 1-boundedness in sirups, and the detection of ZYT-linearizability in simple nonlinear recursions.

Semigroup techniques in recursive query optimization

Independence of logic database queries and update

A query is independent of an update if executing the update cannot change the result of evaluating the query. The theorems of this paper give methods for proving independence in concrete cases, taking into account integrity constraints, recursive rules, and arbitrary queries. First we define the notion of independence model-theoretically, and we prove basic properties of the concept. Then we provide proof-theoretic conditions for a conjunctive query to be independent of an update. Finally, we prove correct an induction scheme for showing that a recursive query is independent of an update.

Modular stratification and magic sets for DATALOG programs with negation

We propose a class of programs, called modularly stratified programs that have several attractive properties. Modular stratification generalizes stratification and local stratification, while allowing programs that are not expressible by stratified programs. For modularly stratified programs the well-founded semantics coincides with the stable model semantics, and makes every ground literal true or false. Modularly stratified programs are all weakly stratified, but the converse is false. Unlike some weakly stratified programs, modularly stratified programs can be evaluated in a subgoal-at-a-time fashion. We demonstrate a technique for rewriting a modularly stratified program for bottom-up evaluation and extend this rewriting to include magic-set techniques. The rewritten program, when evaluated bottom-up, gives the same answers as the well-founded semantics. We discuss extending modular stratification to other operators such as set-grouping and aggregation that have traditionally been stratified to prevent semantic difficulties.

Three-valued formalization of logic programming: is it needed?

The central issue of this paper concerns the truth value undefined in Przymusinsi's 3-valued formalization of nonmonotonic reasoning and logic programming. We argue that this formalization can lead to the problem of unintended semantics and loss of disjunctive information. We modify the formalization by proposing two general principles for logic program semantics: justifiability and minimal undefinedness. The former is shown to be a general property for almost all logic program semantics, and the latter requires the use of the undefined only when it is necessary. We show that there are three types of information embedded in the undefined: the disjunctive, the factoring, and the “difficult-to-be-assigned”. In the modified formalization, the first two can be successfully identified and branched into multiple models. This leaves only the “difficult-to-be-assigned” as the undefined. It is shown that the truth value undefined is needed only for a very special type of programs whose practicality is yet to be evidenced.

Backward chaining evaluation in stratified disjunctive theories

The expressive powers of the logic programming semantics (extended abstract)

We compare the expressive powers of three semantics for deductive databases and logic programming: the 3-valued program completion semantics, the well-founded semantics, and the stable semantics, We identify the expressive power of the stable semantics, and in fairly general circumstances that of the well-founded semantics. Over infinite Herbrand models, where the three semantics have equivalent expressive power, we also consider a notion of uniform translatability between the 3-valued program completion and well-founded semantics. In this sense of uniform translatability we show the well-founded semantics to be more expressive.

Stable models and non-determinism in logic programs with negation

Previous researchers have proposed generalizations of Horn clause logic to support negation and non-determinism as two separate extensions. In this paper, we show that the stable model semantics for logic programs provides a unified basis for the treatment of both concepts. First, we introduce the concepts of partial models, stable models, strongly founded models and deterministic models and other interesting classes of partial models and study their relationships. We show that the maximal deterministic model of a program is a subset of the intersection of all its stable models and that the well-founded model of a program is a subset of its maximal deterministic model. Then, we show that the use of stable models subsumes the use of the non-deterministic choice construct in LDL and provides an alternative definition of the semantics of this construct. Finally, we provide a constructive definition for stable models with the introduction of a procedure, called backtracking fixpoint, that non-deterministically constructs a total stable model, if such a model exists.

Non-deterministic languages to express deterministic transformations

The use of non-deterministic database languages is motivated using pragmatic and theoretical considerations. It is shown that non-determinism resolves some difficulties concerning the expressive power of deterministic languages: there are non-deterministic languages expressing low complexity classes of queries/updates, whereas no such deterministic languages exist. Various mechanisms yielding non-determinism are reviewed. The focus is on two closely related families of non-deterministic languages. The first consists of extensions of Datalog with negations in bodies and/or heads of rules, with non-deterministic fixpoint semantics. The second consists of non-deterministic extensions of first-order logic and fixpoint logics, using the witness operator. The ability of the various non-deterministic languages to express deterministic transformation is characterized. In particular, non-deterministic languages expressing exactly the queries/updates computable in polynomial time are exhibited, whereas it is conjectured that no analogous deterministic language exists. The connection between non-deterministic languages and determinism is also explored. Several problems of practical interest are examined, such as checking (statically or dynamically) if a given program is deterministic, detecting coincidence of deterministic and non-deterministic semantics, and verifying termination for non-deterministic programs.

Graph-theoretic methods in database theory

Quasilinear algorithms for processing relational calculus expressions (preliminary report)

Throughout this paper q will denote a query such that I is the number of tuples inputted into the query, and U is the number of tuples in its output. We will say that q has quasi-linear complexity iff for some constant d, it is executable in time O(U + I logdI) and space O(I + U). This article will define a large subset of the relational calculus, called RCS, and show that all RCS queries are executable by quasi-linear algorithms. Our algorithm does not require the maintenance of any complex index, as it builds all the needed data structures during the course of the executing algorithm. Its exponent d can be large for some particular queries q, but it is a quite nice constant equal to 1 or 0 in most practical cases. Our algorithm is intended for data bases stored in main memory, and its time O(U + I logdI) should amount to only a few seconds of CPU time in many practical applications. Chapter 10 of this paper lists some open questions for further investigation.

On the optimality of disk allocation for Cartesian product files (extended abstract)

In this paper we present a coding-theoretic analysis of the disk allocation problem. We provide both necessary and sufficient conditions for the existence of strictly optimal allocation methods. Based on a class of optimal codes, known as maximum distance separable codes, strictly optimal allocation methods are constructed. Using the necessary conditions proved, we argue that the standard definition of strict optimality is too strong, and cannot be attained in general. A new criterion for optimality is therefore defined whose objective is to design allocation methods that yield a response time of one for all queries with a minimum number of specified attributes. Using coding theory, we determined this minimum number for binary files, assuming that the number of disks is a power of two. In general, our approach provides better allocation methods than previous techniques.

Efficient processing of window queries in the pyramid data structure

Window operations serve as the basis of a number of queries that can be posed in a spatial database. Examples of these window-based queries include the exist query (i.e., determining whether or not a spatial feature exists inside a window) and the report query, (i.e., reporting the identity of all the features that exist inside a window). Algorithms are described for answering window queries in &Ogr;(n log logT) time for a window of size n x n in a feature space (e.g., an image) of size T x T (e.g., pixel elements). The significance of this result is that even though the window contains n2 pixel elements, the worst-case time complexity of the algorithms is almost linearly proportional (and not quadratic) to the window diameter, and does not depend on other factors. The above complexity bounds are achieved via the introduction of the incomplete pyramid data structure (a variant of the pyramid data structure) as the underlying representation to store spatial features and to answer queries on them.

A framework for the performance analysis of concurrent B-tree algorithms

Many concurrent B-tree algorithms have been proposed, but they have not yet been satisfactorily analyzed. When transaction processing systems require high levels of concurrency, a restrictive serialization technique on the B-tree index can cause a bottleneck. In this paper, we present a framework for constructing analytical performance models of concurrent B-tree algorithms. The models can predict the response time and maximum throughput. We analyze three algorithms: Naive Lock-coupling, Optimistic Descent, and the Lehman-Yao algorithm. The analyses are validated by simulations of the algorithms on actual B-trees. Simple and instructive rules of thumb for predicting performance are also derived. We apply the analyses to determine the effect of database recovery on B-tree concurrency.

Querying constraints

The design of languages to tackle constraint satisfaction problems has a long history. Only more recently the reverse problem of introducing constraints as primitive constructs in programming languages has been addressed. A main task that the designers and implementers of such languages face is to use and adapt the concepts and algorithms from the extensive studies on constraints done in areas such as Mathematical Programming, Symbolic Computation, Artificial Intelligence, Program Verification and Computational Geometry. In this paper, we illustrate this task in a simple and yet important domain: linear arithmetic constraints. We show how one can design a querying system for sets of linear constraints by using basic concepts from logic programming and symbolic computation, as well as algorithms from linear programming and computational geometry. We conclude by reporting briefly on how notions of negation and canonical representation used in linear constraints can be generalized to account for cases in term algebras, symbolic computation, affine geometry, and elsewhere.

Constraint query languages (preliminary report)

We discuss the relationship between constraint programming and database query languages. We show that bottom-up, efficient, declarative database programming can be combined with efficient constraint solving. The key intuition is that the generalization of a ground fact, or tuple, is a conjunction of constraints. We describe the basic Constraint Query Language design principles, and illustrate them with four different classes of constraints: Polynomial, rational order, equality, and Boolean constraints.

Magic conditions

Much recent work has focussed on the bottom-up evaluation of Datalog programs. One approach, called Magic-Sets, is based on rewriting a logic program so that bottom-up fixpoint evaluation of the program avoids generation of irrelevant facts ([BMSU86, BR87, Ram88]). It is widely believed that the principal application of the Magic-Sets technique is to restrict computation in recursive queries using equijoin predicates. We extend the Magic-Set transformation to use predicates other than equality (X > 10, for example). This Extended Magic-Set technique has practical utility in “real” relational databases, not only for recursive queries, but for non-recursive queries as well; in ([MFPR90]) we use the results in this paper and those in [MPR89] to define a magic-set transformation for relational databases supporting SQL and its extensions, going on to describe an implementation of magic in Starburst ([HFLP89]). We also give preliminary performance measurements. In extending Magic-Sets, we describe a natural generalization of the common class of bound (b) and free (ƒ) adornments. We also present a formalism to compare adornment classes.

On being optimistic about real-time constraints

Performance studies of concurrency control algorithms for conventional database systems have shown that, under most operating circumstances, locking protocols outperform optimistic techniques. Real-time database systems have special characteristics - timing constraints are associated with transactions, performance criteria are based on satisfaction of these timing constraints, and scheduling algorithms are priority driven. In light of these special characteristics, results regarding the performance of concurrency control algorithms need to be re-evaluated. We show in this paper that the following parameters of the real-time database system - its policy for dealing with transactions whose constraints are not met, its knowledge of transaction resource requirements, and the availability of resources - have a significant impact on the relative performance of the concurrency control algorithms. In particular, we demonstrate that under a policy that discards transactions whose constraints are not met, optimistic concurrency control outperforms locking over a wide range of system utilization. We also outline why, for a variety of reasons, optimistic algorithms appear well-suited to real-time database systems.

Token transactions: managing fine-grained migration of data

Executing a transaction in a conventional distributed database system involves the execution of several subtransactions, each at a remote site where the data reside and running a two-phase commit protocol at the end of the transaction. With the advent of fast communication networks, we consider an alternative paradigm where the remote data being accessed are dynamically migrated to the initiation site of the transaction. One example of such a system is a distributed shared virtual memory system. In this paper, we examine the problem of recovery from system failure in data migration systems. Most data migration systems use the notion of tokens for the access rights a site has on the data elements it caches. Our goal is to recover the site's knowledge of the set of tokens it owned when a system failure occurred. Our approach is to consider the token knowledge at each site as a fragment of a global token database and the data migration activities as token transactions that update this distributed database. We have developed a unique commit protocol for token transactions, called unilateral commit (UCP), that efficiently achieves consistency and recoverability of the token state. The correctness of UCP with respect to the two-phase commit protocol is also presented.

Data-valued partitioning and virtual messages (extended abstract)

Network Partition failures in traditional Distributed Databases cause severe problems for transaction processing. The only way to overcome the problems of “blocking” behavior for transaction processing in the event of such failures is, effectively, to execute them at single sites. A new approach to data representation and distribution is proposed and it is shown to be suitable for failure-prone environments. We propose techniques for transaction processing, concurrency control and recovery for the new representation. Several properties that arise as a result of these methods, such as non-blocking behavior, independent recovery and high availability, suggest that the techniques could be profitably implemented in a distributed environment.

A novel checkpointing scheme for distributed database systems

We present a new checkpointing scheme for a distributed database system. Our scheme records the states of some selected data items and can be executed at any time without stopping other activities in the database system. It makes use of “shadows” of data items to make sure that the collected data item values are “transaction-consistent”. Storage overhead is low, since at most one shadow is needed for each data item.

Polynomial time query processing in temporal deductive databases

We study conditions guaranteeing polynomial time computability of queries in temporal deductive databases. We show that if for a given set of temporal rules, the period of its least models is bounded from the above by a polynomial in the database size, then also the time to process yes-no queries (as well as to compute finite representations of all query answers) can be polynomially bounded. We present a bottom-up query processing algorithm BT that is guaranteed to terminate in polynomial time if the periods are polynomially bounded. Polynomial periodicity is our most general criterion, however it can not be directly applied. Therefore, we exhibit two weaker criteria, defining inflationary and I-periodic sets of temporal rules. We show that it can be decided whether a set of temporal rules is inflationary. I-periodicity is undecidable (as we show), but it can be closely approximated by a syntactic notion of multi-separability.

Handling infinite temporal data

In this paper, we present a powerful framework for describing, storing, and reasoning about infinite temporal information. This framework is an extension of classical relational databases. It represents infinite temporal information by generalized tuples defined by linear repeating points and constraints on these points. We prove that relations formed from generalized tuples are closed under the operations of relational algebra. A characterization of the expressiveness of generalized relations is given in terms of predicates definable in Presburger arithmetic. Finally, we provide some complexity results.

GraphLog: a visual formalism for real life recursion

We present a query language called GraphLog, based on a graph representation of both data and queries. Queries are graph patterns. Edges in queries represent edges or paths in the database. Regular expressions are used to qualify these paths. We characterize the expressive power of the language and show that it is equivalent to stratified linear Datalog, first order logic with transitive closure, and non-deterministic logarithmic space (assuming ordering on the domain). The fact that the latter three classes coincide was not previously known. We show how GraphLog can be extended to incorporate aggregates and path summarization, and describe briefly our current prototype implementation.

A graph-oriented object database model

A simple, graph-oriented database model, supporting object-identity, is presented. For this model, a transformation language based on elementary graph operations is defined. This transformation language is suitable for both querying and updates. It is shown that the transformation language supports both set-operations (except for the powerset operator) and recursive functions.