PODS '17- Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Full Citation in the ACM Digital Library

SESSION: PODS Keynote

Session details: PODS Keynote

Data Citation: A Computational Challenge

Data citation is an interesting computational challenge, whose solution draws on several well-studied problems in database theory: query answering using views, and provenance. We describe the problem, suggest an approach to its solution, and highlight several open research problems, both practical and theoretical.

SESSION: PODS Session 1: New Formal Frameworks

Session details: PODS Session 1: New Formal Frameworks

A Relational Framework for Classifier Engineering

In the design of analytical procedures and machine-learning solutions, a critical and time-consuming task is that of feature engineering, for which various recipes and tooling approaches have been developed. In this framework paper, we embark on the establishment of database foundations for feature engineering. We propose a formal framework for classification in the context of a relational database. The goal of this framework is to open the way to research and techniques to assist developers with the task of feature engineering by utilizing the database's modeling and understanding of data and queries, and by deploying the well studied principles of database management. As a first step, we demonstrate the usefulness of this framework by formally defining three key algorithmic challenges. The first challenge is that of separability, which is the problem of determining the existence of feature queries that agree with the training examples. The second is that of evaluating the VC dimension of the model class with respect to a given sequence of feature queries. The third challenge is identifiability, which is the task of testing for a property of independence among features that are represented as database queries. We give preliminary results on these challenges for the case where features are defined by means of conjunctive queries, and in particular we study the implication of various traditional syntactic restrictions on the inherent computational complexity.

Querying Probabilistic Preferences in Databases

We propose a novel framework wherein probabilistic preferences can be naturally represented and analyzed in a probabilistic relational database. The framework augments the relational schema with a special type of a relation symbol---a preference symbol. A deterministic instance of this symbol holds a collection of binary relations. Abstractly, the probabilistic variant is a probability space over databases of the augmented form (i.e., probabilistic database). Effectively, each instance of a preference symbol can be represented as a collection of parametric preference distributions such as Mallows. We establish positive and negative complexity results for evaluating Conjunctive Queries (CQs) over databases where preferences are represented in the Repeated Insertion Model (RIM), Mallows being a special case. We show how CQ evaluation reduces to a novel inference problem (of independent interest) over RIM, and devise a solver with polynomial data complexity.

SESSION: PODS Session 2: Algorithms, Data Structures, Benchmarking

Session details: PODS Session 2: Algorithms, Data Structures, Benchmarking

Benchmarking the Chase

The chase is a family of algorithms used in a number of data management tasks, such as data exchange, answering queries under dependencies, query reformulation with constraints, and data cleaning. It is well established as a theoretical tool for understanding these tasks, and in addition a number of prototype systems have been developed. While individual chase-based systems and particular optimizations of the chase have been experimentally evaluated in the past, we provide the first comprehensive and publicly available benchmark---test infrastructure and a set of test scenarios---for evaluating chase implementations across a wide range of assumptions about the dependencies and the data. We used our benchmark to compare chase-based systems on data exchange and query answering tasks with one another, as well as with systems that can solve similar tasks developed in closely related communities. Our evaluation provided us with a number of new insights concerning the factors that impact the performance of chase implementations.

Efficient and Provable Multi-Query Optimization

Complex queries for massive data analysis jobs have become increasingly commonplace. Many such queries contain common subexpressions, either within a single query or among multiple queries submitted as a batch. Conventional query optimizers do not exploit these subexpressions and produce sub-optimal plans. The problem of multi-query optimization (MQO) is to generate an optimal combined evaluation plan by computing common subexpressions once and reusing them. Exhaustive algorithms for MQO explore an O(nn) search space. Thus, this problem has primarily been tackled using various heuristic algorithms, without providing any theoretical guarantees on the quality of their solution.

In this paper, instead of the conventional cost minimization problem, we treat the problem as maximizing a linear transformation of the cost function. We propose a greedy algorithm for this transformed formulation of the problem, which under weak, intuitive assumptions, provides an approximation factor guarantee for this formulation. We go on to show that this factor is optimal, unless P = NP. An- other noteworthy point about our algorithm is that it can be easily incorporated into existing transformation-based optimizers. We finally propose optimizations which can be used to improve the efficiency of our algorithm.

Write-Optimized Skip Lists

The skip list is an elegant dictionary data structure that is commonly deployed in RAM. A skip list with N elements supports searches, inserts, and deletes in O(log N) operations with high probability (w.h.p.) and range queries returning K elements in O(log N + K) operations w.h.p.

A seemingly natural way to generalize the skip list to external memory with block size B is to "promote" with probability 1/B, rather than 1/2. However, there are practical and theoretical obstacles to getting the skip list to retain its efficient performance, space bounds, and high-probability guarantees.

We give an external-memory skip list that achieves write-optimized bounds. That is, for 0 < ε < 1, range queries take O(logBε N + K/B) I/Os w.h.p. and insertions and deletions take O((logBε N) / B1-ε) amortized I/Os w.h.p.

Our write-optimized skip list inherits the virtue of simplicity from RAM skip lists. Moreover, it matches or beats the asymptotic bounds of prior write-optimized data structures such as the Bε & tree or LSM trees, which are deployed in high-performance databases and file systems.

The main technical challenge in proving our bounds comes from the fact that there are so few levels in the skip list, an aspect of the data structure that is essential to getting strong external-memory bounds. We use extremal-graph coloring to show that it is possible to decompose paths in the skip list into uncorrelated groups, regardless of the insertion/deletion pattern. Thus, we achieve our bounds by averaging over these uncorrelated paths rather than by averaging over uncorrelated levels, as in the standard skip list.

Output-optimal Parallel Algorithms for Similarity Joins

Parallel join algorithms have received much attention in recent years, due to the rapid development of massively parallel systems such as MapReduce and Spark. In the database theory community, most efforts have been focused on studying worst-optimal algorithms. However, the worst-case optimality of these join algorithms relies on the hard instances having very large output sizes. In the case of a two-relation join, the hard instance is just a Cartesian product, with an output size that is quadratic in the input size.

In practice, however, the output size is usually much smaller. One recent parallel join algorithm by Beame et al.[8] has achieved output-optimality, i.e., its cost is optimal in terms of both the input size and the output size, but their algorithm only works for a 2-relation equi-join, and has some imperfections. In this paper, we first improve their algorithm to true optimality. Then we design output-optimal algorithms for a large class of similarity joins. Finally, we present a lower bound, which essentially eliminates the possibility of having output-optimal algorithms for any join on more than two relations.

SESSION: Gems of PODS and Test-of-Time Award Session

Session details: Gems of PODS and Test-of-Time Award Session

2017 ACM PODS Alberto O. Mendelzon Test-of-Time Award

The Semiring Framework for Database Provenance

Imagine a computational process that uses a complex input consisting of multiple "items" (e.g.,files, tables, tuples, parameters, configuration rules) The provenance analysis of such a process allows us to understand how the different input items affect the output of the computation. It can be used, for example, to derive confidence in the output (given confidences in the input items), to derive the minimum access clearance for the output (given input items with different classifications), to minimize the cost of obtaining the output (given a complex input item pricing scheme). It also applies to probabilistic reasoning about an output (given input item distributions), as well as to output maintenance, and to debugging.

Provenance analysis for queries, views, database ETL tools, and schema mappings is strongly influenced by their declarative nature, providing mathematically nice descriptions of the output-inputs correlation. In a series of papers starting with PODS 2007 we have developed an algebraic framework for describing such provenance based on commutative semirings and semimodules over such semirings. So far, the framework has exploited usefully the observation that, for database provenance, data use has two flavors: joint and alternative.

Here, we have selected several insights that we consider essential for the appreciation of this framework's nature and effectiveness and we also give some idea of its applicability.

Data Integration: After the Teenage Years

The field of data integration has expanded significantly over the years, from providing a uniform query and update interface to structured databases within an enterprise to the ability to search, ex- change, and even update, structured or unstructured data that are within or external to the enterprise. This paper describes the evolution in the landscape of data integration since the work on rewriting queries using views in the mid-1990's. In addition, we describe two important challenges for the field going forward. The first challenge is to develop good open-source tools for different components of data integration pipelines. The second challenge is to provide practitioners with viable solutions for the long-standing problem of systematically combining structured and unstructured data.

SESSION: PODS Session 3: Concurrency, JSON, Learning and Privacy

Session details: PODS Session 3: Concurrency, JSON, Learning and Privacy

How Fast can a Distributed Transaction Commit?

The atomic commit problem lies at the heart of distributed database systems. The problem consists for a set of processes (database nodes) to agree on whether to commit or abort a transaction (agreement property). The commit decision can only be taken if all processes are initially willing to commit the transaction, and this decision must be taken if all processes are willing to commit and there is no failure (validity property). An atomic commit protocol is said to be non-blocking if every correct process (a database node that does not fail) eventually reaches a decision (commit or abort) even if there are failures elsewhere in the distributed database system (termination property).

Surprisingly, despite the importance of the atomic commit problem, little is known about its complexity. In this paper, we present, for the first time, a systematic study on the time and message complexity of the problem. We measure complexity in the executions that are considered the most frequent in practice, i.e., failure-free, with all processes willing to commit. In other words, we measure how fast a transaction can commit. Through our systematic study, we close many open questions like the complexity of synchronous non-blocking atomic commit. We also present optimal protocols which may be of independent interest. In particular, we present an effective protocol which solves what we call indulgent atomic commit that tolerates practical distributed database systems which are synchronous ``most of the time''.

JSON: Data model, Query languages and Schema specification

Despite the fact that JSON is currently one of the most popular formats for exchanging data on the Web, there are very few studies on this topic and there is no agreement upon a theoretical framework for dealing with JSON. Therefore in this paper we propose a formal data model for JSON documents and, based on the common features present in available systems using JSON, we define a lightweight query language allowing us to navigate through JSON documents. We also introduce a logic capturing the schema proposal for JSON and study the complexity of basic computational tasks associated with these two formalisms.

J-Logic: Logical Foundations for JSON Querying

We propose a logical framework, based on Datalog, to study the foundations of querying JSON data. The main feature of our approach, which we call J-Logic, is the emphasis on paths. Paths are sequences of keys and are used to access the tree structure of nested JSON objects. J-Logic also features "packing" as a means to generate a new key from a path or subpath. J-Logic with recursion is computationally complete, but many queries can be expressed without recursion, such as deep equality. We give a necessary condition for queries to be expressible without recursion. Most of our results focus on the deterministic nature of JSON objects as partial functions from keys to values. Predicates defined by J-Logic programs may not properly describe objects, however. Nevertheless we show that every object-to-object transformation in J-Logic can be defined using only objects in intermediate results. Moreover we show that it is decidable whether a positive, nonrecursive J-Logic program always returns an object when given objects as inputs. Regarding packing, we show that packing is unnecessary if the output does not require new keys. Finally, we show the decidability of query containment for positive, nonrecursive J-Logic programs.

Reverse Engineering SPJ-Queries from Examples

This paper investigates the problem of reverse engineering, i.e., learning, select-project-join (SPJ) queries from a user-provided example set, containing positive and negative tuples. The goal is then to determine whether there exists a query returning all the positive tuples, but none of the negative tuples, and furthermore, to find such a query, if it exists. These are called the satisfiability and learning problems, respectively. The ability to solve these problems is an important step in simplifying the querying process for non-expert users.

This paper thoroughly investigates the satisfiability and learning problems in a variety of settings. In particular, we consider several classes of queries, which allow different combinations of the operators select, project and join. In addition, we compare the complexity of satisfiability and learning, when the query is, or is not, of bounded size. We note that bounded-size queries are of particular interest, as they can be used to avoid over-fitting (i.e., tailoring a query precisely to only the seen examples).

In order to fully understand the underlying factors which make satisfiability and learning (in)tractable, we consider different components of the problem, namely, the size of a query to be learned, the size of the schema and the number of examples. We study the complexity of our problems, when considering these as part of the input, as constants or as parameters (i.e., as in parameterized complexity analysis). Depending on the setting, the complexity of satisfiability and learning can vary significantly. Among other results, our analysis also provides new problems that are complete for W[3], for which few natural problems are known. Finally, by considering a variety of settings, we derive insight on how the different facets of our problem interplay with the size of the database, thereby providing the theoretical foundations necessary for a future implementation of query learning from examples.

Private Incremental Regression

Data is continuously generated by modern data sources, and a recent challenge in machine learning has been to develop techniques that perform well in an incremental (streaming) setting. A variety of offline machine learning tasks are known to be feasible under differential privacy, where generic construction exist that, given a large enough input sample, perform tasks such as PAC learning, Empirical Risk Minimization (ERM), regression, etc. In this paper, we investigate the problem of private machine learning, where as common in practice, the data is not given at once, but rather arrives incrementally over time.

We introduce the problems of private incremental ERM and private incremental regression where the general goal is to always maintain a good empirical risk minimizer for the history observed under differential privacy. Our first contribution is a generic transformation of private batch ERM mechanisms into private incremental ERM mechanisms, based on a simple idea of invoking the private batch ERM procedure at some regular time intervals. We take this construction as a baseline for comparison. We then provide two mechanisms for the private incremental regression problem. Our first mechanism is based on privately constructing a noisy incremental gradient function, which is then used in a modified projected gradient procedure at every timestep. This mechanism has an excess empirical risk of ≈√d where d the input and constraint set can be used to derive significantly better results for certain interesting regression problems. Our second mechanism which achieves this is based on the idea of projecting the data to a lower dimensional space using random projections, and then adding privacy noise in this low dimensional space. The mechanism overcomes the issues of adaptivity inherent with the use of random projections in online streams, and uses recent developments in high-dimensional estimation to achieve an excess empirical risk bound of ≈ T1/3 W2/3, where T is the length of the stream and W is the sum of the Gaussian widths of the input domain and the constraint set that we optimize over.

TUTORIAL SESSION: PODS Invited Tutorial 1

Session details: PODS Invited Tutorial 1

Statistical Relational Learning: Unifying AI & DB Perspectives on Structured Probabilistic Models

Machine learning and database approaches to structured probabilistic models share many commonalities, yet exhibit certain important differences. Machine learning methods focus on learning probabilistic models from (certain) data and efficient learning and inference, whereas probabilistic database approaches focus on storing and efficiently querying uncertain data. Nonetheless, the structured probabilistic models that both use are often (almost) identical. In this tutorial, I will overview the field of statistical relational learning (SRL) [1] and survey common approaches. I'll make connections to work in probabilistic databases [2], and highlight commonalities and differences among them. I'll close by describing some of our recent work on probabilistic soft logic [3].

SESSION: PODS Session 4: Best Paper Award, Ontologies and JSON

Session details: PODS Session 4: Best Paper Award, Ontologies and JSON

Dichotomies in Ontology-Mediated Querying with the Guarded Fragment

We study the complexity of ontology-mediated querying when ontologies are formulated in the guarded fragment of first-order logic (GF). Our general aim is to classify the data complexity on the level of ontologies where query evaluation w.r.t. an ontology O is considered to be in PTime if all (unions of conjunctive) queries can be evaluated in PTime w.r.t. O and coNP-hard if at least one query is coNP-hard w.r.t. O. We identify several large and relevant fragments of GF that enjoy a dichotomy between PTime and coNP, some of them additionally admitting a form of counting. In fact, almost all ontologies in the BioPortal repository fall into these fragments or can easily be rewritten to do so. We then establish a variation of Ladner's Theorem on the existence of NP-intermediate problems and use this result to show that for other fragments, there is provably no such dichotomy. Again for other fragments (such as full GF), establishing a dichotomy implies the Feder-Vardi conjecture on the complexity of constraint satisfaction problems. We also link these results to Datalog-rewritability and study the decidability of whether a given ontology enjoys PTime query evaluation, presenting both positive and negative results.

The Complexity of Ontology-Based Data Access with OWL 2 QL and Bounded Treewidth Queries

Our concern is the overhead of answering OWL 2 QL ontology-mediated queries (OMQs) in ontology-based data access compared to evaluating their underlying tree-shaped and, more generally, bounded treewidth conjunctive queries (CQs). We show that OMQs with bounded depth ontologies have nonrecursive datalog (NDL) rewritings that can be constructed and evaluated in LOGCFL for combined complexity, and even in NL if their CQs are tree-shaped with a bounded number of leaves. Thus, such OMQs incur no overhead in complexity-theoretic terms. For OMQs with arbitrary ontologies and bounded-leaf tree-shaped CQs, NDL-rewritings are constructed and evaluated in LOGCFL. We experimentally demonstrate feasibility and scalability of our rewritings compared to previously proposed NDL-rewritings. On the negative side, we prove that answering OMQs with tree-shaped CQs is not fixed-parameter tractable if the ontology depth or the number of leaves in the CQs is regarded as the parameter, and that answering OMQs with a fixed ontology (of infinite depth) is NP-complete for tree-shaped CQs and LOGCFL-complete for bounded-leaf CQs.

Conjunctive Queries on Probabilistic Graphs: Combined Complexity

Query evaluation over probabilistic databases is known to be intractable in many cases, even in data complexity, i.e., when the query is fixed. Although some restrictions of the queries and instances[4] have been proposed to lower the complexity, these known tractable cases usually do not apply to combined complexity, i.e., when the query is not fixed. This leaves open the question of which query and instance languages ensure the tractability of probabilistic query evaluation in combined complexity.

This paper proposes the first general study of the combined complexity of conjunctive query evaluation on probabilistic instances over binary signatures, which we can alternatively phrase as a probabilistic version of the graph homomorphism problem, or of a constraint satisfaction problem (CSP) variant. We study the complexity of this problem depending on whether instances and queries can use features such as edge labels, disconnectedness, branching, and edges in both directions. We show that the complexity landscape is surprisingly rich, using a variety of technical tools: automata-based compilation to d-DNNF lineages [4], ß-acyclic lineages, the X-property for tractable CSP[25], graded DAGs[28] and various coding techniques for hardness proofs.

Circuit Treewidth, Sentential Decision, and Query Compilation

The evaluation of a query over a probabilistic database boils down to computing the probability of a suitable Boolean function, the lineage of the query over the database. The method of query compilation approaches the task in two stages: first, the query lineage is implemented (compiled) in a circuit form where probability computation is tractable; and second, the desired probability is computed over the compiled circuit. A basic theoretical quest in query compilation is that of identifying pertinent classes of queries whose lineages admit compact representations over increasingly succinct, tractable circuit classes.

Fostering previous work by Jha and Suciu (ICDT 2012) and Petke and Razgon (SAT 2013), we focus on queries whose lineages admit circuit implementations with small treewidth, and investigate their compilability within tame classes of decision diagrams. In perfect analogy with the characterization of bounded circuit pathwidth by bounded OBDD width, we show that a class of Boolean functions has bounded circuit treewidth if and only if it has bounded SDD width. Sentential decision diagrams (SDDs) are central in knowledge compilation, being essentially as tractable as OBDDs but exponentially more succinct. By incorporating constant width (linear size) SDDs and polynomial size SDDs in the picture, we refine the panorama of query compilation for unions of conjunctive queries with and without inequalities.

SESSION: PODS Session 5: Enumeration Problems

Session details: PODS Session 5: Enumeration Problems

2-3 Cuckoo Filters for Faster Triangle Listing and Set Intersection

We introduce new dynamic set intersection data structures, which we call 2-3 cuckoo filters and hash tables. These structures differ from the standard cuckoo hash tables and cuckoo filters in that they choose two out of three locations to store each item, instead of one out of two, ensuring that any item in an intersection of two structures will have at least one common location in both structures. We demonstrate the utility of these structures by using them in improved algorithms for listing triangles and answering set intersection queries in internal or external memory. For a graph G of n vertices and m edges, our internal-memory triangle listing algorithm runs in O(m⌈(α(G)log w)/w⌉ + k) expected time, where α(G) is the arboricity of G, w is the number of bits in a machine word, and k is the number of output triangles. Our external-memory algorithm uses O(sort(n,α(G))+ sort(m⌈(α(G)log w)/w⌉) + sort(k)) expected number of I/Os.

On Asymptotic Cost of Triangle Listing in Random Graphs

Triangle listing has been a long-standing problem, with many heuristics, bounds, and experimental results, but not much asymptotically accurate complexity analysis. To address this issue, we introduce a novel stochastic framework, based on Glivenko-Cantelli results for functions of order statistics, that allows modeling cost of in-memory triangle enumeration in families of random graphs. Unlike prior work that usually studies the O(.) notation, we derive the exact limits of CPU complexity of all vertex/edge iterators under arbitrary acyclic orientations as graph size n → ∞. These results are obtained in simple closed form as functions of the degree distribution. This allows us to establish optimal orientations for all studied algorithms, compare them to each other, and discover the best technique within each class.

Efficiently Enumerating Minimal Triangulations

We present an algorithm that enumerates all the minimal triangulations of a graph in incremental polynomial time. Consequently, we get an algorithm for enumerating all the proper tree decompositions, in incremental polynomial time, where ``proper'' means that the tree decomposition cannot be improved by removing or splitting a bag. The algorithm can incorporate any method for (ordinary, single result) triangulation or tree decomposition, and can serve as an anytime algorithm to improve such a method. We describe an extensive experimental study of an implementation on real data from different fields. Our experiments show that the algorithm improves upon central quality measures over the underlying tree decompositions, and is able to produce a large number of high-quality decompositions.

Counting and Enumerating (Preferred) Database Repairs

In the traditional sense, a subset repair of an inconsistent database refers to a consistent subset of facts (tuples) that is maximal under set containment. Preferences between pairs of facts allow to distinguish a set of preferred repairs based on relative reliability (source credibility, extraction quality, recency, etc.) of data items. Previous studies explored the problem of categoricity, where one aims to determine whether preferences suffice to repair the database unambiguously, or in other words, whether there is precisely one preferred repair. In this paper we study the ability to quantify ambiguity, by investigating two classes of problems. The first is that of counting the number of subset repairs, both preferred (under various common semantics) and traditional. We establish dichotomies in data complexity for the entire space of (sets of) functional dependencies. The second class of problems is that of enumerating (i.e., generating) the preferred repairs. We devise enumeration algorithms with efficiency guarantees on the delay between generated repairs, even for constraints represented as general conflict graphs or hypergraphs.

Answering Conjunctive Queries under Updates

We consider the task of enumerating and counting answers to k-ary conjunctive queries against relational databases that may be updated by inserting or deleting tuples. We exhibit a new notion of q-hierarchical conjunctive queries and show that these can be maintained efficiently in the following sense. During a linear time pre-processing phase, we can build a data structure that enables constant delay enumeration of the query results; and when the database is updated, we can update the data structure and restart the enumeration phase within constant time. For the special case of self-join free conjunctive queries we obtain a dichotomy: if a query is not q-hierarchical, then query enumeration with sublinear *) delay and sublinear update time (and arbitrary preprocessing time) is impossible.

For answering Boolean conjunctive queries and for the more general problem of counting the number of solutions of k-ary queries we obtain complete dichotomies: if the query's homomorphic core is q-hierarchical, then size of the the query result can be computed in linear time and maintained with constant update time. Otherwise, the size of the query result cannot be maintained with sublinear update time.

All our lower bounds rely on the OMv-conjecture, a conjecture on the hardness of online matrix-vector multiplication that has recently emerged in the field of fine-grained complexity to characterise the hardness of dynamic problems. The lower bound for the counting problem additionally relies on the orthogonal vectors conjecture, which in turn is implied by the strong exponential time hypothesis.*) By sublinear we mean O(n(1-ε) for some ε > 0, where n is the size of the active domain of the current database.

TUTORIAL SESSION: PODS Invited Tutorial 2

Session details: PODS Invited Tutorial 2

Communication Cost in Parallel Query Evaluation: A Tutorial

We consider the following problem: what is the amount of communication required to compute a query in parallel on p servers, over a large input database? To study this problem we define a variant of Valiant's BSP model [10], called the Massively Parallel Communication (MPC) model, where servers are in finitely powerful and where the cost is measured in terms of the maximum communication per server, and the number of rounds. Query evaluation in this model has been studied for full conjunctive queries in [6, 7, 9]. The model is similar to the MapReduce model of computation, where full conjunctive queries were studied in [1–4].

SESSION: PODS Session 6: Best Student Paper Award, Streaming and Sketches

Session details: PODS Session 6: Best Student Paper Award, Streaming and Sketches

Tight Space-Approximation Tradeoff for the Multi-Pass Streaming Set Cover Problem

We study the classic set cover problem in the streaming model: the sets that comprise the instance are revealed one by one in a stream and the goal is to solve the problem by making one or few passes over the stream while maintaining a sublinear space o(mn) in the input size; here m denotes the number of the sets and n is the universe size. Notice that in this model, we are mainly concerned with the space requirement of the algorithms and hence do not restrict their computation time.

Our main result is a resolution of the space-approximation tradeoff for the streaming set cover problem: we show that any α-approximation algorithm for the set cover problem requires Ω(mn1) space, even if it is allowed polylog(n) passes over the stream, and even if the sets are arriving in a random order in the stream. This space-approximation tradeoff matches the best known bounds achieved by the recent algorithm of Har-Peled et.al. (PODS 2016) that requires only O(α) passes over the stream in an adversarial order, hence settling the space complexity of approximating the set cover problem in data streams in a quite robust manner. Additionally, our approach yields tight lower bounds for the space complexity of (1- ε)-approximating the streaming maximum coverage problem studied in several recent works.

Streaming Algorithms for Measuring H-Impact

We consider publication settings with positive user feedback, such as, users publishing tweets and other users retweeting them, friends posting photos and others liking them or even authors publishing research papers and others citing these publications. A well-accepted notion of "impact" for users in these settings is the H-Index: Query rewriting through link analysis of the click graph. PVLDB, 1(1):408--421, 2008., which is the largest k such that at least k publications have k or more (positive) feedback.

We study how to calculate H-index on large streams of user publications and feedback. If all the items can be stored, H-index of a user can be computed by sorting. We focus on the streaming setting where as is typical, we do not have space to store all the items.

We present the first known streaming algorithm for computing the H-index of a user in the cash register streaming model using space poly(1/ε,log(1/δ),logn); this algorithm provides an additive ε approximation. For the aggregated model where feedback for a publication is collated, we present streaming algorithms that use much less space, either only dependent on ε and even a small constant. We also address the problem of finding "heavy hitters" users in H-index without estimating everyones? H-index. We present randomized streaming algorithms for finding 1 + ε approximation to heavy hitters that uses space poly(1/ε,log(1/δ),logn) and succeeds with probability at least 1 -- δ. Again, this is the first sublinear space algorithm for this problem, despite extensive research on heavy hitters in general. Our work initiates study of streaming algorithms for problems that estimate impact or identify impactful users.

Efficient Matrix Sketching over Distributed Data

A sketch or synopsis of a large dataset captures vital properties of the original data while typically occupying much less space. In this paper, we consider the problem of computing a sketch of a massive data matrix A ∈ℜnxd, which is distributed across a large number of s servers. Our goal is to output a matrix B∈ℜℓ x d which is significantly smaller than but still approximates A well in terms of covariance error, i.e., ||ATA-BTB||2||. Here, for a matrix A, ||A||2|| is the spectral norm of A, which is defined as the largest singular value of A. Following previous works, we call B a covariance sketch of A. We are mainly focused on minimizing the communication cost, which is arguably the most valuable resource in distributed computations. We show a gap between deterministic and randomized communication complexity for computing a covariance sketch. More specifically, we first prove a tight deterministic lower bound, then show how to bypass this lower bound using randomization. In Principle Component Analysis (PCA), the goal is to find a low-dimensional subspace that captures as much of the variance of a dataset as possible. Based on a well-known connection between covariance sketch and PCA, we give a new algorithm for distributed PCA with improved communication cost. Moreover, in our algorithms, each server only needs to make one pass over the data with limited working space.

BPTree: An ℓ2 Heavy Hitters Algorithm Using Constant Memory

The task of finding heavy hitters is one of the best known and well studied problems in the area of data streams. One is given a list i1,i2,...,im∈[n] and the goal is to identify the items among [n] that appear frequently in the list. In sub-polynomial space, the strongest guarantee available is the l2 guarantee, which requires finding all items that occur at least ε||ƒ||2 times in the stream, where the vector ƒ∈Rn is the count histogram of the stream with ith coordinate equal to the number of times i appears ƒi:=#{jε[m]:ij=i. The first algorithm to achieve the l2 guarantee was the CountSketch of [11], which requires O-2log n) words of memory and O(log n) update time and is known to be space-optimal if the stream allows for deletions. The recent work of [7] gave an improved algorithm for insertion-only streams, using only O-2logε-1log log n) words of memory. In this work, we give an algorithm BPTree for l2 heavy hitters in insertion-only streams that achieves O-2logε-1) words of memory and O(logε-1) update time, which is the optimal dependence on n and m. In addition, we describe an algorithm for tracking ||ƒ||2 at all times with O-2) memory and update time. Our analyses rely on bounding the expected supremum of a Bernoulli process involving Rademachers with limited independence, which we accomplish via a Dudley-like chaining argument that may have applications elsewhere.

SESSION: PODS Session 7: Dependencies, Graphs and Query Evaluation

Session details: PODS Session 7: Dependencies, Graphs and Query Evaluation

Stable Model Semantics for Tuple-Generating Dependencies Revisited

Normal tuple-generating dependencies (NTGDs) are TGDs enriched with default negation, a.k.a. negation as failure. Query answering under NTGDs, where negation is interpreted according to the stable model semantics, is an intriguing new problem that gave rise to flourishing research activity in the database and KR communities. So far, all the existing works that investigate this problem, except for one recent paper that adopts an operational semantics based on the chase, follow the so-called logic programming (LP) approach. According to the LP approach, the existentially quantified variables are first eliminated via Skolemization, which leads to a normal logic program, and then the standard stable model semantics for normal logic programs is applied. However, as we discuss in the paper, Skolemization is not appropriate in the presence of default negation since it fails to capture the intended meaning of NTGDs, while the operational semantics mentioned above fails to overcome the limitations of the LP approach. This reveals the need to adopt an alternative approach to stable model semantics that is directly applicable to NTGDs with existentially quantified variables. We propose such an approach based on a recent characterization of stable models in terms of second-order logic, which indeed overcomes the limitations of the LP approach. We then perform an in-depth complexity analysis of query answering under prominent classes of NTGDs based on the main decidability paradigms for TGDs, namely weak-acyclicity, guardedness and stickiness. Interestingly, weakly-acyclic NTGDs give rise to robust and highly expressive query languages that allow us to solve in a declarative way problems in the second level of the polynomial hierarchy.

Schema Mappings for Data Graphs

Schema mappings are a fundamental concept in data integration and exchange, and they have been thoroughly studied in different data models. For graph data, however, mappings have been studied in a restricted context that, unlike real-life graph databases, completely disregards the data they store. Our main goal is to understand query answering under graph schema mappings - in particular, in exchange and integration of graph data - for graph databases that mix graph structure with data. We show that adding data querying alters the picture in a significant way.

As the model, we use data graphs: a theoretical abstraction of property graphs employed by graph database implementations. We start by showing a very strong negative result: using the simplest form of nontrivial navigation in mappings makes answering even simple queries that mix navigation and data undecidable. This result suggests that for the purposes of integration and exchange, schema mappings ought to exclude recursively defined navigation over target data. For such mappings and analogs of regular path queries that take data into account, query answering becomes decidable, although intractable. To restore tractability without imposing further restrictions on queries, we propose a new approach based on the use of null values that resemble usual nulls of relational DBMSs, as opposed to marked nulls one typically uses in integration and exchange tasks. If one moves away from path queries and considers more complex patterns, query answering becomes undecidable again, even for the simplest possible mappings.

Dependencies for Graphs

This paper proposes a class of dependencies for graphs, referred to as graph entity dependencies (GEDs). A GED is a combination of a graph pattern and an attribute dependency. In a uniform format, GEDs express graph functional dependencies with constant literals to catch inconsistencies, and keys carrying id literals to identify entities in a graph.

We revise the chase for GEDs and prove its Church-Rosser property. We characterize GED satisfiability and implication, and establish the complexity of these problems and the validation problem for GEDs, in the presence and absence of constant literals and id literals. We also develop a sound and complete axiom system for finite implication of GEDs. In addition, we extend GEDs with built-in predicates or disjunctions, to strike a balance between the expressive power and complexity. We settle the complexity of the satisfiability, implication and validation problems for the extensions.

A Worst-Case Optimal Multi-Round Algorithm for Parallel Computation of Conjunctive Queries

We study the optimal communication cost for computing a full conjunctive query Q over p distributed servers. Two prior results were known. First, for one-round algorithms over skew-free data the optimal communication cost per server is m/p^(1/tau*), where m is the size of the largest input relation, and tau* is the fractional vertex covering number of the query hypergraph. Second, for multi-round algorithms and unrestricted database instances, it was shown that any algorithm requires at least m/p^(1/rho*) communication cost per server, where rho* is the fractional edge covering number of the query hypergraph; but no matching algorithms were known for this case (except for two restricted queries: chains and cycles).

In this paper we describe a multi-round algorithm that computes any query with load m/p^(1/rho*) per server, in the case when all input relations are binary. Thus, we prove this to be the optimal load for all queries over binary input relations. Our algorithm represents a non-trivial extension of previous algorithms for chains and cycles, and exploits some unique properties of graphs, which no longer hold for hyper-graphs.

What Do Shannon-type Inequalities, Submodular Width, and Disjunctive Datalog Have to Do with One Another?

Recent works on bounding the output size of a conjunctive query with functional dependencies and degree bounds have shown a deep connection between fundamental questions in information theory and database theory. We prove analogous output bounds for disjunctive datalog rules, and answer several open questions regarding the tightness and looseness of these bounds along the way. The bounds are intimately related to Shannon-type information inequalities. We devise the notion of a "proof sequence" of a specific class of Shannon-type information inequalities called "Shannon flow inequalities". We then show how a proof sequence can be used as symbolic instructions to guide an algorithm called PANDA, which answers disjunctive datalog rules within the size bound predicted. We show that PANDA can be used as a black-box to devise algorithms matching precisely the fractional hypertree width and the submodular width runtimes for aggregate and conjunctive queries with functional dependencies and degree bounds.

Our results improve upon known results in three ways. First, our bounds and algorithms are for the much more general class of disjunctive datalog rules, of which conjunctive queries are a special case. Second, the runtime of PANDA matches precisely the submodular width bound, while the previous algorithm by Marx has a runtime that is polynomial in this bound. Third, our bounds and algorithms work for queries with input cardinality bounds, functional dependencies, and degree bounds.

Overall, our results showed a deep connection between three seemingly unrelated lines of research; and, our results on proof sequences for Shannon flow inequalities might be of independent interest.