PODS '19- Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Full Citation in the ACM Digital Library

SESSION: PODS Keynote

Differential Privacy and the US Census

Differential privacy is a mathematically rigorous definition of privacy tailored to statistical analysis of large datasets. Differentially private systems simultaneously provide useful statistics to the well-intentioned data analyst and strong protection against arbitrarily powerful adversarial system users -- without needing to distinguish between the two. Differentially private systems "don't care'' what the adversary knows, now or in the future. Finally, differentially private systems can rigorously bound and control the cumulative privacy loss that accrues over many interactions with the confidential data. These unique properties, together with the abundance of auxiliary data sources and the ease with which they can be deployed by a privacy adversary, led the US Census Bureau to adopt differential privacy as the disclosure avoidance methodology of the 2020 decennial census. This talk will motivate the definition of differential privacy, reflect on the theory-meets-practice experiences of the decennial census, and highlight a few pressing challenges in the field.

SESSION: PODS Session 1: Incomplete Information

Regularizing Conjunctive Features for Classification

We consider the feature-generation task wherein we are given a database with entities labeled as positive and negative examples, and the goal is to find feature queries that allow for a linear separation between the two sets of examples. We focus on conjunctive feature queries, and explore two fundamental problems: (a) deciding whether separating feature queries exist (separability), and (b) generating such queries when they exist. In the approximate versions of these problems, we allow a predefined fraction of the examples to be misclassified. To restrict the complexity of the generated classifiers, we explore various ways of regularizing (i.e., imposing simplicity constraints on) them by limiting their dimension, the number of joins in feature queries, and their generalized hypertree width (ghw). Among other results, we show that the separability problem is tractable in the case of bounded ghw; yet, the generation problem is intractable, simply because the feature queries might be too large. So, we explore a third problem: classifying new entities without necessarily generating the feature queries. Interestingly, in the case of bounded ghw we can efficiently classify without ever explicitly generating the feature queries.

Probabilistic Databases with an Infinite Open-World Assumption

Probabilistic databases (PDBs) introduce uncertainty into relational databases by specifying probabilities for several possible instances. Traditionally, they are finite probability spaces over database instances. Such finite PDBs inherently make a closed-world assumption: non-occurring facts are assumed to be impossible, rather than just unlikely. As convincingly argued by Ceylan et al. (KR '16), this results in implausibilities and clashes with intuition. An open-world assumption, where facts not explicitly listed may have a small positive probability can yield more reasonable results. The corresponding open-world model of Ceylan et al., however, assumes that all entities in the PDB come from a fixed finite universe. In this work, we take one further step and propose a model of "truly" open-world PDBs with an infinite universe. This is natural when we consider entities from typical domains such as integers, real numbers, or strings. While the probability space might become infinitely large, all instances of a PDB remain finite. We provide a sound mathematical framework for infinite PDBs generalizing the existing theory of finite PDBs. Our main results are concerned with countable, tuple-independent PDBs; we present a generic construction showing that such PDBs exist in the infinite and provide a characterization of their existence. This construction can be used to give an open-world semantics to finite PDBs. The construction can also be extended to so-called block-independent-disjoint probabilistic databases. Algorithmic questions are not the focus of this paper, but we show how query evaluation algorithms can be lifted from finite PDBs to perform approximate evaluation (with an arbitrarily small additive approximation error) in countably infinite tuple-independent PDBs.

Query Evaluation in Election Databases

Election databases are the main elements of a recently introduced framework that aims to create bridges between the computational social choice and the data management communities. An election database consists of incomplete information about the preferences of voters, in the form of partial orders, alongside with standard database relations that provide contextual information. Earlier work in computational social choice focused on the computation of possible winners and necessary winners that are determined by the available incomplete information and the voting rule at hand. The presence of the relational context, however, permits the formulation of sophisticated queries about voting rules, candidates, potential winners, issues, and positions on issues. Such queries can be given possible answer semantics and necessary answer semantics on an election database, where the former means that the query is true on some completion of the given partial orders and the latter means that the query is true on every such completion. %In this paper, łooseness = -1 We carry out a systematic investigation of query evaluation on election databases by analyzing how the interaction between the partial preferences, the voting rules and the relational context impacts on the complexity of query evaluation. To this effect, we focus on positional scoring rules and unions of conjunctive queries. We establish a number of results that delineate the complexity of the possible answers and of the necessary answers for different positional scoring rules and for various classes of unions of conjunctive queries. Furthermore, we show that query evaluation is fixed-parameter tractable, where the parameter is the number of candidates in the election.

SESSION: Gems of PODS and Test-of-Time Award Session

2019 ACM PODS Alberto O. Mendelzon Test-of-Time Award

Database Repairs and Consistent Query Answering: Origins and Further Developments

In this article we review the main concepts around database repairs and consistent query answering, with emphasis on tracing back the origin, motivation, and early developments. We also describe some research directions that has spun from those main concepts and the original line of research. We emphasize, in particular, fruitful and recent connections between repairs and causality in databases.

SESSION: PODS Session 2: Best Paper Award, Enumeration and Counting

Efficient Logspace Classes for Enumeration, Counting, and Uniform Generation

In this work, we study two simple yet general complexity classes, based on logspace Turing machines, which provide a unifying framework for efficient query evaluation in areas like information extraction and graph databases, among others. We investigate the complexity of three fundamental algorithmic problems for these classes: enumeration, counting and uniform generation of solutions, and show that they have several desirable properties in this respect. Both complexity classes are defined in terms of nondeterministic logspace transducers (NL transducers). For the first class, we consider the case of unambiguous NL transducers, and we prove constant delay enumeration, and both counting and uniform generation of solutions in polynomial time. For the second class, we consider unrestricted NL transducers, and we obtain polynomial delay enumeration, approximate counting in polynomial time, and polynomial-time randomized algorithms for uniform generation. More specifically, we show that each problem in this second class admits a fully polynomial-time randomized approximation scheme (FPRAS) and a polynomial-time Las Vegas algorithm for uniform generation. Interestingly, the key idea to prove these results is to show that the fundamental problem #NFA admits an FPRAS, where #NFA is the problem of counting the number of strings of length n accepted by a nondeterministic finite automaton (NFA). While this problem is known to be #P-complete and, more precisely, SpanL-complete, it was open whether this problem admits an FPRAS. In this work, we solve this open problem, and obtain as a welcome corollary that every function in SpanL admits an FPRAS.

Ranked Enumeration of Minimal Triangulations

A tree decomposition of a graph facilitates computations by grouping vertices into bags that are interconnected in an acyclic structure; hence their importance in a plethora of problems such as query evaluation over databases and inference over probabilistic graphical models. The relative benefit from different tree decompositions is measured by diverse (sometime complex) cost functions that vary from one application to another. For generic cost functions like width and fill-in, an optimal tree decomposition can be efficiently computed in some cases, notably when the number of minimal separators is bounded by a polynomial (due to Bouchitte and Todinca); we refer to this assumption as "poly-MS.'' To cover the variety of cost functions in need, it has recently been proposed to devise algorithms for enumerating many decomposition candidates for applications to choose from using specialized, or even machine-learned, cost functions. We explore the ability to produce a large collection of "high quality'' tree decompositions. We present the first algorithm for ranked enumeration of the proper (non-redundant) tree decompositions, or equivalently minimal triangulations, under a wide class of cost functions that substantially generalizes the generic ones above. On the theoretical side, we establish the guarantee of polynomial delay if poly-MS is assumed, or if we are interested in tree decompositions of a width bounded by a constant. We describe an experimental evaluation on graphs of various domains (including join queries, Bayesian networks, treewidth benchmarks and random), and explore both the applicability of the poly-MS assumption and the performance of our algorithm relative to the state of the art.

Enumeration on Trees with Tractable Combined Complexity and Efficient Updates

We give an algorithm to enumerate the results on trees of monadic second-order (MSO) queries represented by nondeterministic tree automata. After linear time preprocessing (in the input tree), we can enumerate answers with linear delay (in each answer). We allow updates on the tree to take place at any time, and we can then restart the enumeration after logarithmic time in the tree. Further, all our combined complexities are polynomial in the automaton. Our result follows our previous circuit-based enumeration algorithms based on deterministic tree automata, and is also inspired by our earlier result on words and nondeterministic sequential extended variable-set automata in the context of document spanners. We extend these results and combine them with a recent tree balancing scheme by Niewerth, so that our enumeration structure supports updates to the underlying tree in logarithmic time (with leaf insertions, leaf deletions, and node relabelings). Our result implies that, for MSO queries with free first-order variables, we can enumerate the results with linear preprocessing and constant-delay and update the underlying tree in logarithmic time, which improves on several known results for words and trees. Building on lower bounds from data structure research, we also show unconditionally that up to a doubly logarithmic factor the update time of our algorithm is optimal. Thus, unlike other settings, there can be no algorithm with constant update time.

Counting Database Repairs under Primary Keys Revisited

Consistent query answering (CQA) aims to deliver meaningful answers when queries are evaluated over inconsistent databases. Such answers must be certainly true in all repairs, which are consistent databases whose difference from the inconsistent one is somehow minimal. An interesting task in this context is to count the number of repairs that entail the query. This problem has been already studied for conjunctive queries and primary keys; we know that it is #P-complete in data complexity under polynomial-time Turing reductions (a.k.a. Cook reductions). However, as it has been already observed in the literature of counting complexity, there are problems that are ''hard-to-count-easy-to-decide'', which cannot be complete (under reasonable assumptions) for #P under weaker reductions, and, in particular, under standard many-one logspace reductions (a.k.a. parsimonious reductions). For such ''hard-to-count-easy-to-decide'' problems, a crucial question is whether we can determine their exact complexity by looking for subclasses of #P to which they belong. Ideally, we would like to show that such a problem is complete for a subclass of #P under many-one logspace reductions. The main goal of this work is to perform such a refined analysis for the problem of counting the number of repairs under primary keys that entail the query.

The Complexity of Counting Cycles in the Adjacency List Streaming Model

We study the problem of counting cycles in the adjacency list streaming model, fully resolving in which settings there exist sublinear space algorithms. Our main upper bound is a two-pass algorithm for estimating triangles that uses $\wtO (m/T^2/3 )$ space, where m is the edge count and T is the triangle count of the graph. On the other hand, we show that no sublinear space multipass algorithm exists for counting $\ell$-cycles for $\ell \geq 5$. Finally, we show that counting 4-cycles is intermediate: sublinear space algorithms exist in multipass but not single-pass settings.

On the Enumeration Complexity of Unions of Conjunctive Queries

We study the enumeration complexity of Unions of Conjunctive Queries (UCQs). We aim to identify the UCQs that are tractable in the sense that the answer tuples can be enumerated with a linear preprocessing phase and a constant delay between every successive tuples. It has been established that, in the absence of self joins and under conventional complexity assumptions, the CQs that admit such an evaluation are precisely the free-connex ones. A union of tractable CQs is always tractable. We generalize the notion of free-connexity from CQs to UCQs, thus showing that some unions containing intractable CQs are, in fact, tractable. Interestingly, some unions consisting of only intractable CQs are tractable too. The question of a finding a full characterization of the tractability of UCQs remains open. Nevertheless, we prove that for several classes of queries, free-connexity fully captures the tractable UCQs.

SESSION: PODS Session 3: Information Extraction, Hashing, and Privacy

Split-Correctness in Information Extraction

Programs for extracting structured information from text, namely information extractors, often operate separately on document segments obtained from a generic splitting operation such as sentences, paragraphs, k-grams, HTTP requests, and so on. An automated detection of this behavior of extractors, which we refer to as split-correctness, would allow text analysis systems to devise query plans with parallel evaluation on segments for accelerating the processing of large documents. Other applications include the incremental evaluation on dynamic content, where re-evaluation of information extractors can be restricted to revised segments, and debugging, where developers of information extractors are informed about potential boundary crossing of different semantic components. We propose a new formal framework for split-correctness within the formalism of document spanners. Our preliminary analysis studies the complexity of split-correctness over regular spanners. We also discuss different variants of split-correctness, for instance, in the presence of black-box extractors with "split constraints".

Robust Set Reconciliation via Locality Sensitive Hashing

We consider variations of set reconciliation problems where two parties, Alice and Bob, each hold a set of points in a metric space, and the goal is for Bob to conclude with a set of points that is close to Alice's set of points in a well-defined way. This setting has been referred to as robust set reconciliation. In one variation, the goal is for Bob to end with a set of points that is close to Alice's in earth mover's distance, and in another the goal is for Bob to have a point that is close to each of Alice's. The first problem has been studied before; while previous results achieved an $O(d)$ approximation, where d is the dimension of the space, we achieve an $O(łog n)$ approximation, where n is the number of points. The second problem appears new, and here we find schemes that, under certain conditions, use sublinear communication. Our primary novelty is utilizing Invertible Bloom Lookup Tables in combination with locality sensitive hashing. This combination allows us to cope with the geometric setting in a communication-efficient manner.

What Storage Access Privacy is Achievable with Small Overhead?

Oblivious RAM (ORAM) and private information retrieval (PIR) are classic cryptographic primitives used to hide the access pattern to data whose storage has been outsourced to an untrusted server. Unfortunately, both primitives require considerable overhead compared to plaintext access. For large-scale storage infrastructure with highly frequent access requests, the degradation in response time and the exorbitant increase in resource costs incurred by either ORAM or PIR prevent their usage. In an ideal scenario, a privacy-preserving storage protocols with small overhead would be implemented for these heavily trafficked storage systems to avoid negatively impacting either performance and/or costs. In this work, we study the problem of the best \em storage access privacy that is achievable with only \em small overhead over plaintext access. To answer this question, we consider \em differential privacy access which is a generalization of the \em oblivious access security notion that are considered by ORAM and PIR. Quite surprisingly, we present strong evidence that constant overhead storage schemes may only be achieved with privacy budgets of ε = Ømega(łog n)$. We present asymptotically optimal constructions for differentially private variants of both ORAM and PIR with privacy budgets ε = Θ(łog n)$ with only $O(1)$ overhead. In addition, we consider a more complex storage primitive called key-value storage in which data is indexed by keys from a large universe (as opposed to consecutive integers in ORAM and PIR). We present a differentially private key-value storage scheme with ε = Θ(łog n)$ and $O(łogłog n)$ overhead. This construction uses a new oblivious, two-choice hashing scheme that may be of independent interest.

SESSION: PODS Session 4: Streams

Tight Trade-offs for the Maximum k-Coverage Problem in the General Streaming Model

We study the maximum k-coverage problem in the general edge-arrival streaming model: given a collection of m sets F, each subset of a ground set of elements U of size n, the task is to find k sets whose coverage is maximized. The sets are specified as a sequence of (element, set) pairs in an arbitrary order. Our main result is a tight (up to polylogarithmic factors) trade-off between the space complexity and the approximation factor α\in(1/(1-1/e), \tildeOmega (\sqrtm )]$ of any single-pass streaming algorithm that estimates the maximum coverage size. Specifically, we show that the optimal space bound is $\tildeTheta (m/α^2)$. Moreover, we design a single-pass algorithm that reports an α-approximate solution in $\tildeO (m/α^2 + k)$ space. Our algorithm heavily exploits data stream sketching techniques, which could lead to further connections between vector sketching methods and streaming algorithms for combinatorial optimization tasks.

Weighted Reservoir Sampling from Distributed Streams

We consider message-efficient continuous random sampling from a distributed stream, where the probability of inclusion of an item in the sample is proportional to a weight associated with the item. The unweighted version, where all weights are equal, is well studied, and admits tight upper and lower bounds on message complexity. For weighted sampling with replacement, there is a simple reduction to unweighted sampling with replacement. However, in many applications the stream may have only a few heavy items which may dominate a random sample when chosen with replacement. Weighted samplingwithout replacement (weighted SWOR) eludes this issue, since such heavy items can be sampled at most once. In this work, we present the first message-optimal algorithm for weighted SWOR from a distributed stream. Our algorithm also has optimal space and time complexity. As an application of our algorithm for weighted SWOR, we derive the first distributed streaming algorithms for trackingheavy hitters with residual error. Here the goal is to identify stream items that contribute significantly to the residual stream, once the heaviest items are removed. Residual heavy hitters generalize the notion of $\ell_1$ heavy hitters and are important in streams that have a skewed distribution of weights. In addition to the upper bound, we also provide a lower bound on the message complexity that is nearly tight up to a $łog(1/\eps)$ factor. Finally, we use our weighted sampling algorithm to improve the message complexity of distributed $L_1$ tracking, also known as count tracking, which is a widely studied problem in distributed streaming. We also derive a tight message lower bound, which closes the message complexity of this fundamental problem.

Distributed and Streaming Linear Programming in Low Dimensions

We study linear programming and general LP-type problems in several big data (streaming and distributed) models. We mainly focus on low dimensional problems in which the number of constraints is much larger than the number of variables. Low dimensional LP-type problems appear frequently in various machine learning tasks such as robust regression, support vector machines, and core vector machines. As supporting large-scale machine learning queries in database systems has become an important direction for database research, obtaining efficient algorithms for low dimensional LP-type problems on massive datasets is of great value. In this paper we give both upper and lower bounds for LP-type problems in distributed and streaming models. Our bounds are almost tight when the dimensionality of the problem is a fixed constant.

Better Sliding Window Algorithms to Maximize Subadditive and Diversity Objectives

The streaming computation model is a standard model for large-scale data analysis: the input arrives one element at a time, and the goal is to maintain an approximately optimal solution using only a constant, or, at worst, polylogarithmic space.

In practice, however, recency plays a large role, and one often wishes to consider only the last w elements that have arrived, the so-called sliding window problem. A trivial approach is to simply store the last w elements in a buffer; our goal is to develop algorithms with space and update time sublinear in w. In this regime, there are two frameworks: exponential histograms and smooth histograms, which can be used to obtain sliding window algorithms for families of functions satisfying certain properties.

Unfortunately, these frameworks have limitations and cannot always be applied directly. A prominent example is the problem of maximizing submodular function with cardinality constraints. While some of these difficulties can be rectified on a case-by-case basis, here, we describe an alternative approach to designing efficient sliding window algorithms for maximization problems. Then we instantiate this approach on a wide range of problems, yielding better algorithms for submodular function optimization, diversity optimization and general subadditive optimization. In doing so, we improve state-of-the art results obtained using problem-specific algorithms.

SESSION: PODS Invited Tutorial 1

Making Consistency Protocols Serializable

A tutorial given at PODS 2019, focussed on several research agendas in the past decade or so, that examine weak isolation levels, and obtain many (or all) of the benefits for application integrity, traditionally achieved by serializable concurrency control. The tutorial presents both the mechanisms and the reasoning approaches from these research works.

SESSION: PODS Session 5: Semistructured Data and Knowledge Graphs, Logic, and Verification

The Space-Efficient Core of Vadalog

Vadalog is a system for performing complex reasoning tasks such as those required in advanced knowledge graphs. The logical core of the underlying Vadalog language is the warded fragment of tuple-generating dependencies (TGDs). This formalism ensures tractable reasoning in data complexity, while a recent analysis focusing on a practical implementation led to the reasoning algorithm around which the Vadalog system is built. A fundamental question that has emerged in the context of Vadalog is the following: can we limit the recursion allowed by wardedness in order to obtain a formalism that provides a convenient syntax for expressing useful recursive statements, and at the same time achieves space-efficiency? After analyzing several real-life examples of warded sets of TGDs provided by our industrial partners, as well as recent benchmarks, we observed that recursion is often used in a restricted way: the body of a TGD contains at most one atom whose predicate is mutually recursive with a predicate in the head. We show that this type of recursion, known as piece-wise linear in the Datalog literature, is the answer to our main question. We further show that piece-wise linear recursion alone, without the wardedness condition, is not enough as it leads to the undecidability of reasoning. We finally study the relative expressiveness of the query languages based on (piece-wise linear) warded sets of TGDs.

Decidable XPath Fragments in the Real World

XPath is arguably the most popular query language for selecting elements in XML documents. Besides query evaluation, query satisfiability and containment are the main computational problems for XPath; they are useful, for instance, to detect dead code or validate query optimisations. These problems are undecidable in general, but several fragments have been identified over time for which satisfiability (or query containment) is decidable: CoreXPath 1.0 and 2.0 without so-called data joins, fragments with data joins but limited navigation, etc. However, these fragments are often given in a simplified syntax, and sometimes w.r.t. a simplified XPath semantics. Moreover, they have been studied mostly with theoretical motivations, with little consideration for the practically relevant features of XPath. To investigate the practical impact of these theoretical fragments, we design a benchmark compiling thousands of real-world XPath queries extracted from open-source projects, and match them against syntactic fragments from the literature. We investigate how to extend these fragments with seldom-considered features such as free variables, data tests, data joins, and the last () and id () functions, for which we provide both undecidability and decidability results. We analyse the coverage of the original and extended fragments, and further provide a glimpse at which other practical features might be worth investigating in the future.

Containment of Shape Expression Schemas for RDF

We study the problem of containment of shape expression schemas ShEx for RDF graphs. We identify a subclass of ShEx that has a natural graphical representation in the form of shape graphs and whose semantics is captured with a tractable notion of embedding of an RDF graph in a shape graph. When applied to pairs of shape graphs, an embedding is a sufficient condition for containment, and for a practical subclass of deterministic shape graphs, it is also a necessary one, thus yielding a subclass with tractable containment. Containment for general shape graphs is EXP-complete. Finally, we show that containment for arbitrary ShEx is decidable.

Complexity Bounds for Relational Algebra over Document Spanners

We investigate the complexity of evaluating queries in Relational Algebra (RA) over the relations extracted by regex formulas (i.e., regular expressions with capture variables) over text documents. Such queries, also known as the regular document spanners, were shown to have an evaluation with polynomial delay for every positive RA expression (i.e., consisting of only natural joins, projections and unions); here, the RA expression is fixed and the input consists of both the regex formulas and the document. In this work, we explore the implication of two fundamental generalizations. The first is adopting the "schemaless'' semantics for spanners, as proposed and studied by Maturana et al. The second is going beyond the positive RA to allowing the difference operator. We show that each of the two generalizations introduces computational hardness: it is intractable to compute the natural join of two regex formulas under the schemaless semantics, and the difference between two regex formulas under both the ordinary and schemaless semantics. Nevertheless, we propose and analyze syntactic constraints, on the RA expression and the regex formulas at hand, such that the expressive power is fully preserved and, yet, evaluation can be done with polynomial delay. Unlike the previous work on RA over regex formulas, our technique is not (and provably cannot be) based on the static compilation of regex formulas, but rather on an ad-hoc compilation into an automaton that incorporates both the query and the document. This approach also allows us to include black-box extractors in the RA expression.

Reachability in Database-driven Systems with Numerical Attributes under Recency Bounding

A prominent research direction of the database theory community is to develop techniques for verification of database-driven systems operating over relational and numerical data. Along this line, we lift the framework of database manipulating systems \citeAbdullaAAMR-pods-16 which handle relational data to also accommodate numerical data and the natural order on them. We study an under-approximation called recency bounding under which the most basic verification problem --reachability, is decidable. Even under this under-approximation the reachability space is infinite in multiple dimensions -- owing to the unbounded sizes of the active domain, the unbounded numerical domain it has access to, and the unbounded length of the executions. We show that, nevertheless, reachability is ExpTime complete. Going beyond reachability to LTL model checking renders verification undecidable.

Compiling Existential Positive Queries to Bounded-Variable Fragments

A crucial property of bounded-variable fragments of first-order logic is that they can be evaluated in polynomial time. It is therefore a useful preprocessing step to rewrite, if possible, a first-order query to a logically equivalent one with a minimum number of variables. However, it may occur that reducing the number of variables causes an increase in formula size. We investigate this trade-off for the existential-positive fragment of first-order queries, where variable minimisation is decidable in general. In particular, we study the blow-up in the formula size when compiling existential-positive queries to the bounded variable fragment of positive first-order logic. While the increase of the formula size is always at most exponential, we identify situations (based on the signature and the number of variables) where only a polynomial blow-up is needed. In all other cases, we show that an exponential lower bound on the formula size of the compiled formula that matches the general upper bound. This exponential lower bound is unconditional, and is the first unconditional lower bound for formula size with respect to the studied compilation; it is proved via establishing a novel interface with circuit complexity which may be of future interest.

SESSION: PODS Session 6: Containment and Homomorphisms

Testability of Homomorphism Inadmissibility: Property Testing Meets Database Theory

In this paper, we utilize the perspective of property testing to consider the testability of relational database queries. A primary motivation is the desire to avoid reading an entire database to decide a property thereof. We focus on conjunctive queries, which are the most basic and heavily studied database queries. Each conjunctive query can be represented as a relational structure A such that deciding if the conjunctive query is satisfied by a relational structure B is equivalent to deciding if there exists a homomorphism from A to B. We phrase our results in terms of homomorphisms. Precisely, we study, for each relational structure A, the testability of homomorphism inadmissibility from A. We consider algorithms that have oracle access to an input relational structure B and that distinguish, with high probability, the case where there is no homomorphism from A to B, from the case where one needs to remove a constant fraction of tuples from B in order to suppress all such homomorphisms. We provide a complete characterization of the structures A from which one can test homomorphism inadmissibility with one-sided error by making a constant number of queries to B. Our characterization shows that homomorphism inadmissibility from A is constant-query testable with one-sided error if and only if the core of A is alpha-acyclic. We also show that the injective version of the problem is constant-query testable with one-sided error if A is alpha-acyclic; this result generalizes existing results for testing subgraph-freeness in the general graph model.

The Selfish Models Property: Bounding the Complexity of Query Containment and Entailment Problems

Query containment is the fundamental problem of deciding, given two database queries, if the result of the first query is always contained in the result of the second query. For a number of established query classes, an instance of this problem can be decided by computing a set of models of the first query, and then evaluating the second query on each of the models. We formalize this phenomenon by introducing the selfish models property; this property gives an avenue for establishing both the decidability of and complexity upper bounds for containment problems. Using this property, we show how existing results can be uniformly derived, and we present two significant novel positive results for first-order query containment problems, exhibiting complexity upper bounds for containment problems that were not previously known to be decidable.

Attacking Diophantus: Solving a Special Case of Bag Containment

Conjunctive-query containment is the problem of deciding whether the answers of a given conjunctive query on an arbitrary database instance are always contained in the answers of a second query on the same instance. This is a very relevant question in query optimization, data integration, and other data management and artificial intelligence areas. The problem has been deeply studied and understood for the, so-called, set-semantics, i.e., when query answers and database instances are modelled as sets of tuples. In particular, it has been shown by Chandra and Merlin to be NPTIME-COMPLETE. On the contrary, when investigated under bag-semantics, a.k.a. multiset semantics, which allows for replicated tuples both in the underlying instance and in the query answers, it is not even clear whether the problem is decidable. Since this is exactly the standard interpretation for commercial relational database systems, the question turns out to be an important one. Multiple works on variations and restrictions of the bag-containment problem have been reported in the literature and, although the general problem is still open, we contribute with this article by solving a special case that has been identified as a major open problem on its own. More specifically, we study projection-free queries, i.e., queries without existentially quantified variables, and show decidability for the bag-containment problem of a projection-free conjunctive query into a generic conjunctive query. We prove indeed that deciding containment in this setting is in ¶i^p_2. Our approach relies on the solution of a special case of the Diophantine inequality problem via a reduction to the linear inequality problem and clearly exposes inherent difficulties in the analysis of the general question.

SESSION: PODS Session 7: Joins, Hypergraphs, and Aggregate Queries

On Functional Aggregate Queries with Additive Inequalities

Motivated by fundamental applications in databases and relational machine learning, we formulate and study the problem of answering functional aggregate queries (FAQ) in which some of the input factors are defined by a collection of additive inequalities between variables. We refer to these queries as FAQ-AI for short. To answer FAQ-AI in the Boolean semiring, we define relaxed tree decompositions and relaxed submodular and fractional hypertree width parameters. We show that an extension of the InsideOut algorithm using Chazelle's geometric data structure for solving the semigroup range search problem can answer Boolean FAQ-AI in time given by these new width parameters. This new algorithm achieves lower complexity than known solutions for FAQ-AI. It also recovers some known results in database query answering. Our second contribution is a relaxation of the set of polymatroids that gives rise to the counting version of the submodular width, denoted by #subw. This new width is sandwiched between the submodular and the fractional hypertree widths. Any FAQ and FAQ-AI over one semiring can be answered in time proportional to #subw and respectively to the relaxed version of #subw. We present three applications of our FAQ-AI framework to relational machine learning: k-means clustering, training linear support vector machines, and training models using non-polynomial loss. These optimization problems can be solved over a database asymptotically faster than computing the join of the database relations.

Topology Dependent Bounds For FAQs

In this paper, we prove topology dependent bounds on the number of rounds needed to compute Functional Aggregate Queries ($\FAQ$s) studied by Abo Khamis et al. [PODS 2016] in a synchronous distributed network under the model considered by Chattopadhyay et al. [FOCS 2014, SODA 2017]. Unlike the recent work on computing database queries in the Massively Parallel Computation model, in the model of Chattopadhyay et al., nodes can communicate only via private point-to-point channels and we are interested in bounds that work over an \em arbitrary communication topology. This model, which is closer to the well-studied $\congest$ model in distributed computing and generalizes Yao's two party communication complexity model, has so far only been studied for problems that are common in the two-party communication complexity literature. This is the first work to consider more practically motivated problems in this distributed model. For the sake of exposition, we focus on two specific problems in this paper: Boolean Conjunctive Query ($\BCQ$) and computing variable/factor marginals in Probabilistic Graphical Models (PGMs). We obtain tight bounds on the number of rounds needed to compute such queries as long as the underlying hypergraph of the query is $O(1)$-degenerate and has $O(1)$-arity. In particular, the $O(1)$-degeneracy condition covers most well-studied queries that are efficiently computable in the centralized computation model like queries with constant treewidth. These tight bounds depend on a new notion of 'width' (namely \em internal-node-width ) for Generalized Hypertree Decompositions (GHDs) of acyclic hypergraphs, which minimizes the number of internal nodes in a sub-class of GHDs. To the best of our knowledge, this width has not been studied explicitly in the theoretical database literature. Finally, we consider the problem of computing the product of a vector with a chain of matrices and prove tight bounds on its round complexity (over a finite field of two elements) using a novel min-entropy based argument.

Instance and Output Optimal Parallel Algorithms for Acyclic Joins

Massively parallel join algorithms have received much attention in recent years, while most prior work has focused on worst-optimal algorithms. However, the worst-case optimality of these join algorithms relies on hard instances having very large output sizes, which rarely appear in practice. A stronger notion of optimality is \em output-optimal, which requires an algorithm to be optimal within the class of all instances sharing the same input and output size. An even stronger optimality is \em instance-optimal, i.e., the algorithm is optimal on every single instance, but this may not always be achievable. In the traditional RAM model of computation, the classical Yannakakis algorithm is instance-optimal on any acyclic join. But in the massively parallel computation (MPC) model, the situation becomes much more complicated. We first show that for the class of r-hierarchical joins, instance-optimality can still be achieved in the MPC model. Then, we give a new MPC algorithm for an arbitrary acyclic join with load $O (\IN øver p + \sqrt\IN \cdot ØUT øver p )$, where $\IN,ØUT$ are the input and output sizes of the join, and p is the number of servers in the MPC model. This improves the MPC version of the Yannakakis algorithm by an $O (\sqrtØUT øver \IN )$ factor. Furthermore, we show that this is output-optimal when $ØUT = O(p \cdot \IN)$, for every acyclic but non-r-hierarchical join. Finally, we give the first output-sensitive lower bound for the triangle join in the MPC model, showing that it is inherently more difficult than acyclic joins.

HyperBench: A Benchmark and Tool for Hypergraphs and Empirical Findings

To cope with the intractability of answering Conjunctive Queries (CQs) and solving Constraint Satisfaction Problems (CSPs), several notions of hypergraph decompositions have been proposed - giving rise to different notions of width, noticeably, plain, generalized, and fractional hypertree width (hw, ghw, and fhw). Given the increasing interest in using such decomposition methods in practice, a publicly accessible repository of decomposition software, as well as a large set of benchmarks, and a web-accessible workbench for inserting, analysing, and retrieving hypergraphs are called for. We address this need by providing (i) concrete implementations of hypergraph decompositions (including new practical algorithms), (ii) a new, comprehensive benchmark of hypergraphs stemming from disparate CQ and CSP collections, and (iii) HyperBench, our new web-interface for accessing the benchmark and the results of our analyses. In addition, we describe a number of actual experiments we carried out with this new infrastructure.

SESSION: PODS Invited Tutorial 2

Algorithmic Fairness: Measures, Methods and Representations

What happens when we replace - or augment - human decision-making with algorithms? This is a simple question, but the answers now define a new field of study - a field that I call algorithmic fairness, and that spans issues of fairness, discrimination, accountability, transparency, interpretability and responsibility, and so much more. While some of the early work in the area came out of data mining and machine learning, the field is now truly transdisciplinary, with contributions from all across computer science, as well as from all disciplines that touch on aspects of society - whether it be economics, philosophy, sociology, political science, or communication. In this tutorial, I'll try to do three things: I'll survey the main questions and some of the key insights we've developed over the years. I'll explain the web of connections between the technical and the social disciplines that make up this area, and I'll point to exciting directions that remain to be explored in both technical and social dimensions. Along the way I hope to illustrate what I think are some interesting "collisions" between computer science and the social sciences, and call for a reimagining of core ideas in our field, including the very idea of how we think about data representation.