The phenomenon of "Big Data" is creating a need for research perspectives that blend computational thinking (with its focus on, e.g., abstractions, algorithms and scalability) with inferential thinking (with its focus on, e.g., underlying populations, sampling patterns, error bars and predictions). Database researchers and statistical machine learning researchers are centrally involved in the creation of this blend, and research that incorporates perspectives from both databases and machine learning will be of particular value in the bigger picture. This is true both for methodology and for theory. I present highlights of several research initiatives that draw jointly on database and statistical foundations, including work on concurrency control and distributed inference, subsampling, time/data tradeoffs and inference/privacy tradeoffs.
The framework of database repairs provides a principled approach to managing inconsistencies in databases. Informally, a repair of an inconsistence database is a consistent database that differs from the inconsistent one in a "minimal way." A fundamental problem in this framework is the repair-checking problem: given two instances, is the second a repair of the first? Here, all repairs are taken into account, and they are treated on a par with each other. There are situations, however, in which it is natural and desired to prefer one repair over another; for example, one data source is regarded to be more reliable than another, or timestamp information implies that a more recent fact should be preferred over an earlier one. Motivated by these considerations, Staworko, Chomicki and Marcinkowski introduced the framework of preferred repairs. The main characteristic of this framework is that it uses a priority relation between conflicting facts of an inconsistent database to define notions of preferred repairs. In this paper we focus on the globally-optimal repairs, in the case where the constraints are functional dependencies. Intuitively, a globally-optimal repair is a repair that cannot be improved by exchanging facts with preferred facts. In this setting, it is known that there is a fixed schema (i.e., signature and functional dependencies) where globally-optimal repair-checking is coNP-complete.
Our main result is a dichotomy in complexity: for each fixed relational signature and each fixed set of functional dependencies, the globally-optimal repair-checking problem either is solvable in polynomial time or is coNP-complete. Specifically, the problem is solvable in polynomial time if for each relation symbol in the signature, the functional dependencies are equivalent to either a single functional dependency or to a set of two key constraints; in all other cases, the globally-optimal repair-checking problem is coNP-complete. We also show that there is a polynomial-time algorithm for distinguishing between the tractable and the intractable cases. The setup of preferred repairs assumes that preferences are only between conflicting facts. In the last part of the paper, we investigate the effect of this assumption on the complexity of globally-optimal repair checking. With this assumption relaxed, we give another dichotomy theorem and another polynomial-time distinguishing algorithm. Interestingly, the two dichotomies turn out to have quite different conditions for distinguishing tractability from intractability.
A relational database is said to be uncertain if primary key constraints can possibly be violated. A repair (or possible world) of an uncertain database is obtained by selecting a maximal number of tuples without ever selecting two distinct tuples with the same primary key value. For any Boolean query q, CERTAINTY(q) is the problem that takes an uncertain database db as input, and asks whether q is true in every repair of db. The complexity of this problem has been particularly studied for q ranging over the class of self-join-free Boolean conjunctive queries. A research challenge is to determine, given q, whether CERTAINTY(q) belongs to complexity classes FO, P, or coNP-complete. In this paper, we combine existing techniques for studying the above complexity classification task. We show that for any self-join-free Boolean conjunctive query q, it can be decided whether or not CERTAINTY(q) is in FO. Further, for any self-join-free Boolean conjunctive query q, CERTAINTY(q) is either in P or coNP-complete, and the complexity dichotomy is effective. This settles a research question that has been open for ten years.
We propose a novel foundational framework for why-not explanations, that is, explanations for why a tuple is missing from a query result. Our why-not explanations leverage concepts from an ontology to provide high-level and meaningful reasons for why a tuple is missing from the result of a query.
A key algorithmic problem in our framework is that of computing a most-general explanation for a why-not question, relative to an ontology, which can either be provided by the user, or it may be automatically derived from the data and/or schema. We study the complexity of this problem and associated problems, and present concrete algorithms for computing why-not explanations. In the case where an external ontology is provided, we first show that the problem of deciding the existence of an explanation to a why-not question is NP-complete in general. However, the problem is solvable in polynomial time for queries of bounded arity, provided that the ontology is specified in a suitable language, such as a member of the DL-Lite family of description logics, which allows for efficient concept subsumption checking. Furthermore, we show that a most-general explanation can be computed in polynomial time in this case. In addition, we propose a method for deriving a suitable (virtual) ontology from a database and/or a schema, and we present an algorithm for computing a most-general explanation to a why-not question, relative to such ontologies. This algorithm runs in polynomial-time in the case when concepts are defined in a selection-free language, or if the underlying schema is fixed. Finally, we also study the problem of computing short most-general explanations, and we briefly discuss alternative definitions of what it means to be an explanation, and to be most general.
A dominant cost for query evaluation in modern massively distributed systems is the number of communication rounds. For this reason, there is a growing interest in single-round multiway join algorithms where data is first reshuffled over many servers and then evaluated in a parallel but communication-free way. The reshuffling itself is specified as a distribution policy. We introduce a correctness condition, called parallel-correctness, for the evaluation of queries w.r.t. a distribution policy. We study the complexity of parallel-correctness for conjunctive queries as well as transferability of parallel-correctness between queries. We also investigate the complexity of transferability for certain families of distribution policies, including, for instance, the Hypercube distribution.
We give an overview of LogiQL, a declarative, Datalog based language for data management and analytics, along with techniques for efficient evaluation of LogiQL programs, emphasizing theoretical foundations when possible. These techniques include: leapfrog triejoin and its associated incremental maintenance algorithm, which we measure against appropriate optimality criteria; purely-functional data structures, which provide elegant versioning and branching capabilities that are indispensable for LogiQL; and transaction repair, a lock-free concurrency control scheme that uses LogiQL, incremental maintenance, and purely-functional data structures as essential ingredients.
Tuple-generating dependencies -- for short tgds -- have been a staple of database research throughout most of its history. Yet one of the central aspects of tgds, namely the role of existential quantifiers, has not seen much investigation so far. When studying dependencies, existential quantifiers and -- in their Skolemized form -- function symbols are often viewed as two ways to express the same concept. But in fact, tgds are quite restrictive in the way that functional terms can occur.
In this paper, we investigate the role of function symbols in dependency formalisms that go beyond tgds. Among them is the powerful class of SO tgds and the intermediate class of nested tgds. In addition, we employ Henkin quantifiers -- a well-known concept in the area of logic -- and introduce Henkin tgds to gain a more fine-grained understanding of the role of function symbols in dependencies.
For members of these families of dependency classes, we investigate their expressive power, that is, when one dependency class is equivalently representable in another class of dependencies. In addition, we analyze the computability of query answering under many of the well-known syntactical decidability criteria for tgds as well as the complexity of model checking.
The problem of query answering under the well-founded and stable model semantics for normal existential rules, that is, existential rules enriched with default negation, has recently attracted a lot of interest from the database and KR communities. In particular, it has been thoroughly studied for classes of normal existential rules that are based on restrictions that guarantee the tree-likeness of the underlying models; a prime example of such a restriction is guardedness. However, little is known about classes of existential rules that significantly deviate from the above paradigm. A prominent example of such a formalism is the class of existential rules that is based on the notion of stickiness, which enforces restrictions on the forms of joins in the rule-bodies. It is the precise aim of the current work to extend sticky existential rules with default negation, and perform an in-depth analysis of the complexity of conjunctive query answering under the well-founded and stable model semantics. We show that an effective way for bridging the gap between stickiness and the well-founded semantics exists, and we provide data and combined complexity results. However, there is no way to reconcile stickiness and the stable model semantics. The reason for this surprising negative result should be found in the fact that sticky existential rules are powerful enough for expressing cartesian products, a construct that forms a prime example of non-guardedness.
The chase procedure is considered as one of the most fundamental algorithmic tools in database theory. It has been successfully applied to different database problems such as data exchange, and query answering and containment under constraints, to name a few. One of the central problems regarding the chase procedure is all-instance termination, that is, given a set of tuple-generating dependencies (TGDs) (a.k.a. existential rules), decide whether the chase under that set terminates, for every input database. It is well-known that this problem is undecidable, no matter which version of the chase we consider. The crucial question that comes up is whether existing restricted classes of TGDs, proposed in different contexts such as ontological query answering, make the above problem decidable. In this work, we focus our attention on the oblivious and the semi-oblivious versions of the chase procedure, and we give a positive answer for classes of TGDs that are based on the notion of guardedness. To the best of our knowledge, this is the first work that establishes positive results about the (semi-)oblivious chase termination problem. In particular, we first concentrate on the class of linear TGDs, and we syntactically characterize, via rich- and weak-acyclicity, its fragments that guarantee the termination of the oblivious and the semi-oblivious chase, respectively. Those syntactic characterizations, apart from being interesting in their own right, allow us to pinpoint the complexity of the problem, which is PSPACE-complete in general, and NL-complete if we focus on predicates of bounded arity, for both the oblivious and the semi-oblivious chase. We then proceed with the more general classes of guarded and weakly-guarded TGDs. Although we do not provide syntactic characterizations for its relevant fragments, as for linear TGDs, we show that the problem under consideration remains decidable. In fact, we show that it is 2EXPTIME-complete in general, and EXPTIME-complete if we focus on predicates of bounded arity, for both the oblivious and the semi-oblivious chase. Finally, we investigate the expressive power of the query languages obtained from our analysis, and we show that they are equally expressive with standard database query languages. Nevertheless, we have strong indications that they are more succinct.
The inversion of data exchange mappings is one of the thorniest issues in data exchange. In this paper we study inverse data exchange from a novel perspective. Previous work has dealt with the static problem of finding a target-to-source mapping that captures the "inverse" of a source-to-target data exchange mapping. As we will show this approach has some drawbacks when it come actually applying the inverse mapping in order to recover a source instance from a materialized target instance. More specifically (1): As is well known, the inverse mappings have to be expressed in a much more powerful language than the mappings they invert. (2): There are simple cases where a source instance computed by the inverse mapping misses sound information that one may easily obtain when the particular target instance is available. (3): In some cases the inverse mapping can introduce unsound information in the recovered source instance.
To overcome these drawbacks we focus on the dynamic problem of recovering the source instance using the source-to-target mapping as well as a given target instance. Similarly to the problem of finding "good" target instances in forward data exchange, we look for "good" source instances to restore, i.e. to materialize. For this we introduce a new semantics to capture instance based recovery. We then show that given a target instance and a source-to-target mapping expressed as set of tuple generating dependencies, there are chase-based algorithms to compute a representative finite set of source instances that can be used to get certain answers to any union of conjunctive source queries. We also show that the instance based source recovery problem unfortunately is coNP-complete. We therefore present a polynomial time algorithm that computes a "small" set of source instances that can be used to get sound certain answers to any union of conjunctive source queries. This algorithm is then extended to extract more sound information for the case when only conjunctive source queries are allowed.
Tree pattern queries are being investigated in database theory for more than a decade. They are a fundamental and flexible query mechanism and have been considered in the context of querying tree structured as well as graph structured data. We revisit their containment, validity, and satisfiability problem, both with and without schema information. We present a comprehensive overview of what is known about the complexity of containment and develop new techniques which allow us to obtain tractability- and hardness results for cases that have been open since the early work on tree pattern containment. For the tree pattern queries we consider in this paper, it is known that the containment problem does not depend on whether patterns are evaluated on trees or on graphs. This means that our results also shed new light on tree pattern queries on graphs.
Conjunctive queries (CQs) fail to provide an answer when the pattern described by the query does not exactly match the data. CQs might thus be too restrictive as a querying mechanism when data is semistructured or incomplete. The semantic web therefore provides a formalism - known as well-designed pattern trees (WDPTs) - that tackles this problem: WDPTs allow us to match patterns over the data, if available, but do not fail to give an answer otherwise. Here we abstract away the specifics of semantic web applications and study WDPTs over arbitrary relational schemas. Our language properly subsumes the class of CQs. Hence, WDPT evaluation is intractable. We identify structural proper ties of WDPTs that lead to tractability of various variants of the evaluation problem. For checking if a WDPT is equivalent to one in our tractable class, we prove 2EXPTIME-membership. As a corollary, we obtain fixed-parameter tractability of (variants of) the evaluation problem. Our techniques also allow us to develop a theory of approximations for WDPTs.
While the migration from DTD to XML Schema was driven by a need for increased expressivity and flexibility, the latter was also significantly more complex to use and understand. Whereas DTDs are characterized by their simplicity, XML Schema Definitions (XSDs) are notoriously difficult. In this paper, we introduce the XML specification language BonXai which possesses most features of XSDs, including its expressivity, while retaining the simplicity of DTDs. In brief, the latter is achieved by sacrificing the explicit use of types in favor of simple patterns expressing contexts for elements. The goal of BonXai is by no means to replace XML Schema, but rather to provide a simpler DTD-like alternative to schema designers that do not need the explicit use of types. Therefore, BonXai can be seen as a practical front-end for XML Schema. A particular strong point of BonXai is its solid foundation rooted in a decade of theoretical work around pattern-based schemas. We present in detail the formal model for BonXai and discuss translation algorithms to and from XML Schema.
A fundamental challenge in processing the massive quantities of information generated by modern applications is in extracting suitable representations of the data that can be stored, manipulated and interrogated on a single machine. A promising approach is in the design and analysis of compact summaries: data structures which capture key features of the data, and which can be created effectively over distributed data sets. Popular summary structures include the count distinct algorithms, which compactly approximate item set cardinalities, and sketches which allow vector norms and products to be estimated. These are very attractive, since they can be computed in parallel and combined to yield a single, compact summary of the data. This tutorial introduces the concepts and examples of compact summaries.
Designing query languages for graph structured data is an active field of research. Evaluating a query on a graph results in a relation on the set of its nodes. In other words, a query is a mechanism for defining relations on a graph. Some relations may not be definable by any query in a given language. This leads to the following question: given a graph, a query language and a relation on the graph, does there exist a query in the language that defines the relation? This is called the definability problem. When the given query language is standard regular expressions, the definability problem is known to be PSPACE-complete.
The model of graphs can be extended by labeling nodes with values from an infinite domain. These labels induce a partition on the set of nodes: two nodes are equivalent if they are labeled by the same value. Query languages can also be extended to make use of this equivalence. Two such extensions are Regular Expressions with Memory (REM) and Regular Expressions with Equality (REE).
In this paper, we study the complexity of the definability problem in this extended model when the query language is either REM or REE. We show that the definability problem is EXPSPACE-complete when the query language is REM, and it is PSPACE-complete when the query language is REE. In addition, when the query language is a union of conjunctive queries based on REM or REE, we show CoNP-completeness.
This paper investigates the feasibility of querying big data by accessing a bounded amount of the data. We study boundedly evaluable queries under a form of access constraints, when their evaluation cost is determined by the queries and constraints only. While it is undecidable to determine whether FO queries are boundedly evaluable, we show that for several classes of FO queries, the bounded evaluability problem is decidable. We also provide characterization and effective syntax for their boundedly evaluable queries.
When a query Q is not boundedly evaluable, we study two approaches to approximately answering Q under access constraints. (1) We search for upper and lower envelopes of Q that are boundedly evaluable and warrant a constant accuracy bound. (2) We instantiate a minimum set of variables (parameters) in Q such that the specialized query is boundedly evaluable. We study problems for deciding the existence of envelopes and bounded specialized queries, and establish their complexity for various classes of FO queries.
We study in this paper the computation of skyline queries - a popular tool for multicriteria data analysis - in the presence of noisy input. Motivated by crowdsourcing applications, we present the first algorithms for skyline evaluation in a computation model where the input data items can only be compared through noisy comparisons. In this model comparisons may return wrong answers with some probability, and confidence can be increased through independent repetitions of a comparison. Our goal is to minimize the number of comparisons required for computing or verifying a candidate skyline, while returning the correct answer with high probability. We design output-sensitive algorithms, namely algorithms that take advantage of the potentially small size of the skyline, and analyze the number of comparison rounds of our solutions. We also consider the problem of predicting the most likely skyline given some partial information in the form of noisy comparisons, and show that optimal prediction is computationally intractable.
Given a set-comparison predicate P and given two lists of sets A = (A1,...,Am) and B = (B1,...,Bm), with all Ai, Bj ⊆ [n], the P-set join A bowtieP B is defined to be the set {(i, j) in [m] x [m] | P(Ai,Bj)}. When P(Ai,Bj) is the condition "Ai ∩ Bj ≠ is empty " we call this the set-intersection-notempty join (a.k.a. the composition of A and B); when P(Ai,Bj) is "Ai ∩ Bj is empty" we call it the set-disjointness join; when P(Ai,Bj) is "Ai = Bj" we call it the set-equality join; when P(Ai,Bj) is "|Ai ∩ Bj| ≥ T" for a given threshold T, we call it the set-intersection threshold join. Assuming A and B are stored at two different sites in a distributed environment, we study the (randomized) communication complexity of computing these, and related, set-joins A bowtieP B, as well as the (randomized) communication complexity of computing the exact and approximate value of their size k = |A bowtieP B|. Combined, our analyses shed new insights into the quantitative differences between these different set-joins. Furthermore, given the close affinity of the natural join and the set-intersection-not-empty join, our results also yield communication complexity results for computing the natural join in a distributed environment.
Additionally, we obtain new algorithms for computing the distributed set-intersection-not empty join when the input and/or output is sparse. For instance, when the output is k sparse, we improve an Õ(kn) communication algorithm of (Williams and Yu, SODA 2014). Observing that the set-intersection-not-empty join is isomorphic to Boolean matrix multiplication (BMM), our results imply new algorithms for fundamental graph theoretic problems related to BMM. For example, we show how to compute the transitive closure of a directed graph in Õ(k3/2) time, when the transitive closure contains at most k edges. When k = O(n), we obtain a (practical) Õ(n3/2) time algorithm, improving a recent Õ(n1+(č+3)/4) time algorithm (Borassi, Crescenzi, and Habib, arXiv 2014) based on (impractical) fast matrix multiplication, where č ≥ 2 is the exponent for matrix multiplication.
We present a simple geometric framework for the relational join. Using this framework, we design an algorithm that achieves the fractional hypertree-width bound, which generalizes classical and recent worst-case algorithmic results on computing joins. In addition, we use our framework and the same algorithm to show a series of what are colloquially known as beyond worst-case results. The framework allows us to prove results for data stored in Btrees, multidimensional data structures, and even multiple indices per table. A key idea in our framework is formalizing the inference one does with an index as a type of geometric resolution; transforming the algorithmic problem of computing joins to a geometric problem. Our notion of geometric resolution can be viewed as a geometric analog of logical resolution. In addition to the geometry and logic connections, our algorithm can also be thought of as backtracking search with memoization.
This paper aims to understand the I/O-complexity of maintaining a big sample set---whose size exceeds the internal memory's capacity---on a data stream. We study this topic in a new computation model, named the external memory stream (EMS) model, that naturally extends the standard external memory model to stream environments. A suite of EMS-indigenous techniques are presented to prove matching lower and upper bounds for with-replacement (WR) and without-replacement (WoR) sampling on append-only and time-based sliding window streams, respectively. Our results imply that, compared to RAM, the EMS model is perhaps a more suitable computation model for studying stream sampling, because the new model separates different problems by their hardness in ways that could not be observed in RAM.
A growing body of work addresses the challenge of processing dynamic graph streams: a graph is defined by a sequence of edge insertions and deletions and the goal is to construct synopses and compute properties of the graph while using only limited memory. Linear sketches have proved to be a powerful technique in this model and can also be used to minimize communication in distributed graph processing.
We present the first linear sketches for estimating vertex connectivity and constructing hypergraph sparsifiers. Vertex connectivity exhibits markedly different combinatorial structure than edge connectivity and appears to be harder to estimate in the dynamic graph stream model. Our hypergraph result generalizes the work of Ahn et al. (PODS 2012) on graph sparsification and has the added benefit of significantly simplifying the previous results. One of the main ideas is related to the problem of reconstructing subgraphs that satisfy a specific sparsity property. We introduce a more general notion of graph degeneracy and extend the graph reconstruction result of Becker et al. (IPDPS 2011).
Histograms are among the most popular structures for the succinct summarization of data in a variety of database applications. In this work, we provide fast and near-optimal algorithms for approximating arbitrary one dimensional data distributions by histograms.
A k-histogram is a piecewise constant function with k pieces. We consider the following natural problem, previously studied by Indyk, Levi, and Rubinfeld in PODS 2012: given samples from a distribution p over {1,...,n}, compute a k histogram that minimizes the l2-distance from p, up to an additive ε. We design an algorithm for this problem that uses the information-theoretically minimal sample size of m = O(1/ε2), runs in sample-linear time O(m), and outputs an O(k)-histogram whose l2-distance from p is at most O(optk) +ε, where optk is the minimum l2-distance between p and any k-histogram. Perhaps surprisingly, the sample size and running time of our algorithm are independent of the universe size.
We generalize our approach to obtain fast algorithms for multi-scale histogram construction, as well as approximation by piecewise polynomial distributions. We experimentally demonstrate one to two orders of magnitude im rovement in terms of empirical running times over previous state-of-the-art algorithms.
Orthogonal range reporting (ORR) is a classic problem in computational geometry and databases, where the objective is to preprocess a set P of points in R2 such that, given an axis-parallel rectangle q, all the points in P ∩ Q can be reported efficiently. This paper studies a natural variant of the problem called top-k ORR, where each point p ∈ P carries a weight w(p) ∈R;. Besides q, a query also specifies an integer k ∈ [1, |P|], and needs to report the k points in q ∩ P with the largest weights. We present optimal or near-optimal structures for solving the top-k ORR problem in the pointer machine and external memory models. As a side product, our structures give new space-query tradeoff for the orthogonal range max problem, which is a special case of top-k ORR with k = 1.
In the dynamic indexing problem, we must maintain a changing collection of text documents so that we can efficiently support insertions, deletions, and pattern matching queries. We are especially interested in developing efficient data structures that store and query the documents in compressed form. All previous compressed solutions to this problem rely on answering rank and select queries on a dynamic sequence of symbols. Because of the lower bound in [Fredman and Saks, 1989], answering rank queries presents a bottleneck in compressed dynamic indexing. In this paper we show how this lower bound can be circumvented using our new framework. We demonstrate that the gap between static and dynamic variants of the indexing problem can be almost closed. Our method is based on a novel framework for adding dynamism to static compressed data structures. Our framework also applies more generally to dynamizing other problems. We show, for example, how our framework can be applied to develop compressed representations of dynamic graphs and binary relations.
In this paper, we revisit two fundamental problems in database theory. The first one is called join dependency (JD) testing, where we are given a relation r and a JD, and need to determine whether the JD holds on r. The second problem is called JD existence testing, where we need to determine if there exists any non-trivial JD that holds on r.
We prove that JD testing is NP-hard even if the JD is defined only on binary relations (i.e., each with only two attributes). Unless P = NP, this result puts a negative answer to the question whether it is possible to efficiently test JDs defined exclusively on small (in terms of attribute number) relations. The question has been open since the classic NP-hard proof of Maier, Sagiv, and Yannakakis in JACM'81 which requires the JD to involve a relation of Ω(d) attributes, where d is the number of attributes in r.
For JD existence testing, the challenge is to minimize the computation cost because the problem is known to be solvable in polynomial time. We present a new algorithm for solving the problem I/O-efficiently in the external memory model. Our algorithm in fact settles the closely related Loomis-Whitney (LW) enumeration problem, and as a side product, achieves the optimal I/O complexity for the triangle enumeration problem, improving a recent result of Pagh and Silvestri in PODS'14.
A wide variety of fundamental data analyses in machine learning, such as linear and logistic regression, require minimizing a convex function defined by the data. Since the data may contain sensitive information about individuals, and these analyses can leak that sensitive information, it is important to be able to solve convex minimization in a privacy-preserving way.
A series of recent results show how to accurately solve a single convex minimization problem in a differentially private manner. However, the same data is often analyzed repeatedly, and little is known about solving multiple convex minimization problems with differential privacy. For simpler data analyses, such as linear queries, there are remarkable differentially private algorithms such as the private multiplicative weights mechanism (Hardt and Rothblum, FOCS 2010) that accurately answer exponentially many distinct queries. In this work, we extend these results to the case of convex minimization and show how to give accurate and differentially private solutions to exponentially many convex minimization problems on a sensitive dataset.
The FO Model Counting problem (FOMC) is the following: given a sentence Φ in FO and a number n, compute the number of models of Φ over a domain of size n; the Weighted variant (WFOMC) generalizes the problem by associating a weight to each tuple and defining the weight of a model to be the product of weights of its tuples. In this paper we study the complexity of the symmetric WFOMC, where all tuples of a given relation have the same weight. Our motivation comes from an important application, inference in Knowledge Bases with soft constraints, like Markov Logic Networks, but the problem is also of independent theoretical interest. We study both the data complexity, and the combined complexity of FOMC and WFOMC. For the data complexity we prove the existence of an FO3 formula for which FOMC is #P1-complete, and the existence of a Conjunctive Query for which WFOMC is #P1-complete. We also prove that all γ-acyclic queries have polynomial time data complexity. For the combined complexity, we prove that, for every fragment FOk, k ≥ 2, the combined complexity of FOMC (or WFOMC) is #P-complete.
Locality Sensitive Hashing (LSH) has emerged as the method of choice for high dimensional similarity search, a classical problem of interest in numerous applications. LSH-based solutions require that each data point be inserted into a number A of hash tables, after which a query can be answered by performing B lookups. The original LSH solution of [IM98] showed for the first time that both A and B can be made sublinear in the number of data points. Unfortunately, the classical LSH solution does not provide any tradeoff between insert and query complexity, whereas for data (respectively, query) intensive applications one would like to minimize insert time by choosing a smaller $A$ (respectively, minimize query time by choosing a smaller B). A partial remedy for this is provided by Entropy LSH [Pan06], which allows to make either inserts or queries essentially constant time at the expense of a loss in the other parameter, but no algorithm that achieves a smooth tradeoff is known.
In this paper, we present an algorithm for performing similarity search under the Euclidean metric that resolves the problem above. Our solution is inspired by Entropy LSH, but uses a very different analysis to achieve a smooth tradeoff between insert and query complexity. Our results improve upon or match, up to lower order terms in the exponent, best known data-oblivious algorithms for the Euclidean metric.