2016 SIGMOD Test of Time Award

Provenance Management in Curated Databases

Peter Buneman, Adriane Chapman, James Cheney

Original Abstract: Curated databases in bioinformatics and other disciplines are the result of a great deal of manual annotation, correction and transfer of data from other sources. Provenance information concerning the creation, attribution, or version history of such data is crucial for assessing its integrity and scientific value. General purpose database systems provide little support for tracking provenance, especially when data moves among databases. This paper investigates general-purpose techniques for recording provenance for data that is copied among databases. We describe an approach in which we track the user’s actions while browsing source databases and copying data into a curated database, in order to record the user’s actions in a convenient, queryable form. We present an implementation of this technique and use it to evaluate the feasibility of database support for provenance management. Our experiments show that although the overhead of a naive approach is fairly high, it can be decreased to an acceptable level using simple optimizations.

	Peter Buneman is a Professor of Computer Science in the School for Informatics at the University of Edinburgh. He is also a fellow of the Royal Society, a fellow of the ACM, a fellow of the Royal Society of Edinburgh, and has won a Royal Society Wolfson Research Merit Award. He was the Program Chair of SIGMOD (in 1993), VLDB (in 2008), and PODS (in 2001). In 2013 he was appointed Member of the Order of the British Empire (MBE) in the New Year Honours for services to data systems and computing.
	Adriane Chapman is a specialist in information management and database technologies. Adriane received a BS in Biology and a BS in Chemistry from MIT (1998) and earned her PhD in Computer Science at the University of Michigan in 2008. She has been an active researcher in provenance since 2003, and is a leader of the Provenance Week conferences. At The MITRE Corporation, she is a task leader for government projects and runs several research projects spanning data quality, provenance, and health data exchange.
	James Cheney is a Royal Society University Research Fellow and Reader in the Laboratory for Foundations of Computer Science, University of Edinburgh, working in the areas of databases and programming languages. James received a BS in Computer Science and Mathematics and MS in Mathematics from Carnegie Mellon University (1998) and earned his PhD in Computer Science at Cornell University in 2004. From September 2004 until October 2008 he was a postdoctoral research associate in the Edinburgh Database Group. He helped start the Workshop on Theory and Practice of Provenance series in 2009. His current and recent research on provenance is funded by AFOSR, DARPA, the UK Engineering and Physical Sciences Research Council (EPSRC), Google, the European Research Council (ERC), Microsoft Research, LogicBlox, Inc., and the Royal Society.