2003 Digital Symposium Collection

Locating Data Sources in Large Distributed Systems

Leonidas Galanis, Yuan Wang, Shawn R. Jeffery, and David J. DeWitt
View Paper (PDF)

Return to Metadata & Sampling (Session C6)

Abstract

Querying large numbers of data sources is gain- ing importance due to increasing numbers of in- dependent data providers. One of the key chal- lenges is executing queries on all relevant infor- mation sources in a scalable fashion and retriev- ing fresh results. The key to scalability is to send queries only to the relevant servers and avoid wasting resources on data sources which will not provide any results. Thus, a catalog service, which would determine the relevant data sources given a query, is an essential component in effi- ciently processing queries in a distributed envi- ronment. This paper proposes a catalog frame- work which is distributed across the data sources themselves and does not require any central in- frastructure. As new data sources become avail- able, they automatically become part of the cata- log service infrastructure, which allows scalabil- ity to large numbers of nodes. Furthermore, we propose techniques for workload adaptability. Using simulation and real-world data we show that our approach is valid and can scale to thou- sands of data sources.