Abstract
Scientific datasets are typically archived at mass storage systems
or data centers close to supercomputers/instruments. Endusers
of these datasets, however, usually perform parts of their
workflows at their local computers. In such cases, client-side
caching can offer significant gains by reducing the cost of widearea
data movement.
Scientific data caches, however, traditionally cache entire datasets,
which may not be necessary. In this paper, we propose a novel
combination of prefix caching and collective download. Prefix
caching allows the bootstrapping of dataset downloads by caching
only a prefix of the dataset, while collective download facilitates
efficient parallel patching of the missing suffix from an external
data source. To estimate the optimal prefix size, we further present
an analytical model that considers both the initial download overhead
and the downloading speed. We implemented our proposed
approach in the FreeLoader distributed cache prototype. Experimental
results (using multiple scientific data repositories and data
transfer tools, as well as a real-world scientific dataset access
trace) demonstrate that prefix caching and collective download
can be implemented efficiently, our model can select an appropriate
prefix size, and the cache hit rate can be improved significantly
without hurting the local access rate of cached datasets.