The type of analysis that astronomers need to performa on the output of their simulations can be organized into three categories: ad-hoc analysis on the raw simulation data, data clustering, and analysis on the result of clustering.
In our IASDS 2009 paper, we presented five types of queries that astrophysicists frequently ask over simulation data. The paper includes detailed use-cases as well as benchmark result. We release the benchmark scripts for Pig/ Hadoop. For SQL queries, please refer page 6 of the IASDS 2009 paper.
Distributed Friends-of-Friends (dFoF)
A second important analysis task is to identify interesting structures in the simulated universe such as galaxies. For this, astronomers use a variety of data clustering algorithms. Clustering is performed on a per-simulation-snapshot basis. One of the simplest data clustering algorithma is called Friends-of-Friends.
The Friends-of-Friends algorithm (FoF and references therein) is a domain-specific clustering algorithm that is also a simplified version of the more general and commonly used DBSCAN algorithm. A concrete application that also uses FoF is kernel density estimation (KDE), which involves searching for all points whose kernel can contribute to the density at a given point. In astrophysics, KDE techniques are used for object classification in a multi-dimensional parameter space of sky survey data.
In our SSDBM 2010 paper (and accompanying technical report) we present this clustering problem and one efficient parallel implementation of friends-of-friends. The distributed Friends-of-Friends (dFoF) is an optimized implementation of FoF algorithm running in a shared-nothing computational platform such as Hadoop and Dryad. Here we release an implementation using DryadLINQ.
- dFoF source code using DryadLINQ [zip] (does not include any binary!)
- To compile and run this program, you first have to install Dryad and DryadLINQ. Please refer this site for download and license.
Analyzing clustered data
Finally, astronomers need to analyze the clustered data. This last problem has been studied by our colleagues Debabrata Dash at Carnegie Mellon University and Anastasia Ailamaki at the Ecole Polytechnique Federale de Lausanne. Below, we present a brief summary of the problem and their findings.
The most common type of query on the clustered data is a ``going back in time query'', i.e. identifying a set of progenitors for a given set of ``particle groups''. For this, astronomers identify when two clusters in two adjacent simulation timesteps are related to each other and record this fact in a table, which then needs to be recursively traversed to identify ancestors or descendants.
Implementing this traversal in a database is not straightforward, requiring the development of recursive queries. The tools currently available for automated indexing and partitioning do not work for recursive queries. We initially experimented with the obvious choice of flattening our database structures to eliminate recursive queries, but ended up with significantly increasing data sizes.
We solve the problem by not materializing the complete progenitor relationships, but materializing them in steps. In the 128g dataset, the progenitors are recorded in 256 time steps. Instead of materializing 255 different progenitor tables (one for each time step difference), we materialize only 8 progenitor tables. Each progenitor table contains the origin information at 2^i time steps away, where i = 1, 2, 3, ..., k. This allows getting the accurate progenitor information using only 7 joins instead of 255 joins. This method is attractive because of the following two properties of the data:
- Most of the queries contain the starting and the end points of their time steps. Even if some queries do not contain that information, we can assume the worst case start and end times.
- The progenitor tree does to not have a high degree. The degree of the tree is close to 1 on average (1.018 in the 128g dataset).
This selective materialization trades off the space requirement and the computation overhead intelligently. For example, the 128g dataset, this method uses 0.02% of the fully materialized table, and does 7 joins instead of one non-join query.