We analyze Hadoop workloads from three different research clusters from an application-level perspective, with two goals: (1) explore new issues in application patterns and user behavior and (2) understand key performance challenges related to IO and load balance. Our analysis suggests that Hadoop usage is still in its adolescence. We see underuse of Hadoop features, extensions, and tools as well as significant opportunities for optimization. We see significant diversity in application styles, including some "interactive" workloads, motivating new tools in the ecosystem. We find that some conventional approaches to improving performance are not especially effective and suggest some alternatives. Overall, we find significant opportunity for simplifying the use and optimization of Hadoop, and make recommendations for future research.
Return to top.
As one concrete application scenario, we explore the emergent data management needs of the University of Washington’s “N-body Shop” group, which specializes in the development and utilization of large-scale simulations (specifically, “N-body tree codes”). These simulations serve to investigate the formation and evolution of large scale structure in the universe. The UW N-body Shop is representative of the current state-of-the-art in astrophysical cosmological simulation. In 2008, the N-body Shop was the 10th largest consumer of NSF Teragrid time, using 7:5 million CPU hours and generated 50 terabytes of raw data, with an additional 25 terabytes of post-processing information. The total size of each simulation ranges in size from 55 GB to a few TB. We find that the required analysis involves three types of tasks: filtering and correlating the data at different times in the simulation, clustering data within one simulated timestep, and querying the clustered data. We are in the process of developing a benchmark comprising all three types of tasks. The publication below presents the first part of the benchmark along with some preliminary results from running this benchmakrk on Pig/Hadoop and a commercial relational database management system. The following sub-project discusses clustering. We did not yet address the challenge of querying clustered data.
The queries and data sets used in the paper are available here.
Return to top.
Scientists today have the ability to generate data at an unprecedented scale and rate and, as a result, they must increasingly turn to parallel data processing engines to perform their analyses. However, the simple execution model of these engines can make it difficult to implement efficient algorithms for scientific analytics. In particular, many scientific analytics require the extraction of features from data represented as either a multidimensional array or points in a multidimensional space. These applications exhibit significant computational skew, where the runtime of different partitions depends on more than just input size and can therefore vary dramatically and unpredictably. In SkewReduce project, we explore how to alleviate such skew problem in a large MapReduce cluster by requesting users minimal information of their analysis tasks.
Return to top.
We present an automatic skew mitigation approach for user-defined MapReduce programs and present SkewTune, a system that implements this approach as a drop-in replacement for an existing MapReduce implementation. There are three key challenges: (a) require no extra input from the user yet work for all MapReduce applications, (b) be completely transparent, and (c) impose minimal overhead if there is no skew. The SkewTune approach addresses these challenges and works as follows: When a node in the cluster becomes idle, SkewTune identifies the task with the greatest expected remaining processing time. The unprocessed input data of this straggling task is then proactively repartitioned in a way that fully utilizes the nodes in the cluster and preserves the ordering of the input data so that the original output can be reconstructed by concatenation. We implement SkewTune as an extension to Hadoop and evaluate its effectiveness using several real applications. The results show that SkewTune can significantly reduce job runtime in the presence of skew and adds little to no overhead in the absence of skew.
Return to top.
In parallel query-processing environments, accurate, time-oriented progress indicators could provide much utility to users given that queries take a very long time to complete and both inter- and intra-query execution times can have high variance. In these systems, query times depend on the query plans and the amount of data being processed, but also on the amount of parallelism available, the types of operators (often user-defined) that perform the processing, and the overall system load. None of the techniques used by existing tools or available in the literature provide a non-trivial progress indicator for parallel queries. In this project, we are building Parallax, an accurate, time-oriented progress indicator for parallel queries.
Return to top.
Many intensive data analysis techniques require iterative computations, including PageRank, HITS, recursive relational queries, clustering, neural-network analysis, social network analysis, and network traffic analysis. These techniques have a common trait: data are processed iteratively until the computation satisfies a convergence or stopping condition. The MapReduce framework does not directly support these iterative data analysis applications. Instead, programmers must implement iterative programs manually by issuing multiple MapReduce jobs and orchestrating their execution. In this project, HaLoop, our modified version of the Hadoop MapReduce framework, is designed to serve these applications. HaLoop not only extends MapReduce with programming support for iterative applications, but also dramatically improves their efficiency by making the task scheduler loop-aware and by adding various caching mechanisms.
Return to top.
We address the problem of making online, parallel query plans fault-tolerant: i.e., provide intra-query fault-tolerance without blocking. We develop an approach that not only achieves this goal but does so through the use of different fault-tolerance techniques at different operators within a query plan. Enabling each operator to use a different fault-tolerance strategy leads to a space of fault-tolerance plans amenable to cost-based optimization. We develop FTOpt, a cost-based fault-tolerance optimizer that automatically selects the best strategy for each operator in a query plan in a manner that minimizes the expected processing time with failures for the entire query. We implement our approach in a prototype parallel query-processing engine. Our experiments demonstrate that (1) there is no single best fault-tolerance strategy for all query plans, (2) often hybrid strategies that mix-and-match recovery techniques outperform any uniform strategy, and (3) our optimizer correctly identiļ¬es winning fault-tolerance configurations.
Return to top.
Even with high-throughput data processing environments such as MapReduce, as data inputs grow ever larger, scientific discovery can be held back by limited processing rates. It is not uncommon for a single data analysis task to take hours to execute in a MapReduce cluster. Such high latencies are especially frustrating when a scientist needs only a subset of the analysis result: e.g., suppose an astronomer runs a MapReduce job that extracts all particles in a simulation of the universe that have some specific property, but examines only a subset of those particles that are also near a known star. Part of the computation has been wasted.
To reduce processing times, a user can be cautious and compute only the subset of the result that she definitely needs. If the user subsequently decides that additional output is needed, that output can always be computed later. Of course, such "lazy" computation comes at a cost: processing a dataset incrementally using MapReduce is slower than processing the entire dataset in one shot. So what should the user do?
In this paper, we study the benefits and overheads of lazy MapReduce processing, where the input data is partitioned and only the smallest subset of these partitions are processed to meet a user's need at any time. We also develop guidelines for successfully applying the lazy MapReduce computation technique to reduce processing times of MapReduce-based analysis.
Return to top.