Home

Projects and Publications

Dataset and Benchmark

Hadepot Repository

Acknowledgments

: Projects and Publications

Hadoop Workload Analysis
Astronomy Simulation Use-Case and Benchmark
Skew-Resistant Parallel Processing of Feature-Extracting Scientific User-Defined Functions
SkewTune: Mitigating Skew in MapReduce Applications
The Parallax Time-based Progress Indicator for Parallel Queries
HaLoop: Efficient Iterative Data Processing on Large Clusters
A Latency and Fault-Tolerance Optimizer for Parallel Data Processing Systems
The Case for Being Lazy: How to Leverage Lazy Evaluation in MapReduce

Hadoop Workload Analysis

We analyze Hadoop workloads from three different research clusters from an application-level perspective, with two goals: (1) explore new issues in application patterns and user behavior and (2) understand key performance challenges related to IO and load balance. Our analysis suggests that Hadoop usage is still in its adolescence. We see underuse of Hadoop features, extensions, and tools as well as significant opportunities for optimization. We see significant diversity in application styles, including some "interactive" workloads, motivating new tools in the ecosystem. We find that some conventional approaches to improving performance are not especially effective and suggest some alternatives. Overall, we find significant opportunity for simplifying the use and optimization of Hadoop, and make recommendations for future research.

Hadoop's Adolescence: A Comparative Workload Analysis from Three Research Clusters
Kai Ren, YongChul Kwon, Magdalena Balazinska, and Bill Howe University of Washington Technical Report. UW-CSE-12-06-01. June 2012

Return to top.

Astronomy Simulation Use-Case and Benchmark

As one concrete application scenario, we explore the emergent data management needs of the University of Washington’s “N-body Shop” group, which specializes in the development and utilization of large-scale simulations (specifically, “N-body tree codes”). These simulations serve to investigate the formation and evolution of large scale structure in the universe. The UW N-body Shop is representative of the current state-of-the-art in astrophysical cosmological simulation. In 2008, the N-body Shop was the 10th largest consumer of NSF Teragrid time, using 7:5 million CPU hours and generated 50 terabytes of raw data, with an additional 25 terabytes of post-processing information. The total size of each simulation ranges in size from 55 GB to a few TB. We find that the required analysis involves three types of tasks: filtering and correlating the data at different times in the simulation, clustering data within one simulated timestep, and querying the clustered data. We are in the process of developing a benchmark comprising all three types of tasks. The publication below presents the first part of the benchmark along with some preliminary results from running this benchmakrk on Pig/Hadoop and a commercial relational database management system. The following sub-project discusses clustering. We did not yet address the challenge of querying clustered data.

Analyzing Massive Astrophysical Datasets: Can Pig/Hadoop or a Relational DBMS Help?
Sarah Loebman, Dylan Nunley, YongChul Kwon, Bill Howe, Magdalena Balazinska, and Jeffrey P. Gardner
In the Workshop on Interfaces and Abstractions for Scientific Data (IASDS) 2009, New Orleans, LA, August 2009.

The queries and data sets used in the paper are available here.

Return to top.

SkewReduce: Skew-Resistant Parallel Processing on Large Clusters

Scientists today have the ability to generate data at an unprecedented scale and rate and, as a result, they must increasingly turn to parallel data processing engines to perform their analyses. However, the simple execution model of these engines can make it difficult to implement efficient algorithms for scientific analytics. In particular, many scientific analytics require the extraction of features from data represented as either a multidimensional array or points in a multidimensional space. These applications exhibit significant computational skew, where the runtime of different partitions depends on more than just input size and can therefore vary dramatically and unpredictably. In SkewReduce project, we explore how to alleviate such skew problem in a large MapReduce cluster by requesting users minimal information of their analysis tasks.

Skew-Resistant Parallel Processing of Feature-Extracting Scientific User-Defined Functions
YongChul Kwon, Magda Balazinska, Bill Howe, Jerome Rolia
In the First ACM Symposium on Cloud Computing (SoCC), Indianapolis, IN, June 2010.
Scalable clustering algorithm for N-body simulations in a shared-nothing cluster
YongChul Kwon, Dylan Nunley, Jeffrey P. Gardner, Magdalena Balazinska, Bill Howe, and Sarah Loebman
In the 22nd International Conference on Scientific and Statistical Database Management (SSDBM), Heidelberg, Germany, July, 2010.
A Study of Skew in MapReduce Applications (Best Student Paper)
YongChul Kwon, Magdalena Balazinska, Bill Howe, and Jerome Rolia
In the 5th Open Cirrus Summit, Moscow, Russia, June, 2011.

SkewReduce source code is publicly available at http://skewreduce.googlecode.com/
The queries and data sets used in the papers are available here.

Return to top.

SkewTune: Mitigating Skew in MapReduce Applications

We present an automatic skew mitigation approach for user-defined MapReduce programs and present SkewTune, a system that implements this approach as a drop-in replacement for an existing MapReduce implementation. There are three key challenges: (a) require no extra input from the user yet work for all MapReduce applications, (b) be completely transparent, and (c) impose minimal overhead if there is no skew. The SkewTune approach addresses these challenges and works as follows: When a node in the cluster becomes idle, SkewTune identifies the task with the greatest expected remaining processing time. The unprocessed input data of this straggling task is then proactively repartitioned in a way that fully utilizes the nodes in the cluster and preserves the ordering of the input data so that the original output can be reconstructed by concatenation. We implement SkewTune as an extension to Hadoop and evaluate its effectiveness using several real applications. The results show that SkewTune can significantly reduce job runtime in the presence of skew and adds little to no overhead in the absence of skew.

SkewTune: Mitigating Skew in MapReduce Applications _[proceeding] _[correct]
YongChul Kwon, Magda Balazinska, Bill Howe, Jerome Rolia
In ACM SIGMOD International Conference on Management of Data (SIGMOD), Scottsdale, AZ, USA, May, 2012.
Errata: Citation [20] in the proceeding version should be "M. Kalyuzhnaya, D. Beck, and L. Chistoserdova. Functional metagenomics of methylotrophs. Methods in Enzymology, 495, 2011."
SkewTune: Mitigating Skew in MapReduce Applications
YongChul Kwon, Magda Balazinska, Bill Howe, Jerome Rolia
University of Washington Technical Report. UW-CSE-12-03-03. March 2012.
Extended version of SIGMOD 2012 paper
SkewTune in Action(Demo)
YongChul Kwon, Magda Balazinska, Bill Howe, Jerome Rolia
In VLDB 2012, Istanbul, Turkey, August, 2012.

SkewTune source code is now publicly available at http://skewtune.googlecode.com/

Return to top.

The Parallax Time-based Progress Indicator for Parallel Queries

In parallel query-processing environments, accurate, time-oriented progress indicators could provide much utility to users given that queries take a very long time to complete and both inter- and intra-query execution times can have high variance. In these systems, query times depend on the query plans and the amount of data being processed, but also on the amount of parallelism available, the types of operators (often user-defined) that perform the processing, and the overall system load. None of the techniques used by existing tools or available in the literature provide a non-trivial progress indicator for parallel queries. In this project, we are building Parallax, an accurate, time-oriented progress indicator for parallel queries.

Parallax

ParaTimer: A Progress Indicator for MapReduce DAGs.
Kristi Morton, Magdalena Balazinska, Dan Grossman
In ACM SIGMOD International Conference on Management of Data (SIGMOD), Indianapolis, IN, June, 2010. Experimental results fully verified by SIGMOD repeatability and workability review committee: here.

Estimating the Progress of MapReduce Pipelines.
Kristi Morton, Abe Friesen, Magdalena Balazinska, Dan Grossman
In the 26th IEEE International Conference on Data Engineering (ICDE), Long Beach, CA, March 2010.

Toward A Progress Indicator for Parallel Queries.
Kristi Morton, Abe Friesen, Magdalena Balazinska, Dan Grossman
University of Washington Technical Report. UW-CSE-09-07-01. July 2009

Return to top.

HaLoop: Efficient Iterative Data Processing on Large Clusters

Many intensive data analysis techniques require iterative computations, including PageRank, HITS, recursive relational queries, clustering, neural-network analysis, social network analysis, and network traffic analysis. These techniques have a common trait: data are processed iteratively until the computation satisfies a convergence or stopping condition. The MapReduce framework does not directly support these iterative data analysis applications. Instead, programmers must implement iterative programs manually by issuing multiple MapReduce jobs and orchestrating their execution. In this project, HaLoop, our modified version of the Hadoop MapReduce framework, is designed to serve these applications. HaLoop not only extends MapReduce with programming support for iterative applications, but also dramatically improves their efficiency by making the task scheduler loop-aware and by adding various caching mechanisms.

HaLoop: Efficient Iterative Data Processing on Large Clusters.
Yingyi Bu, Bill Howe, Magda Balazinska, Michael D. Ernst
PVLDB, Vol 3, Nb 1, (VLDB 2010)

HaLoop source code is publicly available at http://code.google.com/p/haloop

Return to top.

A Latency and Fault-Tolerance Optimizer for Online Parallel Data Processing

We address the problem of making online, parallel query plans fault-tolerant: i.e., provide intra-query fault-tolerance without blocking. We develop an approach that not only achieves this goal but does so through the use of different fault-tolerance techniques at different operators within a query plan. Enabling each operator to use a different fault-tolerance strategy leads to a space of fault-tolerance plans amenable to cost-based optimization. We develop FTOpt, a cost-based fault-tolerance optimizer that automatically selects the best strategy for each operator in a query plan in a manner that minimizes the expected processing time with failures for the entire query. We implement our approach in a prototype parallel query-processing engine. Our experiments demonstrate that (1) there is no single best fault-tolerance strategy for all query plans, (2) often hybrid strategies that mix-and-match recovery techniques outperform any uniform strategy, and (3) our optimizer correctly identiﬁes winning fault-tolerance configurations.

A Latency and Fault-Tolerance Optimizer for Online Parallel Data Processing
A Latency and Fault-Tolerance Optimizer for Online Parallel Data Processing (Extended Report)
Prasang Upadhyaya, YongChul Kwon, Magda Balazinska
In ACM SIGMOD International Conference on Management of Data (SIGMOD), Athens, Greece, June, 2011.

FTOpt source code will soon be made public.

Return to top.

The Case for Being Lazy: How to Leverage Lazy Evaluation in MapReduce

Even with high-throughput data processing environments such as MapReduce, as data inputs grow ever larger, scientific discovery can be held back by limited processing rates. It is not uncommon for a single data analysis task to take hours to execute in a MapReduce cluster. Such high latencies are especially frustrating when a scientist needs only a subset of the analysis result: e.g., suppose an astronomer runs a MapReduce job that extracts all particles in a simulation of the universe that have some specific property, but examines only a subset of those particles that are also near a known star. Part of the computation has been wasted.

To reduce processing times, a user can be cautious and compute only the subset of the result that she definitely needs. If the user subsequently decides that additional output is needed, that output can always be computed later. Of course, such "lazy" computation comes at a cost: processing a dataset incrementally using MapReduce is slower than processing the entire dataset in one shot. So what should the user do?

In this paper, we study the benefits and overheads of lazy MapReduce processing, where the input data is partitioned and only the smallest subset of these partitions are processed to meet a user's need at any time. We also develop guidelines for successfully applying the lazy MapReduce computation technique to reduce processing times of MapReduce-based analysis.

The Case for Being Lazy: How to Leverage Lazy Evaluation in MapReduce
Kristi Morton, Magdalena Balazinska, Dan Grossman, Christopher Olston
In ScienceCloud 2011: 2nd Workshop on Scientific Cloud Computing, San Jose, CA, June, 2011.

Return to top.

UW Databases

UW Computer Science and Engineering

University of Washington