Hadepot: Repository of MapReduce Applications

The goal of this repository is to give researchers easy access to existing, publicly available MapReduce applications. These applications can serve as examples for writing new applications. They can also serve as benchmark or simply sample workloads for new systems, tools, and techniques for performing large-scale, complex analytics in the cloud.

The repository is curated:

University of Washington Duke University
YongChul Kwon
Magdalena Balazinska
Bill Howe
Herodotos Herodotou
Rozemary Scarlat
Shivnath Babu

How to contribute?

To propose a new application for inclusion in the repository, please send as much of the following information as possible via email to hadepot@cs.washington.edu.

  1. Title and a short description of the application
  2. A link to the code
  3. A link to the dataset
  4. A job history file generated by running code on data
  5. Any performance challenges encountered with this application (exhibits skew, long-running, failure-sensitive, etc.).
  6. If possible, also include any relevant documentation to help get the application set up and running.

List of Applications

Click to see the detail of each application.

PageRank

Description: PageRank is a link analysis algorithm that assigns weights (ranks) to each vertex in a graph by iteratively aggregating the weights of its inbound neighbors.

CloudBurst

Description: CloudBurst is a MapReduce implementation of the RMAP algorithm for short-read gene alignment2. CloudBurst aligns a set of genome sequence reads with a reference sequence. CloudBurst distributes the approximate-alignment computations across reduce tasks by partitioning the reads and references on their n-grams.

  • Application Areas: bioinformatics, short-read sequence analysis
  • Author: Michael Schatz
  • Source: http://cloudburst-bio.sourceforge.net/
  • Hadoop Compatibility: 0.20 or higher. Old API.
  • Characteristics:
    For map, the runtime distribution is a bi-modal distribution of which modes correspond to the two input datasets.
    Runtime distribution of map tasks of CloudBurst
    Reduce is computationally intensive. The runtime distribution of reduce tasks is relatively uniform.
    Runtime distribution of reduce tasks of CloudBurst
  • Submitted By: yongchul
  • Distributed Friends-of-Friends

    Description: FoF is used by astronomers to analyze the structure of the universe within a snapshot of a simulation of the universe evolution. For each point in the dataset, the FoF algorithm uses a spatial index to recursively look up neighboring points to find connected clusters.

    • Application Areas: Astronomy, astrophysics simulations
    • Author: YongChul Kwon
    • Source:
    • License: Apache License 2.0
    • Dataset: Please contact yongchul
    • Hadoop Compatibility: 0.20 or higher. New API.
    • Characteristics:
      The runtime of local clustring phase (map) is sensitive to the distribution of input data in 3D space. Runtime distribution of map tasks of distributed FoF
    • Submitted By: yongchul

    MR-Tandem: Parallel Peptide Search with X!Tandem and Hadoop MapReduce

    Description: MR-Tandem adapts the popular X!Tandem peptide search engine to work with Hadoop MapReduce for reliable parallel execution of large searches. MR-Tandem supports local Hadoop clusters as well as Amazon Web Services for creating inexpensive on-demand Hadoop clusters, enabling search volumes that might not otherwise be feasible with the compute resources a researcher has at hand. MR-Tandem is designed to drop in wherever X!Tandem is already in use and requires no modification to existing X!Tandem parameter files or workflows. The work of moving data into the cluster and search results back onto the local file system is handled transparently.

    • Application Areas: Bioinformatics, Proteomics, Peptide Search
    • Author: Brian Pratt, Insilicos LLC (extends X!Tandem work by Beavis et al)
    • Source:
    • License: Apache License 2.0
    • Dataset: Many search sets are available at http://www.peptideatlas.org/repository/
    • Hadoop Compatibility: 0.18 or higher. Hadoop Streaming.
    • Characteristics:
      Excellent scalability improvement over existing brittle MPI-based parallel peptide search efforts, but parallelism is still limited by the need for each search task to create the same list of theoretical spectra.
    • Submitted By: Brian Pratt (www.insilicos.com)

    Hadoop MapReduce Example

    Description: This package contains a subset of the Hadoop 0.20.2 example MapReduce programs adapted to use the new MapReduce API.

    PigMix Benchmark

    Description: PigMix is a set of queries used test pig performance from release to release. There are queries that test latency and queries that test scalability. In addition it includes a set of map reduce java programs to run equivalent map reduce jobs directly. Original code taken from https://issues.apache.org/jira/browse/PIG-200 and modified by Herodotos Herodotou to simplify usage and use Hadoop's new API. See enclosed README file for detailed instructions on how to generate the data and execute the benchmark. You will find further details at https://cwiki.apache.org/confluence/display/PIG/PigMix

    • Application Areas: General, Benchmark
    • Author: Daniel Dai, Alan Gates, Amir Youseffi, Herodotos Herodotou
    • Source: Download
    • License: Apache License 2.0
    • Dataset: The data generator is included in the source code above and can be executed using the pigmix/scripts/generate_data.sh script.
    • Hadoop Compatibility: 0.20.2 or higher. New API. Pig.
    • Submitted By: Herodotos Herodotou

    Hive Performance Benchmark

    Description: This benchmark was taken from https://issues.apache.org/jira/browse/HIVE-396 and was slight modified to make it easier to use. The queries appear in the Pavle et. al. paper A Comparison of Approaches to Large-Scale Data Analysis. See enclosed README file for detailed instructions on how to generate the data and execute the benchmark.

    • Application Areas: General, Benchmark
    • Author: Yuntao Jia, Zheng Shao
    • Source: Download
    • License: Apache License 2.0
    • Dataset: The data generator is included in the source code above in the datagen directory.
    • Hadoop Compatibility: 0.20.2 or higher. New API. Hive. Pig.
    • Submitted By: Herodotos Herodotou

    TPC-H Benchmark

    Description: TPC-H is an ad-hoc, decision support benchmark (http://www.tpc.org/tpch/). The attached source ode contains (i) scripts for generating TPC-H data in parallel and loading them on HDFS, (ii) the Pig-Latin implementation of all TPC-H queries, and (iii) the HiveQL implementation of all TPC-H queries (see https://issues.apache.org/jira/browse/HIVE-600).

    Term Frequency-Inverse Document Frequency

    Description: The TF-IDF weight (Term Frequency-Inverse Document Frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. For details, please visit: http://en.wikipedia.org/wiki/Tf%E2%80%93idf.
    See enclosed README file for detailed instructions on how to execute the jobs. TF-IDF is implemented as a pipeline of 3 MapReduce jobs:

    1. Calculates word frequency in documents
    2. Calculates word counts for documents
    3. Counts the documents in the corpus and computes the TF-IDF

    Behavioral Targeting (Model Generation)

    Description: This application builds a model that can be used to score ads for each user based on its search and click behavior. The model consists of a weight assigned to each (ad, keyword) pair and that shows the correlation between the two (meaning that a larger weight signifies a greater probability that a user will click on an ad if he has searched for that keyword). The job takes as input the path to the folder that contains 3 log files named clickLog, impressionsLog and searchLog. The impressionsLog records when an ad (identified by the adId) was shown to a user (identified by the userId). The clickLog records when a user clicked on an ad. The searchLog records when a user searched for a particular keyword. These values are assumed to be comma separated and the order of the fields is: timestamp, userId, adId (or keyword in the searchLog).
    See enclosed BT_description.pdf file for detailed instructions on how to execute the jobs.

    Various Pig Workflows

    Description: This package includes pig scripts and dataset generators for various pig workflows.
    See enclosed README file for detailed instructions on how to execute the pig scripts. The Pig-Latin scripts included are:

    1. coauthor.pig: Finds the top 20 coauthorship pairs. The input data has "paperid \t authorid" records. It first creates a list of authorids for each paperid. Then, it count the author pairs. Then it finds the author pairs with the highest count.
    2. pagerank.pig: Runs 1 iteration of page rank. The input datasets are Rankings ("pageid \t rank") and Pages ("pageid \t comma-delimited links").
    3. pavlo.pig: Complex join task from the Sigmod 08 Pavlo et. al. paper.
    4. tf-idf.pig: Computes TF-IDF. The input data has "doc \t word" records.
    5. tpch-17.pig: Query 17 from the TPC-H benchmark but slightly modified to have a different orderings of columns (this script assumes that part key is the first column of both tables).
    6. common.pig: A seven job workflow. The first job scans and performs an initial processing of the data. Two jobs read, filter, and find the sum and maximum of the prices for each {order ID, part ID}, {orderID, supplier ID}, respectively. The results of these two jobs are further processed by separate jobs to find the overall sum and maximum prices for each {order ID}. Finally, the results are separately used to find the number of distinct aggregated prices.

    Bottom Edge
    icon uw UW DatabasesBulletUW Computer Science and EngineeringBulletUniversity of Washington icon uw