The goal of this repository is to give researchers easy access to existing, publicly available MapReduce applications. These applications can serve as examples for writing new applications. They can also serve as benchmark or simply sample workloads for new systems, tools, and techniques for performing large-scale, complex analytics in the cloud.
The repository is curated:
University of Washington | Duke University |
---|---|
YongChul Kwon Magdalena Balazinska Bill Howe |
Herodotos Herodotou Rozemary Scarlat Shivnath Babu |
To propose a new application for inclusion in the repository, please send as much of the following information as possible via email to hadepot@cs.washington.edu.
Click to see the detail of each application.
Description: PageRank is a link analysis algorithm that assigns weights (ranks) to each vertex in a graph by iteratively aggregating the weights of its inbound neighbors.
Description: CloudBurst is a MapReduce implementation of the RMAP algorithm for short-read gene alignment2. CloudBurst aligns a set of genome sequence reads with a reference sequence. CloudBurst distributes the approximate-alignment computations across reduce tasks by partitioning the reads and references on their n-grams.
Description: FoF is used by astronomers to analyze the structure of the universe within a snapshot of a simulation of the universe evolution. For each point in the dataset, the FoF algorithm uses a spatial index to recursively look up neighboring points to find connected clusters.
Description: MR-Tandem adapts the popular X!Tandem peptide search engine to work with Hadoop MapReduce for reliable parallel execution of large searches. MR-Tandem supports local Hadoop clusters as well as Amazon Web Services for creating inexpensive on-demand Hadoop clusters, enabling search volumes that might not otherwise be feasible with the compute resources a researcher has at hand. MR-Tandem is designed to drop in wherever X!Tandem is already in use and requires no modification to existing X!Tandem parameter files or workflows. The work of moving data into the cluster and search results back onto the local file system is handled transparently.
Description: This package contains a subset of the Hadoop 0.20.2 example MapReduce programs adapted to use the new MapReduce API.
Description: PigMix is a set of queries used test pig performance from release to release. There are queries that test latency and queries that test scalability. In addition it includes a set of map reduce java programs to run equivalent map reduce jobs directly. Original code taken from https://issues.apache.org/jira/browse/PIG-200 and modified by Herodotos Herodotou to simplify usage and use Hadoop's new API. See enclosed README file for detailed instructions on how to generate the data and execute the benchmark. You will find further details at https://cwiki.apache.org/confluence/display/PIG/PigMix
pigmix/scripts/generate_data.sh
script.Description: This benchmark was taken from https://issues.apache.org/jira/browse/HIVE-396 and was slight modified to make it easier to use. The queries appear in the Pavle et. al. paper A Comparison of Approaches to Large-Scale Data Analysis. See enclosed README file for detailed instructions on how to generate the data and execute the benchmark.
datagen
directory.Description: TPC-H is an ad-hoc, decision support benchmark (http://www.tpc.org/tpch/). The attached source ode contains (i) scripts for generating TPC-H data in parallel and loading them on HDFS, (ii) the Pig-Latin implementation of all TPC-H queries, and (iii) the HiveQL implementation of all TPC-H queries (see https://issues.apache.org/jira/browse/HIVE-600).
Description:
The TF-IDF weight (Term Frequency-Inverse Document Frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. For details, please visit: http://en.wikipedia.org/wiki/Tf%E2%80%93idf.
See enclosed README file for detailed instructions on how to execute the jobs. TF-IDF is implemented as a pipeline of 3 MapReduce jobs:
Description:
This application builds a model that can be used to score ads for each user based on its search and click behavior. The model consists of a weight assigned to each (ad, keyword) pair and that shows the correlation between the two (meaning that a larger weight signifies a greater probability that a user will click on an ad if he has searched for that keyword). The job takes as input the path to the folder that contains 3 log files named clickLog, impressionsLog
and searchLog
. The impressionsLog
records when an ad (identified by the adId
) was shown to a user (identified by the userId
). The clickLog
records when a user clicked on an ad. The searchLog
records when a user searched for a particular keyword. These values are assumed to be comma separated and the order of the fields is: timestamp, userId, adId
(or keyword
in the searchLog
).
See enclosed BT_description.pdf file for detailed instructions on how to execute the jobs.
Description:
This package includes pig scripts and dataset generators for various pig workflows.
See enclosed README file for detailed instructions on how to execute the pig scripts. The Pig-Latin scripts included are:
coauthor.pig
: Finds the top 20 coauthorship pairs. The input data has "paperid \t authorid" records. It first creates a list of authorids for each paperid. Then, it count the author pairs. Then it finds the author pairs with the highest count.pagerank.pig
: Runs 1 iteration of page rank. The input datasets are Rankings ("pageid \t rank") and Pages ("pageid \t comma-delimited links").pavlo.pig
: Complex join task from the Sigmod 08 Pavlo et. al. paper.tf-idf.pig
: Computes TF-IDF. The input data has "doc \t word" records.tpch-17.pig
: Query 17 from the TPC-H benchmark but slightly modified to have a different orderings of columns (this script assumes that part key is the first column of both tables).common.pig
: A seven job workflow. The first job scans and performs an initial processing of the data. Two jobs read, filter, and find the sum and maximum of the prices for each {order ID, part ID}, {orderID, supplier ID}, respectively. The results of these two jobs are further processed by separate jobs to find the overall sum and maximum prices for each {order ID}. Finally, the results are separately used to find the number of distinct aggregated prices.