Skip to content
This repository was archived by the owner on Jun 7, 2024. It is now read-only.

Sample Applications

adamwg edited this page Jan 14, 2011 · 2 revisions

Elastic Phoenix comes with four sample applications, based on the original Phoenix sample applications. They are described below.

word_count

The word_count application counts the frequency of each word in a text file. The map emits a key/value pair for each word in the text, with the word as the key and 1 as the value. The reduce sums the values for each word.

One drawback of word_count is that it generates a very large amount of intermediate data. Because of this, Elastic Phoenix with 4GB of Nahanni shared memory can only handle the small 10MB input file. This problem could be alleviated by using a combiner, but this is not currently supported in Elastic Phoenix.

histogram

The histogram application generates a color histogram of a bitmap image. The map iterates over the pixels in a region and keeps track of their color values (red/green/blue), emitting a key/value pair for each color value found, containing the count. The reduce sums the values for each color value.

Since each map task generates at most 768 intermediate values (3 colors * 256 possible color values), it doesn't suffer from the size limitations encountered in word_count. We've run histogram on all three of the Phoenix-provided sample inputs, 100MB, 399MB, and 1.4GB.

string_match

The string_match application has an internal list of "secret" words, which are hashed. It goes through a file of cleartext words, hashing each one and comparing to the hashes of the secret words to count their frequencies in the text. The map hashes and compares each word, emitting a count for each secret word found in a region of the file. The reducer sums the values.

Each map task generates at most 4 intermediate values (one for each secret word), so the amount of intermediate data is small. We've run string_match on all three of the Phoenix-provided sample inputs, 55MB, 104MB, and 518MB.

linear_regression

The linear_regression application computes statistics for a set of points, then uses the statistics to perform a linear regression on the points. The map emits, for each set of points, a key/value pair for each of five statistics. The reducer sums each statistic. The linear regression itself is performed after the MapReduce job is finished, using the statistics it has generated.

Each map task generates exactly five intermediate values (one for each statistic), so there is no size limitation. We have run linear_regression for each of the Phoenix-provided sample inputs, 55MB, 104MB, and 518MB.

Clone this wiki locally