WebMapReduce
CS 121 (CS1)
wmr
-
tallies --
[HC] Simple tallies. WebMapReduce (WMR), as described in CSinParallel module http://selkie.macalester.edu/csinparallel/modules/IntroWMR/build/html/index.html, has great potential for "big data" computations, since it computes using the open-source software Hadoop that is the most common choice for computations at a truly enormous scale with data that may have no rigid regular structure. Some examples of such data include: corpora (plural of "corpus"), which could be literary or could equally well represent conversations, research data, or web pages; graphical information such as maps and their annotations; and relationship data such as "friendship" or "likes" on social media.One of the first skills to develop in map-reduce problem solving is to perform simple tallies that describe a dataset. The following questions develop these skills using WMR. The "word count" example discussed in the module is an example. The goal is to produce a list of words in a corpus together with each word's frequency in that corpus. The following solution is taken from the module, with documentation added.
Mapper# IN keys and values: # key: holds lines of text from the corpus value: is empty # OUT keys and values: # key: one word from that text value: a string representing the number 1 def mapper(key, value): words=key.split() for word in words: Wmr.emit(word, '1')
Reducer# IN keys and values: # key: one word from a corpus value: a string representing the number 1 # NOTE: the values are packed into an interator iter # OUT keys and values: # key: that one word from a corpus value: the sum of the 1's from iter def reducer(key, iter): sum = 0 for s in iter: sum = sum + int(s) Wmr.emit(key, str(sum))
Comments
-
The basic structure of map-reduce computing is to
-
break the given data up into labelled pieces, using a
mapper()
function, then to -
gather those pieces according to label and combine all the pieces that have the same label, using a
reducer()
function.
-
-
We have highlighted comments about
IN
andOUT
keys and values. These comments describe how labelled pieces (technically called key-value pairs) are generated from the original data and how those pieces are combined for each label.Writing down
IN
andOUT
not only describes the map-reduce computation, but also helps with problem solving, because it specifies the goals for the mapper and reducer functions. -
All keys and values in map-reduce computation are actually strings.
-
The
Wmr.emit()
method will automatically convert numerical arguments to strings, but we have used string argument values'1'
andstr(sum)
directly in calls ofWmr.emit()
to emphasize that those arguments must be strings. -
The conversion
int(s)
inreducer()
is necessary, sinces
is a string and we want to add its numerical value to the accumulatorsum
.
-
-
The arguments for a
mapper()
are a key and a value. This supports using the result of one map-reduce computation as the input for another one, since map-reduce computations produce key-value pairs.This particular mapper assumes that the lines of the given input text are located entirely in the
key
argument ofmapper()
, with an empty string asvalue
. This allows us to count words in ordinary text files.The
mapper()
is called once for each line of input in WMR (whether that line has key-value structure or only a key). -
The
reducer()
operates on all values having the same key. Therefore the arguments of areducer()
are-
a single copy of the
key
string (i.e., the "label"), and -
an iterator that contains all the values for that key that were produced by all calls of the
mapper()
.
iter
makes it convenient to examine all values using afor
loop, as seen in thisreducer()
. An iterator is used instead of a list because WMR supports "big-data" computations, in which there could be billions or trillions of values for a given key -- too many values to be held in a list. -
-
-
wmr --
[C] Getting started with WMR. Explore the features of WebMapReduce (WMR) with themapper()
andreducer()
above, by performing the following steps.-
Access WMR, by visiting the site
http://www.cs.stolaf.edu/wmr
in a browser.Note: This site is only accessible on the St. Olaf campus network. If you are off-campus, you can connect your computer to the campus network using VPN.
-
Use your St. Olaf username and password to login to WMR. Your account will be recognized automatically -- there is no need to register for an account.
-
This will present you with the WMR job submission form
-
In the Job info section, choose a job name (e.g.,
word-count
), and make sure the language isPython3
. Leave the two task counts blank (Hadoop will choose these automatically) and leave the sorting on the default Alphabetic. -
In the Input section, choose Direct Input, and enter two or three lines of text in the text box, such as
The cat in the hat wore the hat to the cat hat party.
-
In the Mapper section, enter the
mapper()
code above. Include the comments (feel free to cut and paste) as well as the code (retype this short function to get a feel for WMR code input). -
Likewise, in the Reducer section, enter the
reducer()
code above, including theIN
andOUT
comments. -
Click on the Test button. This performs a test run of your job, useful for making sure your code works correctly with test input.
-
The data size is limited for Test executions, but it otherwise behaves similarly to actual Hadoop computations.
-
The Test option is faster for small data than an actual Hadoop job.
-
A successful Test computation shows the key-value pairs emitted from a
mapper()
, which is useful for verifying your code and for debugging if something is wrong. The actual Hadoop interface does not show these intermediate key-value pairs.
mapper()
andreducer()
code, since the Hadoop Submit option takes much longer to deliver less debugging information if something goes wrong.Note: for the "cat-hat" test data above, the intermediate key-value pairs are
The 1 cat 1 in 1 the 1 hat 1 wore 1 the 1 hat 1 to 1 the 1 cat 1 hat 1 party. 1
The final output from the cat-hat example input iscat 2 hat 3 in 1 party. 1 the 3 The 1 to 1 wore 1
-
-
-
If the Test job indicates any errors, you can use a Resubmit Job link or your browser's back button to return to the Job Submission page and correct the error. Then, Test your job again, repeating as necessary until your job works correctly with the Test interface.
-
Next, click on a Resubmit Job link or use your browser's back button to return to the New Job page. (If you use Resubmit Job, note that your input has been stored as a data set in the Hadoop system.) This time, use the Submit button to run the job on Hadoop instead of the test system.
-
Even though your data is small, a Hadoop run takes a long time. This is because Hadoop takes time to gear up for very large data sets. These lengthy preparations pay off when there are billions or trillions of bytes of data, but they extend the time required for a job when there are only a few dozen bytes of data.
-
Your job should run correctly without errors, since you tested it using the Test option.
-
You may observe that the sort order is slightly different with Hadoop Submit than with the Test option. Hadoop produces
The 1 cat 2 hat 3 in 1 party. 1 the 3 to 1 wore 1
The difference is with the wordThe
, since Hadoop sorting places capitalized words before words that begin with a lower-case letter.The Test option allows you to check the correctness of your code, but this sorting difference shows that the Test system doesn't necessarily produce results that are identical with Hadoop's results.
-
-
After your successful Hadoop (Submit) run, return to the New Job page and select the Save link at the bottom of the page. This saves the data set and code for your job under the job name you chose above (e.g.,
word-count
). In the future, you can load this job again by choosing Saved Configurations in the left menu bar.In practice, you probably don't want to save all of your successful runs, since the Saved Configuration menu might become overly large. However, some configurations such as this word-count example can be good starting points for other WMR jobs.
-
Finally, run your job with larger input, e.g., a large book. For example:
-
Enter the Cluster Path
/shared/gutenberg/CompleteShakespeare.txt
to use a copy of Project Gutenberg's digitization of the Complete Works of William Shakespeare. -
Likewise, the Cluster Path
/shared/gutenberg/WarAndPeace.txt
is a copy of a translation of Tolstoy's famous novel.Note: Each of these works may also be accessible using the drop down menu labelled Cluster Dataset. Use one of the choices without line numbers for this
mapper()
andreducer()
.
Here are the first few lines of output for
CompleteShakespeare.txt
:" 241 "'Tis 1 "A 4 "AS-IS". 1 "Air," 1 "Alas, 1 "Amen" 2 "Amen"? 1 "Amen," 1 "And 1 "Aroint 1 "B 1 "Black 1 "Break 1 "Brutus" 1 "Brutus, 2 "C 1 "Caesar"? 1 ...
-
mapper()
, yourreducer()
, and the first few lines of your output for your final run. Use copy and paste for all code, not a screenshot. -
-
wc_value --
[C] Word count with line numbers. Modify the code in the previous problem to work when each text line has a line number. This means changing the IN/OUT documentation for themapper()
(only), to# IN keys and values: # key: holds a line number value: holds a line of text from the corpus # OUT keys and values: # key: one word from that text value: a string representing the number 1
Theboldface
portion of the comment is what needs to change.Hints:
-
You can start with your saved configuration from the exercise above, and modify the
mapper()
. Only themapper()
needs to change, since the IN/OUT specifications for thereducer()
are unchanged. -
Here is a test data set:
1 The cat in the hat 2 wore the hat 3 to the cat hat party.
You can enter this data interactively in the Direct Input box in the Input portion of the New Job page. Separate the line number from the text by a TAB character in that form, because Hadoop uses the first TAB character in a line to separate the key from the value for that line. -
The Test computation should show the same intermediate key-value pairs and result key-value pairs as the original cat-hat example data produced.
-
At the end, test your modified code with a large dataset that has line numbers, such as
/shared/gutenberg/CompleteShakespeare.txt_n
or/shared/gutenberg/WarAndPeace.txt_n
(these may be available as options with line numbers in Cluster Datasets.
-
-
sort --
[C] Sorting with the shuffle stage. The the previous two problems probably produce results that are sorted by word in dictionary order (lexicographic order). This is because Hadoop sorts all key-value pairs according to key in the "shuffle" phase, which occurs between the calls ofmapper()
and the calls ofreducer()
.mapper() calls ---> shuffle ---> reducer() calls
The primary purpose of the shuffle phase is to move and organize the key-value pairs produced bymapper()
s in order to prepare for calls ofreducer()
, which calls must receive all values for a particular key. Hadoop uses sorting according to key in order to perform this organization, and by callingreducer()
in that key-sorted order, the results of map-reduce can take advantage of that sorted order.If your results do not appear in sorted order, then Hadoop is using more than one reducer task. A reducer task is a program that calls
reducer()
for keys. If Hadoop uses more than one reducer task, the keys will be dealt out among each reducer task equally; each reducer task will produce results from the sorted list of keys that it receives, but the combined results from all reducer tasks will be out of order.DO THIS:
-
2reducers --
Request that Hadoop use two reducer tasks by entering the number 2 in the "Reduce tasks" box, in the "Job Info" section at the top of the "New Job" page. Be sure to use "Submit" to perform a Hadoop computation (the "Test" option ignores the "Reduce tasks" choice.) If you use the cat-hat text for input, this may lead to unsorted output such as the following:The 1 hat 3 in 1 the 3 to 1 wore 1 cat 2 party. 1
In any case, expect two sorted segments of results from the two reducer tasks (in this case,The
towore
, thencat
toparty.
). Record a screen shot of your unsorted results.Note: Hadoop decides the actual number of reducer tasks, and it may choose a different value than the number entered in the "Reduce tasks" box. The actual number chosen is shown on the job results page.
-
1mapper --
Request 1 reducer task (using that same text box), and verify that you get the sorted values shown above. -
default_map_tasks --
What results do you get if you leave the "Reduce tasks" box empty? This gives a default value for the number of reduce tasks, which may be something other than 1 on some WMR setups.
--
-
-
2cycle --
[HC] Ordering by frequency. The previous problems produce word-frequency results that are sorted in ordering according to word, but a linguist or humanist might want to know about the most frequently occurring words instead of the frequency of every word. This requires sorting by frequency instead of sorting by word.Since the word frequency values cannot be known until after the
reducer()
calls have computed those frequencies, we will use a second map-reduce cycle to sort the results of this sort the word-frequency results according to frequency, using the steps below. This follow-on WMR job may produce the following results for the cat-hat data.1 The 1 in 1 party. 1 to 1 wore 2 cat 3 hat 3 the
Note: words with the same frequency (e.g.,hat
andthe
) may appear in different orders than above, but the frequencies (e.g., 3) will be appear in sorted order.
-
--
[C] Start with the "Job Succeeded" results page from a map-reduce job computed using Hadoop (the "Submit" button), not the "Test" option.At the bottom of the "Job Succeeded" page, click on the "Use Output" link. This will proceed to the "New Job" page, but with a predefined data set that will consist of the key-value pairs (sorted by word) produced by the original word-count
reducer()
-
--
[HC] Now, fill in the rest of the form for this second map-reduce cycle.-
In the "Job Info" section, choose a new job name, such as
word-count2
Also, choose one reducer, to insure that a single sorted data set will be produced at the end of this second map-reduce cycle.
Finally, for "Sort", choose "Numeric", which will sort keys according to numerical value instead of sorting lexicographically as strings. For example, in "Numeric" order the key
"2"
should come before the key"10"
, but in lexicographic (dictionary) order"10"
would come before"2"
. -
For
mapper()
:# IN key-value pairs # key: holds a word value: holds an integer, frequency of that word # OUT key-value pairs # key: holds an integer, frequency of a word value: holds that word
-
For
reducer()
:# IN key-value pairs # key: holds an integer, frequency of a word value: holds that word # NOTE: the values are packed into an interator iter # OUT key-value pairs # key: holds an integer, frequency of a word value: holds that word
-
This
mapper()
doesn't change the key or the value. It only needs to emit those strings in the opposite roles (i.e., emit the original value as a new key, and the original key as a new value). -
This
reducer()
also doesn't change the key or the value. It only emits the key and each value in the same roles, unchanged. This is called the identity reducer. -
The results will be sorted by frequency because of the shuffle phase.
-
Test and debug your
mapper()
andreducer()
using the "Test" option. Then, use the "Submit" option for your final run. -
This might be a good job configuration to "Save", since it includes an identity reducer. You might add a comment with "Identity reducer" to the code for the
reducer()
.
-
--
-
-
strip --
[HC] Removing capitals and punctuation. The originalmapper()
andreducer()
for counting word frequencies counts capitalized or punctuated forms of a word separately from that word; for example, the frequencies ofThe
and"the
andthe
are all tallied separately. These distinctions are undesirable in many cases for corpus analysis. For example, in the complete Shakespeare example above, we would most likely want to combine the frequencies of"Amen"
,"Amen"?
,"Amen,"
,amen
, etc., into a single frequency foramen
.Modify those original
mapper()
andreducer()
functions to ignore capitalization and punctuation before or after a word. The resulting functions will satisfy these IN/OUT specs (boldface indicates changes):Mapper
# IN keys and values: # key: holds lines of text from the corpus value: is empty # OUT keys and values: # key: one word from that text, after lowering letter cases # and removing punctuation before and after each word # value: a string representing the number 1
Reducer
# IN keys and values: # key: one word from a corpus value: a string representing the number 1 # NOTE: the values are packed into an interator iter # OUT keys and values: # key: that one word from a corpus value: the sum of the 1's from iter
Hints
-
As the changes suggest, the same
reducer()
can be used; only themapper()
needs to be modified. -
Apply
lower()
andstrip()
to each word before emitting it.Note: The method
strip()
is better thanremove()
for this purpose, becausestrip()
only removes characters from the beginning or ending of each word, and we don't want to remove internal apostrophes from contractions (likedon't
). -
Use the "Test" interface first with a small data set, such as the cat-hat data (including a final period after the word
party.
), as always. -
Start with your best guess about punctuation marks to strip away, and check your guess using an actual corpus, such as the complete works of Shakespeare. Scroll through the results of a Hadoop ("Submit") run to see if additional punctuation marks should be removed.
Also, remember to escape a quote character within quote characters. E.g., the
strip()
argument".,?\"'"
removes periods, commas, question marks, double-quote characters"
, and single-quote/apostrophe characters'
.
-
-
oview --
Overview of map-reduce techniques; context forwarding. The exercises above have presented several basic techniques of map-reduce computing, including the following:-
the essential elements of problem-solving with the map-reduce pattern, summarized as
Basic idea of map-reduce computing:
-
Break data up into labelled pieces (
mapper()
) -
Gather and combine pieces that have the same label (
reducer()
)
-
-
using features of WMR including
-
basic job configuration features provided by WMR, such as selecting a job name (used when saving a configuration), choice of programming language for mappers and reducers, requesting a certain number of reducers (use a single reducer in order to produce results that preserve the shuffler's sorting order), and choice of sorting (alphabetic vs. numeric),
-
specifying input data for a map-reduce job, which may originate as a preset cluster dataset, or specified using a cluster path, or may be uploaded from your own computer, or may be entered directly into a text box,
-
entering definitions of the
mapper()
andreducer()
functions that determine the map-reduce computation to perform on the input data, -
performing a "Test" computation to try a
mapper()
andreducer()
with small data, in order to verify and debut a map-reduce computation, -
performing a "Submit" computation to run a map-reduce computation using Hadoop, which takes more time than "Test" but could be performed with billions, trillions, or even (given a large enough cluster) quadrillions of bytes of data, and
-
saving a configuration including input data,
mapper()
, andreducer()
for later retrieval;
-
-
using map-reduce computation to perform operations involving simple tallies, such as word frequency counting, finding mean ratings, or constructing an index;
-
using IN/OUT specs to describe the mapper and reducer functions needed for a map-reduce computation;
-
taking advantage of sorting during the shuffle phase of Hadoop computation to order the key-value pairs of a result;
-
using multiple Hadoop cycles in order to perform multi-stage map-reduce computations, as required when sorting word frequencies by frequency rather than by word;
-
using an identity
mapper()
or an identityreducer()
when needed to solve a problem; -
grouping data values together under a single data value, for example, treating
"Amen"
,"Amen"?
,"Amen,"
, etc., all as occurrences of the wordamen
(we implemented this grouping by lowering letter case and stripping punction); and -
pre-tallying with available values within a call to a
mapper()
.
-
context forwarding, in which information about a key-value pair (labelled piece of data) are included with that pair;
-
structured values and structured keys, for packing multiple components of information into a key-value pair;
-
multi-case mappers and multi-case reducers, which enable multiple kinds of input data or key-value pairs to be processed within a single map-reduce cycle; and
-
broadcasting data values, which enables known elements of information to be distributed to many
reduce()
operations.
-
-
concordance --
[HC] Constructing a concordance. One of the most useful tools for humanists and linguists studying a corpus of text is a concordance or keyword-in-context (KWIC) index, which lists words in a corpus together with the contexts in which that word appears in that corpus. For example, here is a concordance for the cat-hat test data set.The The cat in the hat cat The cat in the hat cat to the cat hat party. hat The cat in the hat hat to the cat hat party. hat wore the hat in The cat in the hat party. to the cat hat party. the The cat in the hat the to the cat hat party. the wore the hat to to the cat hat party. wore wore the hat
Write amapper()
andreducer()
functions to create a concordance for an input corpus that is sorted by word, where a context presented for each word consists an the entire line containing that word.Note: this is an example of the context-forwarding technique for map-reduce computing identified in the overview of map-reduce techniques.
Hints:
-
Here are IN/OUT specs:
Mapper
# IN keys and values: # key: holds lines of text from the corpus value: is empty # OUT keys and values: # key: one word from that corpus value: the entire line containing that word
Reducer
# IN keys and values: # key: one word from a corpus value: an entire line containing that word # OUT keys and values: # key: that same word from that corpus value: that same line of context
-
As the IN/OUT spec indicates, the
reducer()
should be an identity reducer.
-
-
multibook --
[HC] Word frequencies per book. Whether comparing the multiple works by a single author or analyzing books by different authors, being able to analyze the content of multiple books in a corpus by book is an essential skill for many research projects in the humanities and Linguistics. We will start considering this kind of analysis by counting word frequencies per book for a corpus consisting of multiple books.Implement these IN/OUT specs for a map-reduce job:
Mapper# IN keys and values: # key: holds a one-word code for a book title value: one line of that book # OUT keys and values: # key: one word from a book, followed by a space and that book's one-word code # value: a string '1'
Reducer# IN keys and values: # key: one word from a book, followed by a space and that book's one-word code # value: a string '1' # OUT keys and values: # key: one word value: a one-word code, followed by a space and the # frequency of that word in that book
Example data:cathat The cat in the hat cathat wore the hat cathat to the cat hat party. OwlCat The owl and the pussy cat went to sea
Intermediate key-value pairsThe cathat 1 cat cathat 1 in cathat 1 the cathat 1 hat cathat 1 ... The OwlCat 1 owl OwlCat 1 and OwlCat 1 the OwlCat 1 ...
Final key-value pairs emitted from reducer:The OwlCat 1 The cathat 1 and OwlCat 1 cat cathat 2 cat OwlCat 1 in cathat 1 hat cathat 3 owl OwlCat 1 sea OwlCat 1 the OwlCat 1 the cathat 3 ...
Hints:-
Only one map-reduce cycle is needed.
-
A
reducer()
call will have two arguments: a key containing a space such as'The cathat'
and an iterator of values'1'
. Add up the'1'
s just as you would for the word-count reducer. But you should emit something different than the word-count reducer emits:-
use
split()
to break the two-word key into two pieces; -
the key for your
emit()
call should only be the first piece from splitting; -
the value argument for your
emit()
call should be the second part from splitting (book code) followed by a space, then the sum of the'1'
s.
-
-
-
struct --
Structured values and structured keys. In map-reduce computing, we solve problems using key-value pairs to hold data. If our data only involves one component, we can use an empty key or an empty value to represent the data. For example, for a corpus without line numbers or other labels, the text may exist entirely within thekey
argument of amapper()
call, and thevalue
argument of thatmapper()
may always be empty.What if the data has three or more components? In that case, we can pack multiple fields of data into a key and/or a value by using separator characters or other strategies within that key or value. The Netflix dataset provides an example of a multi-field key: commas separate the four fields (movie id, reviewer id, rating, and date of review) of the data. Using multiple fields is an example of structured keys and values. Other structuring strategies could be used besides fields.
The following exercises use structured keys and/or values.
-
2values --
[HC] Two fields within a value. For Netflix data, compute both the average rating per movie and the latest date of rating for each movie, using a single map-reduce cycle.Mapper
# IN keys and values: # key: holds one line of Netflix data value: empty
Reducer# OUT keys and values: # key: a movie id # value: mean of that movie's ratings to 2 dec. places, followed by a space, # followed by the latest date when that movie was rated
For example, if the Netflix data lines are1,1596531,5,2004-01-23 3,2318004,4,2004-02-05 6,110641,5,2003-12-15 8,1447639,4,2005-03-27 8,2557899,5,2005-04-16 6,52076,4,2004-10-05 1,13651,3,2004-06-16 1,1374216,2,2005-09-23
then the reducer should produce the following key-value pairs:1 3.33 2005-09-23 3 4.00 2004-02-05 6 4.50 2004-10-05 8 4.50 2005-04-16
Hints-
The IN/OUT spec above only indicates the input for the mapper and the output from the reducer. What should the intermediate key-value pairs be?
From each line of Netflix input, we need to extract the movie id, the movie rating, and the date of that rating. Since we want to organize the computation by movie id, the key should be the movie id for intermediate key-value pairs. The other two data elements can be packed into the value, separated by a space.
Using this strategy for the example lines of data, the intermediate key-value pairs would be
1 5 2004-01-23 3 4 2004-02-05 6 5 2003-12-15 8 4 2005-03-27 8 5 2005-04-16 6 4 2004-10-05 1 3 2004-06-16 1 2 2005-09-23
We can use this strategy to complete the key-value specs for the mapper and reducer:Mapper
# IN keys and values: # key: holds one line of Netflix data value: empty # OUT keys and values: # key: a movie id value: a rating, then a space, then a date for that movie
Reducer# IN keys and values: # key: a movie id value: a rating, then a space, then a date for that movie # OUT keys and values: # key: a movie id # value: mean of that movie's ratings to 2 dec. places, followed by a space, # followed by the latest date when that movie was rated
-
Use
split()
in themapper()
(with comma as a separator) to obtain the four fields from the input in Netflix format.Also use
split()
in thereducer()
to separate each value's rating from its date. -
Use three accumulators in the
reducer()
: two for computing the mean rating (count and sum) and one for computing the latest date. Note: The netflix dates are formatted so they can be compared using simple lexicographic order (<
or>
asstr
objects).
-
-
manyvalues --
[HC] Many fields within a value. For Netflix data, compute the average rating per movie, number of ratings, the highest rating, the lowest rating, the earliest date of rating for each movie, and the latest date of rating for each movie, using a single map-reduce cycle.Mapper
# IN keys and values: # key: holds one line of Netflix data value: empty
Reducer# OUT keys and values: # key: a movie id # value: the following fields for a movie, separated by spaces: # mean rating to 2 dec. places; number of ratings; # lowest rating; highest rating; # earliest date rated; latest date rated
For example, if the Netflix data lines are1,1596531,5,2004-01-23 3,2318004,4,2004-02-05 6,110641,5,2003-12-15 8,1447639,4,2005-03-27 8,2557899,5,2005-04-16 6,52076,4,2004-10-05 1,13651,3,2004-06-16 1,1374216,2,2005-09-23
then the reducer should produce the following key-value pairs:1 3.33 3 2 5 2004-01-23 2005-09-23 3 4.00 1 4 4 2004-02-05 2004-02-05 6 4.50 2 4 5 2003-12-15 2004-10-05 8 4.50 2 4 5 2005-03-27 2005-04-16
Hints:-
Fill in the IN/OUT specs first, in order to determine the intermediate key-value pairs you will need.
-
Use numerous accumulators in the
reducer()
to compute all the desired values.
-
-
trace --
[H] Tracing map-reduce. A trace of a WebMapReduce computation describes the computational process that a Hadoop map-reduce job performs, for a given input,mapper()
, andreducer()
. For example, (see class notes for an example).Note: No specs are needed for a trace of a WMR computation, just as no specs are needed for tracing a Python computation such as a recursive function call.
DO THIS: perform traces of the following WMR jobs, given input,
mapper()
andreducer()
. -
index_trace --
[H] Trace the following WMR computation (see example above), which produces a list of all line numbers containing a particular word in a corpus.-
Input:
1 I am Sam. 2 Sam I am, okay?
-
Mapper:
def mapper(key, value): lis = value.split() for word in lis: Wmr.emit(word.strip('.,!?;-'), key)
-
Reducer:
def reducer(key, iter): lines = '' for linenum in iter: lines = lines + linenum + ' ' Wmr.emit(key, lines)
-
-
multicase --
Multi-case reducers and mappers. Exercises above with multi-field values show how to compute multiple values in a single map-reduce cycle by packing multiple quantities into the value of a key-value pair. Another strategy for computing multiple values during a single map-reduce cycle is to have multiple kinds of keys, and for a reducer to have multiple cases in order to perform different computations for different kinds of keys. The following exercises explore this useful technique. -
localglobal --
Per-book and overall frequencies. The exercise about word frequencies in multi-books data sets above computes word frequencies per book, but not the total word frequencies among all books. Intermediate Key-value pairs are chosen to have the formword bookcode 1
in that problem, with a two-part key, in order to have a reducer call for each word appearing in a book.We could also compute the total frequency of a word over all books, by using intermediate key-value pairs
word 1
If we need to know both a word's frequency per book and that word's total frequency over all books, we can compute both of those frequencies per word by including both kinds of key-value pairs. Perform this computation by implementing the following IN/OUT spec.
Mapper
# IN keys and values: # key: holds a one-word code for a book title value: one line of that book # OUT keys and values (2 formats): # 1 key: one word from a book value: string '1' # 2 key: one word from a book, a space, and a book title value: string '1'
Reducer
# IN keys and values (2 formats): # key: one word from a book value: string '1' # key: one word from a book, a space, and a book title value: string '1' # OUT keys and values (2 formats): # key: one word value: the total frequency of that word for all books # key: one word value: a book code, a space, and the frequency that # word in that book
Suggested test data set:
cathat The cat in the hat cathat wore the hat cathat to the cat hat party. OwlCat The owl and the pussy cat went to sea
Expected output for that test data set:The 2 The OwlCat 1 The cathat 1 and 1 and OwlCat 1 cat 3 cat cathat 2 cat OwlCat 1 in 1 in cathat 1 hat 3 hat cathat 3 owl 1 owl OwlCat 1 sea 1 sea OwlCat 1 the 4 the OwlCat 1 the cathat 3 ...
Here, the overall count is indicated by output pairs such asThe 2
that do not mention a book code, and the other key-value pairs indicate per-book frequencies.Hints:
-
In your mapper, emit two pairs for each word, one that includes both that word and the book name in the key and one whose key is only that word.
-
In your reducer, apply
split()
to the key in order to produce a list of either one or two strings. When it comes time to emit a value from your reducer, use that list's length to determine whether to emit a two-field value (with book code and frequency) or a one-field value (frequency only).
-
-
accum --
[HC] Accumulator keys. The standard word-count map-reduce computation computes the frequency of each word in a corpus, but not the total count of all words appearing in that corpus. For example, in thecathat.txt
test set, the word-count computation determines that the wordcat
occurs three times, but does not report that there are 13 words total in that data set. Of course, knowing the total word count would be important for corpus analysis: finding three occurrences ofcat
among 13 total words is quite different than finding three occurrences out of 10,000 words.We can compute the total word frequency on the same map-reduce cycle as the individual word frequencies by using an accumulator key, i.e., a special-purpose key for computing that total word frequency, and also using individual words for keys. By choosing a key that can't possibly be a word as the accumulator key, that special-purpose key can be handled specially in the
reducer()
.DO THIS: Modify the standard word-count
mapper()
andreducer()
to count the total number of words as well as the frequency per word, by satisfying the following specs.Mapper
# IN keys and values: # key: holds lines of text from the corpus value: is empty # OUT keys and values (2 formats): # 1. key: one word from that text, after lowering case and stripping # punctuation from each end of each word # value: a string representing the number 1 # 2. key: AAAA value: a string representing the number 1
Reducer
# IN keys and values (2 formats): # key: one word from a corpus value: a string representing the number 1 # key: AAAA value: a string representing the number 1 # OUT keys and values (2 cases): # key: that one word from a corpus value: the frequency of that word # key: the string "TOTAL WORDS" value: total number of words in the corpus
Suggested test data set:
The cat in the hat wore the hat to the cat hat party.
Expected output for that test data set:TOTAL WORDS 13 cat 2 hat 3 in 1 party 1 the 4 to 1 wore 1
Hints
-
We chose
'AAAA'
as the accumulator key because it is not a word, and because it will appear early in sorted order. -
Give your WMR job a name such as
density-1
in the "Name" field of the "Job info" section, since it represents the first phase of your solution to this problem. -
In the
mapper()
, lower the case of each word and strip punctuation from it, then emit each (modified) word and'1'
as usual, but also emitAAAA
and'1'
for each word. -
In the
reducer()
, add up the'1'
values initer
as for all keys, including the special accumulator keyAAAA
. (Of course, this will require an accumulator variable inside the reducer.) Then, when you're ready to emit the frequency, emitTOTAL WORDS
and the sum if the reducer key wasAAAA
, and emit the reducer key unchanged and the sum for all other keys.
-
-
tmp --
tmp.
input
=mapper INcat The cat in the hat cat wore the cat hat Owl The owl and the cat
mapper()
def mapper(key, value): lis = value.split() for word in lis: Wmr.emit(word.lower(), key)
(A) For first input line: key
_____
value
_______________
lis
_______________
word
_____
mapper OUT (B)
Shuffle reducer IN (C)
reducer()
def reducer(key, iter): count = {'cat': 0, 'Owl': 0} for val in iter: count[val] = count[val] + 1 Wmr.emit(key, 'cat ' + str(count['cat']) + ', Owl ' + str(count['Owl']))
(D) For call reducer('cat', ...)
:key
__________
FILL IN FOR OTHER VARIABLES HERE reducer OUT and cat 0, Owl 1 cat cat 2, Owl 1 (E) CONTINUE:
-
index --
Making an index. Index... -
netflix --
Netflix computations. Netflix problems...
--
--
--
--
_____
--
_____