Hands-on 3: MapReduce

Due: 11:59pm March 6, 2019

Warmup

This assignment asks you to write a simple parallel program with the MapReduce library using a single-machine python implementation.

Download the following files: mapreduce.py and kjv12.txt.

If you'd prefer to use Python 3, not Python 2, download this version of mapreduce.py instead (and rename the file to mapreduce.py).
Then, run mapreduce:
python mapreduce.py kjv12.txt 
After running for a little while, the output should be as follows:
  and 12846
  i 8854
  god 4114
  israel 2574
  the 1842
  for 1743
  but 1558
  then 1374
  lord 1070
  o 1065
  david 1064
  jesus 977
  moses 847
  judah 816
  jerusalem 814
  he 754
  now 643
  so 622
  egypt 611
  behold 596

The output has two columns: the first column has a lower-case version of a title-cased word that appears in the ASCII bible and the second column has a count of the number of times that word appears in the bible. The output is trimmed to only display the top 20 results sorted by descending word count.

Studying mapreduce.py

The program in mapreduce.py begins execution after the following statement:

  if __name__ == '__main__':

It creates an instance of the WordCount class using a few parameters. The last parameter comes from the command line, and is the name of the file that we will be executing MapReduce on. Immediately after initialization, the program calls run on the WordCount instance, and the Python MapReduce library runs the MapReduce algorithm using the map and reduce methods defined in the WordCount class.

You may find the Python Reference useful in answering the following questions. In particular, the sections on Multiprocessing and Process Pools may be useful.

Questions

Now you're ready for this week's questions.

Like before, the questions are in a read-only google doc. Make sure to enter quesitons in the page indicated (please do not erase the question text) and upload them as a PDF to Gradescope. See more detailed instructions at the end of the first week's hands-on. If you are having Gradescope problems, please post a question on Piazza!