mincemeat.py MapReduce on Python


Introduction

mincemeat.py is a Python implementation of the MapReduce distributed computing framework.

mincemeat.py is:

Download

Example

Let's look at the canonical MapReduce example, word counting:

example.py:
#!/usr/bin/env python 
import mincemeat

data = ["Humpty Dumpty sat on a wall",
        "Humpty Dumpty had a great fall",
        "All the King's horses and all the King's men",
        "Couldn't put Humpty together again",
        ] 

def mapfn(k, v):
    for w in v.split():
        yield w, 1

def reducefn(k, vs):
    result = 0
    for v in vs:
        result += v
    return result

s = mincemeat.Server() 

# The data source can be any dictionary-like object 
s.datasource = dict(enumerate(data)) 
s.mapfn = mapfn
s.reducefn = reducefn

results = s.run_server(password="changeme") 
print results
Execute this script on the server:
python example.py
Run mincemeat.py as a worker on a client:
python mincemeat.py -p changeme [server address] 
And the server will print out:
{'a': 2, 'on': 1, 'great': 1, 'Humpty': 3, 'again': 1, 'wall': 1, 'Dumpty': 2, 'men': 1, 
'had': 1, 'all': 1, 'together': 1, "King's": 2, 'horses': 1, 'All': 1, "Couldn't": 1,
'fall': 1, 'and': 1, 'the': 2, 'put': 1, 'sat': 1}

This example was overly simplistic, but changing the datasource to be a collection of large files and running the client on multiple machines will work just as well. In fact, mincemeat.py has been used to produce a word frequency lists for >3GB of text using a slightly modified version of this code.

Documentation

Sorry! I don't have much documentation available at the moment as mincemeat.py is still in its early stages of development. In the mean time, feel free to contact me with any questions or suggestions at .

Roadmap

The following features will be included in mincemeat.py by version 1.0:

Contact

Get in touch with me at .

Patches are welcome, especially for the roadmapped features. It's best to contact me to make sure that your potential work fits the goals of the project and has not already been started.