Home Blog Contact
fil RSSRSS Flux

Recent entries

Archives

MapReduce in Scheme

MapReduce is a useful algorithm to analyze massive amount of data. It is powerful because it dispatches jobs in parallel to multiple machines. When you have a cluster of server machine, you can put it to good use.

MapReduce was originally described by Google, but they never released their implementation. Later the Apache Software Foundation made available a concrete implementation named Hadoop. Part of Hadoop, HFS (Hadoop File System) is used to distribute data files across your cluster of machines.

The core of the algorithm is two function named map and reduce. The first takes one parameter and returns a pair containing a key and a value. The parameter could be any input value: it is usually a file. If you have many input files, you can run the map function in parallel on all of them. The result would be a list of key and value pairs.

The next step is to run the reduce operation. All the key/value pairs are grouped by their key, so that one reduce operation receives as input one key with all the values associated with it. The result of the MapReduce system is the aggregation of all the reduce jobs.

Here we will implement a simple MapReduce system in Scheme. It will not be distributed and made to run on a cluster of machines, instead it will be completely sequential and run on a single machine. The purpose here is to allow an easy experimentation of MapReduce algorithm. The main entry point of the MapReduce system is the mapreduce procedure. It takes the two map and reduce functions to use together with the input and returns the aggregated result. Here is its definition:

(define (fold-right init lst proc)
  (cond ((null? lst) init)
	((pair? lst)
	 (proc (car lst) (fold-right init (cdr lst) proc)))))

(define (add-entry! keys k+v)
  (let* ((entry (assq (car k+v) keys)))
    (if entry
	(begin 
	  (set-cdr! entry (cons (cdr k+v) (cdr entry)))
	  keys)
	(cons (cons (car k+v) (list (cdr k+v))) keys))))
    
(define (mapper map)
  (lambda (data keys+vals)
    (let ((key+val (map data)))
      (add-entry! keys+vals key+val))))

(define (mapreduce map* reduce* data)
  (map reduce* (fold-right '() data (mapper map*))))

That's it for the implementation. What can you compute with it ? The standard example given is word count for counting the occurence of words in documents. Instead here we will compute the intersection of sets. The map function produces the pair e,e of key value for all elements e of the sets to intersect. The reduce function produces the element e only if two elements are associated with the key. If two elements are associated with the key, it means one element comes from the first set and the second element from the second set. Otherwise nothing is produced as a result.

Here is the Scheme implementation:

  (define (intersect-map e) (cons e e))
  (define (intersect-reduce k vs)
    (let ((v (cdr k+v)))
      (if (= (length v) 2)
          (car v)
          #f)))

Let see if it works

  (mapreduce intersect-map intersect-reduce (append '(1 2 3 4 5) '( 3 4 5 6 7 8)))
  =>
  '(3 4 5)

Try to be creative now !

Comments

Add a comment

Name: required
URL: (optional)
Mail: required (will never be shown)
85+2= basic human verification