Monday, 9 September 2013

Efficient Ways to Load Large data Sets

Efficient Ways to Load Large data Sets

I am reading a white paper on MapReduce by Google. And I want to know how
to pass GBs of data efficiently to MapReduce algorithm. The paper shows
stats for processing TBs of data in seconds. This paper says that to make
it work effieciently they reduce the network calls and try to make local
writes on local disks. Only the reducer function performs the remote calls
and writes olocal outputfile. Now when we load GBs of data in memory to
pass it to a Map function, the data loader application will would
certainly go out of memory.
So my question is what techniques should be used to load data efficiently
and pass to scheduler applications for M and R schedulings and to
calculate the number of M pieces and R pieces.
I would most probably reading some data from the Oracle database and
update it back in some other tables.
URL to white paper
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/mapreduce-osdi04.pdf

No comments:

Post a Comment