Hadoop
Hadoop is a framework which helps us to store and process large datasets in parallel and in a distribution fashion.
Let’s focus on the store and process part of Hadoop.
Store
The storage part in Hadoop is handled by HDFS i.e Hadoop Distributed File System. It provides high availability across a distributed ecosystem. The way it function is like this, it breaks the incoming information into chunks, and distributes them to different nodes in a cluster, allowing distributed storage.
Process
MapReduce is the heart of Hadoop processing. The algorithms do two important tasks, map and reduce. The mappers break the task into smaller tasks which are processed parallely. Once, all the mappers do their share of work, they aggregate their results, and then these results are reduced to a simpler value by the Reduce process. To learn more on Hadoop you can go through our Hadoop Tutorial blog series.
If we use Hadoop as our storage in Data Science it becomes difficult to process the input with R Studio, due to its inability to perform well in distributed environment, hence we have Spark R.