Hadoop

Apache Hadoop is a software framework and distributed file system, enabling distributed data processing across thousands of computers (nodes) with petabytes of data.

The distribution of computation is enabled through the 'map-reduce' programming model, originally developed at Google. The master node of the computational cluster sub-divides the input into smaller sub-problems and distributes these across the network of worker nodes (Maps the task into subtasks and deploys them). The worker nodes return the answers to the master node (asynchronously), which then re-combines the results in some way to form the answer to the initial problem (Reduces the results into a single solution).

Hadoop uses the Hadoop Distributed File System (HDFS) (based on Google's distributed File System (GFS)) to allow for the distributed processing and storage of large data files, petabytes in size. The filesystem automatically chunks the files and distributes these across available nodes in the Hadoop cluster. This filesystem includes data redundancy and data locality: Data redundancy means nodes can come and go from the cluster and the filesystem remains intact. Data locality allows  'mapped' sub-problems of a worker node to compute results using a local chunk of the input data file.

Category: 
Cloud computing
Category: 
Text mining
Availability: 
Online
Other software required: 
Other software not required
Difficulty: 
Advanced
User Community: 
Community support, online documentation and introductory literature.
Active Development: 
Active development
Purpose: 
Multifunctional
Operating System: 
Windows
Operating System: 
Mac
Operating System: 
Unix
System Requirements: 
Unix, Java, lots and lots of hardware (though can also run a virtual cluster on a single machine).