Apache Hadoop

Apache Hadoop is a framework for distributed storage and processing of large data sets across clusters of commodity hardware. Doug Cutting and Mike Cafarella created it in 2006, inspired by Google's published papers on the Google File System and MapReduce. Hadoop made it practical to store and analyze petabytes of data without specialized hardware — commodity servers with local disks replaced expensive storage arrays.

The current release is Hadoop 3.4. The framework consists of three core components: HDFS (Hadoop Distributed File System) for fault-tolerant storage with data replicated across nodes, YARN (Yet Another Resource Negotiator) for cluster resource management and job scheduling, and MapReduce for batch processing. In modern deployments, Apache Spark typically replaces MapReduce as the processing engine while still using HDFS for storage and YARN for resource management.

The official documentation covers cluster setup, HDFS administration, and YARN configuration. The source code is on GitHub under the Apache 2.0 license.

hadoop.apache.org

Related technologies

Apache Spark

Let's talk

Every project starts with a free, no-strings-attached conversation. Tell us where you are and where you want to go.