Apache Hadoop 3.3 distributed data processing platform released

- Shero King

Aug 2, 2020

Apache Hadoop 3.3

After a year and a half of development, the Apache Software Foundation released Apache Hadoop 3.3.0, a free platform for organizing distributed large data processing using the map/reduce paradigm, where the problem is divided into a set of smaller, detached fragments, each of which can be run on a separate cluster node. A Hadoop-based repository can cover thousands of nodes and contain exobytes of data.

Hadoop includes the implementation of the distributed Hadoop Distributed Filesystem (HDFS) file system, which automatically provides data backup and is optimized for Mapreduce applications. To facilitate access to data in Hadoop, the repository has developed BD Hbase and SQL-like Pig, which is a kind of SQL for Mapreduce whose queries can be parsed and processed by several Hadoop platforms. The project is considered to be fully stable and ready for industrial operation. Hadoop is actively used in major industrial projects, providing features similar to the Google Bigtable/GFS/Mapreduce platform, with Google officially delegating Hadoop and other Apache projects the right to use technology, which are subject to patents related to the Mapreduce method.

Hadoop ranks first among Apache repositories in terms of the number of changes made and fifth in terms of the size of the base code (about 4 million lines of code). Major Hadoop implementations include Netflix repositories (more than 500 billion events per day), Twitter (a cluster of 10,000 nodes in real time stores more zetabite data and processes more than 5 billion sessions per day), Facebook (a cluster of 4,000 nodes stores more than 300 petabytes and increases daily by 4 Pb per day).

Main changes in Apache Hadoop 3.3:

Added support for ARM based platforms.
The implementation of the Protobuf (Protocol buffers) format used to serialize structured data has been updated to 3.7.1 due to the end of the protobuf-2.5.0 branch life cycle.
S3A connector features were enhanced: Token authentication (Delegation Token) support was added, 404 response caching was improved, S3guard performance was increased, and reliability was improved.
The ABFS file system solved problems with automatic tuning.
Added built-in support for the Tencent Cloud COS file system to access the COS object repository.
Added full support for Java 11.
The implementation of HDFS RBF (Router-based Federation) has been stabilized. HDFS Router has added security management tools.
Added a DNS Resolution service to allow the client to define servers via DNS by node names without listing all hosts in the settings.
Added support for planning the launch of opportunistic containers through the Central Resource Manager (Resourcemanager), including the possibility of distributing containers according to the load of each node.
Added YARN Application Directory (Yet Another Resource Negotiator) with search capability.