Introduction to Hadoop YARN

Apache Hadoop 2.0 introduced a framework for job scheduling and cluster resource management and negotiation called Hadoop YARN (Yet Another Resource Negotiator).

YARN is a general-purpose application scheduling framework for distributed applications that was initially aimed at improving MapReduce job management but quickly turned itself into supporting non-MapReduce applications equally, like Spark on YARN.

YARN comes with two components — ResourceManager and NodeManager — running on their own machines.

ResourceManager is the master daemon that communicates with YARN clients, tracks resources on the cluster (on NodeManagers), and orchestrates work by assigning tasks to NodeManagers. It coordinates work of ApplicationMasters and NodeManagers.
NodeManager is a worker process that offers resources (memory and CPUs) as resource containers. It launches and tracks processes spawned on them.
Containers run tasks, including ApplicationMasters. YARN offers container allocation.

YARN currently defines two resources: vcores and memory. vcore is a usage share of a CPU core.

YARN ResourceManager keeps track of the cluster’s resources while NodeManagers tracks the local host’s resources.

It can optionally work with two other components:

History Server for job history
Proxy Server for viewing application status and logs from outside the cluster.

YARN ResourceManager accepts application submissions, schedules them, and tracks their status (through ApplicationMasters). A YARN NodeManager registers with the ResourceManager and provides its local CPUs and memory for resource negotiation.

In a real YARN cluster, there are one ResourceManager (two for High Availability) and multiple NodeManagers.

YARN ResourceManager

YARN ResourceManager manages the global assignment of compute resources to applications, e.g. memory, cpu, disk, network, etc.

YARN NodeManager

Each NodeManager tracks its own local resources and communicates its resource configuration to the ResourceManager, which keeps a running total of the cluster’s available resources.
- By keeping track of the total, the ResourceManager knows how to allocate resources as they are requested.

YARN ApplicationMaster

YARN ResourceManager manages the global assignment of compute resources to applications, e.g. memory, cpu, disk, network, etc.

An application is a YARN client program that is made up of one or more tasks.
For each running application, a special piece of code called an ApplicationMaster helps coordinate tasks on the YARN cluster. The ApplicationMaster is the first process run after the application starts.
An application in YARN comprises three parts:
- The application client, which is how a program is run on the cluster.
- An ApplicationMaster which provides YARN with the ability to perform allocation on behalf of the application.
- One or more tasks that do the actual work (runs in a process) in the container allocated by YARN.
An application running tasks on a YARN cluster consists of the following steps:
- The application starts and talks to the ResourceManager (running on the master) for the cluster.
- The ResourceManager makes a single container request on behalf of the application.
- The ApplicationMaster starts running within that container.
- The ApplicationMaster requests subsequent containers from the ResourceManager that are allocated to run tasks for the application. Those tasks do most of the status communication with the ApplicationMaster.
- Once all tasks are finished, the ApplicationMaster exits. The last container is de-allocated from the cluster.
- The application client exits. (The ApplicationMaster launched in a container is more specifically called a managed AM).
The ResourceManager, NodeManager, and ApplicationMaster work together to manage the cluster’s resources and ensure that the tasks, as well as the corresponding application, finish cleanly.

YARN’s Model of Computation (aka YARN components)

ApplicationMaster is a lightweight process that coordinates the execution of tasks of an application and asks the ResourceManager for resource containers for tasks.

It monitors tasks, restarts failed ones, etc. It can run any type of tasks, be them MapReduce tasks or Spark tasks.

An ApplicationMaster is like a queen bee that starts creating worker bees (in their own containers) in the YARN cluster.

Others

A host is the Hadoop term for a computer (also called a node, in YARN terminology).
A cluster is two or more hosts connected by a high-speed local network.
- It can technically also be a single host used for debugging and simple testing.
- Master hosts are a small number of hosts reserved to control the rest of the cluster. Worker hosts are the non-master hosts in the cluster.
- A master host is the communication point for a client program. A master host sends the work to the rest of the cluster, which consists of worker hosts.
The YARN configuration file is an XML file that contains properties. This file is placed in a well-known location on each host in the cluster and is used to configure the ResourceManager and NodeManager. By default, this file is named yarn-site.xml.
A container in YARN holds resources on the YARN cluster.
- A container hold request consists of vcore and memory.
Once a hold has been granted on a host, the NodeManager launches a process called a task.
Distributed Cache for application jar files.
Preemption (for high-priority applications)
Queues and nested queues
User authentication via Kerberos

Hadoop YARN

YARN could be considered a cornerstone of Hadoop OS (operating system) for big distributed data with HDFS as the storage along with YARN as a process scheduler.
YARN is essentially a container system and scheduler designed primarily for use with a Hadoop-based cluster.
The containers in YARN are capable of running various types of tasks.
Resource manager, node manager, container, application master, jobs
focused on data storage and offline batch analysis
Hadoop is storage and compute platform:
- MapReduce is the computing part.
- HDFS is the storage.
Hadoop is a resource and cluster manager (YARN)
Spark runs on YARN clusters, and can read from and save data to HDFS.
- leverages data locality
Spark needs distributed file system and HDFS (or Amazon S3, but slower) is a great choice.
HDFS allows for data locality.
Excellent throughput when Spark and Hadoop are both distributed and co-located on the same (YARN or Mesos) cluster nodes.
HDFS offers (important for initial loading of data):
- high data locality
- high throughput when co-located with Spark
- low latency because of data locality
- very reliable because of replication
When reading data from HDFS, each InputSplit maps to exactly one Spark partition.
HDFS is distributing files on data-nodes and storing a file on the filesystem, it will be split into partitions.

ContainerExecutors

LinuxContainerExecutor and Docker
WindowsContainerExecutor

LinuxContainerExecutor and Docker

YARN-3611 Support Docker Containers In LinuxContainerExecutor is an umbrella JIRA issue for Hadoop YARN to support Docker natively.

Introduction to Hadoop YARN