YARN is a framework for job scheduling and cluster resource management to support Hadoop 2 with generic applications. Upgraded Hadoop is not just simply tied with a MapReduce application for data processing. It can support MPI (mpich2-yarn), graph processing(Apache Giraph, Google Pregel) in a Hadoop YARN cluster.
Tip
Duration 1 hour
Apache Software Foundation (ASF) defines YARN as the next generation of MapReduce or MapReduce 2.0 (MRv2). The main change of MRv2 is the resource management from the programming model of MapReduce. The MRv2 consists of a single master ResourceManager (RM), one slave NodeManager (NM) per cluster-node, and ApplicationMaster (AM) per application. Under YARN, MapReduce is independent as it behaves like a type of available application running in a YARN container. YARN is an abbreviation for Yet Another Resource Negotiator which was addressed in January 2008 from https://issues.apache.org/jira/browse/MAPREDUCE-279
The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The scheduler is responsible for deciding which tasks get to run where and when to run them. Capacity - Allocates resources to pools, with FIFO scheduling within each pool. The CapacityScheduler supports hierarchical queues to allow for more predictable sharing of cluster resources Fair - Allocates resources to weighted pools, with fair sharing within each pool. llocates resources based on arrival time.
The second part of the Resource Manager, called the Application Manager, receives job submissions and manages to launch the Application Master. The Application Manager handles failures of the Application Master, while the Application Master handles failures of job containers. The Application Master then is really an application-specific container charged with management of containers running the actual tasks of the job.
Resource Container memory, cpu, disk, network etc.
NodeManager (NM) is a YARN service that manages resources and deployment on a cluster node. NM is separated from TaskTracker from MR1 and now NM is responsible for launching containers, each of which can house a map or reduce task. NM is YARN’s per-node agent and takes care of the individual compute nodes in a Hadoop cluster. This includes keeping up-to date with the ResourceManager (RM), overseeing containers’ life-cycle management; monitoring resource usage (memory, CPU) of individual containers, tracking node-health, log’s management and auxiliary services which may be exploited by different YARN applications.
ApplicationMaster (AM) is for executing shell commands on a set of launched containers using the YARN framework. AM is started on a container by the ResourceManager’s launcher. The first thing that AM needs to do is to connect and register itself with the ResourceManager (RM). The registration sets up information within RM regarding what host:port AM is listening on to provide any form of functionality to a client as well as a tracking URL that a client can use to keep track of status/job history if needed. AM needs to send a heartbeat to RM at regular intervals to inform RM that it is up and alive.
HDFS:
hdfs dfsadmin -report
YARN:
yarn node -list
YARN ResourceManager: 8088
Try to open a web browser with a master node address (IP or hostname):
http://[node address]:8088
YARN NodeManager: 50060
This is for all slaves.
Cloudera and Hortonworks are major vendors of Hadoop. They provide helpful documentation about Hadoop developments.