Apache Oozie
Apache Oozie is a scheduler system to run and manage Hadoop jobs in a distributed environment. It allows to combine multiple complex jobs to be run in a sequential order to achieve a bigger task. Within a sequence of task, two or more jobs can also be programmed to run parallel to each other.
One of the main advantages of Oozie is that it is tightly integrated with Hadoop stack supporting various Hadoop jobs like Hive, Pig, Sqoop as well as system-specific jobs like Java and Shell.
Oozie detects completion of tasks through callback and polling. When Oozie starts a task, it provides a unique callback HTTP URL to the task, and notifies that URL when it is
complete. If the task fails to invoke the callback URL, Oozie can poll the task for completion.
Following three types of jobs are common in Oozie:
- Oozie Workflow Jobs - These are represented as Directed Acyclic Graphs (DAGs) to specify a sequence of actions to be executed.
- Oozie Coordinator Jobs - These consist of workflow jobs triggered by time and data availability.
- Oozie Bundle - These can be referred to as a package of multiple coordinator and workflow jobs.
Oozie Workflow
Workflow will always start with a Start tag and end with an End tag.
Oozie Components
Workflow Jobs
Workflow in Oozie is a sequence of actions arranged in a control dependency DAG (Direct Acyclic Graph).
The actions are in controlled dependency as the next action can only run as per the output of current action. Subsequent actions are dependent on its previous action.
We can run multiple jobs using same workflow by using multiple .property files (one property for each job).
Suppose we want to change the jobtracker url or change the script name or value of a param.
We can specify a config file (.property) and pass it while running the workflow.
Note - The property file should be on the edge node (not in HDFS), whereas the workflow and hive scripts will be in HDFS.
What are the different control flow nodes supported by Apache Oozie workflows that control the workflow execution path?
Apache Oozie workflow supports the following:
1. Decision Control Node - The decision control node is like a switch-case statement, which enables a workflow to make a selection on the execution path to follow.
2. Fork and Join Control Node - n scenarios where we want to run multiple jobs parallel to each other, we can use Fork. When fork is used we have to use Join as an end node to fork. Basically Fork and Join work together. For each fork there should be a join. As Join assumes all the node are a child of a single fork.
Different states of an Apache Oozie Workflow job
- PREP
- RUNNING
- SUSPENDED
- SUCCEEDED
- KILLED
- FAILED
Cordinators
Coordinator applications allow users to schedule complex workflows, including workflows that are scheduled regularly. Oozie Coordinator models the workflow
execution triggers in the form of time, data or event predicates. The workflow job mentioned inside the Coordinator is started only after the given conditions
are satisfied.
Sample workflow.xml file
<workflow-app xmlns = "uri:oozie:workflow:0.4" name = "simple-Workflow">
<start to = "Insert_into_Table" />
<action name = "Insert_into_Table">
<hive xmlns = "uri:oozie:hive-action:0.4">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<script>${script_name_copy}</script>
<param>${database}</param>
</hive>
<ok to = "end" />
<error to = "kill_job" />
</action>
<kill name = "kill_job">
<message>Job failed</message>
</kill>
<end name = "end" />
</workflow-app>
Coordinator file
<coordinator-app xmlns = "uri:oozie:coordinator:0.2" name =
"coord_copydata_from_external_orc" frequency = "5 * * * *" start =
"2016-00-18T01:00Z" end = "2025-12-31T00:00Z"" timezone = "America/Los_Angeles">
<controls>
<timeout>1</timeout>
<concurrency>1</concurrency>
<execution>FIFO</execution>
<throttle>1</throttle>
</controls>
<action>
<workflow>
<app-path>pathof_workflow_xml/workflow.xml</app-path>
</workflow>
</action>
</coordinator-app>
Definitions of the above given code is as follows -
start - It means the start datetime for the job.
end - The end datetime for the job. When actions will stop being materialized.
timezone - The timezone of the coordinator application.
frequency - The frequency, in minutes, to materialize actions.
Control Information
1.Timeout
2.Concurrency
3.Execution order(FIFO,LIFO,LAST_ONLY)
Coordinator Job Status
At any time, a coordinator job is in one of the following statuses :
- PREP
- RUNNING
- PREPSUSPENDED
- SUSPENDED
- PREPPAUSED
- PAUSED
- SUCCEEDED
- DONWITHERROR
- KILLED
- FAILED
Parameterization of a Coordinator
The workflow parameters can be passed to a coordinator as well using the .properties file.
3 files are needed:
- workflow.xml
- cordinator file
- .property file
Concept: The Oozie Bundle system allows the user to define and execute a bunch of coordinator applications often called a data pipeline. There is no explicit dependency among the coordinator applications in a bundle. However, a user could use the data dependency of coordinator applications to create an implicit data application pipeline.
Important Keywords to Remember
Apache Oozie
Oozie workflow
Control Flow Nodes
Start Control Node, End Control Node, Kill Control Node
Decision Control Node
Fork and Join Control Node
Action Nodes
Oozie workflow states
Oozie workflow life-cycle
Comments
Post a Comment