Apache Sqoop
Hadoop Data ingestion is the beginning of your data pipeline in a data lake. It means taking data from various silo databases and files and putting it into Hadoop. ... The reason is as Hadoop is an open source; there are a variety of ways you can ingest data into Hadoop.
Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to relational databases. This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.
In a single line:
Sqoop − “SQL to Hadoop and Hadoop to SQL”
Key features of Sqoop
It has following features:
1. JDBC based implementation are used
2. Auto generation of tedious user side code
3. Integration with hive
4. Extensible Backend
Sqoop Connectors
How the Sqoop Import works?
The import tool imports individual tables from RDBMS to HDFS. Each row in a table is treated
as a record in HDFS. All records are stored as text data in text files or as binary data in
Avro and Sequence files.
The ‘Import tool’ imports individual tables from RDBMS to HDFS. Each row in a table is treated as a record in HDFS. All records are stored as text data in the text files or as binary data in Avro and Sequence files.
General Syntax
The following syntax is used to import data into HDFS.
$ sqoop import (generic-args) (import-args)
$ sqoop-import (generic-args) (import-args)
Examples:
Importing a Table
The following command is used to import the emp table from MySQL database server to HDFS.
$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp --m 1
Import a target Directory
The following command is used to import emp_add table data into ‘/queryresult’ directory.
$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp_add \
--m 1 \
--target-dir /queryresult
Import Subset of Table Data
The following command is used to import a subset of emp_add table data. The subset query is to retrieve the employee id and address, who lives in Secunderabad city.
$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp_add \
--m 1 \
--where “city =’sec-bad’” \
--target-dir /wherequery
Incremental Import
Incremental import is a technique that imports only the newly added rows in a table. It is required to add ‘incremental’, ‘check-column’, and ‘last-value’ options to perform the incremental import.
The following command is used to perform the incremental import in the emp table.
$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp \
--m 1 \
--incremental append \
--check-column id \
-last value 1205
Import All Tables
The following command is used to import all the tables from the userdb database.
$ sqoop import-all-tables \
--connect jdbc:mysql://localhost/userdb \
--username root
How the Sqoop Export works?
Export data back from the HDFS to the RDBMS database. The target table must exist in the target database. The files which are given as input to the Sqoop contain records, which are called rows in table. Those are read and parsed into a set of records and delimited with user-specified delimiter.
General Syntax
The following is the syntax for the export command.
$ sqoop export (generic-args) (export-args)
$ sqoop-export (generic-args) (export-args)
Examples:
It is mandatory that the table to be exported is created manually and is present in the database from where it has to be exported.
The following query is used to create the table ‘employee’ in mysql command line.
$ mysql
mysql> USE db;
mysql> CREATE TABLE employee (
id INT NOT NULL PRIMARY KEY,
name VARCHAR(20),
deg VARCHAR(20),
salary INT,
dept VARCHAR(10));
The following command is used to export the table data (which is in emp_data file on HDFS) to the employee table in db database of Mysql database server.
$ sqoop export \
--connect jdbc:mysql://localhost/db \
--username root \
--table employee \
--export-dir /emp/emp_data
The following command is used to verify the table in mysql command line.
mysql>select * from employee;
Sqoop job creates and saves the import and export commands. It specifies parameters to identify and recall the saved job. This re-calling or re-executing is used in the incremental import, which can import the updated rows from RDBMS table to HDFS.
Syntax
$ sqoop job (generic-args) (job-args)
Create Job (--create)
Here we are creating a job with the name myjob, which can import the table data from RDBMS table to HDFS. The following command is used to create a job that is importing data from the employee table in the db database to the HDFS file.
$ sqoop job --create myjob \
-- import \
--connect jdbc:mysql://localhost/db \
--username root \
--table employee --m 1
Verify Job (--list)
‘--list’ argument is used to verify the saved jobs. The following command is used to verify the list of saved Sqoop jobs.
$ sqoop job --list
It shows the list of saved jobs.
Available jobs:
myjob
Inspect Job (--show)
‘--show’ argument is used to inspect or verify particular jobs and their details.
Execute Job (--exec)
‘--exec’ option is used to execute a saved job. The following command is used to execute a saved job called myjob.
$ sqoop job --exec myjob
Codegen
It is a tool that generates DAO class in Java, based on the Table Schema structure. The Java definition is instantiated as a part of the import process. The main usage of this tool is to
check if Java lost the Java code. If so, it will create a new version of Java with the default delimiter between fields.
Syntax
$ sqoop codegen (generic-args) (codegen-args)
Example
Let us take an example that generates Java code for the emp table in the userdb database.
The following command is used to execute the given example.
$ sqoop codegen \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp
Sqoop ‘eval’ tool
It allows users to execute user-defined queries against respective database servers and preview the result in the console. So, the user can expect the resultant table data to import. Using eval, we can evaluate any type of SQL query that can be either DDL or DML statement.
Syntax
$ sqoop eval (generic-args) (eval-args)
Select Query Evaluation
Using eval tool, we can evaluate any type of SQL query. Let us take an example of selecting limited rows in the employee table of db database. The following command is used to evaluate the given example using SQL query.
$ sqoop eval \
--connect jdbc:mysql://localhost/db \
--username root \
--query “SELECT * FROM employee LIMIT 3”
Insert Query Evaluation
Sqoop eval tool can be applicable for both modeling and defining the SQL statements. That means, we can use eval for insert statements too. The following command is used to insert a new row in the employee table of db database.
$ sqoop eval \
--connect jdbc:mysql://localhost/db \
--username root \
-e “INSERT INTO employee VALUES(1207,‘Raju’,‘UI dev’,15000,‘TP’)”
List all Databases
The following command is used to list all the databases in the MySQL database server.
$ sqoop list-databases \
--connect jdbc:mysql://localhost/ \
--username root
List all Tables
The following command is used to list all the tables in the userdb database of MySQL database server.
$ sqoop list-tables \
--connect jdbc:mysql://localhost/userdb \
--username root
Some Sample Commands
Limitations of Apache Sqoop
Comments
Post a Comment