Skip to main content

Apache Sqoop

Apache Sqoop

Hadoop Data ingestion is the beginning of your data pipeline in a data lake. It means taking data from various silo databases and files and putting it into Hadoop. ... The reason is as Hadoop is an open source; there are a variety of ways you can ingest data into Hadoop.

























Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to relational databases. This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.





In a single line:
Sqoop − “SQL to Hadoop and Hadoop to SQL”

Key features of Sqoop

It has following features: 1. JDBC based implementation are used 2. Auto generation of tedious user side code 3. Integration with hive 4. Extensible Backend

Sqoop Connectors



How the Sqoop Import works?

The import tool imports individual tables from RDBMS to HDFS. Each row in a table is treated 
as a record in HDFS. All records are stored as text data in text files or as binary data in 
Avro and Sequence files.



The ‘Import tool’ imports individual tables from RDBMS to HDFS. Each row in a table is treated as a record in HDFS. All records are stored as text data in the text files or as binary data in Avro and Sequence files.

General Syntax
The following syntax is used to import data into HDFS.

$ sqoop import (generic-args) (import-args) 
$ sqoop-import (generic-args) (import-args)

Examples:

Importing a Table

The following command is used to import the emp table from MySQL database server to HDFS.

$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp --m 1

Import a target Directory


The following command is used to import emp_add table data into ‘/queryresult’ directory.

$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp_add \
--m 1 \
--target-dir /queryresult

Import Subset of Table Data


The following command is used to import a subset of emp_add table data. The subset query is to retrieve the employee id and address, who lives in Secunderabad city.

$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp_add \
--m 1 \
--where “city =’sec-bad’” \
--target-dir /wherequery

Incremental Import


Incremental import is a technique that imports only the newly added rows in a table. It is required to add ‘incremental’, ‘check-column’, and ‘last-value’ options to perform the incremental import.

The following command is used to perform the incremental import in the emp table.

$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp \
--m 1 \
--incremental append \
--check-column id \
-last value 1205

Import All Tables


The following command is used to import all the tables from the userdb database.

$ sqoop import-all-tables \
--connect jdbc:mysql://localhost/userdb \
--username root



How the Sqoop Export works?

Export data back from the HDFS to the RDBMS database. The target table must exist in the target database. The files which are given as input to the Sqoop contain records, which are called rows in table. Those are read and parsed into a set of records and delimited with user-specified delimiter.



General Syntax
The following is the syntax for the export command.

$ sqoop export (generic-args) (export-args) 
$ sqoop-export (generic-args) (export-args)

Examples:


It is mandatory that the table to be exported is created manually and is present in the database from where it has to be exported.

The following query is used to create the table ‘employee’ in mysql command line.

$ mysql
mysql> USE db;
mysql> CREATE TABLE employee ( 
   id INT NOT NULL PRIMARY KEY, 
   name VARCHAR(20), 
   deg VARCHAR(20),
   salary INT,
   dept VARCHAR(10));


The following command is used to export the table data (which is in emp_data file on HDFS) to the employee table in db database of Mysql database server.

$ sqoop export \
--connect jdbc:mysql://localhost/db \
--username root \
--table employee \ 
--export-dir /emp/emp_data
The following command is used to verify the table in mysql command line.

mysql>select * from employee;


Sqoop job creates and saves the import and export commands. It specifies parameters to identify and recall the saved job. This re-calling or re-executing is used in the incremental import, which can import the updated rows from RDBMS table to HDFS.

Syntax

$ sqoop job (generic-args) (job-args)


Create Job (--create)

Here we are creating a job with the name myjob, which can import the table data from RDBMS table to HDFS. The following command is used to create a job that is importing data from the employee table in the db database to the HDFS file. $ sqoop job --create myjob \ -- import \ --connect jdbc:mysql://localhost/db \ --username root \ --table employee --m 1

Verify Job (--list)

‘--list’ argument is used to verify the saved jobs. The following command is used to verify the list of saved Sqoop jobs. $ sqoop job --list It shows the list of saved jobs. Available jobs: myjob

Inspect Job (--show)

‘--show’ argument is used to inspect or verify particular jobs and their details.

Execute Job (--exec)

‘--exec’ option is used to execute a saved job. The following command is used to execute a saved job called myjob. $ sqoop job --exec myjob

Codegen

It is a tool that generates DAO class in Java, based on the Table Schema structure. The Java definition is instantiated as a part of the import process. The main usage of this tool is to check if Java lost the Java code. If so, it will create a new version of Java with the default delimiter between fields. Syntax $ sqoop codegen (generic-args) (codegen-args) Example Let us take an example that generates Java code for the emp table in the userdb database. The following command is used to execute the given example. $ sqoop codegen \ --connect jdbc:mysql://localhost/userdb \ --username root \ --table emp

Sqoop ‘eval’ tool

It allows users to execute user-defined queries against respective database servers and preview the result in the console. So, the user can expect the resultant table data to import. Using eval, we can evaluate any type of SQL query that can be either DDL or DML statement. Syntax $ sqoop eval (generic-args) (eval-args)

Select Query Evaluation

Using eval tool, we can evaluate any type of SQL query. Let us take an example of selecting limited rows in the employee table of db database. The following command is used to evaluate the given example using SQL query. $ sqoop eval \ --connect jdbc:mysql://localhost/db \ --username root \ --query “SELECT * FROM employee LIMIT 3”

Insert Query Evaluation

Sqoop eval tool can be applicable for both modeling and defining the SQL statements. That means, we can use eval for insert statements too. The following command is used to insert a new row in the employee table of db database. $ sqoop eval \ --connect jdbc:mysql://localhost/db \ --username root \ -e “INSERT INTO employee VALUES(1207,‘Raju’,‘UI dev’,15000,‘TP’)”

List all Databases

The following command is used to list all the databases in the MySQL database server. $ sqoop list-databases \ --connect jdbc:mysql://localhost/ \ --username root

List all Tables

The following command is used to list all the tables in the userdb database of MySQL database server. $ sqoop list-tables \ --connect jdbc:mysql://localhost/userdb \ --username root


Some Sample Commands





Limitations of Apache Sqoop



Comments