Intro to Hive

Intro to Hive

6/18/2017


Introduction

Hive is a top-level Apache project that was originally developed by Facebook. Hive aims to bring the ease of SQL for interactive data querying into the Hadoop ecosystem. If you've used HDFS for data storage before, you're probably familiar with the frustration of trying to query your data interactively or to perform QA. Before I started using Hive for this purpose, I was stuck running a bunch of hdfs dfs -cat commands on files, then using grep to search for points of interest, which, as you can imagine, becomes pretty tedious (and slow). With Hive, all of that frustration is taken care of.


For those coming from a SQL background, Hive may seem a little odd. You have to remember all of this is built upon a file system. So in Hive, when you create a database, you are creating a new directory inside of HDFS. Then you can create managed tables in that database, which will become subdirectories of the root database folder. Then, data you put inside your database will be represented as files under your table directory. Note that I specified managed tables in this example. That's because in Hive, there are two types of tables you can create: Managed and External.


Managed vs. External Tables

Like I described above, a managed table is one that will be created under the database directory inside of HDFS. By default, this will be under /usr/hive/warehouse. However, it will change if you use a new database that you create. Say you wanted to create a new database called 'school', and wanted to put it at the directory /data/school/. You would run this command to do so:

CREATE DATABASE IF NOT EXISTS school
LOCATION '/data/school/';

Now if you want to create a new table in that database:

USE school;

CREATE TABLE students(
first_name String,
last_name String,
email String,
university String
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

So when you run that command, it would create a new directory called students under the /data/school/ directory.


On the flip side, an external table can be created essentially anywhere in HDFS (assuming you have permissions); it's not restricted to being under the db directory. So if I wanted to create a external table and place it elsewhere, you could run a command like this:

USE school;

CREATE EXTERNAL TABLE teachers (
first_name String,
last_name String,
email String,
university String
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/data/external/teachers';

You can see this is very similar to the example above, the only difference being the EXTERNAL keyword in the first line, as well as the location being specified at the end.


So far, these two table types probably seem like they're the same thing. And they are pretty similar, however, there is a big difference when you go to drop the table. If you drop an external table (remember that the table is a directory), the directory and the underlying files will remain intact. However, if you drop a managed table, the directory and underlying data will be deleted from HDFS. For that reason alone, I prefer to use an External table.


Hopefully this has served as an intro to help get you familiar with Hive. If you liked this, go checkout my post on useful Hive commands that I use on a daily basis. Feel free to leave any comments or questions below. Thanks for reading!



Check out more posts in the Hive category!