In my last post, I gave a quick overview of Hive. In this post, I aim to show off some of my favorite commands that I use in my day-to-day work.
Show Create Table
If you ever stumble across a table in your Hive environment and wonder things such as: What data types are these columns? Is this a managed or external table? Where is the table pointed to? What file format is this data stored as? And so on, then the show create table command is going to be your new best friend.
The syntax is very simple. After typing a 'use db_name', just simply run:
SHOW CREATE TABLE table_name;
MSCK Repair Table
Hive partitions are not something that I've covered yet (maybe in a future post), but this command is pretty essential if your table is partitioned. In Hive, partitions are simply sub-directories under your root table directory. If you add partitions to your table by simply issuing a move command (hdfs dfs -mv), then you need to run this command to refresh the metadata about your partition.
MSCK REPAIR TABLE table_name;
This is opposed to adding partitions by actually using an insert command or a 'load data inpath' command. If you use those, Hive will already be aware of the new partitions and this command won't be needed. However, if you're just writing new data to your table directory, you need to run this command to refresh the hive partitions.
A word of warning: If your table is very large and has hundreds of thousands of partitions, be cautious when using this command. I've seen this run on an extremely large table and just bring down the Hive metastore entirely.
Setting Resource Queues
If your Hadoop setup has resource pools configured, it's a really good idea to specify your resource pool before running a Hive query. Not doing this can result in your query taking up the whole cluster resource-wise, leaving little memory and containers for other users. An example of this command is:
Again, you would need this set up in your cluster already, but running this command and setting a resource queue (ie: LowQueue) will ensure you aren't eating up the whole cluster and leaving other developers starved for resources.
Say your need to copy a table to a new database, or just create a copy to try out some things. You can just run the following command to quickly and easily copy that table:
CREATE TABLE new_db.table1_copy AS SELECT * FROM old_db.table1;
I use these commands on a daily basis in my work and I hope that you can find them as useful as I do. Please feel free to leave any comments or questions below. Thanks for reading!