Spark Shell Tips

Spark Shell Tips



Spark shell is a very useful tool to to interactive analysis and querying on data stored in HDFS. Here are some tips that I've found useful when using spark shell.

Running Spark Shell

For those who are new to using spark shell, you can run the REPL shell for either scala or python. To use the scala spark shell, run spark-shell. To use the python spark shell, run pyspark. I typically use the scala shell, so the rest of my examples will reflect that. Just realize that all of these same options are available for the python shell, just the syntax may be a bit different.

Check the spark shell options

Run spark-shell --help and you'll see a big list of options when running the spark-shell command. Of these, probably one of the most useful is to give it resource constraints. When running the shell, I'll almost always run spark-shell --num-executors 5 to ensure I don't use a lot of resources on the cluster and take away from jobs that are running.

Another tip here is to set up an alias in your .bashrc file. For example, I set one up and just called it 'spark' and gave it all of my resource constraints. So now, I can just run spark and it'll open the spark shell with all of my options by default.

Running Linux/Bash commands from inside spark-shell

This is by far the most useful thing I've learned when using spark shell. It's not really specific to the spark-shell but more just to the scala language in general. Once you have the REPL shell open, you'll need to import sys.process._. Then after that, you can run any Linux command by wrapping it in quotes, followed by a .! For example, if you want to see where some data is in HDFS that you're going to read in spark-shell, you can run:

scala> import sys.process._
scala> "hdfs dfs -ls /my_data/example".!

And you'll get your results printed right in the shell output. You can also use this to execute scripts that are stored on the server you're running spark-shell on, which can be pretty useful.

Hopefully these tips help you become a little bit more efficient when using the spark shell. Thanks for reading and feel free to leave a comment or question below!

Check out more posts in the Spark category!