Spark shell is a very useful tool to to interactive analysis and querying on data stored in HDFS. Here are some tips that I've found useful when using spark shell.
Running Spark Shell
For those who are new to using spark shell, you can run the REPL shell for either scala or python. To use the scala spark shell, run
To use the python spark shell, run
pyspark. I typically use the scala shell, so the rest of my examples will reflect that. Just realize that all of these
same options are available for the python shell, just the syntax may be a bit different.
Check the spark shell options
spark-shell --help and you'll see a big list of options when running the spark-shell command. Of these, probably one of the most
useful is to give it resource constraints. When running the shell, I'll almost always run
spark-shell --num-executors 5 to ensure
I don't use a lot of resources on the cluster and take away from jobs that are running.
Another tip here is to set up an alias in your
.bashrc file. For example, I set one up and just called it 'spark' and gave it all of my resource constraints. So now, I can just run
spark and it'll
open the spark shell with all of my options by default.
Running Linux/Bash commands from inside spark-shell
This is by far the most useful thing I've learned when using spark shell. It's not really specific to the spark-shell but more just to
the scala language in general. Once you have the REPL shell open, you'll need to import
Then after that, you can run any Linux command by wrapping it in quotes, followed by a
For example, if you want to see where some data is in HDFS that you're going to read in spark-shell, you can run:
scala> import sys.process._ scala> "hdfs dfs -ls /my_data/example".!
And you'll get your results printed right in the shell output. You can also use this to execute scripts that are stored on the server you're running spark-shell on, which can be pretty useful.
Hopefully these tips help you become a little bit more efficient when using the spark shell. Thanks for reading and feel free to leave a comment or question below!