Friday, February 26, 2021

Spark on Ubuntu (WSL) : Installing PySpark


This is a quick post to go over the install process for running PySpark on an Ubuntu terminal environment running on the Windows (WSL, Windows Subsystem for Linux). How times have changed! 

Links for both the Ubuntu Terminal and Apache Spark::

DOWNLOAD SPARK: Downloads | Apache Spark (I moved the downloaded tgz file to a local folder C:\spark)

DOWNLOAD UBUNTU TERMINAL FOR WINDOWS 10 WSL | Ubuntu

The post won't cover any instructions for installing Ubuntu and instead I'll assume you've installed already and downloaded the tgz file from the Apache Spark download page (Step 3 in the above link).

Let's go straight into the terminal window and get going! I've put the commands in bold text (don't include the $) just so anyone can see a bit easier and who also prefers to ignore my jibberish! 

$ dir is a very familiar command however running at this point returned absolutely nothing! I was expecting at least some file or folder listings and attempting to change directory to my Spark folder on my local C drive returned a no such file or directory error.

In Linux land the C:\ drive is actually mounted a little different so from my command prompt I had enter the following command instead to get to the right folder:

$ cd /mnt/c/spark

Now I can extract my downloaded tgz (tarball) file

$ tar -xf spark-3.0.2-bin-hadoop2.7.tgz

HOWEVER, straight away errors saying: Cannot utime: Operation not permitted which was resolved by redoing the above command with good old sudo: 

$ sudo tar -xf spark-3.0.2-bin-hadoop2.7.tgz which will prompt me for the password when Ubuntu was first installed and now the tarball contents are successfully extracted (you can run dir again just to be sure!) and cd to the folder:

$ cd spark-3.0.2-bin-hadoop2.7

Now we need to install the shell which we'll use with Spark (Python, Scala, SQL or R) in this case I opted for Python and in order to do that cd to the bin folder using cd bin and run pyspark 

Which promptly didn't run and instead returned the error pyspark: command not found.

Even though with a dir command I could see pyspark, pyspark.cmd etc it would not run and that's because I needed to slightly the command to ./pyspark which although worked (YES!) gave a different error message: JAVA_HOME not set (BOO!!!). 

Rubbish, but in this case it's simply because Java is not installed which is fixed by the following command to get it installed:

$ sudo apt install default-jre

Now that Java is installed you'd think we'd be there by now but no, running ./pyspark gave yet another error, this time it's: env: 'python': No such file or directory so we need one last command to set the environment variable properly: 

$ export PYSPARK_PYTHON=python3 (to set just for this session)

And finally, we have PySpark (our Spark Shell) running locally and ready to go!



No comments:

Post a Comment

Breaking up with SQL Server

I was inspired to write this after reading a post from Dave Mason regarding breaking up with Big Tech companies. Yet again I haven't wr...