In recent years, with the growing need for data, there has been a growing processing demand for large-sized data. To solve this problem, a framework was designed that can help you process a large amount of data in a distributed manner. This framework is known as Apache Spark .
Apache Spark is an open-source framework supporting different languages like Java, Scala, Python and R . PySpark is an API for Spark that lets you write Spark applications using Python APIs. This library helps you run Python applications leveraging Spark capabilities. In addition, PySpark provides users with a PySpark shell to analyze data in a distributed environment. It supports four core features of Spark, namely Spark SQL, DataFrame, Streaming, MLlib, and Spark Core.
In this blog post, we shall guide you on getting started with PySpark on Windows.
Prerequisites of Pyspark
To install PySpark on your Windows system, make sure you have the following installed first:
- Java version 7 or later
- Python version 2.6 or later.
You can download the JDK from the official website of Oracle as per your system requirement. Then you need to run the installer to set up the Java environment on your system.
After Java installation is completed, do check if JAVA_HOME is set into the PATH variable or not, mentioning the Java location.
We can achieve that with the below steps:
1) Define JAVA_HOME
2) Add JAVA_HOME to the PATH variable
3) You need to restart your system after the above step, or you can restart all your terminal , command prompt, and IDE so that environment settings can be picked up by them to know about the JAVA installation. To validate, you can open the command prompt and check the Java version.
To get started with PySpark, you first need to set up Python on your system. If you don't have Python already installed, you can do it with the link: install-python-on-windows .
You can simply type the “python” command in the Command Prompt to validate if it's installed already.
With Python, we do not get Spark installed. We need to do additional steps to get it installed on our local system. To validate this, we can follow the below steps and see:
The above screenshot shows that PySpark is not available, and we need to install it.
To install PySpark, we need to exit our Python terminal by typing exit() and then use the pip command to install it. Before doing that, we should look at our Python packages:
This is how the folders look like before Python installation:
To install PySpark, we can use the following command:
pip install pyspark
To validate the installation, Redo the above steps by opening the Python shell and then try to import PySpark again and this time it should not result in an error.
The above screenshot validates that the Python installation has succeeded.
Python site-packages before installation:
Python site-packages after installation:
Here, we can see the new folders that support PySpark.
Let us see if we are able to run PySpark programs using the current setup.
Note that for demonstration purposes, we are going to use the same Spark examples that come with the setup.
So, let’s now try running one of the programs:
The above setup has failed because of two reasons:
- It still needs Spark and Hadoop setup on its local system to cover all use cases.
- It needs to know the connection between PySpark and Python that can be established by setting the below environment variables:
Spark and Hadoop Setup
To install Spark, please refer to Spark official page for download .
- You can select the distribution that you want to download. I am choosing the latest version.
- After you make a choice in steps 1 and 2, step 3 gives you a link to download the distribution.
You can extract the package either by using winrar or the below command:
- This will extract the package to its named folder. You can move this folder under the C:/ directory.
- For Spark, we need to install winutils. Winutils are Windows binaries for Hadoop versions. You can download them from the page. Choose the version as per the above version compatibility. So, I will download winutils.exe for Hadoop 3.
- Now, we want to set up environment variables for supporting Hadoop and Spark. Please refer to HADOOP_HOME and SPARK_HOME variables in the below screenshot. Please update them as per your setup location for Spark and winutils.
As per my setup, I have placed winutils.exe inside “C:\spark-3.3.0-bin-hadoop3\bin\hadoop\bin”. (Note: Please create hadoop/bin inside spark/bin)
- Once you have created the above variables, please add hadoop/bin and spark/bin to the classpath as shown in the screenshot in the PATH environment variable:
After we have made the environment variable settings, we need to restart our system.
Now, we can again try and execute the same program wordcount.py to see if the setup is ready.
Before doing that, let us take a look at the two files which we are using for this demo:
import sys from operator import add from pyspark.sql import SparkSession if __name__ == "__main__": if len(sys.argv) != 2: print("Usage: wordcount <file>", file=sys.stderr) sys.exit(-1) spark = SparkSession\ .builder\ .appName("PythonWordCount")\ .getOrCreate() lines = spark.read.text(sys.argv).rdd.map(lambda r: r) counts = lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: (x, 1)) \ .reduceByKey(add) output = counts.collect() for (word, count) in output: print("%s: %i" % (word, count)) spark.stop()
We will try to run the above program as below, and now, it worked for me.
Also now you can open the PySpark shell using the pyspark command on your shell as below:
As per the logs above, the Spark UI is now available at the mentioned link and will display the state of executing tasks as below:
Now, you can play around by writing simple PySpark programs and see how they behave on the Spark UI. Please see one such example:
In this example, I will try to read a file. Please note that any submitted job will be visible on Spark UI only after you perform an action command since Spark works with a lazy evaluation concept.
In the below example, the show is an action.
After the show method is invoked, you can see the running job on the Spark UI.
That’s all about the installation of PySpark on Windows. After reading this article, you might have got a clear idea of getting started with PySpark setup and a few details. Once you are done with it, you are all set to create Spark applications using Python.
Stay tuned for more blogs to learn the language and some interesting use cases.
People are also reading: