Introduction
Apache Spark is an open-source, general-purpose, multi-language analytics engine for large-scale data processing. It works on both single and multiple nodes by utilizing the RAM in clusters to perform fast data queries on large amounts of data. It offers batch data processing and real-time streaming, with support of high-level APIs in languages such as Python, SQL, Scala, Java or R. The framework offers in-memory technologies that allow it to store queries and data directly in the main memory of the cluster nodes.
This article explains how to install Apache Spark on Ubuntu 20.04 server.
Prerequisites
- Deploy a fully updated Rcs Ubuntu 20.04 Server.
- Create a non-root user with sudo access.
1. Install Java
Update system packages.
$ sudo apt update
Install Java.
$ sudo apt install default-jdk -y
Verify Java installation.
$ java -version
2. Install Apache Spark
Install required packages.
$ sudo apt install curl mlocate git scala -y
Download Apache Spark. Find the latest release from the downloads page.
$ curl -O https://archive.apache.org/dist/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz
Extract the Spark tarball.
$ sudo tar xvf spark-3.2.0-bin-hadoop3.2.tgz
Create an installation directory /opt/spark
.
$ sudo mkdir /opt/spark
Move the extracted files to the installation directory.
$ sudo mv spark-3.2.0-bin-hadoop3.2/* /opt/spark
Change the permission of the directory.
$ sudo chmod -R 777 /opt/spark
Edit the bashrc
configuration file to add Apache Spark installation directory to the system path.
$ sudo nano ~/.bashrc
Add the code below at the end of the file, save and exit the file:
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
Save the changes to take effect.
$ source ~/.bashrc
Start the standalone master server.
$ start-master.sh
Find your server hostname from the dashboard by visiting http://ServerIPaddress:8080
. Under the URL value. It might look like this:
URL: spark://my-server-development:7077
Start the Apache Spark worker process. Change spark://ubuntu:7077
with your server hostname.
$ start-slave.sh spark://ubuntu:7077
3. Access Apache Spark Web Interface
Go to your browser address bar to access the web interface and type in http://ServerIPaddress:8080
to access the web install wizard. For example:
http://192.0.2.10:8080
Conclusion
You have installed Apache Spark on your server. You can now access the main dashboard begin managing your clusters.
More Information
For more information about Apache Spark, please see the official documentation.