Apache Spark

What is Spark?

Apache Spark is a general-purpose cluster computing system for large-scale data processing.
It is the most powerful tools for analysing Big Data.

Apache Spark is one of the most powerful tools for analysing Big Data.
If you want to be a Data Scientist or work with Big Data, you should learn Apache Spark.

Related Course: Spark and Python for Big Data with PySpark

Features

Apache Spark was originally developed at UC Berkley, but later donated to the Apache Group.
In short it has these specs:

  • Its a cluster computing tool
  • general purpose distributed system
  • 100 times faster than MapReduce
  • made in the Scala Functional Programming Language
  • provides an API in Python
  • can be integrated with Hadoop
  • can process existing HDFS data

Why Big Data

Big Data Skills are highly in demand and used at the worlds biggest companies. Spark is one of the most valuable tech skills to learn. The number of Data Science jobs has been rapidly increasing (source: indeed.com):

Data Science Jobs

Apache Spark vs Hadoop

Apache Spark can run programs:

  • up to a 100 times faster than Hadoop MapReduce in memory
  • up to a 10 times faster than Hadoop MapReduce on disk

apache spark vs hadoop

Logistic regression in Hadoop and Apache Spark (source: spark.apache.org)

Previous Post

Leave a Reply