What is Spark?
Apache Spark is a general-purpose cluster computing system for large-scale data processing.
It is the most powerful tools for analysing Big Data.
Apache Spark is one of the most powerful tools for analysing Big Data.
If you want to be a Data Scientist or work with Big Data, you should learn Apache Spark.
Related Course: Spark and Python for Big Data with PySpark
Apache Spark was originally developed at UC Berkley, but later donated to the Apache Group.
In short it has these specs:
- Its a cluster computing tool
- general purpose distributed system
- 100 times faster than MapReduce
- made in the Scala Functional Programming Language
- provides an API in Python
- can be integrated with Hadoop
- can process existing HDFS data
Why Big Data
Big Data Skills are highly in demand and used at the worlds biggest companies. Spark is one of the most valuable tech skills to learn. The number of Data Science jobs has been rapidly increasing (source: indeed.com):
Apache Spark vs Hadoop
Apache Spark can run programs:
- up to a 100 times faster than Hadoop MapReduce in memory
- up to a 10 times faster than Hadoop MapReduce on disk
Logistic regression in Hadoop and Apache Spark (source: spark.apache.org)