Category: Big Data
Apache Spark is a general-purpose cluster computing system for large-scale data processing.
It is the most powerful tools for analysing Big Data.
Apache Spark is one of the most powerful tools for analysing Big Data.
If you want to be a Data Scientist or work with Big Data, you should learn Apache Spark.
Taming Big Data with Apache Spark and Python - Hands On!
Apache Spark was originally developed at UC Berkley, but later donated to the Apache Group.
In short it has these specs:
- Its a cluster computing tool
- general purpose distributed system
- 100 times faster than MapReduce
- made in the Scala Functional Programming Language
- provides an API in Python
- can be integrated with Hadoop
- can process existing HDFS data
Big Data Skills are highly in demand and used at the worlds biggest companies. Spark is one of the most valuable tech skills to learn. The number of Data Science jobs has been rapidly increasing (source: indeed.com):
Apache Spark can run programs:
- up to a 100 times faster than Hadoop MapReduce in memory
- up to a 10 times faster than Hadoop MapReduce on disk
Logistic regression in Hadoop and Apache Spark (source: spark.apache.org)
A CSV (Comma Separated Values) file is a file with values seperated by a comma. Its often used to import and export with databases and spreadsheets.
Values are mostly seperated by comma. Sometimes another character is used like a semicolon, the seperation character is called a delimiter.
Data Analysis with Pandas and Python
Pandas is a data analysis library. If you work with data a lot, using pandas is way better.
Lets say you have a csv file containing nation statistics:
We can read a csv with the lines:
Pandas works with dataframes which hold all data. Thats why we can use the rows like df[‘name’].
If you don’t want to use pandas, you can use the csv module to read csv files.
This is as simple as:
We can access individual cells like so: