category: Big Data | Python Tutorial

Category: Big Data

Apache Spark

What is Spark?

Apache Spark is a general-purpose cluster computing system for large-scale data processing.
It is the most powerful tools for analysing Big Data.

Apache Spark is one of the most powerful tools for analysing Big Data.
If you want to be a Data Scientist or work with Big Data, you should learn Apache Spark.

Related Course:
Taming Big Data with Apache Spark and Python - Hands On!

Features

Apache Spark was originally developed at UC Berkley, but later donated to the Apache Group.
In short it has these specs:

  • Its a cluster computing tool
  • general purpose distributed system
  • 100 times faster than MapReduce
  • made in the Scala Functional Programming Language
  • provides an API in Python
  • can be integrated with Hadoop
  • can process existing HDFS data

Why Big Data

Big Data Skills are highly in demand and used at the worlds biggest companies. Spark is one of the most valuable tech skills to learn. The number of Data Science jobs has been rapidly increasing (source: indeed.com):

Data Science Jobs

Apache Spark vs Hadoop

Apache Spark can run programs:

  • up to a 100 times faster than Hadoop MapReduce in memory
  • up to a 10 times faster than Hadoop MapReduce on disk

apache spark vs hadoop

Logistic regression in Hadoop and Apache Spark (source: spark.apache.org)

Read CSV

A CSV (Comma Separated Values) file is a file with values seperated by a comma. Its often used to import and export with databases and spreadsheets.

Values are mostly seperated by comma. Sometimes another character is used like a semicolon, the seperation character is called a delimiter.

Related Course:
Data Analysis with Pandas and Python
icon

Read CSV with Pandas

Pandas is a data analysis library. If you work with data a lot, using pandas is way better.
Lets say you have a csv file containing nation statistics:

Country, Capital, Language, Currency
United States, Washington, English, US dollar
Canada, Ottawa, English and French, Canadian dollar
Germany, Berlin, German, Euro

We can read a csv with the lines:

 
import pandas as pd
import numpy as np

df = pd.read_csv('nations.csv')
print(df)
print('\n')

for country in df['Country']:
print(country)

Pandas works with dataframes which hold all data. Thats why we can use the rows like df[‘name’].

Read CSV

If you don’t want to use pandas, you can use the csv module to read csv files.
This is as simple as:

 
import csv

with open('nations.csv') as csvfile:
csvReader = csv.reader(csvfile, delimiter=',')
for row in csvReader:
print(row)

We can access individual cells like so:

 
for row in csvReader:
print(row[0])
print(row[1])


1