This repository provides a set of self-study tutorials on Machine Learning for big data using Apache Spark (PySpark) from basics (Dataframes and SQL) to advanced (Machine Learning Library (MLlib)) topics with practical real-world projects and datasets.
Note: I have tested the codes on Linux. It can surely be run on Windows and Mac with some little changes.
- Clone the repository, and navigate to the downloaded folder.
git clone https://github.com/iamirmasoud/pyspark_tutorials.git
cd pyspark_tutorials
-
Create (and activate) a new environment, named
spark_env
with Python 3.7. If prompted to proceed with the install(Proceed [y]/n)
type y.conda create -n spark_env python=3.7.10 source activate spark_env
At this point your command line should look something like:
(spark_env) <User>:pyspark_tutorials <user>$
. The(spark_env)
indicates that your environment has been activated, and you can proceed with further package installations. -
Before you can experiment with the code, you'll have to make sure that you have all the libraries and dependencies required to support this project. You will mainly need Python3.7+ and PySpark. You can install dependencies using:
pip install -r requirements.txt
- Navigate back to the repo. (Also, your source environment should still be activated at this point.)
cd pyspark_tutorials
- Open the directory of notebooks, using the below command. You'll see all files appear in your local environment; open the first notebook and follow the instructions.
jupyter notebook
- Once you open any of the project notebooks, make sure you are in the correct
spark_env
environment by clickingKernel > Change Kernel > spark_env
.
Before run the python scripts and jupyter notebooks of each section, please download necessary datasets for each section from the list below and put them in a directory called Datasets
next to the scripts. You can find more details about each dataset in the jupyter notebook files.
Datasets:
Datasets:
Datasets:
Datasets:
- Daily Trending YouTube Videos Dataset
- Extracted tweets from all the representatives (latest 200 as of May 17th 2018)
Datasets:
Datasets:
Datasets:
- Dataset for aggregate rating of restaurants in Bengaluru India from Zomato
- New York City Taxi Trip - Hourly Weather Data
Datasets:
Datasets:
Project - Genre classification:
Have you ever wondered what makes us, humans, able to tell apart two songs of different genres? How we do we inherently know the difference between a pop song and heavy metal? This type of classification may seem easy for us, but it's a very difficult challenge for a computer to do. So the question is, could an automatic genre classification model be possible? For this project we will be classifying songs based on a number of characteristics into a set of 23 electronic genres. This technology could be used by an application like Pandora to recommend songs to users or just create meaningful channels. Super fun!
Datasets:
Project - Kickstarter Project Success Prediction:
Kickstarter is an American public-benefit corporation based in Brooklyn, New York, that maintains a global crowdfunding platform, focused on creativity and merchandising. The company's stated mission is to "help bring creative projects to life". Kickstarter, has reportedly received more than $1.9 billion in pledges from 9.4 million backers to fund 257,000 creative projects, such as films, music, stage shows, comics, journalism, video games, technology and food-related projects.
People who back Kickstarter projects are offered tangible rewards or experiences in exchange for their pledges. This model traces its roots to subscription model of arts patronage, where artists would go directly to their audiences to fund their work.
The goal is to predict if a project will be or not to be able to get the money from their backers.
Datasets:
Project - Indeed Real/Fake Job Posting Prediction:
Indeed.com has just hired you to create a system that automatically flags suspicious job postings on its website. It has recently seen an influx of fake job postings that is negatively impacting its customer experience. Because of the high volume of job postings it receives every day, their employees don't have the capacity to check every posting, so they would like an automated system that prioritizes which postings to review before deleting it. The final task is to use the attached dataset to create an NLP algorithm which automatically flags suspicious posts for review.
Datasets:
Project - House Price Prediction in California
Datasets:
Project - Cement Strength Prediction based on Ingredients:
You have been hired as a consultant to a cement production company who wants to be able to improve their customer experience around a number of areas like being able to provide recommendations to customers on optimal amounts of certain ingredients in the cement making process and perhaps even create an application where users can input their own values and received a predicted cement strength!
Datasets:
Project - Customer Segmentation:
Use customers data to target marketing efforts! We could use clustering to target similar customer segments. For example, if we do some research about the groups and discover that one is mostly a certain social economic status and purchasing frequency, and offer them a cost savings package that could be beneficial to them. How cool would that be?!
We could also learn a bit more about our clustering by calling on various aggregate statistics for each one of the clusters across each of the variables in our dataframe like this.
Datasets:
Project - Topic Modeling for Cooking Recipes from BBC Good Food:
We will be analyzing a collection of Christmas cooking recipes scraped from BBC Good Food. We want to try to discover some additional themes amongst these recipes imagining that we want to create our own website that provides a more intelligent tagging system to recipes that are pulled from multiple data sources.
Datasets:
Project - Customer Segmentation based on sales:
In this project, we try to segment customers based on sample sales data, order info, sales, customer, shipping, etc.
Datasets:
Project - University Clustering for the Greater Good:
You are a data scientist employed by the ABCDE Foundation, a non-profit organization whose mission is to increase college graduation rates for underprivileged populations. Through advocacy and targeted outreach programs, ABCDE strives to identify and alleviate barriers to educational achievement. ABCDE is driven by the belief that with the right supports, an increase in college attendance and completion rates can be achieved, thereby weakening the grip of the cycles of poverty and social immobility affecting many of our communities. ABCDE is committed to developing a more data driven approach to decision-making. As a prelude to future analyses, ABCDE has requested that you analyze the data to identify clusters of similar colleges and universities.
Your task is to use cluster analysis to identify the groups of characteristically similar schools in the dataset.
Datasets:
Project - Analyzing Participants in Personality Test:
Datasets:
Project - Market Basket Analysis:
You are owing a supermarket mall and through membership cards, you have some basic data about your customers like Customer ID, age, gender, annual income and spending score. Spending Score is something you assign to the customer based on your defined parameters like customer behavior and purchasing data. You want to understand the customers like who can be easily grouped together so that a strategy can be provided to the marketing team to plan accordingly.
Datasets: