This project analyzes the Titanic dataset to explore factors that influenced survival rates. It involves data exploration, cleaning, preprocessing, and exploratory data analysis (EDA) to uncover patterns and trends related to survival.
- Introduction
- Features
- Usage
- Dataset Overview
- Tasks Performed
- Results and Visualizations
- Contributing
- License
The Titanic disaster is one of the most infamous shipwrecks in history. This project aims to analyze the Titanic dataset to identify trends and factors that affected passenger survival. By exploring and preprocessing the dataset, we can extract meaningful insights and prepare the data for predictive modeling.
-
Data Exploration:
- Inspect dataset structure and distributions.
- Identify correlations between features and survival.
-
Data Cleaning and Preprocessing:
- Handle missing values in
Age
,Cabin
, andEmbarked
. - Encode categorical features (
Sex
andEmbarked
). - Create new features like
FamilySize
fromSibSp
andParch
.
- Handle missing values in
-
Exploratory Data Analysis (EDA):
- Visualize survival trends across various features such as gender, age, and class.
- Uncover patterns that may predict survival.
-
Visualizations:
- Generate clear and insightful plots to enhance understanding of the dataset.
- Open the Jupyter Notebook or Python script provided in the repository.
- Run the code step-by-step to explore, preprocess, and analyze the dataset.
- Modify the code to test new hypotheses or visualize additional trends.
The Titanic dataset contains information about passengers, including demographics, ticket details, and survival status.
Key features include:
- Survival: Survival indicator (0 = No, 1 = Yes).
- Pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd).
- Sex: Gender of the passenger.
- Age: Age of the passenger.
- SibSp: Number of siblings/spouses aboard.
- Parch: Number of parents/children aboard.
- Fare: Ticket fare.
- Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).
- Inspected dataset structure and statistical summaries.
- Identified missing values and distribution patterns in key features.
- Filled missing values in
Age
using median imputation. - Imputed missing
Embarked
values with the most frequent port. - Encoded
Sex
andEmbarked
features into numeric format. - Created new feature
FamilySize
by combiningSibSp
andParch
.
- Analyzed survival rates by:
- Gender: Higher survival rates for females.
- Class: First-class passengers had a higher chance of survival.
- Age: Children had better survival rates compared to adults.
- Created visualizations to highlight these trends.
- Gender: Female passengers had significantly higher survival rates.
- Class: Survival rates were highest among first-class passengers.
- Age: Younger passengers (children) were more likely to survive.
- Survival rates by gender, age groups, and passenger class.
- Distribution of ticket fares across survival statuses.
- Heatmaps showcasing feature correlations.
Sample visualizations can be found in the repository.
Contributions are welcome! If you have suggestions for improvement or new analysis techniques, feel free to open an issue or submit a pull request.
This project is licensed under the MIT License. See the LICENSE
file for details.