Project Overview
This project focuses on performing data cleaning and exploratory data analysis (EDA) on the Titanic dataset, a well-known dataset that records the details of passengers aboard the Titanic, including whether they survived the tragic sinking of the ship. The primary objectives are to preprocess the data for analysis and to derive meaningful insights regarding survival patterns among different passenger demographics.
Objectives
Data Cleaning:
Handle missing values by filling them with appropriate statistics (median for age and fare, placeholder for cabin).
Verify the completeness of the dataset post-cleaning to ensure data integrity.
Exploratory Data Analysis (EDA):
Generate summary statistics for both numerical and categorical features to understand the dataset's structure and characteristics.
Analyze survival rates among different passenger demographics, including age, sex, and passenger class.
Visualize key findings through histograms and bar plots to effectively communicate results.
Dataset Description
The dataset contains 418 rows and 12 columns, including:
PassengerId: Unique identifier for each passenger.
Survived: Survival status (0 = No, 1 = Yes).
Pclass: Passenger class (1, 2, or 3).
Name: Name of the passenger.
Sex: Gender of the passenger.
Age: Age of the passenger in years.
SibSp: Number of siblings/spouses aboard the Titanic.
Parch: Number of parents/children aboard the Titanic.
Ticket: Ticket number.
Fare: Fare paid for the ticket.
Cabin: Cabin number (if applicable).
Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).
Methodology
Data Preprocessing: Missing values are addressed by replacing the 'Age' and 'Fare' columns with their median values, while the 'Cabin' column is filled with 'Unknown'.
Statistical Analysis: Summary statistics for numerical and categorical columns are generated to provide insights into passenger demographics.
Visual Analysis: Histograms and bar plots are used to illustrate the distributions of age and fare, as well as survival rates by sex, passenger class, and port of embarkation.
Expected Outcomes
A clean dataset ready for further analysis or modeling.
Insights into survival trends based on age, gender, and passenger class, which can inform future analyses in the context of survival prediction modeling.
Tools and Libraries
Python: Programming language used for data analysis.
Pandas: Library for data manipulation and analysis.
Matplotlib & Seaborn: Libraries for data visualization.