Project 1
Online Grocery Data Analytics
This is focused on data cleaning, exploratory analysis, data visualization, and simple statistical insights.
Description
This project analyzes transaction data from an online grocery delivery platform to understand customer behavior, evaluate coupon effectiveness, and measure sales performance. The analysis uses two datasets: orders.csv (order details such as order time, coupon use, and customer IDs) and order_products.csv (product-level purchase details including quantity, price, and customer ID). The project was implemented in a Jupyter Notebook using Python and pandas for data cleaning, transformation, and exploratory analysis.
Objective
- Gain insights into online grocery shopping behavior.
-
Assess whether coupon use influences customer purchase frequency.
-
Evaluate sales distribution by day of the week.
-
Calculate revenue at both product and customer levels.
-
Provide customer-specific purchase histories for service inquiries.
Key Tasks
-
Import datasets (orders.csv, order_products.csv).
-
Detect and handle missing values in df_orders.
-
Compare average order frequency between coupon users and non-users.
-
Count orders by day of the week.
-
Compute product-level revenue and total platform revenue.
-
Generate individual customer purchase reports.
Approach
-
Data Import & Inspection – Loaded the two datasets into pandas DataFrames and reviewed their structure.
-
Data Cleaning – Handled missing values by replacing IDs with placeholders, imputing averages, or dropping incomplete rows.
-
Exploratory Analysis – Grouped and aggregated data to examine trends (e.g., sales per weekday, coupon usage effects).
-
Revenue Calculation – Created a new revenue column (quantity × unit_price) to compute total revenue.
-
Customer Inquiry Simulation – Filtered records for a given customer ID to retrieve purchase history and total spending.
Results
-
Coupon Analysis – Customers using coupons showed different average gaps between orders compared to non-coupon users, helping assess coupon effectiveness.
-
Order Trends – Sales distribution revealed peak days of the week for customer activity.
-
Revenue Insights – The revenue column enabled detailed analysis of product-level and total revenue.
-
Customer Service Tool – The inquiry feature successfully generated purchase history and total spending for a specific customer ID.
This PDF was generated from a Jupyter Notebook containing both Python code and outputs. While the PDF shows a static version of the work, the interactive .ipynb file allows you to run the code, explore outputs dynamically, and make modifications for further analysis.
Note: This Jupyter Notebook is best viewed in Jupyter or JupyterLab.
Download the file from the Dropbox link and open it in your Python environment.
Project 2
Predictive Modeling: House Prices & Cancer Survival
This project is focused on statistical learning, supervised learning (regression & classification), model evaluation, cross-validation and bias-variance trade-off
Description
This project applies predictive data analytics methods to two real-world scenarios: house price prediction and cancer survival analysis. Using datasets of housing features and patient records, the project implements machine learning models in Python to perform regression and classification tasks. The workflow involves data preprocessing, feature engineering, model training, evaluation, and prediction on new cases.
Objective
- Build and evaluate a linear regression model to predict house prices.
- Develop and compare logistic regression and 5-Nearest-Neighbour (5NN) classifiers for predicting cancer patient survival.
- Assess model performance using statistical metrics and interpretability of coefficients.
Key Tasks
House Prices Dataset
- Import and describe dataset.
- Encode categorical variable (Neighborhood).
- Compute correlation matrix with Price.
- Create Age feature and train Linear Regression model.
- Evaluate model using RMSE and R².
- Predict prices for three new properties
Cancer Survival Dataset
- Import and describe dataset.
- Encode categorical variables (Smoking Status, Cancer Stage, Treatment Type).
- Train Logistic Regression classifier.
- Evaluate with confusion matrix, precision, recall, and accuracy.
- Predict survival probabilities for two patients.
- Repeat classification with 5NN and compare results.
Approach
-
Data Import & Exploration – Loaded datasets (house_prices.csv, cancer_data.csv) and examined descriptive statistics.
-
Data Preprocessing – Applied one-hot encoding to categorical variables, handled feature transformations (e.g., creating “house age” from Year Built).
-
Feature Correlation & Selection – Checked relationships with target variables (Price for houses, Survival for patients).
-
Model Building
-
Linear Regression for predicting house prices.
-
Logistic Regression and 5NN Classifier for survival prediction.
-
-
Model Evaluation – Used RMSE, R² (for regression) and confusion matrix, precision, recall, accuracy (for classification).
-
Prediction on New Data – Predicted housing prices for three sample properties and survival probabilities for two hypothetical patients.
-
Model Comparison – Compared logistic regression vs. 5NN performance to assess trade-offs in interpretability vs. accuracy.
Results
-
House Price Prediction
-
Regression model successfully identified size, number of rooms, and neighborhood as strong predictors of price.
-
Model performance (R² and RMSE) provided a reasonable level of accuracy, enabling practical price predictions for new listings.
-
-
Cancer Survival Prediction
-
Logistic Regression offered interpretable coefficients, showing how smoking status, tumor size, and cancer stage impact survival.
-
5NN provided competitive accuracy, but with lower interpretability compared to Logistic Regression.
-
Model comparison highlighted trade-offs: Logistic Regression for explainability vs. 5NN for flexibility in complex patterns.
-
This PDF was generated from a Jupyter Notebook containing both Python code and outputs. While the PDF shows a static version of the work, the interactive .ipynb file allows you to run the code, explore outputs dynamically, and make modifications for further analysis.
Note: This Jupyter Notebook is best viewed in Jupyter or JupyterLab.
Download the file from the Dropbox link and open it in your Python environment.