Predicting Waiter Tips with AI

End to End Machine Learning Project

Introduction:

Imagine you're at a restaurant, and you're trying to guess how much tip the waiter might get based on the total bill, time of day, customer gender, or even smoking preference.

Sounds like a fun challenge, right? Welcome to the Waiter Tips Prediction project!

In this hands-on machine learning mini-project, we’ll dive into a real-world dataset collected from a restaurant and try to predict the amount of tip a customer is likely to leave. This isn’t just about numbers, it’s about uncovering hidden patterns in human behavior and building a model that learns from data.

We’ll use popular Python libraries like Pandas, Seaborn, and Scikit-learn to explore the data, visualize trends, and train a regression model that makes smart predictions. Whether you're a beginner exploring data science or someone looking to practice ML with fun datasets, this project is for you.

By the end of this project, you’ll understand:

How to prepare and explore a dataset

How tipping patterns change with various factors

How to build and evaluate a predictive ML model

How regression helps in solving real-life forecasting problems

So grab your digital notepad and let’s get ready to turn dinner bills into data-driven insights!

Let's get started with

1. Importing Libraries:

numpy and pandas help with numerical and tabular data processing.

seaborn and matplotlib.pyplot are used for data visualization.

%matplotlib inline ensures that your plots show up directly inside the notebook.

warnings.filterwarnings("ignore") hides warning messages to keep the output clean.

2. Loading the Dataset:

sns.load_dataset('tips') loads the famous tips dataset included in Seaborn. This dataset contains information about restaurant bills, customer gender, smoking status, and tips.

3. Viewing the Data:

df.head() displays the first five rows of the dataset to give a quick look at the structure and types of data you'll be working with.

Code:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

import warnings

warnings.filterwarnings('ignore')

df = pd.read_csv('/kaggle/input/tips-dataset/tips.csv')

df.head()

Output:

Since the columns ‘Payer Name`, `CC Number`, and `Payment ID` likely contain personally identifiable information (PII) or transactional details that are irrelevant to the analysis and pose privacy/security risks, we are removing them from the dataset.

It can also cause Redundancy because `price_per_person` might be a derived column (calculated as `total_bill / size`). Including it could cause multicollinearity or redundancy in a machine learning model.

So let's start Column Removal:

The code removes four columns from the DataFrame `df`:

- `Payer Name`

- `CC Number`

- `Payment ID`

- `price_per_person`

Result

After dropping these columns, the DataFrame retains only the core features relevant for analysis:

- `total_bill`, `tip`, `sex`, `smoker`, `day`, `time`, `size`

Code:

df = df.drop(['Payer Name','CC Number','Payment ID','price_per_person'],axis=1)

df.head()

Output:

Now we have to perform label encoding on categorical columns in the DataFrame `df`, converting text-based categories into numerical values for machine learning compatibility.

Why This is Done ?

1. Label Encoding Purpose:

Machine learning algorithms require numerical input. Label encoding converts categorical text data (e.g., `Male/Female`, `Yes/No`) into integers.

2. Column-Specific Logic:

- `sex`: Binary encoding (`Male=1`, `Female=0`).

- `smoker`: Binary encoding (`Yes=1`, `No=0`).

- `day`: Assigns numerical values to days (e.g., `Sun=4`, `Sat=3`).

- `time`: Maps meal times to integers (`Dinner=2`, `Lunch=1`).

Code:

df.sex = df.sex.replace(['Male','Female'],[1,0])

df.smoker = df.smoker.replace(['Yes','No'],[1,0])

df.day = df.day.replace(['Sun', 'Sat', 'Thur', 'Fri'],[4,3,1,2])

df.time = df.time.replace(['Dinner','Lunch'],[2,1])

df.head()

Output:

We have to check via distribution plots (histograms + kernel density estimates) for the `total_bill` and `tip` columns to analyze their spread and patterns.

Loop through 'total_bill' and 'tip' columns to plot distributions.

To represent this visually we will:

1. Figure Setup:

- Creates a large canvas (`15x9 inches`) to display plots clearly.

- Uses a 2-row, 3-column grid for subplots (though only 2 plots are generated).

2. Loop Logic:

- Iterates over the columns `total_bill` and `tip`.

- For each column:

- Creates a subplot in position `1` and `2` of the grid.

- Uses `sns.distplot()` to plot:

- Histogram: Shows frequency of values in bins.

- KDE (Kernel Density Estimate): Smooth curve estimating the probability density.

3. Output:

- Two distribution plots displayed side-by-side.

Code:

plt.subplots(figsize=(15,9))

for i, col in enumerate(['total_bill','tip']):

plt.subplot(2,3,i+1)

sns.distplot(df[col])

plt.tight_layout()

plt.show()

Output:

1. `total_bill` Distribution:

- Range: Most bills are between \$10–\$40, with a few outliers up to \$60.

- Peak: Highest density around \$15–\$20.

- Shape: Right-skewed (common in financial data).

2. `tip` Distribution:

- Range: Tips mostly fall between \$1–\$5 with rare tips up to \$10.

- Peak: Most frequent tip amount is \$2–\$3

- Shape: Also right-skewed.

Let’s generate a correlation heatmap to visualize relationships between numerical features in the dataset.

How to Print the Correlation Matrix?

1. `df.corr()`:

Computes Pearson correlation coefficients between all numerical columns. Values range from -1 (perfect inverse) to 1 (perfect positive).

2. Heatmap Visualization:

- Colors (plasma colormap) and annotations show correlation strength.

- Darker colors = weaker correlations; brighter colors = stronger correlations.

Code:

#NOW check the correlations

corr = df.corr()

plt.figure(figsize=(15,9))

sns.heatmap(corr,annot=True,cbar=True,cmap='plasma')

plt.show()

Output:

Key Observations from the Output:

| total_bil | vs. size | 0.60 | Moderate positive correlation. Larger groups tend to have higher bills.

| tip | vs. total_bill | 0.46 | Tips increase with bill amount (expected). |

| day | vs. time | 0.87 | Extremely high correlation.

|sex vs. day | 0.23 | Weak correlation. Minimal relationship. |

Actionable Insights:

1. Feature Selection:

- Prioritize `total_bill` and `size` for tip prediction (strongest correlations).

Now we’ll explore how to visualize relationships between numerical data using scatter plots in Python. We’ll analyze a dataset comparing restaurant bills (total_bill) to tips (tip) to see if there’s a pattern. By the end, you’ll understand how to create, customize, and interpret scatter plots using matplotlib. Let’s dive in!

Why this works:

Sets clear expectations.
Connect the code to a real-world example (tips vs. bills).
Keep it engaging and concise.

1. plt.scatter(df.total_bill, df.tip)

What it does: Creates a scatter plot using two columns from the DataFrame (df):

df.total_bill: X-axis values (independent variable).
df.tip: Y-axis values (dependent variable).

Why scatter plot? To visualize the relationship between two numerical variables (e.g., correlation).

2. plt.title('TOTAL BILL vs TIPS')

Adds a title to the plot for context.

3. plt.xlabel('TOTAL BILL') and plt.ylabel('TIPS')

Labels the X-axis ("TOTAL BILL") and Y-axis ("TIPS") for clarity.

4. plt.show()

Displays the plot (required in non-Jupyter environments).

Code:

plt.scatter(df.total_bill,df.tip)

plt.title('TOTAL BILL vs TIPS')

plt.xlabel('TOTAL BILL')

plt.ylabel('TIPS')

plt.show()

Output:

The scatter plot shows:

X-axis (Total Bill): Ranges from
10–
10–50.
Y-axis (Tips): Ranges from
2–
2–10.
Key Observations:

As the total bill increases, tips generally increase (positive correlation).
Some outliers exist (e.g., high bills with low tips or vice versa).

Common Questions

Q: Why not use a line plot?
A: Scatter plots show individual data points; line plots imply continuity.
Q: How to change point colors?
A: Add c='red' in plt.scatter().

Diving into the Train-Test Split and Model Evaluation phase.

Before building any machine learning model, it's crucial to split our data into training and testing sets. This helps us evaluate how well our model generalizes to unseen data. Today, we'll:

1. Split the dataset (`total_bill` vs. `tip`).

2. Scale features for better model performance.

3. Train 13 regression models and compare their accuracy using the R² scores

Code:

#Splitting the data into train and test

x = df.drop(['tip'],axis=1)

y = df.tip

#train test split

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)

#feature scaling

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

x_train_scaled = ss.fit_transform(x_train)

x_test_scaled = ss.transform(x_test)

#model selection

from sklearn.linear_model import LinearRegression,Ridge,Lasso,ElasticNet

from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor,AdaBoostRegressor

from xgboost import XGBRegressor

from sklearn.neighbors import KNeighborsRegressor

from sklearn.svm import SVR

from catboost import CatBoostRegressor

import lightgbm as lgbm

from sklearn.gaussian_process import GaussianProcessRegressor

lr = LinearRegression()

r = Ridge()

l = Lasso()

en = ElasticNet()

rf = RandomForestRegressor()

gb = GradientBoostingRegressor()

adb = AdaBoostRegressor()

xgb = XGBRegressor()

knn = KNeighborsRegressor()

svr = SVR()

cat = CatBoostRegressor()

lgb =lgbm.LGBMRegressor()

gpr = GaussianProcessRegressor()

#Fittings

lr.fit(x_train_scaled,y_train)

r.fit(x_train_scaled,y_train)

l.fit(x_train_scaled,y_train)

en.fit(x_train_scaled,y_train)

rf.fit(x_train_scaled,y_train)

gb.fit(x_train_scaled,y_train)

adb.fit(x_train_scaled,y_train)

xgb.fit(x_train_scaled,y_train)

knn.fit(x_train_scaled,y_train)

svr.fit(x_train_scaled,y_train)

cat.fit(x_train_scaled,y_train)

lgb.fit(x_train_scaled,y_train)

gpr.fit(x_train_scaled,y_train)

#preds

lrpred = lr.predict(x_test_scaled)

rpred = r.predict(x_test_scaled)

lpred = l.predict(x_test_scaled)

enpred = en.predict(x_test_scaled)

rfpred = rf.predict(x_test_scaled)

gbpred = gb.predict(x_test_scaled)

adbpred = adb.predict(x_test_scaled)

xgbpred = xgb.predict(x_test_scaled)

knnpred = knn.predict(x_test_scaled)

svrpred = svr.predict(x_test_scaled)

catpred = cat.predict(x_test_scaled)

lgbpred = lgb.predict(x_test_scaled)

gprpred = gpr.predict(x_test_scaled)

#Evaluations

from sklearn.metrics import r2_score,mean_absolute_error

lrr2 = r2_score(y_test,lrpred)

rr2 = r2_score(y_test,rpred)

lr2 = r2_score(y_test,lpred)

enr2 = r2_score(y_test,enpred)

rfr2 = r2_score(y_test,rfpred)

gbr2 = r2_score(y_test,gbpred)

adbr2 = r2_score(y_test,adbpred)

xgbr2 = r2_score(y_test,xgbpred)

knnr2 = r2_score(y_test,knnpred)

svrr2 = r2_score(y_test,svrpred)

catr2 = r2_score(y_test,catpred)

lgbr2 = r2_score(y_test,lgbpred)

gprr2 = r2_score(y_test,gprpred)

print('LINEAR REG ',lrr2)

print('RIDGE ',rr2)

print('LASSO ',lr2)

print('ELASTICNET',enr2)

print('RANDOM FOREST ',rfr2)

print('GB',gbr2)

print('ADABOOST',adbr2)

print('XGB',xgbr2)

print('KNN',knnr2)

print('SVR',svrr2)

print('CAT',catr2)

print('LIGHTGBM',lgbr2)

print('GUASSIAN PROCESS',gprr2)

Output:

LINEAR REG 0.44293996874898933

RIDGE 0.44391863717160485

LASSO -0.15896098636013822

ELASTICNET 0.24009610844714901

RANDOM FOREST 0.30054654155503713

GB 0.3316197984256388

ADABOOST 0.2440881620509796

XGB 0.18081183850948024

KNN 0.39596009597821025

SVR 0.39878682459820713

CAT 0.38644351719711245

LIGHTGBM 0.37694716798251826

GUASSIAN PROCESS -592939.1876093106

Output Interpretation:

The R² scores show:

Best Performers:

- Ridge Regression (0.44): Slightly better than Linear Regression (0.44).

- SVR (0.40) & KNN (0.39): Descent for nonlinear patterns.

- Worst Performers:

- Lasso (-0.16): Poor fit (likely oversimplified the data).

- Gaussian Process (-592k): Failed catastrophically (unsuitable for this small dataset).

Takeaway:

- Linear models (Ridge/Linear) work best here, suggesting a roughly linear trend between `total_bill` and `tip`.

- Tree-based models (Random Forest, XGBoost) underperformed likely due to limited data or simple patterns.

Discussion Questions

1. Why did scaling improve some models (e.g., SVR) but not others (e.g., Random Forest)?

- Hint: Tree-based models are scale-invariant.

2. How could we improve the R² score?

- Hint: Feature engineering, more data, hyperparameter tuning.

Ridge Regression vs. Prediction: Explained!

Creating a scatter plot that compares the actual tips (`Ground Truth`) vs. tips predicted by the Ridge Regressor (`Prediction`). Let’s break it down:

Code:

plt.scatter(y_test,rpred)

plt.title('RIDGE REGRESSOR VS PREDICTION')

plt.xlabel('Ground Truth')

plt.ylabel('Prediction')

plt.show()

Output:

What the Plot Shows

- X-axis (`Ground Truth`): Real tip amounts from the test set (1, 2, 3, 4, 5).

- Y-axis (`Prediction`): Tips predicted by Ridge Regression (1, 2, 3, 4).

- Ideal Scenario: Points should fall on a diagonal line (perfect predictions).

Key Observations

1. Underestimation:

- For high tips (e.g., true tip=5), the model predicts lower (e.g., 4).

2. Decent Fit for Mid-Range:

- Predictions are close for tips 1–3.

- Ridge’s R² Score: 0.44 (from earlier output).

- Interpretation: 44% of variance in tips is explained by `total_bill`. Not great, but better than other models we tried!

Discussion Questions

1. Why might Ridge underestimate high tips?

- Hint: Maybe high tips are rare in the data, so the model "plays it safe."

2. How could we improve predictions for extreme values?

- Hint: Collect more high-tip examples or try nonlinear models.

Quiz Time! 🎯

1. What does a point at (3, 2) mean?

- A) Model perfectly predicted a $3 tip.

- B) Model predicted $2 for a $3 tip.

- Answer: B (It underestimated by $1!)

2. If all points lie on y=x, the R² score is?

- A) 0

- B) 1

- Answer: B (Perfect predictions!)

Fun Facts & Did You Know?

- Fact 1: Ridge Regression adds a "penalty" to prevent overfitting, like a teacher gently correcting a student’s wild guesses!

- Fact 2: The word "regression" comes from Francis Galton’s study of pea sizes, where he noticed "regression to the mean." 🌱

Activity

- Plot Other Models:

Compare plots for Lasso/Random Forest. Which looks closest to y=x?

Want to visualize the perfect prediction line (y=x) on the plot? Try adding:

Is the Ridge Model Overfitting or Underfitting?

Let’s Investigate!

We will use cross-validation to check if our Ridge Regression model generalizes well or if it’s overfitting/underfitting.

Code:

#(TO CHECK IF THE MODEL HAS OVERFITTED OR UNDERFITTED)

from sklearn.model_selection import cross_val_score

cross_val = cross_val_score(estimator=r,X=x_train_scaled,y=y_train)

print('Cross Val Acc Score of RIDGE model is ---> ',cross_val)

print('\n Cross Val Mean Acc Score of RIDGE model is ---> ',cross_val.mean())

Output:

Cross Val Acc Score of RIDGE model is --->  [-0.1145362   0.46443359  0.57347271  0.55080537  0.12219202]

Cross Val Mean Acc Score of RIDGE model is --->  0.3192735010167972

What’s Happening?
1. `cross_val_score`:
- Splits `x_train_scaled` into 5 folds (default).
- Trains Ridge on 4 folds, tests on the 5th, and repeats for all folds.
- Returns R² scores (not accurate , common misconception!) for each fold.

2. Output:
- Fold scores: `[-0.11, 0.46, 0.57, 0.55, 0.12]` (varies widely!).
- Mean score: 0.32 (low, but better than negative values!).
Overfitting?
- ❌ No. Overfitting would show high training scores but low test scores (not observed here).
Underfitting?
- ✅ Likely! Low mean score (0.32) suggests the model is too simple to capture patterns.
High Variance?
- Scores range from -0.11 to 0.57 inconsistent performance across folds.

Key Insight:
- Ridge isn’t overfitting, but it’s underfitting (or the data is noisy).
Discussion Questions
1. Why might Fold 1 score be negative?
- Hint: Negative R² means the model is worse than predicting the mean tip!
2. How could we improve the Ridge model’s consistency?
- Hint: Try adjusting `alpha` (penalty strength) or adding more features.

Quiz Time! 🎯
1. If all fold scores were 0.9, the model is:
- A) Overfitting
- B) Generalizing well
- Answer: B (High and consistent scores = good fit!)

2. Cross-validation helps avoid:
- A) Underfitting
- B) Data leakage
- Answer: B (It ensures we don’t cheat by peeking at test data!)

Fun Facts & Did You Know?
- Fact 1: Negative R² scores can happen! It’s like a weatherman predicting snowfall in a desert. ❄️🏜️
- Fact 2: Ridge’s `alpha` is like a "regularization knob", turn it up to simplify the model, down to fit quirks. 🔧

Activity Idea
- Tune `alpha`: Try `Ridge(alpha=10)` or `Ridge(alpha=0.1)` and rerun cross-validation. Does the mean score improve?

# Example: Test a stronger penalty
r_tuned = Ridge(alpha=10)
cross_val_score(r_tuned, x_train_scaled, y_train).mean()

Visualize: Plot `alpha` vs. mean R² to find the sweet spot!

Conclusion
Despite using a relatively small dataset, our analysis uncovered meaningful insights that align with real-world observations about tipping behavior. The trends we identified such as the correlation between total bill amount and tips reflect patterns commonly seen in practice. However, with a larger and more diverse dataset, we could delve deeper into the nuances of this relationship, uncovering hidden patterns and potentially more complex dynamics influenced by factors like dining time, party size, or server demographics. This project highlights the power of data science to extract actionable insights even from limited data, while also emphasizing the value of robust datasets for building more accurate and generalizable models. Future work could expand the analysis to include additional features or employ advanced techniques to further refine our understanding of tipping behavior.
Key Takeaways:
Small datasets can yield valid insights but have limitations.
Larger datasets enable deeper pattern recognition and stronger conclusions.
Real-world applicability depends on data quality and model tuning.

The Programmar Kid's Ultimate Coding Series

Search This Blog