Predicting Waiter Tips with AI
Imagine you're at a restaurant, and you're trying to guess how much tip the waiter might get based on the total bill, time of day, customer gender, or even smoking preference.
Sounds like a fun challenge, right? Welcome to the Waiter Tips Prediction project!
In this hands-on machine learning mini-project, we’ll dive into a real-world dataset collected from a restaurant and try to predict the amount of tip a customer is likely to leave. This isn’t just about numbers, it’s about uncovering hidden patterns in human behavior and building a model that learns from data.
We’ll use popular Python libraries like Pandas, Seaborn, and Scikit-learn to explore the data, visualize trends, and train a regression model that makes smart predictions. Whether you're a beginner exploring data science or someone looking to practice ML with fun datasets, this project is for you.
By the end of this project, you’ll understand:
How to prepare and explore a dataset
How tipping patterns change with various factors
How to build and evaluate a predictive ML model
How regression helps in solving real-life forecasting problems

So grab your digital notepad and let’s get ready to turn dinner bills into data-driven insights!
Let's get started with
1. Importing Libraries:
numpy and pandas help with numerical and tabular data processing.
seaborn and matplotlib.pyplot are used for data visualization.
%matplotlib inline ensures that your plots show up directly inside the notebook.
warnings.filterwarnings("ignore") hides warning messages to keep the output clean.
2. Loading the Dataset:
sns.load_dataset('tips') loads the famous tips dataset included in Seaborn. This dataset contains information about restaurant bills, customer gender, smoking status, and tips.
3. Viewing the Data:
df.head() displays the first five rows of the dataset to give a quick look at the structure and types of data you'll be working with.
Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('/kaggle/input/tips-dataset/tips.csv')
df.head()
Output:
Since the columns ‘Payer Name`, `CC Number`, and `Payment ID` likely contain personally identifiable information (PII) or transactional details that are irrelevant to the analysis and pose privacy/security risks, we are removing them from the dataset.
It can also cause Redundancy because `price_per_person` might be a derived column (calculated as `total_bill / size`). Including it could cause multicollinearity or redundancy in a machine learning model.
So let's start Column Removal:
The code removes four columns from the DataFrame `df`:
- `Payer Name`
- `CC Number`
- `Payment ID`
- `price_per_person`
Result
After dropping these columns, the DataFrame retains only the core features relevant for analysis:
- `total_bill`, `tip`, `sex`, `smoker`, `day`, `time`, `size`
Code:
df = df.drop(['Payer Name','CC Number','Payment ID','price_per_person'],axis=1)
df.head()
Output:
Now we have to perform label encoding on categorical columns in the DataFrame `df`, converting text-based categories into numerical values for machine learning compatibility.
Why This is Done ?
1. Label Encoding Purpose:
Machine learning algorithms require numerical input. Label encoding converts categorical text data (e.g., `Male/Female`, `Yes/No`) into integers.
2. Column-Specific Logic:
- `sex`: Binary encoding (`Male=1`, `Female=0`).
- `smoker`: Binary encoding (`Yes=1`, `No=0`).
- `day`: Assigns numerical values to days (e.g., `Sun=4`, `Sat=3`).
- `time`: Maps meal times to integers (`Dinner=2`, `Lunch=1`).
Code:
df.sex = df.sex.replace(['Male','Female'],[1,0])
df.smoker = df.smoker.replace(['Yes','No'],[1,0])
df.day = df.day.replace(['Sun', 'Sat', 'Thur', 'Fri'],[4,3,1,2])
df.time = df.time.replace(['Dinner','Lunch'],[2,1])
df.head()
Output:
We have to check via distribution plots (histograms + kernel density estimates) for the `total_bill` and `tip` columns to analyze their spread and patterns.
Loop through 'total_bill' and 'tip' columns to plot distributions.
To represent this visually we will:
1. Figure Setup:
- Creates a large canvas (`15x9 inches`) to display plots clearly.
- Uses a 2-row, 3-column grid for subplots (though only 2 plots are generated).
2. Loop Logic:
- Iterates over the columns `total_bill` and `tip`.
- For each column:
- Creates a subplot in position `1` and `2` of the grid.
- Uses `sns.distplot()` to plot:
- Histogram: Shows frequency of values in bins.
- KDE (Kernel Density Estimate): Smooth curve estimating the probability density.
3. Output:
- Two distribution plots displayed side-by-side.
Code:
plt.subplots(figsize=(15,9))
for i, col in enumerate(['total_bill','tip']):
plt.subplot(2,3,i+1)
sns.distplot(df[col])
plt.tight_layout()
plt.show()
Output:
1. `total_bill` Distribution:
- Range: Most bills are between \$10–\$40, with a few outliers up to \$60.
- Peak: Highest density around \$15–\$20.
- Shape: Right-skewed (common in financial data).
2. `tip` Distribution:
- Range: Tips mostly fall between \$1–\$5 with rare tips up to \$10.
- Peak: Most frequent tip amount is \$2–\$3
- Shape: Also right-skewed.
Let’s generate a correlation heatmap to visualize relationships between numerical features in the dataset.
How to Print the Correlation Matrix?
1. `df.corr()`:
Computes Pearson correlation coefficients between all numerical columns. Values range from -1 (perfect inverse) to 1 (perfect positive).
2. Heatmap Visualization:
- Colors (plasma colormap) and annotations show correlation strength.
- Darker colors = weaker correlations; brighter colors = stronger correlations.
Code:
#NOW check the correlations
corr = df.corr()
plt.figure(figsize=(15,9))
sns.heatmap(corr,annot=True,cbar=True,cmap='plasma')
plt.show()
Output:
Key Observations from the Output:
| total_bil | vs. size | 0.60 | Moderate positive correlation. Larger groups tend to have higher bills.
| tip | vs. total_bill | 0.46 | Tips increase with bill amount (expected). |
| day | vs. time | 0.87 | Extremely high correlation.
|sex vs. day | 0.23 | Weak correlation. Minimal relationship. |
Actionable Insights:
1. Feature Selection:
- Prioritize `total_bill` and `size` for tip prediction (strongest correlations).
Now we’ll explore how to visualize relationships between numerical data using scatter plots in Python. We’ll analyze a dataset comparing restaurant bills (total_bill) to tips (tip) to see if there’s a pattern. By the end, you’ll understand how to create, customize, and interpret scatter plots using matplotlib. Let’s dive in!
Why this works:
Sets clear expectations.
Connect the code to a real-world example (tips vs. bills).
Keep it engaging and concise.
1. plt.scatter(df.total_bill, df.tip)
What it does: Creates a scatter plot using two columns from the DataFrame (df):
df.total_bill: X-axis values (independent variable).
df.tip: Y-axis values (dependent variable).
Why scatter plot? To visualize the relationship between two numerical variables (e.g., correlation).
2. plt.title('TOTAL BILL vs TIPS')
Adds a title to the plot for context.
3. plt.xlabel('TOTAL BILL') and plt.ylabel('TIPS')
Labels the X-axis ("TOTAL BILL") and Y-axis ("TIPS") for clarity.
4. plt.show()
Displays the plot (required in non-Jupyter environments).
Code:
plt.scatter(df.total_bill,df.tip)
plt.title('TOTAL BILL vs TIPS')
plt.xlabel('TOTAL BILL')
plt.ylabel('TIPS')
plt.show()
Output:
The scatter plot shows:
X-axis (Total Bill): Ranges from
10–
10–50.
Y-axis (Tips): Ranges from
2–
2–10.
Key Observations:
As the total bill increases, tips generally increase (positive correlation).
Some outliers exist (e.g., high bills with low tips or vice versa).
Common Questions
Q: Why not use a line plot?
A: Scatter plots show individual data points; line plots imply continuity.Q: How to change point colors?
A: Add c='red' in plt.scatter().
Diving into the Train-Test Split and Model Evaluation phase.
Before building any machine learning model, it's crucial to split our data into training and testing sets. This helps us evaluate how well our model generalizes to unseen data. Today, we'll:
1. Split the dataset (`total_bill` vs. `tip`).
2. Scale features for better model performance.
3. Train 13 regression models and compare their accuracy using the R² scores
Code:
#Splitting the data into train and test
x = df.drop(['tip'],axis=1)
y = df.tip
#train test split
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)
#feature scaling
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
x_train_scaled = ss.fit_transform(x_train)
x_test_scaled = ss.transform(x_test)
#model selection
from sklearn.linear_model import LinearRegression,Ridge,Lasso,ElasticNet
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor,AdaBoostRegressor
from xgboost import XGBRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from catboost import CatBoostRegressor
import lightgbm as lgbm
from sklearn.gaussian_process import GaussianProcessRegressor
lr = LinearRegression()
r = Ridge()
l = Lasso()
en = ElasticNet()
rf = RandomForestRegressor()
gb = GradientBoostingRegressor()
adb = AdaBoostRegressor()
xgb = XGBRegressor()
knn = KNeighborsRegressor()
svr = SVR()
cat = CatBoostRegressor()
lgb =lgbm.LGBMRegressor()
gpr = GaussianProcessRegressor()
#Fittings
lr.fit(x_train_scaled,y_train)
r.fit(x_train_scaled,y_train)
l.fit(x_train_scaled,y_train)
en.fit(x_train_scaled,y_train)
rf.fit(x_train_scaled,y_train)
gb.fit(x_train_scaled,y_train)
adb.fit(x_train_scaled,y_train)
xgb.fit(x_train_scaled,y_train)
knn.fit(x_train_scaled,y_train)
svr.fit(x_train_scaled,y_train)
cat.fit(x_train_scaled,y_train)
lgb.fit(x_train_scaled,y_train)
gpr.fit(x_train_scaled,y_train)
#preds
lrpred = lr.predict(x_test_scaled)
rpred = r.predict(x_test_scaled)
lpred = l.predict(x_test_scaled)
enpred = en.predict(x_test_scaled)
rfpred = rf.predict(x_test_scaled)
gbpred = gb.predict(x_test_scaled)
adbpred = adb.predict(x_test_scaled)
xgbpred = xgb.predict(x_test_scaled)
knnpred = knn.predict(x_test_scaled)
svrpred = svr.predict(x_test_scaled)
catpred = cat.predict(x_test_scaled)
lgbpred = lgb.predict(x_test_scaled)
gprpred = gpr.predict(x_test_scaled)
#Evaluations
from sklearn.metrics import r2_score,mean_absolute_error
lrr2 = r2_score(y_test,lrpred)
rr2 = r2_score(y_test,rpred)
lr2 = r2_score(y_test,lpred)
enr2 = r2_score(y_test,enpred)
rfr2 = r2_score(y_test,rfpred)
gbr2 = r2_score(y_test,gbpred)
adbr2 = r2_score(y_test,adbpred)
xgbr2 = r2_score(y_test,xgbpred)
knnr2 = r2_score(y_test,knnpred)
svrr2 = r2_score(y_test,svrpred)
catr2 = r2_score(y_test,catpred)
lgbr2 = r2_score(y_test,lgbpred)
gprr2 = r2_score(y_test,gprpred)
print('LINEAR REG ',lrr2)
print('RIDGE ',rr2)
print('LASSO ',lr2)
print('ELASTICNET',enr2)
print('RANDOM FOREST ',rfr2)
print('GB',gbr2)
print('ADABOOST',adbr2)
print('XGB',xgbr2)
print('KNN',knnr2)
print('SVR',svrr2)
print('CAT',catr2)
print('LIGHTGBM',lgbr2)
print('GUASSIAN PROCESS',gprr2)
Output:
LINEAR REG 0.44293996874898933
RIDGE 0.44391863717160485
LASSO -0.15896098636013822
ELASTICNET 0.24009610844714901
RANDOM FOREST 0.30054654155503713
GB 0.3316197984256388
ADABOOST 0.2440881620509796
XGB 0.18081183850948024
KNN 0.39596009597821025
SVR 0.39878682459820713
CAT 0.38644351719711245
LIGHTGBM 0.37694716798251826
GUASSIAN PROCESS -592939.1876093106
Output Interpretation:
The R² scores show:
Best Performers:
- Ridge Regression (0.44): Slightly better than Linear Regression (0.44).
- SVR (0.40) & KNN (0.39): Descent for nonlinear patterns.
- Worst Performers:
- Lasso (-0.16): Poor fit (likely oversimplified the data).
- Gaussian Process (-592k): Failed catastrophically (unsuitable for this small dataset).
Takeaway:
- Linear models (Ridge/Linear) work best here, suggesting a roughly linear trend between `total_bill` and `tip`.
- Tree-based models (Random Forest, XGBoost) underperformed likely due to limited data or simple patterns.
Discussion Questions
1. Why did scaling improve some models (e.g., SVR) but not others (e.g., Random Forest)?
- Hint: Tree-based models are scale-invariant.
2. How could we improve the R² score?
- Hint: Feature engineering, more data, hyperparameter tuning.
Ridge Regression vs. Prediction: Explained!
Creating a scatter plot that compares the actual tips (`Ground Truth`) vs. tips predicted by the Ridge Regressor (`Prediction`). Let’s break it down:
Code:
plt.scatter(y_test,rpred)
plt.title('RIDGE REGRESSOR VS PREDICTION')
plt.xlabel('Ground Truth')
plt.ylabel('Prediction')
plt.show()
Output:
What the Plot Shows
- X-axis (`Ground Truth`): Real tip amounts from the test set (1, 2, 3, 4, 5).
- Y-axis (`Prediction`): Tips predicted by Ridge Regression (1, 2, 3, 4).
- Ideal Scenario: Points should fall on a diagonal line (perfect predictions).
Key Observations
1. Underestimation:
- For high tips (e.g., true tip=5), the model predicts lower (e.g., 4).
2. Decent Fit for Mid-Range:
- Predictions are close for tips 1–3.
- Ridge’s R² Score: 0.44 (from earlier output).
- Interpretation: 44% of variance in tips is explained by `total_bill`. Not great, but better than other models we tried!
Discussion Questions
1. Why might Ridge underestimate high tips?
- Hint: Maybe high tips are rare in the data, so the model "plays it safe."
2. How could we improve predictions for extreme values?
- Hint: Collect more high-tip examples or try nonlinear models.
Quiz Time! 🎯
1. What does a point at (3, 2) mean?
- A) Model perfectly predicted a $3 tip.
- B) Model predicted $2 for a $3 tip.
- Answer: B (It underestimated by $1!)
2. If all points lie on y=x, the R² score is?
- A) 0
- B) 1
- Answer: B (Perfect predictions!)
Fun Facts & Did You Know?
- Fact 1: Ridge Regression adds a "penalty" to prevent overfitting, like a teacher gently correcting a student’s wild guesses!
- Fact 2: The word "regression" comes from Francis Galton’s study of pea sizes, where he noticed "regression to the mean." 🌱
Activity
- Plot Other Models:
Compare plots for Lasso/Random Forest. Which looks closest to y=x?
Want to visualize the perfect prediction line (y=x) on the plot? Try adding:
Is the Ridge Model Overfitting or Underfitting?
Let’s Investigate!
We will use cross-validation to check if our Ridge Regression model generalizes well or if it’s overfitting/underfitting.
Code:
#(TO CHECK IF THE MODEL HAS OVERFITTED OR UNDERFITTED)
from sklearn.model_selection import cross_val_score
cross_val = cross_val_score(estimator=r,X=x_train_scaled,y=y_train)
print('Cross Val Acc Score of RIDGE model is ---> ',cross_val)
print('\n Cross Val Mean Acc Score of RIDGE model is ---> ',cross_val.mean())
Output:
Cross Val Acc Score of RIDGE model is ---> [-0.1145362 0.46443359 0.57347271 0.55080537 0.12219202] Cross Val Mean Acc Score of RIDGE model is ---> 0.3192735010167972
What’s Happening?
1. `cross_val_score`:
- Splits `x_train_scaled` into 5 folds (default).
- Trains Ridge on 4 folds, tests on the 5th, and repeats for all folds.
- Returns R² scores (not accurate , common misconception!) for each fold.
2. Output:
- Fold scores: `[-0.11, 0.46, 0.57, 0.55, 0.12]` (varies widely!).
- Mean score: 0.32 (low, but better than negative values!).
Overfitting?
- ❌ No. Overfitting would show high training scores but low test scores (not observed here).
Underfitting?
- ✅ Likely! Low mean score (0.32) suggests the model is too simple to capture patterns.
High Variance?
- Scores range from -0.11 to 0.57 inconsistent performance across folds.
Key Insight:
- Ridge isn’t overfitting, but it’s underfitting (or the data is noisy).
Discussion Questions
1. Why might Fold 1 score be negative?
- Hint: Negative R² means the model is worse than predicting the mean tip!
2. How could we improve the Ridge model’s consistency?
- Hint: Try adjusting `alpha` (penalty strength) or adding more features.
Quiz Time! 🎯
1. If all fold scores were 0.9, the model is:
- A) Overfitting
- B) Generalizing well
- Answer: B (High and consistent scores = good fit!)
2. Cross-validation helps avoid:
- A) Underfitting
- B) Data leakage
- Answer: B (It ensures we don’t cheat by peeking at test data!)
Fun Facts & Did You Know?
- Fact 1: Negative R² scores can happen! It’s like a weatherman predicting snowfall in a desert. ❄️🏜️
- Fact 2: Ridge’s `alpha` is like a "regularization knob", turn it up to simplify the model, down to fit quirks. 🔧
Activity Idea
- Tune `alpha`: Try `Ridge(alpha=10)` or `Ridge(alpha=0.1)` and rerun cross-validation. Does the mean score improve?
# Example: Test a stronger penalty
r_tuned = Ridge(alpha=10)
cross_val_score(r_tuned, x_train_scaled, y_train).mean()
Visualize: Plot `alpha` vs. mean R² to find the sweet spot!
Conclusion
Despite using a relatively small dataset, our analysis uncovered meaningful insights that align with real-world observations about tipping behavior. The trends we identified such as the correlation between total bill amount and tips reflect patterns commonly seen in practice. However, with a larger and more diverse dataset, we could delve deeper into the nuances of this relationship, uncovering hidden patterns and potentially more complex dynamics influenced by factors like dining time, party size, or server demographics. This project highlights the power of data science to extract actionable insights even from limited data, while also emphasizing the value of robust datasets for building more accurate and generalizable models. Future work could expand the analysis to include additional features or employ advanced techniques to further refine our understanding of tipping behavior.
Key Takeaways:
Small datasets can yield valid insights but have limitations.
Larger datasets enable deeper pattern recognition and stronger conclusions.
Real-world applicability depends on data quality and model tuning.