Healthcare Project Ai (End-To-End)
Predicting Cancer Using Machine Learning
In today’s data-driven healthcare world, early diagnosis is not just a convenience, it's a lifesaving tool. Imagine being able to predict the likelihood of a patient having cancer using data and machine learning. This blog takes you through a complete hands-on project using real-world healthcare data, where we build a cancer prediction model using Python and scikit-learn.
Whether you're a data science beginner or someone exploring machine learning applications in healthcare, this blog will help you understand how AI can be used to create meaningful predictions and potentially save lives.
1. Real-Life Use Case
One of the most common and threatening forms of cancer is breast cancer. Early detection can significantly improve patient survival rates. Hospitals and diagnostic centers collect vast amounts of patient data, such as tumor size, cell shape, and texture. But interpreting this data manually is slow and prone to error.
Real-Life Example:
Institutions like the Mayo Clinic and IBM Watson Health use machine learning algorithms to assist doctors in diagnosing breast cancer faster and more accurately. This project replicates a simpler version of those tools using open-source data.
2. Objective of the Project
Our goal is to build a classification model that can predict whether a tumor is malignant or benign, based on features extracted from digitized images of a breast mass.
3. Dataset Overview
We’ll use the Breast Cancer Wisconsin (Diagnostic) Dataset, which is built into sklearn.datasets.
This dataset includes 30 numeric features derived from tumor images and a target label (0 for malignant, 1 for benign).
4. Loading the Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import warnings
import joblib
warnings.filterwarnings('ignore')
df = pd.read_csv('/kaggle/input/cancer-data/Cancer_Data.csv')
df.head()
🧠 What's happening here?
Before we can analyze or build a machine learning model, we need to import essential Python libraries. Here's what each one does:
pandas as pd: For loading, cleaning, and analyzing data in tabular form (like Excel).
numpy as np: For working with numerical arrays, essential for numerical calculations.
matplotlib.pyplot as plt: For making basic graphs and plots (like bar charts, line plots, etc.).
seaborn as sns: A more beautiful plotting library built on top of matplotlib, used for statistical data visualizations.
pickle and joblib: Tools to save and load machine learning models after training so you can reuse them without retraining.
warnings.filterwarnings('ignore'): This is used to suppress warning messages, which can make your notebook cleaner and easier to read.
Pd.read_csv: This reads the CSV file containing the cancer dataset. In this case, the dataset is located in the Kaggle environment's directory.
df.head():
Displays the first 5 rows of the dataset to give us a quick preview of what the data looks like.
OUTPUT:
Just type: df.columns
Why Do We Use df.columns in Pandas?
In pandas, df.columns is used to view or access the names of all the columns in a DataFrame (df stands for DataFrame).
🔍 What does it return?
It returns a list-like object containing all the column names from your dataset.
df.columns
Output:
Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
'fractal_dimension_se', 'radius_worst', 'texture_worst',
'perimeter_worst', 'area_worst', 'smoothness_worst',
'compactness_worst', 'concavity_worst', 'concave points_worst',
'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32'],
dtype='object')
So above are the columns that we have in this dataset. Now we will drop some unnecessary columns that is ‘Unnamed: 32 and ‘id’
To drop, we will use this code:
df = df.drop(['Unnamed: 32','id'],axis=1)
🔍 What does this line do?
It removes two columns 'Unnamed: 32' and 'id' from the DataFrame df.
✅ Why do we drop these columns?
'Unnamed: 32':
This is often a blank or empty column that accidentally gets added when saving or exporting CSV files (especially from Excel).
It usually has no useful data and just clutters the dataset.
'id':
This is an identifier column like a serial number or patient ID.
It's unique for each row but doesn’t help in predicting the target (like whether cancer is present or not), so it's not useful for machine learning models.
Keeping it may lead to overfitting or noise.
🧠 What does axis=1 mean?
axis=1 tells pandas to drop columns.
(If it were axis=0, it would drop rows instead.)
Now let’s see what df.info() gives.
A quick summary of the DataFrame, including:
The number of rows and columns
Column names and data types
The count of non-null (non-missing) values in each column
The memory usage of the DataFrame
It's very useful for checking data quality and identifying missing values or data types that may need conversion before analysis or modeling. So let's check:
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 diagnosis 569 non-null object
1 radius_mean 569 non-null float64
2 texture_mean 569 non-null float64
3 perimeter_mean 569 non-null float64
4 area_mean 569 non-null float64
5 smoothness_mean 569 non-null float64
6 compactness_mean 569 non-null float64
7 concavity_mean 569 non-null float64
8 concave points_mean 569 non-null float64
9 symmetry_mean 569 non-null float64
10 fractal_dimension_mean 569 non-null float64
11 radius_se 569 non-null float64
12 texture_se 569 non-null float64
13 perimeter_se 569 non-null float64
14 area_se 569 non-null float64
15 smoothness_se 569 non-null float64
16 compactness_se 569 non-null float64
17 concavity_se 569 non-null float64
18 concave points_se 569 non-null float64
19 symmetry_se 569 non-null float64
20 fractal_dimension_se 569 non-null float64
21 radius_worst 569 non-null float64
22 texture_worst 569 non-null float64
23 perimeter_worst 569 non-null float64
24 area_worst 569 non-null float64
25 smoothness_worst 569 non-null float64
26 compactness_worst 569 non-null float64
27 concavity_worst 569 non-null float64
28 concave points_worst 569 non-null float64
29 symmetry_worst 569 non-null float64
30 fractal_dimension_worst 569 non-null float64
dtypes: float64(30), object(1)
memory usage: 137.9+ KB
So there is no missing value in this dataset. And there are no duplicated values either in this dataset. To check missing and duplicates, we use df.isnull().sum() and df.duplicated().sum()
Now to check the target variable (diagnosis) that is ‘M’ and ‘B’, we will convert it into 1 and 0 so that we its converted into a numerical value. It’s easy for Machine Learning models to understand numerical values
df.diagnosis.unique()
Output:
array(['M', 'B'], dtype=object)
So lets convert
df.diagnosis = df.diagnosis.replace(['M','B'],[1,0])
df.head()
Output:
As you can see above the target column is not numerical with 0 and 1 values. Here Malignant is cancer and Banign is not which means 1 is cancer and 0 is not.
Now check the correlation matrix
What is this df.corr() ?
df.corr() calculates the correlation matrix, which shows how strongly each feature is related to the others.
plt.figure(figsize=(28,27)) sets the size of the heatmap plot.
sns.heatmap visualizes the correlation matrix using a color-coded heatmap:
annot=True: displays the actual correlation values inside the boxes.
cbar=True: shows the color scale on the side.
cmap='plasma': applies a colorful gradient for better visual contrast.
👉 Useful to identify multicollinearity and decide which features are most informative.
corr = df.corr()
plt.figure(figsize=(28,27))
sns.heatmap(corr,annot=True,cbar=True,cmap='plasma')
plt.show()
Output:
Now lets Scale the features so that machine learning models can easily interpret the features and perform predictions.
What Feature Scaling does?
StandardScaler: Scales the features so they have a mean of 0 and standard deviation of 1. This helps many machine learning models perform better.
fit_transform(x_train): Learns the scaling parameters from x_train and applies the transformation.
transform(x_test): Applies the same transformation to x_test using the parameters learned from x_train.
✅ Feature scaling is important, especially for models like KNN, SVM, and Logistic Regression that are sensitive to feature magnitudes.
#FEATURE SCALING
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
x_train_scaled = ss.fit_transform(x_train)
x_test_scaled = ss.transform(x_test)
Now the data is scaled, we will import Machine Learning models.
This part is called model selections. Since we are doing a regression project, we will try several regression algorithms to check which models provides the best results on this dataset. So we are import all the following models.
#model selections
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
#objects
lr = LogisticRegression()
rf = RandomForestClassifier()
gb = GradientBoostingClassifier()
xgb = XGBClassifier()
svc = SVC()
knn = KNeighborsClassifier()
nb = GaussianNB()
lgb = LGBMClassifier()
cat = CatBoostClassifier()
And we have created the model instances also. Save them in these variables above so that we can easily call them from a name of our choice.
Now lets fit these models on our training data
lrpred = lr.predict(x_test_scaled)
rfpred = rf.predict(x_test_scaled)
gbpred = gb.predict(x_test_scaled)
xgbpred = xgb.predict(x_test_scaled)
svcpred = svc.predict(x_test_scaled)
knnpred = knn.predict(x_test_scaled)
nbpred = nb.predict(x_test_scaled)
lgbpred = lgb.predict(x_test_scaled)
catpred = cat.predict(x_test_scaled)
#Evaluations
from sklearn.metrics import accuracy_score
lracc = accuracy_score(y_test,lrpred)
rfacc = accuracy_score(y_test,rfpred)
gbacc = accuracy_score(y_test,gbpred)
xgbacc = accuracy_score(y_test,xgbpred)
svcacc = accuracy_score(y_test,svcpred)
knnacc = accuracy_score(y_test,knnpred)
nbacc = accuracy_score(y_test,nbpred)
lgbacc = accuracy_score(y_test,lgbpred)
catacc = accuracy_score(y_test,catpred)
print('LOGISTIC REG',lracc)
print('RANDOM FOREST',rfacc)
print('GB',gbacc)
print('XGB',xgbacc)
print('SVC',svcacc)
print('KNN',knnacc)
print('NB',nbacc)
print('LIGHT GBM',lgbacc)
print('CATO',catacc)
LinearRegression(): Initializes a linear regression model from scikit-learn.
fit(x_train_scaled, y_train): Trains the model using the scaled training data.
predict(x_test_scaled): Uses the trained model to make predictions on the scaled test data.
✅ This is the core step where the machine learns patterns from training data and makes predictions on unseen data.
Output:
LOGISTIC REG: 0.9736842105263158 (97%)
RANDOM FOREST: 0.9649122807017544 (96%)
GB: 0.956140350877193(95%)
XGB: 0.956140350877193(95%)
SVC: 0.9824561403508771(98%)
KNN: 0.9473684210526315(94%)
NB: 0.9649122807017544(96%)
LIGHT GBM: 0.9649122807017544(96%)
CATO: 0.9736842105263158(97%)
This shows that nearly every model is performing well on this dataset. This is because we did feature scaling properly and gave the best amount of training data.
Now we see above the best performing model is SVC(Support vector Classifier). So we will use this model in our final deployment phase. Before deployment we will check the confusion matrix.
🎯 Why We Check the Confusion Matrix:
The confusion matrix gives a detailed breakdown of your classification model's performance by showing:
True Positives (TP) – Correctly predicted positives (e.g., correctly identified cancer cases)
True Negatives (TN) – Correctly predicted negatives (e.g., correctly identified non-cancer cases)
False Positives (FP) – Incorrectly predicted as positive (e.g., predicted cancer, but actually not)
False Negatives (FN) – Missed positives (e.g., predicted non-cancer, but actually cancer)
✅ Why It’s Important:
It helps identify where the model is making mistakes.
Useful in imbalanced datasets (e.g., more non-cancer than cancer cases), where accuracy alone can be misleading.
Helps calculate key metrics:
Precision = TP / (TP + FP)
Recall (Sensitivity) = TP / (TP + FN)
F1-Score = Harmonic mean of precision and recall
👉 In healthcare (like cancer prediction), a false negative can be dangerous. The confusion matrix highlights such critical errors, helping you improve the model.
#NOW CHECK THE CONFUSION MATRIX(for specific model)
from sklearn.metrics import confusion_matrix,classification_report
cm = confusion_matrix(y_test,svcpred) #Enter the model pred here
plt.title('Heatmap of Confusion matrix',fontsize=15)
sns.heatmap(cm,annot=True)
plt.show()
Output:
Now Let’s head to the Classification Report
classification_report is a function from sklearn.metrics.
It compares your model’s predictions (svcpred) with the actual values (y_test).
It prints four key performance metrics for each class (e.g., malignant or benign):
Precision
Recall
F1-score
Support
📊 Why Are We Checking the Classification Report?
Because it gives a comprehensive evaluation of the classification model’s performance more than just accuracy or a confusion matrix.
💡 Here’s what each metric tells you:
Precision: Out of all predicted positive cases, how many were actually positive?
Example: How many predicted “cancer” cases were actually cancer?
Recall (Sensitivity): Out of all actual positive cases, how many were correctly predicted?
Important in medical scenarios we want to catch as many actual cancer cases as possible.
F1-score: Harmonic mean of precision and recall. A balance between both.
Useful when the class distribution is imbalanced.
Support: Number of actual instances for each class in y_test.
✅ Summary:
The classification report tells you how well your model performs per class, which is especially important in critical applications like healthcare.
Now Let's check the Classification Report
#NOW CHECK THE CLASSIFICATION REPORT
#Printing the classification report
print(classification_report(y_test,svcpred))
Output:
Now let's interpret the values from the output:
Class-wise Performance
🔹 Class 0: Benign Tumors
Precision = 0.97
→ 97% of the time when the model predicted "benign", it was correct.Recall = 1.00
→ The model caught all actual benign tumors, missing none.F1-Score = 0.99
→ Excellent overall performance for benign tumors.
🔹 Class 1: Malignant Tumors
Precision = 1.00
→ Every time the model predicted "malignant", it was always correct no false positives.Recall = 0.95
→ The model identified 95% of actual malignant tumors. It missed 5%.F1-Score = 0.98
→ Still a strong score, but we could aim to improve recall slightly.
📊 Overall Model Performance
✅ Accuracy = 0.98 (98%)
Out of 114 total predictions, the model predicted 112 correctly which means very strong!
🧠 Macro vs. Weighted Averages
Macro Avg:
Simple average of precision, recall, and F1 across both classes.
Treats both classes equally, regardless of how many samples they have.
Weighted Avg:
Takes class imbalance into account by giving more weight to the class with more instances (in this case, class 0).
Better for overall performance reflection when classes are imbalanced.
Both are very high (~0.98–0.99), which confirms strong, reliable performance.
Now let’s check the Cross Validation Accuracy Score. This will tell us that either our model has overfitted or underfitted.
🎯 Why Do We Check Cross-Validation Score?
When we train and test a machine learning model, it's easy to get overconfident if the accuracy is high but how do we know it will perform well on unseen data?
That’s where cross-validation comes in.
✅ What Is Cross-Validation?
Cross-validation is a technique to evaluate a model's performance more reliably.
It works by splitting your dataset into multiple parts (called folds), training the model on some folds, and testing it on the others and repeating this process several times.
🔁 Example: 5-Fold Cross Validation
Split your data into 5 parts (folds)
Use 4 parts for training and 1 part for testing
Repeat this 5 times, changing the test fold each time
Calculate the accuracy each time
Average the results to get the cross-validation score
This gives you a more balanced and general idea of how your model will perform on new, real-world data.
🧠 Why It’s Important:
Prevents overfitting: Helps ensure your model is not just memorizing training data.
More reliable metric: It gives a more honest picture of accuracy compared to just one train-test split.
Helps in model selection: You can compare multiple models using cross-validation to see which generalizes best.
📌 Teaching Analogy:
Imagine studying for an exam by only solving one type of question you'll do well on that type, but might struggle on the actual test.
Cross-validation is like practicing with different types of questions from different chapters it prepares your model better for the final exam (real-world predictions)!
Checking:
from sklearn.model_selection import cross_val_score
cross_val = cross_val_score(estimator=svc,X=x_train_scaled,y=y_train)
print('Cross Val Acc Score of SVC model is ---> ',cross_val)
print('\n Cross Val Mean Acc Score of SVC model is ---> ',cross_val.mean())
Output:
Cross Val Acc Score of SVC model is ---> [0.97802198 0.96703297 0.98901099 0.98901099 0.95604396]
Cross Val Mean Acc Score of SVC model is ---> 0.9758241758241759
Understanding Cross-Validation Accuracy Scores
1. What the Output Means
The output shows the performance of a Support Vector Classifier (SVC) model evaluated using 5-fold cross-validation:
Cross Val Acc Score of SVC model is ---> [0.97802198 0.96703297 0.98901099 0.98901099 0.95604396]
Cross Val Mean Acc Score of SVC model is ---> 0.9758241758241759
First Line: Accuracy scores for each of the 5 validation folds.
- Fold 1: 97.80%
- Fold 2: 96.70%
- Fold 3: 98.90%
- Fold 4: 98.90%
- Fold 5: 95.60%
Second Line: The mean accuracy across all folds (~97.58%).
2. Key Concepts to Explain
A. Why Cross-Validation?
- Avoids overfitting by testing the model on different subsets of data.
- More reliable than a single train-test split.
B. Why 5 Folds?
- The dataset was split into 5 parts (folds).
- Model trained on 4 folds, tested on the 5th, repeated 5 times.
C. Interpreting Variability
- The scores range from 95.6% to 98.9%.
Consistency Check: Small variation (all scores >95%) means the model generalizes well.
Red Flag: If one fold had 70%, we’d investigate data imbalances or outliers.
D. Mean Accuracy
- The average (97.58%) is the model’s expected performance on unseen data.
3. Why This Matters
- High Mean (~97.58%): The SVC model is very accurate.
- Low Variance: Consistent scores suggest robustness.
Analogy: Think of cross-validation like taking 5 different exams. If you score 95-99% on all, you’re consistently good, not just lucky!
4. Potential Discussion Questions
1. *“Why might Fold 5’s score (95.6%) be slightly lower?”
- Maybe harder samples or minor overfitting.
2. “How could we improve the lowest fold’s score?”
- Feature engineering, hyperparameter tuning.
Now Let’s Save the model to use it later in the deployment phase.
#NOW save the model
import pickle
#Saving the model
pickle.dump(svc,open('breast_cancer_svc.pickle','wb'))
#loading the model
breast_cancer_svc_model = pickle.load(open('breast_cancer_svc.pickle','rb'))
#Predicting the output
y_pred = breast_cancer_svc_model.predict(x_test_scaled)
#confusion matrix
print('Confusion matrix of SVC model : \n', confusion_matrix(y_test,y_pred),'\n')
#showing off the accuracy score
print('Accuracy Score on testing data by svc model is ---> ',accuracy_score(y_test,y_pred))
Output:
Confusion matrix of SVC model :
[[71 0]
[ 2 41]]
Accuracy Score on testing data by svc model is ---> 0.9824561403508771
Now let me explain the above code and the output step by step
1. Saving the Model with `pickle`
import pickle
pickle.dump(svc, open('breast_cancer_svc.pickle', 'wb'))
What it does:
- `pickle` is Python’s built-in module for serializing (saving) objects like trained models.
- `pickle.dump()` saves the trained `svc` model to a file named `breast_cancer_svc.pickle`.
- `'wb'` means "write in binary mode" (required for pickle files).
Why it matters:
- Saves time, you don’t need to retrain the model every time.
- Lets you share/reuse the model in other projects or deploy it.
2. Loading the Saved Model
breast_cancer_svc_model = pickle.load(open('breast_cancer_svc.pickle', 'rb'))
What it does:
- `pickle.load()` reads the saved model from the file.
- `'rb'` means "read in binary mode".
- The loaded model is stored in `breast_cancer_svc_model` (now identical to the original `svc`).
Key Point:
- The loaded model is ready to make predictions, just like before!
3. Making Predictions
y_pred = breast_cancer_svc_model.predict(x_test_scaled)
What it does:
- Uses the loaded model (`breast_cancer_svc_model`) to predict outcomes for the scaled test features (`x_test_scaled`).
- Predictions are stored in `y_pred`.
Reminder:
- Always scale test data the same way you scaled training data (e.g., using `StandardScaler`).
4. Evaluating the Model
A. Confusion Matrix
print('Confusion matrix of SVC model : \n', confusion_matrix(y_test, y_pred), '\n')
What it shows:
- A table comparing actual (`y_test`) vs. predicted (`y_pred`) values.
- Example output (for binary classification):
[[TN FP]
[FN TP]]
- TN (True Negative), FP (False Positive), etc.
Why it matters:
- Reveals model errors (e.g., how many false positives/negatives occurred).
B. Accuracy Score
print('Accuracy Score on testing data by svc model is ---> ', accuracy_score(y_test, y_pred))
What it shows:
- The fraction of correct predictions: `(TP + TN) / Total`.
- Example: `0.98` means 98% accuracy.
Limitation:
- Accuracy can be misleading for imbalanced datasets (always check the confusion matrix too!).
Visualizing the Flow
1. Save Model → Load Model → Predict → Evaluate.
2. Like saving a recipe (`pickle`), reheating it (loading), and tasting it (evaluation).
Discussion Questions
1. "What happens if we forget to scale `x_test` before predicting?"
- Answer: Predictions will be wrong! Scales must match training data.
2. "Why not just use `svc` directly instead of saving/loading?"
- Answer: Saving lets you reuse the model without retraining.
Key Takeaways
- `pickle` is for saving/loading Python objects (like models).
- Always: Scale test data → Predict → Evaluate with multiple metrics.
- Confusion matrices give deeper insight than accuracy alone.