End-To-End Healthcare Ai Project Predicting Cancer Using Machine Learning

 Healthcare Project Ai (End-To-End)

Predicting Cancer Using Machine Learning




Introduction

In today’s data-driven healthcare world, early diagnosis is not just a convenience, it's a lifesaving tool. Imagine being able to predict the likelihood of a patient having cancer using data and machine learning. This blog takes you through a complete hands-on project using real-world healthcare data, where we build a cancer prediction model using Python and scikit-learn.

Whether you're a data science beginner or someone exploring machine learning applications in healthcare, this blog will help you understand how AI can be used to create meaningful predictions and potentially save lives.


1. Real-Life Use Case

One of the most common and threatening forms of cancer is breast cancer. Early detection can significantly improve patient survival rates. Hospitals and diagnostic centers collect vast amounts of patient data, such as tumor size, cell shape, and texture. But interpreting this data manually is slow and prone to error.

Real-Life Example:
Institutions like the Mayo Clinic and IBM Watson Health use machine learning algorithms to assist doctors in diagnosing breast cancer faster and more accurately. This project replicates a simpler version of those tools using open-source data.


2. Objective of the Project

Our goal is to build a classification model that can predict whether a tumor is malignant or benign, based on features extracted from digitized images of a breast mass.


3. Dataset Overview

We’ll use the Breast Cancer Wisconsin (Diagnostic) Dataset, which is built into sklearn.datasets. 

This dataset includes 30 numeric features derived from tumor images and a target label (0 for malignant, 1 for benign).


4. Loading the Required Libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

import pickle

import warnings

import joblib

warnings.filterwarnings('ignore')


df = pd.read_csv('/kaggle/input/cancer-data/Cancer_Data.csv')

df.head()



🧠 What's happening here?

Before we can analyze or build a machine learning model, we need to import essential Python libraries. Here's what each one does:

pandas as pd: For loading, cleaning, and analyzing data in tabular form (like Excel).

numpy as np: For working with numerical arrays, essential for numerical calculations.

matplotlib.pyplot as plt: For making basic graphs and plots (like bar charts, line plots, etc.).

seaborn as sns: A more beautiful plotting library built on top of matplotlib, used for statistical data visualizations.

pickle and joblib: Tools to save and load machine learning models after training so you can reuse them without retraining.


warnings.filterwarnings('ignore'): This is used to suppress warning messages, which can make your notebook cleaner and easier to read.

Pd.read_csv: This reads the CSV file containing the cancer dataset. In this case, the dataset is located in the Kaggle environment's directory.

df.head():
Displays the first 5 rows of the dataset to give us a quick preview of what the data looks like.


OUTPUT:



Now we can check the number of columns in this dataset

Just type: df.columns

Why Do We Use df.columns in Pandas?

In pandas, df.columns is used to view or access the names of all the columns in a DataFrame (df stands for DataFrame).

🔍 What does it return?

It returns a list-like object containing all the column names from your dataset.

df.columns

Output:

Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',

       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',

       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',

       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',

       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',

       'fractal_dimension_se', 'radius_worst', 'texture_worst',

       'perimeter_worst', 'area_worst', 'smoothness_worst',

       'compactness_worst', 'concavity_worst', 'concave points_worst',

       'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32'],

      dtype='object')




So above are the columns that we have in this dataset. Now we will drop some unnecessary columns that is ‘Unnamed: 32 and ‘id’

To drop, we will use this code:

df = df.drop(['Unnamed: 32','id'],axis=1)

🔍 What does this line do?

It removes two columns 'Unnamed: 32' and 'id' from the DataFrame df.

✅ Why do we drop these columns?

  1. 'Unnamed: 32':

    • This is often a blank or empty column that accidentally gets added when saving or exporting CSV files (especially from Excel).

    • It usually has no useful data and just clutters the dataset.

  2. 'id':

    • This is an identifier column like a serial number or patient ID.

    • It's unique for each row but doesn’t help in predicting the target (like whether cancer is present or not), so it's not useful for machine learning models.

    • Keeping it may lead to overfitting or noise.

🧠 What does axis=1 mean?

  • axis=1 tells pandas to drop columns.

  • (If it were axis=0, it would drop rows instead.)


Now let’s see what df.info() gives. 

A quick summary of the DataFrame, including:

  • The number of rows and columns

  • Column names and data types

  • The count of non-null (non-missing) values in each column

  • The memory usage of the DataFrame

It's very useful for checking data quality and identifying missing values or data types that may need conversion before analysis or modeling. So let's check:

df.info()

Output:

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 569 entries, 0 to 568

Data columns (total 31 columns):

 #   Column                   Non-Null Count  Dtype  

---  ------                   --------------  -----  

 0   diagnosis                569 non-null    object 

 1   radius_mean              569 non-null    float64

 2   texture_mean             569 non-null    float64

 3   perimeter_mean           569 non-null    float64

 4   area_mean                569 non-null    float64

 5   smoothness_mean          569 non-null    float64

 6   compactness_mean         569 non-null    float64

 7   concavity_mean           569 non-null    float64

 8   concave points_mean      569 non-null    float64

 9   symmetry_mean            569 non-null    float64

 10  fractal_dimension_mean   569 non-null    float64

 11  radius_se                569 non-null    float64

 12  texture_se               569 non-null    float64

 13  perimeter_se             569 non-null    float64

 14  area_se                  569 non-null    float64

 15  smoothness_se            569 non-null    float64

 16  compactness_se           569 non-null    float64

 17  concavity_se             569 non-null    float64

 18  concave points_se        569 non-null    float64

 19  symmetry_se              569 non-null    float64

 20  fractal_dimension_se     569 non-null    float64

 21  radius_worst             569 non-null    float64

 22  texture_worst            569 non-null    float64

 23  perimeter_worst          569 non-null    float64

 24  area_worst               569 non-null    float64

 25  smoothness_worst         569 non-null    float64

 26  compactness_worst        569 non-null    float64

 27  concavity_worst          569 non-null    float64

 28  concave points_worst     569 non-null    float64

 29  symmetry_worst           569 non-null    float64

 30  fractal_dimension_worst  569 non-null    float64

dtypes: float64(30), object(1)

memory usage: 137.9+ KB


So there is no missing value in this dataset. And there are no duplicated values either in this dataset. To check missing and duplicates, we use df.isnull().sum() and df.duplicated().sum()


Now to check the target variable (diagnosis) that is ‘M’ and ‘B’, we will convert it into 1 and 0 so that we its converted into a numerical value. It’s easy for Machine Learning models to understand numerical values

df.diagnosis.unique()

Output:

array(['M', 'B'], dtype=object)

So lets convert

df.diagnosis = df.diagnosis.replace(['M','B'],[1,0])

df.head()

Output:








As you can see above the target column is not numerical with 0 and 1 values. Here Malignant is cancer and Banign is not which means 1 is cancer and 0 is not.


Now check the correlation matrix

What is this df.corr() ?

  • df.corr() calculates the correlation matrix, which shows how strongly each feature is related to the others.

  • plt.figure(figsize=(28,27)) sets the size of the heatmap plot.

  • sns.heatmap visualizes the correlation matrix using a color-coded heatmap:

    • annot=True: displays the actual correlation values inside the boxes.

    • cbar=True: shows the color scale on the side.

    • cmap='plasma': applies a colorful gradient for better visual contrast.

👉 Useful to identify multicollinearity and decide which features are most informative.

corr = df.corr()

plt.figure(figsize=(28,27))

sns.heatmap(corr,annot=True,cbar=True,cmap='plasma')

plt.show()


Output:

heatmap


Now lets Scale the features so that machine learning models can easily interpret the features and perform predictions. 

What Feature Scaling does?

  • StandardScaler: Scales the features so they have a mean of 0 and standard deviation of 1. This helps many machine learning models perform better.

  • fit_transform(x_train): Learns the scaling parameters from x_train and applies the transformation.

  • transform(x_test): Applies the same transformation to x_test using the parameters learned from x_train.

✅ Feature scaling is important, especially for models like KNN, SVM, and Logistic Regression that are sensitive to feature magnitudes.


#FEATURE SCALING

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

x_train_scaled = ss.fit_transform(x_train)

x_test_scaled = ss.transform(x_test)


Now the data is scaled, we will import Machine Learning models.

This part is called model selections. Since we are doing a regression project, we will try several regression algorithms to check which models provides the best results on this dataset. So we are import all the following models.

#model selections

from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier

from xgboost import XGBClassifier

from sklearn.svm import SVC

from sklearn.neighbors import KNeighborsClassifier

from sklearn.naive_bayes import GaussianNB

from lightgbm import LGBMClassifier

from catboost import CatBoostClassifier


#objects

lr = LogisticRegression()

rf = RandomForestClassifier()

gb = GradientBoostingClassifier()

xgb = XGBClassifier()

svc = SVC()

knn = KNeighborsClassifier()

nb = GaussianNB()

lgb = LGBMClassifier()

cat = CatBoostClassifier()



And we have created the model instances also. Save them in these variables above so that we can easily call them from a name of our choice.

Now lets fit these models on our training data






Our models are now trained on the training data that we provided. Now lets test them on the testing data that we kept during the train test split operation. 


lrpred = lr.predict(x_test_scaled)

rfpred = rf.predict(x_test_scaled)

gbpred = gb.predict(x_test_scaled)

xgbpred = xgb.predict(x_test_scaled)

svcpred = svc.predict(x_test_scaled)

knnpred = knn.predict(x_test_scaled)

nbpred = nb.predict(x_test_scaled)

lgbpred = lgb.predict(x_test_scaled)

catpred = cat.predict(x_test_scaled)


#Evaluations

from sklearn.metrics import accuracy_score

lracc = accuracy_score(y_test,lrpred)

rfacc = accuracy_score(y_test,rfpred)

gbacc = accuracy_score(y_test,gbpred)

xgbacc = accuracy_score(y_test,xgbpred)

svcacc = accuracy_score(y_test,svcpred)

knnacc = accuracy_score(y_test,knnpred)

nbacc = accuracy_score(y_test,nbpred)

lgbacc = accuracy_score(y_test,lgbpred)

catacc = accuracy_score(y_test,catpred)


print('LOGISTIC REG',lracc)

print('RANDOM FOREST',rfacc)

print('GB',gbacc)

print('XGB',xgbacc)

print('SVC',svcacc)

print('KNN',knnacc)

print('NB',nbacc)

print('LIGHT GBM',lgbacc)

print('CATO',catacc)



  • LinearRegression(): Initializes a linear regression model from scikit-learn.

  • fit(x_train_scaled, y_train): Trains the model using the scaled training data.

  • predict(x_test_scaled): Uses the trained model to make predictions on the scaled test data.

✅ This is the core step where the machine learns patterns from training data and makes predictions on unseen data.



Output:

LOGISTIC REG: 0.9736842105263158 (97%)

RANDOM FOREST: 0.9649122807017544 (96%)

GB: 0.956140350877193(95%)

XGB: 0.956140350877193(95%)

SVC: 0.9824561403508771(98%)

KNN: 0.9473684210526315(94%)

NB: 0.9649122807017544(96%)

LIGHT GBM: 0.9649122807017544(96%)

CATO: 0.9736842105263158(97%)


This shows that nearly every model is performing well on this dataset. This is because we did feature scaling properly and gave the best amount of training data.




Now we see above the best performing model is SVC(Support vector Classifier). So we will use this model in our final deployment phase. Before deployment we will check the confusion matrix. 

🎯 Why We Check the Confusion Matrix:

The confusion matrix gives a detailed breakdown of your classification model's performance by showing:

  • True Positives (TP) – Correctly predicted positives (e.g., correctly identified cancer cases)

  • True Negatives (TN) – Correctly predicted negatives (e.g., correctly identified non-cancer cases)

  • False Positives (FP) – Incorrectly predicted as positive (e.g., predicted cancer, but actually not)

  • False Negatives (FN) – Missed positives (e.g., predicted non-cancer, but actually cancer)

✅ Why It’s Important:

  • It helps identify where the model is making mistakes.

  • Useful in imbalanced datasets (e.g., more non-cancer than cancer cases), where accuracy alone can be misleading.

  • Helps calculate key metrics:

    • Precision = TP / (TP + FP)

    • Recall (Sensitivity) = TP / (TP + FN)

    • F1-Score = Harmonic mean of precision and recall

👉 In healthcare (like cancer prediction), a false negative can be dangerous. The confusion matrix highlights such critical errors, helping you improve the model.

#NOW CHECK THE CONFUSION MATRIX(for specific model)

from sklearn.metrics import confusion_matrix,classification_report

cm = confusion_matrix(y_test,svcpred) #Enter the model pred here

plt.title('Heatmap of Confusion matrix',fontsize=15)

sns.heatmap(cm,annot=True)

plt.show()

Output:






Now Let’s head to the Classification Report

  • classification_report is a function from sklearn.metrics.

  • It compares your model’s predictions (svcpred) with the actual values (y_test).

  • It prints four key performance metrics for each class (e.g., malignant or benign):

    • Precision

    • Recall

    • F1-score

    • Support

📊 Why Are We Checking the Classification Report?

Because it gives a comprehensive evaluation of the classification model’s performance more than just accuracy or a confusion matrix.

💡 Here’s what each metric tells you:

  • Precision: Out of all predicted positive cases, how many were actually positive?

    • Example: How many predicted “cancer” cases were actually cancer?

  • Recall (Sensitivity): Out of all actual positive cases, how many were correctly predicted?

    • Important in medical scenarios we want to catch as many actual cancer cases as possible.

  • F1-score: Harmonic mean of precision and recall. A balance between both.

    • Useful when the class distribution is imbalanced.

  • Support: Number of actual instances for each class in y_test.


✅ Summary:

The classification report tells you how well your model performs per class, which is especially important in critical applications like healthcare.



Now Let's check the Classification Report

#NOW CHECK THE CLASSIFICATION REPORT

#Printing the classification report

print(classification_report(y_test,svcpred))

Output:




Now let's interpret the values from the output:

Class-wise Performance

🔹 Class 0: Benign Tumors

  • Precision = 0.97
    → 97% of the time when the model predicted "benign", it was correct.

  • Recall = 1.00
    → The model caught all actual benign tumors, missing none.

  • F1-Score = 0.99
    → Excellent overall performance for benign tumors.

🔹 Class 1: Malignant Tumors

  • Precision = 1.00
    → Every time the model predicted "malignant", it was always correct no false positives.

  • Recall = 0.95
    → The model identified 95% of actual malignant tumors. It missed 5%.

  • F1-Score = 0.98
    → Still a strong score, but we could aim to improve recall slightly.

📊 Overall Model Performance

✅ Accuracy = 0.98 (98%)

Out of 114 total predictions, the model predicted 112 correctly which means very strong!

🧠 Macro vs. Weighted Averages

  • Macro Avg:

    • Simple average of precision, recall, and F1 across both classes.

    • Treats both classes equally, regardless of how many samples they have.

  • Weighted Avg:

    • Takes class imbalance into account by giving more weight to the class with more instances (in this case, class 0).

    • Better for overall performance reflection when classes are imbalanced.

Both are very high (~0.98–0.99), which confirms strong, reliable performance.



Now let’s check the Cross Validation Accuracy Score. This will tell us that either our model has overfitted or underfitted.

🎯 Why Do We Check Cross-Validation Score?

When we train and test a machine learning model, it's easy to get overconfident if the accuracy is high but how do we know it will perform well on unseen data?
That’s where cross-validation comes in.

✅ What Is Cross-Validation?

Cross-validation is a technique to evaluate a model's performance more reliably.
It works by splitting your dataset into multiple parts (called folds), training the model on some folds, and testing it on the others and repeating this process several times.

🔁 Example: 5-Fold Cross Validation

  1. Split your data into 5 parts (folds)

  2. Use 4 parts for training and 1 part for testing

  3. Repeat this 5 times, changing the test fold each time

  4. Calculate the accuracy each time

  5. Average the results to get the cross-validation score

This gives you a more balanced and general idea of how your model will perform on new, real-world data.

🧠 Why It’s Important:

  • Prevents overfitting: Helps ensure your model is not just memorizing training data.

  • More reliable metric: It gives a more honest picture of accuracy compared to just one train-test split.

  • Helps in model selection: You can compare multiple models using cross-validation to see which generalizes best.


📌 Teaching Analogy:

Imagine studying for an exam by only solving one type of question you'll do well on that type, but might struggle on the actual test.
Cross-validation is like practicing with different types of questions from different chapters it prepares your model better for the final exam (real-world predictions)!


Checking:

from sklearn.model_selection import cross_val_score

cross_val = cross_val_score(estimator=svc,X=x_train_scaled,y=y_train)

print('Cross Val Acc Score of SVC model is ---> ',cross_val)

print('\n Cross Val Mean Acc Score of SVC model is ---> ',cross_val.mean())

Output:

Cross Val Acc Score of SVC model is --->  [0.97802198 0.96703297 0.98901099 0.98901099 0.95604396]

 Cross Val Mean Acc Score of SVC model is --->  0.9758241758241759


Understanding Cross-Validation Accuracy Scores


1. What the Output Means

The output shows the performance of a Support Vector Classifier (SVC) model evaluated using 5-fold cross-validation:

Cross Val Acc Score of SVC model is --->  [0.97802198 0.96703297 0.98901099 0.98901099 0.95604396]

Cross Val Mean Acc Score of SVC model is --->  0.9758241758241759


First Line: Accuracy scores for each of the 5 validation folds.  

  - Fold 1: 97.80%  

  - Fold 2: 96.70%  

  - Fold 3: 98.90%  

  - Fold 4: 98.90%  

  - Fold 5: 95.60%  


Second Line: The mean accuracy across all folds (~97.58%).  


2. Key Concepts to Explain

A. Why Cross-Validation?

- Avoids overfitting by testing the model on different subsets of data.  

- More reliable than a single train-test split.  

B. Why 5 Folds?

- The dataset was split into 5 parts (folds).  

- Model trained on 4 folds, tested on the 5th, repeated 5 times.  

C. Interpreting Variability

- The scores range from 95.6% to 98.9%.  

 Consistency Check: Small variation (all scores >95%) means the model generalizes well.  

Red Flag: If one fold had 70%, we’d investigate data imbalances or outliers.  

D. Mean Accuracy  

- The average (97.58%) is the model’s expected performance on unseen data.


3. Why This Matters

- High Mean (~97.58%): The SVC model is very accurate.  

- Low Variance: Consistent scores suggest robustness.  


Analogy: Think of cross-validation like taking 5 different exams. If you score 95-99% on all, you’re consistently good, not just lucky!  


4. Potential Discussion Questions  

1. *“Why might Fold 5’s score (95.6%) be slightly lower?”

   - Maybe harder samples or minor overfitting.  

2. “How could we improve the lowest fold’s score?” 

   - Feature engineering, hyperparameter tuning.  



Now Let’s Save the model to use it later in the deployment phase.


#NOW save the model

import pickle

#Saving the model

pickle.dump(svc,open('breast_cancer_svc.pickle','wb'))


#loading the model

breast_cancer_svc_model = pickle.load(open('breast_cancer_svc.pickle','rb'))


#Predicting the output

y_pred = breast_cancer_svc_model.predict(x_test_scaled)


#confusion matrix

print('Confusion matrix of SVC model : \n', confusion_matrix(y_test,y_pred),'\n')


#showing off the accuracy score

print('Accuracy Score on testing data by svc model is ---> ',accuracy_score(y_test,y_pred))

Output:

Confusion matrix of SVC model : 

 [[71  0]

 [ 2 41]] 

Accuracy Score on testing data by svc model is --->  0.9824561403508771




Now let me explain the above code and the output step by step

1. Saving the Model with `pickle`

import pickle

pickle.dump(svc, open('breast_cancer_svc.pickle', 'wb'))

What it does:

  - `pickle` is Python’s built-in module for serializing (saving) objects like trained models.  

  - `pickle.dump()` saves the trained `svc` model to a file named `breast_cancer_svc.pickle`.  

  - `'wb'` means "write in binary mode" (required for pickle files).  

Why it matters:  

  - Saves time, you don’t need to retrain the model every time.  

  - Lets you share/reuse the model in other projects or deploy it.  


2. Loading the Saved Model

breast_cancer_svc_model = pickle.load(open('breast_cancer_svc.pickle', 'rb'))

What it does:  

  - `pickle.load()` reads the saved model from the file.  

  - `'rb'` means "read in binary mode".  

  - The loaded model is stored in `breast_cancer_svc_model` (now identical to the original `svc`).  

Key Point:  

  - The loaded model is ready to make predictions, just like before!  

3. Making Predictions

y_pred = breast_cancer_svc_model.predict(x_test_scaled)


What it does:  

  - Uses the loaded model (`breast_cancer_svc_model`) to predict outcomes for the scaled test features (`x_test_scaled`).  

  - Predictions are stored in `y_pred`.  


Reminder:  

  - Always scale test data the same way you scaled training data (e.g., using `StandardScaler`).  

4. Evaluating the Model

A. Confusion Matrix

print('Confusion matrix of SVC model : \n', confusion_matrix(y_test, y_pred), '\n')

What it shows:  

  - A table comparing actual (`y_test`) vs. predicted (`y_pred`) values.  

  - Example output (for binary classification):  

    [[TN FP]

     [FN TP]]

    

    - TN (True Negative), FP (False Positive), etc.  


Why it matters:  

  - Reveals model errors (e.g., how many false positives/negatives occurred).  

B. Accuracy Score

print('Accuracy Score on testing data by svc model is ---> ', accuracy_score(y_test, y_pred))

What it shows:  

  - The fraction of correct predictions: `(TP + TN) / Total`.  

  - Example: `0.98` means 98% accuracy.  


Limitation:  

  - Accuracy can be misleading for imbalanced datasets (always check the confusion matrix too!).  


Visualizing the Flow

1. Save Model → Load Model → Predict → Evaluate.  

2. Like saving a recipe (`pickle`), reheating it (loading), and tasting it (evaluation).  


Discussion Questions

1. "What happens if we forget to scale `x_test` before predicting?"

   - Answer: Predictions will be wrong! Scales must match training data.  

2. "Why not just use `svc` directly instead of saving/loading?"  

   - Answer: Saving lets you reuse the model without retraining.  


Key Takeaways

- `pickle` is for saving/loading Python objects (like models).  

- Always: Scale test data → Predict → Evaluate with multiple metrics.  

- Confusion matrices give deeper insight than accuracy alone.