Autism Prediction using Ai (Part-3)

 Autism Prediction using Ai (Part-3)

End-To-End Machine Learning Project Blog Part-3


Igniting New Discoveries 

Welcome Back to Part 3 of Our Autism Prediction Project!

Hello, my incredible viewers and students! I’m absolutely thrilled to welcome you back to Part 3 of our "Autism Prediction Classification Project" on this sunny Thursday morning.

After the amazing groundwork of Parts 1 and 2, balancing our Class/ASD dataset, engineering features like ageGroup and sum_score, and uncovering predictive patterns, we’re now diving into an action-packed day of feature engineering, exploratory data analysis (EDA), and more! 

Today, we’ll refine our data with label encoding, dive deep into feature distributions and correlations to sharpen our autism spectrum disorder (ASD) prediction model, and set the stage for the modeling phase. 

Whether you’re joining me from Istanbul’s vibrant streets or coding with passion from across the globe, let’s fuel our compassion with cutting-edge AI—cheers to making a difference together! 🌟🚀

Refining Our Craft: 

Feature Engineering and Encoding

Today, we’ll apply log transformation to reduce skewness in age, label encode all object-type columns to prepare them for modeling, and set the stage for deeper correlations and modeling. 

Cheers to transforming data for impact! 🌟🚀

Why Feature Engineering and Encoding Matter

Log transforming age helps normalize its distribution, while label encoding object-type columns (e.g., ethnicity, contry_of_res) converts them into numerical formats for our machine learning model. For clinicians, this refined data could enhance ASD prediction accuracy, supporting early interventions.

What to Expect in This Step

In this step, we’ll:

  • Apply a log transformation to age to reduce skewness.

  • Create a function to label encode all object-type columns.

  • Preview the updated DataFrame to see the transformed features.

Get ready to polish our dataset—our journey is about to get even more powerful!

Fun Fact: 

Log Transformation in Action!

Did you know log transformation, used since the 1970s in statistics, is a go-to method to handle skewed data in machine learning? It’s perfect for normalizing age in our autism dataset!

Real-Life Example

Imagine you’re a data scientist building an ASD screening tool. Log-transforming age and encoding contry_of_res ensures your model captures diverse age distributions and global patterns, improving predictions for local clinics!

Quiz Time!

Let’s test your feature engineering skills, students!

  1. What does log transformation do to age?
    a) Increases skewness
    b) Reduces skewness for a normal distribution
    c) Deletes the column
     

  2. Why label encode object columns?
    a) To confuse the model
    b) To convert categorical data into numbers for modeling
    c) To reduce dataset size
     

Drop your answers in the comments

Cheat Sheet: 

Feature Engineering and Encoding

  • df.column.apply(lambda x: np.log(x)): Applies log transformation to a column.

  • LabelEncoder().fit_transform(): Converts categorical values to integers.

  • Tip: Check df.dtypes to identify object columns before encoding.

Did You Know?

Label encoding, introduced with scikit-learn in 2010, is a foundational technique for handling categorical data—our project leverages it to prepare for advanced modeling!

Pro Tip:

Let’s refine our data! Log-transform age and encode categories—how will this boost our autism predictions?

What’s Happening in This Code?

Let’s break it down like we’re perfecting a vintage blend:

  • Log Transformation:

    • df.age = df.age.apply(lambda x: np.log(x)) applies a natural log transformation to the age column to reduce skewness, assuming age values are positive (e.g., 5, 53 from the output).

  • Label Encoding Setup:

    • from sklearn.preprocessing import LabelEncoder imports the tool for encoding categorical data.

    • def encode_labels(data) defines a function to process all columns.

    • for col in data.columns: Loops through each column.

    • if df[col].dtype == 'object': Checks if the column is of object type (e.g., strings).

    • le = LabelEncoder(); df[col] = le.fit_transform(df[col]): Encodes the column into integers (0, 1, 2, ...).

  • Apply Function: df = encode_labels(df) applies the encoding to the dataset.

  • Preview: df.head() displays the first 5 rows to show the changes

Feature Engineering with Log Transformation and Label Encoding

Here’s the code we’re working with:

# Continuing Feature Engineering

# Applying the log transformations to remove the skewness of data

df.age = df.age.apply(lambda x: np.log(x))


# And now Label encoding to all the columns that are of 'object' dtype:

from sklearn.preprocessing import LabelEncoder


# Create a function that label encodes all the object dtypes

def encode_labels(data):

    for col in data.columns:

        # Here it is checking the dtypes

        # If it is an object, it will encode it

        if df[col].dtype == 'object':

            le = LabelEncoder()

            df[col] = le.fit_transform(df[col])

    return data


df = encode_labels(df)


df.head()


The Output:


Updated DataFrame with Log Transformation and Encoding

Take a look at the uploaded image! The output of df.head() shows the first 5 rows with transformed features:

  • Columns (from the image):

    • A1_Score to A10_Score: Binary values (0 or 1), unchanged.

    • austim: Encoded (e.g., 0 for ‘no’, 1 for ‘yes’—Row 1: 0, Row 2: 1).

    • contry_of_res: Encoded (e.g., 53 for United States, 39 for United Kingdom, 34 for Oman—mapping depends on alphabetical order or frequency).

    • used_app_before: Encoded (e.g., 0 for ‘no’, 1 for ‘yes’—all 0 here).

    • result: Numerical, unchanged (e.g., 13.875569, 11.705242).

    • age_desc: Encoded (e.g., 0 for ‘18 and more’—all 0 here).

    • relation: Encoded (e.g., 5 for ‘Self’—all 5 here).

    • Class/ASD: Encoded (e.g., 1 for ASD—all 1 here due to oversampling focus).

    • ageGroup: Encoded (e.g., 3 for Teenager, 2 for Senior/OLD, 4 for Young, 1 for Kid).

    • sum_score: Total of A1-A10, unchanged (e.g., 10, 9, 8, 10, 10).

    • Pak: Still concatenated strings (e.g., 0 for ‘nonono’, 4 for ‘yesyesno’—needs fixing as it wasn’t re-encoded properly).

    • age: Log-transformed (e.g., 1.791759 for 5, 3.970292 for 53—natural log of original ages).

Insight: The log transformation on age (e.g., log(5) ≈ 1.79, log(53) ≈ 3.97) reduces skewness, making it more suitable for modeling, though we should verify the original range (e.g., 5 seems low—possibly a data shift). Label encoding successfully converted object columns (e.g., contry_of_res from 53 unique values to 0-53, ageGroup from 5 categories to 0-4), but Pak remains a string concatenation due to prior string addition—we’ll need to re-encode it after fixing the logic. This step preps our data for correlations and modeling, with sum_score and encoded features like austim poised to be key predictors!

Next Steps

We’ve transformed our data—fantastic progress with a small fix ahead! Next, we’ll dive into EDA to check distributions and correlations with Class/ASD. 

What do you think of this encoding, viewers? Drop your thoughts in the comments, and let’s make this project a game-changer together! 🌟🚀



Unveiling the Connections: Correlation Heatmap

After refining our dataset with log transformation on age and label encoding object-type columns, we’re now diving deeper into exploratory data analysis (EDA) with a correlation heatmap. This code block calculates the correlation between all features and our target Class/ASD, revealing which factors most strongly predict autism spectrum disorder (ASD). 

Let’s raise our spirits to uncover these critical insights—cheers to data-driven breakthroughs! 🌟🚀

Why a Correlation Heatmap Matters

A correlation heatmap helps us identify which features (e.g., sum_score, ageGroup) are most linked to Class/ASD, guiding our model’s focus. For healthcare professionals, this could highlight key screening questions or demographic factors, enhancing early ASD detection.

What to Expect in This Step

In this step, we’ll:

  • Compute the correlation matrix for all features in our dataset.

  • Create a heatmap to visualize correlations, with annotations for clarity.

  • Analyze the output to pinpoint the strongest predictors for Class/ASD.

Get ready to decode the relationships in our data—our journey is revealing powerful patterns!

Fun Fact: 

Heatmaps in Data Science!

Did you know correlation heatmaps became a staple in data analysis with the rise of Python libraries like Seaborn in 2014? They’re perfect for spotting trends in our autism dataset!

Real-Life Example

Imagine you’re a researcher designing an ASD screening tool. A heatmap showing high correlation between sum_score and Class/ASD could confirm that behavioral scores are the best indicator, shaping your tool’s focus!

Quiz Time!

Let’s test your EDA skills, students!

  1. What does a correlation heatmap show?
    a) Average values of features
    b) Strength and direction of relationships between features
    c) Count of categories
     

  2. What does a value close to 1 or -1 mean?
    a) No correlation
    b) Strong correlation
    c) Weak correlation
     

Drop your answers in the comments

Cheat Sheet: 

Creating a Correlation Heatmap

  • df.corr(): Computes the Pearson correlation matrix.

  • sns.heatmap(data, annot=True, cmap=...): Visualizes the matrix with annotations and a color map (e.g., ‘plasma’).

  • Tip: Use plt.figure(figsize=(25,9)) for a wide heatmap with many features.

Did You Know?

The Pearson correlation coefficient, used here since the early 1900s, measures linear relationships—our heatmap leverages it to find autism predictors!

Pro Tip

Which features predict autism best? Let’s dive into a correlation heatmap!

What’s Happening in This Code?

Let’s break it down like we’re mapping a treasure hunt:

  • Correlation Matrix: corr = df.corr() calculates the Pearson correlation coefficient between all pairs of columns, ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation).

  • Figure Size: plt.figure(figsize=(25,9)) sets a wide 25x9-inch plot to accommodate many features.

  • Heatmap: sns.heatmap(corr, annot=True, cbar=True, cmap='plasma') visualizes the correlation matrix:

    • annot=True adds numerical values to each cell.

    • cbar=True includes a color bar to interpret the scale.

    • cmap='plasma' uses a colorful gradient (yellow for high positive, purple for low/negative).

  • Display: plt.show() renders the plot.

Creating a Correlation Heatmap

Here’s the code we’re working with:

corr = df.corr()


plt.figure(figsize=(25,9))

sns.heatmap(corr,annot=True,cbar=True,cmap='plasma')

plt.show()

The Output:


Correlation Heatmap

Take a look at the uploaded image! The heatmap displays correlations between all features, with Class/ASD as the target:

  • X-Axis and Y-Axis: Lists all columns (e.g., A1_Score to A10_Score, austim, contry_of_res, result, age_desc, relation, Class/ASD, ageGroup, sum_score, Pak, age).

  • Color Scale: Ranges from purple (~-0.2) to yellow (~1.0), with the color bar indicating correlation strength:

    • Purple: Weak or negative correlation.

    • Yellow: Strong positive correlation.

  • Key Observations:

    • A1_Score to A10_Score vs. Class/ASD: Moderate positive correlations (e.g., A1_Score: 0.47, A10_Score: 0.51), showing individual screening questions contribute to ASD prediction.

    • sum_score vs. Class/ASD: Strongest correlation (0.97), confirming that the total of A1-A10 scores is a top predictor, aligning with our earlier EDA.

    • result vs. Class/ASD: Also high (0.97), redundant with sum_score since it’s the same metric.

    • age vs. Class/ASD: Weak correlation (~0.02), suggesting log-transformed age has little direct impact.

    • austim vs. Class/ASD: Moderate (0.28), indicating family history of autism has some influence.

    • jaundice vs. Class/ASD: Weak (~0.04), showing limited predictive power.

    • used_app_before vs. Class/ASD: Very weak (~0.01), suggesting prior app use is not a strong factor.

    • Pak vs. Class/ASD: Weak (~0.07), likely due to improper encoding as strings earlier.

    • ageGroup vs. Class/ASD: Weak (~0.03), consistent with balanced distribution across groups.

    • contry_of_res, ethnicity, etc.: Mostly weak (<0.2), indicating geographical or demographic factors have minimal direct correlation post-encoding.

Insight: The heatmap highlights sum_score (and result) as the dominant predictors for Class/ASD, with a 0.97 correlation reflecting their direct derivation from A1-A10 scores. Individual scores (A1-A10) contribute moderately, while austim (0.28) adds some value. Weak correlations for age, ageGroup, contry_of_res, and Pak suggest these need further refinement or interaction terms. The redundancy between sum_score and result means we might drop one. This analysis guides our feature selection for modeling.

Next Steps:

We’ve uncovered key correlations—amazing insights! 

Next we'll dive into more EDA (e.g., feature distributions) and prepare for modeling.

What stood out in the heatmap, viewers? Drop your thoughts in the comments, and let’s make this project a game-changer together! 🌟🚀



Unleashing Predictive Power 

Model Training and Evaluation


After refining our dataset with log transformation, label encoding, and uncovering correlations via a heatmap, we’re now diving into the heart of our journey—model training, fitting, and predictions. This code block brings it all together: splitting data into training and test sets, scaling features, training nine diverse models (Logistic Regression, Random Forest, Gradient Boosting, XGBoost, SVC, KNN, Naive Bayes, LightGBM, and CatBoost), and evaluating their accuracy on predicting autism spectrum disorder (ASD). 

Let's raise our spirits to build a life-changing model—cheers to precision and impact! 🌟🚀

Why Model Training and Evaluation Matter

Training multiple models and comparing their accuracy helps us select the best tool for ASD prediction. For healthcare providers, this could mean identifying the most reliable method to flag autism early, ensuring timely support for families.


What to Expect in This Step

In this step, we’ll:

- Split and scale our data for training and testing.

- Train nine different machine learning models on the scaled data.

- Evaluate their performance using accuracy scores and compare results.


Get ready to witness the power of predictive modeling—our journey is reaching a thrilling peak!


Fun Fact: 

Model Diversity in AI!

Did you know combining models like Random Forest and XGBoost, popularized in the 2010s, often boosts accuracy in medical predictions? Our diverse lineup is tapping into this winning strategy!

Real-Life Example

Imagine you’re a clinician using our model to screen patients. A high-accuracy model like Random Forest (92.97%) could help you confidently identify ASD cases, transforming lives with early intervention!


Quiz Time!

Let’s test your modeling skills, students!

1. What does `train_test_split` do?  

   a) Combines training and test data  

   b) Splits data into training and test sets  

   c) Deletes the dataset  


2. Why scale features with `StandardScaler`?  

   a) To increase dataset size  

   b) To normalize data for better model performance  

   c) To remove correlations  

Drop your answers in the comments


Cheat Sheet: 

Model Training and Evaluation

- `train_test_split(x, y, test_size=0.2, random_state=42)`: Splits data (80% train, 20% test).

- `StandardScaler().fit_transform()`: Scales features to have mean 0 and variance 1.

- `accuracy_score(y_test, y_pred)`: Measures prediction accuracy.


Did You Know?

The Random Forest algorithm, introduced in 2001 by Leo Breiman, revolutionized ensemble learning—our project leverages it alongside modern tools like CatBoost!


Pro Tip:

Which model will win? Let’s train nine and find the best for autism prediction!

What’s Happening in This Code?

Let’s break it down like we’re assembling a winning team:

- Data Split: 

  - `x = df.drop(['ID', 'used_app_before', 'Class/ASD'], axis=1)`: Removes `ID` (unique identifier), `used_app_before` (low correlation), and `Class/ASD` (target) from features.

  - `y = df['Class/ASD']`: Sets the target variable.

- Train-Test Split: `train_test_split(x, y, test_size=0.2, random_state=42)` splits data into 80% train (e.g., ~1022 rows) and 20% test (e.g., ~256 rows) with a fixed seed.

- Feature Scaling

  - `StandardScaler().fit_transform(x_train)` and `.transform(x_test)` scales features to zero mean and unit variance, critical for models like SVM and KNN.

- Model Selection: Imports nine models: LogisticRegression, RandomForestClassifier, GradientBoostingClassifier, XGBClassifier, SVC, KNeighborsClassifier, GaussianNB, LGBMClassifier, CatBoostClassifier.

- Model Objects: Initializes each model (e.g., `lr = LogisticRegression()`).

- Fittings: Trains each model on `x_train_scaled` and `y_train` (e.g., `lr.fit(...)`), with `lgb.set_params(verbosity=-1)` and `cat.fit(..., verbose=False)` to suppress logs.

- Predictions: Generates predictions on `x_test_scaled` for each model (e.g., `lrpred = lr.predict(...)`).

- Evaluations: Uses `accuracy_score` to compare predictions with `y_test`, printing results


Model Training, Fitting, and Evaluation

Here’s the code we’re working with:

# Model Trainings, Fittings and Predictions all in one code


# NOW split the data into x and y

x = df.drop(['ID', 'used_app_before', 'Class/ASD'], axis=1)

y = df['Class/ASD']


# Apply the train test split

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)


# FEATURE SCALING

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

x_train_scaled = ss.fit_transform(x_train)

x_test_scaled = ss.transform(x_test)


# Model selections

from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

from xgboost import XGBClassifier

from sklearn.svm import SVC

from sklearn.neighbors import KNeighborsClassifier

from sklearn.naive_bayes import GaussianNB

from lightgbm import LGBMClassifier

from catboost import CatBoostClassifier


# Objects

lr = LogisticRegression()

rf = RandomForestClassifier()

gb = GradientBoostingClassifier()

xgb = XGBClassifier()

svc = SVC()

knn = KNeighborsClassifier()

nb = GaussianNB()

lgb = LGBMClassifier()

cat = CatBoostClassifier()


# Fittings

lr.fit(x_train_scaled, y_train)

rf.fit(x_train_scaled, y_train)

gb.fit(x_train_scaled, y_train)

xgb.fit(x_train_scaled, y_train)

svc.fit(x_train_scaled, y_train)

knn.fit(x_train_scaled, y_train)

nb.fit(x_train_scaled, y_train)


lgb.set_params(verbosity=-1)  # Suppress logs globally

lgb.fit(x_train_scaled, y_train)

cat.fit(x_train_scaled, y_train, verbose=False)


# Now predictions

lrpred = lr.predict(x_test_scaled)

rfpred = rf.predict(x_test_scaled)

gbpred = gb.predict(x_test_scaled)

xgbpred = xgb.predict(x_test_scaled)

svcpred = svc.predict(x_test_scaled)

knnpred = knn.predict(x_test_scaled)

nbpred = nb.predict(x_test_scaled)

lgbpred = lgb.predict(x_test_scaled)

catpred = cat.predict(x_test_scaled)


# Evaluations

from sklearn.metrics import accuracy_score

lracc = accuracy_score(y_test, lrpred)

rfacc = accuracy_score(y_test, rfpred)

gbacc = accuracy_score(y_test, gbpred)

xgbacc = accuracy_score(y_test, xgbpred)

svcacc = accuracy_score(y_test, svcpred)

knnacc = accuracy_score(y_test, knnpred)

nbacc = accuracy_score(y_test, nbpred)

lgbacc = accuracy_score(y_test, lgbpred)

catacc = accuracy_score(y_test, catpred)


print('LOGISTIC REG', lracc)

print('RANDOM FOREST', rfacc)

print('GB', gbacc)

print('XGB', xgbacc)

print('SVC', svcacc)

print('KNN', knnacc)

print('NB', nbacc)

print('LIGHT GBM', lgbacc)

print('CATO', catacc)

```

Model Accuracy Scores

Here’s the output:

LOGISTIC REG 0.7890625

RANDOM FOREST 0.9296875

GB 0.91015625

XGB 0.921875

SVC 0.8984375

KNN 0.8671875

NB 0.8203125

LIGHT GBM 0.92578125

CATO 0.92578125

```

Explanation of Features:

- Accuracy Scores: Percentage of correct predictions on the test set (256 samples, assuming 1278 total rows × 0.2).

  - Logistic Regression (0.789): 78.91% accuracy, a solid baseline.

  - Random Forest (0.930): 92.97% accuracy, the top performer.

  - Gradient Boosting (0.910): 91.02% accuracy, close behind.

  - XGBoost (0.922): 92.19% accuracy, very competitive.

  - SVC (0.898): 89.84% accuracy, good but not the best.

  - KNN (0.867): 86.72% accuracy, moderate performance.

  - Naive Bayes (0.820): 82.03% accuracy, decent baseline.

  - LightGBM (0.926): 92.58% accuracy, tied for second.

  - CatBoost (0.926): 92.58% accuracy, tied for second.


Insight: Random Forest leads with 92.97% accuracy, followed closely by LightGBM and CatBoost (92.58%) and XGBoost (92.19%), showcasing the strength of ensemble methods on our balanced dataset. Logistic Regression and Naive Bayes lag, likely due to linear assumptions on non-linear data. The high accuracies reflect our balanced `Class/ASD` and strong predictor `sum_score` (0.97 correlation). Next, we’ll optimize the top models and explore other metrics like precision and recall to ensure fairness for ASD detection!


Next Steps:

We’ve trained some stellar models—fantastic results! Next, we’ll optimize the top performers (e.g., Random Forest, LightGBM), evaluate additional metrics (e.g., precision, recall), and interpret their predictions for ASD. 



Decoding Predictions

Confusion Matrix for Random Forest 


After training nine models and finding Random Forest as our top performer with 92.97% accuracy, we’re now diving deeper into its performance using a confusion matrix. This code block visualizes the confusion matrix for Random Forest predictions (`rfpred`), helping us understand how well our model distinguishes between ASD (1) and no ASD (0). 

Let’s raise our spirits to evaluate our model’s impact—cheers to precision in autism prediction! 🌟🚀


Why a Confusion Matrix Matters

A confusion matrix breaks down true positives, true negatives, false positives, and false negatives, showing where our Random Forest model excels or stumbles in predicting ASD. For clinicians this ensures we’re not missing critical ASD cases, balancing accuracy with sensitivity.


What to Expect in This Step

In this step, we’ll:

- Compute the confusion matrix for Random Forest predictions (`rfpred`).

- Visualize it as a heatmap with annotations for clarity.

- Analyze the results to assess model performance beyond accuracy.


Get ready to dive into the nitty-gritty of our predictions—our journey is getting even more revealing!


Fun Fact: 

Confusion Matrix Origins!

Did you know the confusion matrix concept dates back to the 1950s in signal detection theory? It’s now a cornerstone for evaluating classification models like ours in autism prediction!


Real-Life Example

Imagine you’re a pediatrician using our Random Forest model to screen children. A confusion matrix showing few false negatives ensures you’re catching most ASD cases, giving families the support they need early!


Quiz Time!

Let’s test your evaluation skills, students!

1. What does the diagonal of a confusion matrix show?  

   a) Incorrect predictions  

   b) Correct predictions (true positives and true negatives)  

   c) Total predictions  

   


2. Why is a false negative critical in ASD prediction?  

   a) It means missing an ASD case  

   b) It means over-diagnosing ASD  

   c) It improves accuracy  

   


Drop your answers in the comments


Cheat Sheet: 

Confusion Matrix Visualization

- `confusion_matrix(y_test, y_pred)`: Computes the matrix comparing true and predicted labels.

- `sns.heatmap(data, annot=True)`: Visualizes the matrix with numbers in each cell.

- Tip: Add `plt.xlabel('Predicted')` and `plt.ylabel('Actual')` for better clarity.


Did You Know?

Heatmaps for confusion matrices became popular with Python’s Seaborn library in 2014—our visualization leverages this modern standard for clarity!


Pro Tip

How good is our Random Forest at spotting ASD? Let’s break it down with a confusion matrix!


What’s Happening in This Code?

Let’s break it down like we’re solving a puzzle:

- Imports: `from sklearn.metrics import confusion_matrix, classification_report` brings in tools for evaluation.

- Confusion Matrix: `cm = confusion_matrix(y_test, rfpred)` computes the matrix comparing actual (`y_test`) and predicted (`rfpred`) labels for Random Forest.

- Visualization

  - `plt.title('Heatmap of Confusion Matrix', fontsize=15)` sets the title.

  - `sns.heatmap(cm, annot=True)` creates a heatmap with annotated values in each cell.

- Display: `plt.show()` renders the plot.


Visualizing the Confusion Matrix for Random Forest


Here’s the code we’re working with:


# NOW CHECK THE CONFUSION MATRIX (for specific model)

from sklearn.metrics import confusion_matrix, classification_report

cm = confusion_matrix(y_test, rfpred)  # Enter the model pred here

plt.title('Heatmap of Confusion Matrix', fontsize=15)

sns.heatmap(cm, annot=True)

plt.show()




The Output:


Confusion Matrix Heatmap

The heatmap shows the confusion matrix for Random Forest predictions:


- Axes:

  - X-Axis (Predicted Labels): 0 (no ASD), 1 (ASD).

  - Y-Axis (Actual Labels): 0 (no ASD), 1 (ASD).

- Matrix Values (2x2 for binary classification):

  - True Negatives (0,0): 122 (actual 0, predicted 0—correctly identified no ASD).

  - False Positives (0,1): 6 (actual 0, predicted 1—incorrectly predicted ASD).

  - False Negatives (1,0): 12 (actual 1, predicted 0—missed ASD cases).

  - True Positives (1,1): 116 (actual 1, predicted 1—correctly identified ASD).

- Color Scale: Lighter shades (e.g., white for 122) indicate higher counts; darker shades (e.g., dark blue for 6) indicate lower counts.


Insight: Random Forest’s accuracy of 92.97% (from earlier: 0.9296875) aligns with the matrix: (122 + 116) / (122 + 6 + 12 + 116) = 238 / 256 ≈ 0.9297. The model excels at identifying both classes, with 122 true negatives (no ASD) and 116 true positives (ASD). However, it has 12 false negatives (missed ASD cases), which is critical in a medical context—we’ll aim to reduce this with optimization. The 6 false positives (over-diagnosing ASD) are less concerning but still worth minimizing. This matrix gives us a solid starting point to improve sensitivity for ASD detection!


Next Steps:

We’ve dissected Random Forest’s performance, great insights! Next, we’ll dive into the classification report for more metrics (e.g., precision, recall, F1-score), optimize our model to reduce false negatives, and interpret its predictions. 



Evaluating Model Fit: Cross-Validation


After analyzing Random Forest’s performance with a confusion matrix, we’re now focusing on another model—Support Vector Classifier (SVC)—to check for overfitting or underfitting using cross-validation. This code block performs 5-fold cross-validation on SVC, calculating accuracy scores for each fold and their mean, ensuring our model generalizes well for autism spectrum disorder (ASD) prediction. 

Let’s raise our spirits to fine-tune our models—cheers to robust predictions! 🌟🚀


Why Cross-Validation Matters

Cross-validation helps us assess if our SVC model overfits (performs too well on training data but poorly on unseen data) or underfits (fails to learn patterns). For healthcare providers, a well-generalized model ensures reliable ASD predictions across diverse patients.


What to Expect in This Step

In this step, we’ll:

- Perform 5-fold cross-validation on the SVC model using the training data.

- Print the accuracy scores for each fold and their mean.

- Analyze the results to determine if SVC overfits, underfits, or generalizes well.


Get ready to ensure our model’s reliability—our journey is getting even more refined!


Fun Fact: 

Cross-Validation in AI!

Did you know cross-validation, popularized in the 1970s, is a gold standard for model evaluation? It’s perfect for ensuring our SVC model is ready for real-world autism screening!


Real-Life Example

Imagine you’re a data scientist in 2025, validating an ASD prediction tool. Cross-validation showing consistent scores across folds ensures your SVC model won’t fail when deployed in local clinics, giving families confidence in the results!


Quiz Time!

Let’s test your evaluation skills, students!

1. What does cross-validation do?  

   a) Trains the model on test data  

   b) Splits training data into folds to evaluate generalization  

   c) Deletes the dataset  

   


2. What does a high variance in cross-validation scores suggest?  

   a) Model generalizes well  

   b) Model might be overfitting  

   c) Model is perfect  

   


Drop your answers in the comments


Cheat Sheet: Cross-Validation with `cross_val_score`

- `cross_val_score(estimator, X, y)`: Performs k-fold cross-validation (default k=5).

- `cross_val.mean()`: Calculates the mean of the fold scores.

- Tip: Check the variance of scores (e.g., max - min) to assess consistency.


Did You Know?

The Support Vector Machine (SVM) algorithm, introduced in 1995, excels in high-dimensional spaces—our cross-validation ensures SVC’s robustness for autism prediction!


Pro Tip:

Is our SVC model overfitting? Let’s check with cross-validation!


What’s Happening in This Code?

Let’s break it down like we’re fine-tuning a recipe:

- Imports: `from sklearn.model_selection import cross_val_score` brings in the cross-validation tool.

- Cross-Validation: `cross_val_score(estimator=svc, X=x_train_scaled, y=y_train)` performs 5-fold cross-validation:

  - `estimator=svc`: Uses the trained SVC model.

  - `X=x_train_scaled`: Uses scaled training features.

  - `y=y_train`: Uses training labels.

  - Default `cv=5` splits the training data into 5 folds (~204 samples per fold, given ~1022 training rows).


- Results

  - `print(cross_val)`: Displays accuracy for each fold.

  - `print(cross_val.mean())`: Shows the mean accuracy across folds.


Cross-Validation for SVC Model


Here’s the code we’re working with:



# (TO CHECK IF THE MODEL HAS OVERFITTED OR UNDERFITTED)

from sklearn.model_selection import cross_val_score

cross_val = cross_val_score(estimator=svc, X=x_train_scaled, y=y_train)

print('Cross Val Acc Score of SVC model is ---> ', cross_val)

print('\n Cross Val Mean Acc Score of SVC model is ---> ', cross_val.mean())



Output:
Cross Val Acc Score of SVC model is --->  [0.92156863 0.87745098 0.83823529 0.87745098 0.8872549 ]
Cross Val Mean Acc Score of SVC model is --->  0.8803921568627452

Explanation of Features:

- Fold Scores: Accuracy for each of the 5 folds:

  - Fold 1: 0.9216 (92.16%)

  - Fold 2: 0.8775 (87.75%)

  - Fold 3: 0.8382 (83.82%)

  - Fold 4: 0.8775 (87.75%)

  - Fold 5: 0.8873 (88.73%)

- Mean Score: Average accuracy across folds: 0.8804 (88.04%).

- Variance: Range of scores (max - min) = 0.9216 - 0.8382 = 0.0834 (8.34% difference).


Insight: The SVC model’s mean cross-validation accuracy of 88.04% is slightly below its test accuracy of 89.84% (from earlier: 0.8984375), suggesting it generalizes well with no significant overfitting. The variance of 8.34% across folds indicates some inconsistency, possibly due to small fold sizes or data imbalances not fully addressed by scaling. Compared to the test score, the model isn’t underfitting either, as the gap (89.84% - 88.04% = 1.8%) is small. However, the lower fold score (83.82%) hints at potential sensitivity to certain data splits—we might improve this with hyperparameter tuning (e.g., adjusting SVC’s `C` or `kernel`). This solid performance makes SVC a reliable choice, though Random Forest (92.97%) remains our top model!


Next Steps

We’ve validated SVC’s fit—great consistency! Next, we’ll optimize our top models (e.g., Random Forest, SVC) with hyperparameter tuning, explore additional metrics like precision and recall, and interpret feature importance for ASD prediction.



A Milestone of Precision

 Wrapping Up Part 3 of Our Autism Prediction Project!


What an exhilarating journey we’ve shared, my incredible viewers and students! We’ve just wrapped up Part 3 of our "Autism Prediction Classification Project" and I’m overflowing with pride over our achievements. 

We kicked off with advanced feature engineering—log-transforming `age` and label encoding categorical columns—then dove into EDA with a correlation heatmap, confirming `sum_score` as a top predictor (0.97 correlation with `Class/ASD`). We trained nine models, with Random Forest leading at 92.97% accuracy, analyzed its confusion matrix (122 true negatives, 116 true positives), and validated SVC’s generalization with cross-validation (88.04% mean score). Every step has brought us closer to a reliable autism spectrum disorder (ASD) prediction tool, ready to support early diagnosis for families. 

Whether you’re joining me from Stockholm Sweden’s vibrant streets or coding with heart from across the globe, your passion has made this journey transformative—let’s raise a toast to a phenomenal Part 3! 🌟🚀


 The Future Is Bright: Get Ready for Part 4!


Hold onto your excitement because Part 4 is set to elevate our project to extraordinary heights! On our website, www.theprogrammarkid004.online 

we’ll:

- Choose the Best Model: Select our top performer (Random Forest?) for ASD prediction.

- Hyperparameter Tuning: Optimize its performance for even higher accuracy.

- SHAP Analysis: Interpret feature importance to understand what drives ASD predictions.

- Advanced Model Evaluation: Dive into precision, recall, and ROC curves for a holistic view.

- Optimal Threshold: Fine-tune prediction thresholds to minimize false negatives.

- Model Deployment and Serialization: Prepare our model for real-world use and save it for future applications.


Make sure to subscribe www.youtube.com/@cognitutorai 

hit that notification bell, and join our community of compassionate coders. 

Let’s keep this meaningful journey soaring. What was your favorite moment—Random Forest’s 92.97% accuracy or SVC’s cross-validation? 

Drop it in the comments, and tell me what you’re most excited for in Part 4.

I can’t wait to make this project a life-changer with you! 🌟🚀