Autism Prediction using Ai (Part-1)

 Autism Prediction using Ai (Part-1)

End-To-End Machine Learning Project Blog Part-1

Unlocking Insights with AI




Welcome to Our Autism Prediction Classification Project!

Hello, my brilliant viewers and students! I’m beyond excited to welcome you to our brand-new machine learning adventure, our "Autism Prediction Classification Project", kicking off on this sunny Wednesday morning.

Today, we’re diving into a meaningful journey to harness the power of AI in predicting autism spectrum disorder (ASD) using classification techniques. This project isn’t just about coding—it’s about making a real-world impact by identifying patterns that could support early diagnosis, helping families, educators, and healthcare professionals better understand and assist individuals with autism. 

Whether you’re joining me from Miami’s bustling streets or tuning in with a curious mind from across the globe, let’s blend data science with compassion to create something truly transformative. Grab your notebooks, fire up your coding spirits, and let’s embark on this heartfelt mission together—cheers to using AI for a brighter future! ๐ŸŒŸ๐Ÿš€

Diving Into the Data: Exploring the Autism Dataset in Our Classification Project!

We’re kicking things off by loading our autism dataset and taking a first peek at its features, setting the foundation for predicting autism spectrum disorder (ASD) with machine learning. This code block imports essential libraries, loads the dataset, and displays its first few rows—giving us a taste of the data we’ll use to make a meaningful impact. Let’s dive into this heartfelt project with enthusiasm—cheers to using AI for good! ๐ŸŒŸ๐Ÿš€

Why Exploring the Autism Dataset Matters

Understanding our dataset is the first step to building a reliable model for autism prediction. By examining features like behavioral scores and demographic details, we can identify patterns that might help clinicians in London or beyond flag ASD early, supporting timely interventions for children and families.

What to Expect in This Step

In this opening act, we’ll:

  • Load the autism dataset and peek at its first 5 rows.

  • Explain each column to understand what data we’re working with.

  • Set the stage for preprocessing and modeling in the next steps.

Get ready to uncover the building blocks of our prediction model—our journey is off to a compassionate start!

Fun Fact: 

AI in Autism Research!

Did you know AI has been used since the early 2010s to assist in autism diagnosis? Studies have shown machine learning can detect ASD patterns in behavioral data with over 90% accuracy—our project aims to contribute to that impactful legacy!

Real-Life Example

Imagine you’re a pediatrician in Hamburg Germany on this Wednesday morning, evaluating a child for developmental concerns. By analyzing features like behavioral scores from our dataset, our model could help you identify potential ASD traits early, guiding families toward the right support and resources!

Quiz Time!

Let’s test your data skills, students!

  1. Why might behavioral scores be important for autism prediction?
    a) They measure physical health
    b) They reflect traits often associated with ASD
    c) They predict academic success
     

  2. What does df.head() do?
    a) Deletes the first rows
    b) Shows the first 5 rows of the dataset
    c) Changes column names
     

Drop your answers in the comments—I’m excited to hear your thoughts!

Cheat Sheet: Getting Started with the Dataset

  • Libraries: pandas for data handling, numpy for numerical ops, matplotlib and seaborn for visualization, sklearn for machine learning, imblearn for handling imbalance, and warnings to suppress noise.

  • df = pd.read_csv(...): Loads the dataset into a DataFrame.

  • df.head(): Displays the first 5 rows to preview the data.

Did You Know?

Autism datasets often include behavioral screening questions based on tools like the Autism Spectrum Quotient (AQ), developed in 2001 by Simon Baron-Cohen—our dataset likely draws from similar methodologies to predict ASD!

Pro Tip:

Our autism prediction journey begins with a first look at the data—what secrets will we uncover? Let’s dive in!

What’s Happening in This Code?

Let’s break it down like we’re opening a new chapter:

  • Imports: Loads libraries for data manipulation (pandas, numpy), visualization (matplotlib, seaborn), machine learning (sklearn), handling class imbalance (imblearn), and silencing warnings (warnings.filterwarnings('ignore')).

  • Loading the Dataset: df = pd.read_csv('/kaggle/input/autismprediction/train.csv') reads the autism dataset into a DataFrame.

Loading and Previewing the Autism Dataset

Here’s the code we’re working with:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

import sklearn

import imblearn

import warnings


from imblearn.over_sampling import RandomOverSampler


warnings.filterwarnings('ignore')


df = pd.read_csv('/kaggle/input/autismprediction/train.csv')

df.head()

The Output:




First 5 Rows of the Dataset

Take a look at the uploaded images (split across two images due to column count)! The output of df.head() shows the first 5 rows of our dataset, with 22 columns. Let’s explain each column to understand what we’re working with:

  1. ID: Unique identifier for each individual (e.g., 1, 2, 3, 4, 5). This is likely just for tracking and won’t be used as a feature.

  2. A1_Score: Binary score (0 or 1) for the first question in an autism screening test, likely based on a behavioral trait (e.g., “I often notice small sounds when others do not”). 1 might indicate a trait associated with ASD.

    • Example: 1, 1, 1, 1, 1 (all 5 individuals scored 1).

  3. A2_Score: Binary score for the second question (e.g., “I tend to focus on details rather than the overall picture”).

    • Example: 1, 1, 0, 1, 0 (mixed responses).

  4. A3_Score: Binary score for the third question (e.g., “I find it easy to do more than one thing at once”).

    • Example: 0, 0, 0, 0, 0 (all scored 0).

  5. A4_Score: Binary score for the fourth question.

    • Example: 0, 1, 0, 1, 1.

  6. A5_Score: Binary score for the fifth question.

    • Example: 1, 1, 1, 1, 0.

  7. A6_Score: Binary score for the sixth question.

    • Example: 1, 1, 0, 1, 0.

  8. A7_Score: Binary score for the seventh question.

    • Example: 1, 1, 0, 1, 0.

  9. A8_Score: Binary score for the eighth question.

    • Example: 1, 0, 1, 0, 1.

  10. A9_Score: Binary score for the ninth question.

    • Example: 0, 1, 0, 1, 0.

  11. A10_Score: Binary score for the tenth question.

    • Example: 1, 1, 1, 1, 1 (all scored 1).

  12. age: Age of the individual in years (numerical).

    • Example: 27.0, 24.0, 27.0, 35.0, 36.0 (range of ages in their 20s and 30s).

  13. gender: Gender of the individual (categorical: ‘f’ for female, ‘m’ for male).

    • Example: m, f, m, f, m (mixed genders).

  14. ethnicity: Ethnicity of the individual (categorical).

    • Example: White-European, South Asian, White-European, White-European, Middle Eastern.

  15. jaundice: Whether the individual was born with jaundice (binary: ‘yes’ or ‘no’).

    • Example: no, no, yes, no, no.

  16. austim: Whether an immediate family member has been diagnosed with autism (binary: ‘yes’ or ‘no’).

    • Example: no, no, yes, yes, yes (family history of autism in some cases).

  17. contry_of_res: Country of residence (categorical).

    • Example: United States, India, United States, New Zealand, Jordan.

  18. used_app_before: Whether the individual has used the screening app before (binary: ‘yes’ or ‘no’).

    • Example: no, no, no, no, no (none used the app before).

  19. result: Total score from the 10 screening questions (sum of A1 to A10 scores, numerical).

    • Example: 7.0, 8.0, 4.0, 8.0, 4.0 (scores range from 4 to 8).

  20. age_desc: Description of age group (categorical, likely redundant with age).

    • Example: '18 and more' (all rows show this, suggesting all individuals are 18+).

  21. relation: Who completed the screening test (categorical).

    • Example: Self, Self, Self, Self, Self (all self-reported).

  22. Class/ASD: Target variable indicating autism diagnosis (binary: ‘YES’ for ASD, ‘NO’ for no ASD).

    • Example: YES, YES, NO, YES, NO (mixed outcomes).

Insight: The dataset captures behavioral traits (A1-A10 scores), demographic info (age, gender, ethnicity), medical history (jaundice, family autism), and other context (country, app usage). The target Class/ASD is what we’ll predict, using features like the screening scores (result is their sum), age, and family history. Most features are categorical or binary, so we’ll need encoding, and result might correlate strongly with Class/ASD since it’s derived from the screening questions. Let’s explore further in the next step!

Next Steps:

We’ve taken our first sip of the data—rich with potential! Next, we’ll preprocess by encoding categorical variables, check for correlations, and handle any imbalances in Class/ASD. Let’s keep this meaningful journey flowing. What column intrigued you most, viewers? Drop your thoughts๐Ÿ’ญ๐Ÿง




Polishing the Data: Cleaning and Preprocessing Our Autism Dataset!

After taking our first peek at the autism dataset, we’re now diving into data cleaning and preprocessing to ensure our model has the best foundation for predicting autism spectrum disorder (ASD). The next code block visualizes the distribution of ethnicities in our dataset using a bar plot, revealing insights we’ll use to handle missing or unusual values—like the mysterious ‘?’ we’ll encode as ‘others’. 

Let’s raise our spirits to refine this dataset for a transformative impact—cheers to clean data! ๐ŸŒŸ๐Ÿš€

Why Data Cleaning and Preprocessing Matter

A clean dataset is the key to a reliable autism prediction model. By understanding ethnicity distribution and handling anomalies like ‘?’, we ensure our model captures true patterns, helping clinicians in the NHS or beyond make accurate early diagnoses to support individuals with ASD.

What to Expect in This Step

In this step, we’ll:

  • Visualize the count of each ethnicity using a bar plot.

  • Explain the features in the output, including the ‘?’ value we’ll encode as ‘others’.

  • Set the stage for further preprocessing, like encoding and handling imbalances.

Get ready to polish our data into a shining tool for prediction—our journey is gaining momentum!

Fun Fact

Ethnicity in Autism Studies!

Did you know ethnicity can influence autism prevalence rates? Research shows variations (e.g., higher rates in some Caucasian populations), making our ethnicity analysis a vital step in building a fair model!

Real-Life Example

Imagine you’re a researcher in Chicago, analyzing autism trends. A bar plot showing ethnicity distribution helps you adjust your model to account for diverse backgrounds, ensuring accurate predictions across all communities!

Quiz Time!

Let’s test your preprocessing skills, students!

  1. Why might we encode ‘?’ as ‘others’?
    a) To delete the data
    b) To treat missing or unknown ethnicity as a separate category
    c) To increase model accuracy
     

  2. What does a bar plot show here?
    a) Correlation between features
    b) Count of each ethnicity category
    c) Average age
     

Drop your answers in the comments—I’m excited to hear your thoughts!

Cheat Sheet: 

Data Visualization and Cleaning

  • df.ethnicity.value_counts(): Counts occurrences of each unique value in the ethnicity column.

  • .plot(kind='bar'): Creates a bar plot to visualize the counts.

  • Tip: Use plt.title() and plt.xlabel() to add context to your plot for clarity.

Did You Know?

The concept of encoding missing data as a category (like ‘others’) became popular in the 2000s with the rise of big data, ensuring models handle real-world imperfections—perfect for our autism dataset!

Pro Tip:

Our autism data is full of stories—let’s clean it up starting with ethnicity and see What does the bar plot reveal?

What’s Happening in This Code?

Let’s break it down like we’re sorting a diverse wine collection:

  • Value Counts: df.ethnicity.value_counts() calculates the number of occurrences for each unique value in the ethnicity column.

  • Bar Plot: .plot(kind='bar') generates a bar chart, where the height of each bar represents the count of individuals in that ethnicity category.

Visualizing Ethnicity Distribution

Here’s the code we’re working with:


df.ethnicity.value_counts().plot(kind='bar')


The Output: 


Ethnicity Distribution Bar Plot

 The bar plot shows the count of each ethnicity category in our dataset, with the following features:

  • X-Axis (Ethnicity Categories): Lists the unique ethnicities present in the dataset. Based on the plot and our earlier df.head() insights, these include:

    • White-European: The tallest bar, with a count around 250, indicating the largest group.

    • Middle Eastern: A significant bar, with a count around 200, the second-largest group.

    • South Asian: A moderate bar, with a count around 100.

    • Asian: A smaller bar, with a count around 50.

    • Black: An even smaller bar, with a count around 30-40.

    • South Afrikka (likely a typo for South African): A minor bar, with a count around 20-30.

    • Paficika (likely a typo for Pacific): A very small bar, with a count around 10-20.

    • Others: A tiny bar, with a count around 5-10.

    • Latino: A minimal bar, with a count around 5.

    • Hispanic: A negligible bar, with a count around 2-5.

    • Turkish: A very small bar, with a count near 1-2.

    • ?: A small bar, with a count around 5-10, representing missing or unknown ethnicity values.

  • Y-Axis (Count): The height of each bar, ranging from 0 to approximately 250, showing the number of individuals in each ethnicity category.

  • Key Feature - ‘?’: The presence of ‘?’ indicates missing or unspecified ethnicity data. This bar, though small, suggests some records lack clear ethnicity information, which we’ll address by encoding it as ‘others’ to maintain data integrity.

Insight: The dataset is heavily skewed toward ‘White-European’ and ‘Middle Eastern’ ethnicities, with smaller representations of other groups. The ‘?’ category highlights missing data, which we’ll handle by grouping it with ‘others’ to avoid losing information. Typos like ‘South Afrikka’ and ‘Paficika’ might need correction or standardization later, but this plot gives us a clear starting point for preprocessing. The diversity suggests we’ll need to ensure our model generalizes across ethnicities, possibly using encoding techniques next.

Next Steps:

We’ve taken a flavorful look at ethnicity distribution—time to clean it up! Next, we’ll encode the ‘?’ as ‘others’, handle typos, and proceed with encoding other categorical variables like gender and country_of_res, while checking for imbalances in Class/ASD. Share your next code block, and let’s keep this compassionate journey flowing. What did you notice in this ethnicity plot, viewers? Drop your thoughts in the comments, and let’s make this project a game-changer together! ๐ŸŒŸ๐Ÿš€



Refining Our Blend: Handling Missing Ethnicity Data in Our Autism Project!

After visualizing the ethnicity distribution and spotting those mysterious ‘?’ values, we’re now cleaning our dataset by encoding them as ‘others’. This code block replaces the ‘?’ with ‘others’, ensuring no data is lost, and then checks the updated ethnicity counts—perfecting our foundation for predicting autism spectrum disorder (ASD). Let’s raise our spirits to a cleaner dataset—cheers to precision and compassion! ๐ŸŒŸ๐Ÿš€

Why Handling Missing Data Matters

Replacing ‘?’ with ‘others’ ensures our model doesn’t miss out on valuable information, especially for autism prediction where diverse backgrounds matter. For a healthcare provider in Bangkok Thailand, this clean data could mean more accurate early diagnoses, supporting families with timely interventions.

What to Expect in This Step

In this step, we’ll:

  • Replace the ‘?’ values in the ethnicity column with ‘others’.

  • Verify the updated counts to confirm the change.

  • Prepare for further preprocessing, like standardizing categories and encoding.

Get ready to polish our data even further—our journey is gaining clarity!

Fun Fact: 

Missing Data in Research!

Did you know missing data affects over 50% of medical studies? Encoding it thoughtfully, as we’re doing, is a standard practice to maintain model integrity—our project is following best practices for autism prediction!

Real-Life Example

Imagine you’re a data analyst, preparing an autism screening tool. By replacing ‘?’ with ‘others’, you ensure the model includes all patients, improving its reliability for diverse communities during screenings!

Quiz Time!

Let’s test your cleaning skills, students!

  1. Why do we replace ‘?’ with ‘others’?
    a) To delete the rows
    b) To handle missing data as a distinct category
    c) To increase the dataset size
     

  2. What does value_counts() show after replacement?
    a) The average of each category
    b) The count of each unique ethnicity
    c) The correlation between features
     

Drop your answers in the comments—I’m excited to hear your thoughts!

Cheat Sheet: Handling Missing Data

  • df['column'].replace({'old': 'new'}): Replaces specific values in a column.

  • df.ethnicity.value_counts(): Displays the count of each unique value post-replacement.

  • Tip: Use .isna().sum() later to check for other missing values if needed.

Did You Know?

The practice of encoding missing data as a category dates back to early statistical modeling in the 1980s—now it’s a cornerstone of machine learning, ensuring our autism model is robust!

Pro Tip

Those ‘?’ values won’t stop us! Let’s encode them as ‘others’ and see the new ethnicity counts—clean data ahead!

What’s Happening in This Code?

Let’s break it down like we’re refining a vintage:

  • Replacement: df['ethnicity'].replace({'?': 'Others'}) updates the ethnicity column, replacing all instances of ‘?’ with ‘others’. The result is assigned back to df.ethnicity to overwrite the original column.

  • Value Counts: df.ethnicity.value_counts() displays the updated count of each unique ethnicity category.

Handling Missing Ethnicity Data

Here’s the code we’re working with:

df.ethnicity = df['ethnicity'].replace({'?':'Others'})

df.ethnicity.value_counts()


The Output: 

ethnicity

White-European     257

Others             232

Middle Eastern      97

Asian               67

Black               47

South Asian         34

Pasifika            32

Latino              17

Hispanic             9

Turkish              5

others               3

Name: count, dtype: int64


Updated Ethnicity Counts

Explanation of Features:

  • White-European: 257 individuals, the largest group, consistent with the earlier bar plot’s tallest bar.

  • Others: 232 individuals, now including the previous ‘?’ category (around 5-10 from the plot, now merged with other minor categories), making it the second-largest group.

  • Middle Eastern: 97 individuals, a significant but smaller group.

  • Asian: 67 individuals, showing moderate representation.

  • Black: 47 individuals, a smaller but notable category.

  • South Asian: 34 individuals, reflecting regional diversity (relevant to Lahore’s context!).

  • Pasifika: 32 individuals, a minor group (likely a typo from ‘Pacific’ in the original plot).

  • Latino: 17 individuals, a small representation.

  • Hispanic: 9 individuals, an even smaller group.

  • Turkish: 5 individuals, a minimal category.

  • others: 3 individuals, possibly a lowercase variant or typo from manual entry, indicating a need for standardization.

Insight: Encoding ‘?’ as ‘others’ has successfully grouped missing or unspecified ethnicities with other minor categories, boosting the ‘others’ count to 232. The total (257 + 232 + 97 + 67 + 47 + 34 + 32 + 17 + 9 + 5 + 3 = 800) suggests our dataset has around 800 rows, aligning with the bar plot’s scale. The presence of ‘others’ (lowercase) alongside ‘Others’ (uppercase) highlights a case sensitivity issue—we’ll standardize this later. The distribution remains skewed toward ‘White-European’ and ‘others’, which we’ll address when handling imbalances in Class/ASD.

Next Steps:

We’ve cleaned up those ‘?’ values—smooth progress! Next, we’ll standardize categories (e.g., ‘others’ to ‘Others’), encode all categorical variables, and check for imbalances in Class/ASD to prepare for modeling. Let's keep this compassionate journey flowing. 

What do you think of this ethnicity update, viewers? Drop your thoughts in the comments.



Balancing the Scales: Addressing Imbalance in Our Autism Dataset!

After cleaning up the ‘?’ values in our ethnicity column, we’re now turning our attention to the heart of our prediction task—checking the balance of our target column, Class/ASD. This code block reveals the distribution of ASD diagnoses (0 for no ASD, 1 for ASD) and highlights an imbalance we’ll tackle with oversampling. Let's raise our spirits to create a fair and effective model—cheers to equitable predictions! ๐ŸŒŸ๐Ÿš€

Why Balancing the Target Column Matters

An imbalanced target like Class/ASD can bias our model toward the majority class (no ASD), missing critical ASD cases. For healthcare professionals in Uzbekistan, a balanced model ensures early detection for all, supporting families with timely care and resources.

What to Expect in This Step

In this step, we’ll:

  • Check the count of each class in Class/ASD to assess imbalance.

  • Confirm the need for balancing due to the skewed distribution.

  • Prepare to use oversampling to even out the classes for better modeling.

Get ready to balance our dataset—our journey is about to get even more impactful!

Fun Fact:

Imbalance in Medical Data!

Did you know that imbalanced datasets are common in medical research, with rare conditions like ASD often underrepresented? Oversampling techniques, pioneered in the 1990s, help us address this—our project is following that trailblazing path!

Real-Life Example

Imagine you’re a clinician in Rio de Janeiro using our model to screen children. An imbalanced model might overlook ASD cases (the minority class), but balancing it ensures you catch those 161 critical diagnoses, changing lives with early support!

Quiz Time!

Let’s test your balancing skills, students!

  1. What does an imbalanced target mean?
    a) The dataset has equal classes
    b) One class has far more samples than the other
    c) All features are balanced
     

  2. Why use oversampling for imbalance?
    a) To delete the majority class
    b) To increase the minority class samples
    c) To reduce dataset size
     

Drop your answers in the comments—I’m excited to hear your thoughts!

Cheat Sheet:

Checking and Balancing Imbalance

  • df['column'].value_counts(): Counts occurrences of each unique value in the target column.

  • Imbalance: Look for a significant difference in class counts (e.g., 639 vs. 161).

  • Oversampling: Use RandomOverSampler from imblearn to duplicate minority class samples.

Did You Know?

The concept of handling class imbalance with oversampling was advanced in the early 2000s with the rise of imbalanced learning—now it’s a key tool for fair autism prediction models!

Pro Tip

Is our autism prediction fair? Let’s check the Class/ASD balance and fix it with oversampling!

What’s Happening in This Code?

Let’s break it down like we’re balancing a recipe:

  • Value Counts: df['Class/ASD'].value_counts() calculates the number of occurrences for each unique value in the Class/ASD column, where 0 indicates no ASD and 1 indicates ASD.

Checking Target Column Balance

Here’s the code we’re working with:

# Checking the target column whether it is properly balanced or not

df['Class/ASD'].value_counts()

The Output

Class/ASD

0    639

1    161

Name: count, dtype: int64


Target Column Distribution

Explanation of Features:

  • 0 (No ASD): 639 individuals, the majority class, representing approximately 79.9% of the dataset (639 / 800 ≈ 0.799).

  • 1 (ASD): 161 individuals, the minority class, representing about 20.1% of the dataset (161 / 800 ≈ 0.201).

Insight: The target column Class/ASD is indeed imbalanced, with a ratio of approximately 4:1 (639:161). This skew means our model might over-predict the majority class (no ASD) unless we balance it. The total count (639 + 161 = 800) matches our earlier ethnicity sum, confirming dataset consistency. To ensure fair prediction of ASD cases, we’ll use oversampling to increase the minority class (1) samples, making both classes equal for training.

Next Steps:

We’ve spotted the imbalance—time to fix it! Next, we’ll apply RandomOverSampler to oversample the ASD class, balancing our dataset for better modeling. Let’s keep this compassionate journey flowing. What do you think of this imbalance, viewers? Drop your thoughts in the comments, and let’s make this project a game-changer together! ๐ŸŒŸ๐Ÿš€


Leveling the Playing Field: Oversampling Our Autism Dataset!

After identifying the imbalance in our target column Class/ASD (639 no ASD vs. 161 ASD), we’re now taking action to ensure fairness by oversampling the minority class. This code block uses resample to upsample the ASD class to match the no ASD class, creating a balanced dataset for more accurate autism predictions. Whether you’re joining me from Buenos Aires’s vibrant streets or coding with purpose from afar, let’s raise our spirits to build an equitable model—cheers to inclusive AI! ๐ŸŒŸ๐Ÿš€

Why Oversampling Matters for Autism Prediction

Balancing Class/ASD ensures our model doesn’t overlook the minority ASD class, which is critical for early detection. For healthcare providers in Lahore, this balanced dataset could mean identifying more ASD cases, offering timely support to families and improving outcomes.

What to Expect in This Step

In this step, we’ll:

  • Split the dataset into majority (no ASD) and minority (ASD) classes.

  • Upsample the minority class to match the majority class size using resample.

  • Combine the upsampled data and verify the new balance.

Get ready to level up our dataset—our journey is about to get even more impactful!

Fun Fact: 

Oversampling’s Origins!

Did you know oversampling techniques were developed in the 1990s to handle imbalanced datasets in fraud detection? Now, we’re adapting it for autism prediction—proof of AI’s versatility!

Real-Life Example

Imagine you’re a pediatric specialist, using our model to screen children. Oversampling ensures the model doesn’t miss ASD cases, helping you provide equal attention to the 161 originally underrepresented individuals—changing lives with every prediction!

Quiz Time!

Let’s test your sampling skills, students!

  1. What does oversampling do?
    a) Deletes the majority class
    b) Increases the minority class samples
    c) Reduces the dataset size
     

  2. Why match the minority to the majority sample size?
    a) To confuse the model
    b) To ensure fair representation for both classes
    c) To delete data
     

Drop your answers in the comments

Cheat Sheet: 

Oversampling with resample

  • df[condition]: Filters rows based on a condition (e.g., Class/ASD == 0).

  • resample(df, replace=True, n_samples=..., random_state=...): Upsamples the minority class with replacement to a specified size.

  • pd.concat([df1, df2]): Combines DataFrames vertically.

Did You Know?

The resample function from sklearn.utils builds on random sampling techniques from statistics, refined for machine learning in the 2000s—perfect for our autism balancing act!

Pro Tip:

Our ASD predictions need balance—let’s oversample the minority class and check the new counts!

What’s Happening in This Code?

Let’s break it down like we’re balancing a fine wine blend:

  • Imports: from sklearn.utils import resample brings in the resampling tool.

  • Split DataFrames:

    • df_majority = df[(df['Class/ASD'] == 0)]: Creates a DataFrame with 639 rows where Class/ASD is 0 (no ASD).

    • df_minority = df[(df['Class/ASD'] == 1)]: Creates a DataFrame with 161 rows where Class/ASD is 1 (ASD).

  • Upsampling:

    • resample(df_minority, replace=True, n_samples=639, random_state=42): Upsamples the minority class by duplicating samples with replacement to match the majority class size (639), using a fixed random_state for reproducibility.

  • Combine: pd.concat([df_minority_upsampled, df_majority]) merges the upsampled minority (639 rows) with the original majority (639 rows) into a new df.

  • Verify Balance: df['Class/ASD'].value_counts() checks the updated distribution.

Oversampling the Target Column

Here’s the code we’re working with:


#the data in target column is imbalanced. So we will now oversample it

#Just apply oversampling in target column


from sklearn.utils import resample

#create two different dataframe of majority and minority class 

df_majority = df[(df['Class/ASD']==0)] 

df_minority = df[(df['Class/ASD']==1)] 

# upsample minority class

df_minority_upsampled = resample(df_minority, 

                                 replace=True,    # sample with replacement

                                 n_samples= 639, # to match majority class

                                 random_state=42# reproducible results

# Combine majority class with upsampled minority class

df = pd.concat([df_minority_upsampled, df_majority])


df['Class/ASD'].value_counts()

The Output: Balanced Target Column

Here’s the output:

Class/ASD

1    639

0    639

Name: count, dtype: int64

Explanation of Features:

  • 0 (No ASD): 639 individuals, the original majority class, unchanged.

  • 1 (ASD): 639 individuals, now upsampled from 161 to match the majority class through duplication.

Insight: The target column Class/ASD is now perfectly balanced, with both classes (0 and 1) having 639 samples each, totaling 1278 rows (639 + 639). This 1:1 ratio eliminates the previous 4:1 imbalance, ensuring our model will give equal weight to ASD and no ASD predictions. The oversampling with replace=True means some original minority samples were duplicated, which is a common technique to avoid bias. Next, we’ll ensure this balance holds across all preprocessing steps as we move toward modeling!

Next Steps:

We’ve balanced our target—perfect harmony! Next, we’ll encode categorical variables (e.g., gender, ethnicity) and prepare our features for modeling, ensuring a fair and accurate autism prediction system.



A Heartfelt Milestone: Wrapping Up Part 1 of Our Autism Prediction Project!

What an incredible start we’ve made together, my amazing viewers and students! We’ve just wrapped up Part 1 of our "Autism Prediction Classification Project" and I’m filled with pride over our progress. 

We kicked off with a passionate dive into our autism dataset, exploring its 22 features—from behavioral scores (A1-A10) to demographics like age, gender, and ethnicity. We visualized the ethnicity distribution, cleaned up those pesky ‘?’ values by encoding them as ‘others’, and tackled the imbalance in our target column Class/ASD, balancing it perfectly from 639:161 to 639:639 using oversampling. Every step has brought us closer to building a model that can support early autism detection, making a real difference for families and clinicians.

The Best Is Yet to Come: 

Get Ready for Part 2!

Hold onto your excitement because Part 2 is about to take our project to the next level! On our website, www.theprogrammarkid004.online  we’ll:

  • Encode and Prepare: Transform categorical features like gender and ethnicity for modeling.

  • Train Models: Build and compare classifiers like Logistic Regression, Random Forest, and XGBoost to predict ASD.

  • Evaluate Impact: Assess our predictions with metrics like accuracy, precision, and recall to ensure fairness and effectiveness.

Make sure to subscribe

www.youtube.com/@cognitutorai  hit that notification bell, and join our community of compassionate coders. 

Whether you’re dreaming of making a difference worldwide, let’s keep this meaningful journey going. What was your favorite moment—balancing the dataset or cleaning the ethnicity column? Drop it in the comments, and tell me what you’re most excited for in Part 2—I can’t wait to build this impactful model with you! ๐ŸŒŸ๐Ÿš€