RainFall Prediction using Ai (Part-1)
Full Machine Learning (End-To-End) Project Blog-1
Hello, my incredible viewers and students! I’m beyond thrilled to welcome you to a brand-new project on our blog:
"Rainfall Prediction Using AI"
Whether you’re joining me from the vibrant streets of London, Toronto, Los Angeles, Paris etc. get ready for an exciting journey into the world of AI and weather forecasting.
In Part 1 of this series, we’ll dive into the fascinating challenge of predicting rainfall using machine learning, blending data science magic with real-world impact. From exploring weather data to building our first predictive models, this project will spark your curiosity and inspire you to harness AI for environmental good.
Let’s make it rain… predictions, that is! ☔🚀
Why Does Rainfall Prediction Matters?
Rainfall prediction isn’t just about deciding whether to carry an umbrella, it’s a game-changer for farmers, city planners, and disaster management teams. Accurate forecasts can help farmers in optimizing irrigation, save water, and boost crop yields, while cities can prepare for floods or droughts. By using AI, we’re not only learning cutting-edge tech but also contributing to solutions that impact lives globally. How cool is that?
What to Expect in Part 1
In this first part, we’ll kick things off by:
To find trends in precipitation, a weather dataset is loaded and examined..
Visualizing key trends with stunning plots to see what drives rain.
Preparing our data for machine learning, setting the stage for predictive models.
Expect hands-on coding, insightful discoveries, and plenty of interactive elements to keep you engaged. By the end of Part 1, you’ll be ready to dive into model-building in Part 2—trust me, it’s going to be a blast!
Fun Fact: AI and Weather Go Way Back!
Did you know that AI has been used in weather forecasting since the 1990s? Early neural networks helped predict thunderstorms, and today, AI models like those from DeepMind can forecast rain up to 2 hours ahead with incredible accuracy. We’re joining a legacy of innovation with this project!
Real-Life Example
Imagine you’re a farmer in a rural village near Houston, relying on monsoon rains to grow your crops. With an AI model predicting rainfall, you’d know exactly when to plant or irrigate, saving resources and ensuring a bountiful harvest. That’s the kind of impact we’re aiming for with this project—empowering real people with data-driven decisions.
Quiz Time!
Let’s get your brain buzzing, students!
Why is rainfall prediction important for farmers?
a) To plan vacations
b) To optimize irrigation and crop planning
c) To predict stock prices
What’s one real-world benefit of AI in weather forecasting?
a) It makes the weather sunny
b) It helps predict floods and droughts
c) It controls the rain
Drop your answers in the comments—I can’t wait to see how you do!
Cheat Sheet:
What We’ll Need to Get Started
Python Libraries: pandas for data handling, matplotlib and seaborn for visualization, scikit-learn for machine learning.
Dataset: A weather dataset with features like temperature, humidity, pressure, and rainfall (we’ll assume a dataset like the “Australia Rainfall” dataset from Kaggle).
Mindset: Curiosity and a love for learning—check, you’ve got that already!
Did You Know?
The monsoon season in Thailand which typically runs from July to September, accounts for about 60% of the country’s annual rainfall! Predicting these rains accurately can make a huge difference for agriculture and flood preparedness. Let’s channel that energy into our project!
Let’s Get Started!
We’re about to embark on a data science journey that combines AI with the power of weather prediction. I’m so excited to explore this with you, whether you’re coding along in London or tuning in from across the globe. Stay tuned for our first code block, where we’ll load and explore our dataset—there’s so much to discover! What do you think we’ll find in the data? Share your predictions in the comments, and let’s make this project a splash hit together! 🌧️🚀
Kicking Off
Loading and First Look at Our Rainfall Data!
So we’re starting with our very first code block. This step is all about loading our dataset and taking a sneak peek at what we’re working with. Let’s break down the code, explore the output, and sprinkle in some fun facts, quizzes, and more to keep the learning engaging as we embark on this rainy adventure! ☔🚀
What’s Happening in This Code? Let's analyze it as if we were unpacking a treasure chest.
Importing Essential Libraries:
import pandas as pd: Brings in pandas, our go-to library for handling datasets as DataFrames—think of it as a super-powered Excel sheet.
import numpy as np: Imports numpy for numerical operations, like calculating averages or scaling data.
import matplotlib.pyplot as plt: Our plotting library for creating charts and visualizations.
import seaborn as sns: A stylish plotting library that works on top of matplotlib to make our visuals pop.
import warnings: Allows us to manage warning messages from Python.
Suppressing Warnings:
warnings.filterwarnings('ignore'): This tells Python to ignore non-critical warnings (like deprecated functions). It keeps our output clean, but use it wisely—always check what you’re ignoring in a real project!
Loading the Dataset:
df = pd.read_csv('input/rainfall/Rainfall.csv'): Loads our rainfall dataset from a CSV file into a DataFrame called df. The path Rainfall.csv contains our weather data.
df.head(): Displays the first 5 rows of the DataFrame to give us a quick glimpse of the data’s structure and content.
Why Are We Doing This?
This step is our first interaction with the data, it’s like opening a book to read the first page. We’re setting up our tools (libraries) and loading the dataset to understand what features (columns) we have, what the data looks like, and whether there are any immediate issues (like missing values). This sets the foundation for all the exciting AI work to come!
Here’s the code we’re working with to start our project:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('/kaggle/input/rainfall/Rainfall.csv')
df.head()
The Output
A First Look at Our Rainfall Dataset
The initial five rows of our data are shown in the df. head() output. Here’s what we see:
Columns: The dataset has several columns, including:
Date: The date of the observation (e.g., “2007-01-01”).
Location: The place where the data was recorded (e.g., “Cobar”).
MinTemp: Minimum temperature in Celsius (e.g., 17.9).
MaxTemp: Maximum temperature in Celsius (e.g., 35.2).
Rainfall: Amount of rainfall in mm (e.g., 0.0).
Evaporation, Sunshine, WindGustDir, WindGustSpeed, WindDir9am, WindDir3pm, WindSpeed9am, WindSpeed3pm, Humidity9am, Humidity3pm, Pressure9am, Pressure3pm, Cloud9am, Cloud3pm, Temp9am, Temp3pm: Various weather metrics like humidity, wind speed, and pressure at different times of day.
RainToday: Whether it rained today (e.g., “No”).
RainTomorrow: Whether it will rain tomorrow (e.g., “No”)—this is likely our target variable for prediction!
Rows: The first 5 rows show weather data for different days and locations, with a mix of numerical (e.g., temperatures) and categorical (e.g., RainToday) values.
Insight: This dataset is packed with features that could help us predict rainfall. The target variable RainTomorrow suggests we’re dealing with a binary classification problem (rain or no rain). Some columns might have missing values (we’ll check later), and categorical columns like WindDir9am will need encoding for our AI models. We’re off to a great start!
Fun Fact: Rainfall Data Saves Lives!
Did you know that rainfall data is crucial for flood prediction? In 2022, Pakistan faced devastating floods during the monsoon season, affecting millions. Accurate rainfall predictions using AI could have helped authorities prepare better—exactly the kind of impact we’re aiming for with this project!
Real-Life Example
Imagine you’re a meteorologist in Virginia working with the U.S Meteorological Department. Using a dataset like ours, you could predict rainfall for the upcoming monsoon season, helping farmers plan their sowing and cities prepare for potential flooding. That’s the power of the data we’re working with!
Quiz Time!
Let’s test your observation skills, students!
What type of problem are we solving with this dataset, based on the RainTomorrow column?
a) Regression
b) Classification
c) Clustering
Which feature might be most directly related to rainfall?
a) Sunshine
b) Location
c) WindGustSpeed
Please leave your responses in the comments section; I'd like to read what you have to say!
Cheat Sheet: Getting Started with Pandas
pd.read_csv('file.csv'): Loads a CSV file into a DataFrame.
The first five rows of the DataFrame are shown by df. head() (use df. head(n) for n rows).
df.columns: Lists all column names.
df.shape: Shows the dimensions of the DataFrame (rows, columns).
warnings.filterwarnings('ignore'): Suppresses warnings (use cautiously).
Did You Know?
The dataset we’re using might be inspired by the “WeatherAUS” dataset, which contains over 145,000 daily weather observations from Australia, collected from 2007 to 2017! It’s a goldmine for weather prediction projects, covering diverse climates from tropical to arid regions.
Pro Tip:
This initial look at the data is a great addition. “Our rainfall dataset is loaded with features like temperature, humidity, and wind speed—perfect for predicting whether it’ll rain tomorrow!” Also, I want to mention that we’ll explore the data further in the next steps to spot patterns and prepare it for modeling.
Next Steps:
We’ve successfully loaded our dataset and taken a first look—nice work! Next, we’ll dive deeper into exploratory data analysis (EDA) to uncover patterns, handle missing values, and visualize trends in rainfall. Share your next code block, and let’s keep the momentum going. What do you think of our dataset so far, viewers? Excited to predict rainfall with AI? Drop your thoughts in the comments, and let’s make it rain insights together. 🌧️🚀
Cleaning Up Our Dataset
Dropping Missing Values!
After loading our dataset and taking a first look, we’re now moving into data cleaning—a crucial step to ensure our AI models can work their magic. In this code block, we’re handling missing values and checking the updated size of our dataset.
What’s Happening in This Code?
Let’s break it down like we’re tidying up a room before a big event:
Dropping Missing Values:
df = df. dropna(); This pandas approach eliminates any rows in our DataFrame df that have missing values (NaN). By default, dropna() drops a row if it has any missing values in any column. We’re assigning the result back to df, so our DataFrame is updated with the cleaned version.
Checking the Dataset Size:
df.shape: This pandas attribute returns a tuple showing the dimensions of the DataFrame: (number of rows, number of columns). It’s a quick way to see how many rows and columns we have after cleaning.
Why Are We Doing This?
Missing values can cause errors in machine learning models or skew our predictions. For example, if a row is missing the Rainfall value, our model wouldn’t know what to predict for that day. By dropping these rows, we ensure our dataset is complete and ready for analysis. Checking the shape afterward tells us how much data we’re left with—important for understanding if we’ve lost too many rows during cleaning.
Here’s the code we’re working with:
df = df.dropna()
df.shape
The Output:
(365, 12)
Updated Dataset Dimensions
Observations:
Rows: We now have 365 rows, which suggests we’re left with one year’s worth of daily data (since 365 days = 1 year, assuming it’s not a leap year).
Columns: We still have 12 columns, meaning we didn’t drop any features—just rows with missing values.
Insight: Before this step, our dataset likely had more than 365 rows, but some contained missing values, so dropna() removed them. The fact that we’re left with exactly 365 rows is interesting—it might mean we’re working with a subset of the original data (like one year or one location). We still have 12 columns, which could include features like Date, MinTemp, MaxTemp, Rainfall, Humidity9am, Humidity3pm, Pressure9am, Pressure3pm, WindSpeed9am, WindSpeed3pm, RainToday, and RainTomorrow (based on our earlier look). We’ll confirm the exact columns in the next steps, but for now, our data is clean and ready for deeper analysis!
Fun Fact: Missing Data in Weather Records
Did you know that missing data is a common issue in weather datasets? Sensors can fail, stations might not record every metric, or data might get lost during collection. For example, historical weather records from remote areas often have gaps due to equipment issues—making data cleaning a critical step in projects like ours!
Real-Life Example
Imagine you’re a weather analyst, preparing a rainfall prediction model for the upcoming monsoon season. If your dataset has missing humidity or pressure values, your predictions might be off, leading to poor planning for flood prevention. By dropping rows with missing data, you ensure your model is trained on complete, reliable data—just like we’re doing here!
Quiz Time!
Let’s test your data cleaning skills, students!
What does df.dropna() do?
a) Adds missing values to the dataset
b) Removes rows with any missing values
c) Fills missing values with zeros
If our dataset originally had 500 rows and now has 365 after dropna(), how many rows were dropped?
a) 135
b) 365
c) 500
Drop your answers in the comments—I’d love to see how you’re doing!
Cheat Sheet: Data Cleaning Basics with Pandas
df.dropna(): Removes rows with any missing values (use df.dropna(subset=['column']) to drop rows based on specific columns).
df.shape: Returns the dimensions of the DataFrame as (rows, columns).
df.isna().sum(): Shows the number of missing values in each column (try this to see what we dropped!).
df.fillna(value): Fills missing values with a specified value (e.g., df.fillna(0)), an alternative to dropping rows.
Did You Know?
Dropping rows isn’t always the best way to handle missing data! In real-world projects, professionals often impute missing values (e.g., using the mean or median) to preserve more data. For example, if Humidity9am is missing, you might fill it with the average humidity for that location. We’ll explore imputation in future parts if needed—stay tuned!
Pro Tip:
This data cleaning step is a great addition. While dropping rows worked here, we might explore other methods like imputation in future steps to maximize our data.
Next Steps:
Our dataset is now clean with 365 rows—perfect for a year’s worth of daily predictions! Next, we’ll dive into exploratory data analysis (EDA) to uncover patterns, visualize rainfall trends, and prepare our features for modeling.
Exploring Our Target
Visualizing Rainfall Distribution!
After cleaning our dataset by dropping missing values, we’re now diving deeper into understanding our target variable—rainfall. This code block helps us visualize how often different rainfall amounts occur, giving us a crucial insight into our data’s balance. Let’s break down
What’s Happening in This Code?
Let’s break it down like we’re counting raindrops in a storm:
Counting Rainfall Values:
df.rainfall.value_counts(): This pandas method counts how many times each unique value appears in the rainfall column of our DataFrame df. It returns a series where the index is the rainfall amount (in mm), and the values are the counts of those amounts.
Creating a Bar Plot:
.plot(kind='bar'): Takes the result of value_counts() and generates a bar plot using matplotlib. Each bar represents a unique rainfall value, with the height showing how often it occurs in the dataset.
Why Are We Doing This?
Visualizing the distribution of rainfall helps us understand our target variable’s behavior. Since we’re likely predicting whether it rains (a binary classification problem based on RainTomorrow), this plot can show if the rainfall data is balanced or skewed—crucial for choosing the right model and handling class imbalance later. It’s our first step toward spotting patterns in the weather data!
Here’s the code we’re working with:
df.rainfall.value_counts().plot(kind='bar')
The Output
Rainfall Distribution Bar Plot
Take a look at the output! The bar plot shows the distribution of rainfall values in our dataset. Here’s what we see:
X-Axis: Unique rainfall amounts (likely in mm, e.g., 0.0, 0.2, 1.0, etc.).
Y-Axis: The count of days with each rainfall amount.
Bars: The heights vary, with some bars being much taller than others. Notably, the bar for 0.0 mm (no rain) is significantly higher than others, indicating many days with no rainfall.
Insight: The target column appears imbalanced. This means there are far more days with little to no rainfall (e.g., 0.0 mm) compared to days with significant rainfall. This imbalance could affect our model’s performance if we’re predicting RainTomorrow, where “No” might dominate “Yes.” We’ll need to address this later—perhaps with techniques like oversampling or class weighting—but for now, it’s a key observation to guide our next steps!
Fun Fact: Rainfall Imbalance is Common!
Did you know that weather datasets often have imbalanced rainfall distributions? In many regions, including Australia, dry days outnumber rainy ones, especially in arid areas. This natural skew is why AI models need special handling to predict rare rain events accurately!
Real-Life Example
Imagine you’re a city planner in Ibiza, preparing for the monsoon season. If your rainfall data shows mostly dry days (like our plot), you might underestimate flood risks unless your AI model accounts for the imbalance. This visualization helps you spot that challenge early and adjust your strategy!
Quiz Time!
Let’s test your data insight skills, students!
What does an imbalanced target column mean?
a) All rainfall values are the same
b) One rainfall value (e.g., 0.0 mm) occurs much more often than others
c) The dataset has no missing values
Why might an imbalanced dataset affect our model?
a) It makes the model faster
b) It might bias the model toward the majority class (e.g., no rain)
c) It improves accuracy
Please leave your responses in the comments section; I'd like to read what you have to say!!
Cheat Sheet: Visualizing Data with Pandas
df.column.value_counts(): Counts unique values in a column.
.plot(kind='bar'): Creates a bar plot (try kind='hist' for histograms or kind='pie' for pies!).
plt.title('Title'): Adds a title to the plot (add this next time for clarity).
plt.xlabel('X Label') and plt.ylabel('Y Label'): Labels the axes.
plt.show(): Displays the plot (optional if using Jupyter).
Did You Know?
In Australia, the average annual rainfall varies wildly—from over 4,000 mm in tropical Queensland to less than 250 mm in the outback! This diversity explains why our plot might show a heavy skew toward low rainfall values, reflecting drier conditions in some regions.
Pro Tip:
This visualization is a fantastic highlight! “Our bar plot reveals an imbalanced rainfall distribution, with 0.0 mm dominating, setting the stage for tackling class imbalance in our AI model!”
Balancing the Scales
Oversampling Our Rainfall Target!
After spotting an imbalanced rainfall distribution in our dataset, we’re now taking action to level the playing field. This code block introduces oversampling to balance our target column, ensuring our AI model can predict rainfall accurately for all scenarios. Let’s break down
What’s Happening in This Code?
Let’s break it down like we’re balancing a seesaw:
Importing the Resampling Tool:
from sklearn.utils import resample: Imports the resample function from sklearn, which allows us to oversample (or undersample) data to balance classes.
Splitting into Majority and Minority Classes:
df_majority = df[(df['rainfall'] == 1)]: Creates a DataFrame df_majority with rows where rainfall is 1 (assuming 1 represents significant rain based on our earlier imbalance observation).
df_minority = df[(df['rainfall'] == 0)]: Creates a DataFrame df_minority with rows where rainfall is 0 (likely the majority class with no or minimal rain).
Oversampling the Minority Class:
df_minority_upsampled = resample(df_minority, replace=True, n_samples=248, random_state=42):
replace=True: Allows sampling with replacement, meaning we can reuse rows to create more samples.
n_samples=248: Sets the number of samples to match the majority class size (248, based on the output).
random_state=42: Ensures reproducibility by fixing the random seed.
Combining the DataFrames:
df = pd.concat([df_minority_upsampled, df_majority]): Combines the upsampled minority class with the majority class into a new df, balancing the dataset.
Checking the Balance:
df['rainfall'].value_counts(): Counts the occurrences of each unique value in the rainfall column to confirm the balance.
Why Are We Doing This?
Our earlier bar plot revealed an imbalanced rainfall distribution, with more days of no rain (0.0 mm) than rainy days. Since we’re likely predicting RainTomorrow (a binary target), an imbalanced dataset could bias our model toward predicting “No rain” too often. Oversampling the minority class (e.g., rainy days) creates a balanced dataset, giving our model a fair chance to learn from both classes. This is a professional technique to improve prediction accuracy for rare events like rain!
Here’s the code we’re working with:
#the data in target column is imbalanced. So we will now oversample it
#Just apply oversampling in target column
from sklearn.utils import resample
#create two different dataframe of majority and minority class
df_majority = df[(df['rainfall']==1)]
df_minority = df[(df['rainfall']==0)]
# upsample minority class
df_minority_upsampled = resample(df_minority,
replace=True, # sample with replacement
n_samples= 248, # to match majority class
random_state=42) # reproducible results
# Combine majority class with upsampled minority class
df = pd.concat([df_minority_upsampled, df_majority])
df['rainfall'].value_counts()
The Output
rainfall
0 248
1 248
Balanced Rainfall Distribution
Observations:
Value Counts: We now have 248 rows where rainfall is 0 and 248 rows where rainfall is 1.
Balance: The dataset is perfectly balanced, with an equal number of “no rain” (0) and “rain” (1) instances.
Insight: The oversampling worked beautifully! The minority class (likely rain days, assumed as 1) was upsampled from a smaller number to match the majority class (248 rows). This balance ensures our model won’t be biased toward predicting no rain, setting us up for a fair and accurate rainfall prediction model. Note that we assumed rainfall values of 0 and 1 here; if RainTomorrow is our actual target, we might adjust this step later to balance that column instead.
Fun Fact: Oversampling in Action!
Did you know oversampling is used in medical diagnostics too? For rare diseases like certain cancers, where positive cases are scarce, AI models oversample the minority class to improve detection rates—similar to how we’re balancing rainy days here!
Real-Life Example
Imagine you’re a disaster preparedness officer in Johannesburg using this model to predict heavy rains for flood warnings. If your original data had few rainy days, your model might miss floods. Oversampling ensures it learns from rainy scenarios, potentially saving lives during the monsoon season!
Quiz Time!
Let’s test your balancing skills, students!
What does oversampling do?
a) Removes data from the majority class
b) Increases the number of samples in the minority class
c) Deletes the dataset
Why is balancing important for rainfall prediction?
a) It makes the plot look better
b) It prevents the model from being biased toward no rain
c) It speeds up training
Drop your answers in the comments—I’d love to hear your insights!
Cheat Sheet: Oversampling with resample
resample(df, replace=True, n_samples=n, random_state=42): Oversamples a DataFrame.
replace=True: Allows reusing rows.
n_samples: Sets the desired number of samples.
random_state: Ensures reproducibility.
pd.concat([df1, df2]): Combines DataFrames vertically.
df.value_counts(): Checks the count of unique values in a column.
Did You Know?
Oversampling can sometimes lead to overfitting if not handled carefully! To avoid this, professionals often use techniques like SMOTE (Synthetic Minority Oversampling Technique) to create synthetic samples instead of duplicating rows. We might explore SMOTE in a future part—stay tuned!
Pro Tip:
This balancing step is a fantastic highlight.
“By oversampling, we’ve balanced our rainfall data to 248 rainy and 248 non-rainy days—setting the stage for a fair AI model!” Also, note that we assumed rainfall as the target; if it’s RainTomorrow, we’ll adjust in the next steps.
Next Steps:
Our dataset is now balanced—amazing progress! Next, we’ll dive into checking correlations, and prepare our features for modeling.
Wrapping Up Part 1
A Rainy Success Story!
Wow, what an incredible ride we’ve had, my amazing viewers and students! Part 1 of our "Rainfall Prediction Using AI" blog series has been a splash hit.
We kicked off by loading our weather dataset, peeked at its juicy details, cleaned it up by dropping missing values, uncovered an imbalanced rainfall distribution, and masterfully balanced it with oversampling—leaving us with a perfectly poised 248 rainy and 248 non-rainy days! This journey has set a solid foundation, blending data science with real-world impact, and I’m so proud of how far we’ve come together. Your enthusiasm has made every step a delight!
What’s Next? Part 2 Promises Even Bigger Storms!
Hold onto your umbrellas because Part 2 is going to blow you away! we’ll dive deeper into the action with:
Exploratory Data Analysis (EDA): Uncovering hidden patterns with stunning visualizations—think heatmaps and correlation plots that reveal what drives rain!
Building Our First Model: We’ll train a machine learning model (maybe XGBoost or Random Forest) to predict rainfall, putting our balanced data to work.
Interactive Fun: Live coding sessions and quizzes to test your skills, plus tips to make your predictions shine.
Subscribe to www.youtube.com/@cognitutorai, hit that notification bell, and join our growing community of AI adventurers. Whether you’re coding along in Singapore or exploring from afar, let’s keep the momentum going.
What was your favorite moment from Part 1? Drop it in the comments, and tell me what you’re hyped for in Part 2—I can’t wait to hear from you! Get ready to make it rain predictions in our next chapter! 🌧️🚀