π Exploring Microsoft Stock Data (Part - 2)
A Beginner's Guide to Financial Analysis (Part-2)
End-to-end machine learning project Blog
Welcome Back to Part 2
Unleashing the Power of Stock Predictions!
Hey there amazing viewers and students!
Welcome back to the thrilling continuation of our "Microsoft Stock Price Analysis & Predictions" blog series! If you joined us in Part 1, you’ve already witnessed the magic, we explored Microsoft’s stock data from 1986 to 2024, crafted stunning visualizations, decoded feature correlations with a heatmap, and crowned Linear Regression and Random Forest as our prediction champs with near-perfect R² scores.
What a ride that was! For those just tuning in, you’re jumping into the action at the perfect time, grab a cup of tea (or your favorite DRINK!), and let’s dive back into the world of data science and stock market excitement.
We’re picking up right where we left off, and Part 2 is where the real fun begins! We’re about to harness our top-performing models to predict future Microsoft stock prices, explore advanced techniques like LSTM, and visualize our predictions in epic style. So, buckle up, because we’re about to turn our insights into action and aim for stock prediction glory. Ready to join me on this next chapter?
Let’s get coding, share your next code block, and we’ll start building the future together! π
Kicking Off Part 2: Visualizing Our Best Model’s Predictions!
We’re diving into the first code block of Part 2 of our "Microsoft Stock Price Analysis & Predictions" blog series.
In Part 1, we crowned Linear Regression as one of our top-performing models with an incredible R² score of 0.99994. Now, it’s time to see how well it predicts Microsoft’s stock Close prices by visualizing its performance on the test data. Let’s break down this code, explore the output from the uploaded image, and sprinkle in some fun facts, quizzes, and more to keep the learning engaging!
What’s Happening in This Code?
Let’s break it down like we’re sketching a masterpiece:
Creating a Scatter Plot:
plt.scatter(y_test, lrpred): This line uses matplotlib to create a scatter plot. y_test contains the actual Close prices from the test set (the “ground truth”), and lrpred holds the predictions made by our Linear Regression model (from Part 1). Each point on the plot will show how a predicted price compares to the actual price.
Adding Labels and Title:
plt.title('Linear Regressor VS PREDICTION'): Sets the title of the plot to “Linear Regressor VS PREDICTION.”
plt.xlabel('Ground Truth'): Labels the x-axis as “Ground Truth” (the actual Close prices).
plt.ylabel('Prediction'): Labels the y-axis as “Prediction” (the prices predicted by the model).
Displaying the Plot:
plt.show(): Renders the plot (though this is optional in some environments like Jupyter).
Why Are We Doing This?
This scatter plot helps us visually assess how well our Linear Regression model performed. If the predictions are perfect, all points will lie on a straight line (y=x), meaning the predicted price equals the actual price. Any deviation from this line shows where the model’s predictions differ from reality. It’s a great way to confirm the high R² score we saw in Part 1!
Here’s the code we’re working with to start Part 2:
Code:
# NOW Visualize the best performing model with testing data
plt.scatter(y_test, lrpred)
plt.title('Linear Regressor VS PREDICTION')
plt.xlabel('Ground Truth')
plt.ylabel('Prediction')
plt.show()
Output:
A Scatter Plot of Predictions vs. Reality
Take a look at the uploaded image! The plot, titled "Linear Regressor VS PREDICTION," shows how our model’s predictions stack up against the actual Close prices. Here’s what we see:
X-Axis (Ground Truth): The actual Close prices from the test set, ranging from 0 to around 350.
Y-Axis (Prediction): The predicted Close prices, also ranging from 0 to around 350.
Scatter Points: Each blue dot represents a pair of values: the actual price (x-axis) and the predicted price (y-axis).
Trend: The points form a nearly perfect diagonal line (y=x), meaning the predictions are incredibly close to the actual prices. There’s barely any scatter—our model is nailing it!
Insight: This tight alignment confirms our Linear Regression model’s high R² score of 0.99994 from Part 1. The predictions are almost identical to the actual Close prices, which makes sense given the strong correlations we saw between Close and features like Open and High in the heatmap.
Fun Fact: The Power of a Good Fit
Did you know a scatter plot like this is a classic way to evaluate regression models? When the points hug the diagonal line (y=x), it’s a sign your model is a superstar at predicting—like our Linear Regression here! In real-world stock prediction, such a tight fit is rare due to market volatility, but our features made this task a bit easier.
Real-Life Example
Imagine you’re a financial advisor using this model to predict Microsoft stock prices for a client. This scatter plot would give you confidence to say, “Our model’s predictions are spot-on!” You could use these predictions to guide investment decisions, knowing they closely match reality—at least for the test data we’ve seen.
Quiz Time
Let’s test your visualization skills, students!
What does a perfect diagonal line (y=x) in this scatter plot mean?
a) The predictions are random
b) The predictions match the actual values exactly
c) The model failed
Why might the points be so tightly clustered along the line?
a) The test data is fake
b) The features (like Open, High) are highly correlated with Close
c) The model is overfitting
Drop your answers in the comments, I’d love to see your thoughts!
Cheat Sheet: Scatter Plots with Matplotlib
plt.scatter(x, y): Creates a scatter plot with x (actual values) and y (predicted values).
plt.title('text'): Adds a title to the plot.
plt.xlabel('text') and plt.ylabel('text'): Labels the axes.
plt.show(): Displays the plot (optional in some environments).
Did You Know?
You can enhance this plot by adding a diagonal reference line! Try adding plt.plot([0, 350], [0, 350], color='red', linestyle='--') before plt.show() to draw a red dashed line (y=x). It’ll make the comparison even clearer—give it a shot in your next project!
We’ve confirmed our model’s awesomeness with this scatter plot! Next, we can use it to predict future prices, explore time series models like LSTM, or compare predicted vs. actual prices over time.
What do you think of this plot, viewers? Are you impressed by our model’s performance? Drop your thoughts in the comments, and let’s keep exploring together! π
Checking Our Model’s Fit with Cross-Validation!
We just saw how well our Linear Regression model performed with a stunning scatter plot, showing near-perfect predictions for Microsoft’s stock Close prices. But now, it’s time to dig deeper and make sure our model isn’t tricking us—did it overfit, underfit, or is it just right? In this code block, we’re using cross-validation to test the model’s reliability. Let’s break it down.
What’s Happening in This Code?
Let’s break it down like we’re solving a detective mystery:
Importing the Cross-Validation Tool:
from sklearn.model_selection import cross_val_score: This brings in cross_val_score, a handy function from sklearn that helps us evaluate our model’s performance more robustly.
Running Cross-Validation:
cross_val = cross_val_score(estimator=lr, X=x_train_scaled, y=y_train): This line performs 5-fold cross-validation (the default in sklearn) on our Linear Regression model (lr). It uses the scaled training data (x_train_scaled, y_train) from Part 1. Here’s how it works:
The training data is split into 5 equal parts (folds).
The model is trained on 4 folds and tested on the 5th fold, repeating this process 5 times (each fold gets a turn as the test set).
For each fold, it calculates an R² score, giving us 5 scores to assess consistency.
Printing the Results:
print('Cross Val Acc Score of LINEAR REGRESSION model is ---> ', cross_val): Prints the R² scores for each of the 5 folds.
print('\n Cross Val Mean Acc Score of LR model is ---> ', cross_val.mean()): Prints the average R² score across all folds, giving us a single number to summarize the model’s performance.
Why Are We Doing This?
Cross-validation helps us check if our model is overfitting (performing great on the training data but poorly on new data) or underfitting (failing to learn the patterns well). If the cross-validation scores are high and consistent with our test R² score (0.99994 from Part 1), our model is likely well-balanced. If the scores vary a lot or are much lower, we might have a problem!
Code:
# (TO CHECK IF THE MODEL HAS OVERFITTED OR UNDERFITTED)
from sklearn.model_selection import cross_val_score
cross_val = cross_val_score(estimator=lr, X=x_train_scaled, y=y_train)
print('Cross Val Acc Score of LINEAR REGRESSION model is ---> ', cross_val)
print('\n Cross Val Mean Acc Score of LR model is ---> ', cross_val.mean())
The Output: Cross-Validation Scores for Linear Regression
Cross Val Acc Score of LINEAR REGRESSION model is ---> [0.9999498 0.99993975 0.99994065 0.9999354 0.99988651]
Cross Val Mean Acc Score of LR model is ---> 0.9999304222327898
Observations:
Individual Scores: The R² scores for the 5 folds are [0.9999498, 0.99993975, 0.99994065, 0.9999354, 0.99988651]. These are all extremely close to 1, showing our model performs consistently well across different subsets of the training data.
Mean Score: The average R² score is 0.9999304222327898—incredibly high and very close to our test R² score of 0.99994 from Part 1.
Insight: The scores are tight (ranging from 0.99988651 to 0.9999498), and the mean matches our test performance. This means our Linear Regression model isn’t overfitting or underfitting—it’s generalizing beautifully! The model is reliable and ready to predict Microsoft stock prices.
Fun Fact: The Magic of Cross-Validation
Did you know cross-validation was inspired by the idea of “testing your recipe”? Just like you’d taste-test a dish multiple times to ensure it’s consistently delicious, cross-validation tests your model on different data splits to ensure it’s consistently accurate!
Real-Life Example
Imagine you’re a data scientist at a tech firm in Silicon Valley, tasked with predicting Microsoft stock prices for a client. Using cross-validation, you can confidently tell your client, “Our model isn’t just a one-hit wonder—it performs consistently well across different data splits!” This reliability could help your client trust your predictions for their investment decisions.
Quiz Time!
Let’s test your cross-validation knowledge, students!
What does a high cross-validation score (close to 1) indicate?
a) The model is overfitting
b) The model is performing consistently well
c) The model is underfitting
If the cross-validation scores were much lower than the test R² score, what might that suggest?
a) The model is underfitting
b) The model might be overfitting to the test set
c) The data is perfect
Drop your explanation to the answers in the comments—I’m excited to see your progress!
Cheat Sheet: Cross-Validation with Scikit-Learn
cross_val_score(estimator, X, y): Performs cross-validation (default is 5 folds).
estimator: The model to evaluate (e.g., lr for Linear Regression).
X and y: The features and target (e.g., x_train_scaled, y_train).
cross_val.mean(): Calculates the average score across folds.
Did You Know?
You can change the number of folds in cross_val_score! Try cross_val_score(estimator=lr, X=x_train_scaled, y=y_train, cv=10) to use 10 folds instead of 5 for an even more thorough evaluation. More folds can give you a better sense of consistency but take longer to compute.
So we’ve confirmed our Linear Regression model is reliable with cross-validation—nice work! Next, we can use this model to predict future Microsoft stock prices, explore time series techniques like LSTM, or plot predictions over time.
Evaluating Our Model’s Performance with Key Metrics!
We’ve already confirmed that our Linear Regression model is a star performer with a scatter plot and cross-validation, showing consistent R² scores above 0.9999. Now, it’s time to dive deeper into its performance by calculating a variety of evaluation metrics. This will give us a fuller picture of how well our model predicts Microsoft’s stock Close prices.
What’s Happening in This Code?
Let’s break it down like we’re baking a cake—step by step:
Importing the Tools:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score: These functions from sklearn calculate common regression metrics.
import numpy as np: We’ll use numpy for some math operations.
Generating Predictions:
y_pred = lr.predict(x_test_scaled): Uses our trained Linear Regression model (lr) to predict Close prices on the scaled test data (x_test_scaled). This is the same lrpred we used earlier, but here we rename it y_pred for clarity.
Calculating Metrics:
A dictionary metrics is created to store four key evaluation metrics:
"MAE": mean_absolute_error(y_test, y_pred) computes the Mean Absolute Error—the average absolute difference between actual (y_test) and predicted (y_pred) prices.
"RMSE": np.sqrt(mean_squared_error(y_test, y_pred)) calculates the Root Mean Squared Error—the square root of the average squared differences between actual and predicted prices. It penalizes larger errors more heavily.
"R²": r2_score(y_test, y_pred) gives us the R² score, which we’ve seen before (it measures how well predictions match the actual data, with 1 being perfect).
"MAPE": np.mean(np.abs((y_test - y_pred) / y_test)) * 100 computes the Mean Absolute Percentage Error—the average percentage error between actual and predicted prices, expressed as a percentage.
Printing the Results:
print("=== Final Evaluation Metrics ==="): A header for our output.
for k, v in metrics.items(): print(f"{k}: {v:.4f}"): Loops through the metrics dictionary and prints each metric with 4 decimal places for clarity.
Why Are We Doing This?
While the R² score gave us a great overview of our model’s performance, these additional metrics provide a more detailed picture:
MAE tells us the average error in dollars.
RMSE highlights larger errors, which is crucial for stock prediction.
MAPE shows the error as a percentage, making it easier to interpret in context.
Together, they help us understand our model’s strengths and weaknesses beyond just R².
Here’s the code we’re working with:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
# Generate predictions
y_pred = lr.predict(x_test_scaled)
# Calculate key metrics
metrics = {
"MAE": mean_absolute_error(y_test, y_pred),
"RMSE": np.sqrt(mean_squared_error(y_test, y_pred)),
"R²": r2_score(y_test, y_pred),
"MAPE": np.mean(np.abs((y_test - y_pred) / y_test)) * 100 # Percentage errors
}
print("=== Final Evaluation Metrics ===")
for k, v in metrics.items():
print(f"{k}: {v:.4f}")
The Output
=== Final Evaluation Metrics ===
MAE: 0.2172
RMSE: 0.4721
R²: 0.9999
MAPE: 4.0527
Observations:
MAE (0.2172): On average, our model’s predictions are off by about $0.22 per share. That’s tiny considering Microsoft’s stock price in recent years (around $300 in 2024)!
RMSE (0.4721): The root mean squared error is $0.47, meaning even when larger errors are penalized, our model’s predictions are still very close to the actual prices.
R² (0.9999): This matches our earlier R² score (0.99994), confirming our model explains nearly all the variance in the Close prices.
MAPE (4.0527): The mean absolute percentage error is 4.05%, meaning our predictions are, on average, off by just 4.05%. For stock prices ranging from a few dollars to $350, this is impressively low!
Insight: These metrics confirm our model is performing exceptionally well. The low MAE and RMSE show small errors in absolute terms, the high R² shows a great fit, and the MAPE of 4.05% means our predictions are highly accurate in percentage terms. However, the MAPE might be higher for older data (e.g., when prices were $0.09 in 1986), as small absolute errors (like $0.22) are a larger percentage of smaller prices.
Fun Fact: Why MAPE Matters in Stocks
Did you know MAPE is especially popular in financial forecasting because it’s easy to interpret? A MAPE of 4.05% means if Microsoft’s stock is $300, our prediction might be off by about $12 on average—not bad for a day’s trading!
Real-Life Example
Imagine you’re a stock trader in the New York Stock Exchange using this model to predict Microsoft’s stock price. With an MAE of $0.22 and a MAPE of 4.05%, you’d feel confident making trades based on these predictions—knowing your model’s errors are small enough to keep your risks low while aiming for profits.
Quiz Time!
Let’s test your metrics knowledge, students!
What does a low MAE (like 0.2172) tell us about our model?
a) It has large errors
b) Its predictions are very close to the actual values
c) It’s underfitting
Why might MAPE be higher for older stock prices (e.g., $0.09 in 1986)?
a) The model is worse for older data
b) Small absolute errors are a larger percentage of smaller prices
c) The data is missing
Drop your answers in the comments—I’d love to hear from you!
Cheat Sheet: Regression Metrics
mean_absolute_error(y_true, y_pred): Average absolute error in the same units as the target.
mean_squared_error(y_true, y_pred): Average squared error; use np.sqrt() for RMSE.
r2_score(y_true, y_pred): Measures how well predictions fit the data (0 to 1).
MAPE: (np.mean(np.abs((y_true - y_pred) / y_true))) * 100—average percentage error.
Did You Know?
You can visualize errors too! Try plotting the residuals (errors: y_test - y_pred) with a histogram using plt.hist(y_test - y_pred, bins=50) to see the distribution of your model’s errors. It’s a great way to spot if errors are skewed or if there are outliers.
A Professional Evaluation of Our Linear Regression Model!
Now, we’re stepping it up with a professional-grade evaluation report, diving even deeper into our model’s performance on the test set. This code block gives us a polished summary of our model’s accuracy, perfect for sharing with stakeholders
What’s Happening in This Code?
Let’s break it down like we’re preparing a professional presentation:
Importing the Evaluation Tools:
from sklearn.metrics import (mean_absolute_error, mean_squared_error, mean_absolute_percentage_error, r2_score): These sklearn functions calculate key regression metrics. We’ve seen most of them before, but mean_absolute_percentage_error is a new addition for a cleaner MAPE calculation.
Generating Predictions:
y_pred = lr.predict(x_test_scaled): Uses our Linear Regression model (lr) to predict Close prices on the scaled test data (x_test_scaled). This is the same step we did in the previous block.
Calculating Key Metrics:
mae = mean_absolute_error(y_test, y_pred): Computes the Mean Absolute Error—the average absolute difference between actual (y_test) and predicted (y_pred) prices.
mse = mean_squared_error(y_test, y_pred) and rmse = np.sqrt(mse): Calculates the Mean Squared Error and its square root (Root Mean Squared Error), which penalizes larger errors more.
mape = mean_absolute_percentage_error(y_test, y_pred): Computes the Mean Absolute Percentage Error using sklearn’s built-in function, giving the average percentage error.
r2 = r2_score(y_test, y_pred): Calculates the R² score, measuring how well our predictions fit the actual data.
Adjusted R-squared: This is computed with the formula 1 - (1 - r2) * (len(y_test) - 1)/(len(y_test) - x_test_scaled.shape[1] - 1). It adjusts the R² score to account for the number of features, preventing over-optimism when using many predictors.
Printing a Professional Report:
The print statements create a formatted report with a header and footer (= lines), showing each metric with proper formatting:
MAE and RMSE are shown in dollars with 4 decimal places.
MAPE is formatted as a percentage with 4 decimal places.
R² and Adjusted R² are shown with 6 decimal places for precision.
Why Are We Doing This?
This evaluation gives us a polished, professional summary of our model’s performance, which is perfect for sharing with an audience. It also includes Adjusted R², which ensures our high R² isn’t just due to having many features. These metrics help us confirm our model’s reliability for predicting Microsoft stock prices and prepare us for real-world applications.
Here’s the code we’re working with:
# STEP 1: COMPREHENSIVE TEST SET EVALUATION
# =============================================
from sklearn.metrics import (mean_absolute_error,
mean_squared_error,
mean_absolute_percentage_error,
r2_score)
# Predict on test set
y_pred = lr.predict(x_test_scaled)
# Calculate key regression metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mape = mean_absolute_percentage_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print('\n' + '='*50)
print("PROFESSIONAL MODEL EVALUATION REPORT")
print('='*50)
print(f"Mean Absolute Error (MAE): ${mae:.4f}")
print(f"Root Mean Squared Error (RMSE): ${rmse:.4f}")
print(f"Mean Absolute Percentage Error (MAPE): {mape:.4%}")
print(f"R-squared (R²): {r2:.6f}")
print(f"Adjusted R-squared: {1 - (1 - r2) * (len(y_test) - 1)/(len(y_test) - x_test_scaled.shape[1] - 1):.6f}")
print('='*50 + '\n')
The Output: A Professional Model Evaluation Report
==================================================
PROFESSIONAL MODEL EVALUATION REPORT
==================================================
Mean Absolute Error (MAE): $0.2172
Root Mean Squared Error (RMSE): $0.4721
Mean Absolute Percentage Error (MAPE): 4.0527%
R-squared (R²): 0.999943
Adjusted R-squared: 0.999943
==================================================
Observations:
MAE ($0.2172): On average, our predictions are off by just $0.22 per share—super small for stock prices that range up to $350!
RMSE ΡΠ΅ΠΊΡΡΠ° ($0.4721): The root mean squared error is $0.47, showing that even when larger errors are penalized, our model remains accurate.
MAPE (4.0527%): Our predictions are off by 4.05% on average, which is fantastic for stock prediction across a wide price range.
R² (0.999943): Matches our earlier R² score, confirming our model explains nearly all the variance in Close prices.
Adjusted R² (0.999943): Almost identical to the R² score, which makes sense since we only have 5 features (Open, High, Low, Adj Close, Volume). The adjustment for the number of features barely changes the score, reinforcing our model’s quality.
Insight: These metrics are consistent with our previous evaluations, showing our Linear Regression model is incredibly accurate. The Adjusted R² matching the R² score tells us our model isn’t overfitting due to too many features—it’s genuinely a great fit for this data.
Fun Fact: Adjusted R² Saves the Day!
Did you know Adjusted R² is like a reality check for your model? While R² always increases as you add more features, Adjusted R² penalizes unnecessary complexity, ensuring your model isn’t just “cheating” by using too many predictors!
Real-Life Example
Imagine you’re presenting this report to a group of investors. With an MAE of $0.22 and a MAPE of 4.05%, you’d confidently tell them, “Our model can predict Microsoft stock prices with high precision—perfect for guiding your investment decisions!” This professional report would make you look like a data science rockstar.
Quiz Time!
Let’s test your evaluation skills, students!
What does a MAPE of 4.0527% mean in practical terms?
a) Predictions are off by 4.05% on average
b) The model fails 4.05% of the time
c) The stock price changes by 4.05% daily
Why is the Adjusted R² almost the same as the R² in this case?
a) The model is overfitting
b) We have a small number of features (5)
c) The data is fake
Drop your answers in the comments—I’d love to hear from you!
Cheat Sheet: Professional Model Evaluation
mean_absolute_error(y_true, y_pred): Average absolute error in dollars.
mean_squared_error(y_true, y_pred): Average squared error; use np.sqrt() for RMSE.
mean_absolute_percentage_error(y_true, y_pred): Average percentage error.
r2_score(y_true, y_pred): Measures overall fit (0 to 1).
Adjusted R²: 1 - (1 - r2) * (n - 1)/(n - p - 1) where n is the number of samples, p is the number of features.
Did You Know?
You can make this report even more professional by adding a metric like Explained Variance Score! Use explained_variance_score(y_test, y_pred) from sklearn.metrics to see how much of the variance in Close prices your model explains—it’ll likely be close to your R² score here.
Diving into Residual Analysis to Perfect Our Model!
We’ve already celebrated our Linear Regression model’s stellar performance with metrics like an MAE of $0.2172 and an R² of 0.999943. But now, it’s time to put on our detective hats and check the model’s behavior under the microscope with residual analysis. This code block creates diagnostic plots to ensure our predictions are as reliable as they seem.
What’s Happening in This Code?
Let’s break it down like we’re exploring a treasure map:
Importing the Tools:
import matplotlib.pyplot as plt and import seaborn as sns: Our plotting superheroes for creating diagnostic visuals.
from scipy import stats: Provides statistical tools like the Q-Q plot for checking normality.
Calculating Residuals:
residuals = y_test - y_pred: Computes the residuals—the differences between the actual Close prices (y_test) and the predicted prices (y_pred). Residuals tell us how far off our predictions are from the truth.
Creating the Residual Diagnostics Plot:
plt.figure(figsize=(16, 12)) and plt.suptitle('RESIDUAL DIAGNOSTICS', fontsize=16): Sets up a large figure (16x12 inches) with a main title for four subplots.
Subplot 221 (Residuals vs Predicted Values):
sns.scatterplot(x=y_pred, y=residuals): Plots residuals against predicted values to check for patterns.
plt.axhline(y=0, color='r', linestyle='--'): Adds a red dashed line at y=0 to show where residuals should ideally cluster.
Labels the axes and adds a title.
Subplot 222 (Distribution of Residuals):
sns.histplot(residuals, kde=True): Creates a histogram of residuals with a Kernel Density Estimate (KDE) curve to visualize their distribution.
Labels the axes and adds a title.
Subplot 223 (Q-Q Plot of Residuals):
stats.probplot(residuals, dist="norm", plot=plt): Generates a Quantile-Quantile plot to check if residuals follow a normal distribution (a key assumption for linear regression).
Adds a title.
Subplot 224 (Residuals Over Time):
plt.plot(y_test.index, residuals, 'o-'): Plots residuals over time using the index of y_test (assumed to be dates if converted earlier).
plt.axhline(y=0, color='r', linestyle='--'): Adds the zero line again.
Includes a try-except block to handle cases where y_test.index isn’t a datetime index.
plt.tight_layout() and plt.show(): Adjusts spacing and displays the plot.
Why Are We Doing This?
Residual analysis helps us validate our model’s assumptions and spot issues:
Residuals vs Predicted Values: Checks for patterns (e.g., increasing errors as prices rise), indicating potential nonlinearity or heteroscedasticity (uneven variance).
Distribution of Residuals: A bell-shaped curve suggests residuals are normally distributed, supporting linear regression’s assumptions.
Q-Q Plot: Points aligning with the red line indicate normality.
Residuals Over Time: Shows if errors change over time, which could hint at trends or seasonality the model missed.
Here’s the code we’re working with:
# =============================================
# STEP 2: RESIDUAL ANALYSIS & DIAGNOSTICS
# =============================================
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
# Calculate residuals
residuals = y_test - y_pred
# Residual diagnostics plot
plt.figure(figsize=(16, 12))
plt.suptitle('RESIDUAL DIAGNOSTICS', fontsize=16)
# Residuals vs Predicted values
plt.subplot(221)
sns.scatterplot(x=y_pred, y=residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.title('Residuals vs Predicted Values')
plt.xlabel('Predicted Stock Price')
plt.ylabel('Residuals')
# Residual distribution
plt.subplot(222)
sns.histplot(residuals, kde=True)
plt.title('Distribution of Residuals')
plt.xlabel('Residuals')
# Q-Q plot for normality check
plt.subplot(223)
stats.probplot(residuals, dist="norm", plot=plt)
plt.title('Q-Q Plot of Residuals')
# Residuals over time (if you have datetime index)
try:
plt.subplot(224)
plt.plot(y_test.index, residuals, 'o-')
plt.axhline(y=0, color='r', linestyle='--')
plt.title('Residuals Over Time')
plt.xlabel('Date')
plt.ylabel('Residuals')
except:
print("No datetime index found for time-based residual plot")
plt.tight_layout()
plt.show()
The Output:
Residual Diagnostics Plots
The figure titled "RESIDUAL DIAGNOSTICS" contains four plots. Here’s what we see:
Residuals vs Predicted Values (Top Left):
X-axis: Predicted stock prices (0 to 350).
Y-axis: Residuals (-4 to 4).
Blue dots are scattered around the red dashed line (y=0), with no clear pattern. The spread increases slightly at higher prices, but it’s mostly random, suggesting no strong heteroscedasticity.
Distribution of Residuals (Top Right):
X-axis: Residuals (-4 to 4).
Y-axis: Count.
The histogram with a KDE curve shows a roughly bell-shaped distribution, centered near zero, indicating residuals are approximately normally distributed.
Q-Q Plot of Residuals (Bottom Left):
X-axis: Theoretical quantiles.
Y-axis: Ordered residuals.
The blue points mostly follow the red diagonal line, especially in the middle, though they deviate slightly at the tails. This suggests residuals are mostly normal, with minor deviations.
Residuals Over Time (Bottom Right):
X-axis: Date (0 to 8000, likely representing days since 1986).
Y-axis: Residuals (-4 to 4).
The residuals fluctuate around the zero line with a cone-like spread, showing no obvious trend over time but increasing variance as time progresses.
Insight: The diagnostics look good! Random scatter in the residuals vs. predicted plot, a near-normal distribution, and a decent Q-Q fit suggest our Linear Regression model meets key assumptions. The residuals over time show increasing variance, which might indicate the model struggles more with recent, higher-priced data—but overall, it’s performing well.
Fun Fact: Residuals Are the Model’s “Oops Moments”
Did you know residuals are like the model’s confession of its mistakes? A random scatter around zero means it’s doing its job, while patterns could mean it’s missing something—like a hidden trend or outlier!
Real-Life Example
Imagine you’re a data analyst, presenting these diagnostics to a finance team. You’d say, “Our residuals are mostly random and normally distributed—our Linear Regression model is reliable for predicting Microsoft stock prices!” This could convince them to use your model for real trades.
Quiz Time!
Let’s test your residual skills, students!
What does a random scatter around the zero line in the residuals vs. predicted plot indicate?
a) The model is overfitting
b) The model’s errors are random, which is good
c) The model is underfitting
Why might the residuals over time show increasing variance?
a) The model is perfect
b) Higher stock prices in recent years amplify errors
c) The data is corrupted
Drop your answers in the comments—I’d love to hear from you!
Cheat Sheet: Residual Diagnostics
residuals = y_test - y_pred: Calculate residuals.
sns.scatterplot(x, y): Plot residuals vs. predicted values.
sns.histplot(data, kde=True): Visualize residual distribution.
stats.probplot(residuals, dist="norm", plot=plt): Create a Q-Q plot.
plt.plot(x, y, 'o-'): Plot residuals over time.
Did You Know?
You can test for heteroscedasticity statistically! Use a Breusch-Pagan test from statsmodels (import statsmodels.api as sm; sm.stats.diagnostic.het_breuschpagan) to confirm if the increasing variance in residuals over time is significant. It’s a pro move for your next analysis!
So, our model’s residuals look promising—great work! Next, we can predict future stock prices, dive into time series forecasting with LSTM, or refine our model based on these diagnostics.
Benchmarking Our Model Against the Competition!
We’ve already confirmed our Linear Regression model’s stellar performance with metrics, cross-validation, and residual diagnostics. But how does it stack up against other popular models?
In this code block, we’re benchmarking our champion against Random Forest, Gradient Boosting, and Support Vector Regression to see who reigns supreme in predicting Microsoft’s stock Close prices. Let’s break down the code, explore the output.
What’s Happening in This Code?
Let’s break it down like we’re hosting a model showdown:
Importing the Contenders:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor: Imports Random Forest and Gradient Boosting, two powerful ensemble methods.
from sklearn.svm import SVR: Imports Support Vector Regression (SVR), a model that uses support vectors to predict continuous values.
Setting Up the Models:
A dictionary models is created with four models:
"Linear Regression": lr: Our champion from earlier, already trained.
"Random Forest": RandomForestRegressor(n_estimators=100, random_state=42): A Random Forest with 100 trees, ensuring reproducibility with random_state=42.
"Gradient Boosting": GradientBoostingRegressor(n_estimators=100, random_state=42): A Gradient Boosting model with 100 trees, also reproducible.
"Support Vector": SVR(kernel='rbf'): An SVR model with a radial basis function (RBF) kernel, a common choice for regression tasks.
Benchmarking Loop:
for name, model in models.items(): Loops through each model.
if name != "Linear Regression": model.fit(x_train_scaled, y_train): Trains each model on the scaled training data, except for Linear Regression (already trained).
y_pred = model.predict(x_test_scaled): Makes predictions on the test set.
rmse = np.sqrt(mean_squared_error(y_test, y_pred)): Calculates the Root Mean Squared Error.
r2 = r2_score(y_test, y_pred): Calculates the R² score.
print(f"{name:<20} | RMSE: ${rmse:.4f} | R²: {r2:.6f}"): Prints the model name, RMSE, and R² in a formatted way.
Why Are We Doing This?
Benchmarking compares our Linear Regression model against other algorithms to see if we can do better. Random Forest and Gradient Boosting are ensemble methods that might capture complex patterns, while SVR could handle non-linear relationships. If another model outperforms Linear Regression, we might switch to it for future predictions!
Here’s the code we’re working with:
# =============================================
# STEP 3: MODEL PERFORMANCE BENCHMARKING
# =============================================
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
models = {
"Linear Regression": lr,
"Random Forest": RandomForestRegressor(n_estimators=100, random_state=42),
"Gradient Boosting": GradientBoostingRegressor(n_estimators=100, random_state=42),
"Support Vector": SVR(kernel='rbf')
}
print("\nMODEL COMPARISON BENCHMARK")
print("-"*50)
for name, model in models.items():
if name != "Linear Regression":
model.fit(x_train_scaled, y_train)
y_pred = model.predict(x_test_scaled)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"{name:<20} | RMSE: ${rmse:.4f} | R²: {r2:.6f}")
print("-"*50 + "\n")
The Output:
MODEL COMPARISON BENCHMARK -------------------------------------------------- Linear Regression | RMSE: $0.4721 | R²: 0.999943 Random Forest | RMSE: $0.2816 | R²: 0.999980 Gradient Boosting | RMSE: $0.5171 | R²: 0.999932 Support Vector | RMSE: $10.5028 | R²: 0.971746 --------------------------------------------------
Observations:
Random Forest: Steals the show with the lowest RMSE ($0.2816) and highest R² (0.999980)—even better than Linear Regression! It’s incredibly accurate.
Linear Regression: Still fantastic with an RMSE of $0.4721 and R² of 0.999943, very close to Random Forest.
Gradient Boosting: Also performs well with an RMSE of $0.5171 and R² of 0.999932, but slightly behind Linear Regression.
Support Vector (SVR): Lags behind with a much higher RMSE ($10.5028) and lower R² (0.971746). SVR struggles here, likely because the default RBF kernel isn’t tuned for this data.
Insight: Random Forest takes the crown with the best performance, though Linear Regression is a close second. SVR’s poor performance suggests it’s not the right fit for this dataset without further tuning. This benchmark confirms we might want to consider Random Forest for future predictions!
Fun Fact: Random Forest’s Superpower
Did you know Random Forest is like a team of decision-makers? It builds 100 trees (in this case) and averages their predictions, reducing overfitting and capturing complex patterns—making it a favorite for tasks like stock prediction!
Real-Life Example
Imagine you’re a data scientist, advising a hedge fund. You’d tell them, “Random Forest outperforms our Linear Regression with an RMSE of just $0.28—let’s use it for our next stock predictions!” This benchmark could guide real investment strategies.
Quiz Time!
Let’s test your benchmarking skills, students!
Which model performed the best in this benchmark?
a) Linear Regression
b) Random Forest
c) Support Vector
Why might SVR have performed so poorly compared to the others?
a) It’s always a bad model
b) Its default parameters aren’t tuned for this data
c) The data is missing
Drop your answers in the comments—I’d love to hear from you!
Cheat Sheet: Model Benchmarking
RandomForestRegressor(n_estimators=100): Builds a forest with 100 trees.
GradientBoostingRegressor(n_estimators=100): Boosts 100 trees sequentially.
SVR(kernel='rbf'): Uses a radial basis function kernel for regression.
np.sqrt(mean_squared_error()): Calculates RMSE.
r2_score(): Measures model fit (0 to 1).
Did You Know?
You can tune SVR to improve its performance! Try using GridSearchCV from sklearn to find the best parameters (e.g., C and gamma) for the RBF kernel—it might bring SVR closer to the others in future benchmarks.
Unlocking the Secrets of Feature Importance!
Now, it’s time to peel back the curtain and discover which features are the real MVPs in predicting Microsoft’s stock Close prices. This code block analyzes feature importance using Linear Regression coefficients, giving us a peek into what drives our predictions.
What’s Happening in This Code?
Let’s break it down like we’re investigating a lineup of suspects:
Checking Model Type:
if hasattr(lr, 'coef_'): Checks if our Linear Regression model (lr) has coefficients (coef_), which are available for linear models. Since Linear Regression is a linear model, this block will execute.
elif hasattr(lr, 'feature_importances_'): This would apply to tree-based models like Random Forest, but it’s skipped here since lr is Linear Regression.
Extracting Feature Importance for Linear Regression:
feature_importance = pd.DataFrame({'Feature': x_train.columns, 'Coefficient': lr.coef_}): Creates a DataFrame with feature names (from x_train.columns) and their coefficients (lr.coef_), which show how much each feature influences the prediction.
.sort_values('Coefficient', key=abs, ascending=False): Sorts the DataFrame by the absolute value of coefficients to highlight the most impactful features.
Creating the Visualization:
plt.figure(figsize=(12, 8)): Sets up a figure 12 inches wide and 8 inches tall.
sns.barplot(x='Coefficient', y='Feature', data=feature_importance): Plots a bar chart with coefficients on the x-axis and feature names on the y-axis.
plt.title('Feature Importance (Linear Regression Coefficients)'): Adds a title.
plt.axvline(x=0, color='k', linestyle='--'): Draws a vertical dashed line at zero to separate positive (increasing Close) and negative (decreasing Close) coefficients.
plt.tight_layout() and plt.show(): Adjusts spacing and displays the plot.
Why Are We Doing This?
Feature importance helps us understand which stock features (Open, High, Low, Adj Close, Volume) most affect our Close price predictions. For Linear Regression, coefficients indicate the strength and direction of each feature’s impact. This insight can guide us in refining our model or selecting features for future predictions!
Here’s the code we’re working with:
# STEP 4: FEATURE IMPORTANCE ANALYSIS
# For linear models
if hasattr(lr, 'coef_'):
feature_importance = pd.DataFrame({
'Feature': x_train.columns,
'Coefficient': lr.coef_
}).sort_values('Coefficient', key=abs, ascending=False)
plt.figure(figsize=(12, 8))
sns.barplot(x='Coefficient', y='Feature', data=feature_importance)
plt.title('Feature Importance (Linear Regression Coefficients)')
plt.axvline(x=0, color='k', linestyle='--')
plt.tight_layout()
plt.show()
# For tree-based models
elif hasattr(lr, 'feature_importances_'):
feature_importance = pd.DataFrame({
'Feature': x_train.columns,
'Importance': lr.feature_importances_
}).sort_values('Importance', ascending=False)
plt.figure(figsize=(12, 8))
sns.barplot(x='Importance', y='Feature', data=feature_importance)
plt.title('Feature Importance')
plt.tight_layout()
plt.show()
The Output:
Feature Importance Bar Chart
Take a look at the uploaded image! The plot, titled "Feature Importance (Linear Regression Coefficients)," shows the coefficients for each feature. Here’s what we see:
X-Axis: Coefficient values, ranging from -20 to 40.
Y-Axis: Feature names (Open, Adj Close, Low, High, Volume).
Bars:
High: A long blue bar extending to around 35, indicating a strong positive effect on Close prices.
Low: An orange bar around 30, also a strong positive effect.
Open: A green bar around -15, suggesting a negative effect on Close.
Adj Close: A red bar near 0 (slightly negative), with a minimal impact.
Volume: A tiny bar near 0, indicating almost no effect.
Zero Line: The black dashed line at 0 separates positive and negative coefficients.
Insight: High and Low are the most influential features, with large positive coefficients, meaning higher daily highs and lows strongly predict higher closing prices. Open has a notable negative coefficient, possibly due to multicollinearity with other price features. Adj Close and Volume have negligible effects, suggesting they add little unique value in this model.
Fun Fact: Coefficients Tell a Story
Did you know in Linear Regression, a positive coefficient (like for High) means that as the feature increases, the predicted Close price increases too? A negative coefficient (like for Open) suggests the opposite—fascinating how these relationships play out in stock data!
Real-Life Example
Imagine you’re a financial analyst, advising a client. You’d say, “Our analysis shows the daily high and low prices are key drivers of Microsoft’s closing price—focus your strategy there!” This insight could guide their trading decisions.
Quiz Time!
Let’s test your feature importance skills, students!
What does a positive coefficient (like for High) indicate?
a) The feature decreases the Close price
b) The feature increases the Close price
c) The feature has no effect
Why might Volume have a tiny coefficient?
a) It’s not important for stock prediction
b) It’s highly correlated with other features
c) The data is missing
Drop your answers in the comments—I’d love to hear from you!
Cheat Sheet: Feature Importance for Linear Regression
lr.coef_: Accesses the coefficients of a Linear Regression model.
pd.DataFrame({'Feature': cols, 'Coefficient': coef}): Creates a DataFrame for analysis.
sns.barplot(x='Coefficient', y='Feature', data=df): Plots feature importance.
plt.axvline(x=0): Adds a zero reference line.
Did You Know?
For tree-based models like Random Forest, you can use feature_importances_ instead of coef_. It measures how much each feature reduces impurity across trees—try it with our Random Forest champion next time!
What do you think of this feature's importance, viewers? Surprised by High and Low’s dominance? Drop your thoughts in the comments, and let’s keep exploring together
Visualizing Our Predictions: Actual vs. Predicted Prices!
We’ve already benchmarked models, analyzed feature importance, and confirmed our Linear Regression model’s reliability. Now, it’s time to bring our predictions to life with a stunning visualization that compares actual Microsoft stock Close prices to our model’s predictions, complete with a confidence interval. Let’s break down the code, explore the output
What’s Happening in This Code?
Let’s break it down like we’re painting a masterpiece:
Setting Up the Plot:
plt.figure(figsize=(16, 8)): Creates a large figure (16 inches wide, 8 inches tall) for a clear visualization.
Plotting Actual and Predicted Prices:
plt.plot(y_test.index, y_test, 'b-', label='Actual Price', alpha=0.7): Plots the actual Close prices (y_test) against their index (assumed to be dates), using a solid blue line with slight transparency (alpha=0.7).
plt.plot(y_test.index, y_pred, 'r--', label='Predicted Price', alpha=0.9): Plots the predicted prices (y_pred from our Linear Regression model) as a red dashed line with higher visibility (alpha=0.9).
Adding a Confidence Interval:
residual_std = np.std(residuals): Calculates the standard deviation of the residuals (from earlier: residuals = y_test - y_pred), which measures prediction error variability.
plt.fill_between(y_test.index, y_pred - 1.96*residual_std, y_pred + 1.96*residual_std, color='r', alpha=0.1, label='95% Confidence Interval'): Adds a shaded 95% confidence interval around the predictions. The 1.96 corresponds to the z-score for a 95% confidence level in a normal distribution, meaning 95% of predictions should fall within this band if errors are normally distributed.
Customizing the Plot:
plt.title('Microsoft Stock Price: Actual vs Predicted'): Sets the title.
plt.xlabel('Date') and plt.ylabel('Stock Price ($)'): Labels the axes.
plt.legend(): Adds a legend to distinguish actual prices, predicted prices, and the confidence interval.
plt.grid(True, linestyle='--', alpha=0.7): Adds a dashed grid for readability.
plt.tight_layout() and plt.show(): Adjusts spacing and displays the plot.
Why Are We Doing This?
This visualization lets us compare our model’s predictions to the actual stock prices over time, showing how close we are and where we might miss. The confidence interval adds a layer of realism, showing the range where we expect most predictions to fall, which is crucial for real-world applications like trading or investment planning.
Here’s the code we’re working with:
# =============================================
# STEP 5: FORECAST VISUALIZATION
# =============================================
# Create time series plot comparing actual vs predicted
plt.figure(figsize=(16, 8))
plt.plot(y_test.index, y_test, 'b-', label='Actual Price', alpha=0.7)
plt.plot(y_test.index, y_pred, 'r--', label='Predicted Price', alpha=0.9)
# Add confidence interval (using standard deviation of residuals)
residual_std = np.std(residuals)
plt.fill_between(y_test.index,
y_pred - 1.96*residual_std,
y_pred + 1.96*residual_std,
color='r', alpha=0.1, label='95% Confidence Interval')
plt.title('Microsoft Stock Price: Actual vs Predicted')
plt.xlabel('Date')
plt.ylabel('Stock Price ($)')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
The Output:
Actual vs. Predicted Prices Plot
Take a look at the uploaded image! The plot, titled "Microsoft Stock Price: Actual vs Predicted," shows the comparison over time. Here’s what we see:
X-Axis: Date (represented as indices from 0 to around 8000, likely days since 1986).
Y-Axis: Stock Price ($) (0 to 350).
Blue Line (Actual Price): The actual Close prices, showing Microsoft’s stock price trajectory with dips, rises, and a sharp increase after index 6000 (around 2018).
Red Dashed Line (Predicted Price): Our Linear Regression model’s predictions, almost perfectly overlapping the blue line—so close you can barely tell them apart!
Red Shaded Area (95% Confidence Interval): A faint red band around the predicted line, showing where 95% of predictions should fall. It’s very narrow, reflecting the low variability in our residuals.
Legend and Grid: Clearly labels the lines and includes a grid for readability.
Insight: The actual and predicted lines are nearly indistinguishable, confirming our model’s high accuracy (R² of 0.999943). The narrow confidence interval (thanks to a small residual_std) shows our predictions are highly reliable, with minimal uncertainty. This plot is a testament to how well our model captures Microsoft’s stock price patterns!
Fun Fact: Confidence Intervals in the Real World
Did you know confidence intervals are used in weather forecasts too? Just like we predict stock prices with a range of uncertainty, meteorologists predict temperatures (e.g., “72°F ± 3°F”) to account for variability—pretty similar to our stock price confidence band!
Real-Life Example
Imagine you’re a stock trader, planning your next move. This plot would give you confidence to trust the model’s predictions, knowing the actual prices fall within the tight 95% confidence interval. You might use this to predict Microsoft’s price tomorrow and make a smart trade!
Quiz Time!
Let’s test your visualization skills, students!
What does the narrow 95% confidence interval tell us?
a) Our predictions have high uncertainty
b) Our predictions are very reliable with low uncertainty
c) The model is overfitting
Why are the actual and predicted lines so close together?
a) The data is fake
b) The model’s high R² (0.999943) means it’s very accurate
c) The plot is zoomed in too much
Drop your answers in the comments—I’d love to hear from you!
Cheat Sheet: Time Series Visualization
plt.plot(x, y, 'b-', label=''): Plots a solid line (e.g., actual prices).
plt.plot(x, y, 'r--', label=''): Plots a dashed line (e.g., predicted prices).
plt.fill_between(x, lower, upper, color='r', alpha=0.1): Adds a shaded confidence interval.
np.std(residuals): Calculates the standard deviation of residuals.
plt.legend(): Adds a legend to the plot.
Did You Know?
You can make this plot even more informative by adding a rolling average! Try plt.plot(y_test.index, y_test.rolling(window=30).mean(), 'g-', label='30-Day Avg') to show the 30-day moving average of actual prices—it’ll highlight long-term trends.
Pro Tip:
This plot is a showstopper. Our Linear Regression model nails Microsoft stock price predictions, with actual and predicted prices nearly identical and a tight 95% confidence interval.
Also, note that while this looks amazing, real-world forecasting might need time series models like LSTM to capture trends better, showing your readers you’re thinking ahead.
Professional Checks
Economic Significance and Walk-Forward Validation!
We’ve already visualized our predictions, analyzed feature importance, and benchmarked our models, with Random Forest and Linear Regression leading the pack. Now, we’re diving into some advanced professional checks to ensure our Linear Regression model isn’t just statistically sound but also economically meaningful and robust over time. Let’s break down the code and explore the output.
What’s Happening in This Code?
Let’s break it down like we’re conducting a financial audit:
Economic Significance Test (Section 7.1):
benchmark_return = np.mean(np.diff(y_test)): Calculates the average daily return of the actual Close prices by taking the difference between consecutive prices (np.diff) and averaging them. This is our baseline.
model_return = np.mean(np.diff(y_pred)): Calculates the average daily return of the predicted prices in the same way.
The print statements display:
The benchmark return.
The model’s predicted return.
The improvement (or decline) in return, both in dollars and as a percentage.
Walk-Forward Validation (Section 7.2):
from sklearn.model_selection import TimeSeriesSplit: Imports TimeSeriesSplit, a cross-validation method designed for time series data.
tscv = TimeSeriesSplit(n_splits=5): Sets up a 5-fold walk-forward validation, where each fold trains on past data and tests on future data, respecting the temporal order.
for train_index, test_index in tscv.split(x_train_scaled): Loops through the 5 splits.
X_train, X_test = x_train_scaled[train_index], x_train_scaled[test_index]: Splits the features.
y_train_fold, y_test_fold = y_train.iloc[train_index], y_train.iloc[test_index]: Splits the target variable, ensuring alignment with the feature splits.
lr.fit(X_train, y_train_fold): Trains the Linear Regression model on the training fold.
score = lr.score(X_test, y_test_fold): Calculates the R² score on the test fold.
walk_forward_scores.append(score): Stores the score.
print statements display the individual scores and their mean.
Why Are We Doing This?
Economic Significance Test: Ensures our model’s predictions are not just statistically accurate but also economically meaningful. If the predicted daily returns align with actual returns, our model could be useful for trading strategies.
Walk-Forward Validation: Tests the model’s robustness over time, mimicking how we’d use it in real-world forecasting by training on past data and predicting future data. This is crucial for time series data like stock prices, where random splits can lead to overfitting.
Here’s the code we’re working with
# =============================================
# STEP 7: ADDITIONAL PROFESSIONAL CHECKS
# =============================================
# 7.1 Economic significance test
benchmark_return = np.mean(np.diff(y_test)) # Average daily return
model_return = np.mean(np.diff(y_pred))
print("\nECONOMIC SIGNIFICANCE ANALYSIS")
print("-"*50)
print(f"Benchmark Average Daily Return: ${benchmark_return:.4f}")
print(f"Model Predicted Average Daily Return: ${model_return:.4f}")
print(f"Prediction Improvement: ${model_return - benchmark_return:.4f} "
f"({(model_return - benchmark_return)/benchmark_return:.2%})")
print("-"*50)
# 7.2 Walk-forward validation
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
walk_forward_scores = []
for train_index, test_index in tscv.split(x_train_scaled):
X_train, X_test = x_train_scaled[train_index], x_train_scaled[test_index]
y_train_fold, y_test_fold = y_train.iloc[train_index], y_train.iloc[test_index]
lr.fit(X_train, y_train_fold)
score = lr.score(X_test, y_test_fold)
walk_forward_scores.append(score)
print("\nWALK-FORWARD VALIDATION SCORES:")
print(walk_forward_scores)
print(f"Mean R²: {np.mean(walk_forward_scores):.6f}")
The Output: Economic Significance and Walk-Forward Results
ECONOMIC SIGNIFICANCE ANALYSIS
--------------------------------------------------
Benchmark Average Daily Return: $-0.0113
Model Predicted Average Daily Return: $-0.0114
Prediction Improvement: $-0.0001 (1.26%)
--------------------------------------------------
WALK-FORWARD VALIDATION SCORES:
[0.9999291258594004, 0.9999543013872372, 0.9999308648546309, 0.9999384730281284, 0.9998654302223667]
Mean R²: 0.999924
:Observations:
Economic Significance:
Benchmark Average Daily Return: $-0.0113, meaning the actual stock price dropped by $0.0113 on average per day in the test set.
Model Predicted Average Daily Return: $-0.0114, very close to the actual return.
Prediction Improvement: $-0.0001 (a 1.26% decrease relative to the benchmark). The model slightly underestimates the daily return, but the difference is tiny.
Walk-Forward Validation:
Scores: [0.999929, 0.999954, 0.999931, 0.999938, 0.999865]. Each fold’s R² score is extremely high and consistent.
Mean R²: 0.999924, matching our earlier test R² (0.999943) and cross-validation mean (0.999930), showing the model performs well in a time series context.
Insight: The economic significance test shows our model captures daily returns very closely, with a negligible difference ($0.0001). The walk-forward validation confirms robustness, as the high and consistent R² scores indicate the model generalizes well across time periods. Our Linear Regression model is ready for real-world forecasting!
Fun Fact: Walk-Forward Validation in Trading
Did you know walk-forward validation is a favorite in algorithmic trading? It mimics how traders use historical data to predict future prices, ensuring models don’t “cheat” by looking at future data during training—keeping predictions realistic!
Real-Life Example
Imagine you’re a financial analyst advising a hedge fund. You’d say, “Our model predicts daily returns within $0.0001 of the actual, and walk-forward validation gives a mean R² of 0.999924—it’s ready for live trading!” This could convince them to deploy your model in their strategy.
Quiz Time!
Let’s test your professional skills, students!
What does the walk-forward validation’s high mean R² (0.999924) indicate?
a) The model is overfitting
b) The model generalizes well across time
c) The model is underfitting
Why is the economic significance test important?
a) It checks if predictions are economically useful
b) It calculates the stock price
c) It plots the data
Drop your answers in the comments—I’d love to hear from you!
Cheat Sheet: Professional Checks
np.diff(data): Calculates differences between consecutive values (e.g., daily returns).
np.mean(data): Computes the average.
TimeSeriesSplit(n_splits=5): Sets up walk-forward validation for time series.
tscv.split(X): Splits data into train/test folds, respecting time order.
model.score(X, y): Calculates the R² score for a fold.
Did You Know?
You can extend the economic significance test by calculating annualized returns! Multiply the daily return by 252 (the number of trading days in a year) to estimate yearly performance—try it to see how your model performs on a larger scale.
Note:
Our model captures daily returns within $0.0001 of the actual, and walk-forward validation confirms its robustness with a mean R² of 0.999924!”.
Also, note that the slight underestimation in returns might be worth exploring with more features like market sentiment data.
Conclusion
Wrapping Up A Stock Prediction Masterpiece!
Wow, what an incredible journey we’ve shared, my amazing viewers and students! Part 2 of our "Microsoft Stock Price Analysis & Predictions" blog series has taken us to new heights in data science. We started by visualizing our Linear Regression model’s near-perfect predictions, with actual and predicted Microsoft stock prices overlapping like best friends. We dug deep with cross-validation, residual diagnostics, and professional metrics, confirming our model’s reliability with an MAE of just $0.2172 and a mean R² of 0.999924. We even benchmarked against heavyweights like Random Forest (our new champion with an RMSE of $0.2816!), analyzed feature importance (shoutout to High and Low!), and validated our model’s economic significance and robustness over time. It’s been a thrilling ride.
I’m so proud of how far we’ve come together—turning raw stock data into actionable insights with the power of machine learning. Your enthusiasm and curiosity have made this adventure truly special, and I can’t wait to see where we go next!
What’s Next? More Data Science Magic Awaits!
Hold onto your excitement because the best is yet to come! In our upcoming blogs and videos on our YouTube channel, www.youtube.com/@cognitutorai, we’ll dive even deeper into the world of AI and data science. Expect to see:
New projects on topics like sentiment analysis, image recognition, and even building AI chatbots—perfect for all you aspiring tech wizards!
Interactive tutorials and live coding sessions on YouTube, where we’ll tackle real-world problems together and answer your burning questions.
Make sure to subscribe to www.youtube.com/@cognitutorai, hit that notification bell, and join our growing community of learners. Whether you’re tuning in from London, Paris, New York or dreaming of a Caribbean adventure, let’s keep exploring the limitless possibilities of AI together. Drop your favorite moment from Part 2 in the comments, and tell me what you’re most excited for next—I can’t wait to hear from you! See you in the next chapter of our data science saga!