Laptop Price Predictor End-To-End Ai/ML/Data Science Project

 Laptop Price Predictor

End-To-End Ai/ML/Data Science Project



Introduction

Welcome to this hands-on project tutorial! If you're a beginner in Machine Learning, Artificial Intelligence (AI), or Data Science, this is the perfect place to start. 

In this blog post, we'll build a Laptop Price Predictor from scratch, an end-to-end project that covers everything from data collection & cleaning to model training & deployment. 


Why This Project?

Buying a laptop can be confusing, prices vary based on brand, processor, RAM, storage, and other specs. Wouldn’t it be great if we could predict the price of a laptop based on its features? That’s exactly what we’ll do!  


What You’ll Learn


Data Collection & Cleaning:

How to prepare real-world data for analysis.  

Exploratory Data Analysis (EDA):

Understanding trends and patterns in laptop pricing.  

Feature Engineering:

Transforming raw data into meaningful features.  

Machine Learning Modeling:

Training models to predict prices accurately.  

Model Deployment

Making your model accessible via a web app (optional).  


By the end, you’ll have a fully functional ML project that you can showcase in your portfolio.  


Who Is This For? 

  • Absolute beginners in AI/ML/Data Science.

  • Students looking for a practical, real-world project.  

  • Aspiring data scientists who want to understand the end-to-end workflow.  


No prior experience? No problem! I’ll guide you step by step.  


Ready to dive in? Let’s get started! 🚀  



Start by importing libraries:



import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

import warnings

warnings.filterwarnings('ignore')


df = pd.read_csv('/kaggle/input/laptop-data/laptop_data.csv')

df.head()

Output:

Let's break down this code step by step so you understand what each line does:  


1. Importing Libraries  




pandas as pd→ This library helps us work with data tables (like Excel sheets)

  - Example: Reading, cleaning, and analyzing data.  

numpy as np → Used for numerical calculations (math operations on data).  

matplotlib.pyplot as plt → Helps in plotting graphs & charts(visualizing data).  

seaborn as sns → Makes statistical graphs more attractive and easier to create.  

warnings.filterwarnings('ignore')→ Stops Python from showing unnecessary warning messages (keeps the output clean).  


💡 Think of these like "tools in a toolbox", each has a special function for data science!  


2. Loading the Dataset

  • pd.read_csv()`: → Reads a CSV file (a spreadsheet-like file) and turns it into a Pandas DataFrame (`df`).

  • /kaggle/input/laptop-data/laptop_data.csv: → This is the file path where our laptop dataset is stored.  


📌 Note:If you're running this locally, replace the path with your file location (e.g., `'laptop_data.csv'`).  


df.head()→ Shows the first 5 rows of the dataset (helps us peek at the data structure).  


🔍 Why is this useful?  

> - Checks if data is loaded correctly.  

> - Gives a quick look at columns (features) like Brand, RAM, Storage, Price etc. 


Now let’s Check whether the values in this dataset are categorical or numerical. To confirm this we will use the df.info() function:


Code:

df.info()

Output:


<class 'pandas.core.frame.DataFrame'>

RangeIndex: 1303 entries, 0 to 1302

Data columns (total 11 columns):

 #   Column            Non-Null Count  Dtype  

---  ------            --------------  -----  

 0   Company           1303 non-null   object 

 1   TypeName          1303 non-null   object 

 2   Inches            1303 non-null   float64

 3   ScreenResolution  1303 non-null   object 

 4   Cpu               1303 non-null   object 

 5   Ram               1303 non-null   object 

 6   Memory            1303 non-null   object 

 7   Gpu               1303 non-null   object 

 8   OpSys             1303 non-null   object 

 9   Weight            1303 non-null   object 

 10  Price             1303 non-null   float64

dtypes: float64(2), object(9)

memory usage: 112.1+ KB

Understanding `df.info()` & Why We Convert Data Types


Let’s break down what `df.info()` tells us and why we need to convert some columns (like `Ram`, `Weight`) from text (`object`) to numbers for machine learning.



1. What Does `df.info()` Do?

This function gives a quick overview of the dataset:  

  • Total rows and columns  

  • Column names  

  • Number of non-null (non-empty) values  

  • Data types (`object`, `float64`, etc.)  

  • Memory usage  


Key Observations

  • 9 columns: are `object` (text) → Need cleaning/conversion.  

  • 2 columns: are `float64` (numbers) → Already numeric (`Inches`, `Price`).  



2. Why Convert Text (`object`) to Numbers?

Machine learning models only understand numbers, not text. For example:  

- ❌ `Ram = "8GB"` (Text) → Model can’t use this directly.  

- ✅ `Ram = 8` (Number) → Model can process this.  


Columns That Need Conversion

1. `Ram` → Extract numbers (e.g., `"8GB"` → `8`).  

2. `Weight` → Remove "kg" (e.g., `"2.5kg"` → `2.5`).  

3. Company`, `OpSys`, etc. → Convert categories to numbers (e.g., `"Dell"=0`, `"HP"=1`).  


3. How Will We Convert Them? 

Later, we’ll use techniques like:  

  • .str.replace()`: → Remove units (`GB`, `kg`).  

  • pd.to_numeric():`→ Convert text numbers to actual numbers.  

  • One-Hot Encoding: → For categories (e.g., `Company`, `TypeName`).  




Now Let’s start converting the categorical columns into numerical by Encoding


df.Ram = df.Ram.str.replace('GB','')

df.Ram = df.Ram.astype('int32')

df.Weight = df.Weight.str.replace('kg','')

df.Weight = df.Weight.astype('float32')

df.info()


Output:

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 1303 entries, 0 to 1302

Data columns (total 11 columns):

 #   Column            Non-Null Count  Dtype  

---  ------            --------------  -----  

 0   Company           1303 non-null   object 

 1   TypeName          1303 non-null   object 

 2   Inches            1303 non-null   float64

 3   ScreenResolution  1303 non-null   object 

 4   Cpu               1303 non-null   object 

 5   Ram               1303 non-null   int32  

 6   Memory            1303 non-null   object 

 7   Gpu               1303 non-null   object 

 8   OpSys             1303 non-null   object 

 9   Weight            1303 non-null   float32

 10  Price             1303 non-null   float64

dtypes: float32(1), float64(2), int32(1), object(7)

memory usage: 101.9+ KB

Cleaning Text Data for Machine Learning

Converting "8GB" to 8 and "2.5kg" to 2.5  


Let's break down these four lines of code to understand how we're converting text-based columns (`Ram` and `Weight`) into numerical values that our machine learning model can use.



1. Cleaning the `Ram` Column

Line 1: Removing the 'GB' Unit


df.Ram = df.Ram.str.replace('GB', '')

What it does:

  - The `Ram` column contains entries like `"8GB"`, `"16GB"`, etc.  

  `.str.replace('GB', '')` removes the `"GB"` text, leaving only the number (e.g., `"8GB"` → `"8"`).  

Why?

Machine learning models need pure numbers, not text.  


Line 2: Converting to Integer


df.Ram = df.Ram.astype('int32')


What it does:

  - Converts the cleaned text (e.g., `"8"`) into an integer (whole number).  

  - `int32` is a memory-efficient data type for storing integers.  

Why? 

  - Ensures the `Ram` column is now numerical (e.g., `8`, `16`) instead of text (`"8GB"`).  


2. Cleaning the `Weight` Column

Removing the 'kg' Unit

df.Weight = df.Weight.str.replace('kg', '')

What it does:  

  - The `Weight` column has entries like `"2.5kg"`, `"1.8kg"`, etc.  

  - `.str.replace('kg', '')` removes `"kg"`, leaving just the number (e.g., `"2.5kg"` → `"2.5"`).  


Converting to Float (Decimal Number)


df.Weight = df.Weight.astype('float32')


What it does:

  - Converts the cleaned text (e.g., `"2.5"`) into a floating-point number (decimal).  

  - `float32` is a memory-efficient way to store decimals.  

Why?

  - Ensures `Weight` is now a numerical column (e.g., `2.5`, `1.8`) instead of text (`"2.5kg"`).  


Visual Example (Before & After)

| Column | Before Cleaning | After Cleaning |

|--------|----------------|----------------|

| Ram | `"8GB"` | `8` (int) |

| Weight | `"2.5kg"` | `2.5` (float) |



Next Steps

Now that `Ram` and `Weight` are numerical, we’ll clean other columns like `Memory`, `Cpu`, and `Gpu` to prepare them for the model! 🚀  




But before that, let’s perform Exploratory Data Analysis


Code:

sns.displot(df.Price)

Output:


This simple line does 3 important things:

  1. Uses Seaborn (sns) to create a distribution plot

  2. Takes our laptop price data (df.Price)

  3. Shows how prices are spread across different ranges

What the Graph Tells Us

1. The X-Axis (Horizontal)

  • Shows price ranges from 0 to 300,000

  • Each segment represents a price bracket (about 50,000 per segment)

2. The Y-Axis (Vertical)

  • Shows how many laptops fall into each price range

  • The tallest bars mean most laptops are in that price range

3. Key Observations

  • 🏆 Most common price range: Around 50,000 (where the bar is tallest)

  • 📉 Few expensive laptops: Very few laptops above 200,000

  • 📊 Distribution shape: Most data is clustered on the left (lower prices) with a long tail to the right

Why This Matters for Our Project

  1. Identifies normal price ranges - Helps us spot what's "average" vs "expensive"

  2. Reveals outliers - Those few super-expensive laptops might need special handling

  3. Guides model development - Our model should be most accurate in the 20,000-100,000 range where most data lives



Now we'll check 

Code:

df.Company.value_counts().plot(kind='bar')


Output:




Explanation:

df.Company.value_counts()`:  

This Counts how many laptops belong to each company in the dataset (e.g., Dell: 300 laptops, Lenovo: 250 laptops).  

  - Example: If Dell appears 300 times in the `Company` column, it means there are 300 Dell laptops in the dataset.  

`.plot(kind='bar'): 

  - Creates a bar chart to visualize these counts.  

The bar chart shows "Number of Laptops per Company":  

X-axis: Laptop brands (e.g., Dell, Lenovo, HP).  

Y-axis: Count of laptops (e.g., Dell has the highest count at ~300).  

Key Insight:  

  - Dell, Lenovo, and HP dominate the dataset (most entries).  

  - Brands like Google, Fujitsu, and LG have very few laptops listed.  




Now let's check the relationship of company features with price.

 sns.barplot(x=df.Company,y=df.Price)

plt.xticks(rotation='vertical')

plt.show()

Output :


Explanation  

`sns.barplot(x=df.Company, y=df.Price)`:  

  - Uses Seaborn to create a bar plot showing average laptop price for each company.  

  - Each bar represents the mean price of laptops from that brand.  

- `plt.xticks(rotation='vertical')`:  

  - Rotates company names vertically to avoid overlapping text on the x-axis.  


What does the Output say?

- The bar chart shows "Average Price of Laptops by Company":  

  - X-axis: Laptop brands.  

  - Y-axis: Price (in local currency, e.g., £ or $).  

Key Insights:  

  - Apple has the highest average price (bar at ~250,000), reflecting its premium positioning.  

  - Brands like Acer, Asus, and Lenovo have lower average prices (budget-friendly options).  

  - Anomaly: Huawei appears twice in the x-axis labels; this might indicate a typo or duplicate entry in the data.  



Code:

df.TypeName.value_counts().plot(kind='bar')


Output:



Count of Laptops by Type  

Visual

Bar chart titled "Count of Laptops by Type" (from `df.TypeName.value_counts().plot`).  

Key Observations:  

- X-axis: Laptop types (`TypeName`).  

- Y-axis: Number of laptops in the dataset.  

- Tallest Bars:  

  - Notebook: ~700 laptops (most common type).  

  - Gaming: ~600 laptops.  

  - Ultrabook: ~500 laptops.  

- Shorter Bars:  

  - 2 in 1 Convertible: ~300 laptops.  

  - Workstation: ~200 laptops.  

  - Netbook: Very few (barely visible).  


Takeaways:  

- Notebooks dominate the dataset, likely because they cater to general-purpose use (students, professionals).  

- Gaming laptops are also prevalent, reflecting demand for high-performance devices.  

- Ultrabooks (premium lightweight laptops) are popular but less common than standard notebooks.  


So what's the Average Price by Laptop Type  ?

Key Observations:  

- X-axis: Laptop types (`TypeName`).  

- Y-axis: Average price (currency not specified, likely £ or $).  

- Most Expensive Types:  

  - Ultrabook: Highest average price (~140,000).  

  - Gaming: Second highest (~120,000).  

  - 2 in 1 Convertible: ~100,000.  

- Budget-Friendly Types:  

  - Notebook: ~40,000 (cheapest).  

  - Netbook: Slightly higher than notebooks.  


Takeaways

- Ultrabooks are premium: Their lightweight design and high-end specs (e.g., SSD, slim builds) justify the cost.  

- Gaming laptops are pricey: Due to specialized hardware (GPUs, cooling systems).  

- Notebooks are affordable: Designed for everyday tasks, not high performance.  

- Anomaly: The label "Gaming in 1 Convertible" seems incorrect, it should likely be "2 in 1 Convertible" (a data entry error?).  

Why This Matters:  

- For buyers: Helps identify cost-effective vs. luxury options.  

- For manufacturers: Guides pricing strategies based on product type.  

- For data analysts: Highlights potential data inconsistencies (e.g., mislabeled types).  

  


Now let's extract Touchscreen from the screen resolution column.

Purpose:

Creates a new binary column Touchscreen to indicate whether a laptop has a touchscreen.  

How It Works: 

 Check each entry in the `ScreenResolution` column for the keyword Touchscreen.  

  - If found: Assigns `1` (True).  

  - If not found: Assigns `0` (False).  


Code:

df['Touchscreen'] = df.ScreenResolution.apply(lambda x:1 if 'Touchscreen' in x else 0)

df.head()

Output:


Output Analysis 

Touchscreen  

0    1111  

1     192  

Interpretation:  

  - `1111` laptops do NOT have a touchscreen (`0`).  

  - 192 laptops DO have a touchscreen (`1`).  

Key Insights  

  1. Class Imbalance:  

   - Touchscreen laptops are a minority (~14% of the dataset), which is typical since touchscreens are often premium features.  

2. Why This Matters:  

   - Touchscreens add to manufacturing costs, so laptops with this feature are likely pricier (confirmed by the subsequent `sns.barplot` showing higher average prices for touchscreen laptops). 



Now let's Create a binary column `IPS` to identify laptops with IPS (In-Plane Switching) panels, a premium display technology.  

How It Works:  

  - Scans the `ScreenResolution` column for the keyword `'IPS'`.  

  - If found: Assigns 1 (IPS panel present).  

  - If not found: Assigns `0` (No IPS panel

Code:

df['IPS'] = df.ScreenResolution.apply(lambda x:1 if 'IPS' in x else 0)

df.head()


Output :


IPS  

0    938  

1    365  

Name: count, dtype: int64  

So it means:

  - `938` laptops lack IPS panels (`0`).  

  - `365` laptops have IPS panels (`1`).  


Key Insights 

1. Class Imbalance:  

   - Only **~28% of laptops** have IPS displays, reflecting their premium status (better color accuracy/viewing angles).  

2. Impact on Price:  

   - The subsequent `sns.barplot(x=df.IPS, y=df.Price)` likely shows:  

     - IPS=1: Higher average price (IPS panels are costlier).  

     - IPS=0: Lower average price.  

Why Does This Feature Matters?  

IPS is a strong price predictor, consumers pay more for better displays.  

Comparison with Touchscreen 

- Touchscreen: 192 laptops (14%).  

- IPS: 365 laptops (28%).  

Takeaway

IPS panels are more common than touchscreens in this dataset, but both are premium features.  



Now let's Split the `ScreenResolution` string at the `'x'` character (which separates width × height pixels, like `1920x1080`).  

We also use Parameters:  

   - `n=1`: Only splits at the first occurrence of `'x'`.  

   - `expand=True`: Returns a DataFrame with separate columns for width (`screen_x[0]`) and height (`screen_x[1]`).


Code:

screen_x = df.ScreenResolution.str.split('x',n=1,expand=True)

df['x_res'] = screen_x[0]

df['y_res'] = screen_x[1]

df.head()


Output:



Purpose?

Extracts screen width and height (e.g., `2560x1600` → `2560` and `1600`) to later calculate PPI (Pixels Per Inch).  

Why Does It Matter? 

- PPI (calculated later) is a critical feature for price prediction, higher resolution screens cost more.  

- This is part of feature engineering to extract meaningful numerical data from text.  



Now let's Clean the `x_res` column (screen width values) by:

   - First removing commas (e.g., "1,920" → "1920")

   - Then extracting just the numeric portion using regex to find:

     - `\d+` (one or more digits)

     - `\.?` (optional decimal point)

   - Finally taking the first match (index [0]) from the extracted numbers

Purpose to do this operation?

Prepares clean numeric resolution values for:

- Converting to integers

- Calculating PPI (pixels per inch)

- Using in machine learning model

Code:

df.x_res = df.x_res.str.replace(',','').str.findall(r'(\d+\.?\d+)').apply(lambda x: x[0])

df.head()


Output:



Let's create a new ‘PPI’ columns which will be very useful in this use case and it will effect the target column.

Why?

✓PPI is a key display quality metric that affects price

✓Higher PPI = sharper display = typically more expensive

✓Creates a standardized measure of screen quality for the model

So to create this column we will

1.Calculate PPI (Pixels Per Inch) using the Pythagorean theorem:

   - Formula: `√(x_res² + y_res²) / Inches`

   - This gives the diagonal pixel density of each display

2. Converts result to float and stores in new `ppi` column

3. Displays first 5 rows to verify calculation


Code:

#NOw we will create  a new column named ppi by using the inches and x and y res columns

df['ppi'] = (((df.x_res**2) + (df.y_res**2))**0.5/df.Inches).astype('float')

df.head()


Output:



Okay, now the screen resolution, x_res, y_res and Inches columns can be dropped. We now shift our focus towards the CPU column. 

We will:

1. Create a new column `Cpu_name` by extracting:

   - The first 3 words from the `Cpu` column

   - Example: "Intel Core i5 2.3GHz" → "Intel Core i5"

Why are we doing this?

1. Simplifies CPU information to just the processor family

2. Prepares for further categorization in next steps (like grouping into Intel Core i3/i5/i7)

3. Reduces noise from clock speeds and model numbers

4. Creates cleaner categorical features for the ML model


Code Block:

df['Cpu_name'] = df.Cpu.apply(lambda x:' '.join(x.split()[0:3]))

df.head()


Output:




Now we Create a function `processor()` that categorizes CPUs into 5 groups:

   - Intel Core i3/i5/i7 (keeps as-is)

   - Other Intel processors (e.g., Pentium, Xeon)

   - All AMD processors

We will then Apply this function to create a new `Cpu_brand` column

Why?

Because it:

1. Standardizes CPU types for better ML modeling

2. Reduces 100+ unique CPU models down to 5 meaningful categories

3. Maintains important performance tiers (i3/i5/i7) while grouping less common models

4. Helps the model learn price patterns based on processor class rather than specific models

Code Block:

def processor(text):

    if text == 'Intel Core i7' or text == 'Intel Core i5' or text == 'Intel Core i3':

        return text

    else:

        if text.split()[0] == 'Intel':

            return 'Other Intel Processor'

        else:

            return 'AMD Processor'

        

df['Cpu_brand'] =df.Cpu_name.apply(processor)

df.head()



Output:

output from kaggle notebook



Okay so now we can drop the columns ‘Cpu’ and ‘Cpu_name’. Our focus now shifts towards the memory column. We perform an in depth feature engineering operation on the memory column. 

What it does:

1. Standardizes units:

   - Removes ".0" decimals (e.g., "1.0TB" → "1TB")

   - Converts "GB" to empty string (e.g., "256GB" → "256")

   - Replaces "TB" with "000" (1TB = 1000GB)


2. Splits combined storage:

   - Separates entries with "+" into two parts (e.g., "128SSD+1TB HDD" → ["128SSD", "1000HDD"])

   - `n=1` splits only at the first "+" encountered

   - `expand=True` creates separate columns for each part


Why does this matter?:

1. Prepares for numerical conversion by:

   - Removing unit text

   - Standardizing all values to GB equivalents

2. Enables separating storage types (SSD/HDD) which will be processed in the next steps

Code:

df.Memory = df.Memory.astype(str).replace('\.0','',regex=True)

df.Memory = df.Memory.str.replace('GB','')

df.Memory = df.Memory.str.replace('TB','000')

memo = df.Memory.str.split('+',n=1,expand=True)

#Now lets Create a new column first to store the primary memory type:

df['first'] = memo[0]

df['first'] =df['first'].str.strip()

#Create a new column second to store the secondary storage type:

df['second'] = memo[1]

#Create indicator variables (1 or 0) for different storage types in first:

df['Layer1HDD'] = df['first'].apply(lambda x:1 if 'HDD' in x else 0)

df['Layer1SSD'] = df['first'].apply(lambda x: 1 if 'SSD' in x else 0)

df['Layer1Hybrid'] = df['first'].apply(lambda x: 1 if 'Hybrid' in x else 0)

df['Layer1Flash_Storage'] = df['first'].apply(lambda x: 1 if 'Flash Storage' in x else 0)

#Remove all non-numeric characters from first:

df['first'] = df['first'].str.replace(r'\D', '')

#Replace missing values in second with "0":

df['second'].fillna('0',inplace=True)

#If there was no secondary storage, it becomes "0".

#Create indicator variables (1 or 0) for different storage types in second:

df['Layer2HDD'] = df.second.apply(lambda x: 1 if 'HDD' in x else 0)

df['Layer2SSD'] = df.second.apply(lambda x: 1 if 'SSD' in x else 0)

df['Layer2Hybrid'] = df.second.apply(lambda x: 1 if 'Hybrid' in x else 0)

df['Layer2Flash_Storage'] = df.second.apply(lambda x: 1 if 'Flash Storage' in x else 0)

#Same logic as Layer1 but applied to second.

#Remove all non-numeric characters from second:


df.second = df.second.str.replace(r'\D','')

df.head()

Output:




Now let’s convert the columns ‘First’ and ‘Second’ from object to int64. (This means that the values are converted from categorical data type to numerical data type)

Code:

df['first'] = df['first'].str.replace('HDD', '')

df['first'] = df['first'].str.replace('SSD', '')

df['first'] = df['first'].str.replace('Hybrid', '')

df['first'] = df['first'].str.replace('Flash Storage', '')

df['first'] = df['first'].str.replace(' ', '')

df['first'] = df['first'].astype(int)

df['second'] = df['second'].str.replace('HDD', '')

df['second'] = df['second'].str.replace('SSD', '')

df['second'] = df['second'].str.replace('Hybrid', '')

df['second'] = df['second'].str.replace(' ', '')

df['second'] = df['second'].astype(int)

df.info()

#Now check the data types

Output:




Now let’s calculate the total storage capacity for different types of storage (HDD, SSD, Hybrid, and Flash Storage). It does this by multiplying the storage size (first and second) by the corresponding binary indicators (Layer1HDD, Layer1SSD, etc.).


Code:


#Computing total HDD storage:

df['HDD'] = (df['first']*df.Layer1HDD+df.second*df.Layer2HDD)

#Computing total SSD storage:

df['SSD'] = (df['first']*df.Layer1SSD+df.second*df.Layer2SSD)

#Computing total Hybrid storage:

df['Hybrid'] = (df['first']*df.Layer1Hybrid+df.second*df.Layer2Hybrid)

#Issue in the last HDD assignment:

df['HDD'] = (df['first']*df.Layer1Flash_Storage+df.second*df.Layer2Flash_Storage)


df.head()

Output:




Let's drop these columns because they represent intermediate steps in processing the ‘Memory’ feature of the dataset. Initially, the Memory column contained complex storage configurations, like combinations of SSDs, HDDs, and Flash Storage. To make it easier to analyze and use in machine learning models, the data was split into layers (Layer1HDD, Layer2SSD, etc.), but these were only temporary.


Once the relevant numerical values for HDD, SSD, and other storage types were extracted and assigned to new columns, the intermediate columns (first, second, Layer1HDD, Layer1SSD, etc.) became redundant. Removing them cleans up the dataset, ensuring only useful, final features remain for model training.


This step helps keep the dataset streamlined and avoids unnecessary complexity in the machine learning project.


Code:

df = df.drop(['first','second','Layer1HDD','Layer1SSD','Layer1Hybrid','Layer1Flash_Storage','Layer2HDD','Layer2SSD','Layer2Hybrid','Layer2Flash_Storage'],axis=1)

df.head()


Output:



Now we extract the brand name from the Gpu column by splitting the text at spaces and keeping the first word (e.g., "Intel", "Nvidia", "AMD"). 

Then, `value_counts()` is used to display the frequency of each brand in the dataset. This helps categorize GPUs efficiently for analysis and modeling. 🚀


Code:

#Extracting the brand name

df['Gpu_brand'] = df.Gpu.apply(lambda x: x.split()[0])

df.Gpu_brand.value_counts()


Output:

Gpu_brand

Intel     722

Nvidia    400

AMD       180

ARM         1

Name: count, dtype: int64




Let's filter out rows where the GPU brand is 'ARM', keeping only laptops with Intel, AMD, or Nvidia GPUs. ARM GPUs are rare in laptops and may not be relevant for price prediction, so they're excluded to maintain consistency in the dataset. 🚀


Code:

df = df[df.Gpu_brand != 'ARM']

df.Gpu_brand.value_counts()


Output:

Gpu_brand

Intel     722

Nvidia    400

AMD       180

Name: count, dtype: int64



Okay now let's shift our focus towards the 

Operating System column:


1. By using `df.OpSys.value_counts()` We check the number of laptops for each operating system type (Windows, macOS, Linux, etc.).

2. `sns.barplot(x=df.OpSys, y=df.Price)`

Creates a bar chart showing how laptop prices vary across different operating systems.

3. `plt.xticks(rotation=45)`

Rotates the x-axis labels for better readability.

4. `plt.show()`

Displays the plot.


This helps analyze how operating systems impact laptop pricing whether macOS devices tend to be more expensive or if Linux-based systems are priced lower.


Code:

sns.barplot(x=df.OpSys,y=df.Price)

plt.xticks(rotation=45)

plt.show()



Output:




Now we simplify the OpSys column by categorizing operating systems into broader groups:


1. Define a function (`os`) that classifies the operating systems:

   - Windows versions → "Windows"

   - macOS variants → "Mac"

   - Linux, Chrome OS, No OS → “Others/No OS/Linux"

2. Apply this function to the `OpSys` column, creating a new column `OS` with simplified categories.


But Why?

Grouping operating systems reduces complexity, preventing excessive fragmentation that could hurt model performance. It ensures the machine learning model can better identify trends based on major OS groups.


Code:

def os(inp):

    if inp == 'Windows 10' or inp == 'Windows 7' or inp == 'Windows 10 S':

        return 'Windows'

    elif inp == 'macOs' or inp == 'Mac OS X':

            return 'Mac'

    else:

        return 'Others/No OS/Linux'

    

df['OS'] = df.OpSys.apply(os)

df.head()


Output:



Now We Use One-Hot Encoding for Company Names

Why?

  1. Categorical Data Problem:

    • Company names (Apple, Dell, etc.) are text labels, but ML models only understand numbers

    • We can't just assign numbers like Apple=1, Dell=2 because that would imply "Apple < Dell" mathematically (which is nonsense!)

  2. One-Hot Encoding Solves This:

    • Creates separate binary (1/0) columns for each company

    • Preserves the fact that companies are distinct categories with no numerical relationship



Why It Matters:

  1. ML Readiness: Converts text-based company names into numerical format that models can process

  2. Preserves Information: Each company's impact on price can be individually analyzed

Avoids Ordinality: Prevents the model from assuming artificial order/ranking between brands

What It Does:

  1. One-Hot Encoding:

    • pd.get_dummies(df.Company) converts the categorical Company column (e.g., "Apple", "Dell") into binary columns

    • Each company becomes a new column with 1/0 values indicating presence

  2. Type Conversion:

    • .astype(int) ensures the dummy variables are integers (not booleans)

  3. Joining to DataFrame:

    • df.join() merges these new binary columns back into the original DataFrame

Code:

df = df.join(pd.get_dummies(df.Company).astype(int))

df = df.join(pd.get_dummies(df.TypeName).astype(int))

df = df.join(pd.get_dummies(df.Cpu_brand).astype(int))

df = df.join(pd.get_dummies(df.OS).astype(int))

df = df.join(pd.get_dummies(df.Gpu_brand).astype(int))

#Dropping the columns since they are now redundant

df = df.drop(['Gpu_brand'],axis=1)

df = df.drop(['OS'],axis=1)

df = df.drop(['Cpu_brand'],axis=1)

df = df.drop(['Company','TypeName'],axis=1)

df = df.drop(['Gpu_brand'],axis=1)



df.head()

Output:




Correlation Heatmap Check

Why We Do This?:

  1. Feature Selection:

    • Helps identify which features strongly correlate with price (our target variable)

    • Reveals redundant features that can be removed (if two features are highly correlated with each other)

  2. Business Insights:

    • Shows real-world relationships like:

      • Higher RAM → Higher Price (positive correlation)

      • HDD storage → Lower Price (negative correlation)

      • SSD storage → Higher Price

  3. Model Health Check:

    • Prevents multicollinearity issues (when predictors are too correlated)

    • Helps avoid "double-counting" the same information

  4. Color Decoding:

    • Yellow = Strong positive correlation (near +1)

    • Purple = Strong negative correlation (near -1)

    • Dark colors = Weak correlation (near 0)

Student Takeaways:

  1. Always check correlations before modeling

  2. Look for:

    • Strong correlations with your target variable (good)

    • Strong correlations between features (potentially problematic)

  3. This is why we engineered features like PPI - to create meaningful numerical relationships

Pro Tip: The large size (figsize=(30,25)) ensures all feature names remain readable in the visualization!

What This Code Does:

  1. df.corr() - Calculates pairwise Pearson correlation coefficients between all numerical features

  2. Creates a large visualization (30x25 inches) to display all relationships clearly

  3. sns.heatmap() with parameters:

    • annot=True: Shows correlation values in each cell

    • cbar=True: Displays the color scale legend

    • cmap='plasma': Uses a purple-to-yellow color gradient

Code:

corr = df.corr()

plt.figure(figsize=(30,25))

sns.heatmap(corr,annot=True,cbar=True,cmap='plasma')

plt.show()

Output:



Finally all the data preprocessing and feature engineerings has been done, we can now move on to split the data into train and test. 

  1. Train-Test Split:

    • Separates data into training (80%) and testing (20%) sets

    • random_state=42 ensures reproducible results

    • Never scale before splitting to avoid data leakage

  2. Feature Scaling:

    • Standardization (Z-score normalization) is crucial for:

      • Distance-based algorithms (KNN, SVR)

      • Regularized models (Ridge, Lasso)

      • Gradient-based methods (Neural Networks)

    • Fit only on training data to prevent information bleed

  3. Model Selection:

    • Tests diverse algorithms:

      • Linear models (baseline)

      • Tree-based models (handle non-linearity)

      • Ensemble methods (boost performance)

      • Neural network analogs (LightGBM/XGBoost)

    • CatBoost handles categorical features automatically

  4. Evaluation:

    • Uses R² score (coefficient of determination):

      • 1 = Perfect prediction

      • 0 = Baseline (mean prediction)

      • Negative = Worse than baseline

    • MAE could also be used for dollar-error interpretation

  5. Key Findings:

    • Tree-based models (CatBoost, XGBoost) perform best

    • Linear models underfit (non-linear relationships exist)

    • SVR performs poorly without extensive tuning

Why This Matters:

  1. Methodical Approach: Systematically compares algorithms

  2. Scalability: Pipeline can test 100+ models with minimal code changes

  3. Practical Insight: Shows that data prep matters more than model choice

  4. Production Readiness: Identifies CatBoost as best model for deployment


Code:

#Now split the data for ML model

x = df.drop(['Price'],axis=1)

y = df.Price


#train test split

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)


#feature scaling

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

x_train_scaled = ss.fit_transform(x_train)

x_test_scaled = ss.transform(x_test)


#model selection

from sklearn.linear_model import LinearRegression,Ridge,Lasso,ElasticNet

from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor,AdaBoostRegressor

from xgboost import XGBRegressor

from sklearn.neighbors import KNeighborsRegressor

from sklearn.svm import SVR

from catboost import CatBoostRegressor

import lightgbm as lgbm

from sklearn.gaussian_process import GaussianProcessRegressor



lr = LinearRegression()

r = Ridge()

l = Lasso()

en = ElasticNet()

rf = RandomForestRegressor()

gb = GradientBoostingRegressor()

adb = AdaBoostRegressor()

xgb = XGBRegressor()

knn = KNeighborsRegressor()

svr = SVR()

cat = CatBoostRegressor()

lgb =lgbm.LGBMRegressor()

gpr = GaussianProcessRegressor()


#Fittings

lr.fit(x_train_scaled,y_train)

Output:



What This Code Does:

  1. Makes Predictions:

    • Each trained model (lr, rf, xgb, etc.) predicts prices on the scaled test set

    • Creates 13 sets of predictions (one per algorithm)

  2. Evaluates Performance:

    • Calculates R² scores comparing predictions (_pred) to true prices (y_test)

    • R² measures how well models explain price variations (0 = baseline, 1 = perfect)

  3. Prints Results:

    • Shows each model's R² score for direct comparison


Going into the Prediction Phase:

  1. lrpred = lr.predict(x_test_scaled)  # Linear Regression's predictions

    • Every model uses the same test features (x_test_scaled)

    • Output: Array of predicted prices for each laptop in test set

  2. R² Score Interpretation:

    • R² = 0.85 → Model explains 85% of price variations

    • Negative R² → Worse than just predicting average price

  3. Why Multiple Metrics?:

    • Here we only use R² for simplicity, but:

      • mean_absolute_error() (imported but unused) would show average $ error

      • Different metrics answer different questions

Why This Matters:

  1. Model Selection: Identifies CatBoost as best for deployment

  2. Baseline Comparison: Linear Regression's 0.75 R² sets a benchmark

  3. Algorithm Insights: Shows ensemble methods (XGBoost, CatBoost) outperform classics

Student Takeaways:

  1. Always evaluate models on unseen data (test set)

  2. is useful but combine with other metrics (MAE/RMSE) for business context

  3. No "best algorithm always" – test multiple approaches like we did here!

Code:

#preds

lrpred = lr.predict(x_test_scaled)

rpred = r.predict(x_test_scaled)

lpred = l.predict(x_test_scaled)

enpred = en.predict(x_test_scaled)

rfpred = rf.predict(x_test_scaled)

gbpred = gb.predict(x_test_scaled)

adbpred = adb.predict(x_test_scaled)

xgbpred = xgb.predict(x_test_scaled)

knnpred = knn.predict(x_test_scaled)

svrpred = svr.predict(x_test_scaled)

catpred = cat.predict(x_test_scaled)

lgbpred = lgb.predict(x_test_scaled)

gprpred = gpr.predict(x_test_scaled)


#Evaluations

from sklearn.metrics import r2_score,mean_absolute_error

lrr2 = r2_score(y_test,lrpred)

rr2 = r2_score(y_test,rpred)

lr2 = r2_score(y_test,lpred)

enr2 = r2_score(y_test,enpred)

rfr2 = r2_score(y_test,rfpred)

gbr2 = r2_score(y_test,gbpred)

adbr2 = r2_score(y_test,adbpred)

xgbr2 = r2_score(y_test,xgbpred)

knnr2 = r2_score(y_test,knnpred)

svrr2 = r2_score(y_test,svrpred)

catr2 = r2_score(y_test,catpred)

lgbr2 = r2_score(y_test,lgbpred)

gprr2 = r2_score(y_test,gprpred)


print('LINEAR REG ',lrr2)

print('RIDGE ',rr2)

print('LASSO ',lr2)

print('ELASTICNET',enr2)

print('RANDOM FOREST ',rfr2)

print('GB',gbr2)

print('ADABOOST',adbr2)

print('XGB',xgbr2)

print('KNN',knnr2)

print('SVR',svrr2)

print('CAT',catr2)

print('LIGHTGBM',lgbr2)

print('GUASSIAN PROCESS',gprr2)

Output:

LINEAR REG  0.7518519231498904

RIDGE  0.7526016268459503

LASSO  0.7525656909375112

ELASTICNET 0.7374981483103606

RANDOM FOREST  0.8230675771039606

GB 0.8205891546798048

ADABOOST 0.5789903696720302

XGB 0.8392580531741592

KNN 0.691001565311663

SVR -0.02486099423502175

CAT 0.8538872462911021

LIGHTGBM 0.791475902516366

GUASSIAN PROCESS -56901.537446931245


Identifies CatBoost is best for deployment

Now it can be deployed to Streamlit