Data Cleaning, Preparation & Data Wrangling (Using Python)

UNIT-3

1. Data Cleaning

Definition

Data Cleaning is the process of detecting and correcting (or removing) incorrect, incomplete, duplicate, or irrelevant data to improve data quality before analysis.

Why Data Cleaning is Important

Real-world data is messy
Improves accuracy of models
Reduces bias and errors
Saves time during analysis

Common Data Quality Issues

Missing values
Duplicate records
Incorrect data types
Outliers
Inconsistent formats

Keywords

Missing Values, Duplicates, Outliers, Noise, Null, NaN, Data Quality

Exam Diagram (Draw This)


Raw Data
   ↓
[Missing Values]
[Duplicates]
[Incorrect Data]
   ↓
Clean Data

Python Examples

1. Handling Missing Values


import pandas as pd

df = pd.DataFrame({
    'Age': [25, None, 30],
    'Salary': [40000, 50000, None]
})

df.fillna(df.mean(), inplace=True)
print(df)

2. Removing Duplicates


df.drop_duplicates(inplace=True)

3. Correcting Data Types


df['Age'] = df['Age'].astype(int)

MOST IMPORTANT EXAM QUESTION

What is Data Cleaning? Explain its importance and techniques.

2. Data Preparation

Definition

Data Preparation is the process of transforming cleaned data into a suitable format for analysis or machine learning models.

Objectives

Make data model-ready
Improve feature quality
Convert data into numerical form

Key Steps in Data Preparation

Feature selection
Feature scaling
Encoding categorical data
Splitting data (train/test)

Keywords

Feature Engineering, Normalization, Encoding, Scaling, Train-Test Split

Exam Diagram


Clean Data
   ↓
Feature Selection
   ↓
Encoding & Scaling
   ↓
Prepared Data

Python Examples

1. Encoding Categorical Data


df = pd.DataFrame({'Gender': ['Male', 'Female', 'Male']})
df_encoded = pd.get_dummies(df)
print(df_encoded)

2. Feature Scaling


from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df_scaled = scaler.fit_transform([[100], [200], [300]])
print(df_scaled)

3. Train-Test Split


from sklearn.model_selection import train_test_split

X = df_encoded
y = [1, 0, 1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

MOST IMPORTANT EXAM QUESTION

Explain Data Preparation steps with suitable examples.

3. Data Wrangling

Definition

Data Wrangling is the process of collecting, transforming, structuring, and enriching raw data into a usable format for analysis.

📌 Data Wrangling = Cleaning + Preparation + Transformation

Tasks in Data Wrangling

Merging datasets
Reshaping data
Aggregation
Filtering

Keywords

Merge, Join, GroupBy, Reshape, Pivot, Transform

Exam Diagram (Very Important)


Multiple Data Sources
        ↓
   Data Wrangling
(Clean + Transform)
        ↓
   Analysis-Ready Data

Python Examples

1. Merging Data


df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['A', 'B']})
df2 = pd.DataFrame({'ID': [1, 2], 'Marks': [80, 90]})

merged = pd.merge(df1, df2, on='ID')
print(merged)

2. Grouping Data


df = pd.DataFrame({
    'Department': ['CS', 'CS', 'IT'],
    'Salary': [50000, 60000, 55000]
})

print(df.groupby('Department').mean())

3. Reshaping Data


df.pivot_table(values='Salary', index='Department')

MOST IMPORTANT EXAM QUESTION

What is Data Wrangling? How is it different from Data Cleaning?

4. Difference Table (High Probability Exam Question)

Aspect	Data Cleaning	Data Preparation	Data Wrangling
Purpose	Fix errors	Make data model-ready	Transform & structure
Scope	Narrow	Medium	Broad
Examples	Remove nulls	Encode features	Merge datasets
Tools	Pandas	Pandas, Sklearn	Pandas

5. One-Line Memory Tips (For Exams)

Data Cleaning → Fix bad data
Data Preparation → Make data usable for ML
Data Wrangling → Convert raw data into structured data

6. Final Exam Flowchart (Draw This)


Raw Data
   ↓
Data Cleaning
   ↓
Data Preparation
   ↓
Data Wrangling
   ↓
Data Analysis / ML Model


Handling Missing Data in Data Science (Using Python)

1. Definition
Missing data refers to the absence of values in a dataset where information should be present.

In Python, missing values are usually represented as NaN (Not a Number) or None.
📌 Handling missing data is the process of detecting, analyzing, and treating missing values to ensure accurate data analysis and model performance.

2. Why Handling Missing Data is Important
Prevents biased results
Improves model accuracy
Avoids errors in algorithms
Ensures data integrity

3. Types of Missing Data (Very Important for Exams)
1. MCAR – Missing Completely at Random
Missing values occur randomly
No relationship with other variables
2. MAR – Missing at Random
Missingness depends on other variables
3. MNAR – Missing Not at Random
Missingness depends on the missing value itself

Exam Diagram (Draw This)
Missing Data Types
 ├── MCAR
 ├── MAR
 └── MNAR


4. Keywords (Important)
NaN, Null, None, Missing Values, Imputation,

Deletion, Mean, Median, Mode, Forward Fill, Backward Fill

5. Detecting Missing Data in Python
Python Example
import pandas as pd

df = pd.DataFrame({
    'Age': [20, None, 25],
    'Salary': [30000, 40000, None]
})

print(df.isnull())
print(df.isnull().sum())


6. Techniques for Handling Missing Data

1. Removing Missing Data (Deletion Method)
Definition
Rows or columns containing missing values are removed.
When to Use
Missing values are very few
Dataset is large
Python Example
df.dropna(inplace=True)

Exam Diagram
Data with Missing Values
        ↓
     Drop Rows
        ↓
     Clean Data


2. Filling Missing Data (Imputation Method)
Definition
Missing values are replaced with statistical values.

a) Mean Imputation
df['Age'].fillna(df['Age'].mean(), inplace=True)

b) Median Imputation (Best for Outliers)
df['Salary'].fillna(df['Salary'].median(), inplace=True)

c) Mode Imputation (Categorical Data)
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)


Exam Diagram
Missing Values
      ↓
 Statistical Method
(Mean / Median / Mode)
      ↓
 Filled Data


3. Forward Fill & Backward Fill
Forward Fill (ffill)
Uses previous value
df.fillna(method='ffill', inplace=True)

Backward Fill (bfill)
Uses next value
df.fillna(method='bfill', inplace=True)


4. Replacing Missing Values with Constant
df.fillna(0, inplace=True)


5. Advanced Imputation (Mention in Exams)
KNN Imputation
Regression Imputation
Machine Learning models
📌 (Mentioning these fetches extra marks)

7. Choosing the Right Method (Important Table)
Scenario Best Method
Few missing values Drop rows
Numerical data Mean / Median
Categorical data Mode
Time series data Forward / Backward fill
Outliers present Median

8. Common Mistakes (Write in Theory Answers)
Blindly deleting large data
Using mean with skewed data
Ignoring missing data patterns

9. Final Exam Flowchart (Very Important)
Raw Dataset
     ↓
Detect Missing Values
     ↓
Analyze Type (MCAR / MAR / MNAR)
     ↓
Choose Method
     ↓
Handle Missing Data
     ↓
Clean Dataset


10. One-Line Memory Points (For Exams)
NaN represents missing data in Python
Imputation replaces missing values
Median is robust to outliers
Handling missing data improves model accuracy

MOST IMPORTANT EXAM QUESTIONS
Define missing data and explain its types.
Explain techniques for handling missing data with Python examples.
Compare deletion and imputation methods.

Data Transformation in Data Science (Using Python)

1. Definition
Data Transformation is the process of converting raw data into a suitable format for analysis or 
machine learning by applying operations such as scaling, normalization, aggregation, encoding, or 
feature engineering.
📌 Purpose: To improve data quality, model performance, and analytical insights.

2. Importance of Data Transformation
Makes data model-ready
Handles skewed distributions
Improves accuracy of algorithms
Enables comparison between features
Facilitates feature extraction and engineering

3. Types of Data Transformation
Type Description Example
Normalization Rescale values to a 0-1 range MinMaxScaler
Standardization Rescale to mean = 0, std = 1 StandardScaler
Log Transformation Reduce skewness log(x)
Binning / Discretization Convert continuous → categorical Age → young, adult, senior
Encoding Convert categorical → numerical One-hot encoding
Aggregation Summarize data sum, mean, max by group
Feature Engineering Create new features BMI = weight / height²

4. Keywords (High Probability in Exams)
Normalization, Standardization, Scaling, Encoding, Log Transform, Binning, Feature Engineering, Aggregation, MinMaxScaler, StandardScaler, One-Hot Encoding

5. Exam Diagram (Draw This)
Raw Data
   ↓
Transformation
 ├── Normalization / Standardization
 ├── Log / Power Transform
 ├── Binning
 ├── Encoding
 └── Feature Engineering
   ↓
Transformed Data (Model Ready)


6. Python Examples
1. Normalization (Min-Max Scaling)
from sklearn.preprocessing import MinMaxScaler
import pandas as pd

df = pd.DataFrame({'Age':[20, 30, 40]})
scaler = MinMaxScaler()
df['Age_scaled'] = scaler.fit_transform(df[['Age']])
print(df)


2. Standardization (Z-score Scaling)
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df['Age_std'] = scaler.fit_transform(df[['Age']])
print(df)


3. Log Transformation
import numpy as np
df['Salary_log'] = np.log(df['Salary'])


4. Binning / Discretization
bins = [0, 18, 35, 60]
labels = ['Child', 'Adult', 'Senior']
df['Age_group'] = pd.cut(df['Age'], bins=bins, labels=labels)


5. One-Hot Encoding (Categorical to Numerical)
df = pd.DataFrame({'Gender':['Male','Female','Male']})
df_encoded = pd.get_dummies(df)
print(df_encoded)


6. Feature Engineering
df['BMI'] = df['Weight'] / (df['Height']/100)**2


7. Flow of Data Transformation (Important for Exams)
Raw Dataset
   ↓
Identify Transformations Needed
   ↓
Apply Scaling / Encoding / Log / Binning
   ↓
Feature Engineering
   ↓
Transformed Dataset (Ready for ML)


8. Difference Table: Transformation vs Cleaning vs Preparation
Aspect Cleaning Preparation Transformation
Goal Remove errors Make data model-ready Convert/modify data for analysis
Scope Narrow Medium Broad
Techniques Drop / Fill Encoding / Scaling Normalization, Log, Feature Eng.
Example Fill NaN One-Hot Encode Scale Age, create BMI

9. One-Line Memory Points (For Exams)
Transformation = Change the form of data
Normalization = 0–1 scale
Standardization = mean=0, std=1
Log Transform = reduce skew
Feature Engineering = create new meaningful features

10. MOST IMPORTANT EXAM QUESTIONS
Define Data Transformation and explain its importance.
Explain types of data transformations with Python examples.
Difference between data cleaning, preparation, and transformation.
Write a Python program to normalize and encode a dataset.

String Manipulation in Data Science (Using Python)

1. Definition
String Manipulation is the process of modifying, analyzing, and extracting information from text data in Python.

In Python, strings are objects of the str class, and they come with built-in methods for operations such as formatting, searching, splitting, and joining.
📌 Importance in Data Science:
Text data is common (names, emails, logs, tweets)
Preprocessing textual data for analysis or ML requires string operations
Helps in cleaning, parsing, and feature extraction

2. String Object
Strings in Python are immutable, meaning they cannot be changed after creation.
Strings can be created using single quotes, double quotes, or triple quotes.
s1 = 'Hello'
s2 = "Data Science"
s3 = """Python is fun"""


3. Keywords (High Probability in Exams)
String, Immutable, Concatenation, Slicing, Indexing, split(), join(), strip(), replace(), lower(), upper(), find(), format(), f-string, startswith(), endswith(), isdigit(), isalpha()

4. Common String Operations
Operation Description Python Example
Concatenation Join strings 'Hello' + ' World'
Repetition Repeat string 'Hi' * 3 → 'HiHiHi'
Indexing Access specific character s[0] → 'H'
Slicing Extract substring s[0:5] → 'Hello'
Length Get string length len(s)

5. String Methods (Most Important)
Method Description Example
lower() Convert to lowercase 'HELLO'.lower() → 'hello'
upper() Convert to uppercase 'hello'.upper() → 'HELLO'
title() Capitalize each word 'data science'.title() → 'Data Science'
strip() Remove leading/trailing spaces '  hello  '.strip() → 'hello'
replace() Replace substring 'hello'.replace('h','H') → 'Hello'
split() Split string into list 'a,b,c'.split(',') → ['a','b','c']
join() Join list into string ','.join(['a','b']) → 'a,b'
find() Find substring index 'hello'.find('e') → 1
startswith() / endswith() Check prefix/suffix 'hello'.startswith('h') → True
isdigit() / isalpha() Check string type '123'.isdigit() → True

6. Python Examples
1. Concatenation & Repetition
s1 = "Data"
s2 = "Science"
print(s1 + " " + s2)   # Output: Data Science
print("Hi! " * 3)      # Output: Hi! Hi! Hi!

2. Indexing & Slicing
s = "Python"
print(s[0])       # Output: P
print(s[0:4])     # Output: Pyth

3. Using String Methods
s = "  Data Science  "
print(s.strip())             # 'Data Science'
print(s.upper())             # '  DATA SCIENCE  '
print(s.lower())             # '  data science  '
print(s.replace("Data","Big Data"))  # '  Big Data Science  '
print(s.split())             # ['Data', 'Science']

4. Joining a List of Strings
lst = ['Data', 'Science', 'Python']
s = " ".join(lst)
print(s)   # Output: Data Science Python


7. Flow Diagram for String Manipulation (Draw in Exam)
Raw String
    ↓
Operations
 ├── Indexing & Slicing
 ├── Concatenation / Repetition
 ├── Case Conversion (upper/lower/title)
 ├── Cleaning (strip/replace)
 ├── Splitting / Joining
 └── Searching (find/startswith/endswith)
    ↓
Processed String (Ready for Analysis)


8. Real-World Data Science Applications
Cleaning text data (removing extra spaces, converting to lowercase)
Parsing logs (split IP addresses, timestamps)
Feature extraction (count words, detect patterns)
Preparing input for NLP models

9. One-Line Memory Tips (For Exams)
Strings are immutable
split() → string → list
join() → list → string
strip() → remove whitespace
replace() → modify text
upper()/lower() → case conversion

10. Most Important Exam Questions
Define string manipulation in Python and explain its importance in Data Science.
Explain commonly used string methods in Python with examples.
Write a Python program to clean and preprocess a string dataset.
Explain difference between split() and join() with examples.
Regular Expressions in Data Science (Using Python)

1. Definition
A Regular Expression (Regex) is a sequence of characters that defines a search pattern used for matching, searching, extracting, replacing, and validating text data.
📌 In Data Science, regex is widely used for:
Text cleaning
Pattern extraction
Data validation
Preprocessing unstructured text
Python provides the re module for working with regular expressions.

2. Why Regex is Important in Data Science
Handles unstructured text data
Automates data cleaning
Extracts useful features from text
Validates emails, phone numbers, IDs
Improves text preprocessing for NLP

3. Keywords (High Probability in Exams)
Regular Expression, Pattern, re module, match(), search(),

findall(), sub(), split(), Metacharacters, Quantifiers,

Character Classes, Anchors, Groups

4. Python re Module Functions (Very Important)
Function Description
re.match() Matches pattern at start of string
re.search() Searches pattern anywhere
re.findall() Returns all matches as list
re.sub() Replaces pattern
re.split() Splits string using pattern

5. Basic Regex Syntax
1. Character Classes
Pattern Meaning
\d Digit (0–9)
\D Non-digit
\w Alphanumeric
\W Non-alphanumeric
\s Whitespace
\S Non-whitespace

2. Quantifiers
Symbol Meaning
* 0 or more
+ 1 or more
? 0 or 1
{n} Exactly n times
{n,m} Between n and m

3. Anchors
Symbol Meaning
^ Start of string
$ End of string

6. Exam Diagram (Draw This)
Text Data
   ↓
Regex Pattern
   ↓
Matching / Searching
   ↓
Extract / Replace / Validate
   ↓
Clean Text Data


7. Python Examples

1. Matching a Pattern
import re

text = "Data Science"
result = re.match("Data", text)
print(result.group())


2. Searching a Pattern
text = "I love Data Science"
result = re.search("Science", text)
print(result.group())


3. Find All Matches
text = "Marks: 80, 90, 85"
numbers = re.findall(r'\d+', text)
print(numbers)


4. Replacing Text
text = "Contact: 9876543210"
masked = re.sub(r'\d', 'X', text)
print(masked)


5. Splitting Using Regex
text = "Data,Science;Python"
words = re.split(r'[;,]', text)
print(words)


8. Common Regex Patterns (Important for Exams)
Task Regex Pattern
Email validation [a-zA-Z0-9_.]+@[a-zA-Z]+\.[a-zA-Z]+
Phone number \d{10}
Extract numbers \d+
Extract words \w+
Remove special characters [^a-zA-Z0-9 ]

9. Real-World Data Science Use Cases
Cleaning social media text
Extracting dates, numbers, emails
Log file analysis
NLP preprocessing
Data validation rules

10. Flowchart for Regex Usage (Exam-Friendly)
Raw Text
   ↓
Define Regex Pattern
   ↓
Apply re functions
   ↓
Matched / Cleaned Text
   ↓
Analysis or Model Input


11. One-Line Memory Tips (For Exams)
Regex = pattern matching tool
match() → start of string
search() → anywhere
findall() → list of matches
sub() → replace text
Regex simplifies text preprocessing


12. MOST IMPORTANT EXAM QUESTIONS
Define regular expressions and explain their role in data science.
Explain Python re module functions with examples.
Write regex patterns for email and phone number validation.
Explain character classes and quantifiers in regex.
Vectorized String Functions in Pandas (Data Science – Python)

1. Definition
Vectorized string functions in Pandas are specialized string operations applied element-wise to an entire Series or Index at once using the .str accessor.
📌 These functions allow fast, efficient, and readable text processing without using explicit loops.
👉 Internally optimized → better performance than Python loops.

2. Why Vectorized String Functions are Important
Text data is common (names, emails, reviews, logs)
Apply string operations to thousands/millions of rows
Essential for data cleaning and preprocessing
Works well with missing values (NaN)

3. Keywords (High Probability in Exams)
Vectorization, Pandas Series, .str accessor, Element-wise operation,

Text preprocessing, NaN handling, split(), replace(),

contains(), extract(), len(), lower(), upper()

4. What Does “Vectorized” Mean? (Important Concept)
Vectorized operation =

👉 Apply the same operation to all elements at once (no loops).

Exam Diagram (Very Important)
Text Column (Series)
 ┌───────────────┐
 | "Data Science"|
 | "python"      |
 | "PANDAS"      |
 └───────────────┘
        ↓
   .str methods
        ↓
Processed Text Column


5. .str Accessor in Pandas
Pandas provides .str to access vectorized string methods
Works on:
Series
Index
df['column'].str.method()


6. Common Vectorized String Functions
1. Case Conversion
import pandas as pd

df = pd.DataFrame({
    'Text': ['Data Science', 'python', 'PANDAS']
})

df['lower'] = df['Text'].str.lower()
df['upper'] = df['Text'].str.upper()
print(df)


2. Length of Strings
df['length'] = df['Text'].str.len()


3. Removing Whitespaces
df['clean'] = df['Text'].str.strip()


4. Splitting Strings
df['words'] = df['Text'].str.split()


5. Replacing Substrings
df['replaced'] = df['Text'].str.replace('Data', 'Big Data')


6. Checking Pattern Existence
df['contains_py'] = df['Text'].str.contains('py', case=False)


7. Extracting Using Regular Expressions
df = pd.DataFrame({
    'Email': ['abc@gmail.com', 'xyz@yahoo.com']
})

df['domain'] = df['Email'].str.extract(r'@(.*)')
print(df)


8. Startswith & Endswith
df['starts'] = df['Text'].str.startswith('Data')
df['ends'] = df['Text'].str.endswith('S')


7. Handling Missing Values (Important Advantage)
df = pd.DataFrame({'Text': ['Data', None, 'Science']})

df['lower'] = df['Text'].str.lower()

📌 .str methods safely ignore NaN values.

8. Flowchart of Vectorized String Processing (Exam-Friendly)
Text Data in Pandas Series
          ↓
Use .str Accessor
          ↓
Vectorized String Methods
(lower, split, replace, extract)
          ↓
Clean & Processed Text Data


9. Difference: Normal String Methods vs Vectorized String Functions
Feature Python String Methods Pandas Vectorized
Applies to Single string Entire Series
Speed Slow (loops) Fast (vectorized)
Syntax "text".lower() series.str.lower()
NaN handling Error Safe

10. Real-World Data Science Use Cases
Cleaning text columns
Email domain extraction
Keyword filtering
NLP preprocessing
Log and social media analysis

11. One-Line Memory Points (For Exams)
.str → vectorized string operations
No loops required
Works on Series and Index
Handles NaN safely
Faster than Python string methods

12. MOST IMPORTANT EXAM QUESTIONS
Define vectorized string functions in Pandas.
Explain the .str accessor with examples.
Compare Python string methods and Pandas vectorized string functions.
Write a Pandas program to clean and preprocess text data.
Data Wrangling in Data Science (Using Python)

1. Definition
Data Wrangling (also called Data Munging) is the process of collecting, cleaning, transforming, restructuring, and enriching raw data into a usable, analysis-ready format.
📌 In simple words:

Data Wrangling = Cleaning + Transformation + Structuring

2. Why Data Wrangling is Important
Real-world data is messy and unstructured
Required before data analysis and machine learning
Improves data quality and consistency
Enables integration of multiple data sources



3. Keywords (High Probability for Exams)
Data Wrangling, Data Munging, Cleaning, Transformation,

Reshaping, Filtering, Aggregation, Merge, Join,

GroupBy, Pivot, Melt, Pandas, NumPy

4. Main Steps in Data Wrangling
Data Collection
Data Cleaning
Data Transformation
Data Structuring
Data Enrichment
Exam Diagram (Very Important – Draw This)
Raw Data
   ↓
Data Cleaning
   ↓
Data Transformation
   ↓
Data Structuring
   ↓
Analysis-Ready Data

5. Tools Used in Python for Data Wrangling
Pandas – primary library
NumPy – numerical operations
Regex – text processing

6. Common Data Wrangling Operations (With Examples)

1.Handling Missing Data
import pandas as pd

df = pd.DataFrame({
    'Age': [20, None, 25],
    'Salary': [30000, 40000, None]
})

df.fillna(df.mean(), inplace=True)
print(df)


2. Removing Duplicates
df.drop_duplicates(inplace=True)


3. Filtering Data
df[df['Age'] > 21]


4. Transforming Data
df['Salary_in_Lakhs'] = df['Salary'] / 100000


5. Merging Datasets
df1 = pd.DataFrame({'ID':[1,2], 'Name':['A','B']})
df2 = pd.DataFrame({'ID':[1,2], 'Marks':[80,90]})

merged = pd.merge(df1, df2, on='ID')
print(merged)


6. Grouping and Aggregation
df.groupby('Department')['Salary'].mean()


7. Reshaping Data
Pivot
df.pivot_table(values='Salary', index='Department')
Melt
pd.melt(df, id_vars=['ID'])


7. Data Wrangling Flowchart (Exam-Friendly)
Multiple Data Sources
        ↓
   Data Wrangling
 ├─ Cleaning
 ├─ Transformation
 ├─ Reshaping
 └─ Integration
        ↓
   Structured Dataset


8. Difference Table (Very Important)
Aspect Data Cleaning Data Wrangling
Scope Narrow Broad
Focus Fix errors Prepare & structure data
Includes Missing values Cleaning + transformation
Example Fill NaN Merge & reshape data

9. Real-World Data Science Applications
Combining data from databases & APIs
Preparing datasets for machine learning
Cleaning social media & text data
Business and scientific data analysis

10. One-Line Memory Points (For Exams)
Data wrangling prepares raw data for analysis
It includes cleaning, transforming, reshaping
Pandas is the main tool
Essential before EDA and ML models

11. MOST IMPORTANT EXAM QUESTIONS
Define Data Wrangling and explain its importance.
Explain steps involved in data wrangling with examples.
Differentiate between data cleaning and data wrangling.
Write a Python program to perform data wrangling operations.

Data Wrangling – JOIN Operations (Using Python)


1. Definition
In Data Wrangling, a JOIN is the process of combining two or more datasets (tables/DataFrames) based on a common key or column.
📌 JOIN operations are used to integrate data from multiple sources into a single, structured dataset for analysis.

In Python, JOIN operations are mainly performed using:
Pandas (merge(), join())


2. Why JOIN is Important in Data Science
Real-world data is stored in multiple tables
Required for data integration
Helps in feature enrichment
Essential before analysis and machine learning


3. Keywords (High Probability for Exams)
Join, Merge, Key, Primary Key, Foreign Key,

Inner Join, Left Join, Right Join, Outer Join,

Pandas, merge(), join()


4. Types of JOINs (Very Important)
1. Inner Join
Returns only matching rows from both tables.
2. Left Join
Returns all rows from left table and matching rows from right table.
3. Right Join
Returns all rows from right table and matching rows from left table.
4. Outer (Full) Join
Returns all rows from both tables, filling missing values with NaN.

5. Exam Diagrams (Must Draw)

INNER JOIN
Table A      Table B
   ○────────○
(Only common rows)


LEFT JOIN
Table A      Table B
○────────○
○────────
(All rows from A)


RIGHT JOIN
Table A      Table B
      ○────────○
        ○────────
(All rows from B)


OUTER JOIN
Table A      Table B
○────────○
○        ○
(All rows from A and B)


6. Sample Data (Use in Exams)
import pandas as pd

df1 = pd.DataFrame({
    'ID': [1, 2, 3],
    'Name': ['A', 'B', 'C']
})

df2 = pd.DataFrame({
    'ID': [2, 3, 4],
    'Marks': [80, 90, 85]
})


7. JOIN Using merge() in Pandas

1. Inner Join
pd.merge(df1, df2, on='ID', how='inner')
Output: Only IDs 2 and 3

2. Left Join
pd.merge(df1, df2, on='ID', how='left')
Output: All IDs from df1
3. Right Join
pd.merge(df1, df2, on='ID', how='right')
Output: All IDs from df2

4. Outer Join
pd.merge(df1, df2, on='ID', how='outer')
Output: IDs 1,2,3,4 (NaN where no match)


8. JOIN Using join() Method
df1.set_index('ID').join(df2.set_index('ID'), how='inner')
📌 join() works mainly on index, while merge() works on columns.


9. JOIN Flowchart (Exam-Friendly)
DataFrame 1        DataFrame 2
     ↓                  ↓
        Identify Common Key
                 ↓
             Apply JOIN
                 ↓
          Combined DataFrame


10. Difference: merge() vs join()
Aspect merge() join()
Join key Column Index
Flexibility High Medium
SQL-like Yes Limited
Usage Most common Simple joins


11. Common Mistakes (Mention in Theory)
Joining on wrong key
Ignoring duplicate keys
Not handling NaN after join


12. One-Line Memory Points (For Exams)
JOIN combines datasets using a common key
merge() is most used
Inner join → common rows only
Left join → all left table rows
Outer join → all rows from both tables


13. MOST IMPORTANT EXAM QUESTIONS
Define JOIN operation in data wrangling.
Explain types of JOINs with diagrams.
Write a Python program for INNER and OUTER JOIN using Pandas.
Differentiate between merge() and join().
Hierarchical Indexing in Data Science (Using Python)

1. Definition
Hierarchical Indexing, also called MultiIndex, is a Pandas feature that allows a DataFrame or Series to have multiple levels of indexing on rows or columns.
📌 It enables representation and analysis of higher-dimensional data in lower-dimensional structures.

2. Why Hierarchical Indexing is Important
Represents multi-dimensional data
Enables group-wise analysis
Simplifies data reshaping
Useful for aggregation and pivot operations



3. Keywords (High Probability in Exams)
Hierarchical Indexing, MultiIndex, Levels, Labels,

Index, Sub-index, Unstack, Stack, xs(),

swaplevel(), sort_index()

4. Structure of Hierarchical Index
Exam Diagram (Must Draw)
Index Level 1   Index Level 2   Value
------------------------------------
India           2023            100
India           2024            120
USA             2023            90
USA             2024            110
📌 Each row is identified by (Country, Year).

5. Creating Hierarchical Index

1. From Tuples
import pandas as pd

index = pd.MultiIndex.from_tuples([
    ('India', 2023),
    ('India', 2024),
    ('USA', 2023),
    ('USA', 2024)
])

data = [100, 120, 90, 110]
s = pd.Series(data, index=index)
print(s)


2. From Lists
arrays = [
    ['India', 'India', 'USA', 'USA'],
    [2023, 2024, 2023, 2024]
]

index = pd.MultiIndex.from_arrays(arrays)


6. Selecting Data (Very Important)

1. Selecting Outer Level
s['India']


2. Selecting Inner Level using xs()
s.xs(2023, level=1)


3. Partial Indexing
s.loc[('India', 2024)]


7. Reshaping with Stack & Unstack

Unstack
Converts inner index → columns
df = s.unstack()
print(df)


Stack
Converts columns → index
df.stack()


Exam Diagram
MultiIndex Series
        ↓ unstack
Wide DataFrame
        ↓ stack
MultiIndex Series


8. Swapping and Sorting Levels

Swap Levels
s.swaplevel()


Sort Index
s.sort_index()


9. Hierarchical Columns (Mention in Exams)
data = {
    ('Sales', 2023): [100, 120],
    ('Sales', 2024): [130, 150]
}

df = pd.DataFrame(data, index=['India', 'USA'])
print(df)


10. Difference: Normal Index vs Hierarchical Index
Aspect Normal Index Hierarchical Index
Index levels One Multiple
Dimensionality 1D Multi-dimensional
Complexity Simple Advanced
Use case Simple data Complex grouped data

11. Flowchart for Hierarchical Indexing (Exam-Friendly)
Multi-Dimensional Data
        ↓
Apply MultiIndex
        ↓
Hierarchical Indexing
        ↓
Efficient Analysis & Reshaping


12. Real-World Data Science Use Cases
Time-series data (Year, Month)
Geographic data (Country, State)
Financial data (Company, Quarter)
Sales & performance analysis



13. One-Line Memory Points (For Exams)
Hierarchical Index = MultiIndex
Allows multiple index levels
Reduces need for redundant columns
Works well with groupby, pivot, stack, unstack



14. MOST IMPORTANT EXAM QUESTIONS
Define Hierarchical Indexing in Pandas.
Explain creation and selection in MultiIndex with examples.
Differentiate between normal index and hierarchical index.
Explain stack and unstack operations.

Combining and Merging Datasets in Data Science (Using Python)

1. Definition
Combining Datasets
Combining refers to bringing together multiple datasets into a single dataset using methods like concatenation and joining.
Merging Datasets
Merging is a specific type of combining where datasets are joined based on a common key or column, similar to SQL JOIN operations.
📌 These operations are core parts of Data Wrangling.

2. Why Combining & Merging are Important
Real-world data comes from multiple sources
Required for data integration
Helps in feature enrichment
Essential before data analysis and machine learning



3. Keywords (High Probability for Exams)
Combine, Merge, Concatenate, Join,

Inner Join, Left Join, Right Join, Outer Join,

Key, Primary Key, Foreign Key,

Pandas, concat(), merge(), join()

4. Methods to Combine and Merge in Pandas
Method Purpose
concat() Stack datasets
merge() Join using key column
join() Join using index

5. Combining Datasets using concat()

Definition
concat() combines DataFrames row-wise or column-wise.

Exam Diagram
DF1        DF2
 A B        A B
 1 2        3 4
   ↓ concat
 A B
 1 2
 3 4


Python Example
import pandas as pd

df1 = pd.DataFrame({'A':[1,2], 'B':[3,4]})
df2 = pd.DataFrame({'A':[5,6], 'B':[7,8]})

combined = pd.concat([df1, df2])
print(combined)


6. Merging Datasets using merge()

Definition
merge() combines datasets based on a common key column.

Types of Merge (Very Important)

1. Inner Merge
pd.merge(df1, df2, on='ID', how='inner')


2. Left Merge
pd.merge(df1, df2, on='ID', how='left')


3. Right Merge
pd.merge(df1, df2, on='ID', how='right')


4. Outer Merge
pd.merge(df1, df2, on='ID', how='outer')


Exam Diagrams (Draw These)
INNER MERGE
A ∩ B (Common Rows)
LEFT MERGE
All rows from A
+ Matching rows from B
OUTER MERGE
All rows from A and B


7. Sample Dataset for Merge
df1 = pd.DataFrame({
    'ID':[1,2,3],
    'Name':['A','B','C']
})

df2 = pd.DataFrame({
    'ID':[2,3,4],
    'Marks':[80,90,85]
})


8. Joining Datasets using join()
df1.set_index('ID').join(df2.set_index('ID'), how='inner')
📌 join() is index-based, while merge() is column-based.

9. Difference: concat() vs merge() vs join()
Feature concat() merge() join()
Combines Stacking Key-based Index-based
Similar to SQL No Yes Partially
Flexibility Medium High Medium
Use case Same columns Related tables Indexed data

10. Flowchart (Exam-Friendly)
Multiple Datasets
        ↓
Decide Method
 (concat / merge / join)
        ↓
Combine on Rows or Keys
        ↓
Single Integrated Dataset


11. Real-World Data Science Use Cases
Merging customer & transaction data
Combining monthly sales files
Integrating datasets from APIs and databases
Preparing data for ML models



12. One-Line Memory Points (For Exams)
Combining = bring datasets together
Merging = key-based combining
concat() → stack datasets
merge() → SQL-style join
join() → index-based join



13. MOST IMPORTANT EXAM QUESTIONS
Define combining and merging datasets.
Explain concat() and merge() with examples.
Differentiate between merge() and join().
Write Python code for inner and outer merge.
Reshaping and Pivoting in Data Science (Using Python)

1. Definition
Reshaping
Reshaping is the process of changing the structure or layout of a dataset (rows ↔ columns) without changing the actual data values.
Pivoting
Pivoting is a specific reshaping operation where unique values from one column become new columns, creating a wide-format table.
📌 Both are key steps in Data Wrangling.

2. Why Reshaping & Pivoting are Important
Converts data into analysis-ready (tidy) format
Makes data suitable for visualization and ML
Simplifies group-wise comparisons
Required for aggregation and reporting

3. Keywords (High Probability in Exams)
Reshaping, Pivot, Pivot Table, Melt,

Wide Format, Long Format, Stack, Unstack,

Aggregation, Index, Columns, Values

4. Data Formats (Important Concept)
Long Format
Year  Dept  Sales
2023   CS    100
2023   IT    120
Wide Format
Year   CS   IT
2023  100  120


5. Exam Diagram (Must Draw)
Long Format Data
        ↓
     Reshaping
 (Pivot / Unstack)
        ↓
Wide Format Data


6. Pivoting in Pandas

1. pivot()
Definition
Transforms long → wide format using index, columns, and values.

Python Example
import pandas as pd

df = pd.DataFrame({
    'Year':[2023, 2023, 2024, 2024],
    'Dept':['CS', 'IT', 'CS', 'IT'],
    'Sales':[100, 120, 110, 130]
})

pivot_df = df.pivot(index='Year', columns='Dept', values='Sales')
print(pivot_df)


2. pivot_table() (Very Important)
Definition
Similar to pivot() but supports aggregation and handles duplicates.

Python Example
pivot_table = df.pivot_table(
    index='Year',
    columns='Dept',
    values='Sales',
    aggfunc='mean'
)
print(pivot_table)


7. Melting (Reverse of Pivot)

melt()
Definition
Converts wide → long format.

Exam Diagram
Wide Data
        ↓ melt
Long Data


Python Example
pd.melt(pivot_df.reset_index(), id_vars='Year')


8. Stack and Unstack

Stack
Columns → row index
pivot_df.stack()


Unstack
Row index → columns
df.set_index(['Year','Dept'])['Sales'].unstack()


9. Difference Table (High Probability Exam Question)
Operation Purpose Format Change
pivot() Reshape Long → Wide
pivot_table() Reshape + aggregate Long → Wide
melt() Reverse reshape Wide → Long
stack() Columns → rows Wide → Long
unstack() Rows → columns Long → Wide

10. Flowchart for Reshaping & Pivoting
Raw Dataset
     ↓
Identify Format (Long / Wide)
     ↓
Apply Reshaping Method
(pivot / melt / stack)
     ↓
Structured Dataset


11. Real-World Data Science Applications
Sales and financial reports
Survey data analysis
Time-series analysis
Preparing data for dashboards

12. One-Line Memory Points (For Exams)
Reshaping changes structure, not data
Pivot creates new columns
pivot_table() supports aggregation
melt() is reverse of pivot
Stack/unstack work with MultiIndex

13. MOST IMPORTANT EXAM QUESTIONS
Define reshaping and pivoting.
Explain pivot and pivot_table with examples.
Differentiate between pivot and melt.
Write Python code to reshape data using stack and unstack.

Operation	Description	Python Example
Concatenation	Join strings	`'Hello' + ' World'`
Repetition	Repeat string	`'Hi' * 3 → 'HiHiHi'`
Indexing	Access specific character	`s[0] → 'H'`
Slicing	Extract substring	`s[0:5] → 'Hello'`
Length	Get string length	`len(s)`

Method	Description	Example
lower()	Convert to lowercase	`'HELLO'.lower() → 'hello'`
upper()	Convert to uppercase	`'hello'.upper() → 'HELLO'`
title()	Capitalize each word	`'data science'.title() → 'Data Science'`
strip()	Remove leading/trailing spaces	`' hello '.strip() → 'hello'`
replace()	Replace substring	`'hello'.replace('h','H') → 'Hello'`
split()	Split string into list	`'a,b,c'.split(',') → ['a','b','c']`
join()	Join list into string	`','.join(['a','b']) → 'a,b'`
find()	Find substring index	`'hello'.find('e') → 1`
startswith() / endswith()	Check prefix/suffix	`'hello'.startswith('h') → True`
isdigit() / isalpha()	Check string type	`'123'.isdigit() → True`

Function	Description
`re.match()`	Matches pattern at start of string
`re.search()`	Searches pattern anywhere
`re.findall()`	Returns all matches as list
`re.sub()`	Replaces pattern
`re.split()`	Splits string using pattern

Sunday, December 21, 2025

Data Cleaning, Preparation & Data Wrangling (Using Python)

1. Data Cleaning

Definition

Why Data Cleaning is Important

Common Data Quality Issues

Keywords

Exam Diagram (Draw This)

Python Examples

1. Handling Missing Values

2. Removing Duplicates

3. Correcting Data Types

MOST IMPORTANT EXAM QUESTION

2. Data Preparation

Definition

Objectives

Key Steps in Data Preparation

Keywords

Exam Diagram

Python Examples

1. Encoding Categorical Data

2. Feature Scaling

3. Train-Test Split

MOST IMPORTANT EXAM QUESTION

3. Data Wrangling

Definition

Tasks in Data Wrangling

Keywords

Exam Diagram (Very Important)

Python Examples

1. Merging Data

2. Grouping Data

3. Reshaping Data

MOST IMPORTANT EXAM QUESTION

4. Difference Table (High Probability Exam Question)

5. One-Line Memory Tips (For Exams)

6. Final Exam Flowchart (Draw This)

Handling Missing Data in Data Science (Using Python)

1. Definition

2. Why Handling Missing Data is Important

3. Types of Missing Data (Very Important for Exams)

1. MCAR – Missing Completely at Random

2. MAR – Missing at Random

3. MNAR – Missing Not at Random

Exam Diagram (Draw This)

4. Keywords (Important)

5. Detecting Missing Data in Python

Python Example

6. Techniques for Handling Missing Data

1. Removing Missing Data (Deletion Method)

Definition

When to Use

Python Example

Exam Diagram

2. Filling Missing Data (Imputation Method)

Definition

a) Mean Imputation

b) Median Imputation (Best for Outliers)

c) Mode Imputation (Categorical Data)

Exam Diagram

3. Forward Fill & Backward Fill

Forward Fill (ffill)

Backward Fill (bfill)

4. Replacing Missing Values with Constant

5. Advanced Imputation (Mention in Exams)

7. Choosing the Right Method (Important Table)

8. Common Mistakes (Write in Theory Answers)

9. Final Exam Flowchart (Very Important)

10. One-Line Memory Points (For Exams)

MOST IMPORTANT EXAM QUESTIONS

Data Transformation in Data Science (Using Python)

1. Definition

2. Importance of Data Transformation

3. Types of Data Transformation

4. Keywords (High Probability in Exams)

5. Exam Diagram (Draw This)

6. Python Examples

1. Normalization (Min-Max Scaling)

2. Standardization (Z-score Scaling)

3. Log Transformation

4. Python `re` Module Functions (Very Important)

5. `.str` Accessor in Pandas