Data Cleaning, Preparation & Data Wrangling (Using Python)
1. Data Cleaning
Definition
Data Cleaning is the process of detecting and correcting (or removing) incorrect, incomplete, duplicate, or irrelevant data to improve data quality before analysis.
Why Data Cleaning is Important
-
Real-world data is messy
-
Improves accuracy of models
-
Reduces bias and errors
-
Saves time during analysis
Common Data Quality Issues
-
Missing values
-
Duplicate records
-
Incorrect data types
-
Outliers
-
Inconsistent formats
Keywords
Missing Values, Duplicates, Outliers, Noise, Null, NaN, Data Quality
Exam Diagram (Draw This)
Raw Data
↓
[Missing Values]
[Duplicates]
[Incorrect Data]
↓
Clean Data
Python Examples
1. Handling Missing Values
import pandas as pd
df = pd.DataFrame({
'Age': [25, None, 30],
'Salary': [40000, 50000, None]
})
df.fillna(df.mean(), inplace=True)
print(df)
2. Removing Duplicates
df.drop_duplicates(inplace=True)
3. Correcting Data Types
df['Age'] = df['Age'].astype(int)
MOST IMPORTANT EXAM QUESTION
What is Data Cleaning? Explain its importance and techniques.
2. Data Preparation
Definition
Data Preparation is the process of transforming cleaned data into a suitable format for analysis or machine learning models.
Objectives
-
Make data model-ready
-
Improve feature quality
-
Convert data into numerical form
Key Steps in Data Preparation
-
Feature selection
-
Feature scaling
-
Encoding categorical data
-
Splitting data (train/test)
Keywords
Feature Engineering, Normalization, Encoding, Scaling, Train-Test Split
Exam Diagram
Clean Data
↓
Feature Selection
↓
Encoding & Scaling
↓
Prepared Data
Python Examples
1. Encoding Categorical Data
df = pd.DataFrame({'Gender': ['Male', 'Female', 'Male']})
df_encoded = pd.get_dummies(df)
print(df_encoded)
2. Feature Scaling
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_scaled = scaler.fit_transform([[100], [200], [300]])
print(df_scaled)
3. Train-Test Split
from sklearn.model_selection import train_test_split
X = df_encoded
y = [1, 0, 1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
MOST IMPORTANT EXAM QUESTION
Explain Data Preparation steps with suitable examples.
3. Data Wrangling
Definition
Data Wrangling is the process of collecting, transforming, structuring, and enriching raw data into a usable format for analysis.
📌 Data Wrangling = Cleaning + Preparation + Transformation
Tasks in Data Wrangling
-
Merging datasets
-
Reshaping data
-
Aggregation
-
Filtering
Keywords
Merge, Join, GroupBy, Reshape, Pivot, Transform
Exam Diagram (Very Important)
Multiple Data Sources
↓
Data Wrangling
(Clean + Transform)
↓
Analysis-Ready Data
Python Examples
1. Merging Data
df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['A', 'B']})
df2 = pd.DataFrame({'ID': [1, 2], 'Marks': [80, 90]})
merged = pd.merge(df1, df2, on='ID')
print(merged)
2. Grouping Data
df = pd.DataFrame({
'Department': ['CS', 'CS', 'IT'],
'Salary': [50000, 60000, 55000]
})
print(df.groupby('Department').mean())
3. Reshaping Data
df.pivot_table(values='Salary', index='Department')
MOST IMPORTANT EXAM QUESTION
What is Data Wrangling? How is it different from Data Cleaning?
4. Difference Table (High Probability Exam Question)
| Aspect | Data Cleaning | Data Preparation | Data Wrangling |
|---|---|---|---|
| Purpose | Fix errors | Make data model-ready | Transform & structure |
| Scope | Narrow | Medium | Broad |
| Examples | Remove nulls | Encode features | Merge datasets |
| Tools | Pandas | Pandas, Sklearn | Pandas |
5. One-Line Memory Tips (For Exams)
-
Data Cleaning → Fix bad data
-
Data Preparation → Make data usable for ML
-
Data Wrangling → Convert raw data into structured data
6. Final Exam Flowchart (Draw This)
Raw Data
↓
Data Cleaning
↓
Data Preparation
↓
Data Wrangling
↓
Data Analysis / ML Model
Handling Missing Data in Data Science (Using Python)
1. Definition
Missing data refers to the absence of values in a dataset where information should be present.
In Python, missing values are usually represented as NaN (Not a Number) or None.
📌 Handling missing data is the process of detecting, analyzing, and treating missing values to ensure accurate data analysis and model performance.
2. Why Handling Missing Data is Important
Prevents biased results
Improves model accuracy
Avoids errors in algorithms
Ensures data integrity
3. Types of Missing Data (Very Important for Exams)
1. MCAR – Missing Completely at Random
Missing values occur randomly
No relationship with other variables
2. MAR – Missing at Random
Missingness depends on other variables
3. MNAR – Missing Not at Random
Missingness depends on the missing value itself
Exam Diagram (Draw This)
Missing Data Types
├── MCAR
├── MAR
└── MNAR
4. Keywords (Important)
NaN, Null, None, Missing Values, Imputation,
Deletion, Mean, Median, Mode, Forward Fill, Backward Fill
5. Detecting Missing Data in Python
Python Example
import pandas as pd
df = pd.DataFrame({
'Age': [20, None, 25],
'Salary': [30000, 40000, None]
})
print(df.isnull())
print(df.isnull().sum())
6. Techniques for Handling Missing Data
1. Removing Missing Data (Deletion Method)
Definition
Rows or columns containing missing values are removed.
When to Use
Missing values are very few
Dataset is large
Python Example
df.dropna(inplace=True)
Exam Diagram
Data with Missing Values
↓
Drop Rows
↓
Clean Data
2. Filling Missing Data (Imputation Method)
Definition
Missing values are replaced with statistical values.
a) Mean Imputation
df['Age'].fillna(df['Age'].mean(), inplace=True)
b) Median Imputation (Best for Outliers)
df['Salary'].fillna(df['Salary'].median(), inplace=True)
c) Mode Imputation (Categorical Data)
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
Exam Diagram
Missing Values
↓
Statistical Method
(Mean / Median / Mode)
↓
Filled Data
3. Forward Fill & Backward Fill
Forward Fill (ffill)
Uses previous value
df.fillna(method='ffill', inplace=True)
Backward Fill (bfill)
Uses next value
df.fillna(method='bfill', inplace=True)
4. Replacing Missing Values with Constant
df.fillna(0, inplace=True)
5. Advanced Imputation (Mention in Exams)
KNN Imputation
Regression Imputation
Machine Learning models
📌 (Mentioning these fetches extra marks)
7. Choosing the Right Method (Important Table)
Scenario Best Method Few missing values Drop rows Numerical data Mean / Median Categorical data Mode Time series data Forward / Backward fill Outliers present Median
8. Common Mistakes (Write in Theory Answers)
Blindly deleting large data
Using mean with skewed data
Ignoring missing data patterns
9. Final Exam Flowchart (Very Important)
Raw Dataset
↓
Detect Missing Values
↓
Analyze Type (MCAR / MAR / MNAR)
↓
Choose Method
↓
Handle Missing Data
↓
Clean Dataset
10. One-Line Memory Points (For Exams)
NaN represents missing data in Python
Imputation replaces missing values
Median is robust to outliers
Handling missing data improves model accuracy
MOST IMPORTANT EXAM QUESTIONS
Define missing data and explain its types.
Explain techniques for handling missing data with Python examples.
Compare deletion and imputation methods.
Data Transformation in Data Science (Using Python)
1. Definition
Data Transformation is the process of converting raw data into a suitable format for analysis or
machine learning by applying operations such as scaling, normalization, aggregation, encoding, or
feature engineering.
📌 Purpose: To improve data quality, model performance, and analytical insights.
2. Importance of Data Transformation
Makes data model-ready
Handles skewed distributions
Improves accuracy of algorithms
Enables comparison between features
Facilitates feature extraction and engineering
3. Types of Data Transformation
Type Description Example Normalization Rescale values to a 0-1 range MinMaxScaler Standardization Rescale to mean = 0, std = 1 StandardScaler Log Transformation Reduce skewness log(x) Binning / Discretization Convert continuous → categorical Age → young, adult, senior Encoding Convert categorical → numerical One-hot encoding Aggregation Summarize data sum, mean, max by group Feature Engineering Create new features BMI = weight / height²
4. Keywords (High Probability in Exams)
Normalization, Standardization, Scaling, Encoding, Log Transform, Binning, Feature Engineering, Aggregation, MinMaxScaler, StandardScaler, One-Hot Encoding
5. Exam Diagram (Draw This)
Raw Data
↓
Transformation
├── Normalization / Standardization
├── Log / Power Transform
├── Binning
├── Encoding
└── Feature Engineering
↓
Transformed Data (Model Ready)
6. Python Examples
1. Normalization (Min-Max Scaling)
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
df = pd.DataFrame({'Age':[20, 30, 40]})
scaler = MinMaxScaler()
df['Age_scaled'] = scaler.fit_transform(df[['Age']])
print(df)
2. Standardization (Z-score Scaling)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df['Age_std'] = scaler.fit_transform(df[['Age']])
print(df)
3. Log Transformation
import numpy as np
df['Salary_log'] = np.log(df['Salary'])
4. Binning / Discretization
bins = [0, 18, 35, 60]
labels = ['Child', 'Adult', 'Senior']
df['Age_group'] = pd.cut(df['Age'], bins=bins, labels=labels)
5. One-Hot Encoding (Categorical to Numerical)
df = pd.DataFrame({'Gender':['Male','Female','Male']})
df_encoded = pd.get_dummies(df)
print(df_encoded)
6. Feature Engineering
df['BMI'] = df['Weight'] / (df['Height']/100)**2
7. Flow of Data Transformation (Important for Exams)
Raw Dataset
↓
Identify Transformations Needed
↓
Apply Scaling / Encoding / Log / Binning
↓
Feature Engineering
↓
Transformed Dataset (Ready for ML)
8. Difference Table: Transformation vs Cleaning vs Preparation
Aspect Cleaning Preparation Transformation Goal Remove errors Make data model-ready Convert/modify data for analysis Scope Narrow Medium Broad Techniques Drop / Fill Encoding / Scaling Normalization, Log, Feature Eng. Example Fill NaN One-Hot Encode Scale Age, create BMI
9. One-Line Memory Points (For Exams)
Transformation = Change the form of data
Normalization = 0–1 scale
Standardization = mean=0, std=1
Log Transform = reduce skew
Feature Engineering = create new meaningful features
10. MOST IMPORTANT EXAM QUESTIONS
Define Data Transformation and explain its importance.
Explain types of data transformations with Python examples.
Difference between data cleaning, preparation, and transformation.
Write a Python program to normalize and encode a dataset.
String Manipulation in Data Science (Using Python)
1. Definition
String Manipulation is the process of modifying, analyzing, and extracting information from text data in Python.
In Python, strings are objects of the str class, and they come with built-in methods for operations such as formatting, searching, splitting, and joining.
📌 Importance in Data Science:
Text data is common (names, emails, logs, tweets)
Preprocessing textual data for analysis or ML requires string operations
Helps in cleaning, parsing, and feature extraction
2. String Object
Strings in Python are immutable, meaning they cannot be changed after creation.
Strings can be created using single quotes, double quotes, or triple quotes.
s1 = 'Hello'
s2 = "Data Science"
s3 = """Python is fun"""
3. Keywords (High Probability in Exams)
String, Immutable, Concatenation, Slicing, Indexing, split(), join(), strip(), replace(), lower(), upper(), find(), format(), f-string, startswith(), endswith(), isdigit(), isalpha()
4. Common String Operations
Operation Description Python Example Concatenation Join strings 'Hello' + ' World'Repetition Repeat string 'Hi' * 3 → 'HiHiHi'Indexing Access specific character s[0] → 'H'Slicing Extract substring s[0:5] → 'Hello'Length Get string length len(s)
5. String Methods (Most Important)
Method Description Example lower() Convert to lowercase 'HELLO'.lower() → 'hello'upper() Convert to uppercase 'hello'.upper() → 'HELLO'title() Capitalize each word 'data science'.title() → 'Data Science'strip() Remove leading/trailing spaces ' hello '.strip() → 'hello'replace() Replace substring 'hello'.replace('h','H') → 'Hello'split() Split string into list 'a,b,c'.split(',') → ['a','b','c']join() Join list into string ','.join(['a','b']) → 'a,b'find() Find substring index 'hello'.find('e') → 1startswith() / endswith() Check prefix/suffix 'hello'.startswith('h') → Trueisdigit() / isalpha() Check string type '123'.isdigit() → True
6. Python Examples
1. Concatenation & Repetition
s1 = "Data"
s2 = "Science"
print(s1 + " " + s2) # Output: Data Science
print("Hi! " * 3) # Output: Hi! Hi! Hi!
2. Indexing & Slicing
s = "Python"
print(s[0]) # Output: P
print(s[0:4]) # Output: Pyth
3. Using String Methods
s = " Data Science "
print(s.strip()) # 'Data Science'
print(s.upper()) # ' DATA SCIENCE '
print(s.lower()) # ' data science '
print(s.replace("Data","Big Data")) # ' Big Data Science '
print(s.split()) # ['Data', 'Science']
4. Joining a List of Strings
lst = ['Data', 'Science', 'Python']
s = " ".join(lst)
print(s) # Output: Data Science Python
7. Flow Diagram for String Manipulation (Draw in Exam)
Raw String
↓
Operations
├── Indexing & Slicing
├── Concatenation / Repetition
├── Case Conversion (upper/lower/title)
├── Cleaning (strip/replace)
├── Splitting / Joining
└── Searching (find/startswith/endswith)
↓
Processed String (Ready for Analysis)
8. Real-World Data Science Applications
Cleaning text data (removing extra spaces, converting to lowercase)
Parsing logs (split IP addresses, timestamps)
Feature extraction (count words, detect patterns)
Preparing input for NLP models
9. One-Line Memory Tips (For Exams)
Strings are immutable
split() → string → list
join() → list → string
strip() → remove whitespace
replace() → modify text
upper()/lower() → case conversion
10. Most Important Exam Questions
Define string manipulation in Python and explain its importance in Data Science.
Explain commonly used string methods in Python with examples.
Write a Python program to clean and preprocess a string dataset.
Explain difference between split() and join() with examples.
Regular Expressions in Data Science (Using Python)
1. Definition
A Regular Expression (Regex) is a sequence of characters that defines a search pattern used for matching, searching, extracting, replacing, and validating text data.
📌 In Data Science, regex is widely used for:
Text cleaning
Pattern extraction
Data validation
Preprocessing unstructured text
Python provides the re module for working with regular expressions.
2. Why Regex is Important in Data Science
Handles unstructured text data
Automates data cleaning
Extracts useful features from text
Validates emails, phone numbers, IDs
Improves text preprocessing for NLP
3. Keywords (High Probability in Exams)
Regular Expression, Pattern, re module, match(), search(),
findall(), sub(), split(), Metacharacters, Quantifiers,
Character Classes, Anchors, Groups
4. Python re Module Functions (Very Important)
Function Description re.match()Matches pattern at start of string re.search()Searches pattern anywhere re.findall()Returns all matches as list re.sub()Replaces pattern re.split()Splits string using pattern
5. Basic Regex Syntax
1. Character Classes
Pattern Meaning \dDigit (0–9) \DNon-digit \wAlphanumeric \WNon-alphanumeric \sWhitespace \SNon-whitespace
2. Quantifiers
Symbol Meaning *0 or more +1 or more ?0 or 1 {n}Exactly n times {n,m}Between n and m
3. Anchors
Symbol Meaning ^Start of string $End of string
6. Exam Diagram (Draw This)
Text Data
↓
Regex Pattern
↓
Matching / Searching
↓
Extract / Replace / Validate
↓
Clean Text Data
7. Python Examples
1. Matching a Pattern
import re
text = "Data Science"
result = re.match("Data", text)
print(result.group())
2. Searching a Pattern
text = "I love Data Science"
result = re.search("Science", text)
print(result.group())
3. Find All Matches
text = "Marks: 80, 90, 85"
numbers = re.findall(r'\d+', text)
print(numbers)
4. Replacing Text
text = "Contact: 9876543210"
masked = re.sub(r'\d', 'X', text)
print(masked)
5. Splitting Using Regex
text = "Data,Science;Python"
words = re.split(r'[;,]', text)
print(words)
8. Common Regex Patterns (Important for Exams)
Task Regex Pattern Email validation [a-zA-Z0-9_.]+@[a-zA-Z]+\.[a-zA-Z]+Phone number \d{10}Extract numbers \d+Extract words \w+Remove special characters [^a-zA-Z0-9 ]
9. Real-World Data Science Use Cases
Cleaning social media text
Extracting dates, numbers, emails
Log file analysis
NLP preprocessing
Data validation rules
10. Flowchart for Regex Usage (Exam-Friendly)
Raw Text
↓
Define Regex Pattern
↓
Apply re functions
↓
Matched / Cleaned Text
↓
Analysis or Model Input
11. One-Line Memory Tips (For Exams)
Regex = pattern matching tool
match() → start of string
search() → anywhere
findall() → list of matches
sub() → replace text
Regex simplifies text preprocessing
12. MOST IMPORTANT EXAM QUESTIONS
Define regular expressions and explain their role in data science.
Explain Python re module functions with examples.
Write regex patterns for email and phone number validation.
Explain character classes and quantifiers in regex.
Vectorized String Functions in Pandas (Data Science – Python)
1. Definition
Vectorized string functions in Pandas are specialized string operations applied element-wise to an entire Series or Index at once using the .str accessor.
📌 These functions allow fast, efficient, and readable text processing without using explicit loops.
👉 Internally optimized → better performance than Python loops.
2. Why Vectorized String Functions are Important
Text data is common (names, emails, reviews, logs)
Apply string operations to thousands/millions of rows
Essential for data cleaning and preprocessing
Works well with missing values (NaN)
3. Keywords (High Probability in Exams)
Vectorization, Pandas Series, .str accessor, Element-wise operation,
Text preprocessing, NaN handling, split(), replace(),
contains(), extract(), len(), lower(), upper()
4. What Does “Vectorized” Mean? (Important Concept)
Vectorized operation =
👉 Apply the same operation to all elements at once (no loops).
Exam Diagram (Very Important)
Text Column (Series)
┌───────────────┐
| "Data Science"|
| "python" |
| "PANDAS" |
└───────────────┘
↓
.str methods
↓
Processed Text Column
5. .str Accessor in Pandas
Pandas provides .str to access vectorized string methods
Works on:
Series
Index
df['column'].str.method()
6. Common Vectorized String Functions
1. Case Conversion
import pandas as pd
df = pd.DataFrame({
'Text': ['Data Science', 'python', 'PANDAS']
})
df['lower'] = df['Text'].str.lower()
df['upper'] = df['Text'].str.upper()
print(df)
2. Length of Strings
df['length'] = df['Text'].str.len()
3. Removing Whitespaces
df['clean'] = df['Text'].str.strip()
4. Splitting Strings
df['words'] = df['Text'].str.split()
5. Replacing Substrings
df['replaced'] = df['Text'].str.replace('Data', 'Big Data')
6. Checking Pattern Existence
df['contains_py'] = df['Text'].str.contains('py', case=False)
7. Extracting Using Regular Expressions
df = pd.DataFrame({
'Email': ['abc@gmail.com', 'xyz@yahoo.com']
})
df['domain'] = df['Email'].str.extract(r'@(.*)')
print(df)
8. Startswith & Endswith
df['starts'] = df['Text'].str.startswith('Data')
df['ends'] = df['Text'].str.endswith('S')
7. Handling Missing Values (Important Advantage)
df = pd.DataFrame({'Text': ['Data', None, 'Science']})
df['lower'] = df['Text'].str.lower()
📌 .str methods safely ignore NaN values.
8. Flowchart of Vectorized String Processing (Exam-Friendly)
Text Data in Pandas Series
↓
Use .str Accessor
↓
Vectorized String Methods
(lower, split, replace, extract)
↓
Clean & Processed Text Data
9. Difference: Normal String Methods vs Vectorized String Functions
Feature Python String Methods Pandas Vectorized Applies to Single string Entire Series Speed Slow (loops) Fast (vectorized) Syntax "text".lower()series.str.lower()NaN handling Error Safe
10. Real-World Data Science Use Cases
Cleaning text columns
Email domain extraction
Keyword filtering
NLP preprocessing
Log and social media analysis
11. One-Line Memory Points (For Exams)
.str → vectorized string operations
No loops required
Works on Series and Index
Handles NaN safely
Faster than Python string methods
12. MOST IMPORTANT EXAM QUESTIONS
Define vectorized string functions in Pandas.
Explain the .str accessor with examples.
Compare Python string methods and Pandas vectorized string functions.
Write a Pandas program to clean and preprocess text data.
Data Wrangling in Data Science (Using Python)
1. Definition
Data Wrangling (also called Data Munging) is the process of collecting, cleaning, transforming, restructuring, and enriching raw data into a usable, analysis-ready format.
📌 In simple words:
Data Wrangling = Cleaning + Transformation + Structuring
2. Why Data Wrangling is Important
Real-world data is messy and unstructured
Required before data analysis and machine learning
Improves data quality and consistency
Enables integration of multiple data sources
3. Keywords (High Probability for Exams)
Data Wrangling, Data Munging, Cleaning, Transformation,
Reshaping, Filtering, Aggregation, Merge, Join,
GroupBy, Pivot, Melt, Pandas, NumPy
4. Main Steps in Data Wrangling
Data Collection
Data Cleaning
Data Transformation
Data Structuring
Data Enrichment
Exam Diagram (Very Important – Draw This)
Raw Data
↓
Data Cleaning
↓
Data Transformation
↓
Data Structuring
↓
Analysis-Ready Data
5. Tools Used in Python for Data Wrangling
Pandas – primary library
NumPy – numerical operations
Regex – text processing
6. Common Data Wrangling Operations (With Examples)
1.Handling Missing Data
import pandas as pd
df = pd.DataFrame({
'Age': [20, None, 25],
'Salary': [30000, 40000, None]
})
df.fillna(df.mean(), inplace=True)
print(df)
2. Removing Duplicates
df.drop_duplicates(inplace=True)
3. Filtering Data
df[df['Age'] > 21]
4. Transforming Data
df['Salary_in_Lakhs'] = df['Salary'] / 100000
5. Merging Datasets
df1 = pd.DataFrame({'ID':[1,2], 'Name':['A','B']})
df2 = pd.DataFrame({'ID':[1,2], 'Marks':[80,90]})
merged = pd.merge(df1, df2, on='ID')
print(merged)
6. Grouping and Aggregation
df.groupby('Department')['Salary'].mean()
7. Reshaping Data
Pivot
df.pivot_table(values='Salary', index='Department')
Melt
pd.melt(df, id_vars=['ID'])
7. Data Wrangling Flowchart (Exam-Friendly)
Multiple Data Sources
↓
Data Wrangling
├─ Cleaning
├─ Transformation
├─ Reshaping
└─ Integration
↓
Structured Dataset
8. Difference Table (Very Important)
Aspect Data Cleaning Data Wrangling Scope Narrow Broad Focus Fix errors Prepare & structure data Includes Missing values Cleaning + transformation Example Fill NaN Merge & reshape data
9. Real-World Data Science Applications
Combining data from databases & APIs
Preparing datasets for machine learning
Cleaning social media & text data
Business and scientific data analysis
10. One-Line Memory Points (For Exams)
Data wrangling prepares raw data for analysis
It includes cleaning, transforming, reshaping
Pandas is the main tool
Essential before EDA and ML models
11. MOST IMPORTANT EXAM QUESTIONS
Define Data Wrangling and explain its importance.
Explain steps involved in data wrangling with examples.
Differentiate between data cleaning and data wrangling.
Write a Python program to perform data wrangling operations.
Data Wrangling – JOIN Operations (Using Python)
1. Definition
In Data Wrangling, a JOIN is the process of combining two or more datasets (tables/DataFrames) based on a common key or column.
📌 JOIN operations are used to integrate data from multiple sources into a single, structured dataset for analysis.
In Python, JOIN operations are mainly performed using:
Pandas (merge(), join())
2. Why JOIN is Important in Data Science
Real-world data is stored in multiple tables
Required for data integration
Helps in feature enrichment
Essential before analysis and machine learning
3. Keywords (High Probability for Exams)
Join, Merge, Key, Primary Key, Foreign Key,
Inner Join, Left Join, Right Join, Outer Join,
Pandas, merge(), join()
4. Types of JOINs (Very Important)
1. Inner Join
Returns only matching rows from both tables.
2. Left Join
Returns all rows from left table and matching rows from right table.
3. Right Join
Returns all rows from right table and matching rows from left table.
4. Outer (Full) Join
Returns all rows from both tables, filling missing values with NaN.
5. Exam Diagrams (Must Draw)
INNER JOIN
Table A Table B
○────────○
(Only common rows)
LEFT JOIN
Table A Table B
○────────○
○────────
(All rows from A)
RIGHT JOIN
Table A Table B
○────────○
○────────
(All rows from B)
OUTER JOIN
Table A Table B
○────────○
○ ○
(All rows from A and B)
6. Sample Data (Use in Exams)
import pandas as pd
df1 = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['A', 'B', 'C']
})
df2 = pd.DataFrame({
'ID': [2, 3, 4],
'Marks': [80, 90, 85]
})
7. JOIN Using merge() in Pandas
1. Inner Join
pd.merge(df1, df2, on='ID', how='inner')
Output: Only IDs 2 and 3
2. Left Join
pd.merge(df1, df2, on='ID', how='left')
Output: All IDs from df1
3. Right Join
pd.merge(df1, df2, on='ID', how='right')
Output: All IDs from df2
4. Outer Join
pd.merge(df1, df2, on='ID', how='outer')
Output: IDs 1,2,3,4 (NaN where no match)
8. JOIN Using join() Method
df1.set_index('ID').join(df2.set_index('ID'), how='inner')
📌 join() works mainly on index, while merge() works on columns.
9. JOIN Flowchart (Exam-Friendly)
DataFrame 1 DataFrame 2
↓ ↓
Identify Common Key
↓
Apply JOIN
↓
Combined DataFrame
10. Difference: merge() vs join()
Aspect merge() join() Join key Column Index Flexibility High Medium SQL-like Yes Limited Usage Most common Simple joins
11. Common Mistakes (Mention in Theory)
Joining on wrong key
Ignoring duplicate keys
Not handling NaN after join
12. One-Line Memory Points (For Exams)
JOIN combines datasets using a common key
merge() is most used
Inner join → common rows only
Left join → all left table rows
Outer join → all rows from both tables
13. MOST IMPORTANT EXAM QUESTIONS
Define JOIN operation in data wrangling.
Explain types of JOINs with diagrams.
Write a Python program for INNER and OUTER JOIN using Pandas.
Differentiate between merge() and join().
Hierarchical Indexing in Data Science (Using Python)
1. Definition
Hierarchical Indexing, also called MultiIndex, is a Pandas feature that allows a DataFrame or Series to have multiple levels of indexing on rows or columns.
📌 It enables representation and analysis of higher-dimensional data in lower-dimensional structures.
2. Why Hierarchical Indexing is Important
Represents multi-dimensional data
Enables group-wise analysis
Simplifies data reshaping
Useful for aggregation and pivot operations
3. Keywords (High Probability in Exams)
Hierarchical Indexing, MultiIndex, Levels, Labels,
Index, Sub-index, Unstack, Stack, xs(),
swaplevel(), sort_index()
4. Structure of Hierarchical Index
Exam Diagram (Must Draw)
Index Level 1 Index Level 2 Value
------------------------------------
India 2023 100
India 2024 120
USA 2023 90
USA 2024 110
📌 Each row is identified by (Country, Year).
5. Creating Hierarchical Index
1. From Tuples
import pandas as pd
index = pd.MultiIndex.from_tuples([
('India', 2023),
('India', 2024),
('USA', 2023),
('USA', 2024)
])
data = [100, 120, 90, 110]
s = pd.Series(data, index=index)
print(s)
2. From Lists
arrays = [
['India', 'India', 'USA', 'USA'],
[2023, 2024, 2023, 2024]
]
index = pd.MultiIndex.from_arrays(arrays)
6. Selecting Data (Very Important)
1. Selecting Outer Level
s['India']
2. Selecting Inner Level using xs()
s.xs(2023, level=1)
3. Partial Indexing
s.loc[('India', 2024)]
7. Reshaping with Stack & Unstack
Unstack
Converts inner index → columns
df = s.unstack()
print(df)
Stack
Converts columns → index
df.stack()
Exam Diagram
MultiIndex Series
↓ unstack
Wide DataFrame
↓ stack
MultiIndex Series
8. Swapping and Sorting Levels
Swap Levels
s.swaplevel()
Sort Index
s.sort_index()
9. Hierarchical Columns (Mention in Exams)
data = {
('Sales', 2023): [100, 120],
('Sales', 2024): [130, 150]
}
df = pd.DataFrame(data, index=['India', 'USA'])
print(df)
10. Difference: Normal Index vs Hierarchical Index
Aspect Normal Index Hierarchical Index Index levels One Multiple Dimensionality 1D Multi-dimensional Complexity Simple Advanced Use case Simple data Complex grouped data
11. Flowchart for Hierarchical Indexing (Exam-Friendly)
Multi-Dimensional Data
↓
Apply MultiIndex
↓
Hierarchical Indexing
↓
Efficient Analysis & Reshaping
12. Real-World Data Science Use Cases
Time-series data (Year, Month)
Geographic data (Country, State)
Financial data (Company, Quarter)
Sales & performance analysis
13. One-Line Memory Points (For Exams)
Hierarchical Index = MultiIndex
Allows multiple index levels
Reduces need for redundant columns
Works well with groupby, pivot, stack, unstack
14. MOST IMPORTANT EXAM QUESTIONS
Define Hierarchical Indexing in Pandas.
Explain creation and selection in MultiIndex with examples.
Differentiate between normal index and hierarchical index.
Explain stack and unstack operations.
Combining and Merging Datasets in Data Science (Using Python)
1. Definition
Combining Datasets
Combining refers to bringing together multiple datasets into a single dataset using methods like concatenation and joining.
Merging Datasets
Merging is a specific type of combining where datasets are joined based on a common key or column, similar to SQL JOIN operations.
📌 These operations are core parts of Data Wrangling.
2. Why Combining & Merging are Important
Real-world data comes from multiple sources
Required for data integration
Helps in feature enrichment
Essential before data analysis and machine learning
3. Keywords (High Probability for Exams)
Combine, Merge, Concatenate, Join,
Inner Join, Left Join, Right Join, Outer Join,
Key, Primary Key, Foreign Key,
Pandas, concat(), merge(), join()
4. Methods to Combine and Merge in Pandas
Method Purpose concat()Stack datasets merge()Join using key column join()Join using index
5. Combining Datasets using concat()
Definition
concat() combines DataFrames row-wise or column-wise.
Exam Diagram
DF1 DF2
A B A B
1 2 3 4
↓ concat
A B
1 2
3 4
Python Example
import pandas as pd
df1 = pd.DataFrame({'A':[1,2], 'B':[3,4]})
df2 = pd.DataFrame({'A':[5,6], 'B':[7,8]})
combined = pd.concat([df1, df2])
print(combined)
6. Merging Datasets using merge()
Definition
merge() combines datasets based on a common key column.
Types of Merge (Very Important)
1. Inner Merge
pd.merge(df1, df2, on='ID', how='inner')
2. Left Merge
pd.merge(df1, df2, on='ID', how='left')
3. Right Merge
pd.merge(df1, df2, on='ID', how='right')
4. Outer Merge
pd.merge(df1, df2, on='ID', how='outer')
Exam Diagrams (Draw These)
INNER MERGE
A ∩ B (Common Rows)
LEFT MERGE
All rows from A
+ Matching rows from B
OUTER MERGE
All rows from A and B
7. Sample Dataset for Merge
df1 = pd.DataFrame({
'ID':[1,2,3],
'Name':['A','B','C']
})
df2 = pd.DataFrame({
'ID':[2,3,4],
'Marks':[80,90,85]
})
8. Joining Datasets using join()
df1.set_index('ID').join(df2.set_index('ID'), how='inner')
📌 join() is index-based, while merge() is column-based.
9. Difference: concat() vs merge() vs join()
Feature concat() merge() join() Combines Stacking Key-based Index-based Similar to SQL No Yes Partially Flexibility Medium High Medium Use case Same columns Related tables Indexed data
10. Flowchart (Exam-Friendly)
Multiple Datasets
↓
Decide Method
(concat / merge / join)
↓
Combine on Rows or Keys
↓
Single Integrated Dataset
11. Real-World Data Science Use Cases
Merging customer & transaction data
Combining monthly sales files
Integrating datasets from APIs and databases
Preparing data for ML models
12. One-Line Memory Points (For Exams)
Combining = bring datasets together
Merging = key-based combining
concat() → stack datasets
merge() → SQL-style join
join() → index-based join
13. MOST IMPORTANT EXAM QUESTIONS
Define combining and merging datasets.
Explain concat() and merge() with examples.
Differentiate between merge() and join().
Write Python code for inner and outer merge.
Reshaping and Pivoting in Data Science (Using Python)
1. Definition
Reshaping
Reshaping is the process of changing the structure or layout of a dataset (rows ↔ columns) without changing the actual data values.
Pivoting
Pivoting is a specific reshaping operation where unique values from one column become new columns, creating a wide-format table.
📌 Both are key steps in Data Wrangling.
2. Why Reshaping & Pivoting are Important
Converts data into analysis-ready (tidy) format
Makes data suitable for visualization and ML
Simplifies group-wise comparisons
Required for aggregation and reporting
3. Keywords (High Probability in Exams)
Reshaping, Pivot, Pivot Table, Melt,
Wide Format, Long Format, Stack, Unstack,
Aggregation, Index, Columns, Values
4. Data Formats (Important Concept)
Long Format
Year Dept Sales
2023 CS 100
2023 IT 120
Wide Format
Year CS IT
2023 100 120
5. Exam Diagram (Must Draw)
Long Format Data
↓
Reshaping
(Pivot / Unstack)
↓
Wide Format Data
6. Pivoting in Pandas
1. pivot()
Definition
Transforms long → wide format using index, columns, and values.
Python Example
import pandas as pd
df = pd.DataFrame({
'Year':[2023, 2023, 2024, 2024],
'Dept':['CS', 'IT', 'CS', 'IT'],
'Sales':[100, 120, 110, 130]
})
pivot_df = df.pivot(index='Year', columns='Dept', values='Sales')
print(pivot_df)
2. pivot_table() (Very Important)
Definition
Similar to pivot() but supports aggregation and handles duplicates.
Python Example
pivot_table = df.pivot_table(
index='Year',
columns='Dept',
values='Sales',
aggfunc='mean'
)
print(pivot_table)
7. Melting (Reverse of Pivot)
melt()
Definition
Converts wide → long format.
Exam Diagram
Wide Data
↓ melt
Long Data
Python Example
pd.melt(pivot_df.reset_index(), id_vars='Year')
8. Stack and Unstack
Stack
Columns → row index
pivot_df.stack()
Unstack
Row index → columns
df.set_index(['Year','Dept'])['Sales'].unstack()
9. Difference Table (High Probability Exam Question)
Operation Purpose Format Change pivot() Reshape Long → Wide pivot_table() Reshape + aggregate Long → Wide melt() Reverse reshape Wide → Long stack() Columns → rows Wide → Long unstack() Rows → columns Long → Wide
10. Flowchart for Reshaping & Pivoting
Raw Dataset
↓
Identify Format (Long / Wide)
↓
Apply Reshaping Method
(pivot / melt / stack)
↓
Structured Dataset
11. Real-World Data Science Applications
Sales and financial reports
Survey data analysis
Time-series analysis
Preparing data for dashboards
12. One-Line Memory Points (For Exams)
Reshaping changes structure, not data
Pivot creates new columns
pivot_table() supports aggregation
melt() is reverse of pivot
Stack/unstack work with MultiIndex
13. MOST IMPORTANT EXAM QUESTIONS
Define reshaping and pivoting.
Explain pivot and pivot_table with examples.
Differentiate between pivot and melt.
Write Python code to reshape data using stack and unstack.