Sunday, December 21, 2025

 

Data Cleaning, Preparation & Data Wrangling (Using Python)

UNIT-3

1. Data Cleaning

Definition

Data Cleaning is the process of detecting and correcting (or removing) incorrect, incomplete, duplicate, or irrelevant data to improve data quality before analysis.


Why Data Cleaning is Important

  • Real-world data is messy

  • Improves accuracy of models

  • Reduces bias and errors

  • Saves time during analysis


Common Data Quality Issues

  • Missing values

  • Duplicate records

  • Incorrect data types

  • Outliers

  • Inconsistent formats


Keywords

Missing Values, Duplicates, Outliers, Noise, Null, NaN, Data Quality


Exam Diagram (Draw This)

Raw Data ↓ [Missing Values] [Duplicates] [Incorrect Data] ↓ Clean Data

Python Examples

1. Handling Missing Values

import pandas as pd df = pd.DataFrame({ 'Age': [25, None, 30], 'Salary': [40000, 50000, None] }) df.fillna(df.mean(), inplace=True) print(df)

2. Removing Duplicates

df.drop_duplicates(inplace=True)

3. Correcting Data Types

df['Age'] = df['Age'].astype(int)

MOST IMPORTANT EXAM QUESTION

What is Data Cleaning? Explain its importance and techniques.


2. Data Preparation

Definition

Data Preparation is the process of transforming cleaned data into a suitable format for analysis or machine learning models.


Objectives

  • Make data model-ready

  • Improve feature quality

  • Convert data into numerical form


Key Steps in Data Preparation

  • Feature selection

  • Feature scaling

  • Encoding categorical data

  • Splitting data (train/test)


Keywords

Feature Engineering, Normalization, Encoding, Scaling, Train-Test Split


Exam Diagram

Clean Data ↓ Feature Selection ↓ Encoding & Scaling ↓ Prepared Data

Python Examples

1. Encoding Categorical Data

df = pd.DataFrame({'Gender': ['Male', 'Female', 'Male']}) df_encoded = pd.get_dummies(df) print(df_encoded)

2. Feature Scaling

from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() df_scaled = scaler.fit_transform([[100], [200], [300]]) print(df_scaled)

3. Train-Test Split

from sklearn.model_selection import train_test_split X = df_encoded y = [1, 0, 1] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

MOST IMPORTANT EXAM QUESTION

Explain Data Preparation steps with suitable examples.


3. Data Wrangling

Definition

Data Wrangling is the process of collecting, transforming, structuring, and enriching raw data into a usable format for analysis.

📌 Data Wrangling = Cleaning + Preparation + Transformation


Tasks in Data Wrangling

  • Merging datasets

  • Reshaping data

  • Aggregation

  • Filtering


Keywords

Merge, Join, GroupBy, Reshape, Pivot, Transform


Exam Diagram (Very Important)

Multiple Data Sources ↓ Data Wrangling (Clean + Transform) ↓ Analysis-Ready Data

Python Examples

1. Merging Data

df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['A', 'B']}) df2 = pd.DataFrame({'ID': [1, 2], 'Marks': [80, 90]}) merged = pd.merge(df1, df2, on='ID') print(merged)

2. Grouping Data

df = pd.DataFrame({ 'Department': ['CS', 'CS', 'IT'], 'Salary': [50000, 60000, 55000] }) print(df.groupby('Department').mean())

3. Reshaping Data

df.pivot_table(values='Salary', index='Department')

MOST IMPORTANT EXAM QUESTION

What is Data Wrangling? How is it different from Data Cleaning?


4. Difference Table (High Probability Exam Question)

AspectData CleaningData PreparationData Wrangling
PurposeFix errorsMake data model-readyTransform & structure
ScopeNarrowMediumBroad
ExamplesRemove nullsEncode featuresMerge datasets
ToolsPandasPandas, SklearnPandas

5. One-Line Memory Tips (For Exams)

  • Data Cleaning → Fix bad data

  • Data Preparation → Make data usable for ML

  • Data Wrangling → Convert raw data into structured data


6. Final Exam Flowchart (Draw This)

Raw DataData Cleaning ↓ Data Preparation ↓ Data Wrangling ↓ Data Analysis / ML Model


Handling Missing Data in Data Science (Using Python)


1. Definition

Missing data refers to the absence of values in a dataset where information should be present.
In Python, missing values are usually represented as NaN (Not a Number) or None.

📌 Handling missing data is the process of detecting, analyzing, and treating missing values to ensure accurate data analysis and model performance.


2. Why Handling Missing Data is Important

  • Prevents biased results

  • Improves model accuracy

  • Avoids errors in algorithms

  • Ensures data integrity


3. Types of Missing Data (Very Important for Exams)

1. MCAR – Missing Completely at Random

  • Missing values occur randomly

  • No relationship with other variables

2. MAR – Missing at Random

  • Missingness depends on other variables

3. MNAR – Missing Not at Random

  • Missingness depends on the missing value itself


Exam Diagram (Draw This)

Missing Data Types ├── MCAR ├── MAR └── MNAR

4. Keywords (Important)

NaN, Null, None, Missing Values, Imputation,
Deletion, Mean, Median, Mode, Forward Fill, Backward Fill


5. Detecting Missing Data in Python

Python Example

import pandas as pd df = pd.DataFrame({ 'Age': [20, None, 25], 'Salary': [30000, 40000, None] }) print(df.isnull()) print(df.isnull().sum())

6. Techniques for Handling Missing Data


1. Removing Missing Data (Deletion Method)

Definition

Rows or columns containing missing values are removed.

When to Use

  • Missing values are very few

  • Dataset is large

Python Example

df.dropna(inplace=True)

Exam Diagram

Data with Missing ValuesDrop Rows ↓ Clean Data

2. Filling Missing Data (Imputation Method)

Definition

Missing values are replaced with statistical values.


a) Mean Imputation

df['Age'].fillna(df['Age'].mean(), inplace=True)

b) Median Imputation (Best for Outliers)

df['Salary'].fillna(df['Salary'].median(), inplace=True)

c) Mode Imputation (Categorical Data)

df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)

Exam Diagram

Missing ValuesStatistical Method (Mean / Median / Mode)Filled Data

3. Forward Fill & Backward Fill

Forward Fill (ffill)

Uses previous value

df.fillna(method='ffill', inplace=True)

Backward Fill (bfill)

Uses next value

df.fillna(method='bfill', inplace=True)

4. Replacing Missing Values with Constant

df.fillna(0, inplace=True)

5. Advanced Imputation (Mention in Exams)

  • KNN Imputation

  • Regression Imputation

  • Machine Learning models

📌 (Mentioning these fetches extra marks)


7. Choosing the Right Method (Important Table)

ScenarioBest Method
Few missing valuesDrop rows
Numerical dataMean / Median
Categorical dataMode
Time series dataForward / Backward fill
Outliers presentMedian

8. Common Mistakes (Write in Theory Answers)

  • Blindly deleting large data

  • Using mean with skewed data

  • Ignoring missing data patterns


9. Final Exam Flowchart (Very Important)

Raw DatasetDetect Missing ValuesAnalyze Type (MCAR / MAR / MNAR)Choose MethodHandle Missing DataClean Dataset

10. One-Line Memory Points (For Exams)

  • NaN represents missing data in Python

  • Imputation replaces missing values

  • Median is robust to outliers

  • Handling missing data improves model accuracy


MOST IMPORTANT EXAM QUESTIONS

  1. Define missing data and explain its types.

  2. Explain techniques for handling missing data with Python examples.

  3. Compare deletion and imputation methods.


Data Transformation in Data Science (Using Python)


1. Definition

Data Transformation is the process of converting raw data into a suitable format for analysis or

machine learning by applying operations such as scaling, normalization, aggregation, encoding, or

feature engineering.

📌 Purpose: To improve data quality, model performance, and analytical insights.


2. Importance of Data Transformation

  • Makes data model-ready

  • Handles skewed distributions

  • Improves accuracy of algorithms

  • Enables comparison between features

  • Facilitates feature extraction and engineering


3. Types of Data Transformation

TypeDescriptionExample
NormalizationRescale values to a 0-1 rangeMinMaxScaler
StandardizationRescale to mean = 0, std = 1StandardScaler
Log TransformationReduce skewnesslog(x)
Binning / DiscretizationConvert continuous → categoricalAge → young, adult, senior
EncodingConvert categorical → numericalOne-hot encoding
AggregationSummarize datasum, mean, max by group
Feature EngineeringCreate new featuresBMI = weight / height²

4. Keywords (High Probability in Exams)

Normalization, Standardization, Scaling, Encoding, Log Transform, Binning, Feature Engineering, Aggregation, MinMaxScaler, StandardScaler, One-Hot Encoding


5. Exam Diagram (Draw This)

Raw Data ↓ Transformation ├── Normalization / Standardization ├── Log / Power Transform ├── Binning ├── Encoding └── Feature Engineering ↓ Transformed Data (Model Ready)

6. Python Examples

1. Normalization (Min-Max Scaling)

from sklearn.preprocessing import MinMaxScaler import pandas as pd df = pd.DataFrame({'Age':[20, 30, 40]}) scaler = MinMaxScaler() df['Age_scaled'] = scaler.fit_transform(df[['Age']]) print(df)

2. Standardization (Z-score Scaling)

from sklearn.preprocessing import StandardScaler scaler = StandardScaler() df['Age_std'] = scaler.fit_transform(df[['Age']]) print(df)

3. Log Transformation

import numpy as np df['Salary_log'] = np.log(df['Salary'])

4. Binning / Discretization

bins = [0, 18, 35, 60] labels = ['Child', 'Adult', 'Senior'] df['Age_group'] = pd.cut(df['Age'], bins=bins, labels=labels)

5. One-Hot Encoding (Categorical to Numerical)

df = pd.DataFrame({'Gender':['Male','Female','Male']}) df_encoded = pd.get_dummies(df) print(df_encoded)

6. Feature Engineering

df['BMI'] = df['Weight'] / (df['Height']/100)**2

7. Flow of Data Transformation (Important for Exams)

Raw DatasetIdentify Transformations NeededApply Scaling / Encoding / Log / BinningFeature EngineeringTransformed Dataset (Ready for ML)

8. Difference Table: Transformation vs Cleaning vs Preparation

AspectCleaningPreparationTransformation
GoalRemove errorsMake data model-readyConvert/modify data for analysis
ScopeNarrowMediumBroad
TechniquesDrop / FillEncoding / ScalingNormalization, Log, Feature Eng.
ExampleFill NaNOne-Hot EncodeScale Age, create BMI

9. One-Line Memory Points (For Exams)

  • Transformation = Change the form of data

  • Normalization = 0–1 scale

  • Standardization = mean=0, std=1

  • Log Transform = reduce skew

  • Feature Engineering = create new meaningful features


10. MOST IMPORTANT EXAM QUESTIONS

  1. Define Data Transformation and explain its importance.

  2. Explain types of data transformations with Python examples.

  3. Difference between data cleaning, preparation, and transformation.

  4. Write a Python program to normalize and encode a dataset.


String Manipulation in Data Science (Using Python)


1. Definition

String Manipulation is the process of modifying, analyzing, and extracting information from text data in Python.
In Python, strings are objects of the str class, and they come with built-in methods for operations such as formatting, searching, splitting, and joining.

📌 Importance in Data Science:

  • Text data is common (names, emails, logs, tweets)

  • Preprocessing textual data for analysis or ML requires string operations

  • Helps in cleaning, parsing, and feature extraction


2. String Object

  • Strings in Python are immutable, meaning they cannot be changed after creation.

  • Strings can be created using single quotes, double quotes, or triple quotes.

s1 = 'Hello' s2 = "Data Science" s3 = """Python is fun"""

3. Keywords (High Probability in Exams)

String, Immutable, Concatenation, Slicing, Indexing, split(), join(), strip(), replace(), lower(), upper(), find(), format(), f-string, startswith(), endswith(), isdigit(), isalpha()


4. Common String Operations

OperationDescriptionPython Example
ConcatenationJoin strings'Hello' + ' World'
RepetitionRepeat string'Hi' * 3 → 'HiHiHi'
IndexingAccess specific characters[0] → 'H'
SlicingExtract substrings[0:5] → 'Hello'
LengthGet string lengthlen(s)

5. String Methods (Most Important)

MethodDescriptionExample
lower()Convert to lowercase'HELLO'.lower() → 'hello'
upper()Convert to uppercase'hello'.upper() → 'HELLO'
title()Capitalize each word'data science'.title() → 'Data Science'
strip()Remove leading/trailing spaces' hello '.strip() → 'hello'
replace()Replace substring'hello'.replace('h','H') → 'Hello'
split()Split string into list'a,b,c'.split(',') → ['a','b','c']
join()Join list into string','.join(['a','b']) → 'a,b'
find()Find substring index'hello'.find('e') → 1
startswith() / endswith()Check prefix/suffix'hello'.startswith('h') → True
isdigit() / isalpha()Check string type'123'.isdigit() → True

6. Python Examples

1. Concatenation & Repetition

s1 = "Data" s2 = "Science" print(s1 + " " + s2) # Output: Data Science print("Hi! " * 3) # Output: Hi! Hi! Hi!

2. Indexing & Slicing

s = "Python" print(s[0]) # Output: P print(s[0:4]) # Output: Pyth

3. Using String Methods

s = " Data Science " print(s.strip()) # 'Data Science' print(s.upper()) # ' DATA SCIENCE ' print(s.lower()) # ' data science ' print(s.replace("Data","Big Data")) # ' Big Data Science ' print(s.split()) # ['Data', 'Science']

4. Joining a List of Strings

lst = ['Data', 'Science', 'Python'] s = " ".join(lst) print(s) # Output: Data Science Python

7. Flow Diagram for String Manipulation (Draw in Exam)

Raw String ↓ Operations ├── Indexing & Slicing ├── Concatenation / Repetition ├── Case Conversion (upper/lower/title) ├── Cleaning (strip/replace) ├── Splitting / Joining └── Searching (find/startswith/endswith) ↓ Processed String (Ready for Analysis)

8. Real-World Data Science Applications

  • Cleaning text data (removing extra spaces, converting to lowercase)

  • Parsing logs (split IP addresses, timestamps)

  • Feature extraction (count words, detect patterns)

  • Preparing input for NLP models


9. One-Line Memory Tips (For Exams)

  • Strings are immutable

  • split() → string → list

  • join() → list → string

  • strip() → remove whitespace

  • replace() → modify text

  • upper()/lower() → case conversion


10. Most Important Exam Questions

  1. Define string manipulation in Python and explain its importance in Data Science.

  2. Explain commonly used string methods in Python with examples.

  3. Write a Python program to clean and preprocess a string dataset.

  4. Explain difference between split() and join() with examples.

  5. Regular Expressions in Data Science (Using Python)


    1. Definition

    A Regular Expression (Regex) is a sequence of characters that defines a search pattern used for matching, searching, extracting, replacing, and validating text data.

    📌 In Data Science, regex is widely used for:

    • Text cleaning

    • Pattern extraction

    • Data validation

    • Preprocessing unstructured text

    Python provides the re module for working with regular expressions.


    2. Why Regex is Important in Data Science

    • Handles unstructured text data

    • Automates data cleaning

    • Extracts useful features from text

    • Validates emails, phone numbers, IDs

    • Improves text preprocessing for NLP


    3. Keywords (High Probability in Exams)

    Regular Expression, Pattern, re module, match(), search(),
    findall(), sub(), split(), Metacharacters, Quantifiers,
    Character Classes, Anchors, Groups


    4. Python re Module Functions (Very Important)

    FunctionDescription
    re.match()Matches pattern at start of string
    re.search()Searches pattern anywhere
    re.findall()Returns all matches as list
    re.sub()Replaces pattern
    re.split()Splits string using pattern

    5. Basic Regex Syntax

    1. Character Classes

    PatternMeaning
    \dDigit (0–9)
    \DNon-digit
    \wAlphanumeric
    \WNon-alphanumeric
    \sWhitespace
    \SNon-whitespace

    2. Quantifiers

    SymbolMeaning
    *0 or more
    +1 or more
    ?0 or 1
    {n}Exactly n times
    {n,m}Between n and m

    3. Anchors

    SymbolMeaning
    ^Start of string
    $End of string

    6. Exam Diagram (Draw This)

    Text DataRegex PatternMatching / SearchingExtract / Replace / ValidateClean Text Data

    7. Python Examples


    1. Matching a Pattern

    import re text = "Data Science" result = re.match("Data", text) print(result.group())

    2. Searching a Pattern

    text = "I love Data Science" result = re.search("Science", text) print(result.group())

    3. Find All Matches

    text = "Marks: 80, 90, 85" numbers = re.findall(r'\d+', text) print(numbers)

    4. Replacing Text

    text = "Contact: 9876543210" masked = re.sub(r'\d', 'X', text) print(masked)

    5. Splitting Using Regex

    text = "Data,Science;Python" words = re.split(r'[;,]', text) print(words)

    8. Common Regex Patterns (Important for Exams)

    TaskRegex Pattern
    Email validation[a-zA-Z0-9_.]+@[a-zA-Z]+\.[a-zA-Z]+
    Phone number\d{10}
    Extract numbers\d+
    Extract words\w+
    Remove special characters[^a-zA-Z0-9 ]

    9. Real-World Data Science Use Cases

    • Cleaning social media text

    • Extracting dates, numbers, emails

    • Log file analysis

    • NLP preprocessing

    • Data validation rules


    10. Flowchart for Regex Usage (Exam-Friendly)

    Raw TextDefine Regex PatternApply re functionsMatched / Cleaned TextAnalysis or Model Input

    11. One-Line Memory Tips (For Exams)

    • Regex = pattern matching tool

    • match() → start of string

    • search() → anywhere

    • findall() → list of matches

    • sub() → replace text

    • Regex simplifies text preprocessing


    12. MOST IMPORTANT EXAM QUESTIONS

    1. Define regular expressions and explain their role in data science.

    2. Explain Python re module functions with examples.

    3. Write regex patterns for email and phone number validation.

    4. Explain character classes and quantifiers in regex.

    5. Vectorized String Functions in Pandas (Data Science – Python)


      1. Definition

      Vectorized string functions in Pandas are specialized string operations applied element-wise to an entire Series or Index at once using the .str accessor.

      📌 These functions allow fast, efficient, and readable text processing without using explicit loops.

      👉 Internally optimized → better performance than Python loops.


      2. Why Vectorized String Functions are Important

      • Text data is common (names, emails, reviews, logs)

      • Apply string operations to thousands/millions of rows

      • Essential for data cleaning and preprocessing

      • Works well with missing values (NaN)


      3. Keywords (High Probability in Exams)

      Vectorization, Pandas Series, .str accessor, Element-wise operation,
      Text preprocessing, NaN handling, split(), replace(),
      contains(), extract(), len(), lower(), upper()


      4. What Does “Vectorized” Mean? (Important Concept)

      Vectorized operation =
      👉 Apply the same operation to all elements at once (no loops).


      Exam Diagram (Very Important)

      Text Column (Series) ┌───────────────┐ | "Data Science"| | "python" | | "PANDAS" | └───────────────┘ ↓ .str methodsProcessed Text Column

      5. .str Accessor in Pandas

      • Pandas provides .str to access vectorized string methods

      • Works on:

        • Series

        • Index

      df['column'].str.method()

      6. Common Vectorized String Functions

      1. Case Conversion

      import pandas as pd df = pd.DataFrame({ 'Text': ['Data Science', 'python', 'PANDAS'] }) df['lower'] = df['Text'].str.lower() df['upper'] = df['Text'].str.upper() print(df)

      2. Length of Strings

      df['length'] = df['Text'].str.len()

      3. Removing Whitespaces

      df['clean'] = df['Text'].str.strip()

      4. Splitting Strings

      df['words'] = df['Text'].str.split()

      5. Replacing Substrings

      df['replaced'] = df['Text'].str.replace('Data', 'Big Data')

      6. Checking Pattern Existence

      df['contains_py'] = df['Text'].str.contains('py', case=False)

      7. Extracting Using Regular Expressions

      df = pd.DataFrame({ 'Email': ['abc@gmail.com', 'xyz@yahoo.com'] }) df['domain'] = df['Email'].str.extract(r'@(.*)') print(df)

      8. Startswith & Endswith

      df['starts'] = df['Text'].str.startswith('Data') df['ends'] = df['Text'].str.endswith('S')

      7. Handling Missing Values (Important Advantage)

      df = pd.DataFrame({'Text': ['Data', None, 'Science']}) df['lower'] = df['Text'].str.lower()

      📌 .str methods safely ignore NaN values.


      8. Flowchart of Vectorized String Processing (Exam-Friendly)

      Text Data in Pandas SeriesUse .str AccessorVectorized String Methods (lower, split, replace, extract)Clean & Processed Text Data

      9. Difference: Normal String Methods vs Vectorized String Functions

      FeaturePython String MethodsPandas Vectorized
      Applies toSingle stringEntire Series
      SpeedSlow (loops)Fast (vectorized)
      Syntax"text".lower()series.str.lower()
      NaN handlingErrorSafe

      10. Real-World Data Science Use Cases

      • Cleaning text columns

      • Email domain extraction

      • Keyword filtering

      • NLP preprocessing

      • Log and social media analysis


      11. One-Line Memory Points (For Exams)

      • .str → vectorized string operations

      • No loops required

      • Works on Series and Index

      • Handles NaN safely

      • Faster than Python string methods


      12. MOST IMPORTANT EXAM QUESTIONS

      1. Define vectorized string functions in Pandas.

      2. Explain the .str accessor with examples.

      3. Compare Python string methods and Pandas vectorized string functions.

      4. Write a Pandas program to clean and preprocess text data.

      5. Data Wrangling in Data Science (Using Python)


        1. Definition

        Data Wrangling (also called Data Munging) is the process of collecting, cleaning, transforming, restructuring, and enriching raw data into a usable, analysis-ready format.

        📌 In simple words:
        Data Wrangling = Cleaning + Transformation + Structuring


        2. Why Data Wrangling is Important

        • Real-world data is messy and unstructured

        • Required before data analysis and machine learning

        • Improves data quality and consistency

        • Enables integration of multiple data sources


        3. Keywords (High Probability for Exams)

        Data Wrangling, Data Munging, Cleaning, Transformation,
        Reshaping, Filtering, Aggregation, Merge, Join,
        GroupBy, Pivot, Melt, Pandas, NumPy


        4. Main Steps in Data Wrangling

        1. Data Collection

        2. Data Cleaning

        3. Data Transformation

        4. Data Structuring

        5. Data Enrichment


        Exam Diagram (Very Important – Draw This)

        Raw DataData Cleaning ↓ Data Transformation ↓ Data Structuring ↓ Analysis-Ready Data

        5. Tools Used in Python for Data Wrangling

        • Pandas – primary library

        • NumPy – numerical operations

        • Regex – text processing


        6. Common Data Wrangling Operations (With Examples)


        1.Handling Missing Data

        import pandas as pd df = pd.DataFrame({ 'Age': [20, None, 25], 'Salary': [30000, 40000, None] }) df.fillna(df.mean(), inplace=True) print(df)

        2. Removing Duplicates

        df.drop_duplicates(inplace=True)

        3. Filtering Data

        df[df['Age'] > 21]

        4. Transforming Data

        df['Salary_in_Lakhs'] = df['Salary'] / 100000

        5. Merging Datasets

        df1 = pd.DataFrame({'ID':[1,2], 'Name':['A','B']}) df2 = pd.DataFrame({'ID':[1,2], 'Marks':[80,90]}) merged = pd.merge(df1, df2, on='ID') print(merged)

        6. Grouping and Aggregation

        df.groupby('Department')['Salary'].mean()

        7. Reshaping Data

        Pivot

        df.pivot_table(values='Salary', index='Department')

        Melt

        pd.melt(df, id_vars=['ID'])

        7. Data Wrangling Flowchart (Exam-Friendly)

        Multiple Data Sources ↓ Data Wrangling ├─ Cleaning ├─ Transformation ├─ Reshaping └─ Integration ↓ Structured Dataset

        8. Difference Table (Very Important)

        AspectData CleaningData Wrangling
        ScopeNarrowBroad
        FocusFix errorsPrepare & structure data
        IncludesMissing valuesCleaning + transformation
        ExampleFill NaNMerge & reshape data

        9. Real-World Data Science Applications

        • Combining data from databases & APIs

        • Preparing datasets for machine learning

        • Cleaning social media & text data

        • Business and scientific data analysis


        10. One-Line Memory Points (For Exams)

        • Data wrangling prepares raw data for analysis

        • It includes cleaning, transforming, reshaping

        • Pandas is the main tool

        • Essential before EDA and ML models


        11. MOST IMPORTANT EXAM QUESTIONS

        1. Define Data Wrangling and explain its importance.

        2. Explain steps involved in data wrangling with examples.

        3. Differentiate between data cleaning and data wrangling.

        4. Write a Python program to perform data wrangling operations.

Data Wrangling – JOIN Operations (Using Python)


1. Definition

In Data Wrangling, a JOIN is the process of combining two or more datasets (tables/DataFrames) based on a common key or column.

📌 JOIN operations are used to integrate data from multiple sources into a single, structured dataset for analysis.

In Python, JOIN operations are mainly performed using:

  • Pandas (merge(), join())


2. Why JOIN is Important in Data Science

  • Real-world data is stored in multiple tables

  • Required for data integration

  • Helps in feature enrichment

  • Essential before analysis and machine learning


3. Keywords (High Probability for Exams)

Join, Merge, Key, Primary Key, Foreign Key,
Inner Join, Left Join, Right Join, Outer Join,
Pandas, merge(), join()


4. Types of JOINs (Very Important)

1. Inner Join

Returns only matching rows from both tables.

2. Left Join

Returns all rows from left table and matching rows from right table.

3. Right Join

Returns all rows from right table and matching rows from left table.

4. Outer (Full) Join

Returns all rows from both tables, filling missing values with NaN.


5. Exam Diagrams (Must Draw)

INNER JOIN

Table A Table B ○────────○ (Only common rows)


LEFT JOIN

Table A Table B ○────────○ ○──────── (All rows from A)


RIGHT JOIN

Table A Table B ○────────○ ○──────── (All rows from B)


OUTER JOIN

Table A Table B ○────────○ ○ ○ (All rows from A and B)


6. Sample Data (Use in Exams)

import pandas as pd df1 = pd.DataFrame({ 'ID': [1, 2, 3], 'Name': ['A', 'B', 'C'] }) df2 = pd.DataFrame({ 'ID': [2, 3, 4], 'Marks': [80, 90, 85] })


7. JOIN Using merge() in Pandas


1. Inner Join

pd.merge(df1, df2, on='ID', how='inner')

Output: Only IDs 2 and 3


2. Left Join

pd.merge(df1, df2, on='ID', how='left')

Output: All IDs from df1


3. Right Join

pd.merge(df1, df2, on='ID', how='right')

Output: All IDs from df2


4. Outer Join

pd.merge(df1, df2, on='ID', how='outer')

Output: IDs 1,2,3,4 (NaN where no match)


8. JOIN Using join() Method

df1.set_index('ID').join(df2.set_index('ID'), how='inner')

📌 join() works mainly on index, while merge() works on columns.


9. JOIN Flowchart (Exam-Friendly)

DataFrame 1 DataFrame 2 ↓ ↓ Identify Common Key ↓ Apply JOIN ↓ Combined DataFrame


10. Difference: merge() vs join()

Aspectmerge()join()
Join keyColumnIndex
FlexibilityHighMedium
SQL-likeYesLimited
UsageMost commonSimple joins


11. Common Mistakes (Mention in Theory)

  • Joining on wrong key

  • Ignoring duplicate keys

  • Not handling NaN after join


12. One-Line Memory Points (For Exams)

  • JOIN combines datasets using a common key

  • merge() is most used

  • Inner join → common rows only

  • Left join → all left table rows

  • Outer join → all rows from both tables


13. MOST IMPORTANT EXAM QUESTIONS

  1. Define JOIN operation in data wrangling.

  2. Explain types of JOINs with diagrams.

  3. Write a Python program for INNER and OUTER JOIN using Pandas.

  4. Differentiate between merge() and join().

Hierarchical Indexing in Data Science (Using Python)


1. Definition

Hierarchical Indexing, also called MultiIndex, is a Pandas feature that allows a DataFrame or Series to have multiple levels of indexing on rows or columns.

📌 It enables representation and analysis of higher-dimensional data in lower-dimensional structures.


2. Why Hierarchical Indexing is Important

  • Represents multi-dimensional data

  • Enables group-wise analysis

  • Simplifies data reshaping

  • Useful for aggregation and pivot operations


3. Keywords (High Probability in Exams)

Hierarchical Indexing, MultiIndex, Levels, Labels,
Index, Sub-index, Unstack, Stack, xs(),
swaplevel(), sort_index()


4. Structure of Hierarchical Index

Exam Diagram (Must Draw)

Index Level 1 Index Level 2 Value ------------------------------------ India 2023 100 India 2024 120 USA 2023 90 USA 2024 110

📌 Each row is identified by (Country, Year).


5. Creating Hierarchical Index


1. From Tuples

import pandas as pd index = pd.MultiIndex.from_tuples([ ('India', 2023), ('India', 2024), ('USA', 2023), ('USA', 2024) ]) data = [100, 120, 90, 110] s = pd.Series(data, index=index) print(s)

2. From Lists

arrays = [ ['India', 'India', 'USA', 'USA'], [2023, 2024, 2023, 2024] ] index = pd.MultiIndex.from_arrays(arrays)

6. Selecting Data (Very Important)


1. Selecting Outer Level

s['India']

2. Selecting Inner Level using xs()

s.xs(2023, level=1)

3. Partial Indexing

s.loc[('India', 2024)]

7. Reshaping with Stack & Unstack


Unstack

Converts inner index → columns

df = s.unstack() print(df)

Stack

Converts columns → index

df.stack()

Exam Diagram

MultiIndex Seriesunstack Wide DataFramestack MultiIndex Series

8. Swapping and Sorting Levels


Swap Levels

s.swaplevel()

Sort Index

s.sort_index()

9. Hierarchical Columns (Mention in Exams)

data = { ('Sales', 2023): [100, 120], ('Sales', 2024): [130, 150] } df = pd.DataFrame(data, index=['India', 'USA']) print(df)

10. Difference: Normal Index vs Hierarchical Index

AspectNormal IndexHierarchical Index
Index levelsOneMultiple
Dimensionality1DMulti-dimensional
ComplexitySimpleAdvanced
Use caseSimple dataComplex grouped data

11. Flowchart for Hierarchical Indexing (Exam-Friendly)

Multi-Dimensional Data ↓ Apply MultiIndex ↓ Hierarchical Indexing ↓ Efficient Analysis & Reshaping

12. Real-World Data Science Use Cases

  • Time-series data (Year, Month)

  • Geographic data (Country, State)

  • Financial data (Company, Quarter)

  • Sales & performance analysis


13. One-Line Memory Points (For Exams)

  • Hierarchical Index = MultiIndex

  • Allows multiple index levels

  • Reduces need for redundant columns

  • Works well with groupby, pivot, stack, unstack


14. MOST IMPORTANT EXAM QUESTIONS

  1. Define Hierarchical Indexing in Pandas.

  2. Explain creation and selection in MultiIndex with examples.

  3. Differentiate between normal index and hierarchical index.

  4. Explain stack and unstack operations.

Combining and Merging Datasets in Data Science (Using Python)


1. Definition

Combining Datasets

Combining refers to bringing together multiple datasets into a single dataset using methods like concatenation and joining.

Merging Datasets

Merging is a specific type of combining where datasets are joined based on a common key or column, similar to SQL JOIN operations.

📌 These operations are core parts of Data Wrangling.


2. Why Combining & Merging are Important

  • Real-world data comes from multiple sources

  • Required for data integration

  • Helps in feature enrichment

  • Essential before data analysis and machine learning


3. Keywords (High Probability for Exams)

Combine, Merge, Concatenate, Join,
Inner Join, Left Join, Right Join, Outer Join,
Key, Primary Key, Foreign Key,
Pandas, concat(), merge(), join()


4. Methods to Combine and Merge in Pandas

MethodPurpose
concat()Stack datasets
merge()Join using key column
join()Join using index

5. Combining Datasets using concat()


Definition

concat() combines DataFrames row-wise or column-wise.


Exam Diagram

DF1 DF2 A B A B 1 2 3 4 ↓ concat A B 1 2 3 4

Python Example

import pandas as pd df1 = pd.DataFrame({'A':[1,2], 'B':[3,4]}) df2 = pd.DataFrame({'A':[5,6], 'B':[7,8]}) combined = pd.concat([df1, df2]) print(combined)

6. Merging Datasets using merge()


Definition

merge() combines datasets based on a common key column.


Types of Merge (Very Important)


1. Inner Merge

pd.merge(df1, df2, on='ID', how='inner')

2. Left Merge

pd.merge(df1, df2, on='ID', how='left')

3. Right Merge

pd.merge(df1, df2, on='ID', how='right')

4. Outer Merge

pd.merge(df1, df2, on='ID', how='outer')

Exam Diagrams (Draw These)

INNER MERGE

AB (Common Rows)

LEFT MERGE

All rows from A + Matching rows from B

OUTER MERGE

All rows from A and B

7. Sample Dataset for Merge

df1 = pd.DataFrame({ 'ID':[1,2,3], 'Name':['A','B','C'] }) df2 = pd.DataFrame({ 'ID':[2,3,4], 'Marks':[80,90,85] })

8. Joining Datasets using join()

df1.set_index('ID').join(df2.set_index('ID'), how='inner')

📌 join() is index-based, while merge() is column-based.


9. Difference: concat() vs merge() vs join()

Featureconcat()merge()join()
CombinesStackingKey-basedIndex-based
Similar to SQLNoYesPartially
FlexibilityMediumHighMedium
Use caseSame columnsRelated tablesIndexed data

10. Flowchart (Exam-Friendly)

Multiple Datasets ↓ Decide Method (concat / merge / join) ↓ Combine on Rows or Keys ↓ Single Integrated Dataset

11. Real-World Data Science Use Cases

  • Merging customer & transaction data

  • Combining monthly sales files

  • Integrating datasets from APIs and databases

  • Preparing data for ML models


12. One-Line Memory Points (For Exams)

  • Combining = bring datasets together

  • Merging = key-based combining

  • concat() → stack datasets

  • merge() → SQL-style join

  • join() → index-based join


13. MOST IMPORTANT EXAM QUESTIONS

  1. Define combining and merging datasets.

  2. Explain concat() and merge() with examples.

  3. Differentiate between merge() and join().

  4. Write Python code for inner and outer merge.

Reshaping and Pivoting in Data Science (Using Python)


1. Definition

Reshaping

Reshaping is the process of changing the structure or layout of a dataset (rows ↔ columns) without changing the actual data values.

Pivoting

Pivoting is a specific reshaping operation where unique values from one column become new columns, creating a wide-format table.

📌 Both are key steps in Data Wrangling.


2. Why Reshaping & Pivoting are Important

  • Converts data into analysis-ready (tidy) format

  • Makes data suitable for visualization and ML

  • Simplifies group-wise comparisons

  • Required for aggregation and reporting


3. Keywords (High Probability in Exams)

Reshaping, Pivot, Pivot Table, Melt,
Wide Format, Long Format, Stack, Unstack,
Aggregation, Index, Columns, Values


4. Data Formats (Important Concept)

Long Format

Year Dept Sales 2023 CS 100 2023 IT 120

Wide Format

Year CS IT 2023 100 120

5. Exam Diagram (Must Draw)

Long Format Data ↓ Reshaping (Pivot / Unstack) ↓ Wide Format Data

6. Pivoting in Pandas


1. pivot()

Definition

Transforms long → wide format using index, columns, and values.


Python Example

import pandas as pd df = pd.DataFrame({ 'Year':[2023, 2023, 2024, 2024], 'Dept':['CS', 'IT', 'CS', 'IT'], 'Sales':[100, 120, 110, 130] }) pivot_df = df.pivot(index='Year', columns='Dept', values='Sales') print(pivot_df)

2. pivot_table() (Very Important)

Definition

Similar to pivot() but supports aggregation and handles duplicates.


Python Example

pivot_table = df.pivot_table( index='Year', columns='Dept', values='Sales', aggfunc='mean' ) print(pivot_table)

7. Melting (Reverse of Pivot)


melt()

Definition

Converts wide → long format.


Exam Diagram

Wide Data ↓ melt Long Data

Python Example

pd.melt(pivot_df.reset_index(), id_vars='Year')

8. Stack and Unstack


Stack

Columns → row index

pivot_df.stack()

Unstack

Row index → columns

df.set_index(['Year','Dept'])['Sales'].unstack()

9. Difference Table (High Probability Exam Question)

OperationPurposeFormat Change
pivot()ReshapeLong → Wide
pivot_table()Reshape + aggregateLong → Wide
melt()Reverse reshapeWide → Long
stack()Columns → rowsWide → Long
unstack()Rows → columnsLong → Wide

10. Flowchart for Reshaping & Pivoting

Raw DatasetIdentify Format (Long / Wide)Apply Reshaping Method (pivot / melt / stack)Structured Dataset

11. Real-World Data Science Applications

  • Sales and financial reports

  • Survey data analysis

  • Time-series analysis

  • Preparing data for dashboards


12. One-Line Memory Points (For Exams)

  • Reshaping changes structure, not data

  • Pivot creates new columns

  • pivot_table() supports aggregation

  • melt() is reverse of pivot

  • Stack/unstack work with MultiIndex


13. MOST IMPORTANT EXAM QUESTIONS

  1. Define reshaping and pivoting.

  2. Explain pivot and pivot_table with examples.

  3. Differentiate between pivot and melt.

  4. Write Python code to reshape data using stack and unstack.


No comments:

Post a Comment

 NLP UNIT-2 Grammars and Parsing – Top- Down and Bottom- Up Parsers 4 1. Grammars in Natural Language Processing Definition A grammar is a ...