Data Cleaning and Preparation
Questions With Answers
1. What are the common techniques for handling missing data in a dataset?
Answer:
Common techniques include filtering out missing data (dropping rows/columns with nulls) and filling in missing data using methods like constant values, mean/median, forward fill, or backward fill.
2. How does filtering out missing data differ from filling in missing data?
Answer:
Filtering removes rows or columns containing missing values, while filling replaces missing values with appropriate substitutes (e.g., computed statistics or propagated values).
3. What are some common strategies for filling in missing values in pandas?
Answer:
Strategies include using:
-
fillna()with constants -
Statistical values like mean, median, or mode
-
Forward fill (
ffill) -
Backward fill (
bfill) -
Interpolation methods
4. Why is removing duplicates an important step in data transformation?
Answer:
Duplicates can distort statistical analysis, cause incorrect aggregations, and lead to misleading insights. Removing them ensures data quality and accurate modeling.
5. How can a function or mapping be used to transform data in pandas?
Answer:
Using map(), apply(), or applymap(), a function can be applied to modify data values. Mapping transforms values based on a dictionary or rule (e.g., mapping categories to numerical codes).
6. What is the purpose of replacing values during data cleaning?
Answer:
Replacing values helps standardize data, fix incorrect entries, normalize categories, and convert placeholders (like “N/A” or “-999”) into proper nulls or valid values.
7. When and why would you rename axis indexes in pandas?
Answer:
Renaming is used to make data more readable and consistent, especially when working with messy datasets, merging data, or preparing data for analysis or visualization.
8. What is discretization or binning, and in what situations is it useful?
Answer:
Discretization converts continuous values into categorical bins (e.g., age ranges). It is useful for simplifying models, handling skewed distributions, and improving interpretability.
9. How can outliers be detected and filtered in a dataset?
Answer:
Common methods include:
-
Using the IQR method (values outside 1.5×IQR)
-
Using z-scores
-
Visualizations like boxplots and scatterplots
-
Custom domain-specific rules
10. What is permutation and random sampling used for in data processing?
Answer:
Permutation shuffles data for creating randomized datasets; random sampling selects a subset of data, useful for testing models, cross-validation, or reducing dataset size.
11. What are indicator or dummy variables, and why are they used in data analysis?
Answer:
Dummy variables convert categorical values into binary columns (0/1). They are essential for machine learning models that require numeric inputs, such as linear regression.
12. What are extension data types in pandas, and what advantages do they offer?
Answer:
Extension types (like Int64, string, Boolean, Categorical) support missing values and provide efficient storage, better memory management, and improved performance.
13. How do Python's built-in string methods assist with string manipulation?
Answer:
They provide tools for cleaning and transforming text, such as:
lower(), upper(), strip(), split(), replace(), startswith(), endswith(), etc.
14. What role do regular expressions play in text cleaning and processing?
Answer:
Regular expressions allow pattern-based searching, matching, and extraction of text. They are powerful for tasks like cleaning inconsistent strings or validating formats.
15. What are categorical data in pandas, and how do they improve data processing efficiency?
Answer:
Categorical data represent variables with a limited number of values. They reduce memory usage and speed up operations such as sorting, comparisons, and group-by operations.
If you want, I can turn these into MCQs, True/False, or short-answer exam questions.
No comments:
Post a Comment