UNIT – 1 : INTRODUCTION TO DATA SCIENCE
Q1. Define Data Science.
Answer: Data Science is an interdisciplinary field that uses scientific methods, statistics, algorithms, and computing techniques to extract meaningful insights and knowledge from structured and unstructured data for decision making.
Q2. What are the basic terminologies of Data Science?
Answer:
Data: Raw facts and figures
Dataset: Collection of related data
Feature: Individual measurable property of data
Label: Output variable in supervised learning
Model: Mathematical representation of a real-world process
Training & Testing Data: Data used to build and evaluate models
Q3. Explain the types of data.
Answer:
Structured Data – Organized in rows and columns (e.g., databases)
Unstructured Data – Text, images, videos
Semi-structured Data – JSON, XML
Qualitative Data – Categorical data
Quantitative Data – Numerical data (Discrete & Continuous)
Q4. Explain the five steps of Data Science.
Answer:
Problem Definition
Data Collection
Data Cleaning & Preparation
Data Analysis & Modeling
Visualization & Decision Making
Q5. Explain NumPy ndarray.
Answer: The NumPy ndarray is a multidimensional array object that stores elements of the same data type and enables fast numerical computation using vectorization.
Features:
Fixed size
Homogeneous elements
Supports broadcasting
Q6. What are Universal Functions (ufuncs)?
Answer: Universal functions are vectorized functions in NumPy that perform element-wise operations on arrays efficiently.
Examples: add(), subtract(), multiply(), sqrt()
Q7. Explain Array-Oriented Programming.
Answer: Array-oriented programming avoids explicit loops and uses vectorized operations on entire arrays, improving performance and readability.
Q8. Explain File I/O with NumPy.
Answer: NumPy provides functions like save(), load(), savetxt(), and loadtxt() to store and retrieve array data from files.
Q9. Explain Linear Algebra operations in NumPy.
Answer: NumPy supports matrix multiplication, transpose, inverse, determinant, and eigenvalues using the numpy.linalg module.
Q10. Explain pseudorandom number generation.
Answer: NumPy generates random numbers using algorithms that produce reproducible sequences controlled by a seed value.
UNIT – 2 : DATA EXPLORATION WITH PANDAS
Q11. What is Data Exploration?
Answer: Data exploration is the process of understanding data characteristics using summary statistics and visualizations before modeling.
Q12. Explain Pandas data structures.
Answer:
Series – One-dimensional labeled array
DataFrame – Two-dimensional labeled table
Index – Immutable array for labeling
Q13. Explain descriptive statistics in Pandas.
Answer: Pandas provides functions like mean(), median(), mode(), std(), min(), max() to summarize data.
Q14. What are correlation and covariance?
Answer:
Correlation measures the strength and direction of relationship between variables.
Covariance indicates how two variables change together.
Q15. Explain unique values and value counts.
Answer:
unique() returns distinct values
value_counts() returns frequency of values
Q16. Explain data loading and storage methods in Pandas.
Answer: Pandas supports reading and writing data using CSV, Excel, JSON, SQL databases, and web APIs.
UNIT – 3 : DATA CLEANING, PREPARATION AND WRANGLING
Q17. What is data cleaning?
Answer: Data cleaning is the process of detecting and correcting inaccurate, incomplete, or inconsistent data.
Q18. Explain methods for handling missing data.
Answer:
dropna()
fillna()
Forward/Backward filling
Q19. Explain data transformation.
Answer: Data transformation converts data into suitable format using normalization, scaling, encoding, and aggregation.
Q20. Explain string manipulation in Pandas.
Answer: Pandas provides vectorized string methods using .str accessor for operations like lower(), split(), replace().
Q21. What is data wrangling?
Answer: Data wrangling is the process of cleaning, structuring, and enriching raw data into a usable format.
Q22. Explain combining and merging datasets.
Answer: Pandas supports merge(), join(), and concat() to combine datasets using keys or indexes.
Q23. Explain reshaping and pivoting.
Answer: Reshaping changes the structure of data using pivot(), stack(), unstack(), and melt().
UNIT – 4 : DATA VISUALIZATION & GROUP OPERATIONS
Q24. Explain Matplotlib API.
Answer: Matplotlib is a Python library used for creating static, animated, and interactive plots.
Q25. Explain plotting with Pandas and Seaborn.
Answer: Pandas uses Matplotlib internally, while Seaborn provides high-level statistical visualizations.
Q26. Explain GroupBy mechanics.
Answer: GroupBy splits data into groups, applies functions, and combines results.
Q27. What are pivot tables and cross-tabulation?
Answer: Pivot tables summarize data using aggregation, while cross-tabulation computes frequency tables.
UNIT – 5 : STATISTICAL THINKING & TIME SERIES ANALYSIS
Q28. Explain statistical distributions.
Answer: Distributions describe how values are spread and can be visualized using histograms.
Q29. What are outliers?
Answer: Outliers are extreme values that deviate significantly from other observations.
Q30. Explain PMF and CDF.
Answer:
PMF gives probability of discrete values
CDF gives cumulative probability up to a value
Q31. Explain percentile-based statistics.
Answer: Percentiles divide data into 100 equal parts and help compare relative standing.
Q32. Explain Time Series Analysis.
Answer: Time series analysis studies data points indexed in time order to identify patterns and trends.
Q33. Explain date and time handling in Pandas.
Answer: Pandas provides datetime objects, date ranges, frequency conversion, and time zone handling for time-based data.
⭐ MOST IMPORTANT QUESTIONS (HIGH PROBABILITY)
🔥 UNIT–1 (Very Important)
Define Data Science and explain the five steps of Data Science ⭐⭐⭐
Explain NumPy ndarray and its features ⭐⭐
Explain Universal Functions and Array-Oriented Programming ⭐⭐
Explain Linear Algebra operations in NumPy ⭐⭐
🔥 UNIT–2 (Very Important)
Explain Pandas Data Structures with examples ⭐⭐⭐
Explain Descriptive Statistics, Correlation and Covariance ⭐⭐
Explain Data loading and storage methods in Pandas ⭐⭐
🔥 UNIT–3 (Very Important)
Explain Data Cleaning and handling missing data ⭐⭐⭐
Explain Data Wrangling – combining and merging datasets ⭐⭐
Explain Reshaping and Pivoting operations ⭐⭐
🔥 UNIT–4 (Very Important)
Explain Matplotlib architecture and API ⭐⭐⭐
Explain GroupBy mechanics and Split–Apply–Combine strategy ⭐⭐⭐
Explain Pivot tables and Cross-tabulation ⭐⭐
🔥 UNIT–5 (Very Important)
Explain Statistical Distributions and Outliers ⭐⭐⭐
Explain PMF and CDF with examples ⭐⭐⭐
Explain Time Series Analysis and Pandas time tools ⭐⭐⭐
✏️ IMPORTANT DIAGRAMS & FLOWCHARTS (WHAT TO DRAW IN EXAM)
📌 1. Data Science Life Cycle (UNIT–1) ⭐⭐⭐
Draw a flowchart:
Problem Definition → Data Collection → Data Cleaning → Data Analysis → Visualization & Decision Making👉 Label each step clearly.
📌 2. NumPy Array Structure (UNIT–1) ⭐⭐
Draw:
2D matrix
Show rows, columns, shape
👉 Mention: homogeneous data, fast computation.
📌 3. Pandas Data Structures (UNIT–2) ⭐⭐⭐
Draw block diagram:
Series → DataFrame → Index👉 Show Series as single column, DataFrame as table.
📌 4. Data Cleaning Process (UNIT–3) ⭐⭐⭐
Flowchart:
Raw Data → Missing Value Handling → Transformation → Clean Data📌 5. Data Wrangling Operations (UNIT–3) ⭐⭐
Diagram:
Merge / Join / Concat → Reshape → Final Dataset📌 6. Split–Apply–Combine Strategy (UNIT–4) ⭐⭐⭐
Very Important Diagram:
Data → Split (GroupBy) → Apply (Aggregation) → Combine (Result)📌 7. Histogram & Distribution (UNIT–5) ⭐⭐⭐
Draw:
X-axis: Values
Y-axis: Frequency
👉 Mention: shape, spread, outliers.
📌 8. PMF vs CDF Graph (UNIT–5) ⭐⭐⭐
Draw two graphs:
PMF: discrete bars
CDF: increasing curve
📌 9. Time Series Plot (UNIT–5) ⭐⭐⭐
Draw:
X-axis: Time
Y-axis: Value
Mark trend/seasonality
🏆 FINAL EXAM WRITING STRATEGY
✔ Start answer with definition ✔ Add diagram/flowchart wherever possible ✔ Use keywords from syllabus ✔ End with applications / advantages
👉 This is a complete scoring package for your Data Science exam.
No comments:
Post a Comment