Lesson 6.3: Data Mini-Projects – More Complex Datasets

Today’s Goals

  • Dive into larger datasets
  • Learn how to clean and process real-world data
  • Explore more complex aggregations and visualizations

Warm-Up Question

  • Have you ever dealt with a messy dataset? What problems did you encounter?

Cleaning Data

  • Real-world datasets are often incomplete or contain errors
  • Techniques for cleaning:
    • Removing missing or invalid data
    • Converting data types
    • Handling duplicates

Example: Cleaning a Dataset

import pandas as pd

# Sample messy dataset
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'David', None],
    'score': [90, None, 85, 95, 88],
    'age': [25, 30, None, 40, 35]
}

df = pd.DataFrame(data)

# Remove rows with missing data
df_cleaned = df.dropna()
print(df_cleaned)

Data Type Conversion

  • Convert columns to appropriate data types (e.g., int, float, str)
df['age'] = df['age'].astype(int)

Handling Duplicates

  • Remove duplicate rows based on specific columns
df_no_duplicates = df.drop_duplicates(subset=['name'])
print(df_no_duplicates)

More Aggregations

  • Find average, median, and standard deviation for scores
avg_score = df['score'].mean()
median_score = df['score'].median()
std_score = df['score'].std()

print("Average:", avg_score)
print("Median:", median_score)
print("Standard Deviation:", std_score)

Class Practice

  • Clean a dataset with missing or invalid data
  • Perform aggregations and visualize results

Student Challenge

  • Use a real-world dataset (e.g., sales, weather, or movie data)
  • Clean the data and perform aggregations (mean, median, mode)

Wrap-Up

  • Data cleaning is crucial for accurate analysis
  • Next time: We'll explore advanced visualizations and export data to new formats