Lesson 6.2: Data Mini-Projects – Advanced Techniques

Today’s Goals

  • Learn how to manipulate complex datasets
  • Practice filtering, sorting, and aggregating data
  • Work with data visualization techniques (using matplotlib)

Warm-Up Question

  • Have you ever used a dataset with multiple categories or columns? What did you analyze?

Understanding Complex Datasets

  • Real-world datasets often have multiple fields (e.g., name, age, score, date)
  • We'll use Python’s Pandas library for advanced data manipulation

Filtering Data

import pandas as pd

# Sample data
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [25, 30, 35, 40],
    'score': [90, 80, 85, 95]
}
# Create DataFrame
df = pd.DataFrame(data)

# Filter data: Get students with score > 85
filtered_df = df[df['score'] > 85]
print(filtered_df)

Sorting Data

# Sort by score
sorted_df = df.sort_values(by='score', ascending=False)
print(sorted_df)

Aggregating Data

# Group by age and calculate the average score
grouped_df = df.groupby('age')['score'].mean()
print(grouped_df)

Class Activity

  • Try filtering the dataset to get students over 30 years old
  • Sort the data by age and score
  • Aggregate the data to find average scores per age group

Introduction to Data Visualization

import matplotlib.pyplot as plt

# Simple bar chart for scores
plt.bar(df['name'], df['score'])
plt.title('Student Scores')
plt.xlabel('Student')
plt.ylabel('Score')
plt.show()

Class Challenge

  • Create a line chart showing how the average score changes over age groups
  • Use matplotlib to visualize your findings from the dataset

Wrap-Up

  • Data manipulation is key to extracting useful insights
  • Next time: We'll dive deeper into working with larger, real-world datasets