Pandas for Beginners: A Practical Introduction

Yash
Level Up Coding
Published in
3 min readAug 31, 2023

--

Pandas is one of the most popular and powerful data analysis libraries for Python. It provides easy-to-use data structures and tools for working with structured data. In this post, we'll go through a practical introduction to using Pandas for data analysis.

Importing Pandas:

To start using Pandas, we first need to import it:

import pandas as pd

The convention is to import Pandas using `pd` as the shorthand name.

Creating a Pandas DataFrame:

A Pandas DataFrame is a 2-dimensional labeled data structure that can store different data types (strings, numbers, booleans etc.) in columns. It is similar to a spreadsheet or SQL table.

Let's create a simple DataFrame from a dictionary:

data = {'Name': ['John', 'Mary', 'Peter', 'Jeff', 'Bill'], 
'Age': [28, 32, 47, 19, 55],
'Gender': ['Male', 'Female', 'Male', 'Male', 'Male']
df = pd.DataFrame(data)
print(df)
  Name  Age  Gender
0 John 28 Male
1 Mary 32 Female
2 Peter 47 Male
3 Jeff 19 Male
4 Bill 55 Male

The dictionary keys become the column names and the values become the data in columns.

Selecting Columns:

We can select a column in Pandas using the column name like a dictionary key:

ages = df['Age']
print(ages)
0    28
1 32
2 47
3 19
4 55
Name: Age, dtype: int64

This returns a Pandas Series containing just the 'Age' column data.

Selecting Rows:

We can select rows by integer location or boolean indexing. Let’s get the first 3 rows:

print(df[0:3])
  Name  Age  Gender
0 John 28 Male
1 Mary 32 Female
2 Peter 47 Male

And rows where 'Age' is greater than 30:

print(df[df['Age'] > 30])
  Name  Age  Gender
1 Mary 32 Female
2 Peter 47 Male
4 Bill 55 Male

Loading Data from CSV:

We can easily load data into a DataFrame from a CSV file using `read_csv()`:

df = pd.read_csv('data.csv')

This will load the 'data.csv' file into a Pandas DataFrame.

There are many additional options like parsing dates and handling missing values that can be specified.

Basic Data Cleaning:

Pandas makes it easy to get rid of missing data and tidy up messy data:

# Drop rows with missing values
df.dropna()

# Fill missing values
df.fillna(value)

# Change column names
df.rename(columns={'old_name': 'new_ name'})

Useful Operations:

Pandas includes a lot of vectorized functions that make data munging fast:

# Calculate sum of Age column
df['Age'].sum()

# Calculate sum of Age column
df['Age'].sum()

# Get mean of Age
df['Age'].mean()

# Get max value of Age
df['Age'].max()

# Sort by Age column
df.sort_values('Age')

There are many more functions for aggregations, slicing, transforming, combining, and visualizing data.

Conclusion:

This covers some of the basics of using Pandas for practical data analysis in Python. Key takeaways:

- DataFrame for storing tabular data
- Read/write data from CSV files
- Column selection, row slicing, boolean indexing
- Built-in methods for cleaning, munging and transforming
- Vectorized operations for fast data analysis

Pandas combines ease of use with performance, making it indispensable for data science workflows. Happy Learning!

P.S. Ever wondered if spamming the 👏 clap button here on Medium is the secret workout for your index finger? Give it a try and let me know if your finger gains superpowers! 💪😎

Part 2: Mastering Pandas: Advanced Techniques for Data Manipulation Excellence

--

--

I'm a Data Scientist & Renewable Energy geek 🌱 Exploring Data📊, Green tech🌍, and Innovation💡 Hope to write on Data Science, Life, & Everything in between ;)