Python Pandas Tutorial

You work with data. Often, this data comes in tables, like in Excel spreadsheets or databases. Managing this kind of data in Python can be tricky with just basic lists or dictionaries. This is where Pandas helps.

Pandas is a powerful Python library built for data manipulation and analysis. It gives you easy-to-use data structures and tools to work with tabular data. This means you can clean, transform, analyze, and visualize large datasets without writing a lot of complex code.

It helps you quickly handle common data tasks. This saves you a lot of time when preparing data for analysis or machine learning.

What is Pandas?

Pandas is an open-source Python library. It provides data structures like Series and DataFrames that simplify working with “relational” or “labeled” data. You can think of it like a spreadsheet for your Python code.

It builds on top of NumPy, another Python library for numerical operations. Pandas makes data operations fast and efficient, even with large datasets.

Getting Started: Installation and Import

Before you use Pandas, you need to install it. If you have Python installed, use pip.

To install Pandas, open your terminal or command prompt and type:

pip install pandas

After installing, you need to import the library into your Python script or Jupyter Notebook. The standard way is to import it with the alias pd.

import pandas as pd

This line makes all Pandas functions available through pd..

Pandas Data Structures

Pandas mainly uses two core data structures:

Series: A one-dimensional array. It can hold any data type (numbers, strings, objects). Think of it as a single column of data from a spreadsheet.
DataFrame: A two-dimensional table. It stores data in rows and columns. This is like a complete Excel spreadsheet or a database table.

Let’s look at how to create them.

1. Creating a Series

You can create a Series from a list or a NumPy array.

import pandas as pd

# Create a Series from a list
data_list = [10, 20, 30, 40, 50]
my_series = pd.Series(data_list)
print("Series from a list:")
print(my_series)

# Create a Series with custom labels (index)
data_dict = {'a': 100, 'b': 200, 'c': 300}
my_series_labeled = pd.Series(data_dict)
print("\nSeries with custom labels:")
print(my_series_labeled)

This code outputs:

Series from a list:
0    10
1    20
2    30
3    40
4    50
dtype: int64

Series with custom labels:
a    100
b    200
c    300
dtype: int64

Notice the numbers 0, 1, 2, 3, 4 on the left. These are the default index (labels) of the Series. When you use a dictionary, the keys become the index.

2. Creating a DataFrame

You can create a DataFrame from dictionaries, lists of lists, or by reading data from files.

From a Dictionary

This is a common way to create a DataFrame. Each key in the dictionary becomes a column name.

import pandas as pd

# Create a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 28],
    'City': ['New York', 'London', 'Paris', 'Tokyo']
}
df = pd.DataFrame(data)
print("DataFrame from a dictionary:")
print(df)

This code outputs:

DataFrame from a dictionary:
      Name  Age      City
0    Alice   25  New York
1      Bob   30    London
2  Charlie   35     Paris
3    David   28     Tokyo

This looks like a table, with column names and rows. The numbers 0, 1, 2, 3 are the default row index.

From a CSV File

Often, your data comes from files like CSV (Comma Separated Values). Pandas has functions to read these files directly into a DataFrame.

First, create a sample data.csv file:

Name,Age,City
Alice,25,New York
Bob,30,London
Charlie,35,Paris
David,28,Tokyo

Now, read it into a DataFrame:

import pandas as pd

# Read a CSV file into a DataFrame
# Make sure 'data.csv' is in the same directory as your script
df_csv = pd.read_csv('data.csv')
print("\nDataFrame from a CSV file:")
print(df_csv)

This code will produce the same DataFrame as the dictionary example if data.csv is correctly set up.

Basic DataFrame Operations

Once you have a DataFrame, you can do many things with it.

1. Viewing Data

You often want to quickly see what your data looks like.

df.head(): Shows the first 5 rows.
df.tail(): Shows the last 5 rows.
df.info(): Gives a summary of the DataFrame, including data types and non-null values.
df.describe(): Provides statistical summary for numerical columns (mean, min, max, etc.).
df.shape: Returns a tuple with the number of rows and columns.

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'Age': [25, 30, 35, 28, 22, 40],
    'City': ['New York', 'London', 'Paris', 'Tokyo', 'Rome', 'Berlin'],
    'Salary': [50000, 60000, 75000, 55000, 48000, 80000]
}
df = pd.DataFrame(data)

print("First 3 rows:")
print(df.head(3))

print("\nLast 2 rows:")
print(df.tail(2))

print("\nDataFrame Info:")
df.info()

print("\nStatistical Description:")
print(df.describe())

print("\nDataFrame Shape (rows, columns):")
print(df.shape)

2. Selecting Data

You can select specific columns or rows from your DataFrame.

Select a single column: Use single square brackets df['ColumnName']. This returns a Series.
Select multiple columns: Use double square brackets df[['Col1', 'Col2']]. This returns a DataFrame.
Select rows by label (.loc): Use .loc[row_label, column_label].
Select rows by position (.iloc): Use .iloc[row_position, column_position].

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'Age': [25, 30, 35, 28, 22, 40],
    'City': ['New York', 'London', 'Paris', 'Tokyo', 'Rome', 'Berlin'],
    'Salary': [50000, 60000, 75000, 55000, 48000, 80000]
}
df = pd.DataFrame(data)

print("Select 'Name' column:")
print(df['Name'])

print("\nSelect 'Name' and 'Age' columns:")
print(df[['Name', 'Age']])

print("\nSelect row with index 1 (Bob) using .loc:")
print(df.loc[1]) # Selects row by its label (index)

print("\nSelect row at position 1 (second row) using .iloc:")
print(df.iloc[1]) # Selects row by its integer position

print("\nSelect 'City' for row with index 2 (Charlie) using .loc:")
print(df.loc[2, 'City'])

print("\nSelect 'Age' for row at position 0 (Alice) using .iloc:")
print(df.iloc[0, 1]) # 0 for row position, 1 for column position (Age)

3. Filtering Data

You can filter rows based on conditions. This helps you select subsets of your data.

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'Age': [25, 30, 35, 28, 22, 40],
    'City': ['New York', 'London', 'Paris', 'Tokyo', 'Rome', 'Berlin'],
    'Salary': [50000, 60000, 75000, 55000, 48000, 80000]
}
df = pd.DataFrame(data)

# Filter for people older than 30
filtered_df = df[df['Age'] > 30]
print("People older than 30:")
print(filtered_df)

# Filter for people from London
london_df = df[df['City'] == 'London']
print("\nPeople from London:")
print(london_df)

# Filter with multiple conditions (Age > 25 AND Salary > 50000)
# Use & for AND, | for OR
multi_cond_df = df[(df['Age'] > 25) & (df['Salary'] > 50000)]
print("\nPeople older than 25 AND earning over 50000:")
print(multi_cond_df)

4. Adding New Columns

You can create new columns based on existing ones.

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 28],
    'Salary': [50000, 60000, 75000, 55000]
}
df = pd.DataFrame(data)

# Add a new column 'Bonus'
df['Bonus'] = df['Salary'] * 0.10 # 10% bonus
print("DataFrame with 'Bonus' column:")
print(df)

# Add a 'Salary_Category' column based on conditions
df['Salary_Category'] = ['High' if x > 60000 else 'Low' for x in df['Salary']]
print("\nDataFrame with 'Salary_Category' column:")
print(df)

Cleaning Data (Missing Values)

Real-world data often has missing values. Pandas represents these as NaN (Not a Number). You can handle them in a few ways.

df.isnull(): Returns a DataFrame of booleans, True where values are missing.
df.dropna(): Removes rows or columns with missing values.
df.fillna(value): Fills missing values with a specified value.

import pandas as pd
import numpy as np # Used for NaN

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, np.nan, 28, 22],
    'City': ['New York', 'London', 'Paris', np.nan, 'Rome'],
    'Salary': [50000, 60000, 75000, 55000, np.nan]
}
df_missing = pd.DataFrame(data)
print("Original DataFrame with missing values:")
print(df_missing)

print("\nCheck for null values:")
print(df_missing.isnull())

# Drop rows with any missing values
df_cleaned_drop = df_missing.dropna()
print("\nDataFrame after dropping rows with missing values:")
print(df_cleaned_drop)

# Fill missing 'Age' with the mean age
mean_age = df_missing['Age'].mean()
df_filled_age = df_missing.copy() # Make a copy to avoid changing original
df_filled_age['Age'] = df_filled_age['Age'].fillna(mean_age)
print(f"\nDataFrame after filling missing Age with mean ({mean_age:.2f}):")
print(df_filled_age)

# Fill missing 'City' with 'Unknown'
df_filled_city = df_missing.copy()
df_filled_city['City'] = df_filled_city['City'].fillna('Unknown')
print("\nDataFrame after filling missing City with 'Unknown':")
print(df_filled_city)

Grouping and Aggregating Data

You often need to group data by a category and then calculate sums, averages, or counts for each group. Pandas groupby() method is perfect for this.

import pandas as pd

data = {
    'Department': ['HR', 'Sales', 'HR', 'IT', 'Sales', 'IT'],
    'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'Salary': [50000, 60000, 75000, 80000, 55000, 90000]
}
df_employees = pd.DataFrame(data)
print("Original Employee Data:")
print(df_employees)

# Group by 'Department' and calculate the mean salary
avg_salary_by_dept = df_employees.groupby('Department')['Salary'].mean()
print("\nAverage Salary by Department:")
print(avg_salary_by_dept)

# Group by 'Department' and count employees
employee_count_by_dept = df_employees.groupby('Department')['Employee'].count()
print("\nEmployee Count by Department:")
print(employee_count_by_dept)

# Group by 'Department' and get multiple aggregations
agg_results = df_employees.groupby('Department').agg(
    Total_Salary=('Salary', 'sum'),
    Avg_Salary=('Salary', 'mean'),
    Num_Employees=('Employee', 'count')
)
print("\nAggregated Results by Department:")
print(agg_results)

Conclusion

Pandas is an essential library for anyone working with data in Python. It provides flexible Series and DataFrames to handle tabular data.

You learned how to:

Install Pandas and import it.
Create Series and DataFrames from lists, dictionaries, and CSV files.
View, select, and filter data using various methods like head(), loc[], and boolean indexing.
Add new columns and handle missing values.
Group and aggregate data using groupby().

Full Stack Software Development Course from UT Austin

e-Postgraduate Diploma (ePGD) in Computer Science And Engineering

Introduction to Python Pandas