Introduction to Pandas
Contents
Introduction to Pandas#
In this notebook, we will explore how to use the Pandas library to manipulate and analyze data. Pandas is an essential tool in any data science project. You will learn how to load data, explore it, and manipulate it effectively using DataFrame
.
Objectives#
Learn how to load data into a
DataFrame
Understand the difference between indices and indexes
Manipulate
DataFrame
(selection, adding/removing columns, filtering)Apply simple operations on data (averages, groupings)
Save results in different formats
1. Import Pandas and Load Data#
import pandas as pd
# Load a sample dataset
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv"
data = pd.read_csv(url)
# Show the first few rows of the DataFrame
data.head()
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
We have loaded a dataset with pd.read_csv()
from a URL. This function can also be used to load local files (e.g., .csv
, .txt
).
2. Manipulating DataFrame
#
2.1 Structure of a DataFrame
#
A DataFrame
is a table of data with rows and columns. Each column has a name, and each row is identified by an index.
# Display information about the DataFrame
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 total_bill 244 non-null float64
1 tip 244 non-null float64
2 sex 244 non-null object
3 smoker 244 non-null object
4 day 244 non-null object
5 time 244 non-null object
6 size 244 non-null int64
dtypes: float64(2), int64(1), object(4)
memory usage: 13.5+ KB
2.2 Selecting Columns and Rows#
You can access specific columns or rows by using column names or indexes.
# Select a column
data['total_bill'].head()
# Select multiple columns
data[['total_bill', 'tip']].head()
# Select a row (by index)
data.loc[0]
# Select multiple rows and columns
data.loc[0:5, ['total_bill', 'tip']]
total_bill | tip | |
---|---|---|
0 | 16.99 | 1.01 |
1 | 10.34 | 1.66 |
2 | 21.01 | 3.50 |
3 | 23.68 | 3.31 |
4 | 24.59 | 3.61 |
5 | 25.29 | 4.71 |
2.3 Index vs Indices#
The index in a DataFrame
is the set of labels identifying each row, and it does not necessarily correspond to the row indices (which are just integers from 0 to N).
# Display the current index
data.index
# Change the DataFrame index
data.set_index('day', inplace=True)
# Check the new index
data.head()
# Reset the index to revert to default indices
data.reset_index(inplace=True)
3. Data Operations#
3.1 Filtering and Conditions#
We can filter rows based on specific conditions.
# Filter rows where the total bill is greater than 20
high_total_bill = data[data['total_bill'] > 20]
high_total_bill.head()
day | total_bill | tip | sex | smoker | time | size | |
---|---|---|---|---|---|---|---|
2 | Sun | 21.01 | 3.50 | Male | No | Dinner | 3 |
3 | Sun | 23.68 | 3.31 | Male | No | Dinner | 2 |
4 | Sun | 24.59 | 3.61 | Female | No | Dinner | 4 |
5 | Sun | 25.29 | 4.71 | Male | No | Dinner | 4 |
7 | Sun | 26.88 | 3.12 | Male | No | Dinner | 4 |
3.2 Adding or Removing Columns#
You can easily add new columns derived from existing columns or remove unnecessary columns.
# Add a calculated column
data['tip_percent'] = data['tip'] / data['total_bill'] * 100
# Remove a column
data.drop(columns=['sex'], inplace=True)
data.head()
day | total_bill | tip | smoker | time | size | tip_percent | |
---|---|---|---|---|---|---|---|
0 | Sun | 16.99 | 1.01 | No | Dinner | 2 | 5.944673 |
1 | Sun | 10.34 | 1.66 | No | Dinner | 3 | 16.054159 |
2 | Sun | 21.01 | 3.50 | No | Dinner | 3 | 16.658734 |
3 | Sun | 23.68 | 3.31 | No | Dinner | 2 | 13.978041 |
4 | Sun | 24.59 | 3.61 | No | Dinner | 4 | 14.680765 |
4. Descriptive Analysis and Grouping#
4.1 Descriptive Statistics#
Pandas makes it easy to get descriptive statistics on your data.
# Get descriptive statistics
data.describe()
total_bill | tip | size | tip_percent | |
---|---|---|---|---|
count | 244.000000 | 244.000000 | 244.000000 | 244.000000 |
mean | 19.785943 | 2.998279 | 2.569672 | 16.080258 |
std | 8.902412 | 1.383638 | 0.951100 | 6.107220 |
min | 3.070000 | 1.000000 | 1.000000 | 3.563814 |
25% | 13.347500 | 2.000000 | 2.000000 | 12.912736 |
50% | 17.795000 | 2.900000 | 2.000000 | 15.476977 |
75% | 24.127500 | 3.562500 | 3.000000 | 19.147549 |
max | 50.810000 | 10.000000 | 6.000000 | 71.034483 |
4.2 Data Grouping#
It is often useful to group data by one or more columns and apply computation functions.
# Compute the average tips by day
data.groupby('day')['tip'].mean()
# Count occurrences by day
data['day'].value_counts()
Sat 87
Sun 76
Thur 62
Fri 19
Name: day, dtype: int64
5. Saving Results#
Once your data is manipulated, you can easily save it in different formats (CSV, Excel, etc.).
# Save to CSV
data.to_csv('modified_data.csv', index=False)
# Save to Excel
data.to_excel('modified_data.xlsx', index=False)
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Input In [9], in <cell line: 5>()
2 data.to_csv('modified_data.csv', index=False)
4 # Save to Excel
----> 5 data.to_excel('modified_data.xlsx', index=False)
File /home/zebulon/miniconda3/envs/m1psl/lib/python3.10/site-packages/pandas/core/generic.py:2345, in NDFrame.to_excel(self, excel_writer, sheet_name, na_rep, float_format, columns, header, index, index_label, startrow, startcol, engine, merge_cells, encoding, inf_rep, verbose, freeze_panes, storage_options)
2332 from pandas.io.formats.excel import ExcelFormatter
2334 formatter = ExcelFormatter(
2335 df,
2336 na_rep=na_rep,
(...)
2343 inf_rep=inf_rep,
2344 )
-> 2345 formatter.write(
2346 excel_writer,
2347 sheet_name=sheet_name,
2348 startrow=startrow,
2349 startcol=startcol,
2350 freeze_panes=freeze_panes,
2351 engine=engine,
2352 storage_options=storage_options,
2353 )
File /home/zebulon/miniconda3/envs/m1psl/lib/python3.10/site-packages/pandas/io/formats/excel.py:888, in ExcelFormatter.write(self, writer, sheet_name, startrow, startcol, freeze_panes, engine, storage_options)
884 need_save = False
885 else:
886 # error: Cannot instantiate abstract class 'ExcelWriter' with abstract
887 # attributes 'engine', 'save', 'supported_extensions' and 'write_cells'
--> 888 writer = ExcelWriter( # type: ignore[abstract]
889 writer, engine=engine, storage_options=storage_options
890 )
891 need_save = True
893 try:
File /home/zebulon/miniconda3/envs/m1psl/lib/python3.10/site-packages/pandas/io/excel/_openpyxl.py:49, in OpenpyxlWriter.__init__(self, path, engine, date_format, datetime_format, mode, storage_options, if_sheet_exists, engine_kwargs, **kwargs)
36 def __init__(
37 self,
38 path,
(...)
47 ):
48 # Use the openpyxl module as the Excel writer.
---> 49 from openpyxl.workbook import Workbook
51 engine_kwargs = combine_kwargs(engine_kwargs, kwargs)
53 super().__init__(
54 path,
55 mode=mode,
(...)
58 engine_kwargs=engine_kwargs,
59 )
ModuleNotFoundError: No module named 'openpyxl'
Conclusion#
This notebook has introduced you to the basics of Pandas. You have learned how to:
Load data into a
DataFrame
Manipulate data (selection, adding columns)
Perform simple analyses (grouping, calculations)
Save results
Pandas is a powerful tool that will allow you to explore and analyze your data efficiently.