Introduction to Pandas#

In this notebook, we will explore how to use the Pandas library to manipulate and analyze data. Pandas is an essential tool in any data science project. You will learn how to load data, explore it, and manipulate it effectively using DataFrame.

Objectives#

  • Learn how to load data into a DataFrame

  • Understand the difference between indices and indexes

  • Manipulate DataFrame (selection, adding/removing columns, filtering)

  • Apply simple operations on data (averages, groupings)

  • Save results in different formats

1. Import Pandas and Load Data#

import pandas as pd

# Load a sample dataset
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv"
data = pd.read_csv(url)

# Show the first few rows of the DataFrame
data.head()
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

We have loaded a dataset with pd.read_csv() from a URL. This function can also be used to load local files (e.g., .csv, .txt).

2. Manipulating DataFrame#

2.1 Structure of a DataFrame#

A DataFrame is a table of data with rows and columns. Each column has a name, and each row is identified by an index.

# Display information about the DataFrame
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   sex         244 non-null    object 
 3   smoker      244 non-null    object 
 4   day         244 non-null    object 
 5   time        244 non-null    object 
 6   size        244 non-null    int64  
dtypes: float64(2), int64(1), object(4)
memory usage: 13.5+ KB

2.2 Selecting Columns and Rows#

You can access specific columns or rows by using column names or indexes.

# Select a column
data['total_bill'].head()

# Select multiple columns
data[['total_bill', 'tip']].head()

# Select a row (by index)
data.loc[0]

# Select multiple rows and columns
data.loc[0:5, ['total_bill', 'tip']]
total_bill tip
0 16.99 1.01
1 10.34 1.66
2 21.01 3.50
3 23.68 3.31
4 24.59 3.61
5 25.29 4.71

2.3 Index vs Indices#

The index in a DataFrame is the set of labels identifying each row, and it does not necessarily correspond to the row indices (which are just integers from 0 to N).

# Display the current index
data.index

# Change the DataFrame index
data.set_index('day', inplace=True)

# Check the new index
data.head()

# Reset the index to revert to default indices
data.reset_index(inplace=True)

3. Data Operations#

3.1 Filtering and Conditions#

We can filter rows based on specific conditions.

# Filter rows where the total bill is greater than 20
high_total_bill = data[data['total_bill'] > 20]
high_total_bill.head()
day total_bill tip sex smoker time size
2 Sun 21.01 3.50 Male No Dinner 3
3 Sun 23.68 3.31 Male No Dinner 2
4 Sun 24.59 3.61 Female No Dinner 4
5 Sun 25.29 4.71 Male No Dinner 4
7 Sun 26.88 3.12 Male No Dinner 4

3.2 Adding or Removing Columns#

You can easily add new columns derived from existing columns or remove unnecessary columns.

# Add a calculated column
data['tip_percent'] = data['tip'] / data['total_bill'] * 100

# Remove a column
data.drop(columns=['sex'], inplace=True)

data.head()
day total_bill tip smoker time size tip_percent
0 Sun 16.99 1.01 No Dinner 2 5.944673
1 Sun 10.34 1.66 No Dinner 3 16.054159
2 Sun 21.01 3.50 No Dinner 3 16.658734
3 Sun 23.68 3.31 No Dinner 2 13.978041
4 Sun 24.59 3.61 No Dinner 4 14.680765

4. Descriptive Analysis and Grouping#

4.1 Descriptive Statistics#

Pandas makes it easy to get descriptive statistics on your data.

# Get descriptive statistics
data.describe()
total_bill tip size tip_percent
count 244.000000 244.000000 244.000000 244.000000
mean 19.785943 2.998279 2.569672 16.080258
std 8.902412 1.383638 0.951100 6.107220
min 3.070000 1.000000 1.000000 3.563814
25% 13.347500 2.000000 2.000000 12.912736
50% 17.795000 2.900000 2.000000 15.476977
75% 24.127500 3.562500 3.000000 19.147549
max 50.810000 10.000000 6.000000 71.034483

4.2 Data Grouping#

It is often useful to group data by one or more columns and apply computation functions.

# Compute the average tips by day
data.groupby('day')['tip'].mean()

# Count occurrences by day
data['day'].value_counts()
Sat     87
Sun     76
Thur    62
Fri     19
Name: day, dtype: int64

5. Saving Results#

Once your data is manipulated, you can easily save it in different formats (CSV, Excel, etc.).

# Save to CSV
data.to_csv('modified_data.csv', index=False)

# Save to Excel
data.to_excel('modified_data.xlsx', index=False)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Input In [9], in <cell line: 5>()
      2 data.to_csv('modified_data.csv', index=False)
      4 # Save to Excel
----> 5 data.to_excel('modified_data.xlsx', index=False)

File /home/zebulon/miniconda3/envs/m1psl/lib/python3.10/site-packages/pandas/core/generic.py:2345, in NDFrame.to_excel(self, excel_writer, sheet_name, na_rep, float_format, columns, header, index, index_label, startrow, startcol, engine, merge_cells, encoding, inf_rep, verbose, freeze_panes, storage_options)
   2332 from pandas.io.formats.excel import ExcelFormatter
   2334 formatter = ExcelFormatter(
   2335     df,
   2336     na_rep=na_rep,
   (...)
   2343     inf_rep=inf_rep,
   2344 )
-> 2345 formatter.write(
   2346     excel_writer,
   2347     sheet_name=sheet_name,
   2348     startrow=startrow,
   2349     startcol=startcol,
   2350     freeze_panes=freeze_panes,
   2351     engine=engine,
   2352     storage_options=storage_options,
   2353 )

File /home/zebulon/miniconda3/envs/m1psl/lib/python3.10/site-packages/pandas/io/formats/excel.py:888, in ExcelFormatter.write(self, writer, sheet_name, startrow, startcol, freeze_panes, engine, storage_options)
    884     need_save = False
    885 else:
    886     # error: Cannot instantiate abstract class 'ExcelWriter' with abstract
    887     # attributes 'engine', 'save', 'supported_extensions' and 'write_cells'
--> 888     writer = ExcelWriter(  # type: ignore[abstract]
    889         writer, engine=engine, storage_options=storage_options
    890     )
    891     need_save = True
    893 try:

File /home/zebulon/miniconda3/envs/m1psl/lib/python3.10/site-packages/pandas/io/excel/_openpyxl.py:49, in OpenpyxlWriter.__init__(self, path, engine, date_format, datetime_format, mode, storage_options, if_sheet_exists, engine_kwargs, **kwargs)
     36 def __init__(
     37     self,
     38     path,
   (...)
     47 ):
     48     # Use the openpyxl module as the Excel writer.
---> 49     from openpyxl.workbook import Workbook
     51     engine_kwargs = combine_kwargs(engine_kwargs, kwargs)
     53     super().__init__(
     54         path,
     55         mode=mode,
   (...)
     58         engine_kwargs=engine_kwargs,
     59     )

ModuleNotFoundError: No module named 'openpyxl'

Conclusion#

This notebook has introduced you to the basics of Pandas. You have learned how to:

  • Load data into a DataFrame

  • Manipulate data (selection, adding columns)

  • Perform simple analyses (grouping, calculations)

  • Save results

Pandas is a powerful tool that will allow you to explore and analyze your data efficiently.