Introduction to Pandas#

In this notebook, we will explore how to use the Pandas library to manipulate and analyze data. Pandas is an essential tool in any data science project. You will learn how to load data, explore it, and manipulate it effectively using DataFrame.

Objectives#

Learn how to load data into a DataFrame
Understand the difference between indices and indexes
Manipulate DataFrame (selection, adding/removing columns, filtering)
Apply simple operations on data (averages, groupings)
Save results in different formats

1. Import Pandas and Load Data#

import pandas as pd

# Load a sample dataset
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv"
data = pd.read_csv(url)

# Show the first few rows of the DataFrame
data.head()

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

We have loaded a dataset with pd.read_csv() from a URL. This function can also be used to load local files (e.g., .csv, .txt).

2. Manipulating `DataFrame`#

2.1 Structure of a `DataFrame`#

A DataFrame is a table of data with rows and columns. Each column has a name, and each row is identified by an index.

# Display information about the DataFrame
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   sex         244 non-null    object 
 3   smoker      244 non-null    object 
 4   day         244 non-null    object 
 5   time        244 non-null    object 
 6   size        244 non-null    int64  
dtypes: float64(2), int64(1), object(4)
memory usage: 13.5+ KB

2.2 Selecting Columns and Rows#

You can access specific columns or rows by using column names or indexes.

# Select a column
data['total_bill'].head()

# Select multiple columns
data[['total_bill', 'tip']].head()

# Select a row (by index)
data.loc[0]

# Select multiple rows and columns
data.loc[0:5, ['total_bill', 'tip']]

	total_bill	tip
0	16.99	1.01
1	10.34	1.66
2	21.01	3.50
3	23.68	3.31
4	24.59	3.61
5	25.29	4.71

2.3 Index vs Indices#

The index in a DataFrame is the set of labels identifying each row, and it does not necessarily correspond to the row indices (which are just integers from 0 to N).

# Display the current index
data.index

# Change the DataFrame index
data.set_index('day', inplace=True)

# Check the new index
data.head()

# Reset the index to revert to default indices
data.reset_index(inplace=True)

3. Data Operations#

3.1 Filtering and Conditions#

We can filter rows based on specific conditions.

# Filter rows where the total bill is greater than 20
high_total_bill = data[data['total_bill'] > 20]
high_total_bill.head()

	day	total_bill	tip	sex	smoker	time	size
2	Sun	21.01	3.50	Male	No	Dinner	3
3	Sun	23.68	3.31	Male	No	Dinner	2
4	Sun	24.59	3.61	Female	No	Dinner	4
5	Sun	25.29	4.71	Male	No	Dinner	4
7	Sun	26.88	3.12	Male	No	Dinner	4

3.2 Adding or Removing Columns#

You can easily add new columns derived from existing columns or remove unnecessary columns.

# Add a calculated column
data['tip_percent'] = data['tip'] / data['total_bill'] * 100

# Remove a column
data.drop(columns=['sex'], inplace=True)

data.head()

	day	total_bill	tip	smoker	time	size	tip_percent
0	Sun	16.99	1.01	No	Dinner	2	5.944673
1	Sun	10.34	1.66	No	Dinner	3	16.054159
2	Sun	21.01	3.50	No	Dinner	3	16.658734
3	Sun	23.68	3.31	No	Dinner	2	13.978041
4	Sun	24.59	3.61	No	Dinner	4	14.680765

4. Descriptive Analysis and Grouping#

4.1 Descriptive Statistics#

Pandas makes it easy to get descriptive statistics on your data.

# Get descriptive statistics
data.describe()

	total_bill	tip	size	tip_percent
count	244.000000	244.000000	244.000000	244.000000
mean	19.785943	2.998279	2.569672	16.080258
std	8.902412	1.383638	0.951100	6.107220
min	3.070000	1.000000	1.000000	3.563814
25%	13.347500	2.000000	2.000000	12.912736
50%	17.795000	2.900000	2.000000	15.476977
75%	24.127500	3.562500	3.000000	19.147549
max	50.810000	10.000000	6.000000	71.034483

4.2 Data Grouping#

It is often useful to group data by one or more columns and apply computation functions.

# Compute the average tips by day
data.groupby('day')['tip'].mean()

# Count occurrences by day
data['day'].value_counts()

Sat     87
Sun     76
Thur    62
Fri     19
Name: day, dtype: int64

5. Saving Results#

Once your data is manipulated, you can easily save it in different formats (CSV, Excel, etc.).

# Save to CSV
data.to_csv('modified_data.csv', index=False)

# Save to Excel
data.to_excel('modified_data.xlsx', index=False)

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Input In [9], in <cell line: 5>()
data.to_csv('modified_data.csv', index=False)
# Save to Excel
----> 5 data.to_excel('modified_data.xlsx', index=False)

File /home/zebulon/miniconda3/envs/m1psl/lib/python3.10/site-packages/pandas/core/generic.py:2345, in NDFrame.to_excel(self, excel_writer, sheet_name, na_rep, float_format, columns, header, index, index_label, startrow, startcol, engine, merge_cells, encoding, inf_rep, verbose, freeze_panes, storage_options)
from pandas.io.formats.excel import ExcelFormatter
formatter = ExcelFormatter(
   df,
   na_rep=na_rep,
   (...)
   inf_rep=inf_rep,
)
-> 2345 formatter.write(
   excel_writer,
   sheet_name=sheet_name,
   startrow=startrow,
   startcol=startcol,
   freeze_panes=freeze_panes,
   engine=engine,
   storage_options=storage_options,
)

File /home/zebulon/miniconda3/envs/m1psl/lib/python3.10/site-packages/pandas/io/formats/excel.py:888, in ExcelFormatter.write(self, writer, sheet_name, startrow, startcol, freeze_panes, engine, storage_options)
   need_save = False
else:
   # error: Cannot instantiate abstract class 'ExcelWriter' with abstract
   # attributes 'engine', 'save', 'supported_extensions' and 'write_cells'
--> 888     writer = ExcelWriter(  # type: ignore[abstract]
       writer, engine=engine, storage_options=storage_options
   )
   need_save = True
try:

File /home/zebulon/miniconda3/envs/m1psl/lib/python3.10/site-packages/pandas/io/excel/_openpyxl.py:49, in OpenpyxlWriter.__init__(self, path, engine, date_format, datetime_format, mode, storage_options, if_sheet_exists, engine_kwargs, **kwargs)
def __init__(
   self,
   path,
   (...)
):
   # Use the openpyxl module as the Excel writer.
---> 49     from openpyxl.workbook import Workbook
   engine_kwargs = combine_kwargs(engine_kwargs, kwargs)
   super().__init__(
       path,
       mode=mode,
   (...)
       engine_kwargs=engine_kwargs,
   )

ModuleNotFoundError: No module named 'openpyxl'

Conclusion#

This notebook has introduced you to the basics of Pandas. You have learned how to:

Load data into a DataFrame
Manipulate data (selection, adding columns)
Perform simple analyses (grouping, calculations)
Save results

Pandas is a powerful tool that will allow you to explore and analyze your data efficiently.

Python for Engineers

Introduction to Pandas

Contents

Introduction to Pandas#

Objectives#

1. Import Pandas and Load Data#

2. Manipulating `DataFrame`#

2.1 Structure of a `DataFrame`#

2.2 Selecting Columns and Rows#

2.3 Index vs Indices#

3. Data Operations#

3.1 Filtering and Conditions#

3.2 Adding or Removing Columns#

4. Descriptive Analysis and Grouping#

4.1 Descriptive Statistics#

4.2 Data Grouping#

5. Saving Results#

Conclusion#

Python for Engineers

Introduction to Pandas

Contents

Introduction to Pandas#

Objectives#

1. Import Pandas and Load Data#

2. Manipulating DataFrame#

2.1 Structure of a DataFrame#

2.2 Selecting Columns and Rows#

2.3 Index vs Indices#

3. Data Operations#

3.1 Filtering and Conditions#

3.2 Adding or Removing Columns#

4. Descriptive Analysis and Grouping#

4.1 Descriptive Statistics#

4.2 Data Grouping#

5. Saving Results#

Conclusion#

2. Manipulating `DataFrame`#

2.1 Structure of a `DataFrame`#