{ "cells": [ { "cell_type": "markdown", "id": "3e0b7922", "metadata": {}, "source": [ "# Introduction to Pandas\n", "\n", "In this notebook, we will explore how to use the Pandas library to manipulate and analyze data. Pandas is an essential tool in any data science project. You will learn how to load data, explore it, and manipulate it effectively using `DataFrame`.\n", "\n", "## Objectives\n", "\n", "- Learn how to load data into a `DataFrame`\n", "- Understand the difference between indices and indexes\n", "- Manipulate `DataFrame` (selection, adding/removing columns, filtering)\n", "- Apply simple operations on data (averages, groupings)\n", "- Save results in different formats\n", "\n", "## 1. Import Pandas and Load Data" ] }, { "cell_type": "code", "execution_count": 1, "id": "3cf2e829", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsexsmokerdaytimesize
016.991.01FemaleNoSunDinner2
110.341.66MaleNoSunDinner3
221.013.50MaleNoSunDinner3
323.683.31MaleNoSunDinner2
424.593.61FemaleNoSunDinner4
\n", "
" ], "text/plain": [ " total_bill tip sex smoker day time size\n", "0 16.99 1.01 Female No Sun Dinner 2\n", "1 10.34 1.66 Male No Sun Dinner 3\n", "2 21.01 3.50 Male No Sun Dinner 3\n", "3 23.68 3.31 Male No Sun Dinner 2\n", "4 24.59 3.61 Female No Sun Dinner 4" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "# Load a sample dataset\n", "url = \"https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv\"\n", "data = pd.read_csv(url)\n", "\n", "# Show the first few rows of the DataFrame\n", "data.head()" ] }, { "cell_type": "markdown", "id": "aa7e38a3", "metadata": {}, "source": [ "We have loaded a dataset with `pd.read_csv()` from a URL. This function can also be used to load local files (e.g., `.csv`, `.txt`).\n", "\n", "## 2. Manipulating `DataFrame`\n", "\n", "### 2.1 Structure of a `DataFrame`\n", "A `DataFrame` is a table of data with rows and columns. Each column has a name, and each row is identified by an index." ] }, { "cell_type": "code", "execution_count": 2, "id": "104f68c4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 244 entries, 0 to 243\n", "Data columns (total 7 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 total_bill 244 non-null float64\n", " 1 tip 244 non-null float64\n", " 2 sex 244 non-null object \n", " 3 smoker 244 non-null object \n", " 4 day 244 non-null object \n", " 5 time 244 non-null object \n", " 6 size 244 non-null int64 \n", "dtypes: float64(2), int64(1), object(4)\n", "memory usage: 13.5+ KB\n" ] } ], "source": [ "# Display information about the DataFrame\n", "data.info()" ] }, { "cell_type": "markdown", "id": "0b8bd7c0", "metadata": {}, "source": [ "### 2.2 Selecting Columns and Rows\n", "\n", "You can access specific columns or rows by using column names or indexes." ] }, { "cell_type": "code", "execution_count": 3, "id": "5df5cfa4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtip
016.991.01
110.341.66
221.013.50
323.683.31
424.593.61
525.294.71
\n", "
" ], "text/plain": [ " total_bill tip\n", "0 16.99 1.01\n", "1 10.34 1.66\n", "2 21.01 3.50\n", "3 23.68 3.31\n", "4 24.59 3.61\n", "5 25.29 4.71" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Select a column\n", "data['total_bill'].head()\n", "\n", "# Select multiple columns\n", "data[['total_bill', 'tip']].head()\n", "\n", "# Select a row (by index)\n", "data.loc[0]\n", "\n", "# Select multiple rows and columns\n", "data.loc[0:5, ['total_bill', 'tip']]" ] }, { "cell_type": "markdown", "id": "0ea3e91c", "metadata": {}, "source": [ "### 2.3 Index vs Indices\n", "\n", "The index in a `DataFrame` is the set of labels identifying each row, and it does not necessarily correspond to the row indices (which are just integers from 0 to N)." ] }, { "cell_type": "code", "execution_count": 4, "id": "13fd38d9", "metadata": {}, "outputs": [], "source": [ "# Display the current index\n", "data.index\n", "\n", "# Change the DataFrame index\n", "data.set_index('day', inplace=True)\n", "\n", "# Check the new index\n", "data.head()\n", "\n", "# Reset the index to revert to default indices\n", "data.reset_index(inplace=True)" ] }, { "cell_type": "markdown", "id": "d77aa4a7", "metadata": {}, "source": [ "## 3. Data Operations\n", "\n", "### 3.1 Filtering and Conditions\n", "\n", "We can filter rows based on specific conditions." ] }, { "cell_type": "code", "execution_count": 5, "id": "e1581209", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
daytotal_billtipsexsmokertimesize
2Sun21.013.50MaleNoDinner3
3Sun23.683.31MaleNoDinner2
4Sun24.593.61FemaleNoDinner4
5Sun25.294.71MaleNoDinner4
7Sun26.883.12MaleNoDinner4
\n", "
" ], "text/plain": [ " day total_bill tip sex smoker time size\n", "2 Sun 21.01 3.50 Male No Dinner 3\n", "3 Sun 23.68 3.31 Male No Dinner 2\n", "4 Sun 24.59 3.61 Female No Dinner 4\n", "5 Sun 25.29 4.71 Male No Dinner 4\n", "7 Sun 26.88 3.12 Male No Dinner 4" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Filter rows where the total bill is greater than 20\n", "high_total_bill = data[data['total_bill'] > 20]\n", "high_total_bill.head()" ] }, { "cell_type": "markdown", "id": "412cb22c", "metadata": {}, "source": [ "### 3.2 Adding or Removing Columns\n", "\n", "You can easily add new columns derived from existing columns or remove unnecessary columns." ] }, { "cell_type": "code", "execution_count": 6, "id": "a1f7ba25", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
daytotal_billtipsmokertimesizetip_percent
0Sun16.991.01NoDinner25.944673
1Sun10.341.66NoDinner316.054159
2Sun21.013.50NoDinner316.658734
3Sun23.683.31NoDinner213.978041
4Sun24.593.61NoDinner414.680765
\n", "
" ], "text/plain": [ " day total_bill tip smoker time size tip_percent\n", "0 Sun 16.99 1.01 No Dinner 2 5.944673\n", "1 Sun 10.34 1.66 No Dinner 3 16.054159\n", "2 Sun 21.01 3.50 No Dinner 3 16.658734\n", "3 Sun 23.68 3.31 No Dinner 2 13.978041\n", "4 Sun 24.59 3.61 No Dinner 4 14.680765" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Add a calculated column\n", "data['tip_percent'] = data['tip'] / data['total_bill'] * 100\n", "\n", "# Remove a column\n", "data.drop(columns=['sex'], inplace=True)\n", "\n", "data.head()" ] }, { "cell_type": "markdown", "id": "1808f48a", "metadata": {}, "source": [ "## 4. Descriptive Analysis and Grouping\n", "\n", "### 4.1 Descriptive Statistics\n", "\n", "Pandas makes it easy to get descriptive statistics on your data." ] }, { "cell_type": "code", "execution_count": 7, "id": "fe75fd30", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsizetip_percent
count244.000000244.000000244.000000244.000000
mean19.7859432.9982792.56967216.080258
std8.9024121.3836380.9511006.107220
min3.0700001.0000001.0000003.563814
25%13.3475002.0000002.00000012.912736
50%17.7950002.9000002.00000015.476977
75%24.1275003.5625003.00000019.147549
max50.81000010.0000006.00000071.034483
\n", "
" ], "text/plain": [ " total_bill tip size tip_percent\n", "count 244.000000 244.000000 244.000000 244.000000\n", "mean 19.785943 2.998279 2.569672 16.080258\n", "std 8.902412 1.383638 0.951100 6.107220\n", "min 3.070000 1.000000 1.000000 3.563814\n", "25% 13.347500 2.000000 2.000000 12.912736\n", "50% 17.795000 2.900000 2.000000 15.476977\n", "75% 24.127500 3.562500 3.000000 19.147549\n", "max 50.810000 10.000000 6.000000 71.034483" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get descriptive statistics\n", "data.describe()" ] }, { "cell_type": "markdown", "id": "bd0f3a9e", "metadata": {}, "source": [ "### 4.2 Data Grouping\n", "\n", "It is often useful to group data by one or more columns and apply computation functions." ] }, { "cell_type": "code", "execution_count": 8, "id": "a068d720", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Sat 87\n", "Sun 76\n", "Thur 62\n", "Fri 19\n", "Name: day, dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Compute the average tips by day\n", "data.groupby('day')['tip'].mean()\n", "\n", "# Count occurrences by day\n", "data['day'].value_counts()" ] }, { "cell_type": "markdown", "id": "5b872230", "metadata": {}, "source": [ "## 5. Saving Results\n", "\n", "Once your data is manipulated, you can easily save it in different formats (CSV, Excel, etc.)." ] }, { "cell_type": "code", "execution_count": 9, "id": "b6aaf085", "metadata": {}, "outputs": [ { "ename": "ModuleNotFoundError", "evalue": "No module named 'openpyxl'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)", "Input \u001b[0;32mIn [9]\u001b[0m, in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 2\u001b[0m data\u001b[38;5;241m.\u001b[39mto_csv(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mmodified_data.csv\u001b[39m\u001b[38;5;124m'\u001b[39m, index\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mFalse\u001b[39;00m)\n\u001b[1;32m 4\u001b[0m \u001b[38;5;66;03m# Save to Excel\u001b[39;00m\n\u001b[0;32m----> 5\u001b[0m \u001b[43mdata\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mto_excel\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mmodified_data.xlsx\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mindex\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m)\u001b[49m\n", "File \u001b[0;32m/home/zebulon/miniconda3/envs/m1psl/lib/python3.10/site-packages/pandas/core/generic.py:2345\u001b[0m, in \u001b[0;36mNDFrame.to_excel\u001b[0;34m(self, excel_writer, sheet_name, na_rep, float_format, columns, header, index, index_label, startrow, startcol, engine, merge_cells, encoding, inf_rep, verbose, freeze_panes, storage_options)\u001b[0m\n\u001b[1;32m 2332\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mpandas\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mio\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mformats\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mexcel\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m ExcelFormatter\n\u001b[1;32m 2334\u001b[0m formatter \u001b[38;5;241m=\u001b[39m ExcelFormatter(\n\u001b[1;32m 2335\u001b[0m df,\n\u001b[1;32m 2336\u001b[0m na_rep\u001b[38;5;241m=\u001b[39mna_rep,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 2343\u001b[0m inf_rep\u001b[38;5;241m=\u001b[39minf_rep,\n\u001b[1;32m 2344\u001b[0m )\n\u001b[0;32m-> 2345\u001b[0m \u001b[43mformatter\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mwrite\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 2346\u001b[0m \u001b[43m \u001b[49m\u001b[43mexcel_writer\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2347\u001b[0m \u001b[43m \u001b[49m\u001b[43msheet_name\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43msheet_name\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2348\u001b[0m \u001b[43m \u001b[49m\u001b[43mstartrow\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mstartrow\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2349\u001b[0m \u001b[43m \u001b[49m\u001b[43mstartcol\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mstartcol\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2350\u001b[0m \u001b[43m \u001b[49m\u001b[43mfreeze_panes\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mfreeze_panes\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2351\u001b[0m \u001b[43m \u001b[49m\u001b[43mengine\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mengine\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2352\u001b[0m \u001b[43m \u001b[49m\u001b[43mstorage_options\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mstorage_options\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2353\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n", "File \u001b[0;32m/home/zebulon/miniconda3/envs/m1psl/lib/python3.10/site-packages/pandas/io/formats/excel.py:888\u001b[0m, in \u001b[0;36mExcelFormatter.write\u001b[0;34m(self, writer, sheet_name, startrow, startcol, freeze_panes, engine, storage_options)\u001b[0m\n\u001b[1;32m 884\u001b[0m need_save \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mFalse\u001b[39;00m\n\u001b[1;32m 885\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 886\u001b[0m \u001b[38;5;66;03m# error: Cannot instantiate abstract class 'ExcelWriter' with abstract\u001b[39;00m\n\u001b[1;32m 887\u001b[0m \u001b[38;5;66;03m# attributes 'engine', 'save', 'supported_extensions' and 'write_cells'\u001b[39;00m\n\u001b[0;32m--> 888\u001b[0m writer \u001b[38;5;241m=\u001b[39m \u001b[43mExcelWriter\u001b[49m\u001b[43m(\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;66;43;03m# type: ignore[abstract]\u001b[39;49;00m\n\u001b[1;32m 889\u001b[0m \u001b[43m \u001b[49m\u001b[43mwriter\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mengine\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mengine\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mstorage_options\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mstorage_options\u001b[49m\n\u001b[1;32m 890\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 891\u001b[0m need_save \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mTrue\u001b[39;00m\n\u001b[1;32m 893\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n", "File \u001b[0;32m/home/zebulon/miniconda3/envs/m1psl/lib/python3.10/site-packages/pandas/io/excel/_openpyxl.py:49\u001b[0m, in \u001b[0;36mOpenpyxlWriter.__init__\u001b[0;34m(self, path, engine, date_format, datetime_format, mode, storage_options, if_sheet_exists, engine_kwargs, **kwargs)\u001b[0m\n\u001b[1;32m 36\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21m__init__\u001b[39m(\n\u001b[1;32m 37\u001b[0m \u001b[38;5;28mself\u001b[39m,\n\u001b[1;32m 38\u001b[0m path,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 47\u001b[0m ):\n\u001b[1;32m 48\u001b[0m \u001b[38;5;66;03m# Use the openpyxl module as the Excel writer.\u001b[39;00m\n\u001b[0;32m---> 49\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mopenpyxl\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mworkbook\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m Workbook\n\u001b[1;32m 51\u001b[0m engine_kwargs \u001b[38;5;241m=\u001b[39m combine_kwargs(engine_kwargs, kwargs)\n\u001b[1;32m 53\u001b[0m \u001b[38;5;28msuper\u001b[39m()\u001b[38;5;241m.\u001b[39m\u001b[38;5;21m__init__\u001b[39m(\n\u001b[1;32m 54\u001b[0m path,\n\u001b[1;32m 55\u001b[0m mode\u001b[38;5;241m=\u001b[39mmode,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 58\u001b[0m engine_kwargs\u001b[38;5;241m=\u001b[39mengine_kwargs,\n\u001b[1;32m 59\u001b[0m )\n", "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'openpyxl'" ] } ], "source": [ "# Save to CSV\n", "data.to_csv('modified_data.csv', index=False)\n", "\n", "# Save to Excel\n", "data.to_excel('modified_data.xlsx', index=False)" ] }, { "cell_type": "markdown", "id": "540482f9", "metadata": {}, "source": [ "## Conclusion\n", "\n", "This notebook has introduced you to the basics of Pandas. You have learned how to:\n", "- Load data into a `DataFrame`\n", "- Manipulate data (selection, adding columns)\n", "- Perform simple analyses (grouping, calculations)\n", "- Save results\n", "\n", "Pandas is a powerful tool that will allow you to explore and analyze your data efficiently." ] } ], "metadata": { "jupytext": { "text_representation": { "extension": ".md", "format_name": "myst", "format_version": 0.13, "jupytext_version": "1.14.1" } }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.4" }, "source_map": [ 12, 28, 37, 46, 49, 55, 67, 73, 85, 93, 97, 103, 111, 119, 122, 128, 134, 140, 146 ] }, "nbformat": 4, "nbformat_minor": 5 }