Getting Started with Pandas

Share This Post

Introduction

Pandas is a python package for reading, cleaning, exploring and analyzing tabular data such as data from spreadsheets or databases. Pandas provides flexible and expressive data structures that facilitate working with labelled and relational database

Supported Data Format

 

Pandas is ideal for working with any table-like data that can be stored in common format such as excel. Dataframe is the equivalent table representation in Pandas. You can create a dataframe manually or import from excel, txt, csv. To create data manually, call the “DataFrame” method and insert array of data to create. Just as we have column and row labels in common program for handling 2-D table data format, pandas also has column headers and row labels. It is crucial to understand how pandas create dataframe from different python data structure. If the data is a python dictionary, the key would be the column headers and the values in each list would be column of the dataframe while the row labels would be integers by default except specified otherwise. The column and row lables are important for indexing. If the data is a list of list, the values in each nested list would be the rows. If column name and row headers are not specified in the case of list of list, pandas would use default integer values as column headers and row labels.

In [ ]:
# import package
import pandas as pd

#Create a dataframe from a dictionary
df1 = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9]})
df1
Out[ ]:
A B C
0 1 4 7
1 2 5 8
2 3 6 9
In [ ]:
print(type(df1))
<class 'pandas.core.frame.DataFrame'>
In [ ]:
#create a dataframe from a list of lists
df2 = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df2
Out[ ]:
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
In [ ]:
#create a dataframe from a list of lists and specify column names
df3 = pd.DataFrame([[16, 2, 53], [44, 35, 68], [73, 28, 95]], columns=['A', 'B', 'C'])
df3
Out[ ]:
A B C
0 16 2 53
1 44 35 68
2 73 28 95

To select a colum in a dataframe, use column label in between square bracket. Each column in a dataframe is a series.

In [ ]:
#select column C in df1
df1['C']
Out[ ]:
0    7
1    8
2    9
Name: C, dtype: int64
In [ ]:
print(type(df1['C']))
<class 'pandas.core.series.Series'>

Dataframe can store data of different format including integers, float, string, boolean etc. Let’s create a more heterogenous data for firs 5 passengers in the titanic dataset shown in the image below.

In [ ]:
df_4 = pd.DataFrame({
            'PassengerID':[1,2,3,4,5], 
            'Survived': [0,1,1,1,0], 
            'Name': ['Braud', 'Cumings', 'Heikkinen', 'Frode', 'Kallio'], 
            'Sex': ['male', 'female', 'female', 'female', 'male'], 
            'Age': [22, 38, 26, 35, 35], 
            'Ticket': [2457, 1135, 3421, 3, 1419], 
            'Fare': [7.25, 15.0, 26.0, 8.05, 9.0]
            })
df_4
Out[ ]:
PassengerID Survived Name Sex Age Ticket Fare
0 1 0 Braud male 22 2457 7.25
1 2 1 Cumings female 38 1135 15.00
2 3 1 Heikkinen female 26 3421 26.00
3 4 1 Frode female 35 3 8.05
4 5 0 Kallio male 35 1419 9.00

We can perform some artithmetric operation on series on dataframe. But before that two vital methods for getting statistical info about the dataframe are info() and describe()

In [ ]:
df_4.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerID  5 non-null      int64  
 1   Survived     5 non-null      int64  
 2   Name         5 non-null      object 
 3   Sex          5 non-null      object 
 4   Age          5 non-null      int64  
 5   Ticket       5 non-null      int64  
 6   Fare         5 non-null      float64
dtypes: float64(1), int64(4), object(2)
memory usage: 408.0+ bytes
In [ ]:
df_4.describe()
Out[ ]:
PassengerID Survived Age Ticket Fare
count 5.000000 5.000000 5.00000 5.000000 5.000000
mean 3.000000 0.600000 31.20000 1687.000000 13.060000
std 1.581139 0.547723 6.83374 1304.994253 7.850908
min 1.000000 0.000000 22.00000 3.000000 7.250000
25% 2.000000 0.000000 26.00000 1135.000000 8.050000
50% 3.000000 1.000000 35.00000 1419.000000 9.000000
75% 4.000000 1.000000 35.00000 2457.000000 15.000000
max 5.000000 1.000000 38.00000 3421.000000 26.000000

So far, we have created dataframes manually. What if we are to read data from other programs which is most commonly the case. Pandas provide the read_ method. The is the name/extension of the source. For example, to read from a csv, use read_csv()

In [ ]:
#import from csv
titanic_df = pd.read_csv('train.csv')
titanic_df
Out[ ]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen “Carrie” female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

The data that we just read has 891 rows and 12 columns but just the top 5 rows and last 5 are shown by default

Subsetting

 

We at times need only some parts of the dataset to work. Pandas, provide a easy way to select specific columns and/or row values. To select a column, call the dataframe and the name of the column in a square brackect. This syntax works in similar approach as dictionary indexing using keys. To select multiple colums, use a list of column names

In [ ]:
#Select the age column
Age = titanic_df['Age']
print(type(Age)) #this is a series since it is one column
Age
<class 'pandas.core.series.Series'>
Out[ ]:
0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

Filtering a dataframe is also very easy with dataframe. The syntax involves calling the dataframe with the filtering condition in a square bracket; df[‘condition’]. Let’s look at the example of how to select all teanagers in our titanic dataset. The condition is ‘df[‘Age’] < 19′. So we put this in a selection bracket of the dataframe. Then we have ‘df[df[‘Age’] < 19]’

In [ ]:
teens_df = titanic_df[titanic_df['Age'] < 19]
teens_df
Out[ ]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 0 350406 7.8542 NaN S
16 17 0 3 Rice, Master. Eugene male 2.0 4 1 382652 29.1250 NaN Q
852 853 0 3 Boulos, Miss. Nourelain female 9.0 1 1 2678 15.2458 NaN C
853 854 1 1 Lines, Miss. Mary Conover female 16.0 0 1 PC 17592 39.4000 D28 S
855 856 1 3 Aks, Mrs. Sam (Leah Rosen) female 18.0 0 1 392091 9.3500 NaN S
869 870 1 3 Johnson, Master. Harold Theodor male 4.0 1 1 347742 11.1333 NaN S
875 876 1 3 Najib, Miss. Adele Kiamie “Jane” female 15.0 0 0 2667 7.2250 NaN C

139 rows × 12 columns

We can also combine conditions. Let see an example of selecting teenagers i.e df[‘Age’] < 19 that survived i.e df[‘Survived’] == 1

In [ ]:
survteens_df = titanic_df[(titanic_df['Age'] < 19) & (titanic_df['Survived'] == 1)]
survteens_df
Out[ ]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.00 1 0 237736 30.0708 NaN C
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.00 1 1 PP 9549 16.7000 G6 S
22 23 1 3 McGowan, Miss. Anna “Annie” female 15.00 0 0 330923 8.0292 NaN Q
39 40 1 3 Nicola-Yarred, Miss. Jamila female 14.00 1 0 2651 11.2417 NaN C
43 44 1 2 Laroche, Miss. Simonne Marie Anne Andree female 3.00 1 2 SC/Paris 2123 41.5792 NaN C
831 832 1 2 Richards, Master. George Sibley male 0.83 1 1 29106 18.7500 NaN S
853 854 1 1 Lines, Miss. Mary Conover female 16.00 0 1 PC 17592 39.4000 D28 S
855 856 1 3 Aks, Mrs. Sam (Leah Rosen) female 18.00 0 1 392091 9.3500 NaN S
869 870 1 3 Johnson, Master. Harold Theodor male 4.00 1 1 347742 11.1333 NaN S
875 876 1 3 Najib, Miss. Adele Kiamie “Jane” female 15.00 0 0 2667 7.2250 NaN C

70 rows × 12 columns

At times, when we filter our dataset, we want to selct only one column. We just use select bracket on the column of interest. However, we can also use loc operator. This works in similar way as the indexing above except that we have to specify the column we want to return in addition to the filtering condition

In [ ]:
survteens_df['Name']
Out[ ]:
9           Nasser, Mrs. Nicholas (Adele Achem)
10              Sandstrom, Miss. Marguerite Rut
22                  McGowan, Miss. Anna "Annie"
39                  Nicola-Yarred, Miss. Jamila
43     Laroche, Miss. Simonne Marie Anne Andree
                         ...                   
831             Richards, Master. George Sibley
853                   Lines, Miss. Mary Conover
855                  Aks, Mrs. Sam (Leah Rosen)
869             Johnson, Master. Harold Theodor
875            Najib, Miss. Adele Kiamie "Jane"
Name: Name, Length: 70, dtype: object
In [ ]:
titanic_df.loc[((titanic_df['Age'] < 19) & (titanic_df['Survived'] == 1), 'Name')]
Out[ ]:
9           Nasser, Mrs. Nicholas (Adele Achem)
10              Sandstrom, Miss. Marguerite Rut
22                  McGowan, Miss. Anna "Annie"
39                  Nicola-Yarred, Miss. Jamila
43     Laroche, Miss. Simonne Marie Anne Andree
                         ...                   
831             Richards, Master. George Sibley
853                   Lines, Miss. Mary Conover
855                  Aks, Mrs. Sam (Leah Rosen)
869             Johnson, Master. Harold Theodor
875            Najib, Miss. Adele Kiamie "Jane"
Name: Name, Length: 70, dtype: object

Syntax of loc or iloc operator looks complex but they are simple. The idea is the they allow row and column indexing at a go. Call the operator with the row condition and columns of interest.

df.loc[row_condition, column_of_interest]. When using row and column names

df.iloc[row_condition, column_of_interest]. When specifically interested in certain rows and columns based on position indexing

Let’s use iloc operator to select the first 30 rows of the last three columns in our titanic dataframe

In [ ]:
titanic_df.iloc[:30,-3:]
Out[ ]:
Fare Cabin Embarked
0 7.2500 NaN S
1 71.2833 C85 C
2 7.9250 NaN S
3 53.1000 C123 S
4 8.0500 NaN S
5 8.4583 NaN Q
6 51.8625 E46 S
7 21.0750 NaN S
8 11.1333 NaN S
9 30.0708 NaN C
10 16.7000 G6 S
11 26.5500 C103 S
12 8.0500 NaN S
13 31.2750 NaN S
14 7.8542 NaN S
15 16.0000 NaN S
16 29.1250 NaN Q
17 13.0000 NaN S
18 18.0000 NaN S
19 7.2250 NaN C
20 26.0000 NaN S
21 13.0000 D56 S
22 8.0292 NaN Q
23 35.5000 A6 S
24 21.0750 NaN S
25 31.3875 NaN S
26 7.2250 NaN C
27 263.0000 C23 C25 C27 S
28 7.8792 NaN Q
29 7.8958 NaN S

Creating Plots in Pandas

Pandas plot() method eanables easy visualization of data. To demonstrate its use, we will be using Snow data downloaded from snotel sites.

Use read_csv, tell pandas the column to use for indexing, and parse_dates = True

In [ ]:
SWE =pd.read_csv('Monthly_Mean_SWE.csv', index_col='Months', parse_dates=True)
SWE
Out[ ]:
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
Months
January 40.312258 28.513548 48.571355 36.920129 55.486710 28.185806 62.107097 40.533484 40.623613 39.951742 20.705097 28.030129 41.762516 50.742645 45.752774 22.007871 29.816323 33.429677 39.624000 44.884258
February 57.491586 34.181143 63.500000 46.645286 70.366759 31.532286 85.398429 47.643143 65.356828 45.411571 27.803929 50.645786 50.019857 66.950897 74.403857 34.480500 55.526214 54.758897 55.535286 48.250929
March 73.938581 38.616194 76.216387 61.992387 79.100516 36.125355 94.586323 61.820323 75.593677 61.254968 34.896323 73.045484 46.129677 77.560129 94.414258 50.414903 88.588645 61.123871 69.538645 52.197000
April 65.311867 41.164933 75.793600 68.072000 70.459600 45.017267 104.495600 52.874333 85.301667 70.366467 33.020000 80.484133 39.539333 63.965667 81.948867 46.007867 83.083400 60.790667 51.011667 NaN
May 18.853355 8.504903 33.929484 43.302903 15.715226 14.010968 50.111742 9.479935 49.882323 36.125355 3.269226 40.476129 2.179484 10.831871 23.654774 2.548194 20.393742 7.792065 3.449484 NaN
June 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.033867 0.000000 0.550333 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 NaN
July 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 NaN
August 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 NaN
September 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 NaN
October 0.991419 0.122903 0.000000 0.000000 1.229032 0.000000 0.000000 0.286774 0.073742 1.483032 0.000000 0.024581 0.024581 0.106516 1.302774 0.000000 0.778387 0.000000 0.000000 NaN
November 5.317067 2.726267 6.172200 6.468533 6.383867 12.352867 5.935133 1.049867 3.454400 4.055533 5.240867 5.579533 4.597400 0.474133 5.850467 1.845733 1.735667 9.050867 3.445933 NaN
December 18.066774 32.020387 14.945032 26.530710 19.426903 29.013355 23.138581 12.986774 13.593097 14.281355 17.591548 22.966516 28.841290 18.353548 10.512323 14.437032 11.053097 21.082000 16.215032 NaN

12 rows × 23 columns

In [ ]:
SWE.plot()
Out[ ]:
<AxesSubplot:xlabel='Months'>

With a dataframe, pandas plot() method will give one line plot for each columns with numeric data as shown by the result of the code cell above. To plot a specific column, use the selection method described above before the plot method.

In [ ]:
SWE[['2000', '2005', '2010', '2015', '2020']].plot(figsize=(10,7))
Out[ ]:
<AxesSubplot:xlabel='Months'>

Besides the default line plot, data can be visualized with other types of plots such as scatter, boxplots etc

In [ ]:
#datetime index
SWE.plot.scatter(x='2000', y='2005')
Out[ ]:
<AxesSubplot:xlabel='2000', ylabel='2005'>
In [ ]:
SWE[['2000', '2005', '2010', '2015', '2020']].plot.box()
Out[ ]:
<AxesSubplot:>
In [ ]:
#seperate plots
SWE[['2000', '2005', '2010', '2015', '2020']].plot(figsize=(10,7), subplots=True)
Out[ ]:
array([<AxesSubplot:xlabel='Months'>, <AxesSubplot:xlabel='Months'>,
       <AxesSubplot:xlabel='Months'>, <AxesSubplot:xlabel='Months'>,
       <AxesSubplot:xlabel='Months'>], dtype=object)

Groupby

Finding statistics for each category in a datset is a common task in data analysis. For example we might be interested in knowing the mean of the ages for each gender in the titanic dataset.

The syntax is:
df.groupby(catgerogy_to_groupby).statistics()

In this example, the category is sex and the statistics of interest is mean.

In [ ]:
titanic_df[['Age', 'Sex']].groupby('Sex').mean()
Out[ ]:
Age
Sex
female 27.915709
male 30.726645

Note that the groupby().mean() syntax was applied to a subselection of the two columns of interest (‘Sex’ and ‘Age’ in this case), otherwise the statitical operation will be performed on all columns containing numerical values. This is demonstrated by the code cell below. However, some columns such as Pclass (passenger class) with numerical values encode categorical information or other data types thus not ideal to find the mean or other statistics.

In [ ]:
titanic_df.groupby('Sex').mean()
Out[ ]:
PassengerId Survived Pclass Age SibSp Parch Fare
Sex
female 431.028662 0.742038 2.159236 27.915709 0.694268 0.649682 44.479818
male 454.147314 0.188908 2.389948 30.726645 0.429809 0.235702 25.523893

Alternatively, we can use a selection bracket [] after the groupby() but before the statistics to specifically select the column of interest

In [ ]:
titanic_df.groupby('Sex')['Age'].mean()
Out[ ]:
Sex
female    27.915709
male      30.726645
Name: Age, dtype: float64

We can also find the statistics of multiple category combination. For example, we might be interested in knowing the average ticket fare for each sex of each Passenger class. These can be achieved using the two code cells below

In [ ]:
titanic_df[['Fare', 'Sex', 'Pclass']].groupby(['Sex', 'Pclass']).mean()
Out[ ]:
Fare
Sex Pclass
female 1 106.125798
2 21.970121
3 16.118810
male 1 67.226127
2 19.741782
3 12.661633
In [ ]:
titanic_df.groupby(['Sex', 'Pclass'])['Fare'].mean()
Out[ ]:
Sex     Pclass
female  1         106.125798
        2          21.970121
        3          16.118810
male    1          67.226127
        2          19.741782
        3          12.661633
Name: Fare, dtype: float64

It is crucial to note that the first example returned a dataframe while the second returned a series. Another information to derive from groupby is the number of occurence of each category in a function. This can be achieved using the count() method

In [ ]:
titanic_df.groupby('Sex')['Sex'].count()
Out[ ]:
Sex
female    314
male      577
Name: Sex, dtype: int64

This count operation can be achieved using just value_counts() method on the dataframe

In [ ]:
titanic_df['Sex'].value_counts()
Out[ ]:
male      577
female    314
Name: Sex, dtype: int64

 

 

More To Explore