Introduction¶
Pandas is a python package for reading, cleaning, exploring and analyzing tabular data such as data from spreadsheets or databases. Pandas provides flexible and expressive data structures that facilitate working with labelled and relational database
Supported Data Format¶
Pandas is ideal for working with any table-like data that can be stored in common format such as excel. Dataframe is the equivalent table representation in Pandas. You can create a dataframe manually or import from excel, txt, csv. To create data manually, call the “DataFrame” method and insert array of data to create. Just as we have column and row labels in common program for handling 2-D table data format, pandas also has column headers and row labels. It is crucial to understand how pandas create dataframe from different python data structure. If the data is a python dictionary, the key would be the column headers and the values in each list would be column of the dataframe while the row labels would be integers by default except specified otherwise. The column and row lables are important for indexing. If the data is a list of list, the values in each nested list would be the rows. If column name and row headers are not specified in the case of list of list, pandas would use default integer values as column headers and row labels.
# import package
import pandas as pd
#Create a dataframe from a dictionary
df1 = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9]})
df1
A | B | C | |
---|---|---|---|
0 | 1 | 4 | 7 |
1 | 2 | 5 | 8 |
2 | 3 | 6 | 9 |
print(type(df1))
<class 'pandas.core.frame.DataFrame'>
#create a dataframe from a list of lists
df2 = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df2
0 | 1 | 2 | |
---|---|---|---|
0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 |
2 | 7 | 8 | 9 |
#create a dataframe from a list of lists and specify column names
df3 = pd.DataFrame([[16, 2, 53], [44, 35, 68], [73, 28, 95]], columns=['A', 'B', 'C'])
df3
A | B | C | |
---|---|---|---|
0 | 16 | 2 | 53 |
1 | 44 | 35 | 68 |
2 | 73 | 28 | 95 |
To select a colum in a dataframe, use column label in between square bracket. Each column in a dataframe is a series.
#select column C in df1
df1['C']
0 7 1 8 2 9 Name: C, dtype: int64
print(type(df1['C']))
<class 'pandas.core.series.Series'>
Dataframe can store data of different format including integers, float, string, boolean etc. Let’s create a more heterogenous data for firs 5 passengers in the titanic dataset shown in the image below.
df_4 = pd.DataFrame({
'PassengerID':[1,2,3,4,5],
'Survived': [0,1,1,1,0],
'Name': ['Braud', 'Cumings', 'Heikkinen', 'Frode', 'Kallio'],
'Sex': ['male', 'female', 'female', 'female', 'male'],
'Age': [22, 38, 26, 35, 35],
'Ticket': [2457, 1135, 3421, 3, 1419],
'Fare': [7.25, 15.0, 26.0, 8.05, 9.0]
})
df_4
PassengerID | Survived | Name | Sex | Age | Ticket | Fare | |
---|---|---|---|---|---|---|---|
0 | 1 | 0 | Braud | male | 22 | 2457 | 7.25 |
1 | 2 | 1 | Cumings | female | 38 | 1135 | 15.00 |
2 | 3 | 1 | Heikkinen | female | 26 | 3421 | 26.00 |
3 | 4 | 1 | Frode | female | 35 | 3 | 8.05 |
4 | 5 | 0 | Kallio | male | 35 | 1419 | 9.00 |
We can perform some artithmetric operation on series on dataframe. But before that two vital methods for getting statistical info about the dataframe are info() and describe()
df_4.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerID 5 non-null int64 1 Survived 5 non-null int64 2 Name 5 non-null object 3 Sex 5 non-null object 4 Age 5 non-null int64 5 Ticket 5 non-null int64 6 Fare 5 non-null float64 dtypes: float64(1), int64(4), object(2) memory usage: 408.0+ bytes
df_4.describe()
PassengerID | Survived | Age | Ticket | Fare | |
---|---|---|---|---|---|
count | 5.000000 | 5.000000 | 5.00000 | 5.000000 | 5.000000 |
mean | 3.000000 | 0.600000 | 31.20000 | 1687.000000 | 13.060000 |
std | 1.581139 | 0.547723 | 6.83374 | 1304.994253 | 7.850908 |
min | 1.000000 | 0.000000 | 22.00000 | 3.000000 | 7.250000 |
25% | 2.000000 | 0.000000 | 26.00000 | 1135.000000 | 8.050000 |
50% | 3.000000 | 1.000000 | 35.00000 | 1419.000000 | 9.000000 |
75% | 4.000000 | 1.000000 | 35.00000 | 2457.000000 | 15.000000 |
max | 5.000000 | 1.000000 | 38.00000 | 3421.000000 | 26.000000 |
So far, we have created dataframes manually. What if we are to read data from other programs which is most commonly the case. Pandas provide the read_ method. The is the name/extension of the source. For example, to read from a csv, use read_csv()
#import from csv
titanic_df = pd.read_csv('train.csv')
titanic_df
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th… | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
… | … | … | … | … | … | … | … | … | … | … | … | … |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen “Carrie” | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
The data that we just read has 891 rows and 12 columns but just the top 5 rows and last 5 are shown by default
Subsetting¶
We at times need only some parts of the dataset to work. Pandas, provide a easy way to select specific columns and/or row values. To select a column, call the dataframe and the name of the column in a square brackect. This syntax works in similar approach as dictionary indexing using keys. To select multiple colums, use a list of column names
#Select the age column
Age = titanic_df['Age']
print(type(Age)) #this is a series since it is one column
Age
<class 'pandas.core.series.Series'>
0 22.0 1 38.0 2 26.0 3 35.0 4 35.0 ... 886 27.0 887 19.0 888 NaN 889 26.0 890 32.0 Name: Age, Length: 891, dtype: float64
Filtering a dataframe is also very easy with dataframe. The syntax involves calling the dataframe with the filtering condition in a square bracket; df[‘condition’]. Let’s look at the example of how to select all teanagers in our titanic dataset. The condition is ‘df[‘Age’] < 19′. So we put this in a selection bracket of the dataframe. Then we have ‘df[df[‘Age’] < 19]’
teens_df = titanic_df[titanic_df['Age'] < 19]
teens_df
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
10 | 11 | 1 | 3 | Sandstrom, Miss. Marguerite Rut | female | 4.0 | 1 | 1 | PP 9549 | 16.7000 | G6 | S |
14 | 15 | 0 | 3 | Vestrom, Miss. Hulda Amanda Adolfina | female | 14.0 | 0 | 0 | 350406 | 7.8542 | NaN | S |
16 | 17 | 0 | 3 | Rice, Master. Eugene | male | 2.0 | 4 | 1 | 382652 | 29.1250 | NaN | Q |
… | … | … | … | … | … | … | … | … | … | … | … | … |
852 | 853 | 0 | 3 | Boulos, Miss. Nourelain | female | 9.0 | 1 | 1 | 2678 | 15.2458 | NaN | C |
853 | 854 | 1 | 1 | Lines, Miss. Mary Conover | female | 16.0 | 0 | 1 | PC 17592 | 39.4000 | D28 | S |
855 | 856 | 1 | 3 | Aks, Mrs. Sam (Leah Rosen) | female | 18.0 | 0 | 1 | 392091 | 9.3500 | NaN | S |
869 | 870 | 1 | 3 | Johnson, Master. Harold Theodor | male | 4.0 | 1 | 1 | 347742 | 11.1333 | NaN | S |
875 | 876 | 1 | 3 | Najib, Miss. Adele Kiamie “Jane” | female | 15.0 | 0 | 0 | 2667 | 7.2250 | NaN | C |
139 rows × 12 columns
We can also combine conditions. Let see an example of selecting teenagers i.e df[‘Age’] < 19 that survived i.e df[‘Survived’] == 1
survteens_df = titanic_df[(titanic_df['Age'] < 19) & (titanic_df['Survived'] == 1)]
survteens_df
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.00 | 1 | 0 | 237736 | 30.0708 | NaN | C |
10 | 11 | 1 | 3 | Sandstrom, Miss. Marguerite Rut | female | 4.00 | 1 | 1 | PP 9549 | 16.7000 | G6 | S |
22 | 23 | 1 | 3 | McGowan, Miss. Anna “Annie” | female | 15.00 | 0 | 0 | 330923 | 8.0292 | NaN | Q |
39 | 40 | 1 | 3 | Nicola-Yarred, Miss. Jamila | female | 14.00 | 1 | 0 | 2651 | 11.2417 | NaN | C |
43 | 44 | 1 | 2 | Laroche, Miss. Simonne Marie Anne Andree | female | 3.00 | 1 | 2 | SC/Paris 2123 | 41.5792 | NaN | C |
… | … | … | … | … | … | … | … | … | … | … | … | … |
831 | 832 | 1 | 2 | Richards, Master. George Sibley | male | 0.83 | 1 | 1 | 29106 | 18.7500 | NaN | S |
853 | 854 | 1 | 1 | Lines, Miss. Mary Conover | female | 16.00 | 0 | 1 | PC 17592 | 39.4000 | D28 | S |
855 | 856 | 1 | 3 | Aks, Mrs. Sam (Leah Rosen) | female | 18.00 | 0 | 1 | 392091 | 9.3500 | NaN | S |
869 | 870 | 1 | 3 | Johnson, Master. Harold Theodor | male | 4.00 | 1 | 1 | 347742 | 11.1333 | NaN | S |
875 | 876 | 1 | 3 | Najib, Miss. Adele Kiamie “Jane” | female | 15.00 | 0 | 0 | 2667 | 7.2250 | NaN | C |
70 rows × 12 columns
At times, when we filter our dataset, we want to selct only one column. We just use select bracket on the column of interest. However, we can also use loc operator. This works in similar way as the indexing above except that we have to specify the column we want to return in addition to the filtering condition
survteens_df['Name']
9 Nasser, Mrs. Nicholas (Adele Achem) 10 Sandstrom, Miss. Marguerite Rut 22 McGowan, Miss. Anna "Annie" 39 Nicola-Yarred, Miss. Jamila 43 Laroche, Miss. Simonne Marie Anne Andree ... 831 Richards, Master. George Sibley 853 Lines, Miss. Mary Conover 855 Aks, Mrs. Sam (Leah Rosen) 869 Johnson, Master. Harold Theodor 875 Najib, Miss. Adele Kiamie "Jane" Name: Name, Length: 70, dtype: object
titanic_df.loc[((titanic_df['Age'] < 19) & (titanic_df['Survived'] == 1), 'Name')]
9 Nasser, Mrs. Nicholas (Adele Achem) 10 Sandstrom, Miss. Marguerite Rut 22 McGowan, Miss. Anna "Annie" 39 Nicola-Yarred, Miss. Jamila 43 Laroche, Miss. Simonne Marie Anne Andree ... 831 Richards, Master. George Sibley 853 Lines, Miss. Mary Conover 855 Aks, Mrs. Sam (Leah Rosen) 869 Johnson, Master. Harold Theodor 875 Najib, Miss. Adele Kiamie "Jane" Name: Name, Length: 70, dtype: object
Syntax of loc or iloc operator looks complex but they are simple. The idea is the they allow row and column indexing at a go. Call the operator with the row condition and columns of interest.
df.loc[row_condition, column_of_interest]
. When using row and column names
df.iloc[row_condition, column_of_interest]
. When specifically interested in certain rows and columns based on position indexing
Let’s use iloc operator to select the first 30 rows of the last three columns in our titanic dataframe
titanic_df.iloc[:30,-3:]
Fare | Cabin | Embarked | |
---|---|---|---|
0 | 7.2500 | NaN | S |
1 | 71.2833 | C85 | C |
2 | 7.9250 | NaN | S |
3 | 53.1000 | C123 | S |
4 | 8.0500 | NaN | S |
5 | 8.4583 | NaN | Q |
6 | 51.8625 | E46 | S |
7 | 21.0750 | NaN | S |
8 | 11.1333 | NaN | S |
9 | 30.0708 | NaN | C |
10 | 16.7000 | G6 | S |
11 | 26.5500 | C103 | S |
12 | 8.0500 | NaN | S |
13 | 31.2750 | NaN | S |
14 | 7.8542 | NaN | S |
15 | 16.0000 | NaN | S |
16 | 29.1250 | NaN | Q |
17 | 13.0000 | NaN | S |
18 | 18.0000 | NaN | S |
19 | 7.2250 | NaN | C |
20 | 26.0000 | NaN | S |
21 | 13.0000 | D56 | S |
22 | 8.0292 | NaN | Q |
23 | 35.5000 | A6 | S |
24 | 21.0750 | NaN | S |
25 | 31.3875 | NaN | S |
26 | 7.2250 | NaN | C |
27 | 263.0000 | C23 C25 C27 | S |
28 | 7.8792 | NaN | Q |
29 | 7.8958 | NaN | S |
Creating Plots in Pandas¶
Pandas plot() method eanables easy visualization of data. To demonstrate its use, we will be using Snow data downloaded from snotel sites.
Use read_csv, tell pandas the column to use for indexing, and parse_dates = True
SWE =pd.read_csv('Monthly_Mean_SWE.csv', index_col='Months', parse_dates=True)
SWE
2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | … | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Months | |||||||||||||||||||||
January | 40.312258 | 28.513548 | 48.571355 | 36.920129 | 55.486710 | 28.185806 | 62.107097 | 40.533484 | 40.623613 | 39.951742 | … | 20.705097 | 28.030129 | 41.762516 | 50.742645 | 45.752774 | 22.007871 | 29.816323 | 33.429677 | 39.624000 | 44.884258 |
February | 57.491586 | 34.181143 | 63.500000 | 46.645286 | 70.366759 | 31.532286 | 85.398429 | 47.643143 | 65.356828 | 45.411571 | … | 27.803929 | 50.645786 | 50.019857 | 66.950897 | 74.403857 | 34.480500 | 55.526214 | 54.758897 | 55.535286 | 48.250929 |
March | 73.938581 | 38.616194 | 76.216387 | 61.992387 | 79.100516 | 36.125355 | 94.586323 | 61.820323 | 75.593677 | 61.254968 | … | 34.896323 | 73.045484 | 46.129677 | 77.560129 | 94.414258 | 50.414903 | 88.588645 | 61.123871 | 69.538645 | 52.197000 |
April | 65.311867 | 41.164933 | 75.793600 | 68.072000 | 70.459600 | 45.017267 | 104.495600 | 52.874333 | 85.301667 | 70.366467 | … | 33.020000 | 80.484133 | 39.539333 | 63.965667 | 81.948867 | 46.007867 | 83.083400 | 60.790667 | 51.011667 | NaN |
May | 18.853355 | 8.504903 | 33.929484 | 43.302903 | 15.715226 | 14.010968 | 50.111742 | 9.479935 | 49.882323 | 36.125355 | … | 3.269226 | 40.476129 | 2.179484 | 10.831871 | 23.654774 | 2.548194 | 20.393742 | 7.792065 | 3.449484 | NaN |
June | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.033867 | 0.000000 | 0.550333 | 0.000000 | … | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | NaN |
July | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | … | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | NaN |
August | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | … | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | NaN |
September | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | … | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | NaN |
October | 0.991419 | 0.122903 | 0.000000 | 0.000000 | 1.229032 | 0.000000 | 0.000000 | 0.286774 | 0.073742 | 1.483032 | … | 0.000000 | 0.024581 | 0.024581 | 0.106516 | 1.302774 | 0.000000 | 0.778387 | 0.000000 | 0.000000 | NaN |
November | 5.317067 | 2.726267 | 6.172200 | 6.468533 | 6.383867 | 12.352867 | 5.935133 | 1.049867 | 3.454400 | 4.055533 | … | 5.240867 | 5.579533 | 4.597400 | 0.474133 | 5.850467 | 1.845733 | 1.735667 | 9.050867 | 3.445933 | NaN |
December | 18.066774 | 32.020387 | 14.945032 | 26.530710 | 19.426903 | 29.013355 | 23.138581 | 12.986774 | 13.593097 | 14.281355 | … | 17.591548 | 22.966516 | 28.841290 | 18.353548 | 10.512323 | 14.437032 | 11.053097 | 21.082000 | 16.215032 | NaN |
12 rows × 23 columns
SWE.plot()
<AxesSubplot:xlabel='Months'>
With a dataframe, pandas plot() method will give one line plot for each columns with numeric data as shown by the result of the code cell above. To plot a specific column, use the selection method described above before the plot method.
SWE[['2000', '2005', '2010', '2015', '2020']].plot(figsize=(10,7))
<AxesSubplot:xlabel='Months'>
Besides the default line plot, data can be visualized with other types of plots such as scatter, boxplots etc
#datetime index
SWE.plot.scatter(x='2000', y='2005')
<AxesSubplot:xlabel='2000', ylabel='2005'>
SWE[['2000', '2005', '2010', '2015', '2020']].plot.box()
<AxesSubplot:>
#seperate plots
SWE[['2000', '2005', '2010', '2015', '2020']].plot(figsize=(10,7), subplots=True)
array([<AxesSubplot:xlabel='Months'>, <AxesSubplot:xlabel='Months'>, <AxesSubplot:xlabel='Months'>, <AxesSubplot:xlabel='Months'>, <AxesSubplot:xlabel='Months'>], dtype=object)
Groupby¶
Finding statistics for each category in a datset is a common task in data analysis. For example we might be interested in knowing the mean of the ages for each gender in the titanic dataset.
The syntax is:
df.groupby(catgerogy_to_groupby).statistics()
In this example, the category is sex and the statistics of interest is mean.
titanic_df[['Age', 'Sex']].groupby('Sex').mean()
Age | |
---|---|
Sex | |
female | 27.915709 |
male | 30.726645 |
Note that the groupby().mean()
syntax was applied to a subselection of the two columns of interest (‘Sex’ and ‘Age’ in this case), otherwise the statitical operation will be performed on all columns containing numerical values. This is demonstrated by the code cell below. However, some columns such as Pclass (passenger class) with numerical values encode categorical information or other data types thus not ideal to find the mean or other statistics.
titanic_df.groupby('Sex').mean()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
Sex | |||||||
female | 431.028662 | 0.742038 | 2.159236 | 27.915709 | 0.694268 | 0.649682 | 44.479818 |
male | 454.147314 | 0.188908 | 2.389948 | 30.726645 | 0.429809 | 0.235702 | 25.523893 |
Alternatively, we can use a selection bracket []
after the groupby()
but before the statistics to specifically select the column of interest
titanic_df.groupby('Sex')['Age'].mean()
Sex female 27.915709 male 30.726645 Name: Age, dtype: float64
We can also find the statistics of multiple category combination. For example, we might be interested in knowing the average ticket fare for each sex of each Passenger class. These can be achieved using the two code cells below
titanic_df[['Fare', 'Sex', 'Pclass']].groupby(['Sex', 'Pclass']).mean()
Fare | ||
---|---|---|
Sex | Pclass | |
female | 1 | 106.125798 |
2 | 21.970121 | |
3 | 16.118810 | |
male | 1 | 67.226127 |
2 | 19.741782 | |
3 | 12.661633 |
titanic_df.groupby(['Sex', 'Pclass'])['Fare'].mean()
Sex Pclass female 1 106.125798 2 21.970121 3 16.118810 male 1 67.226127 2 19.741782 3 12.661633 Name: Fare, dtype: float64
It is crucial to note that the first example returned a dataframe while the second returned a series. Another information to derive from groupby is the number of occurence of each category in a function. This can be achieved using the count()
method
titanic_df.groupby('Sex')['Sex'].count()
Sex female 314 male 577 Name: Sex, dtype: int64
This count operation can be achieved using just value_counts()
method on the dataframe
titanic_df['Sex'].value_counts()
male 577 female 314 Name: Sex, dtype: int64