Pandas is a python package for reading, cleaning, exploring and analyzing tabular data such as data from spreadsheets or databases. Pandas provides flexible and expressive data structures that facilitate working with labelled and relational database

Supported Data Format¶

Pandas is ideal for working with any table-like data that can be stored in common format such as excel. Dataframe is the equivalent table representation in Pandas. You can create a dataframe manually or import from excel, txt, csv. To create data manually, call the “DataFrame” method and insert array of data to create. Just as we have column and row labels in common program for handling 2-D table data format, pandas also has column headers and row labels. It is crucial to understand how pandas create dataframe from different python data structure. If the data is a python dictionary, the key would be the column headers and the values in each list would be column of the dataframe while the row labels would be integers by default except specified otherwise. The column and row lables are important for indexing. If the data is a list of list, the values in each nested list would be the rows. If column name and row headers are not specified in the case of list of list, pandas would use default integer values as column headers and row labels.

# import package
import pandas as pd

#Create a dataframe from a dictionary
df1 = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9]})
df1

print(type(df1))

<class 'pandas.core.frame.DataFrame'>

#create a dataframe from a list of lists
df2 = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df2

#create a dataframe from a list of lists and specify column names
df3 = pd.DataFrame([[16, 2, 53], [44, 35, 68], [73, 28, 95]], columns=['A', 'B', 'C'])
df3

To select a colum in a dataframe, use column label in between square bracket. Each column in a dataframe is a series.

#select column C in df1
df1['C']

0    7
1    8
2    9
Name: C, dtype: int64

print(type(df1['C']))

<class 'pandas.core.series.Series'>

Dataframe can store data of different format including integers, float, string, boolean etc. Let’s create a more heterogenous data for firs 5 passengers in the titanic dataset shown in the image below.

df_4 = pd.DataFrame({
            'PassengerID':[1,2,3,4,5], 
            'Survived': [0,1,1,1,0], 
            'Name': ['Braud', 'Cumings', 'Heikkinen', 'Frode', 'Kallio'], 
            'Sex': ['male', 'female', 'female', 'female', 'male'], 
            'Age': [22, 38, 26, 35, 35], 
            'Ticket': [2457, 1135, 3421, 3, 1419], 
            'Fare': [7.25, 15.0, 26.0, 8.05, 9.0]
            })
df_4

We can perform some artithmetric operation on series on dataframe. But before that two vital methods for getting statistical info about the dataframe are info() and describe()

df_4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerID  5 non-null      int64  
 1   Survived     5 non-null      int64  
 2   Name         5 non-null      object 
 3   Sex          5 non-null      object 
 4   Age          5 non-null      int64  
 5   Ticket       5 non-null      int64  
 6   Fare         5 non-null      float64
dtypes: float64(1), int64(4), object(2)
memory usage: 408.0+ bytes

df_4.describe()

So far, we have created dataframes manually. What if we are to read data from other programs which is most commonly the case. Pandas provide the read_ method. The is the name/extension of the source. For example, to read from a csv, use read_csv()

#import from csv
titanic_df = pd.read_csv('train.csv')
titanic_df

The data that we just read has 891 rows and 12 columns but just the top 5 rows and last 5 are shown by default

Subsetting¶

We at times need only some parts of the dataset to work. Pandas, provide a easy way to select specific columns and/or row values. To select a column, call the dataframe and the name of the column in a square brackect. This syntax works in similar approach as dictionary indexing using keys. To select multiple colums, use a list of column names

#Select the age column
Age = titanic_df['Age']
print(type(Age)) #this is a series since it is one column
Age

<class 'pandas.core.series.Series'>

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

Filtering a dataframe is also very easy with dataframe. The syntax involves calling the dataframe with the filtering condition in a square bracket; df[‘condition’]. Let’s look at the example of how to select all teanagers in our titanic dataset. The condition is ‘df[‘Age’] < 19′. So we put this in a selection bracket of the dataframe. Then we have ‘df[df[‘Age’] < 19]’

teens_df = titanic_df[titanic_df['Age'] < 19]
teens_df

We can also combine conditions. Let see an example of selecting teenagers i.e df[‘Age’] < 19 that survived i.e df[‘Survived’] == 1

survteens_df = titanic_df[(titanic_df['Age'] < 19) & (titanic_df['Survived'] == 1)]
survteens_df

At times, when we filter our dataset, we want to selct only one column. We just use select bracket on the column of interest. However, we can also use loc operator. This works in similar way as the indexing above except that we have to specify the column we want to return in addition to the filtering condition

survteens_df['Name']

9           Nasser, Mrs. Nicholas (Adele Achem)
10              Sandstrom, Miss. Marguerite Rut
22                  McGowan, Miss. Anna "Annie"
39                  Nicola-Yarred, Miss. Jamila
43     Laroche, Miss. Simonne Marie Anne Andree
                         ...                   
831             Richards, Master. George Sibley
853                   Lines, Miss. Mary Conover
855                  Aks, Mrs. Sam (Leah Rosen)
869             Johnson, Master. Harold Theodor
875            Najib, Miss. Adele Kiamie "Jane"
Name: Name, Length: 70, dtype: object

titanic_df.loc[((titanic_df['Age'] < 19) & (titanic_df['Survived'] == 1), 'Name')]

9           Nasser, Mrs. Nicholas (Adele Achem)
10              Sandstrom, Miss. Marguerite Rut
22                  McGowan, Miss. Anna "Annie"
39                  Nicola-Yarred, Miss. Jamila
43     Laroche, Miss. Simonne Marie Anne Andree
                         ...                   
831             Richards, Master. George Sibley
853                   Lines, Miss. Mary Conover
855                  Aks, Mrs. Sam (Leah Rosen)
869             Johnson, Master. Harold Theodor
875            Najib, Miss. Adele Kiamie "Jane"
Name: Name, Length: 70, dtype: object

Syntax of loc or iloc operator looks complex but they are simple. The idea is the they allow row and column indexing at a go. Call the operator with the row condition and columns of interest.

df.loc[row_condition, column_of_interest]. When using row and column names

df.iloc[row_condition, column_of_interest]. When specifically interested in certain rows and columns based on position indexing

Let’s use iloc operator to select the first 30 rows of the last three columns in our titanic dataframe

titanic_df.iloc[:30,-3:]

Creating Plots in Pandas¶

Pandas plot() method eanables easy visualization of data. To demonstrate its use, we will be using Snow data downloaded from snotel sites.

Use read_csv, tell pandas the column to use for indexing, and parse_dates = True

SWE =pd.read_csv('Monthly_Mean_SWE.csv', index_col='Months', parse_dates=True)
SWE

SWE.plot()

<AxesSubplot:xlabel='Months'>

With a dataframe, pandas plot() method will give one line plot for each columns with numeric data as shown by the result of the code cell above. To plot a specific column, use the selection method described above before the plot method.

SWE[['2000', '2005', '2010', '2015', '2020']].plot(figsize=(10,7))

<AxesSubplot:xlabel='Months'>

Besides the default line plot, data can be visualized with other types of plots such as scatter, boxplots etc

#datetime index
SWE.plot.scatter(x='2000', y='2005')

<AxesSubplot:xlabel='2000', ylabel='2005'>

SWE[['2000', '2005', '2010', '2015', '2020']].plot.box()

<AxesSubplot:>

#seperate plots
SWE[['2000', '2005', '2010', '2015', '2020']].plot(figsize=(10,7), subplots=True)

array([<AxesSubplot:xlabel='Months'>, <AxesSubplot:xlabel='Months'>,
       <AxesSubplot:xlabel='Months'>, <AxesSubplot:xlabel='Months'>,
       <AxesSubplot:xlabel='Months'>], dtype=object)

Groupby¶

Finding statistics for each category in a datset is a common task in data analysis. For example we might be interested in knowing the mean of the ages for each gender in the titanic dataset.

The syntax is:
df.groupby(catgerogy_to_groupby).statistics()

In this example, the category is sex and the statistics of interest is mean.

titanic_df[['Age', 'Sex']].groupby('Sex').mean()

Note that the groupby().mean() syntax was applied to a subselection of the two columns of interest (‘Sex’ and ‘Age’ in this case), otherwise the statitical operation will be performed on all columns containing numerical values. This is demonstrated by the code cell below. However, some columns such as Pclass (passenger class) with numerical values encode categorical information or other data types thus not ideal to find the mean or other statistics.

titanic_df.groupby('Sex').mean()

Alternatively, we can use a selection bracket [] after the groupby() but before the statistics to specifically select the column of interest

titanic_df.groupby('Sex')['Age'].mean()

Sex
female    27.915709
male      30.726645
Name: Age, dtype: float64

We can also find the statistics of multiple category combination. For example, we might be interested in knowing the average ticket fare for each sex of each Passenger class. These can be achieved using the two code cells below

titanic_df[['Fare', 'Sex', 'Pclass']].groupby(['Sex', 'Pclass']).mean()

titanic_df.groupby(['Sex', 'Pclass'])['Fare'].mean()

Sex     Pclass
female  1         106.125798
        2          21.970121
        3          16.118810
male    1          67.226127
        2          19.741782
        3          12.661633
Name: Fare, dtype: float64

It is crucial to note that the first example returned a dataframe while the second returned a series. Another information to derive from groupby is the number of occurence of each category in a function. This can be achieved using the count() method

titanic_df.groupby('Sex')['Sex'].count()

Sex
female    314
male      577
Name: Sex, dtype: int64

This count operation can be achieved using just value_counts() method on the dataframe

titanic_df['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

	PassengerID	Survived	Age	Ticket	Fare
count	5.000000	5.000000	5.00000	5.000000	5.000000
mean	3.000000	0.600000	31.20000	1687.000000	13.060000
std	1.581139	0.547723	6.83374	1304.994253	7.850908
min	1.000000	0.000000	22.00000	3.000000	7.250000
25%	2.000000	0.000000	26.00000	1135.000000	8.050000
50%	3.000000	1.000000	35.00000	1419.000000	9.000000
75%	4.000000	1.000000	35.00000	2457.000000	15.000000
max	5.000000	1.000000	38.00000	3421.000000	26.000000

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th…	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
…	…	…	…	…	…	…	…	…	…	…	…	…
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen “Carrie”	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C
10	11	1	3	Sandstrom, Miss. Marguerite Rut	female	4.0	1	1	PP 9549	16.7000	G6	S
14	15	0	3	Vestrom, Miss. Hulda Amanda Adolfina	female	14.0	0	0	350406	7.8542	NaN	S
16	17	0	3	Rice, Master. Eugene	male	2.0	4	1	382652	29.1250	NaN	Q
…	…	…	…	…	…	…	…	…	…	…	…	…
852	853	0	3	Boulos, Miss. Nourelain	female	9.0	1	1	2678	15.2458	NaN	C
853	854	1	1	Lines, Miss. Mary Conover	female	16.0	0	1	PC 17592	39.4000	D28	S
855	856	1	3	Aks, Mrs. Sam (Leah Rosen)	female	18.0	0	1	392091	9.3500	NaN	S
869	870	1	3	Johnson, Master. Harold Theodor	male	4.0	1	1	347742	11.1333	NaN	S
875	876	1	3	Najib, Miss. Adele Kiamie “Jane”	female	15.0	0	0	2667	7.2250	NaN	C

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.00	1	0	237736	30.0708	NaN	C
10	11	1	3	Sandstrom, Miss. Marguerite Rut	female	4.00	1	1	PP 9549	16.7000	G6	S
22	23	1	3	McGowan, Miss. Anna “Annie”	female	15.00	0	0	330923	8.0292	NaN	Q
39	40	1	3	Nicola-Yarred, Miss. Jamila	female	14.00	1	0	2651	11.2417	NaN	C
43	44	1	2	Laroche, Miss. Simonne Marie Anne Andree	female	3.00	1	2	SC/Paris 2123	41.5792	NaN	C
…	…	…	…	…	…	…	…	…	…	…	…	…
831	832	1	2	Richards, Master. George Sibley	male	0.83	1	1	29106	18.7500	NaN	S
853	854	1	1	Lines, Miss. Mary Conover	female	16.00	0	1	PC 17592	39.4000	D28	S
855	856	1	3	Aks, Mrs. Sam (Leah Rosen)	female	18.00	0	1	392091	9.3500	NaN	S
869	870	1	3	Johnson, Master. Harold Theodor	male	4.00	1	1	347742	11.1333	NaN	S
875	876	1	3	Najib, Miss. Adele Kiamie “Jane”	female	15.00	0	0	2667	7.2250	NaN	C

	Fare	Cabin	Embarked
0	7.2500	NaN	S
1	71.2833	C85	C
2	7.9250	NaN	S
3	53.1000	C123	S
4	8.0500	NaN	S
5	8.4583	NaN	Q
6	51.8625	E46	S
7	21.0750	NaN	S
8	11.1333	NaN	S
9	30.0708	NaN	C
10	16.7000	G6	S
11	26.5500	C103	S
12	8.0500	NaN	S
13	31.2750	NaN	S
14	7.8542	NaN	S
15	16.0000	NaN	S
16	29.1250	NaN	Q
17	13.0000	NaN	S
18	18.0000	NaN	S
19	7.2250	NaN	C
20	26.0000	NaN	S
21	13.0000	D56	S
22	8.0292	NaN	Q
23	35.5000	A6	S
24	21.0750	NaN	S
25	31.3875	NaN	S
26	7.2250	NaN	C
27	263.0000	C23 C25 C27	S
28	7.8792	NaN	Q
29	7.8958	NaN	S

Getting Started with Pandas

Share This Post

Supported Data Format¶

Subsetting¶

Creating Plots in Pandas¶

Groupby¶

More To Explore

STEM at Your Fingertips: Public Libraries as Hubs of Innovation

Structure from motion using ArcGIS orthomapping Python Api

	PassengerID	Survived	Name	Sex	Age	Ticket	Fare
0	1	0	Braud	male	22	2457	7.25
1	2	1	Cumings	female	38	1135	15.00
2	3	1	Heikkinen	female	26	3421	26.00
3	4	1	Frode	female	35	3	8.05
4	5	0	Kallio	male	35	1419	9.00

	2000	2001	2002	2003	2004	2005	2006	2007	2008	2009	…	2013	2014	2015	2016	2017	2018	2019	2020	2021	2022
Months
January	40.312258	28.513548	48.571355	36.920129	55.486710	28.185806	62.107097	40.533484	40.623613	39.951742	…	20.705097	28.030129	41.762516	50.742645	45.752774	22.007871	29.816323	33.429677	39.624000	44.884258
February	57.491586	34.181143	63.500000	46.645286	70.366759	31.532286	85.398429	47.643143	65.356828	45.411571	…	27.803929	50.645786	50.019857	66.950897	74.403857	34.480500	55.526214	54.758897	55.535286	48.250929
March	73.938581	38.616194	76.216387	61.992387	79.100516	36.125355	94.586323	61.820323	75.593677	61.254968	…	34.896323	73.045484	46.129677	77.560129	94.414258	50.414903	88.588645	61.123871	69.538645	52.197000
April	65.311867	41.164933	75.793600	68.072000	70.459600	45.017267	104.495600	52.874333	85.301667	70.366467	…	33.020000	80.484133	39.539333	63.965667	81.948867	46.007867	83.083400	60.790667	51.011667	NaN
May	18.853355	8.504903	33.929484	43.302903	15.715226	14.010968	50.111742	9.479935	49.882323	36.125355	…	3.269226	40.476129	2.179484	10.831871	23.654774	2.548194	20.393742	7.792065	3.449484	NaN
June	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.033867	0.000000	0.550333	0.000000	…	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	NaN
July	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	…	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	NaN
August	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	…	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	NaN
September	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	…	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	NaN
October	0.991419	0.122903	0.000000	0.000000	1.229032	0.000000	0.000000	0.286774	0.073742	1.483032	…	0.000000	0.024581	0.024581	0.106516	1.302774	0.000000	0.778387	0.000000	0.000000	NaN
November	5.317067	2.726267	6.172200	6.468533	6.383867	12.352867	5.935133	1.049867	3.454400	4.055533	…	5.240867	5.579533	4.597400	0.474133	5.850467	1.845733	1.735667	9.050867	3.445933	NaN
December	18.066774	32.020387	14.945032	26.530710	19.426903	29.013355	23.138581	12.986774	13.593097	14.281355	…	17.591548	22.966516	28.841290	18.353548	10.512323	14.437032	11.053097	21.082000	16.215032	NaN

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
Sex
female	431.028662	0.742038	2.159236	27.915709	0.694268	0.649682	44.479818
male	454.147314	0.188908	2.389948	30.726645	0.429809	0.235702	25.523893