importing python libraries
Load a data file into pandas
Examine the size and data types of a dataframe
Understand the relationship between DataFrame, Series and Index
Compute aggreate values from a pandas series
Compute grouped aggregrate values from a pandas dataframe
Order a data frame
Join two pandas dataframes for additional context
Provide numerical descriptions of a distribution
Visualize the distribution of a variable

Dataset¶

For this notebook, I used the MovieLens 25M Dataset. The Zip file is 250MB, and the files take about 1.2GB uncompressed. A quick documentation for the dataset is available here

Python modules¶

Most useful python functions are in modules. In other to use them, they need to be first imported

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Reading CSV¶

movies = pd.read_csv('../Data/ml-25m/movies.csv')
movies

The DataFrame¶

movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62423 entries, 0 to 62422
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  62423 non-null  int64 
 1   title    62423 non-null  object
 2   genres   62423 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.4+ MB

RangeIndex: indexes 0 to n-1
3 columns
- 1 int64
- 2 object – these store strings
62423 rows
Kind of data
- integer moviedID
- string title
- string genres
The data is about movies

-Each column is a series. It can be accessed like a disctionary

Series¶

An array with an index
All columns of the dataFrame share the same index

movies['title']

0                          Toy Story (1995)
1                            Jumanji (1995)
2                   Grumpier Old Men (1995)
3                  Waiting to Exhale (1995)
4        Father of the Bride Part II (1995)
                        ...                
62418                             We (2018)
62419             Window of the Soul (2001)
62420                      Bad Poems (2018)
62421                   A Girl Thing (2001)
62422        Women of Devil's Island (1962)
Name: title, Length: 62423, dtype: object

# How big is the series?
from turtle import title


print(movies['title'].size)
print(len(movies['title']))
print(movies['title'].shape)

#count count values not including missing values
movies['title'].count()

62423
62423
(62423,)

62423

.shape returns the array shape as a tuple — the array is one-dimensional with length 25M
.size returns the series size (length). len(…) is identical.
.count() counts values, not including missing value

Another DataFrame¶

Let’s load the ratings

ratings = pd.read_csv('../Data/ml-25m/ratings.csv')
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000095 entries, 0 to 25000094
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int64  
 1   movieId    int64  
 2   rating     float64
 3   timestamp  int64  
dtypes: float64(1), int64(3)
memory usage: 762.9 MB

Data has about 25M instances
Data contains
- userId(int)
- movieId(int)
- rating(float)
- timestamnp(int)

ratings

Aggregrate Functions¶

An aggregate function combines a series (or array) into a single value:

ratings['rating'].mean()

3.533854451353085

ratings['rating'].sum()

88346697.0

Alternate form – function from numpy

np.sum(ratings['rating'])

88346697.0

Quantiles¶

Let’s see the quantile function

ratings['rating'].quantile(0.25)

3.0

ratings['rating'].quantile(0.75)

4.0

Interesting. 25% of the ratings is 4 or greater

Grouped Aggregrates¶

We can group by a column and compute aggreagrates within the group.
How many ratings per movie?

ratings.groupby('movieId')['rating'].count()

movieId
1         57309
2         24228
3         11804
4          2523
5         11714
          ...  
209157        1
209159        1
209163        1
209169        1
209171        1
Name: rating, Length: 59047, dtype: int64

We can compute multiple aggregates at the same time:

movie_stats = ratings.groupby('movieId')['rating'].agg(['mean', 'count'])
movie_stats

Finding Largest¶

We can get the 10 movies with the most ratings and 10 most rated movie

movie_stats.nlargest(10, 'count')

movie_stats.nlargest(10, 'mean')

Linking Data¶

Data can refer to other data

Ratings are instances themselves
But each connects a user to a movie

We can merge the two.

We want to combine our stats with movie info.

on=’movieId’ says to use the movieId column of the first frame instead of its index.
Matches values to index in other frame

movie_info = movies.join(movie_stats, on='movieId')
movie_info

movie_info.nlargest(10, 'count')

Average Movie Rating¶

let’s look at the distribution of average movie rating:

movie_info['mean'].describe()

count    59047.000000
mean         3.071374
std          0.739840
min          0.500000
25%          2.687500
50%          3.150000
75%          3.500000
max          5.000000
Name: mean, dtype: float64

Let’s make a histogram:

plt.hist(movie_info['mean'])
plt.show()

And with more bins:

plt.hist(movie_info['mean'], bins=50)
plt.show()

Wrapping Up¶

A dataframe consists of columns
Each column is a series: array with index
We can call info() method to quickly see
- how many rows (instances)
- what columns
- data types
Aggregrates combine a series or array into a single value
- Can compute over a whole series or over groups
Join combines frames
data description can be presented numerically and visually

	userId	movieId	rating	timestamp
0	1	296	5.0	1147880044
1	1	306	3.5	1147868817
2	1	307	5.0	1147868828
3	1	665	5.0	1147878820
4	1	899	3.5	1147868510
…	…	…	…	…
25000090	162541	50872	4.5	1240953372
25000091	162541	55768	2.5	1240951998
25000092	162541	56176	2.0	1240950697
25000093	162541	58559	4.0	1240953434
25000094	162541	63876	5.0	1240952515

	mean	count
movieId
356	4.048011	81491
318	4.413576	81482
296	4.188912	79672
593	4.151342	74127
2571	4.154099	72674
260	4.120189	68717
480	3.679175	64144
527	4.247579	60411
110	4.002273	59184
2959	4.228311	58773

	movieId	title	genres	mean	count
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	3.893708	57309.0
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy	3.251527	24228.0
2	3	Grumpier Old Men (1995)	Comedy\|Romance	3.142028	11804.0
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance	2.853547	2523.0
4	5	Father of the Bride Part II (1995)	Comedy	3.058434	11714.0
…	…	…	…	…	…
62418	209157	We (2018)	Drama	1.500000	1.0
62419	209159	Window of the Soul (2001)	Documentary	3.000000	1.0
62420	209163	Bad Poems (2018)	Comedy\|Drama	4.500000	1.0
62421	209169	A Girl Thing (2001)	(no genres listed)	3.000000	1.0
62422	209171	Women of Devil’s Island (1962)	Action\|Adventure\|Drama	3.000000	1.0

	movieId	title	genres	mean	count
351	356	Forrest Gump (1994)	Comedy\|Drama\|Romance\|War	4.048011	81491.0
314	318	Shawshank Redemption, The (1994)	Crime\|Drama	4.413576	81482.0
292	296	Pulp Fiction (1994)	Comedy\|Crime\|Drama\|Thriller	4.188912	79672.0
585	593	Silence of the Lambs, The (1991)	Crime\|Horror\|Thriller	4.151342	74127.0
2480	2571	Matrix, The (1999)	Action\|Sci-Fi\|Thriller	4.154099	72674.0
257	260	Star Wars: Episode IV – A New Hope (1977)	Action\|Adventure\|Sci-Fi	4.120189	68717.0
475	480	Jurassic Park (1993)	Action\|Adventure\|Sci-Fi\|Thriller	3.679175	64144.0
522	527	Schindler’s List (1993)	Drama\|War	4.247579	60411.0
108	110	Braveheart (1995)	Action\|Drama\|War	4.002273	59184.0
2867	2959	Fight Club (1999)	Action\|Crime\|Drama\|Thriller	4.228311	58773.0

Exploring and describing data using Pandas

Share This Post