Introduction to Python- Playing with the Dataset (EDA)

Sourav Nandi, 27/07/2020

Credit: Data School by Kevin Markham (PyCon 2019)

1. Introduction to the TED Talks dataset

https://www.kaggle.com/rounakbanik/ted-talks

In [1]:
import pandas as pd
pd.__version__
Out[1]:
'0.25.1'
In [2]:
import matplotlib.pyplot as plt
%matplotlib inline
In [5]:
ted = pd.read_csv('ted.csv')
In [6]:
# each row represents a single talk
ted.head()
Out[6]:
comments description duration event film_date languages main_speaker name num_speaker published_date ratings related_talks speaker_occupation tags title url views
0 4553 Sir Ken Robinson makes an entertaining and pro... 1164 TED2006 1140825600 60 Ken Robinson Ken Robinson: Do schools kill creativity? 1 1151367060 [{'id': 7, 'name': 'Funny', 'count': 19645}, {... [{'id': 865, 'hero': 'https://pe.tedcdn.com/im... Author/educator ['children', 'creativity', 'culture', 'dance',... Do schools kill creativity? https://www.ted.com/talks/ken_robinson_says_sc... 47227110
1 265 With the same humor and humanity he exuded in ... 977 TED2006 1140825600 43 Al Gore Al Gore: Averting the climate crisis 1 1151367060 [{'id': 7, 'name': 'Funny', 'count': 544}, {'i... [{'id': 243, 'hero': 'https://pe.tedcdn.com/im... Climate advocate ['alternative energy', 'cars', 'climate change... Averting the climate crisis https://www.ted.com/talks/al_gore_on_averting_... 3200520
2 124 New York Times columnist David Pogue takes aim... 1286 TED2006 1140739200 26 David Pogue David Pogue: Simplicity sells 1 1151367060 [{'id': 7, 'name': 'Funny', 'count': 964}, {'i... [{'id': 1725, 'hero': 'https://pe.tedcdn.com/i... Technology columnist ['computers', 'entertainment', 'interface desi... Simplicity sells https://www.ted.com/talks/david_pogue_says_sim... 1636292
3 200 In an emotionally charged talk, MacArthur-winn... 1116 TED2006 1140912000 35 Majora Carter Majora Carter: Greening the ghetto 1 1151367060 [{'id': 3, 'name': 'Courageous', 'count': 760}... [{'id': 1041, 'hero': 'https://pe.tedcdn.com/i... Activist for environmental justice ['MacArthur grant', 'activism', 'business', 'c... Greening the ghetto https://www.ted.com/talks/majora_carter_s_tale... 1697550
4 593 You've never seen data presented like this. Wi... 1190 TED2006 1140566400 48 Hans Rosling Hans Rosling: The best stats you've ever seen 1 1151440680 [{'id': 9, 'name': 'Ingenious', 'count': 3202}... [{'id': 2056, 'hero': 'https://pe.tedcdn.com/i... Global health expert; data visionary ['Africa', 'Asia', 'Google', 'demo', 'economic... The best stats you've ever seen https://www.ted.com/talks/hans_rosling_shows_t... 12005869
In [10]:
ted.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2550 entries, 0 to 2549
Data columns (total 17 columns):
comments              2550 non-null int64
description           2550 non-null object
duration              2550 non-null int64
event                 2550 non-null object
film_date             2550 non-null int64
languages             2550 non-null int64
main_speaker          2550 non-null object
name                  2550 non-null object
num_speaker           2550 non-null int64
published_date        2550 non-null int64
ratings               2550 non-null object
related_talks         2550 non-null object
speaker_occupation    2544 non-null object
tags                  2550 non-null object
title                 2550 non-null object
url                   2550 non-null object
views                 2550 non-null int64
dtypes: int64(7), object(10)
memory usage: 338.8+ KB
In [7]:
# rows, columns
ted.shape
Out[7]:
(2550, 17)
In [8]:
# object columns are usually strings, but can also be arbitrary Python objects (lists, dictionaries)
ted.dtypes
Out[8]:
comments               int64
description           object
duration               int64
event                 object
film_date              int64
languages              int64
main_speaker          object
name                  object
num_speaker            int64
published_date         int64
ratings               object
related_talks         object
speaker_occupation    object
tags                  object
title                 object
url                   object
views                  int64
dtype: object
In [9]:
# count the number of missing values in each column
ted.isna().sum()
Out[9]:
comments              0
description           0
duration              0
event                 0
film_date             0
languages             0
main_speaker          0
name                  0
num_speaker           0
published_date        0
ratings               0
related_talks         0
speaker_occupation    6
tags                  0
title                 0
url                   0
views                 0
dtype: int64
In [ ]:
 

2. Which talks provoke the most online discussion?

In [11]:
# sort by the number of first-level comments, though this is biased in favor of older talks
ted.sort_values('comments').tail()
Out[11]:
comments description duration event film_date languages main_speaker name num_speaker published_date ratings related_talks speaker_occupation tags title url views
1787 2673 Our consciousness is a fundamental aspect of o... 1117 TED2014 1395100800 33 David Chalmers David Chalmers: How do you explain consciousness? 1 1405350484 [{'id': 25, 'name': 'OK', 'count': 280}, {'id'... [{'id': 1308, 'hero': 'https://pe.tedcdn.com/i... Philosopher ['brain', 'consciousness', 'neuroscience', 'ph... How do you explain consciousness? https://www.ted.com/talks/david_chalmers_how_d... 2162764
201 2877 Jill Bolte Taylor got a research opportunity f... 1099 TED2008 1204070400 49 Jill Bolte Taylor Jill Bolte Taylor: My stroke of insight 1 1205284200 [{'id': 22, 'name': 'Fascinating', 'count': 14... [{'id': 184, 'hero': 'https://pe.tedcdn.com/im... Neuroanatomist ['biology', 'brain', 'consciousness', 'global ... My stroke of insight https://www.ted.com/talks/jill_bolte_taylor_s_... 21190883
644 3356 Questions of good and evil, right and wrong ar... 1386 TED2010 1265846400 39 Sam Harris Sam Harris: Science can answer moral questions 1 1269249180 [{'id': 8, 'name': 'Informative', 'count': 923... [{'id': 666, 'hero': 'https://pe.tedcdn.com/im... Neuroscientist, philosopher ['culture', 'evolutionary psychology', 'global... Science can answer moral questions https://www.ted.com/talks/sam_harris_science_c... 3433437
0 4553 Sir Ken Robinson makes an entertaining and pro... 1164 TED2006 1140825600 60 Ken Robinson Ken Robinson: Do schools kill creativity? 1 1151367060 [{'id': 7, 'name': 'Funny', 'count': 19645}, {... [{'id': 865, 'hero': 'https://pe.tedcdn.com/im... Author/educator ['children', 'creativity', 'culture', 'dance',... Do schools kill creativity? https://www.ted.com/talks/ken_robinson_says_sc... 47227110
96 6404 Richard Dawkins urges all atheists to openly s... 1750 TED2002 1012608000 42 Richard Dawkins Richard Dawkins: Militant atheism 1 1176689220 [{'id': 3, 'name': 'Courageous', 'count': 3236... [{'id': 86, 'hero': 'https://pe.tedcdn.com/ima... Evolutionary biologist ['God', 'atheism', 'culture', 'religion', 'sci... Militant atheism https://www.ted.com/talks/richard_dawkins_on_m... 4374792
In [12]:
# correct for this bias by calculating the number of comments per view
ted['comments_per_view'] = ted.comments / ted.views
In [13]:
# interpretation: for every view of the same-sex marriage talk, there are 0.002 comments
ted.sort_values('comments_per_view').tail()
Out[13]:
comments description duration event film_date languages main_speaker name num_speaker published_date ratings related_talks speaker_occupation tags title url views comments_per_view
954 2492 Janet Echelman found her true voice as an arti... 566 TED2011 1299110400 35 Janet Echelman Janet Echelman: Taking imagination seriously 1 1307489760 [{'id': 23, 'name': 'Jaw-dropping', 'count': 3... [{'id': 453, 'hero': 'https://pe.tedcdn.com/im... Artist ['art', 'cities', 'culture', 'data', 'design',... Taking imagination seriously https://www.ted.com/talks/janet_echelman\n 1832930 0.001360
694 1502 Filmmaker Sharmeen Obaid-Chinoy takes on a ter... 489 TED2010 1265760000 32 Sharmeen Obaid-Chinoy Sharmeen Obaid-Chinoy: Inside a school for sui... 1 1274865960 [{'id': 23, 'name': 'Jaw-dropping', 'count': 3... [{'id': 171, 'hero': 'https://pe.tedcdn.com/im... Filmmaker ['TED Fellows', 'children', 'culture', 'film',... Inside a school for suicide bombers https://www.ted.com/talks/sharmeen_obaid_chino... 1057238 0.001421
96 6404 Richard Dawkins urges all atheists to openly s... 1750 TED2002 1012608000 42 Richard Dawkins Richard Dawkins: Militant atheism 1 1176689220 [{'id': 3, 'name': 'Courageous', 'count': 3236... [{'id': 86, 'hero': 'https://pe.tedcdn.com/ima... Evolutionary biologist ['God', 'atheism', 'culture', 'religion', 'sci... Militant atheism https://www.ted.com/talks/richard_dawkins_on_m... 4374792 0.001464
803 834 David Bismark demos a new system for voting th... 422 TEDGlobal 2010 1279065600 36 David Bismark David Bismark: E-voting without fraud 1 1288685640 [{'id': 25, 'name': 'OK', 'count': 111}, {'id'... [{'id': 803, 'hero': 'https://pe.tedcdn.com/im... Voting system designer ['culture', 'democracy', 'design', 'global iss... E-voting without fraud https://www.ted.com/talks/david_bismark_e_voti... 543551 0.001534
744 649 Hours before New York lawmakers rejected a key... 453 New York State Senate 1259712000 0 Diane J. Savino Diane J. Savino: The case for same-sex marriage 1 1282062180 [{'id': 25, 'name': 'OK', 'count': 100}, {'id'... [{'id': 217, 'hero': 'https://pe.tedcdn.com/im... Senator ['God', 'LGBT', 'culture', 'government', 'law'... The case for same-sex marriage https://www.ted.com/talks/diane_j_savino_the_c... 292395 0.002220
In [14]:
# make this more interpretable by inverting the calculation
ted['views_per_comment'] = ted.views / ted.comments
In [15]:
# interpretation: 1 out of every 450 people leave a comment
ted.sort_values('views_per_comment').head()
Out[15]:
comments description duration event film_date languages main_speaker name num_speaker published_date ratings related_talks speaker_occupation tags title url views comments_per_view views_per_comment
744 649 Hours before New York lawmakers rejected a key... 453 New York State Senate 1259712000 0 Diane J. Savino Diane J. Savino: The case for same-sex marriage 1 1282062180 [{'id': 25, 'name': 'OK', 'count': 100}, {'id'... [{'id': 217, 'hero': 'https://pe.tedcdn.com/im... Senator ['God', 'LGBT', 'culture', 'government', 'law'... The case for same-sex marriage https://www.ted.com/talks/diane_j_savino_the_c... 292395 0.002220 450.531587
803 834 David Bismark demos a new system for voting th... 422 TEDGlobal 2010 1279065600 36 David Bismark David Bismark: E-voting without fraud 1 1288685640 [{'id': 25, 'name': 'OK', 'count': 111}, {'id'... [{'id': 803, 'hero': 'https://pe.tedcdn.com/im... Voting system designer ['culture', 'democracy', 'design', 'global iss... E-voting without fraud https://www.ted.com/talks/david_bismark_e_voti... 543551 0.001534 651.739808
96 6404 Richard Dawkins urges all atheists to openly s... 1750 TED2002 1012608000 42 Richard Dawkins Richard Dawkins: Militant atheism 1 1176689220 [{'id': 3, 'name': 'Courageous', 'count': 3236... [{'id': 86, 'hero': 'https://pe.tedcdn.com/ima... Evolutionary biologist ['God', 'atheism', 'culture', 'religion', 'sci... Militant atheism https://www.ted.com/talks/richard_dawkins_on_m... 4374792 0.001464 683.134291
694 1502 Filmmaker Sharmeen Obaid-Chinoy takes on a ter... 489 TED2010 1265760000 32 Sharmeen Obaid-Chinoy Sharmeen Obaid-Chinoy: Inside a school for sui... 1 1274865960 [{'id': 23, 'name': 'Jaw-dropping', 'count': 3... [{'id': 171, 'hero': 'https://pe.tedcdn.com/im... Filmmaker ['TED Fellows', 'children', 'culture', 'film',... Inside a school for suicide bombers https://www.ted.com/talks/sharmeen_obaid_chino... 1057238 0.001421 703.886818
954 2492 Janet Echelman found her true voice as an arti... 566 TED2011 1299110400 35 Janet Echelman Janet Echelman: Taking imagination seriously 1 1307489760 [{'id': 23, 'name': 'Jaw-dropping', 'count': 3... [{'id': 453, 'hero': 'https://pe.tedcdn.com/im... Artist ['art', 'cities', 'culture', 'data', 'design',... Taking imagination seriously https://www.ted.com/talks/janet_echelman\n 1832930 0.001360 735.525682

Lessons:

  1. Consider the limitations and biases of your data when analyzing it
  2. Make your results understandable

3. Visualize the distribution of comments

In [16]:
# line plot is not appropriate here (use it to measure something over time)
ted.comments.plot()
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x24c220d84e0>
In [17]:
# histogram shows the frequency distribution of a single numeric variable
ted.comments.plot(kind='hist')
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x24c22427940>
In [18]:
# modify the plot to be more informative
ted[ted.comments < 1000].comments.plot(kind='hist')
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x24c22532860>
In [19]:
# check how many observations we removed from the plot
ted[ted.comments >= 1000].shape
Out[19]:
(32, 19)
In [20]:
# can also write this using the query method
ted.query('comments < 1000').comments.plot(kind='hist')
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x24c225a66d8>
In [21]:
# can also write this using the loc accessor
ted.loc[ted.comments < 1000, 'comments'].plot(kind='hist')
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x24c2267a390>
In [22]:
# increase the number of bins to see more detail
ted.loc[ted.comments < 1000, 'comments'].plot(kind='hist', bins=20)
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x24c226e5b70>
In [23]:
# boxplot can also show distributions, but it's far less useful for concentrated distributions because of outliers
ted.loc[ted.comments < 1000, 'comments'].plot(kind='box')
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x24c22772898>

Lessons:

  1. Choose your plot type based on the question you are answering and the data type(s) you are working with
  2. Use pandas one-liners to iterate through plots quickly
  3. Try modifying the plot defaults
  4. Creating plots involves decision-making

4. Plot the number of talks that took place each year

Bonus exercise: calculate the average delay between filming and publishing

In [24]:
# event column does not always include the year
ted.event.sample(10)
Out[24]:
2438                TED2017
765     Mission Blue Voyage
1617             TEDCity2.0
2252              TEDSummit
1996          TEDWomen 2015
38                  TED2005
1194                TED2012
2163                TED2016
2386                TED@IBM
2171                TED2016
Name: event, dtype: object
In [25]:
# dataset documentation for film_date says "Unix timestamp of the filming"
ted.film_date.head()
Out[25]:
0    1140825600
1    1140825600
2    1140739200
3    1140912000
4    1140566400
Name: film_date, dtype: int64
In [26]:
# results don't look right
pd.to_datetime(ted.film_date).head()
Out[26]:
0   1970-01-01 00:00:01.140825600
1   1970-01-01 00:00:01.140825600
2   1970-01-01 00:00:01.140739200
3   1970-01-01 00:00:01.140912000
4   1970-01-01 00:00:01.140566400
Name: film_date, dtype: datetime64[ns]
In [27]:
# now the results look right
pd.to_datetime(ted.film_date, unit='s').head()
Out[27]:
0   2006-02-25
1   2006-02-25
2   2006-02-24
3   2006-02-26
4   2006-02-22
Name: film_date, dtype: datetime64[ns]
In [28]:
ted['film_datetime'] = pd.to_datetime(ted.film_date, unit='s')
In [29]:
# verify that event name matches film_datetime for a random sample
ted[['event', 'film_datetime']].sample(5)
Out[29]:
event film_datetime
1446 TED2013 2013-02-26
1711 TED2014 2014-03-19
1402 TEDSalon London Fall 2012 2012-11-07
789 Business Innovation Factory 2009-10-09
2050 TEDGlobalLondon 2015-06-16
In [30]:
# new column uses the datetime data type (this was an automatic conversion)
ted.dtypes
Out[30]:
comments                       int64
description                   object
duration                       int64
event                         object
film_date                      int64
languages                      int64
main_speaker                  object
name                          object
num_speaker                    int64
published_date                 int64
ratings                       object
related_talks                 object
speaker_occupation            object
tags                          object
title                         object
url                           object
views                          int64
comments_per_view            float64
views_per_comment            float64
film_datetime         datetime64[ns]
dtype: object
In [31]:
# datetime columns have convenient attributes under the dt namespace
ted.film_datetime.dt.year.head()
Out[31]:
0    2006
1    2006
2    2006
3    2006
4    2006
Name: film_datetime, dtype: int64
In [32]:
# similar to string methods under the str namespace
ted.event.str.lower().head()
Out[32]:
0    ted2006
1    ted2006
2    ted2006
3    ted2006
4    ted2006
Name: event, dtype: object
In [33]:
# count the number of talks each year using value_counts()
ted.film_datetime.dt.year.value_counts()
Out[33]:
2013    270
2011    270
2010    267
2012    267
2016    246
2015    239
2014    237
2009    232
2007    114
2017     98
2008     84
2005     66
2006     50
2003     33
2004     33
2002     27
1998      6
2001      5
1983      1
1991      1
1994      1
1990      1
1984      1
1972      1
Name: film_datetime, dtype: int64
In [34]:
# points are plotted and connected in the order you give them to pandas
ted.film_datetime.dt.year.value_counts().plot()
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x24c22820cc0>
In [35]:
# need to sort the index before plotting
ted.film_datetime.dt.year.value_counts().sort_index().plot()
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x24c22866f28>
In [36]:
# we only have partial data for 2017
ted.film_datetime.max()
Out[36]:
Timestamp('2017-08-27 00:00:00')

Lessons:

  1. Read the documentation
  2. Use the datetime data type for dates and times
  3. Check your work as you go
  4. Consider excluding data if it might not be relevant

5. What were the "best" events in TED history to attend?

In [37]:
# count the number of talks (great if you value variety, but they may not be great talks)
ted.event.value_counts().head()
Out[37]:
TED2014    84
TED2009    83
TED2016    77
TED2013    77
TED2015    75
Name: event, dtype: int64
In [38]:
# use views as a proxy for "quality of talk"
ted.groupby('event').views.mean().head()
Out[38]:
event
AORN Congress                  149818.0
Arbejdsglaede Live             971594.0
BBC TV                         521974.0
Bowery Poetry Club             676741.0
Business Innovation Factory    304086.0
Name: views, dtype: float64
In [39]:
# find the largest values, but we don't know how many talks are being averaged
ted.groupby('event').views.mean().sort_values().tail()
Out[39]:
event
TEDxNorrkoping        6569493.0
TEDxCreativeCoast     8444981.0
TEDxBloomington       9484259.5
TEDxHouston          16140250.5
TEDxPuget Sound      34309432.0
Name: views, dtype: float64
In [40]:
# show the number of talks along with the mean (events with the highest means had only 1 or 2 talks)
ted.groupby('event').views.agg(['count', 'mean']).sort_values('mean').tail()
Out[40]:
count mean
event
TEDxNorrkoping 1 6569493.0
TEDxCreativeCoast 1 8444981.0
TEDxBloomington 2 9484259.5
TEDxHouston 2 16140250.5
TEDxPuget Sound 1 34309432.0
In [41]:
# calculate the total views per event
ted.groupby('event').views.agg(['count', 'mean', 'sum']).sort_values('sum').tail()
Out[41]:
count mean sum
event
TED2006 45 3.274345e+06 147345533
TED2015 75 2.011017e+06 150826305
TEDGlobal 2013 66 2.584163e+06 170554736
TED2014 84 2.072874e+06 174121423
TED2013 77 2.302700e+06 177307937

Lessons:

  1. Think creatively for how you can use the data you have to answer your question
  2. Watch out for small sample sizes

6. Unpack the ratings data

In [42]:
# previously, users could tag talks on the TED website (funny, inspiring, confusing, etc.)
ted.ratings.head()
Out[42]:
0    [{'id': 7, 'name': 'Funny', 'count': 19645}, {...
1    [{'id': 7, 'name': 'Funny', 'count': 544}, {'i...
2    [{'id': 7, 'name': 'Funny', 'count': 964}, {'i...
3    [{'id': 3, 'name': 'Courageous', 'count': 760}...
4    [{'id': 9, 'name': 'Ingenious', 'count': 3202}...
Name: ratings, dtype: object
In [43]:
# two ways to examine the ratings data for the first talk
ted.loc[0, 'ratings']
ted.ratings[0]
Out[43]:
"[{'id': 7, 'name': 'Funny', 'count': 19645}, {'id': 1, 'name': 'Beautiful', 'count': 4573}, {'id': 9, 'name': 'Ingenious', 'count': 6073}, {'id': 3, 'name': 'Courageous', 'count': 3253}, {'id': 11, 'name': 'Longwinded', 'count': 387}, {'id': 2, 'name': 'Confusing', 'count': 242}, {'id': 8, 'name': 'Informative', 'count': 7346}, {'id': 22, 'name': 'Fascinating', 'count': 10581}, {'id': 21, 'name': 'Unconvincing', 'count': 300}, {'id': 24, 'name': 'Persuasive', 'count': 10704}, {'id': 23, 'name': 'Jaw-dropping', 'count': 4439}, {'id': 25, 'name': 'OK', 'count': 1174}, {'id': 26, 'name': 'Obnoxious', 'count': 209}, {'id': 10, 'name': 'Inspiring', 'count': 24924}]"
In [44]:
# this is a string not a list
type(ted.ratings[0])
Out[44]:
str
In [45]:
# convert this into something useful using Python's ast module (Abstract Syntax Tree)
import ast
In [46]:
# literal_eval() allows you to evaluate a string containing a Python literal or container
ast.literal_eval('[1, 2, 3]')
Out[46]:
[1, 2, 3]
In [47]:
# if you have a string representation of something, you can retrieve what it actually represents
type(ast.literal_eval('[1, 2, 3]'))
Out[47]:
list
In [48]:
# unpack the ratings data for the first talk
ast.literal_eval(ted.ratings[0])
Out[48]:
[{'id': 7, 'name': 'Funny', 'count': 19645},
 {'id': 1, 'name': 'Beautiful', 'count': 4573},
 {'id': 9, 'name': 'Ingenious', 'count': 6073},
 {'id': 3, 'name': 'Courageous', 'count': 3253},
 {'id': 11, 'name': 'Longwinded', 'count': 387},
 {'id': 2, 'name': 'Confusing', 'count': 242},
 {'id': 8, 'name': 'Informative', 'count': 7346},
 {'id': 22, 'name': 'Fascinating', 'count': 10581},
 {'id': 21, 'name': 'Unconvincing', 'count': 300},
 {'id': 24, 'name': 'Persuasive', 'count': 10704},
 {'id': 23, 'name': 'Jaw-dropping', 'count': 4439},
 {'id': 25, 'name': 'OK', 'count': 1174},
 {'id': 26, 'name': 'Obnoxious', 'count': 209},
 {'id': 10, 'name': 'Inspiring', 'count': 24924}]
In [49]:
# now we have a list (of dictionaries)
type(ast.literal_eval(ted.ratings[0]))
Out[49]:
list
In [50]:
# define a function to convert an element in the ratings Series from string to list
def str_to_list(ratings_str):
    return ast.literal_eval(ratings_str)
In [51]:
# test the function
str_to_list(ted.ratings[0])
Out[51]:
[{'id': 7, 'name': 'Funny', 'count': 19645},
 {'id': 1, 'name': 'Beautiful', 'count': 4573},
 {'id': 9, 'name': 'Ingenious', 'count': 6073},
 {'id': 3, 'name': 'Courageous', 'count': 3253},
 {'id': 11, 'name': 'Longwinded', 'count': 387},
 {'id': 2, 'name': 'Confusing', 'count': 242},
 {'id': 8, 'name': 'Informative', 'count': 7346},
 {'id': 22, 'name': 'Fascinating', 'count': 10581},
 {'id': 21, 'name': 'Unconvincing', 'count': 300},
 {'id': 24, 'name': 'Persuasive', 'count': 10704},
 {'id': 23, 'name': 'Jaw-dropping', 'count': 4439},
 {'id': 25, 'name': 'OK', 'count': 1174},
 {'id': 26, 'name': 'Obnoxious', 'count': 209},
 {'id': 10, 'name': 'Inspiring', 'count': 24924}]
In [52]:
# Series apply method applies a function to every element in a Series and returns a Series
ted.ratings.apply(str_to_list).head()
Out[52]:
0    [{'id': 7, 'name': 'Funny', 'count': 19645}, {...
1    [{'id': 7, 'name': 'Funny', 'count': 544}, {'i...
2    [{'id': 7, 'name': 'Funny', 'count': 964}, {'i...
3    [{'id': 3, 'name': 'Courageous', 'count': 760}...
4    [{'id': 9, 'name': 'Ingenious', 'count': 3202}...
Name: ratings, dtype: object
In [53]:
# lambda is a shorter alternative
ted.ratings.apply(lambda x: ast.literal_eval(x)).head()
Out[53]:
0    [{'id': 7, 'name': 'Funny', 'count': 19645}, {...
1    [{'id': 7, 'name': 'Funny', 'count': 544}, {'i...
2    [{'id': 7, 'name': 'Funny', 'count': 964}, {'i...
3    [{'id': 3, 'name': 'Courageous', 'count': 760}...
4    [{'id': 9, 'name': 'Ingenious', 'count': 3202}...
Name: ratings, dtype: object
In [54]:
# an even shorter alternative is to apply the function directly (without lambda)
ted.ratings.apply(ast.literal_eval).head()
Out[54]:
0    [{'id': 7, 'name': 'Funny', 'count': 19645}, {...
1    [{'id': 7, 'name': 'Funny', 'count': 544}, {'i...
2    [{'id': 7, 'name': 'Funny', 'count': 964}, {'i...
3    [{'id': 3, 'name': 'Courageous', 'count': 760}...
4    [{'id': 9, 'name': 'Ingenious', 'count': 3202}...
Name: ratings, dtype: object
In [55]:
ted['ratings_list'] = ted.ratings.apply(lambda x: ast.literal_eval(x))
In [56]:
# check that the new Series looks as expected
ted.ratings_list[0]
Out[56]:
[{'id': 7, 'name': 'Funny', 'count': 19645},
 {'id': 1, 'name': 'Beautiful', 'count': 4573},
 {'id': 9, 'name': 'Ingenious', 'count': 6073},
 {'id': 3, 'name': 'Courageous', 'count': 3253},
 {'id': 11, 'name': 'Longwinded', 'count': 387},
 {'id': 2, 'name': 'Confusing', 'count': 242},
 {'id': 8, 'name': 'Informative', 'count': 7346},
 {'id': 22, 'name': 'Fascinating', 'count': 10581},
 {'id': 21, 'name': 'Unconvincing', 'count': 300},
 {'id': 24, 'name': 'Persuasive', 'count': 10704},
 {'id': 23, 'name': 'Jaw-dropping', 'count': 4439},
 {'id': 25, 'name': 'OK', 'count': 1174},
 {'id': 26, 'name': 'Obnoxious', 'count': 209},
 {'id': 10, 'name': 'Inspiring', 'count': 24924}]
In [57]:
# each element in the Series is a list
type(ted.ratings_list[0])
Out[57]:
list
In [58]:
# data type of the new Series is object
ted.ratings_list.dtype
Out[58]:
dtype('O')
In [59]:
# object is not just for strings
ted.dtypes
Out[59]:
comments                       int64
description                   object
duration                       int64
event                         object
film_date                      int64
languages                      int64
main_speaker                  object
name                          object
num_speaker                    int64
published_date                 int64
ratings                       object
related_talks                 object
speaker_occupation            object
tags                          object
title                         object
url                           object
views                          int64
comments_per_view            float64
views_per_comment            float64
film_datetime         datetime64[ns]
ratings_list                  object
dtype: object

Lessons:

  1. Pay attention to data types in pandas
  2. Use apply any time it is necessary

7. Count the total number of ratings received by each talk

Bonus exercises:

  • for each talk, calculate the percentage of ratings that were negative
  • for each talk, calculate the average number of ratings it received per day since it was published
In [60]:
# expected result (for each talk) is sum of count
ted.ratings_list[0]
Out[60]:
[{'id': 7, 'name': 'Funny', 'count': 19645},
 {'id': 1, 'name': 'Beautiful', 'count': 4573},
 {'id': 9, 'name': 'Ingenious', 'count': 6073},
 {'id': 3, 'name': 'Courageous', 'count': 3253},
 {'id': 11, 'name': 'Longwinded', 'count': 387},
 {'id': 2, 'name': 'Confusing', 'count': 242},
 {'id': 8, 'name': 'Informative', 'count': 7346},
 {'id': 22, 'name': 'Fascinating', 'count': 10581},
 {'id': 21, 'name': 'Unconvincing', 'count': 300},
 {'id': 24, 'name': 'Persuasive', 'count': 10704},
 {'id': 23, 'name': 'Jaw-dropping', 'count': 4439},
 {'id': 25, 'name': 'OK', 'count': 1174},
 {'id': 26, 'name': 'Obnoxious', 'count': 209},
 {'id': 10, 'name': 'Inspiring', 'count': 24924}]
In [61]:
# start by building a simple function
def get_num_ratings(list_of_dicts):
    return list_of_dicts[0]
In [62]:
# pass it a list, and it returns the first element in the list, which is a dictionary
get_num_ratings(ted.ratings_list[0])
Out[62]:
{'id': 7, 'name': 'Funny', 'count': 19645}
In [63]:
# modify the function to return the vote count
def get_num_ratings(list_of_dicts):
    return list_of_dicts[0]['count']
In [64]:
# pass it a list, and it returns a value from the first dictionary in the list
get_num_ratings(ted.ratings_list[0])
Out[64]:
19645
In [65]:
# modify the function to get the sum of count
def get_num_ratings(list_of_dicts):
    num = 0
    for d in list_of_dicts:
        num = num + d['count']
    return num
In [66]:
# looks about right
get_num_ratings(ted.ratings_list[0])
Out[66]:
93850
In [67]:
# check with another record
ted.ratings_list[1]
Out[67]:
[{'id': 7, 'name': 'Funny', 'count': 544},
 {'id': 3, 'name': 'Courageous', 'count': 139},
 {'id': 2, 'name': 'Confusing', 'count': 62},
 {'id': 1, 'name': 'Beautiful', 'count': 58},
 {'id': 21, 'name': 'Unconvincing', 'count': 258},
 {'id': 11, 'name': 'Longwinded', 'count': 113},
 {'id': 8, 'name': 'Informative', 'count': 443},
 {'id': 10, 'name': 'Inspiring', 'count': 413},
 {'id': 22, 'name': 'Fascinating', 'count': 132},
 {'id': 9, 'name': 'Ingenious', 'count': 56},
 {'id': 24, 'name': 'Persuasive', 'count': 268},
 {'id': 23, 'name': 'Jaw-dropping', 'count': 116},
 {'id': 26, 'name': 'Obnoxious', 'count': 131},
 {'id': 25, 'name': 'OK', 'count': 203}]
In [68]:
# looks about right
get_num_ratings(ted.ratings_list[1])
Out[68]:
2936
In [69]:
# apply it to every element in the Series
ted.ratings_list.apply(get_num_ratings).head()
Out[69]:
0    93850
1     2936
2     2824
3     3728
4    25620
Name: ratings_list, dtype: int64
In [70]:
# another alternative is to use a generator expression
sum((d['count'] for d in ted.ratings_list[0]))
Out[70]:
93850
In [71]:
# use lambda to apply this method
ted.ratings_list.apply(lambda x: sum((d['count'] for d in x))).head()
Out[71]:
0    93850
1     2936
2     2824
3     3728
4    25620
Name: ratings_list, dtype: int64
In [72]:
# another alternative is to use pd.DataFrame()
pd.DataFrame(ted.ratings_list[0])['count'].sum()
Out[72]:
93850
In [73]:
# use lambda to apply this method
ted.ratings_list.apply(lambda x: pd.DataFrame(x)['count'].sum()).head()
Out[73]:
0    93850
1     2936
2     2824
3     3728
4    25620
Name: ratings_list, dtype: int64
In [74]:
ted['num_ratings'] = ted.ratings_list.apply(get_num_ratings)
In [75]:
# do one more check
ted.num_ratings.describe()
Out[75]:
count     2550.000000
mean      2436.408235
std       4226.795631
min         68.000000
25%        870.750000
50%       1452.500000
75%       2506.750000
max      93850.000000
Name: num_ratings, dtype: float64

Lessons:

  1. Write your code in small chunks, and check your work as you go
  2. Lambda is best for simple functions

8. Which occupations deliver the funniest TED talks on average?

Bonus exercises:

  • for each talk, calculate the most frequent rating
  • for each talk, clean the occupation data so that there's only one occupation per talk

Step 1: Count the number of funny ratings

In [76]:
# "Funny" is not always the first dictionary in the list
ted.ratings_list.head()
Out[76]:
0    [{'id': 7, 'name': 'Funny', 'count': 19645}, {...
1    [{'id': 7, 'name': 'Funny', 'count': 544}, {'i...
2    [{'id': 7, 'name': 'Funny', 'count': 964}, {'i...
3    [{'id': 3, 'name': 'Courageous', 'count': 760}...
4    [{'id': 9, 'name': 'Ingenious', 'count': 3202}...
Name: ratings_list, dtype: object
In [77]:
# check ratings (not ratings_list) to see if "Funny" is always a rating type
ted.ratings.str.contains('Funny').value_counts()
Out[77]:
True    2550
Name: ratings, dtype: int64
In [78]:
# write a custom function
def get_funny_ratings(list_of_dicts):
    for d in list_of_dicts:
        if d['name'] == 'Funny':
            return d['count']
In [79]:
# examine a record in which "Funny" is not the first dictionary
ted.ratings_list[3]
Out[79]:
[{'id': 3, 'name': 'Courageous', 'count': 760},
 {'id': 1, 'name': 'Beautiful', 'count': 291},
 {'id': 2, 'name': 'Confusing', 'count': 32},
 {'id': 7, 'name': 'Funny', 'count': 59},
 {'id': 9, 'name': 'Ingenious', 'count': 105},
 {'id': 21, 'name': 'Unconvincing', 'count': 36},
 {'id': 11, 'name': 'Longwinded', 'count': 53},
 {'id': 8, 'name': 'Informative', 'count': 380},
 {'id': 10, 'name': 'Inspiring', 'count': 1070},
 {'id': 22, 'name': 'Fascinating', 'count': 132},
 {'id': 24, 'name': 'Persuasive', 'count': 460},
 {'id': 23, 'name': 'Jaw-dropping', 'count': 230},
 {'id': 26, 'name': 'Obnoxious', 'count': 35},
 {'id': 25, 'name': 'OK', 'count': 85}]
In [80]:
# check that the function works
get_funny_ratings(ted.ratings_list[3])
Out[80]:
59
In [81]:
# apply it to every element in the Series
ted['funny_ratings'] = ted.ratings_list.apply(get_funny_ratings)
ted.funny_ratings.head()
Out[81]:
0    19645
1      544
2      964
3       59
4     1390
Name: funny_ratings, dtype: int64
In [82]:
# check for missing values
ted.funny_ratings.isna().sum()
Out[82]:
0

Step 2: Calculate the percentage of ratings that are funny

In [83]:
ted['funny_rate'] = ted.funny_ratings / ted.num_ratings
In [84]:
# "gut check" that this calculation makes sense by examining the occupations of the funniest talks
ted.sort_values('funny_rate').speaker_occupation.tail(20)
Out[84]:
1849                       Science humorist
337                                Comedian
124     Performance poet, multimedia artist
315                                  Expert
1168             Social energy entrepreneur
1468                          Ornithologist
595                  Comedian, voice artist
1534                         Cartoon editor
97                                 Satirist
2297                          Actor, writer
568                                Comedian
675                          Data scientist
21                     Humorist, web artist
194                                Jugglers
2273                    Comedian and writer
2114                    Comedian and writer
173                                Investor
747                                Comedian
1398                               Comedian
685             Actor, comedian, playwright
Name: speaker_occupation, dtype: object
In [85]:
# examine the occupations of the least funny talks
ted.sort_values('funny_rate').speaker_occupation.head(20)
Out[85]:
2549               Game designer
1612                   Biologist
612                     Sculptor
998               Penguin expert
593                     Engineer
284               Space activist
1041         Biomedical engineer
1618      Spinal cord researcher
2132    Computational geneticist
442                     Sculptor
426              Author, thinker
458                     Educator
2437      Environmental engineer
1491             Photojournalist
1893     Forensic anthropologist
783             Marine biologist
195                    Kenyan MP
772             HIV/AIDS fighter
788            Building activist
936                Neuroengineer
Name: speaker_occupation, dtype: object

Step 3: Analyze the funny rate by occupation

In [86]:
# calculate the mean funny rate for each occupation
ted.groupby('speaker_occupation').funny_rate.mean().sort_values().tail()
Out[86]:
speaker_occupation
Comedian                       0.512457
Actor, writer                  0.515152
Actor, comedian, playwright    0.558107
Jugglers                       0.566828
Comedian and writer            0.602085
Name: funny_rate, dtype: float64
In [87]:
# however, most of the occupations have a sample size of 1
ted.speaker_occupation.describe()
Out[87]:
count       2544
unique      1458
top       Writer
freq          45
Name: speaker_occupation, dtype: object

Step 4: Focus on occupations that are well-represented in the data

In [88]:
# count how many times each occupation appears
ted.speaker_occupation.value_counts()
Out[88]:
Writer                        45
Designer                      34
Artist                        34
Journalist                    33
Entrepreneur                  31
                              ..
High school principal          1
Religious leader               1
Prime Minister of Bhutan       1
Marketing expert               1
Developmental psychologist     1
Name: speaker_occupation, Length: 1458, dtype: int64
In [89]:
# value_counts() outputs a pandas Series, thus we can use pandas to manipulate the output
occupation_counts = ted.speaker_occupation.value_counts()
type(occupation_counts)
Out[89]:
pandas.core.series.Series
In [90]:
# show occupations which appear at least 5 times
occupation_counts[occupation_counts >= 5]
Out[90]:
Writer                   45
Designer                 34
Artist                   34
Journalist               33
Entrepreneur             31
                         ..
Surgeon                   5
Social Media Theorist     5
Science writer            5
Researcher                5
Data scientist            5
Name: speaker_occupation, Length: 68, dtype: int64
In [91]:
# save the index of this Series
top_occupations = occupation_counts[occupation_counts >= 5].index
top_occupations
Out[91]:
Index(['Writer', 'Designer', 'Artist', 'Journalist', 'Entrepreneur',
       'Architect', 'Inventor', 'Psychologist', 'Photographer', 'Filmmaker',
       'Author', 'Educator', 'Neuroscientist', 'Economist', 'Roboticist',
       'Philosopher', 'Biologist', 'Physicist', 'Musician', 'Marine biologist',
       'Activist', 'Technologist', 'Global health expert; data visionary',
       'Historian', 'Graphic designer', 'Philanthropist', 'Poet',
       'Behavioral economist', 'Singer/songwriter', 'Astronomer',
       'Oceanographer', 'Computer scientist', 'Engineer', 'Novelist',
       'Social psychologist', 'Futurist', 'Astrophysicist', 'Mathematician',
       'Writer, activist', 'Performance poet, multimedia artist',
       'Social entrepreneur', 'Evolutionary biologist', 'Singer-songwriter',
       'Techno-illusionist', 'Comedian', 'Climate advocate', 'Legal activist',
       'Photojournalist', 'Reporter', 'Cartoonist', 'Physician',
       'Investor and advocate for moral leadership',
       'Environmentalist, futurist', 'Game designer', 'Musician, activist',
       'Producer', 'Sound consultant', 'Paleontologist', 'Chemist', 'Sculptor',
       'Violinist', 'Chef', 'Tech visionary', 'Surgeon',
       'Social Media Theorist', 'Science writer', 'Researcher',
       'Data scientist'],
      dtype='object')

Step 5: Re-analyze the funny rate by occupation (for top occupations only)

In [92]:
# filter DataFrame to include only those occupations
ted_top_occupations = ted[ted.speaker_occupation.isin(top_occupations)]
ted_top_occupations.shape
Out[92]:
(786, 24)
In [93]:
# redo the previous groupby
ted_top_occupations.groupby('speaker_occupation').funny_rate.mean().sort_values()
Out[93]:
speaker_occupation
Surgeon                                       0.002465
Physician                                     0.004515
Photojournalist                               0.004908
Investor and advocate for moral leadership    0.005198
Photographer                                  0.007152
                                                ...   
Data scientist                                0.184076
Producer                                      0.202531
Singer/songwriter                             0.252205
Performance poet, multimedia artist           0.306468
Comedian                                      0.512457
Name: funny_rate, Length: 68, dtype: float64

Lessons:

  1. Check your assumptions about your data
  2. Check whether your results are reasonable
  3. Take advantage of the fact that pandas operations often output a DataFrame or a Series
  4. Watch out for small sample sizes
  5. Consider the impact of missing data
  6. Data scientists are hilarious
In [ ]: