Pada bagian sebelumnya kita telah melihat bagaimana recommender system dibuat hanya dengan menggunakan average rating, dengan mengurutkan score yang terdapat komponen average rating secara descending, kita dapat mengetahui (secara estimasi) film mana yang menurut para audience paling menarik.
Masih ingat dengan project Building Recommender System using Python?
Pada bagian sebelumnya kita telah melihat bagaimana recommender system dibuat hanya dengan menggunakan average rating, dengan mengurutkan score yang terdapat komponen average rating secara descending, kita dapat mengetahui (secara estimasi) film mana yang menurut para audience paling menarik.
Kali ini, kita akan membuat recommender system yang menggunakan Content/feature dari film/entitas tersebut, kemudian melakukan perhitungan terhadap kesamaannya satu dan yang lain sehingga ketika kita menunjuk ke satu film, kita akan mendapat beberapa film lain yang memiliki kesamaan dengan film tersebut. Hal ini biasa kita sebut sebagai Content Based Recommender System.
Dengan membandingkan kesamaan plot yang ada dan genre yang ada, ketika audience lebih menyukai film Narnia, maka content based recommender system ini akan juga merekomendasikan film seperti Harry Potter atau The Lords of The Rings yang memiliki genre yang mirip.
Langkah pertama yang harus kita lakukan adalah melakukan import library yang dibutuhkan untuk pengerjaan project ini dan melakukan pembacaan dataset.
Notes :
- Library yang akan kita gunakan adalah pandas (as pd) dan numpy (as np)
- Dataset yang akan digunakan adalah movie_rating_df.csv
Akses dataset :
- movie_rating_df.csv = https://storage.googleapis.com/dqlab-dataset/movie_rating_df.csv
#import library yang dibutuhkan
import pandas as pd
import numpy as np#lakukan pembacaan dataset
movie_rating_df = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/movie_rating_df.csv ')
#untuk menyimpan movie_rating_df.csv
movie_rating_df.head()
Info: materi telah diperbarui pada tanggal 2 September 2021, pastikan kembali kode yang telah ditulis disesuaikan dengan bagian Lesson.
Setelah sebelumnya kita sudah menyimpan dataset pada variabel movie_rating_df, hal selanjutnya yang akan kita lakukan adalah menampilkan lima baris teratas dari dataset tersebut dan menampilkan info mengenai tipe data dan jumlah non-null value dari tiap kolom yang ada pada dataset.
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)movie_rating_df = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/movie_rating_df.csv')
print(movie_rating_df.head())
print(movie_rating_df.info())
Output:
tconst titleType primaryTitle originalTitle \
0 tt0000001 short Carmencita Carmencita
1 tt0000002 short Le clown et ses chiens Le clown et ses chiens
2 tt0000003 short Pauvre Pierrot Pauvre Pierrot
3 tt0000004 short Un bon bock Un bon bock
4 tt0000005 short Blacksmith Scene Blacksmith Scene isAdult startYear endYear runtimeMinutes genres \
0 0 1894.0 NaN 1.0 Documentary,Short
1 0 1892.0 NaN 5.0 Animation,Short
2 0 1892.0 NaN 4.0 Animation,Comedy,Romance
3 0 1892.0 NaN 12.0 Animation,Short
4 0 1893.0 NaN 1.0 Comedy,Short
averageRating numVotes
0 5.6 1608
1 6.0 197
2 6.5 1285
3 6.1 121
4 6.1 2050
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 751614 entries, 0 to 751613
Data columns (total 11 columns):
tconst 751614 non-null object
titleType 751614 non-null object
primaryTitle 751614 non-null object
originalTitle 751614 non-null object
isAdult 751614 non-null int64
startYear 751614 non-null float64
endYear 16072 non-null float64
runtimeMinutes 751614 non-null float64
genres 486766 non-null object
averageRating 751614 non-null float64
numVotes 751614 non-null int64
dtypes: float64(4), int64(2), object(5)
memory usage: 63.1+ MB
None
Dari output yang sudah dihasilkan sebelumnya, kita dapat memperoleh list film dengan beberapa metadata seperti isAdult, runtimeMinutes, dan genres nya
Selanjutnya, kita akan menambahkan metadata lain seperti aktor/aktris yang bermain di film tersebut, kita akan menggunakan dataframe lain kemudian akan melakukan join dengan dataframe movie_rating_df.
Dataset yang akan digunakan adalah actor_name.csv
Akses dataset: https://storage.googleapis.com/dqlab-dataset/actor_name.csv
Dataframe yang akan ditambahkan selanjutnya adalah dataframe yang berisi directors dan writers dari film.
Dataset yang akan digunakan adalah directors_writers.csv
Akses dataset : https://storage.googleapis.com/dqlab-dataset/directors_writers.csv
Setelah menampilkan informasi mengenai dataframe directors_writer, dapat dilihat bahwa tidak ada nilai NULL pada dataset tersebut. Hal selanjutnya yang akan kita lakukan adalah mengubah director_name dan writer_name dari string menjadi list
Setelah itu, tampilkan 5 baris teratas dari dataframe director_writers
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)director_writers = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/directors_writers.csv')
director_writers['director_name'] = director_writers['director_name'].apply(lambda row: row.split(','))
director_writers['writer_name'] = director_writers['writer_name'].apply(lambda row: row.split(','))
print(director_writers.head())
Output:
tconst director_name \
0 tt0011414 [David Kirkland]
1 tt0011890 [Roy William Neill]
2 tt0014341 [Buster Keaton, John G. Blystone]
3 tt0018054 [Cecil B. DeMille]
4 tt0024151 [James Cruze] writer_name
0 [John Emerson, Anita Loos]
1 [Arthur F. Goodrich, Burns Mantle, Mary Murillo]
2 [Jean C. Havez, Clyde Bruckman, Joseph A. Mitc...
3 [Jeanie Macpherson]
4 [Max Miller, Wells Root, Jack Jevne]
Kita hanya akan membutuhkan kolom nconst, primaryName, dan knownForTitles pada name_df untuk mencocokkan aktor/aktris ini dengan film yang ada.
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)name_df = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/actor_name.csv')
name_df = name_df[['nconst','primaryName','knownForTitles']]
print(name_df.head())
Output:
nconst primaryName knownForTitles
0 nm1774132 Nathan McLaughlin tt0417686,tt1713976,tt1891860,tt0454839
1 nm10683464 Bridge Andrew tt7718088
2 nm1021485 Brandon Fransvaag tt0168790
3 nm6940929 Erwin van der Lely tt4232168
4 nm5764974 Svetlana Shypitsyna tt3014168
Hal selanjutnya yang ingin kita ketahui adalah mengenai variasi dari jumlah film yang dapat dibintangi oleh seorang aktor.
Tentunya seorang aktor dapat membintangi lebih dari 1 film, bukan? maka akan diperlukan untuk membuat table yang mempunyai relasi 1–1 ke masing-masing title movie tersebut. Kita akan melakukan unnest terhadap table tersebut.
Pekerjaan selanjutnya yang harus kita lakukan adalah :
- Melakukan pengecekan variasi jumlah film yang dibintangi oleh aktor.
- Mengubah kolom ‘knownForTitles’ menjadi list of list.
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)
name_df = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/actor_name.csv')
name_df = name_df[['nconst','primaryName','knownForTitles']]print(name_df['knownForTitles'].apply(lambda x: len(x.split(','))).unique())
name_df['knownForTitles'] = name_df['knownForTitles'].apply(lambda x: x.split(','))
print(name_df.head())
Output:
[4 1 2 3]
nconst primaryName \
0 nm1774132 Nathan McLaughlin
1 nm10683464 Bridge Andrew
2 nm1021485 Brandon Fransvaag
3 nm6940929 Erwin van der Lely
4 nm5764974 Svetlana Shypitsyna knownForTitles
0 [tt0417686, tt1713976, tt1891860, tt0454839]
1 [tt7718088]
2 [tt0168790]
3 [tt4232168]
4 [tt3014168]
Karena pada data sebelumnya dapat dilihat bahwa seorang aktor dapat membintangi 1 sampai 4 film, diperlukan untuk membuat table yang mempunyai relasi 1–1 dari aktor ke masing-masing title movie tersebut.
contoh table yang tidak korespondensi 1–1
Contoh table korespondensi 1–1
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)name_df = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/actor_name.csv')
name_df = name_df[['nconst','primaryName','knownForTitles']]
name_df['knownForTitles'] = name_df['knownForTitles'].apply(lambda x: x.split(','))
#menyiapkan bucket untuk dataframe
df_uni = []
for x in ['knownForTitles']:
#mengulang index dari tiap baris sampai tiap elemen dari knownForTitles
idx = name_df.index.repeat(name_df['knownForTitles'].str.len())
#memecah values dari list di setiap baris dan menggabungkan nya dengan rows lain menjadi dataframe
df1 = pd.DataFrame({
x: np.concatenate(name_df[x].values)
})
#mengganti index dataframe tersebut dengan idx yang sudah kita define di awal
df1.index = idx
#untuk setiap dataframe yang terbentuk, kita append ke dataframe bucket
df_uni.append(df1)
#menggabungkan semua dataframe menjadi satu
df_concat = pd.concat(df_uni, axis=1)
#left join dengan value dari dataframe yang awal
unnested_df = df_concat.join(name_df.drop(['knownForTitles'], 1), how='left')
#select kolom sesuai dengan dataframe awal
unnested_df = unnested_df[name_df.columns.tolist()]
print(unnested_df)
Output:
nconst primaryName knownForTitles
0 nm1774132 Nathan McLaughlin tt0417686
0 nm1774132 Nathan McLaughlin tt1713976
0 nm1774132 Nathan McLaughlin tt1891860
0 nm1774132 Nathan McLaughlin tt0454839
1 nm10683464 Bridge Andrew tt7718088
2 nm1021485 Brandon Fransvaag tt0168790
3 nm6940929 Erwin van der Lely tt4232168
4 nm5764974 Svetlana Shypitsyna tt3014168
5 nm8621807 Utku Arslan tt5493404
5 nm8621807 Utku Arslan tt7661932
5 nm8621807 Utku Arslan tt0845088
5 nm8621807 Utku Arslan tt9278408
6 nm0415875 Mihály Jakab tt0242160
7 nm7082726 Ernesto Ballén tt8696516
7 nm7082726 Ernesto Ballén tt6407610
7 nm7082726 Ernesto Ballén tt2707408
7 nm7082726 Ernesto Ballén tt4392976
8 nm5903442 Jen Brown tt3145724
9 nm1744604 Tatyana Kuzmichyova tt0420982
9 nm1744604 Tatyana Kuzmichyova tt1186366
9 nm1744604 Tatyana Kuzmichyova tt0417552
9 nm1744604 Tatyana Kuzmichyova tt0847880
10 nm8581827 Joel Capps tt6245544
11 nm4404596 Roger Caadan tt1483429
12 nm1669765 Jim Boutet tt0414476
12 nm1669765 Jim Boutet tt0106067
13 nm3211206 Mi Young Jo tt0101178
14 nm3123793 Maksim Romanov tt8777220
14 nm3123793 Maksim Romanov tt1075829
15 nm8651439 Aleksandr Tretyakov tt5528188
.. ... ... ...
983 nm6771564 Do-Kyeong Lee tt2797106
983 nm6771564 Do-Kyeong Lee tt8750956
984 nm4462830 Christopher Bryant Tucker tt1934293
985 nm2753117 Alexander Lorenz tt1074607
986 nm2573036 David Jason Pressman tt0108894
987 nm2279830 Judith Allard tt0145529
988 nm2264406 Jared R. Morris tt0384766
988 nm2264406 Jared R. Morris tt4851552
988 nm2264406 Jared R. Morris tt2229123
988 nm2264406 Jared R. Morris tt0387199
989 nm7677943 Carlos Denis tt5140670
990 nm7390826 Darlene Huynh tt4771886
991 nm5251983 David Hague tt2239078
992 nm1987981 Henry Mercedes Vales tt8979132
992 nm1987981 Henry Mercedes Vales tt0801017
992 nm1987981 Henry Mercedes Vales tt8257760
992 nm1987981 Henry Mercedes Vales tt2238964
993 nm8270190 Lilin Lace tt5523166
994 nm7383079 Francois Landry tt4762718
995 nm7596674 Paul Whitrow tt4118352
995 nm7596674 Paul Whitrow tt9104322
995 nm7596674 Paul Whitrow tt4447090
995 nm7596674 Paul Whitrow tt4892804
996 nm5938546 Wendy Ponce tt2125666
997 nm2101810 Ans Brugmans tt0488280
998 nm5245804 Eliza Jenkins tt1464058
999 nm0948460 Greg Yolen tt0436869
999 nm0948460 Greg Yolen tt0476663
999 nm0948460 Greg Yolen tt0109723
999 nm0948460 Greg Yolen tt0364484[1918 rows x 3 columns]
Sekarang kita akan melakukan :
- join antara movie table dan cast table ( field knownForTitles dan tconst)
- join antara base_df dengan director_writer table (field tconst dan tconst)
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)movie_rating_df = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/movie_rating_df.csv')
director_writers = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/directors_writers.csv')
director_writers['director_name'] = director_writers['director_name'].apply(lambda row: row.split(','))
director_writers['writer_name'] = director_writers['writer_name'].apply(lambda row: row.split(','))
name_df = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/actor_name.csv')
name_df = name_df[['nconst','primaryName','knownForTitles']]
name_df['knownForTitles'] = name_df['knownForTitles'].apply(lambda x: x.split(','))
df_uni = []
for x in ['knownForTitles']:
idx = name_df.index.repeat(name_df['knownForTitles'].str.len())
df1 = pd.DataFrame({
x: np.concatenate(name_df[x].values)
})
df1.index = idx
df_uni.append(df1)
df_concat = pd.concat(df_uni, axis=1)
unnested_df = df_concat.join(name_df.drop(['knownForTitles'], 1), how='left')
unnested_df = unnested_df[name_df.columns.tolist()]
unnested_drop = unnested_df.drop(['nconst'], axis=1)
df_uni = []
for col in ['primaryName']:
dfi = unnested_drop.groupby(['knownForTitles'])[col].apply(list)
df_uni.append(dfi)
df_grouped = pd.concat(df_uni, axis=1).reset_index()
df_grouped.columns = ['knownForTitles','cast_name']
#join antara movie table dan cast table
base_df = pd.merge(df_grouped, movie_rating_df, left_on='knownForTitles', right_on='tconst', how='inner')
#join antara base_df dengan director_writer table
base_df = pd.merge(base_df, director_writers, left_on='tconst', right_on='tconst', how='left')
print(base_df.head())
Output:
knownForTitles cast_name tconst titleType \
0 tt0011414 [Natalie Talmadge] tt0011414 movie
1 tt0011890 [Natalie Talmadge] tt0011890 movie
2 tt0014341 [Natalie Talmadge] tt0014341 movie
3 tt0018054 [Reeka Roberts] tt0018054 movie
4 tt0024151 [James Hackett] tt0024151 movie primaryTitle originalTitle isAdult startYear \
0 The Love Expert The Love Expert 0 1920.0
1 Yes or No Yes or No 0 1920.0
2 Our Hospitality Our Hospitality 0 1923.0
3 The King of Kings The King of Kings 0 1927.0
4 I Cover the Waterfront I Cover the Waterfront 0 1933.0
endYear runtimeMinutes genres averageRating numVotes \
0 NaN 60.0 Comedy,Romance 4.9 136
1 NaN 72.0 NaN 6.3 7
2 NaN 65.0 Comedy,Romance,Thriller 7.8 9621
3 NaN 155.0 Biography,Drama,History 7.3 1826
4 NaN 80.0 Drama,Romance 6.3 455
director_name \
0 [David Kirkland]
1 [Roy William Neill]
2 [Buster Keaton, John G. Blystone]
3 [Cecil B. DeMille]
4 [James Cruze]
writer_name
0 [John Emerson, Anita Loos]
1 [Arthur F. Goodrich, Burns Mantle, Mary Murillo]
2 [Jean C. Havez, Clyde Bruckman, Joseph A. Mitc...
3 [Jeanie Macpherson]
4 [Max Miller, Wells Root, Jack Jevne]
Setelah melakukan join table sebelumnya, sekarang hal yang akan kembali kita lakukan adalah melakukan cleaning pada data yang sudah dihasilkan.
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)movie_rating_df = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/movie_rating_df.csv')
director_writers = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/directors_writers.csv')
director_writers['director_name'] = director_writers['director_name'].apply(lambda row: row.split(','))
director_writers['writer_name'] = director_writers['writer_name'].apply(lambda row: row.split(','))
name_df = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/actor_name.csv')
name_df = name_df[['nconst','primaryName','knownForTitles']]
name_df['knownForTitles'] = name_df['knownForTitles'].apply(lambda x: x.split(','))
df_uni = []
for x in ['knownForTitles']:
idx = name_df.index.repeat(name_df['knownForTitles'].str.len())
df1 = pd.DataFrame({
x: np.concatenate(name_df[x].values)
})
df1.index = idx
df_uni.append(df1)
df_concat = pd.concat(df_uni, axis=1)
unnested_df = df_concat.join(name_df.drop(['knownForTitles'], 1), how='left')
unnested_df = unnested_df[name_df.columns.tolist()]
unnested_drop = unnested_df.drop(['nconst'], axis=1)
df_uni = []
for col in ['primaryName']:
dfi = unnested_drop.groupby(['knownForTitles'])[col].apply(list)
df_uni.append(dfi)
df_grouped = pd.concat(df_uni, axis=1).reset_index()
df_grouped.columns = ['knownForTitles','cast_name']
base_df = pd.merge(df_grouped, movie_rating_df, left_on='knownForTitles', right_on='tconst', how='inner')
base_df = pd.merge(base_df, director_writers, left_on='tconst', right_on='tconst', how='left')
#Melakukan drop terhadap kolom knownForTitles
base_drop = base_df.drop(['knownForTitles'], axis=1)
print(base_drop.info())
#Mengganti nilai NULL pada kolom genres dengan 'Unknown'
base_drop[['director_name','writer_name']] = base_drop[['director_name','writer_name']].fillna('unknown')
#Melakukan perhitungan jumlah nilai NULL pada tiap kolom
print(base_drop.isnull().sum())
#Mengganti nilai NULL pada kolom dorector_name dan writer_name dengan 'Unknown'
base_drop[['director_name','writer_name']] = base_drop[['director_name','writer_name']].fillna('unknown')
#karena value kolom genres terdapat multiple values, jadi kita akan bungkus menjadi list of list
base_drop['genres'] = base_drop['genres'].apply(lambda x: x.split(','))
Output:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1060 entries, 0 to 1059
Data columns (total 14 columns):
cast_name 1060 non-null object
tconst 1060 non-null object
titleType 1060 non-null object
primaryTitle 1060 non-null object
originalTitle 1060 non-null object
isAdult 1060 non-null int64
startYear 1060 non-null float64
endYear 110 non-null float64
runtimeMinutes 1060 non-null float64
genres 745 non-null object
averageRating 1060 non-null float64
numVotes 1060 non-null int64
director_name 986 non-null object
writer_name 986 non-null object
dtypes: float64(4), int64(2), object(8)
memory usage: 124.2+ KB
None
cast_name 0
tconst 0
titleType 0
primaryTitle 0
originalTitle 0
isAdult 0
startYear 0
endYear 950
runtimeMinutes 0
genres 315
averageRating 0
numVotes 0
director_name 0
writer_name 0
dtype: int64
Hal selanjutnya yang akan kita lakukan adalah melakukan reformat pada table base_df yang beberapa kolomnya sudah didrop.
Petunjuk:
Rename-lah kolom berikut:
- primaryTitle -> title
- titleType -> type
- startYear -> start
- runtimeMinutes -> duration
- averageRating -> rating
- numVotes -> votes
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)movie_rating_df = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/movie_rating_df.csv')
director_writers = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/directors_writers.csv')
director_writers['director_name'] = director_writers['director_name'].apply(lambda row: row.split(','))
director_writers['writer_name'] = director_writers['writer_name'].apply(lambda row: row.split(','))
name_df = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/actor_name.csv')
name_df = name_df[['nconst','primaryName','knownForTitles']]
name_df['knownForTitles'] = name_df['knownForTitles'].apply(lambda x: x.split(','))
df_uni = []
for x in ['knownForTitles']:
idx = name_df.index.repeat(name_df['knownForTitles'].str.len())
df1 = pd.DataFrame({
x: np.concatenate(name_df[x].values)
})
df1.index = idx
df_uni.append(df1)
df_concat = pd.concat(df_uni, axis=1)
unnested_df = df_concat.join(name_df.drop(['knownForTitles'], 1), how='left')
unnested_df = unnested_df[name_df.columns.tolist()]
unnested_drop = unnested_df.drop(['nconst'], axis=1)
df_uni = []
for col in ['primaryName']:
dfi = unnested_drop.groupby(['knownForTitles'])[col].apply(list)
df_uni.append(dfi)
df_grouped = pd.concat(df_uni, axis=1).reset_index()
df_grouped.columns = ['knownForTitles','cast_name']
base_df = pd.merge(df_grouped, movie_rating_df, left_on='knownForTitles', right_on='tconst', how='inner')
base_df = pd.merge(base_df, director_writers, left_on='tconst', right_on='tconst', how='left')
base_drop = base_df.drop(['knownForTitles'], axis=1)
base_drop['genres'] = base_drop['genres'].fillna('Unknown')
base_drop[['director_name','writer_name']] = base_drop[['director_name','writer_name']].fillna('unknown')
base_drop['genres'] = base_drop['genres'].apply(lambda x: x.split(','))
#Drop kolom tconst, isAdult, endYear, originalTitle
base_drop2 = base_drop.drop(['tconst','isAdult','endYear','originalTitle'], axis=1)
base_drop2 = base_drop2[['primaryTitle','titleType','startYear','runtimeMinutes','genres','averageRating','numVotes','cast_name','director_name','writer_name']]
# Gunakan petunjuk!
base_drop2.columns = ['title','type','start','duration','genres','rating','votes','cast_name','director_name','writer_name']
print(base_drop2.head())
print(base_drop2.head())
Output:
title type start duration \
0 The Love Expert movie 1920.0 60.0
1 Yes or No movie 1920.0 72.0
2 Our Hospitality movie 1923.0 65.0
3 The King of Kings movie 1927.0 155.0
4 I Cover the Waterfront movie 1933.0 80.0 genres rating votes cast_name \
0 [Comedy, Romance] 4.9 136 [Natalie Talmadge]
1 [Unknown] 6.3 7 [Natalie Talmadge]
2 [Comedy, Romance, Thriller] 7.8 9621 [Natalie Talmadge]
3 [Biography, Drama, History] 7.3 1826 [Reeka Roberts]
4 [Drama, Romance] 6.3 455 [James Hackett]
director_name \
0 [David Kirkland]
1 [Roy William Neill]
2 [Buster Keaton, John G. Blystone]
3 [Cecil B. DeMille]
4 [James Cruze]
writer_name
0 [John Emerson, Anita Loos]
1 [Arthur F. Goodrich, Burns Mantle, Mary Murillo]
2 [Jean C. Havez, Clyde Bruckman, Joseph A. Mitc...
3 [Jeanie Macpherson]
4 [Max Miller, Wells Root, Jack Jevne]
title type start duration \
0 The Love Expert movie 1920.0 60.0
1 Yes or No movie 1920.0 72.0
2 Our Hospitality movie 1923.0 65.0
3 The King of Kings movie 1927.0 155.0
4 I Cover the Waterfront movie 1933.0 80.0
genres rating votes cast_name \
0 [Comedy, Romance] 4.9 136 [Natalie Talmadge]
1 [Unknown] 6.3 7 [Natalie Talmadge]
2 [Comedy, Romance, Thriller] 7.8 9621 [Natalie Talmadge]
3 [Biography, Drama, History] 7.3 1826 [Reeka Roberts]
4 [Drama, Romance] 6.3 455 [James Hackett]
director_name \
0 [David Kirkland]
1 [Roy William Neill]
2 [Buster Keaton, John G. Blystone]
3 [Cecil B. DeMille]
4 [James Cruze]
writer_name
0 [John Emerson, Anita Loos]
1 [Arthur F. Goodrich, Burns Mantle, Mary Murillo]
2 [Jean C. Havez, Clyde Bruckman, Joseph A. Mitc...
3 [Jeanie Macpherson]
4 [Max Miller, Wells Root, Jack Jevne]
kita akan klasifikasikan berdasarkan metadata genres, primaryName (cast name), director name, dan writer_name
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)movie_rating_df = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/movie_rating_df.csv')
director_writers = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/directors_writers.csv')
director_writers['director_name'] = director_writers['director_name'].apply(lambda row: row.split(','))
director_writers['writer_name'] = director_writers['writer_name'].apply(lambda row: row.split(','))
name_df = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/actor_name.csv')
name_df = name_df[['nconst','primaryName','knownForTitles']]
name_df['knownForTitles'] = name_df['knownForTitles'].apply(lambda x: x.split(','))
df_uni = []
for x in ['knownForTitles']:
idx = name_df.index.repeat(name_df['knownForTitles'].str.len())
df1 = pd.DataFrame({
x: np.concatenate(name_df[x].values)
})
df1.index = idx
df_uni.append(df1)
df_concat = pd.concat(df_uni, axis=1)
unnested_df = df_concat.join(name_df.drop(['knownForTitles'], 1), how='left')
unnested_df = unnested_df[name_df.columns.tolist()]
unnested_drop = unnested_df.drop(['nconst'], axis=1)
df_uni = []
for col in ['primaryName']:
dfi = unnested_drop.groupby(['knownForTitles'])[col].apply(list)
df_uni.append(dfi)
df_grouped = pd.concat(df_uni, axis=1).reset_index()
df_grouped.columns = ['knownForTitles','cast_name']
base_df = pd.merge(df_grouped, movie_rating_df, left_on='knownForTitles', right_on='tconst', how='inner')
base_df = pd.merge(base_df, director_writers, left_on='tconst', right_on='tconst', how='left')
base_drop = base_df.drop(['knownForTitles'], axis=1)
base_drop['genres'] = base_drop['genres'].fillna('Unknown')
base_drop[['director_name','writer_name']] = base_drop[['director_name','writer_name']].fillna('unknown')
base_drop['genres'] = base_drop['genres'].apply(lambda x: x.split(','))
base_drop2 = base_drop.drop(['tconst','isAdult','endYear','originalTitle'], axis=1)
base_drop2 = base_drop2[['primaryTitle','titleType','startYear','runtimeMinutes','genres','averageRating','numVotes','cast_name','director_name','writer_name']]
base_drop2.columns = ['title','type','start','duration','genres','rating','votes','cast_name','director_name','writer_name']
#Klasifikasi berdasar title, cast_name, genres, director_name, dan writer_name
feature_df = base_drop2[['title','cast_name','genres','director_name','writer_name']]
#Tampilkan 5 baris teratas
print(feature_df.head())
Output:
title cast_name genres \
0 The Love Expert [Natalie Talmadge] [Comedy, Romance]
1 Yes or No [Natalie Talmadge] [Unknown]
2 Our Hospitality [Natalie Talmadge] [Comedy, Romance, Thriller]
3 The King of Kings [Reeka Roberts] [Biography, Drama, History]
4 I Cover the Waterfront [James Hackett] [Drama, Romance] director_name \
0 [David Kirkland]
1 [Roy William Neill]
2 [Buster Keaton, John G. Blystone]
3 [Cecil B. DeMille]
4 [James Cruze]
writer_name
0 [John Emerson, Anita Loos]
1 [Arthur F. Goodrich, Burns Mantle, Mary Murillo]
2 [Jean C. Havez, Clyde Bruckman, Joseph A. Mitc...
3 [Jeanie Macpherson]
4 [Max Miller, Wells Root, Jack Jevne]
Lengkapilah function sanitize yang digunakan untuk melakukan strip spaces dari setiap row dan setiap elemennya
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)movie_rating_df = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/movie_rating_df.csv')
director_writers = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/directors_writers.csv')
director_writers['director_name'] = director_writers['director_name'].apply(lambda row: row.split(','))
director_writers['writer_name'] = director_writers['writer_name'].apply(lambda row: row.split(','))
name_df = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/actor_name.csv')
name_df = name_df[['nconst','primaryName','knownForTitles']]
name_df['knownForTitles'] = name_df['knownForTitles'].apply(lambda x: x.split(','))
df_uni = []
for x in ['knownForTitles']:
idx = name_df.index.repeat(name_df['knownForTitles'].str.len())
df1 = pd.DataFrame({
x: np.concatenate(name_df[x].values)
})
df1.index = idx
df_uni.append(df1)
df_concat = pd.concat(df_uni, axis=1)
unnested_df = df_concat.join(name_df.drop(['knownForTitles'], 1), how='left')
unnested_df = unnested_df[name_df.columns.tolist()]
unnested_drop = unnested_df.drop(['nconst'], axis=1)
df_uni = []
for col in ['primaryName']:
dfi = unnested_drop.groupby(['knownForTitles'])[col].apply(list)
df_uni.append(dfi)
df_grouped = pd.concat(df_uni, axis=1).reset_index()
df_grouped.columns = ['knownForTitles','cast_name']
base_df = pd.merge(df_grouped, movie_rating_df, left_on='knownForTitles', right_on='tconst', how='inner')
base_df = pd.merge(base_df, director_writers, left_on='tconst', right_on='tconst', how='left')
base_drop = base_df.drop(['knownForTitles'], axis=1)
base_drop['genres'] = base_drop['genres'].fillna('Unknown')
base_drop[['director_name','writer_name']] = base_drop[['director_name','writer_name']].fillna('unknown')
base_drop['genres'] = base_drop['genres'].apply(lambda x: x.split(','))
base_drop2 = base_drop.drop(['tconst','isAdult','endYear','originalTitle'], axis=1)
base_drop2 = base_drop2[['primaryTitle','titleType','startYear','runtimeMinutes','genres','averageRating','numVotes','cast_name','director_name','writer_name']]
base_drop2.columns = ['title','type','start','duration','genres','rating','votes','cast_name','director_name','writer_name']
feature_df = base_drop2[['title','cast_name','genres','director_name','writer_name']]
def sanitize(x):
try:
#kalau cell berisi list
if isinstance(x, list):
return [i.replace(' ','').lower() for i in x]
#kalau cell berisi string
else:
return [x.replace(' ','').lower()]
except:
print(x)
feature_cols = ['cast_name','genres','writer_name','director_name']
for col in feature_cols:
feature_df[col] = feature_df[col].apply(sanitize)
Lengkapi function soup_feature yang digunakan untuk menggabungkan semua feature menjadi 1 bagian
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)movie_rating_df = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/movie_rating_df.csv')
director_writers = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/directors_writers.csv')
director_writers['director_name'] = director_writers['director_name'].apply(lambda row: row.split(','))
director_writers['writer_name'] = director_writers['writer_name'].apply(lambda row: row.split(','))
name_df = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/actor_name.csv')
name_df = name_df[['nconst','primaryName','knownForTitles']]
name_df['knownForTitles'] = name_df['knownForTitles'].apply(lambda x: x.split(','))
df_uni = []
for x in ['knownForTitles']:
idx = name_df.index.repeat(name_df['knownForTitles'].str.len())
df1 = pd.DataFrame({
x: np.concatenate(name_df[x].values)
})
df1.index = idx
df_uni.append(df1)
df_concat = pd.concat(df_uni, axis=1)
unnested_df = df_concat.join(name_df.drop(['knownForTitles'], 1), how='left')
unnested_df = unnested_df[name_df.columns.tolist()]
unnested_drop = unnested_df.drop(['nconst'], axis=1)
df_uni = []
for col in ['primaryName']:
dfi = unnested_drop.groupby(['knownForTitles'])[col].apply(list)
df_uni.append(dfi)
df_grouped = pd.concat(df_uni, axis=1).reset_index()
df_grouped.columns = ['knownForTitles','cast_name']
base_df = pd.merge(df_grouped, movie_rating_df, left_on='knownForTitles', right_on='tconst', how='inner')
base_df = pd.merge(base_df, director_writers, left_on='tconst', right_on='tconst', how='left')
base_drop = base_df.drop(['knownForTitles'], axis=1)
base_drop['genres'] = base_drop['genres'].fillna('Unknown')
base_drop[['director_name','writer_name']] = base_drop[['director_name','writer_name']].fillna('unknown')
base_drop['genres'] = base_drop['genres'].apply(lambda x: x.split(','))
base_drop2 = base_drop.drop(['tconst','isAdult','endYear','originalTitle'], axis=1)
base_drop2 = base_drop2[['primaryTitle','titleType','startYear','runtimeMinutes','genres','averageRating','numVotes','cast_name','director_name','writer_name']]
base_drop2.columns = ['title','type','start','duration','genres','rating','votes','cast_name','director_name','writer_name']
feature_df = base_drop2[['title','cast_name','genres','director_name','writer_name']]
def sanitize(x):
try:
if isinstance(x, list):
return [i.replace(' ','').lower() for i in x]
else:
return [x.replace(' ','').lower()]
except:
print(x)
feature_cols = ['cast_name','genres','writer_name','director_name']
for col in feature_cols:
feature_df[col] = feature_df[col].apply(sanitize)
def soup_feature(x):
return ' '.join(x['cast_name']) + ' ' + ' '.join(x['genres']) + ' ' + ' '.join(x['director_name']) + ' ' + ' '.join(x['writer_name'])
#membuat soup menjadi 1 kolom sendiri
feature_df['soup'] = feature_df.apply(soup_feature, axis=1)
CountVectorizer adalah tipe paling sederhana dari vectorizer. Supaya lebih mudah akan dijelaskan melalui contoh di bawah ini:
bayangkan terdapat 3 text A, B, dan C, dimana text nya adalah
- A: The Sun is a star
- B: My Love is like a red, red rose
- C : Mary had a little lamb
Sekarang kita harus konversi text-text ini menjadi bentuk vector menggunakan CountVectorizer. Langkah-langkahnya adalah: menghitung ukuran dari vocabulary. Vocabulary adalah jumlah dari kata unik yang ada dari text tersebut.
Oleh sebab itu, vocabulary dari set ketiga text tersebut adalah: the, sun, is, a, star, my, love, like, red, rose, mary, had, little, lamb. Secara total, ukuran vocabulary adalah 14.
Tetapi, biasanya kita tidak include stop words (english), seperti as, is, a, the, dan sebagainya karena itu adalah kata yang sudah common sekali.
Dengan mengeliminasi stop words, maka clean size vocabulary kita adalah like, little, lamb, love, mary, red, rose, sun, star (sorted alphabet ascending)
Maka, dengan menggunakan CountVectorizer, maka hasil yang kita dapatkan adalah sebagai berikut:
A : (0,0,0,0,0,0,0,1,1), terdiri atas sun:1, star:1
B : (1,0,0,1,0,2,1,0,0), terdiri atas like:1, love:1, red:2, rose:1
C : (0,1,1,0,1,0,0,0,0), terdiri atas little:1, lamb:1, mary:1
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)movie_rating_df = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/movie_rating_df.csv')
director_writers = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/directors_writers.csv')
director_writers['director_name'] = director_writers['director_name'].apply(lambda row: row.split(','))
director_writers['writer_name'] = director_writers['writer_name'].apply(lambda row: row.split(','))
name_df = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/actor_name.csv')
name_df = name_df[['nconst','primaryName','knownForTitles']]
name_df['knownForTitles'] = name_df['knownForTitles'].apply(lambda x: x.split(','))
df_uni = []
for x in ['knownForTitles']:
idx = name_df.index.repeat(name_df['knownForTitles'].str.len())
df1 = pd.DataFrame({
x: np.concatenate(name_df[x].values)
})
df1.index = idx
df_uni.append(df1)
df_concat = pd.concat(df_uni, axis=1)
unnested_df = df_concat.join(name_df.drop(['knownForTitles'], 1), how='left')
unnested_df = unnested_df[name_df.columns.tolist()]
unnested_drop = unnested_df.drop(['nconst'], axis=1)
df_uni = []
for col in ['primaryName']:
dfi = unnested_drop.groupby(['knownForTitles'])[col].apply(list)
df_uni.append(dfi)
df_grouped = pd.concat(df_uni, axis=1).reset_index()
df_grouped.columns = ['knownForTitles','cast_name']
base_df = pd.merge(df_grouped, movie_rating_df, left_on='knownForTitles', right_on='tconst', how='inner')
base_df = pd.merge(base_df, director_writers, left_on='tconst', right_on='tconst', how='left')
base_drop = base_df.drop(['knownForTitles'], axis=1)
base_drop['genres'] = base_drop['genres'].fillna('Unknown')
base_drop[['director_name','writer_name']] = base_drop[['director_name','writer_name']].fillna('unknown')
base_drop['genres'] = base_drop['genres'].apply(lambda x: x.split(','))
base_drop2 = base_drop.drop(['tconst','isAdult','endYear','originalTitle'], axis=1)
base_drop2 = base_drop2[['primaryTitle','titleType','startYear','runtimeMinutes','genres','averageRating','numVotes','cast_name','director_name','writer_name']]
base_drop2.columns = ['title','type','start','duration','genres','rating','votes','cast_name','director_name','writer_name']
feature_df = base_drop2[['title','cast_name','genres','director_name','writer_name']]
def sanitize(x):
try:
if isinstance(x, list):
return [i.replace(' ','').lower() for i in x]
else:
return [x.replace(' ','').lower()]
except:
print(x)
feature_cols = ['cast_name','genres','writer_name','director_name']
for col in feature_cols:
feature_df[col] = feature_df[col].apply(sanitize)
def soup_feature(x):
return ' '.join(x['cast_name']) + ' ' + ' '.join(x['genres']) + ' ' + ' '.join(x['director_name']) + ' ' + ' '.join(x['writer_name'])
feature_df['soup'] = feature_df.apply(soup_feature, axis=1)
from sklearn.feature_extraction.text import CountVectorizer
#definisikan CountVectorizer dan mengubah soup tadi menjadi bentuk vector
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(feature_df['soup'])
print(count)
print(count_matrix.shape)
Output:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words='english',
strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None)
(1060, 10026)
Pada langkah ini, kita akan menghitung score cosine similarity dari setiap pasangan judul (berdasarkan semua kombinasi pasangan yang ada, dengan kata lain kita akan membuat 675 x 675 matrix, dimana cell di kolom i dan j menunjukkan score similarity antara judul i dan j. kita dapat dengan mudah melihat bahwa matrix ini simetris dan setiap elemen pada diagonal adalah 1, karena itu adalah similarity score dengan dirinya sendiri
Cosine Similarity
pada bagian ini, kita akan menggunakan formula cosine similarity untuk membuat model. Score cosine ini sangatlah berguna dan mudah untuk dihitung.
output yang didapat antara range -1 sampai 1. Score yang hampir mencapai 1 artinya kedua entitas tersebut sangatlah mirip sedangkan score yang hampir mencapai -1 artinya kedua entitas tersebut adalah beda
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)movie_rating_df = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/movie_rating_df.csv')
director_writers = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/directors_writers.csv')
director_writers['director_name'] = director_writers['director_name'].apply(lambda row: row.split(','))
director_writers['writer_name'] = director_writers['writer_name'].apply(lambda row: row.split(','))
name_df = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/actor_name.csv')
name_df = name_df[['nconst','primaryName','knownForTitles']]
name_df['knownForTitles'] = name_df['knownForTitles'].apply(lambda x: x.split(','))
df_uni = []
for x in ['knownForTitles']:
idx = name_df.index.repeat(name_df['knownForTitles'].str.len())
df1 = pd.DataFrame({
x: np.concatenate(name_df[x].values)
})
df1.index = idx
df_uni.append(df1)
df_concat = pd.concat(df_uni, axis=1)
unnested_df = df_concat.join(name_df.drop(['knownForTitles'], 1), how='left')
unnested_df = unnested_df[name_df.columns.tolist()]
unnested_drop = unnested_df.drop(['nconst'], axis=1)
df_uni = []
for col in ['primaryName']:
dfi = unnested_drop.groupby(['knownForTitles'])[col].apply(list)
df_uni.append(dfi)
df_grouped = pd.concat(df_uni, axis=1).reset_index()
df_grouped.columns = ['knownForTitles','cast_name']
base_df = pd.merge(df_grouped, movie_rating_df, left_on='knownForTitles', right_on='tconst', how='inner')
base_df = pd.merge(base_df, director_writers, left_on='tconst', right_on='tconst', how='left')
base_drop = base_df.drop(['knownForTitles'], axis=1)
base_drop['genres'] = base_drop['genres'].fillna('Unknown')
base_drop[['director_name','writer_name']] = base_drop[['director_name','writer_name']].fillna('unknown')
base_drop['genres'] = base_drop['genres'].apply(lambda x: x.split(','))
base_drop2 = base_drop.drop(['tconst','isAdult','endYear','originalTitle'], axis=1)
base_drop2 = base_drop2[['primaryTitle','titleType','startYear','runtimeMinutes','genres','averageRating','numVotes','cast_name','director_name','writer_name']]
base_drop2.columns = ['title','type','start','duration','genres','rating','votes','cast_name','director_name','writer_name']
feature_df = base_drop2[['title','cast_name','genres','director_name','writer_name']]
def sanitize(x):
try:
if isinstance(x, list):
return [i.replace(' ','').lower() for i in x]
else:
return [x.replace(' ','').lower()]
except:
print(x)
feature_cols = ['cast_name','genres','writer_name','director_name']
for col in feature_cols:
feature_df[col] = feature_df[col].apply(sanitize)
def soup_feature(x):
return ' '.join(x['cast_name']) + ' ' + ' '.join(x['genres']) + ' ' + ' '.join(x['director_name']) + ' ' + ' '.join(x['writer_name'])
feature_df['soup'] = feature_df.apply(soup_feature, axis=1)
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(feature_df['soup'])
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity(count_matrix, count_matrix)
print(cosine_sim)
Output:
[[1. 0.15430335 0.35355339 ... 0. 0. 0.13608276]
[0.15430335 1. 0.10910895 ... 0. 0. 0. ]
[0.35355339 0.10910895 1. ... 0. 0.08703883 0.09622504]
...
[0. 0. 0. ... 1. 0. 0. ]
[0. 0. 0.08703883 ... 0. 1. 0.10050378]
[0.13608276 0. 0.09622504 ... 0. 0.10050378 1. ]]
Task selanjutnya yang harus dilakukan adalah reverse mapping dengan judul sebagai index nya
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)movie_rating_df = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/movie_rating_df.csv')
director_writers = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/directors_writers.csv')
director_writers['director_name'] = director_writers['director_name'].apply(lambda row: row.split(','))
director_writers['writer_name'] = director_writers['writer_name'].apply(lambda row: row.split(','))
name_df = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/actor_name.csv')
name_df = name_df[['nconst','primaryName','knownForTitles']]
name_df['knownForTitles'] = name_df['knownForTitles'].apply(lambda x: x.split(','))
df_uni = []
for x in ['knownForTitles']:
idx = name_df.index.repeat(name_df['knownForTitles'].str.len())
df1 = pd.DataFrame({
x: np.concatenate(name_df[x].values)
})
df1.index = idx
df_uni.append(df1)
df_concat = pd.concat(df_uni, axis=1)
unnested_df = df_concat.join(name_df.drop(['knownForTitles'], 1), how='left')
unnested_df = unnested_df[name_df.columns.tolist()]
unnested_drop = unnested_df.drop(['nconst'], axis=1)
df_uni = []
for col in ['primaryName']:
dfi = unnested_drop.groupby(['knownForTitles'])[col].apply(list)
df_uni.append(dfi)
df_grouped = pd.concat(df_uni, axis=1).reset_index()
df_grouped.columns = ['knownForTitles','cast_name']
base_df = pd.merge(df_grouped, movie_rating_df, left_on='knownForTitles', right_on='tconst', how='inner')
base_df = pd.merge(base_df, director_writers, left_on='tconst', right_on='tconst', how='left')
base_drop = base_df.drop(['knownForTitles'], axis=1)
base_drop['genres'] = base_drop['genres'].fillna('Unknown')
base_drop[['director_name','writer_name']] = base_drop[['director_name','writer_name']].fillna('unknown')
base_drop['genres'] = base_drop['genres'].apply(lambda x: x.split(','))
base_drop2 = base_drop.drop(['tconst','isAdult','endYear','originalTitle'], axis=1)
base_drop2 = base_drop2[['primaryTitle','titleType','startYear','runtimeMinutes','genres','averageRating','numVotes','cast_name','director_name','writer_name']]
base_drop2.columns = ['title','type','start','duration','genres','rating','votes','cast_name','director_name','writer_name']
feature_df = base_drop2[['title','cast_name','genres','director_name','writer_name']]
def sanitize(x):
try:
if isinstance(x, list):
return [i.replace(' ','').lower() for i in x]
else:
return [x.replace(' ','').lower()]
except:
print(x)
feature_cols = ['cast_name','genres','writer_name','director_name']
for col in feature_cols:
feature_df[col] = feature_df[col].apply(sanitize)
def soup_feature(x):
return ' '.join(x['cast_name']) + ' ' + ' '.join(x['genres']) + ' ' + ' '.join(x['director_name']) + ' ' + ' '.join(x['writer_name'])
feature_df['soup'] = feature_df.apply(soup_feature, axis=1)
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(feature_df['soup'])
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity(count_matrix, count_matrix)
indices = pd.Series(feature_df.index, index=feature_df['title']).drop_duplicates()
def content_recommender(title):
#mendapatkan index dari judul film yang disebutkan
idx = indicesProject Machine Learning with Python: Building Recommender System with Similarity Function | by Assa Trissia Rizal | Mar, 2025
#menjadikan list dari array similarity cosine sim tadi
#hint: cosine_sim[idx]
sim_scores = list(enumerate(cosine_sim[idx]))
#mengurutkan film dari similarity tertinggi ke terendah
#hint: sorted(iter, key, reverse)
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
#untuk mendapatkan list judul dari item kedua sampe ke 11 (abaikan yang pertama karena yang pertama pasti judul film itu sendiri)
sim_scores = sim_scores[1:11]
#mendapatkan index dari judul-judul yang muncul di sim_scores
movie_indices = [i[0] for i in sim_scores]
#dengan menggunakan iloc, kita bisa panggil balik berdasarkan index dari movie_indices
return base_df.iloc[movie_indices]
print(content_recommender('The Lion King'))
Output:
knownForTitles cast_name tconst titleType \
848 tt3040964 [Cristina Carrión Márquez] tt3040964 movie
383 tt0286336 [Francisco Bretas] tt0286336 tvSeries
1002 tt7222086 [Hiroki Matsukawa] tt7222086 tvSeries
73 tt0075147 [JoaquÃn Parra] tt0075147 movie
232 tt0119051 [Chris Kosloski] tt0119051 movie
556 tt10068158 [Hiroki Matsukawa] tt10068158 movie
9 tt0028657 [Bernard Loftus] tt0028657 movie
191 tt0107875 [Simon Mayal] tt0107875 movie
803 tt2356464 [Sina Müller] tt2356464 movie
983 tt6270328 [Jo Boag] tt6270328 tvSeries primaryTitle \
848 The Jungle Book
383 The Animals of Farthing Wood
1002 Made in Abyss
73 Robin and Marian
232 The Edge
556 Made in Abyss: Journey's Dawn
9 Boss of Lonely Valley
191 The Princess and the Goblin
803 Ostwind
983 The Skinner Boys: Guardians of the Lost Secrets
originalTitle isAdult startYear \
848 The Jungle Book 0 2016.0
383 The Animals of Farthing Wood 0 1993.0
1002 Made in Abyss 0 2017.0
73 Robin and Marian 0 1976.0
232 The Edge 0 1997.0
556 Made in Abyss: Tabidachi no Yoake 0 2019.0
9 Boss of Lonely Valley 0 1937.0
191 The Princess and the Goblin 0 1991.0
803 Ostwind 0 2013.0
983 The Skinner Boys: Guardians of the Lost Secrets 0 2014.0
endYear runtimeMinutes genres averageRating \
848 NaN 106.0 Adventure,Drama,Family 7.4
383 1995.0 25.0 Adventure,Animation,Drama 8.3
1002 NaN 325.0 Adventure,Animation,Drama 8.4
73 NaN 106.0 Adventure,Drama,Romance 6.5
232 NaN 117.0 Action,Adventure,Drama 6.9
556 NaN 139.0 Adventure,Animation,Fantasy 7.4
9 NaN 60.0 Action,Adventure,Drama 6.2
191 NaN 82.0 Adventure,Animation,Comedy 6.8
803 NaN 101.0 Adventure,Drama,Family 6.8
983 NaN 23.0 Adventure,Animation,Drama 7.8
numVotes director_name \
848 250994 [Jon Favreau]
383 3057 [Elphin Lloyd-Jones, Philippe Leclerc]
1002 4577 [Masayuki Kojima, Hitoshi Haga, Shinya Iino, T...
73 10830 [Richard Lester]
232 65673 [Lee Tamahori]
556 81 [Masayuki Kojima]
9 41 [Ray Taylor]
191 2350 [József Gémes]
803 1350 [Katja von Garnier]
983 12 [Pablo De La Torre, Eugene Linkov, Jo Boag]
writer_name
848 [Justin Marks, Rudyard Kipling]
383 [Valerie Georgeson, Colin Dann, Jenny McDade, ...
1002 [Akihito Tsukushi, Keigo Koyanagi, Hideyuki Ku...
73 [James Goldman]
232 [David Mamet]
556 [Akihito Tsukushi]
9 [Frances Guihan, Forrest Brown]
191 [Robin Lyons, George MacDonald]
803 [Kristina Magdalena Henn, Lea Schmidbauer]
983 [David Witt, John Derevlany, David Evans, Pete...