Use python 3.12
- First run
data_preparation.ipynb
to prepare the dataset (clean and process it) 2. (note it will take around 20 minutes due to the large dataset) - Then run
recommender_system.ipynb
inside it you will find the one explained in class and some extra recommender system with graphs
GoodReads_100k_books.csv
original datasetgoodreads_with_languages.csv
original dataset with the addition of the languages used in each book (processed byutils/lang_detect.py
)cleaned_data.csv
dataset after being cleaned and processed bydata_preparations.ipynb
Inside here you will find some helpful scripts to prepare the dataset:
lang_dect.py
will look at title and description and find out the language used in all rows, process the original dataset and save it ingoodreads_with_languages.csv
.- You will have also some stats about the languages found (note you don't have to run it separately, everything is called from the jupiter notebook)
Total unique language categories: 37
Language Detection Breakdown:
Total books: 100000
Books with content: 99713
Missing content: 1
Too short content: 272
Language detection failures: 14
Unexpected errors: 0
nan.py
prints out why the lang_detect script has detected certain results for some books.
MIT