-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #1142 from SatyadevNtv/satyadevntv/geoimc
Include support for GeoIMC algorithm
- Loading branch information
Showing
11 changed files
with
1,145 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,342 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Geometry Aware Inductive Matrix Completion (GeoIMC)\n", | ||
"\n", | ||
"GeoIMC is an inductive matrix completion algorithm based on the works by Jawanpuria et al. (2019)\n", | ||
"\n", | ||
"Consider the case of MovieLens-100K (ML100K), Let $X \\in R^{m \\times d_1}, Z \\in R^{n \\times d_2} $ be the features of users and movies respectively. Let $M \\in R^{m \\times n}$, be the partially observed ratings matrix. GeoIMC models this matrix as $M = XUBV^TZ^T$, where $U \\in R^{d_1 \\times k}, V \\in R^{d_2 \\times k}, B \\in R^{k \\times k}$ are Orthogonal, Orthogonal, Symmetric Positive-Definite matrices respectively. This Optimization problem is solved by using Pymanopt.\n", | ||
"\n", | ||
"\n", | ||
"This notebook provides an example of how to utilize and evaluate GeoIMC implementation in **reco_utils**\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import sys\n", | ||
"import tempfile\n", | ||
"import zipfile\n", | ||
"import pandas as pd\n", | ||
"import numpy as np\n", | ||
"import papermill as pm\n", | ||
"sys.path.append(\"../../\")\n", | ||
"sys.path.append(\"../../reco_utils/recommender/geoimc/\")\n", | ||
"\n", | ||
"from reco_utils.dataset import movielens\n", | ||
"from reco_utils.recommender.geoimc.geoimc_data import ML_100K\n", | ||
"from reco_utils.recommender.geoimc.geoimc_algorithm import IMCProblem\n", | ||
"from reco_utils.recommender.geoimc.geoimc_predict import Inferer\n", | ||
"from reco_utils.recommender.geoimc.geoimc_utils import download_geoimc_features\n", | ||
"from reco_utils.evaluation.python_evaluation import (\n", | ||
" rmse, mae\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Choose the MovieLens dataset\n", | ||
"MOVIELENS_DATA_SIZE = '100k'\n", | ||
"# Normalize user, item features\n", | ||
"normalize = True\n", | ||
"# Rank (k) of the model\n", | ||
"rank = 300\n", | ||
"# Regularization parameter\n", | ||
"regularizer = 1e-3\n", | ||
"\n", | ||
"# URL to download geoimc ML100K features\n", | ||
"FEATURES_URL = \"https://recodatasets.blob.core.windows.net/geoimc/ml100k-features\"\n", | ||
"\n", | ||
"# Parameters for algorithm convergence\n", | ||
"max_iters = 150000\n", | ||
"max_time = 1000\n", | ||
"verbosity = 1" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## 1. Download ML100K dataset and features" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stderr", | ||
"output_type": "stream", | ||
"text": [ | ||
"100%|██████████| 4.81k/4.81k [00:09<00:00, 519KB/s]\n", | ||
"100%|██████████| 82.0/82.0 [00:01<00:00, 53.7KB/s]\n", | ||
"100%|██████████| 31.0/31.0 [00:00<00:00, 59.7KB/s]\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# Create a directory to download ML100K\n", | ||
"dp = tempfile.mkdtemp(suffix='-geoimc')\n", | ||
"movielens.download_movielens(MOVIELENS_DATA_SIZE, f\"{dp}/ml-100k.zip\")\n", | ||
"with zipfile.ZipFile(f\"{dp}/ml-100k.zip\", 'r') as z:\n", | ||
" z.extractall(dp)\n", | ||
"\n", | ||
"download_geoimc_features(FEATURES_URL, [\"item-features.smat\", \"user-features.smat\"], f\"{dp}/ml-100k/\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## 2. Load the dataset using the example features provided in helpers\n", | ||
"\n", | ||
"The features were generated using the same method as the work by Xin Dong et al. (2017)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"dataset = ML_100K(\n", | ||
" normalize=normalize,\n", | ||
" target_transform='binarize'\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"dataset.load_data(\n", | ||
" f\"{dp}/ml-100k/\", \n", | ||
" f\"{dp}/ml-100k/user-features.smat\",\n", | ||
" f\"{dp}/ml-100k/item-features.smat\"\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Characteristics:\n", | ||
"\n", | ||
" target: (943, 1682)\n", | ||
" entities: (943, 1822), (1682, 1923)\n", | ||
"\n", | ||
" training: (80000,)\n", | ||
" training_entities: (943, 1822), (1682, 1923)\n", | ||
"\n", | ||
" testing: (20000,)\n", | ||
" test_entities: (943, 1822), (1682, 1923)\n", | ||
"\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"print(f\"\"\"Characteristics:\n", | ||
"\n", | ||
" target: {dataset.training_data.data.shape}\n", | ||
" entities: {dataset.entities[0].shape}, {dataset.entities[1].shape}\n", | ||
"\n", | ||
" training: {dataset.training_data.get_data().data.shape}\n", | ||
" training_entities: {dataset.training_data.get_entity(\"row\").shape}, {dataset.training_data.get_entity(\"col\").shape}\n", | ||
"\n", | ||
" testing: {dataset.test_data.get_data().data.shape}\n", | ||
" test_entities: {dataset.test_data.get_entity(\"row\").shape}, {dataset.test_data.get_entity(\"col\").shape}\n", | ||
"\"\"\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## 3. Initialize the IMC problem" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"np.random.seed(10)\n", | ||
"prblm = IMCProblem(\n", | ||
" dataset.training_data,\n", | ||
" lambda1=regularizer,\n", | ||
" rank=rank\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 8, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Optimizing...\n", | ||
"Terminated - max time reached after 1844 iterations.\n", | ||
"\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# Solve the Optimization problem\n", | ||
"prblm.solve(\n", | ||
" max_time,\n", | ||
" max_iters,\n", | ||
" verbosity\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 9, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Initialize an inferer\n", | ||
"inferer = Inferer(\n", | ||
" method='dot'\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 10, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Predict using the parametrized matrices\n", | ||
"predictions = inferer.infer(\n", | ||
" dataset.test_data,\n", | ||
" prblm.W\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 11, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Prepare the test, predicted dataframes\n", | ||
"user_ids = dataset.test_data.get_data().tocoo().row\n", | ||
"item_ids = dataset.test_data.get_data().tocoo().col\n", | ||
"test_df = pd.DataFrame(\n", | ||
" data={\n", | ||
" \"userID\": user_ids,\n", | ||
" \"itemID\": item_ids,\n", | ||
" \"rating\": dataset.test_data.get_data().data\n", | ||
" }\n", | ||
")\n", | ||
"predictions_df = pd.DataFrame(\n", | ||
" data={\n", | ||
" \"userID\": user_ids,\n", | ||
" \"itemID\": item_ids,\n", | ||
" \"prediction\": [predictions[uid, iid] for uid, iid in list(zip(user_ids, item_ids))]\n", | ||
" }\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 12, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"\n", | ||
"RMSE: 0.49632302257817473\n", | ||
"MAE: 0.47524377750493757\n", | ||
"\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# Calculate RMSE\n", | ||
"RMSE = rmse(\n", | ||
" test_df,\n", | ||
" predictions_df\n", | ||
")\n", | ||
"# Calculate MAE\n", | ||
"MAE = mae(\n", | ||
" test_df,\n", | ||
" predictions_df\n", | ||
")\n", | ||
"print(f\"\"\"\n", | ||
"RMSE: {RMSE}\n", | ||
"MAE: {MAE}\n", | ||
"\"\"\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"pm.record(\"rmse\", RMSE)\n", | ||
"pm.record(\"mae\", MAE)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## References\n", | ||
"\n", | ||
"[1] Pratik Jawanpuria, Arjun Balgovind, Anoop Kunchukuttan, Bamdev Mishra. _[Learning Multilingual Word Embeddings in Latent Metric Space: A Geometric Approach](https://www.mitpressjournals.org/doi/full/10.1162/tacl_a_00257)_. Transaction of the Association for Computational Linguistics (TACL), Volume 7, p.107-120, 2019.\n", | ||
"\n", | ||
"[2] Xin Dong, Lei Yu, Zhonghuo Wu, Yuxia Sun, Lingfeng Yuan, Fangxi Zhang. [A Hybrid Collaborative Filtering Model withDeep Structure for Recommender Systems](https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14676/13916).\n", | ||
"Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), p.1309-1315, 2017." | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"celltoolbar": "Tags", | ||
"kernelspec": { | ||
"display_name": "Python (reco)", | ||
"language": "python", | ||
"name": "reco_base" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.6.10" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 4 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
Oops, something went wrong.