Skip to content

Commit

Permalink
Merge pull request #1142 from SatyadevNtv/satyadevntv/geoimc
Browse files Browse the repository at this point in the history
Include support for GeoIMC algorithm
  • Loading branch information
miguelgfierro authored Jul 28, 2020
2 parents 6f09f0f + 6a3a5aa commit 221feb5
Show file tree
Hide file tree
Showing 11 changed files with 1,145 additions and 0 deletions.
342 changes: 342 additions & 0 deletions examples/00_quick_start/geoimc_movielens.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,342 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Geometry Aware Inductive Matrix Completion (GeoIMC)\n",
"\n",
"GeoIMC is an inductive matrix completion algorithm based on the works by Jawanpuria et al. (2019)\n",
"\n",
"Consider the case of MovieLens-100K (ML100K), Let $X \\in R^{m \\times d_1}, Z \\in R^{n \\times d_2} $ be the features of users and movies respectively. Let $M \\in R^{m \\times n}$, be the partially observed ratings matrix. GeoIMC models this matrix as $M = XUBV^TZ^T$, where $U \\in R^{d_1 \\times k}, V \\in R^{d_2 \\times k}, B \\in R^{k \\times k}$ are Orthogonal, Orthogonal, Symmetric Positive-Definite matrices respectively. This Optimization problem is solved by using Pymanopt.\n",
"\n",
"\n",
"This notebook provides an example of how to utilize and evaluate GeoIMC implementation in **reco_utils**\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
"import tempfile\n",
"import zipfile\n",
"import pandas as pd\n",
"import numpy as np\n",
"import papermill as pm\n",
"sys.path.append(\"../../\")\n",
"sys.path.append(\"../../reco_utils/recommender/geoimc/\")\n",
"\n",
"from reco_utils.dataset import movielens\n",
"from reco_utils.recommender.geoimc.geoimc_data import ML_100K\n",
"from reco_utils.recommender.geoimc.geoimc_algorithm import IMCProblem\n",
"from reco_utils.recommender.geoimc.geoimc_predict import Inferer\n",
"from reco_utils.recommender.geoimc.geoimc_utils import download_geoimc_features\n",
"from reco_utils.evaluation.python_evaluation import (\n",
" rmse, mae\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Choose the MovieLens dataset\n",
"MOVIELENS_DATA_SIZE = '100k'\n",
"# Normalize user, item features\n",
"normalize = True\n",
"# Rank (k) of the model\n",
"rank = 300\n",
"# Regularization parameter\n",
"regularizer = 1e-3\n",
"\n",
"# URL to download geoimc ML100K features\n",
"FEATURES_URL = \"https://recodatasets.blob.core.windows.net/geoimc/ml100k-features\"\n",
"\n",
"# Parameters for algorithm convergence\n",
"max_iters = 150000\n",
"max_time = 1000\n",
"verbosity = 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Download ML100K dataset and features"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 4.81k/4.81k [00:09<00:00, 519KB/s]\n",
"100%|██████████| 82.0/82.0 [00:01<00:00, 53.7KB/s]\n",
"100%|██████████| 31.0/31.0 [00:00<00:00, 59.7KB/s]\n"
]
}
],
"source": [
"# Create a directory to download ML100K\n",
"dp = tempfile.mkdtemp(suffix='-geoimc')\n",
"movielens.download_movielens(MOVIELENS_DATA_SIZE, f\"{dp}/ml-100k.zip\")\n",
"with zipfile.ZipFile(f\"{dp}/ml-100k.zip\", 'r') as z:\n",
" z.extractall(dp)\n",
"\n",
"download_geoimc_features(FEATURES_URL, [\"item-features.smat\", \"user-features.smat\"], f\"{dp}/ml-100k/\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Load the dataset using the example features provided in helpers\n",
"\n",
"The features were generated using the same method as the work by Xin Dong et al. (2017)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"dataset = ML_100K(\n",
" normalize=normalize,\n",
" target_transform='binarize'\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"dataset.load_data(\n",
" f\"{dp}/ml-100k/\", \n",
" f\"{dp}/ml-100k/user-features.smat\",\n",
" f\"{dp}/ml-100k/item-features.smat\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Characteristics:\n",
"\n",
" target: (943, 1682)\n",
" entities: (943, 1822), (1682, 1923)\n",
"\n",
" training: (80000,)\n",
" training_entities: (943, 1822), (1682, 1923)\n",
"\n",
" testing: (20000,)\n",
" test_entities: (943, 1822), (1682, 1923)\n",
"\n"
]
}
],
"source": [
"print(f\"\"\"Characteristics:\n",
"\n",
" target: {dataset.training_data.data.shape}\n",
" entities: {dataset.entities[0].shape}, {dataset.entities[1].shape}\n",
"\n",
" training: {dataset.training_data.get_data().data.shape}\n",
" training_entities: {dataset.training_data.get_entity(\"row\").shape}, {dataset.training_data.get_entity(\"col\").shape}\n",
"\n",
" testing: {dataset.test_data.get_data().data.shape}\n",
" test_entities: {dataset.test_data.get_entity(\"row\").shape}, {dataset.test_data.get_entity(\"col\").shape}\n",
"\"\"\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Initialize the IMC problem"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"np.random.seed(10)\n",
"prblm = IMCProblem(\n",
" dataset.training_data,\n",
" lambda1=regularizer,\n",
" rank=rank\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Optimizing...\n",
"Terminated - max time reached after 1844 iterations.\n",
"\n"
]
}
],
"source": [
"# Solve the Optimization problem\n",
"prblm.solve(\n",
" max_time,\n",
" max_iters,\n",
" verbosity\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"# Initialize an inferer\n",
"inferer = Inferer(\n",
" method='dot'\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"# Predict using the parametrized matrices\n",
"predictions = inferer.infer(\n",
" dataset.test_data,\n",
" prblm.W\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"# Prepare the test, predicted dataframes\n",
"user_ids = dataset.test_data.get_data().tocoo().row\n",
"item_ids = dataset.test_data.get_data().tocoo().col\n",
"test_df = pd.DataFrame(\n",
" data={\n",
" \"userID\": user_ids,\n",
" \"itemID\": item_ids,\n",
" \"rating\": dataset.test_data.get_data().data\n",
" }\n",
")\n",
"predictions_df = pd.DataFrame(\n",
" data={\n",
" \"userID\": user_ids,\n",
" \"itemID\": item_ids,\n",
" \"prediction\": [predictions[uid, iid] for uid, iid in list(zip(user_ids, item_ids))]\n",
" }\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RMSE: 0.49632302257817473\n",
"MAE: 0.47524377750493757\n",
"\n"
]
}
],
"source": [
"# Calculate RMSE\n",
"RMSE = rmse(\n",
" test_df,\n",
" predictions_df\n",
")\n",
"# Calculate MAE\n",
"MAE = mae(\n",
" test_df,\n",
" predictions_df\n",
")\n",
"print(f\"\"\"\n",
"RMSE: {RMSE}\n",
"MAE: {MAE}\n",
"\"\"\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pm.record(\"rmse\", RMSE)\n",
"pm.record(\"mae\", MAE)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## References\n",
"\n",
"[1] Pratik Jawanpuria, Arjun Balgovind, Anoop Kunchukuttan, Bamdev Mishra. _[Learning Multilingual Word Embeddings in Latent Metric Space: A Geometric Approach](https://www.mitpressjournals.org/doi/full/10.1162/tacl_a_00257)_. Transaction of the Association for Computational Linguistics (TACL), Volume 7, p.107-120, 2019.\n",
"\n",
"[2] Xin Dong, Lei Yu, Zhonghuo Wu, Yuxia Sun, Lingfeng Yuan, Fangxi Zhang. [A Hybrid Collaborative Filtering Model withDeep Structure for Recommender Systems](https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14676/13916).\n",
"Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), p.1309-1315, 2017."
]
}
],
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
"display_name": "Python (reco)",
"language": "python",
"name": "reco_base"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.10"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
15 changes: 15 additions & 0 deletions reco_utils/common/python_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,3 +101,18 @@ def get_top_k_scored_items(scores, top_k, sort_top_k=False):
top_scores = top_scores[test_user_idx, sort_ind]

return np.array(top_items), np.array(top_scores)


def binarize(a, threshold):
"""Binarize the values.
Args:
a (np.ndarray): Input array that needs to be binarized.
threshold (float): Threshold below which all values are set to 0, else 1.
"""
return np.where(
a > threshold,
1.0,
0.0
)

Empty file.
Loading

0 comments on commit 221feb5

Please sign in to comment.