Merge pull request #1142 from SatyadevNtv/satyadevntv/geoimc

Include support for GeoIMC algorithm
recommenders-team · Jul 28, 2020 · 221feb5 · 221feb5
2 parents 6f09f0f + 6a3a5aa
commit 221feb5
Show file tree

Hide file tree

Showing 11 changed files with 1,145 additions and 0 deletions.
diff --git a/examples/00_quick_start/geoimc_movielens.ipynb b/examples/00_quick_start/geoimc_movielens.ipynb
@@ -0,0 +1,342 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Geometry Aware Inductive Matrix Completion (GeoIMC)\n",
+    "\n",
+    "GeoIMC is an inductive matrix completion algorithm based on the works by Jawanpuria et al. (2019)\n",
+    "\n",
+    "Consider the case of MovieLens-100K (ML100K), Let $X \\in R^{m \\times d_1}, Z \\in R^{n \\times d_2} $ be the features of users and movies respectively. Let $M \\in R^{m \\times n}$, be the partially observed ratings matrix. GeoIMC models this matrix as $M = XUBV^TZ^T$, where $U \\in R^{d_1 \\times k}, V \\in R^{d_2 \\times k}, B \\in R^{k \\times k}$ are Orthogonal, Orthogonal, Symmetric Positive-Definite matrices respectively. This Optimization problem is solved by using Pymanopt.\n",
+    "\n",
+    "\n",
+    "This notebook provides an example of how to utilize and evaluate GeoIMC implementation in **reco_utils**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "import tempfile\n",
+    "import zipfile\n",
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "import papermill as pm\n",
+    "sys.path.append(\"../../\")\n",
+    "sys.path.append(\"../../reco_utils/recommender/geoimc/\")\n",
+    "\n",
+    "from reco_utils.dataset import movielens\n",
+    "from reco_utils.recommender.geoimc.geoimc_data import ML_100K\n",
+    "from reco_utils.recommender.geoimc.geoimc_algorithm import IMCProblem\n",
+    "from reco_utils.recommender.geoimc.geoimc_predict import Inferer\n",
+    "from reco_utils.recommender.geoimc.geoimc_utils import download_geoimc_features\n",
+    "from reco_utils.evaluation.python_evaluation import (\n",
+    "    rmse, mae\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Choose the MovieLens dataset\n",
+    "MOVIELENS_DATA_SIZE = '100k'\n",
+    "# Normalize user, item features\n",
+    "normalize = True\n",
+    "# Rank (k) of the model\n",
+    "rank = 300\n",
+    "# Regularization parameter\n",
+    "regularizer = 1e-3\n",
+    "\n",
+    "# URL to download geoimc ML100K features\n",
+    "FEATURES_URL = \"https://recodatasets.blob.core.windows.net/geoimc/ml100k-features\"\n",
+    "\n",
+    "# Parameters for algorithm convergence\n",
+    "max_iters = 150000\n",
+    "max_time = 1000\n",
+    "verbosity = 1"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Download ML100K dataset and features"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "100%|██████████| 4.81k/4.81k [00:09<00:00, 519KB/s]\n",
+      "100%|██████████| 82.0/82.0 [00:01<00:00, 53.7KB/s]\n",
+      "100%|██████████| 31.0/31.0 [00:00<00:00, 59.7KB/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Create a directory to download ML100K\n",
+    "dp = tempfile.mkdtemp(suffix='-geoimc')\n",
+    "movielens.download_movielens(MOVIELENS_DATA_SIZE, f\"{dp}/ml-100k.zip\")\n",
+    "with zipfile.ZipFile(f\"{dp}/ml-100k.zip\", 'r') as z:\n",
+    "    z.extractall(dp)\n",
+    "\n",
+    "download_geoimc_features(FEATURES_URL, [\"item-features.smat\", \"user-features.smat\"], f\"{dp}/ml-100k/\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Load the dataset using the example features provided in helpers\n",
+    "\n",
+    "The features were generated using the same method as the work by Xin Dong et al. (2017)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dataset = ML_100K(\n",
+    "    normalize=normalize,\n",
+    "    target_transform='binarize'\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dataset.load_data(\n",
+    "    f\"{dp}/ml-100k/\", \n",
+    "    f\"{dp}/ml-100k/user-features.smat\",\n",
+    "    f\"{dp}/ml-100k/item-features.smat\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Characteristics:\n",
+      "\n",
+      "              target: (943, 1682)\n",
+      "              entities: (943, 1822), (1682, 1923)\n",
+      "\n",
+      "              training: (80000,)\n",
+      "              training_entities: (943, 1822), (1682, 1923)\n",
+      "\n",
+      "              testing: (20000,)\n",
+      "              test_entities: (943, 1822), (1682, 1923)\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(f\"\"\"Characteristics:\n",
+    "\n",
+    "              target: {dataset.training_data.data.shape}\n",
+    "              entities: {dataset.entities[0].shape}, {dataset.entities[1].shape}\n",
+    "\n",
+    "              training: {dataset.training_data.get_data().data.shape}\n",
+    "              training_entities: {dataset.training_data.get_entity(\"row\").shape}, {dataset.training_data.get_entity(\"col\").shape}\n",
+    "\n",
+    "              testing: {dataset.test_data.get_data().data.shape}\n",
+    "              test_entities: {dataset.test_data.get_entity(\"row\").shape}, {dataset.test_data.get_entity(\"col\").shape}\n",
+    "\"\"\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Initialize the IMC problem"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "np.random.seed(10)\n",
+    "prblm = IMCProblem(\n",
+    "    dataset.training_data,\n",
+    "    lambda1=regularizer,\n",
+    "    rank=rank\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Optimizing...\n",
+      "Terminated - max time reached after 1844 iterations.\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Solve the Optimization problem\n",
+    "prblm.solve(\n",
+    "    max_time,\n",
+    "    max_iters,\n",
+    "    verbosity\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Initialize an inferer\n",
+    "inferer = Inferer(\n",
+    "    method='dot'\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Predict using the parametrized matrices\n",
+    "predictions = inferer.infer(\n",
+    "    dataset.test_data,\n",
+    "    prblm.W\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Prepare the test, predicted dataframes\n",
+    "user_ids = dataset.test_data.get_data().tocoo().row\n",
+    "item_ids = dataset.test_data.get_data().tocoo().col\n",
+    "test_df = pd.DataFrame(\n",
+    "    data={\n",
+    "        \"userID\": user_ids,\n",
+    "        \"itemID\": item_ids,\n",
+    "        \"rating\": dataset.test_data.get_data().data\n",
+    "    }\n",
+    ")\n",
+    "predictions_df = pd.DataFrame(\n",
+    "    data={\n",
+    "        \"userID\": user_ids,\n",
+    "        \"itemID\": item_ids,\n",
+    "        \"prediction\": [predictions[uid, iid] for uid, iid in list(zip(user_ids, item_ids))]\n",
+    "    }\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "RMSE: 0.49632302257817473\n",
+      "MAE: 0.47524377750493757\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Calculate RMSE\n",
+    "RMSE = rmse(\n",
+    "    test_df,\n",
+    "    predictions_df\n",
+    ")\n",
+    "# Calculate MAE\n",
+    "MAE = mae(\n",
+    "    test_df,\n",
+    "    predictions_df\n",
+    ")\n",
+    "print(f\"\"\"\n",
+    "RMSE: {RMSE}\n",
+    "MAE: {MAE}\n",
+    "\"\"\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pm.record(\"rmse\", RMSE)\n",
+    "pm.record(\"mae\", MAE)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## References\n",
+    "\n",
+    "[1] Pratik Jawanpuria, Arjun Balgovind, Anoop Kunchukuttan, Bamdev Mishra. _[Learning Multilingual Word Embeddings in Latent Metric Space: A Geometric Approach](https://www.mitpressjournals.org/doi/full/10.1162/tacl_a_00257)_. Transaction of the Association for Computational Linguistics (TACL), Volume 7, p.107-120, 2019.\n",
+    "\n",
+    "[2] Xin Dong, Lei Yu, Zhonghuo Wu, Yuxia Sun, Lingfeng Yuan, Fangxi Zhang. [A Hybrid Collaborative Filtering Model withDeep Structure for Recommender Systems](https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14676/13916).\n",
+    "Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), p.1309-1315, 2017."
+   ]
+  }
+ ],
+ "metadata": {
+  "celltoolbar": "Tags",
+  "kernelspec": {
+   "display_name": "Python (reco)",
+   "language": "python",
+   "name": "reco_base"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/reco_utils/common/python_utils.py b/reco_utils/common/python_utils.py
@@ -101,3 +101,18 @@ def get_top_k_scored_items(scores, top_k, sort_top_k=False):
         top_scores = top_scores[test_user_idx, sort_ind]
 
     return np.array(top_items), np.array(top_scores)
+
+
+def binarize(a, threshold):
+    """Binarize the values.
+
+    Args:
+        a (np.ndarray): Input array that needs to be binarized.
+        threshold (float): Threshold below which all values are set to 0, else 1.
+    """
+    return np.where(
+        a > threshold,
+        1.0,
+        0.0
+    )
+
diff --git a/reco_utils/recommender/geoimc/__init__.py b/reco_utils/recommender/geoimc/__init__.py