Skip to content
pichljan edited this page Apr 24, 2016 · 113 revisions

This page contains a variety of benchmark results of various features on various domains. To keep the size of this page in check, we remove obsolete measurements (either abandoned

Each dataset has train and test splits. Our primary comparison factor is AP Recall (APR) and Mean Reciprocal Rank (MRR) on the test split (see below). (When talking to outsiders, accuracy-at-one is the easiest measure to use, but it is much more noisy than MRR, so it's brittle for day-to-day evaluation.) Unless otherwise specified, our models are always retrained on the train split before measuring accuracy on the test split.

We are benchmarking on several datasets from two basic families. The TREC-based questions are general factoid questions of wide variety and character, answerable primarily from Wikipedia:

  • curated (https://github.com/brmson/dataset-factoid-curated) is a cleaned up version of the TREC dataset with some IRC-based questions from early user testing also added in. This is our primary "general QA" benchmark.
  • large2180 (in dev branch of https://github.com/brmson/dataset-factoid-curated) is a larger version of the TREC dataset, that mixes in some noisy question to the curated dataset in the interest of having more data to train on. We use this dataset to test how our machine learning scales up.
  • trecnew-raw (in dev branch of https://github.com/brmson/dataset-factoid-curated) has just a test split and contains factoid questions like in curated, but not their cleaned up and filtered version, allowing a more realistic comparison with the "old" programs from the TREC challenges. We rarely use this dataset, just for some final benchmarks when writing papers.

Raw measurements for various historical commits are available from http://pasky.or.cz/dev/brmson/yodaqa-eval/ ...

The other family of datasets is originally based on WebQuestions, exhibiting more monotonous questions modelled around the Freebase knowledge base and always asking for entities (not for example for numbers):

Raw measurements for various historical commits are available from http://pasky.or.cz/dev/brmson/yodaqa-movies-eval/ ...

Typically, our tests are done using data/eval/train-and-eval.sh. When we need to benchmark without retraining, we use something like:

data/eval/_multistage-traineval.sh . trecnew-raw-test 0 0

Each split+commit combination results in three lines of data/eval/tsvout-stats.sh output. The important line for us is the one with the commit prefixed by u as that's the initial pipeline stage (followup stages create user-friendly output, but typicaly senselessly overfit).

The format of each line is:

dataset-split commit  commitdate Commit message          ans/irr/tot ACC1%/ APR% mrr 0.mrr avgtime xyz

ans and ACC1 is the accuracy-at-one --- number (and percentage) of questions where the top answer is correct (since we attempt to answer all questions, this would be precision@100 in DeepQA parlance). APR is Answer Production Recall, i.e. number of questions where the correct answer is generated as a hypothesis. MRR is the mean of reciprocial rank over all questions; a question with top answer correct will have RR=1, a question with second answer correct will have RR=0.5, etc. Ignore the avgtime, it's currently garbage, unfortunately.

Baselines

TREC-based baseline

We use the master branch to measure TREC-based QA.

v1.6 --- curated v2(!) APR 81.2%, MRR 0.452; large2470 APR 83.0%, MRR 0.473; large2180 APR 75.4%, MRR 0.395:

large2470-te  36fcfa0 2016-02-28 data/eval/large2470*... 317/545/766 41.4%/71.1% mrr 0.498 avgtime 5670.130
large2470-te u36fcfa0 2016-02-28 data/eval/large2470*... 296/636/766 38.6%/83.0% mrr 0.473 avgtime 4522.831
large2470-te v36fcfa0 2016-02-28 data/eval/large2470*... 305/545/766 39.8%/71.1% mrr 0.488 avgtime 4833.464
large2470-tr  36fcfa0 2016-02-28 data/eval/large2470*... 963/1228/1704 56.5%/72.1% mrr 0.627 avgtime 16620.185
large2470-tr u36fcfa0 2016-02-28 data/eval/large2470*... 668/1383/1704 39.2%/81.2% mrr 0.480 avgtime 13713.557
large2470-tr v36fcfa0 2016-02-28 data/eval/large2470*... 826/1228/1704 48.5%/72.1% mrr 0.568 avgtime 14780.889
large2180-te  f7d72fa 2016-02-27 AnswerScoreDecisionF... 238/434/694 34.3%/62.5% mrr 0.412 avgtime 5494.607 (trained on large2470)
large2180-te uf7d72fa 2016-02-27 AnswerScoreDecisionF... 217/523/694 31.3%/75.4% mrr 0.395 avgtime 4423.663
large2180-te vf7d72fa 2016-02-27 AnswerScoreDecisionF... 225/434/694 32.4%/62.5% mrr 0.401 avgtime 4725.367
curated-test  45e6eed 2016-02-27 data/eval/curated: A... 178/299/430 41.4%/69.5% mrr 0.487 avgtime 3568.611 (trained on large2470)
curated-test u45e6eed 2016-02-27 data/eval/curated: A... 157/349/430 36.5%/81.2% mrr 0.452 avgtime 2830.443
curated-test v45e6eed 2016-02-27 data/eval/curated: A... 171/299/430 39.8%/69.5% mrr 0.481 avgtime 3035.325

curated-test  5ffddf9 2016-04-01 AnswerFV +originPsgB... 158/296/430 36.7%/68.8% mrr 0.456 avgtime 8127.067
curated-test u5ffddf9 2016-04-01 AnswerFV +originPsgB... 156/348/430 36.3%/80.9% mrr 0.449 avgtime 5806.669
curated-test v5ffddf9 2016-04-01 AnswerFV +originPsgB... 157/296/430 36.5%/68.8% mrr 0.455 avgtime 5918.403
curated-trai  5ffddf9 2016-04-01 AnswerFV +originPsgB... 321/328/430 74.7%/76.3% mrr 0.754 avgtime 14838.503
curated-trai u5ffddf9 2016-04-01 AnswerFV +originPsgB... 186/360/430 43.3%/83.7% mrr 0.529 avgtime 11158.011
curated-trai v5ffddf9 2016-04-01 AnswerFV +originPsgB... 274/328/430 63.7%/76.3% mrr 0.692 avgtime 11364.936

v1.5 --- curated (v1 up to now!) APR 77.0%, MRR 0.408; large2180 APR 74.9%, MRR 0.383:

curated-test  ba7c567 2015-12-11 FindReqParse: Fix St... 138/279/430 32.1%/64.9% mrr 0.408 avgtime 4973.198
curated-test uba7c567 2015-12-11 FindReqParse: Fix St... 139/331/430 32.3%/77.0% mrr 0.408 avgtime 4719.761
curated-test vba7c567 2015-12-11 FindReqParse: Fix St... 136/279/430 31.6%/64.9% mrr 0.408 avgtime 4906.282
curated-trai  ba7c567 2015-12-11 FindReqParse: Fix St... 296/307/430 68.8%/71.4% mrr 0.698 avgtime 5625.474
curated-trai uba7c567 2015-12-11 FindReqParse: Fix St... 163/341/430 37.9%/79.3% mrr 0.471 avgtime 5247.320
curated-trai vba7c567 2015-12-11 FindReqParse: Fix St... 253/307/430 58.8%/71.4% mrr 0.642 avgtime 5526.486

large2180-te  69ecf84 2015-12-11 data/eval/large2180*... 213/426/694 30.7%/61.4% mrr 0.389 avgtime 7298.222
large2180-te u69ecf84 2015-12-11 data/eval/large2180*... 212/520/694 30.5%/74.9% mrr 0.383 avgtime 6918.574
large2180-te v69ecf84 2015-12-11 data/eval/large2180*... 211/426/694 30.4%/61.4% mrr 0.386 avgtime 7201.032
large2180-tr  69ecf84 2015-12-11 data/eval/large2180*... 715/917/1479 48.3%/62.0% mrr 0.540 avgtime 14670.285
large2180-tr u69ecf84 2015-12-11 data/eval/large2180*... 449/1074/1479 30.4%/72.6% mrr 0.386 avgtime 13554.080
large2180-tr v69ecf84 2015-12-11 data/eval/large2180*... 592/917/1479 40.0%/62.0% mrr 0.478 avgtime 14405.263

v1.4 --- curated APR 77.7%, MRR 0.405; large2180 APR 75.5%, MRR 0.379:

curated-test  2b85c94 2015-11-10 AnswerScoreDecisionF... 131/282/430 30.5%/65.6% mrr 0.401 avgtime 3596.696
curated-test u2b85c94 2015-11-10 AnswerScoreDecisionF... 136/334/430 31.6%/77.7% mrr 0.405 avgtime 3320.982
curated-test v2b85c94 2015-11-10 AnswerScoreDecisionF... 135/282/430 31.4%/65.6% mrr 0.409 avgtime 3524.656
curated-trai  2b85c94 2015-11-10 AnswerScoreDecisionF... 295/308/430 68.6%/71.6% mrr 0.699 avgtime 4083.608
curated-trai u2b85c94 2015-11-10 AnswerScoreDecisionF... 164/340/430 38.1%/79.1% mrr 0.480 avgtime 3639.840
curated-trai v2b85c94 2015-11-10 AnswerScoreDecisionF... 257/308/430 59.8%/71.6% mrr 0.649 avgtime 3968.469
large2180-te  2b85c94 2015-11-10 AnswerScoreDecisionF... 209/432/694 30.1%/62.2% mrr 0.387 avgtime 5137.313
large2180-te u2b85c94 2015-11-10 AnswerScoreDecisionF... 205/524/694 29.5%/75.5% mrr 0.379 avgtime 4828.143
large2180-te v2b85c94 2015-11-10 AnswerScoreDecisionF... 213/432/694 30.7%/62.2% mrr 0.389 avgtime 5076.575
large2180-tr  c92760c 2015-11-04 HIGHLEVEL.md constra... 726/926/1479 49.1%/62.6% mrr 0.543 avgtime 11820.064
large2180-tr uc92760c 2015-11-04 HIGHLEVEL.md constra... 453/1075/1479 30.6%/72.7% mrr 0.389 avgtime 10848.609
large2180-tr vc92760c 2015-11-04 HIGHLEVEL.md constra... 602/926/1479 40.7%/62.6% mrr 0.483 avgtime 11638.433

v1.3 --- curated APR 77.9%, MRR 0.413; large2180 APR 75.5%, MRR 0.390:

curated-test  88f39c2 2015-10-19 Mbprop.txt: Retrain ... 138/279/430 32.1%/64.9% mrr 0.407 avgtime 2947.311
curated-test u88f39c2 2015-10-19 Mbprop.txt: Retrain ... 144/335/430 33.5%/77.9% mrr 0.413 avgtime 2681.062
curated-test v88f39c2 2015-10-19 Mbprop.txt: Retrain ... 144/279/430 33.5%/64.9% mrr 0.418 avgtime 2874.982
curated-trai  88f39c2 2015-10-19 Mbprop.txt: Retrain ... 290/306/430 67.4%/71.2% mrr 0.691 avgtime 3725.780
curated-trai u88f39c2 2015-10-19 Mbprop.txt: Retrain ... 169/335/430 39.3%/77.9% mrr 0.479 avgtime 3295.355
curated-trai v88f39c2 2015-10-19 Mbprop.txt: Retrain ... 260/306/430 60.5%/71.2% mrr 0.649 avgtime 3611.334
large2180-te  88f39c2 2015-10-19 Mbprop.txt: Retrain ... 218/435/694 31.4%/62.7% mrr 0.392 avgtime 4509.625
large2180-te u88f39c2 2015-10-19 Mbprop.txt: Retrain ... 217/524/694 31.3%/75.5% mrr 0.390 avgtime 4223.223
large2180-te v88f39c2 2015-10-19 Mbprop.txt: Retrain ... 207/435/694 29.8%/62.7% mrr 0.382 avgtime 4450.021
large2180-tr  88f39c2 2015-10-19 Mbprop.txt: Retrain ... 729/916/1479 49.3%/61.9% mrr 0.544 avgtime 12199.161
large2180-tr u88f39c2 2015-10-19 Mbprop.txt: Retrain ... 468/1063/1479 31.6%/71.9% mrr 0.398 avgtime 11243.693
large2180-tr v88f39c2 2015-10-19 Mbprop.txt: Retrain ... 598/916/1479 40.4%/61.9% mrr 0.480 avgtime 12019.682

v1.2 --- curated APR 77.2%, MRR 0.439; large2180 APR 74.8%, MRR 0.411:

curated-test  0296763 2015-08-30 data/ml/biocrf/model... 146/287/430 34.0%/66.7% mrr 0.431 avgtime 2392.096
curated-test u0296763 2015-08-30 data/ml/biocrf/model... 152/332/430 35.3%/77.2% mrr 0.439 avgtime 2157.916
curated-test v0296763 2015-08-30 data/ml/biocrf/model... 151/287/430 35.1%/66.7% mrr 0.440 avgtime 2343.056
curated-trai  0296763 2015-08-30 data/ml/biocrf/model... 290/303/430 67.4%/70.5% mrr 0.689 avgtime 3887.648
curated-trai u0296763 2015-08-30 data/ml/biocrf/model... 181/332/430 42.1%/77.2% mrr 0.503 avgtime 3595.703
curated-trai v0296763 2015-08-30 data/ml/biocrf/model... 257/303/430 59.8%/70.5% mrr 0.644 avgtime 3816.893
large2180-te  0296763 2015-08-30 data/ml/biocrf/model... 224/439/694 32.3%/63.3% mrr 0.402 avgtime 3326.777
large2180-te u0296763 2015-08-30 data/ml/biocrf/model... 233/519/694 33.6%/74.8% mrr 0.411 avgtime 2994.481
large2180-te v0296763 2015-08-30 data/ml/biocrf/model... 221/439/694 31.8%/63.3% mrr 0.399 avgtime 3260.786
large2180-tr  0296763 2015-08-30 data/ml/biocrf/model... 735/925/1479 49.7%/62.5% mrr 0.551 avgtime 7906.924
large2180-tr u0296763 2015-08-30 data/ml/biocrf/model... 485/1052/1479 32.8%/71.1% mrr 0.406 avgtime 7057.941
large2180-tr v0296763 2015-08-30 data/ml/biocrf/model... 586/925/1479 39.6%/62.5% mrr 0.477 avgtime 7726.841

v1.1 --- curated APR 77.2%, MRR 0.409; large2180 APR 74.8%, MRR 0.398:

curated-test  76cc1af 2015-08-26 Merge branch 'master... 134/284/430 31.2%/66.0% mrr 0.405 avgtime 3460.146
curated-test u76cc1af 2015-08-26 Merge branch 'master... 135/332/430 31.4%/77.2% mrr 0.409 avgtime 3231.877
curated-test v76cc1af 2015-08-26 Merge branch 'master... 127/284/430 29.5%/66.0% mrr 0.397 avgtime 3411.869
curated-trai  76cc1af 2015-08-26 Merge branch 'master... 301/306/430 70.0%/71.2% mrr 0.705 avgtime 5815.394
curated-trai u76cc1af 2015-08-26 Merge branch 'master... 199/333/430 46.3%/77.4% mrr 0.538 avgtime 5533.997
curated-trai v76cc1af 2015-08-26 Merge branch 'master... 281/306/430 65.3%/71.2% mrr 0.677 avgtime 5747.069
large2180-te  76cc1af 2015-08-26 Merge branch 'master... 222/443/694 32.0%/63.8% mrr 0.408 avgtime 3622.175
large2180-te u76cc1af 2015-08-26 Merge branch 'master... 218/519/694 31.4%/74.8% mrr 0.398 avgtime 3285.847
large2180-te v76cc1af 2015-08-26 Merge branch 'master... 235/443/694 33.9%/63.8% mrr 0.416 avgtime 3556.244
large2180-tr  76cc1af 2015-08-26 Merge branch 'master... 752/927/1479 50.8%/62.7% mrr 0.558 avgtime 8455.257
large2180-tr u76cc1af 2015-08-26 Merge branch 'master... 498/1051/1479 33.7%/71.1% mrr 0.412 avgtime 7622.098
large2180-tr v76cc1af 2015-08-26 Merge branch 'master... 616/927/1479 41.6%/62.7% mrr 0.491 avgtime 8287.513
trecnew-raw-      ovt 2015-08-29 Merge branch 'master... 121/233/447 27.1%/52.1% mrr 0.346 avgtime 3756.961
trecnew-raw-      ovt 2015-08-29 Merge branch 'master... 118/272/447 26.4%/60.9% mrr 0.325 avgtime 3496.736
trecnew-raw-      ovt 2015-08-29 Merge branch 'master... 123/233/447 27.5%/52.1% mrr 0.345 avgtime 3681.780

v1.0 (the first YodaQA paper) --- curated APR 79.3%, MRR 0.420:

curated-test  0ae3b79 2015-04-14 Merge branch 'master... 137/292/430 31.9%/67.9% mrr 0.413 avgtime 6767.419
curated-test u0ae3b79 2015-04-14 Merge branch 'master... 139/341/430 32.3%/79.3% mrr 0.420 avgtime 6549.246
curated-test v0ae3b79 2015-04-14 Merge branch 'master... 138/292/430 32.1%/67.9% mrr 0.418 avgtime 6687.020
curated-trai  0ae3b79 2015-04-14 Merge branch 'master... 152/283/430 35.3%/65.8% mrr 0.454 avgtime 6566.500
curated-trai u0ae3b79 2015-04-14 Merge branch 'master... 131/329/430 30.5%/76.5% mrr 0.392 avgtime 6358.768
curated-trai v0ae3b79 2015-04-14 Merge branch 'master... 155/283/430 36.0%/65.8% mrr 0.456 avgtime 6492.669
trecnew-raw-      ovt 2015-04-14 Merge branch 'master... 118/237/447 26.4%/53.0% mrr 0.333 avgtime 6213.230
trecnew-raw-      ovt 2015-04-14 Merge branch 'master... 112/278/447 25.1%/62.2% mrr 0.323 avgtime 6056.471
trecnew-raw-      ovt 2015-04-14 Merge branch 'master... 112/237/447 25.1%/53.0% mrr 0.326 avgtime 6159.455

d/live baseline

We don't do day-to-day development on this baseline, but this section records performance evolution on the Bing-enabled version running at http://live.ailao.eu/.

Current version (v1.6):

large2470-te  37309c8 2016-03-01 Merge branch 'master... 359/599/766 46.9%/78.2% mrr 0.558 avgtime 15024.982
large2470-te u37309c8 2016-03-01 Merge branch 'master... 329/673/766 43.0%/87.9% mrr 0.522 avgtime 11167.556
large2470-te v37309c8 2016-03-01 Merge branch 'master... 349/599/766 45.6%/78.2% mrr 0.546 avgtime 11389.205
large2470-tr  37309c8 2016-03-01 Merge branch 'master... 1051/1317/1704 61.7%/77.3% mrr 0.680 avgtime 26346.896
large2470-tr u37309c8 2016-03-01 Merge branch 'master... 703/1466/1704 41.3%/86.0% mrr 0.510 avgtime 17176.143
large2470-tr v37309c8 2016-03-01 Merge branch 'master... 913/1317/1704 53.6%/77.3% mrr 0.621 avgtime 17987.875

Older versions:

large2180-te  c5d1968 2015-12-12 Merge branch 'master... 258/468/694 37.2%/67.4% mrr 0.451 avgtime 7940.420
large2180-te uc5d1968 2015-12-12 Merge branch 'master... 247/552/694 35.6%/79.5% mrr 0.439 avgtime 7550.901
large2180-te vc5d1968 2015-12-12 Merge branch 'master... 251/468/694 36.2%/67.4% mrr 0.444 avgtime 7844.306
large2180-tr  c5d1968 2015-12-12 Merge branch 'master... 761/967/1479 51.5%/65.4% mrr 0.573 avgtime 16064.049
large2180-tr uc5d1968 2015-12-12 Merge branch 'master... 479/1123/1479 32.4%/75.9% mrr 0.407 avgtime 14909.452
large2180-tr vc5d1968 2015-12-12 Merge branch 'master... 633/967/1479 42.8%/65.4% mrr 0.510 avgtime 15795.383

v1.4:

large2180-te  6a040cb 2015-11-10 Merge remote-trackin... 255/470/694 36.7%/67.7% mrr 0.447 avgtime 8150.822
large2180-te u6a040cb 2015-11-10 Merge remote-trackin... 242/553/694 34.9%/79.7% mrr 0.439 avgtime 7758.923
large2180-te v6a040cb 2015-11-10 Merge remote-trackin... 245/470/694 35.3%/67.7% mrr 0.439 avgtime 8055.959
large2180-tr  6a040cb 2015-11-10 Merge remote-trackin... 766/981/1479 51.8%/66.3% mrr 0.579 avgtime 16448.392
large2180-tr u6a040cb 2015-11-10 Merge remote-trackin... 467/1131/1479 31.6%/76.5% mrr 0.409 avgtime 15269.455
large2180-tr v6a040cb 2015-11-10 Merge remote-trackin... 635/981/1479 42.9%/66.3% mrr 0.514 avgtime 16181.728

large2180-te  35a4484 2015-10-16 Merge branch 'master... 260/469/694 37.5%/67.6% mrr 0.454 avgtime 11034.758
large2180-te u35a4484 2015-10-16 Merge branch 'master... 227/558/694 32.7%/80.4% mrr 0.422 avgtime 10687.774
large2180-te v35a4484 2015-10-16 Merge branch 'master... 261/469/694 37.6%/67.6% mrr 0.452 avgtime 10955.408
large2180-tr  35a4484 2015-10-16 Merge branch 'master... 759/996/1479 51.3%/67.3% mrr 0.581 avgtime 15775.905
large2180-tr u35a4484 2015-10-16 Merge branch 'master... 483/1131/1479 32.7%/76.5% mrr 0.418 avgtime 14665.273
large2180-tr v35a4484 2015-10-16 Merge branch 'master... 640/996/1479 43.3%/67.3% mrr 0.515 avgtime 15518.062
large2180-te  e5ed8a5 2015-09-10 Added one minute tim... 253/492/694 36.5%/70.9% mrr 0.456 avgtime 6951.470
large2180-te ue5ed8a5 2015-09-10 Added one minute tim... 235/557/694 33.9%/80.3% mrr 0.433 avgtime 6608.611
large2180-te ve5ed8a5 2015-09-10 Added one minute tim... 253/492/694 36.5%/70.9% mrr 0.455 avgtime 6857.989
large2180-tr  e5ed8a5 2015-09-10 Added one minute tim... 813/1013/1479 55.0%/68.5% mrr 0.605 avgtime 21314.917
large2180-tr ue5ed8a5 2015-09-10 Added one minute tim... 531/1152/1479 35.9%/77.9% mrr 0.445 avgtime 20472.813
large2180-tr ve5ed8a5 2015-09-10 Added one minute tim... 667/1013/1479 45.1%/68.5% mrr 0.535 avgtime 21075.419

Version running up to 2015-09-18:

large2180-te  f04cce6 2015-07-21 Merge branch 'master... 264/520/694 38.0%/74.9% mrr 0.477 avgtime 6248.368
large2180-te uf04cce6 2015-07-21 Merge branch 'master... 230/587/694 33.1%/84.6% mrr 0.430 avgtime 5976.657
large2180-te vf04cce6 2015-07-21 Merge branch 'master... 259/520/694 37.3%/74.9% mrr 0.474 avgtime 6166.965
large2180-tr  f04cce6 2015-07-21 Merge branch 'master... 599/1052/1479 40.5%/71.1% mrr 0.498 avgtime 12523.736
large2180-tr uf04cce6 2015-07-21 Merge branch 'master... 510/1191/1479 34.5%/80.5% mrr 0.437 avgtime 11852.452
large2180-tr vf04cce6 2015-07-21 Merge branch 'master... 585/1052/1479 39.6%/71.1% mrr 0.490 avgtime 12329.911

v1.2 with Bing search (live since 2015-09-18):

curated-test  e5ed8a5 2015-09-10 Added one minute tim... 178/319/430 41.4%/74.2% mrr 0.500 avgtime 5827.692
curated-test ue5ed8a5 2015-09-10 Added one minute tim... 167/360/430 38.8%/83.7% mrr 0.481 avgtime 5635.870
curated-test ve5ed8a5 2015-09-10 Added one minute tim... 177/319/430 41.2%/74.2% mrr 0.502 avgtime 5779.753
curated-trai  e5ed8a5 2015-09-10 Added one minute tim... 328/336/430 76.3%/78.1% mrr 0.772 avgtime 7043.856
curated-trai ue5ed8a5 2015-09-10 Added one minute tim... 196/364/430 45.6%/84.7% mrr 0.549 avgtime 6767.992
curated-trai ve5ed8a5 2015-09-10 Added one minute tim... 289/336/430 67.2%/78.1% mrr 0.720 avgtime 6963.149
large2180-te  e5ed8a5 2015-09-10 Added one minute tim... 253/492/694 36.5%/70.9% mrr 0.456 avgtime 6951.470
large2180-te ue5ed8a5 2015-09-10 Added one minute tim... 235/557/694 33.9%/80.3% mrr 0.433 avgtime 6608.611
large2180-te ve5ed8a5 2015-09-10 Added one minute tim... 253/492/694 36.5%/70.9% mrr 0.455 avgtime 6857.989
large2180-tr  e5ed8a5 2015-09-10 Added one minute tim... 813/1013/1479 55.0%/68.5% mrr 0.605 avgtime 21314.917
large2180-tr ue5ed8a5 2015-09-10 Added one minute tim... 531/1152/1479 35.9%/77.9% mrr 0.445 avgtime 20472.813
large2180-tr ve5ed8a5 2015-09-10 Added one minute tim... 667/1013/1479 45.1%/68.5% mrr 0.535 avgtime 21075.419

WebQuestions-based baseline

We primarily use the d/movies branch for WebQuestions style questions - this branch has disabled enwiki as a data source since our primary motivation in the movies-based questions is QA just on structured knowledge bases.

Also note that the pipeline phase1 (v- prefixed commits) actually seems non-overfitted here. We didn't factor that into our reports or benchmark instructions yet --- for simplicity to keep the common approach for both TREC and WQ based scenarios. We'll probably drop this soon, though.

v1.6 --- moviesF APR 86.0%, MRR 0.678:

moviesF-test  88bb307 2016-01-12 StructuredPrimarySea... 281/361/435 64.6%/83.0% mrr 0.700 avgtime 1266.148
moviesF-test u88bb307 2016-01-12 StructuredPrimarySea... 265/374/435 60.9%/86.0% mrr 0.678 avgtime 1060.979
moviesF-test v88bb307 2016-01-12 StructuredPrimarySea... 271/361/435 62.3%/83.0% mrr 0.690 avgtime 1187.615
moviesF-trai  88bb307 2016-01-12 StructuredPrimarySea... 869/994/1150 75.6%/86.4% mrr 0.799 avgtime 5482.466
moviesF-trai u88bb307 2016-01-12 StructuredPrimarySea... 774/1023/1150 67.3%/89.0% mrr 0.739 avgtime 4764.607
moviesF-trai v88bb307 2016-01-12 StructuredPrimarySea... 832/994/1150 72.3%/86.4% mrr 0.777 avgtime 5221.205

v1.5 --- moviesF APR 86.4%, MRR 0.685; moviesD APR 83.8%, MRR 0.603:

moviesD-test  7e6767d 2015-12-10 AnswerScoreDecisionF... 134/202/260 51.5%/77.7% mrr 0.595 avgtime 694.315
moviesD-test u7e6767d 2015-12-10 AnswerScoreDecisionF... 135/218/260 51.9%/83.8% mrr 0.603 avgtime 582.854
moviesD-test v7e6767d 2015-12-10 AnswerScoreDecisionF... 135/202/260 51.9%/77.7% mrr 0.599 avgtime 652.902

moviesE-test  66417f1 2015-12-10 Mbprop.txt: Retrain ... 271/354/431 62.9%/82.1% mrr 0.690 avgtime 1247.231
moviesE-test u66417f1 2015-12-10 Mbprop.txt: Retrain ... 269/373/431 62.4%/86.5% mrr 0.690 avgtime 1026.623
moviesE-test v66417f1 2015-12-10 Mbprop.txt: Retrain ... 270/354/431 62.6%/82.1% mrr 0.690 avgtime 1171.421
moviesE-trai  66417f1 2015-12-10 Mbprop.txt: Retrain ... 856/992/1140 75.1%/87.0% mrr 0.800 avgtime 3827.673
moviesE-trai u66417f1 2015-12-10 Mbprop.txt: Retrain ... 770/1021/1140 67.5%/89.6% mrr 0.741 avgtime 3102.732
moviesE-trai v66417f1 2015-12-10 Mbprop.txt: Retrain ... 821/992/1140 72.0%/87.0% mrr 0.777 avgtime 3573.315

moviesF-test  7e6767d 2015-12-10 AnswerScoreDecisionF... 272/357/435 62.5%/82.1% mrr 0.687 avgtime 1114.250
moviesF-test u7e6767d 2015-12-10 AnswerScoreDecisionF... 269/376/435 61.8%/86.4% mrr 0.685 avgtime 902.714
moviesF-test v7e6767d 2015-12-10 AnswerScoreDecisionF... 272/357/435 62.5%/82.1% mrr 0.688 avgtime 1038.073

wq-test-ovt-  36515f6 2015-12-13 AnswerScoreDecisionF... 851/1392/2032 41.9%/68.5% mrr 0.496 avgtime 6156.863
wq-test-ovt- u36515f6 2015-12-13 AnswerScoreDecisionF... 790/1553/2032 38.9%/76.4% mrr 0.474 avgtime 4900.655
wq-test-ovt- v36515f6 2015-12-13 AnswerScoreDecisionF... 844/1392/2032 41.5%/68.5% mrr 0.494 avgtime 5689.644
wq-train-ovt  7e6767d 2015-12-10 AnswerScoreDecisionF... 1952/2805/3778 51.7%/74.2% mrr 0.594 avgtime 14614.412
wq-train-ovt u7e6767d 2015-12-10 AnswerScoreDecisionF... 1668/2990/3778 44.2%/79.1% mrr 0.532 avgtime 12335.248
wq-train-ovt v7e6767d 2015-12-10 AnswerScoreDecisionF... 1858/2805/3778 49.2%/74.2% mrr 0.574 avgtime 13939.751

v1.4 --- moviesD APR 81.9%, MRR 0.590:

moviesD-test  e10cf37 2015-11-03 Mbprop.txt: Retrain ... 138/206/260 53.1%/79.2% mrr 0.609 avgtime 1571.417
moviesD-test ue10cf37 2015-11-03 Mbprop.txt: Retrain ... 130/213/260 50.0%/81.9% mrr 0.590 avgtime 1419.293
moviesD-test ve10cf37 2015-11-03 Mbprop.txt: Retrain ... 137/206/260 52.7%/79.2% mrr 0.609 avgtime 1512.312
moviesD-trai  e10cf37 2015-11-03 Mbprop.txt: Retrain ... 455/512/624 72.9%/82.1% mrr 0.766 avgtime 17632.270
moviesD-trai ue10cf37 2015-11-03 Mbprop.txt: Retrain ... 362/525/624 58.0%/84.1% mrr 0.658 avgtime 17203.644
moviesD-trai ve10cf37 2015-11-03 Mbprop.txt: Retrain ... 406/512/624 65.1%/82.1% mrr 0.715 avgtime 17474.994

v1.3 --- moviesC APR 79.0%, MRR 0.573; moviesD APR 76.5%, MRR 0.531; wq APR 75.7%, MRR 0.476:

moviesC-test  6eadf12 2015-10-18 Mbprop.txt: Retrain ... 118/173/233 50.6%/74.2% mrr 0.577 avgtime 876.665
moviesC-test u6eadf12 2015-10-18 Mbprop.txt: Retrain ... 119/184/233 51.1%/79.0% mrr 0.573 avgtime 739.296
moviesC-test v6eadf12 2015-10-18 Mbprop.txt: Retrain ... 121/173/233 51.9%/74.2% mrr 0.585 avgtime 819.947
moviesC-trai  6eadf12 2015-10-18 Mbprop.txt: Retrain ... 379/438/542 69.9%/80.8% mrr 0.742 avgtime 1829.013
moviesC-trai u6eadf12 2015-10-18 Mbprop.txt: Retrain ... 290/444/542 53.5%/81.9% mrr 0.619 avgtime 1466.149
moviesC-trai v6eadf12 2015-10-18 Mbprop.txt: Retrain ... 347/438/542 64.0%/80.8% mrr 0.700 avgtime 1689.706

moviesD-test  6c13b62 2015-10-19 +moviesD dataset... 127/190/260 48.8%/73.1% mrr 0.551 avgtime 630.581
moviesD-test u6c13b62 2015-10-19 +moviesD dataset... 117/199/260 45.0%/76.5% mrr 0.531 avgtime 482.417
moviesD-test v6c13b62 2015-10-19 +moviesD dataset... 124/190/260 47.7%/73.1% mrr 0.547 avgtime 571.359
moviesD-trai  6c13b62 2015-10-19 +moviesD dataset... 425/485/624 68.1%/77.7% mrr 0.719 avgtime 2140.000
moviesD-trai u6c13b62 2015-10-19 +moviesD dataset... 322/492/624 51.6%/78.8% mrr 0.595 avgtime 1735.939
moviesD-trai v6c13b62 2015-10-19 +moviesD dataset... 364/485/624 58.3%/77.7% mrr 0.658 avgtime 1984.069

wq-test-ovt-  6eadf12 2015-10-18 Mbprop.txt: Retrain ... 863/1393/2032 42.5%/68.6% mrr 0.502 avgtime 5812.585
wq-test-ovt- u6eadf12 2015-10-18 Mbprop.txt: Retrain ... 795/1538/2032 39.1%/75.7% mrr 0.476 avgtime 5122.649
wq-test-ovt- v6eadf12 2015-10-18 Mbprop.txt: Retrain ... 857/1393/2032 42.2%/68.6% mrr 0.499 avgtime 5606.749
wq-train-ovt  6eadf12 2015-10-18 Mbprop.txt: Retrain ... 1906/2773/3778 50.4%/73.4% mrr 0.582 avgtime 17218.725
wq-train-ovt u6eadf12 2015-10-18 Mbprop.txt: Retrain ... 1689/2968/3778 44.7%/78.6% mrr 0.531 avgtime 15051.915
wq-train-ovt v6eadf12 2015-10-18 Mbprop.txt: Retrain ... 1839/2773/3778 48.7%/73.4% mrr 0.566 avgtime 16566.801

v1.2, v1.1 (both same results) --- moviesC APR 75.5%, MRR 0.494; wq APR 67.3%, MRR 0.425:

moviesC-test  a770e5f 2015-08-21 Mark: label-lookup 1... 102/168/233 43.8%/72.1% mrr 0.509 avgtime 585.312
moviesC-test ua770e5f 2015-08-21 Mark: label-lookup 1... 95/176/233 40.8%/75.5% mrr 0.494 avgtime 447.181
moviesC-test va770e5f 2015-08-21 Mark: label-lookup 1... 104/168/233 44.6%/72.1% mrr 0.517 avgtime 530.785
moviesC-trai  a770e5f 2015-08-21 Mark: label-lookup 1... 313/388/542 57.7%/71.6% mrr 0.629 avgtime 1463.521
moviesC-trai ua770e5f 2015-08-21 Mark: label-lookup 1... 240/399/542 44.3%/73.6% mrr 0.522 avgtime 1176.910
moviesC-trai va770e5f 2015-08-21 Mark: label-lookup 1... 287/388/542 53.0%/71.6% mrr 0.596 avgtime 1351.434
wq-test-ovt-  8795cd0 2015-08-27 Merge remote-trackin... 757/1257/2032 37.3%/61.9% mrr 0.445 avgtime 5117.716
wq-test-ovt- u8795cd0 2015-08-27 Merge remote-trackin... 699/1368/2032 34.4%/67.3% mrr 0.425 avgtime 4516.366
wq-test-ovt- v8795cd0 2015-08-27 Merge remote-trackin... 749/1257/2032 36.9%/61.9% mrr 0.443 avgtime 4922.379
wq-train-ovt  8795cd0 2015-08-27 Merge remote-trackin... 1702/2486/3778 45.1%/65.8% mrr 0.522 avgtime 22590.390
wq-train-ovt u8795cd0 2015-08-27 Merge remote-trackin... 1519/2658/3778 40.2%/70.4% mrr 0.477 avgtime 21017.841
uq-train-ovt v8795cd0 2015-08-27 Merge remote-trackin... 1673/2486/3778 44.3%/65.8% mrr 0.510 avgtime 22058.533

Feature Experiments

This section will be probably quite fluid.

Explorative FBpath (Glove-based)

Baseline:

moviesD-test  7bbda27 2015-12-02 FocusGenerator addFo... 141/205/260 54.2%/78.8% mrr 0.614 avgtime 801.693
moviesD-test u7bbda27 2015-12-02 FocusGenerator addFo... 135/215/260 51.9%/82.7% mrr 0.604 avgtime 635.000
moviesD-test v7bbda27 2015-12-02 FocusGenerator addFo... 138/205/260 53.1%/78.8% mrr 0.613 avgtime 742.282
moviesD-trai  7bbda27 2015-12-02 FocusGenerator addFo... 454/513/624 72.8%/82.2% mrr 0.765 avgtime 2061.984
moviesD-trai u7bbda27 2015-12-02 FocusGenerator addFo... 356/527/624 57.1%/84.5% mrr 0.653 avgtime 1612.969
moviesD-trai v7bbda27 2015-12-02 FocusGenerator addFo... 413/513/624 66.2%/82.2% mrr 0.722 avgtime 1898.797

Explorative instead of a priori (logistic regression labelling):

moviesD-test  e462e45 2015-12-04 Merge remote-trackin... 90/160/260 34.6%/61.5% mrr 0.421 avgtime 947.432
moviesD-test ue462e45 2015-12-04 Merge remote-trackin... 91/176/260 35.0%/67.7% mrr 0.425 avgtime 791.527
moviesD-test ve462e45 2015-12-04 Merge remote-trackin... 88/160/260 33.8%/61.5% mrr 0.415 avgtime 887.226
moviesD-trai  e462e45 2015-12-04 Merge remote-trackin... 353/418/624 56.6%/67.0% mrr 0.607 avgtime 2504.473
moviesD-trai ue462e45 2015-12-04 Merge remote-trackin... 252/427/624 40.4%/68.4% mrr 0.484 avgtime 2082.617
moviesD-trai ve462e45 2015-12-04 Merge remote-trackin... 299/418/624 47.9%/67.0% mrr 0.552 avgtime 2358.309

Explorative instead of generic (fetch all) (new baseline):

moviesD-test  c5805b9 2015-12-04 Merge remote-trackin... 137/200/260 52.7%/76.9% mrr 0.598 avgtime 840.640
moviesD-test uc5805b9 2015-12-04 Merge remote-trackin... 133/210/260 51.2%/80.8% mrr 0.588 avgtime 681.419
moviesD-test vc5805b9 2015-12-04 Merge remote-trackin... 136/200/260 52.3%/76.9% mrr 0.596 avgtime 781.449
moviesD-trai  c5805b9 2015-12-04 Merge remote-trackin... 439/512/624 70.4%/82.1% mrr 0.750 avgtime 2343.264
moviesD-trai uc5805b9 2015-12-04 Merge remote-trackin... 344/528/624 55.1%/84.6% mrr 0.635 avgtime 1883.171
moviesD-trai vc5805b9 2015-12-04 Merge remote-trackin... 392/512/624 62.8%/82.1% mrr 0.698 avgtime 2187.892

Fixed score-based ordering, mean score for 2-property paths:

moviesD-test  45af8db 2015-12-05 Merge remote-trackin... 138/204/260 53.1%/78.5% mrr 0.594 avgtime 1214.348  
moviesD-test u45af8db 2015-12-05 Merge remote-trackin... 132/220/260 50.8%/84.6% mrr 0.584 avgtime 1043.612
moviesD-test v45af8db 2015-12-05 Merge remote-trackin... 134/204/260 51.5%/78.5% mrr 0.591 avgtime 1155.532
moviesD-trai  45af8db 2015-12-05 Merge remote-trackin... 446/513/624 71.5%/82.2% mrr 0.759 avgtime 3343.911
moviesD-trai u45af8db 2015-12-05 Merge remote-trackin... 332/537/624 53.2%/86.1% mrr 0.627 avgtime 2851.963
moviesD-trai v45af8db 2015-12-05 Merge remote-trackin... 396/513/624 63.5%/82.2% mrr 0.706 avgtime 3184.205

Limit also the number of 2-property paths, not just 1-prop paths (new baseline):

moviesD-test  c7418cc 2015-12-05 FBPathGloVeScoring: ... 136/207/260 52.3%/79.6% mrr 0.599 avgtime 1071.796
moviesD-test uc7418cc 2015-12-05 FBPathGloVeScoring: ... 137/220/260 52.7%/84.6% mrr 0.606 avgtime 902.943
moviesD-test vc7418cc 2015-12-05 FBPathGloVeScoring: ... 142/207/260 54.6%/79.6% mrr 0.611 avgtime 1010.326
moviesD-trai  c7418cc 2015-12-05 FBPathGloVeScoring: ... 447/509/624 71.6%/81.6% mrr 0.758 avgtime 2786.857
moviesD-trai uc7418cc 2015-12-05 FBPathGloVeScoring: ... 347/530/624 55.6%/84.9% mrr 0.643 avgtime 2326.904
moviesD-trai vc7418cc 2015-12-05 FBPathGloVeScoring: ... 402/509/624 64.4%/81.6% mrr 0.710 avgtime 2633.049

Try changing limit 15 -> 5:

moviesD-test  8c9b29e 2015-12-05 exploringPaths topPa... 128/204/260 49.2%/78.5% mrr 0.574 avgtime 810.969
moviesD-test u8c9b29e 2015-12-05 exploringPaths topPa... 124/217/260 47.7%/83.5% mrr 0.570 avgtime 655.753
moviesD-test v8c9b29e 2015-12-05 exploringPaths topPa... 129/204/260 49.6%/78.5% mrr 0.581 avgtime 751.032
moviesD-trai  8c9b29e 2015-12-05 exploringPaths topPa... 443/507/624 71.0%/81.2% mrr 0.752 avgtime 2121.804
moviesD-trai u8c9b29e 2015-12-05 exploringPaths topPa... 358/520/624 57.4%/83.3% mrr 0.648 avgtime 1696.961
moviesD-trai v8c9b29e 2015-12-05 exploringPaths topPa... 395/507/624 63.3%/81.2% mrr 0.700 avgtime 1966.618

Try disabling a priori fbpath question labelling:

moviesD-test  a8f31c3 2015-12-05 Try disabling a prio... 91/187/260 35.0%/71.9% mrr 0.440 avgtime 896.918
moviesD-test ua8f31c3 2015-12-05 Try disabling a prio... 80/206/260 30.8%/79.2% mrr 0.414 avgtime 735.018
moviesD-test va8f31c3 2015-12-05 Try disabling a prio... 84/187/260 32.3%/71.9% mrr 0.427 avgtime 834.787
moviesD-trai  a8f31c3 2015-12-05 Try disabling a prio... 358/458/624 57.4%/73.4% mrr 0.637 avgtime 2310.752
moviesD-trai ua8f31c3 2015-12-05 Try disabling a prio... 248/488/624 39.7%/78.2% mrr 0.503 avgtime 1885.626
moviesD-trai va8f31c3 2015-12-05 Try disabling a prio... 296/458/624 47.4%/73.4% mrr 0.568 avgtime 2160.346

Retrain explorative (GloVe) classifier using moviesD, include non-link relations (new baseline):

moviesD-test  a24f2f7 2015-12-06 Merge branch 'fbpath... 134/209/260 51.5%/80.4% mrr 0.598 avgtime 1685.146
moviesD-test ua24f2f7 2015-12-06 Merge branch 'fbpath... 132/218/260 50.8%/83.8% mrr 0.592 avgtime 1519.137
moviesD-test va24f2f7 2015-12-06 Merge branch 'fbpath... 135/209/260 51.9%/80.4% mrr 0.602 avgtime 1625.133
moviesD-trai  a24f2f7 2015-12-06 Merge branch 'fbpath... 442/511/624 70.8%/81.9% mrr 0.754 avgtime 4611.489
moviesD-trai ua24f2f7 2015-12-06 Merge branch 'fbpath... 356/530/624 57.1%/84.9% mrr 0.649 avgtime 4140.708
moviesD-trai va24f2f7 2015-12-06 Merge branch 'fbpath... 405/511/624 64.9%/81.9% mrr 0.712 avgtime 4455.004

Try disabling a priori fbpath question labelling:

moviesD-test  4d753b0 2015-12-05 Try disabling a prio... 98/184/260 37.7%/70.8% mrr 0.465 avgtime 783.514
moviesD-test u4d753b0 2015-12-05 Try disabling a prio... 95/209/260 36.5%/80.4% mrr 0.456 avgtime 609.502
moviesD-test v4d753b0 2015-12-05 Try disabling a prio... 88/184/260 33.8%/70.8% mrr 0.445 avgtime 712.502
moviesD-trai  4d753b0 2015-12-05 Try disabling a prio... 383/472/624 61.4%/75.6% mrr 0.670 avgtime 2193.474
moviesD-trai u4d753b0 2015-12-05 Try disabling a prio... 266/502/624 42.6%/80.4% mrr 0.519 avgtime 1765.960
moviesD-trai v4d753b0 2015-12-05 Try disabling a prio... 325/472/624 52.1%/75.6% mrr 0.602 avgtime 2041.186

Building witness-based relations:

moviesD-test  ee63449 2015-12-07 Merge branch 'fbpath... 122/206/260 46.9%/79.2% mrr 0.558 avgtime 1035.734
moviesD-test uee63449 2015-12-07 Merge branch 'fbpath... 116/219/260 44.6%/84.2% mrr 0.542 avgtime 919.843
moviesD-test vee63449 2015-12-07 Merge branch 'fbpath... 122/206/260 46.9%/79.2% mrr 0.562 avgtime 999.078
moviesD-trai  ee63449 2015-12-07 Merge branch 'fbpath... 436/508/624 69.9%/81.4% mrr 0.748 avgtime 3081.712
moviesD-trai uee63449 2015-12-07 Merge branch 'fbpath... 320/530/624 51.3%/84.9% mrr 0.614 avgtime 2694.448
moviesD-trai vee63449 2015-12-07 Merge branch 'fbpath... 390/508/624 62.5%/81.4% mrr 0.698 avgtime 2959.652

[Building witness-based relations] Try disabling a priori fbpath question labelling:

moviesD-test  05176c1 2015-12-05 Try disabling a prio... 93/184/260 35.8%/70.8% mrr 0.455 avgtime 906.356
moviesD-test u05176c1 2015-12-05 Try disabling a prio... 90/208/260 34.6%/80.0% mrr 0.442 avgtime 730.845
moviesD-test v05176c1 2015-12-05 Try disabling a prio... 94/184/260 36.2%/70.8% mrr 0.447 avgtime 845.308
moviesD-trai  05176c1 2015-12-05 Try disabling a prio... 371/457/624 59.5%/73.2% mrr 0.648 avgtime 2295.579
moviesD-trai u05176c1 2015-12-05 Try disabling a prio... 263/487/624 42.1%/78.0% mrr 0.519 avgtime 1871.413
moviesD-trai v05176c1 2015-12-05 Try disabling a prio... 331/457/624 53.0%/73.2% mrr 0.605 avgtime 2143.996

[Building witness-based relations] Improved question focus in "who did play X Y in Z":

moviesD-test  6ed5826 2015-12-07 question FocusGenera... 127/204/260 48.8%/78.5% mrr 0.573 avgtime 1072.878
moviesD-test u6ed5826 2015-12-07 question FocusGenera... 116/219/260 44.6%/84.2% mrr 0.543 avgtime 903.636
moviesD-test v6ed5826 2015-12-07 question FocusGenera... 126/204/260 48.5%/78.5% mrr 0.570 avgtime 1012.997
moviesD-trai  6ed5826 2015-12-07 question FocusGenera... 432/509/624 69.2%/81.6% mrr 0.742 avgtime 2833.613
moviesD-trai u6ed5826 2015-12-07 question FocusGenera... 316/531/624 50.6%/85.1% mrr 0.606 avgtime 2368.257
moviesD-trai v6ed5826 2015-12-07 question FocusGenera... 389/509/624 62.3%/81.6% mrr 0.697 avgtime 2676.699

[Building witness-based relations] Further improved question focus:

moviesD-test  b60fc43 2015-12-08 .... 129/204/260 49.6%/78.5% mrr 0.578 avgtime 13134.762
moviesD-test ub60fc43 2015-12-08 .... 116/219/260 44.6%/84.2% mrr 0.543 avgtime 12973.969
moviesD-test vb60fc43 2015-12-08 .... 124/204/260 47.7%/78.5% mrr 0.570 avgtime 13074.393
moviesD-trai  b60fc43 2015-12-08 .... 441/509/624 70.7%/81.6% mrr 0.751 avgtime 18169.388
moviesD-trai ub60fc43 2015-12-08 .... 328/531/624 52.6%/85.1% mrr 0.618 avgtime 17716.420
moviesD-trai vb60fc43 2015-12-08 .... 392/509/624 62.8%/81.6% mrr 0.698 avgtime 18011.466

[a24f2f7, i.e. disabling witnesses again] Check again with non-link relations and question fixes:

moviesD-test  99168ac 2015-12-08 Merge branch 'fbpath... 130/206/260 50.0%/79.2% mrr 0.587 avgtime 1020.503
moviesD-test u99168ac 2015-12-08 Merge branch 'fbpath... 135/218/260 51.9%/83.8% mrr 0.601 avgtime 855.166
moviesD-test v99168ac 2015-12-08 Merge branch 'fbpath... 131/206/260 50.4%/79.2% mrr 0.594 avgtime 960.256
moviesD-trai  99168ac 2015-12-08 Merge branch 'fbpath... 440/508/624 70.5%/81.4% mrr 0.752 avgtime 2726.236
moviesD-trai u99168ac 2015-12-08 Merge branch 'fbpath... 348/530/624 55.8%/84.9% mrr 0.641 avgtime 2254.485
moviesD-trai v99168ac 2015-12-08 Merge branch 'fbpath... 391/508/624 62.7%/81.4% mrr 0.700 avgtime 2570.305

[99168ac] New witness selection mechanism - using a third matrix for determining CVT-witness relations, averaging total score across the whole triplet:

moviesD-test  9d78f56 2015-12-10 Merge branch 'fbpath... 115/203/260 44.2%/78.1% mrr 0.542 avgtime 1246.365
moviesD-test u9d78f56 2015-12-10 Merge branch 'fbpath... 114/214/260 43.8%/82.3% mrr 0.534 avgtime 1102.542
moviesD-test v9d78f56 2015-12-10 Merge branch 'fbpath... 116/203/260 44.6%/78.1% mrr 0.544 avgtime 1192.046
moviesD-trai  9d78f56 2015-12-10 Merge branch 'fbpath... 431/494/624 69.1%/79.2% mrr 0.734 avgtime 2638.022
moviesD-trai u9d78f56 2015-12-10 Merge branch 'fbpath... 321/514/624 51.4%/82.4% mrr 0.607 avgtime 2235.877
moviesD-trai v9d78f56 2015-12-10 Merge branch 'fbpath... 386/494/624 61.9%/79.2% mrr 0.685 avgtime 2493.626

Retrained based on v1.5 master and fixed locale issues:

moviesD-test  c2969e9 2015-12-17 Merge branch 'fbpath... 121/203/260 46.5%/78.1% mrr 0.558 avgtime 1284.375
moviesD-test uc2969e9 2015-12-17 Merge branch 'fbpath... 114/217/260 43.8%/83.5% mrr 0.548 avgtime 1145.270
moviesD-test vc2969e9 2015-12-17 Merge branch 'fbpath... 122/203/260 46.9%/78.1% mrr 0.563 avgtime 1232.504
moviesD-trai  c2969e9 2015-12-17 Merge branch 'fbpath... 429/510/624 68.8%/81.7% mrr 0.744 avgtime 2586.251
moviesD-trai uc2969e9 2015-12-17 Merge branch 'fbpath... 326/525/624 52.2%/84.1% mrr 0.617 avgtime 2193.123
moviesD-trai vc2969e9 2015-12-17 Merge branch 'fbpath... 388/510/624 62.2%/81.7% mrr 0.696 avgtime 2441.263

Explorative FBpath - RNN scoring

Scoring freebase path using recurent neural network. Scored full path instead of single property separately. Path representation: property labels joined using '#' symbol.

Model: anssel-rnn-4ed6120cf26d1a50, without apriori:

moviesD-test  8873002 2016-04-04 Send questions and p... 138/208/260 53.1%/80.0% mrr 0.599 avgtime 2252.302
moviesD-test u8873002 2016-04-04 Send questions and p... 125/219/260 48.1%/84.2% mrr 0.575 avgtime 2140.293
moviesD-test v8873002 2016-04-04 Send questions and p... 132/208/260 50.8%/80.0% mrr 0.588 avgtime 2210.604
moviesD-trai  8873002 2016-04-04 Send questions and p... 445/519/624 71.3%/83.2% mrr 0.764 avgtime 6037.148
moviesD-trai u8873002 2016-04-04 Send questions and p... 336/533/624 53.8%/85.4% mrr 0.636 avgtime 5679.596
moviesD-trai v8873002 2016-04-04 Send questions and p... 390/519/624 62.5%/83.2% mrr 0.704 avgtime 5910.183

wq-test-ovt-  19e6b71 2016-04-04 Send questions and p... 736/1382/2032 36.2%/68.0% mrr 0.449 avgtime 12566.977
wq-test-ovt- u19e6b71 2016-04-04 Send questions and p... 649/1500/2032 31.9%/73.8% mrr 0.414 avgtime 12121.319
wq-test-ovt- v19e6b71 2016-04-04 Send questions and p... 701/1382/2032 34.5%/68.0% mrr 0.436 avgtime 12444.624
wq-train-ovt  19e6b71 2016-04-04 Send questions and p... 1637/2680/3778 43.3%/70.9% mrr 0.524 avgtime 24967.739
wq-train-ovt u19e6b71 2016-04-04 Send questions and p... 1346/2878/3778 35.6%/76.2% mrr 0.453 avgtime 23614.558
wq-train-ovt v19e6b71 2016-04-04 Send questions and p... 1530/2680/3778 40.5%/70.9% mrr 0.497 avgtime 24604.205

moviesF-test  c14d0b9 2016-04-04 Dummy commit for con... 250/362/435 57.5%/83.2% mrr 0.655 avgtime 2208.916
moviesF-test uc14d0b9 2016-04-04 Dummy commit for con... 244/386/435 56.1%/88.7% mrr 0.638 avgtime 2090.540
moviesF-test vc14d0b9 2016-04-04 Dummy commit for con... 251/362/435 57.7%/83.2% mrr 0.655 avgtime 2170.652
moviesF-trai  c14d0b9 2016-04-04 Dummy commit for con... 798/981/1150 69.4%/85.3% mrr 0.755 avgtime 6037.362
moviesF-trai uc14d0b9 2016-04-04 Dummy commit for con... 679/1033/1150 59.0%/89.8% mrr 0.672 avgtime 5595.132
moviesF-trai vc14d0b9 2016-04-04 Dummy commit for con... 758/981/1150 65.9%/85.3% mrr 0.726 avgtime 5905.740

Model: anssel-rnn-4ed6120cf26d1a50, including apriori:

moviesD-test  30b6f3f 2016-04-07 Added apriori paths... 136/209/260 52.3%/80.4% mrr 0.610 avgtime 2347.566
moviesD-test u30b6f3f 2016-04-07 Added apriori paths... 129/221/260 49.6%/85.0% mrr 0.594 avgtime 2225.090
moviesD-test v30b6f3f 2016-04-07 Added apriori paths... 136/209/260 52.3%/80.4% mrr 0.613 avgtime 2303.449
moviesD-trai  30b6f3f 2016-04-07 Added apriori paths... 454/528/624 72.8%/84.6% mrr 0.779 avgtime 4722.102
moviesD-trai u30b6f3f 2016-04-07 Added apriori paths... 349/543/624 55.9%/87.0% mrr 0.652 avgtime 4361.742
moviesD-trai v30b6f3f 2016-04-07 Added apriori paths... 402/528/624 64.4%/84.6% mrr 0.723 avgtime 4598.365

moviesF-test  5fac8e3 2016-04-07 Added a priori fbpat... 272/364/435 62.5%/83.7% mrr 0.693 avgtime 2268.621
moviesF-test u5fac8e3 2016-04-07 Added a priori fbpat... 263/390/435 60.5%/89.7% mrr 0.677 avgtime 2118.615
moviesF-test v5fac8e3 2016-04-07 Added a priori fbpat... 264/364/435 60.7%/83.7% mrr 0.683 avgtime 2224.328
moviesF-trai  5fac8e3 2016-04-07 Added a priori fbpat... 856/1011/1150 74.4%/87.9% mrr 0.796 avgtime 6209.810
moviesF-trai u5fac8e3 2016-04-07 Added a priori fbpat... 721/1050/1150 62.7%/91.3% mrr 0.712 avgtime 5744.865
moviesF-trai v5fac8e3 2016-04-07 Added a priori fbpat... 811/1011/1150 70.5%/87.9% mrr 0.768 avgtime 6078.353

wq-test-ovt-  5fac8e3 2016-04-07 Added a priori fbpat... 773/1461/2032 38.0%/71.9% mrr 0.479 avgtime 11418.613
wq-test-ovt- u5fac8e3 2016-04-07 Added a priori fbpat... 689/1619/2032 33.9%/79.7% mrr 0.444 avgtime 10613.653
wq-test-ovt- v5fac8e3 2016-04-07 Added a priori fbpat... 782/1461/2032 38.5%/71.9% mrr 0.478 avgtime 11141.899
wq-train-ovt  5fac8e3 2016-04-07 Added a priori fbpat... 1738/2863/3778 46.0%/75.8% mrr 0.558 avgtime 33960.838
wq-train-ovt u5fac8e3 2016-04-07 Added a priori fbpat... 1400/3129/3778 37.1%/82.8% mrr 0.481 avgtime 31675.068
wq-train-ovt v5fac8e3 2016-04-07 Added a priori fbpat... 1655/2863/3778 43.8%/75.8% mrr 0.538 avgtime 33288.027

Model: anssel-rnn-4ed6120cf26d1a50, including apriori, neighbourhood exploring using freebase API (new baseline):

moviesD-test  01e0961 2016-04-23 Merge branch 'fbpath... 141/211/260 54.2%/81.2% mrr 0.621 avgtime 2016.305
moviesD-test u01e0961 2016-04-23 Merge branch 'fbpath... 134/223/260 51.5%/85.8% mrr 0.608 avgtime 1132.468
moviesD-test v01e0961 2016-04-23 Merge branch 'fbpath... 140/211/260 53.8%/81.2% mrr 0.621 avgtime 1195.288
moviesD-trai  01e0961 2016-04-23 Merge branch 'fbpath... 468/523/624 75.0%/83.8% mrr 0.790 avgtime 4514.283
moviesD-trai u01e0961 2016-04-23 Merge branch 'fbpath... 355/541/624 56.9%/86.7% mrr 0.656 avgtime 2210.902
moviesD-trai v01e0961 2016-04-23 Merge branch 'fbpath... 407/523/624 65.2%/83.8% mrr 0.723 avgtime 2396.282

Migrating Freebase from Fuseki to Virtuoso

Baseline:

moviesD-test  e10cf37 2015-11-03 Mbprop.txt: Retrain ... 138/206/260 53.1%/79.2% mrr 0.609 avgtime 1571.417
moviesD-test ue10cf37 2015-11-03 Mbprop.txt: Retrain ... 130/213/260 50.0%/81.9% mrr 0.590 avgtime 1419.293
moviesD-test ve10cf37 2015-11-03 Mbprop.txt: Retrain ... 137/206/260 52.7%/79.2% mrr 0.609 avgtime 1512.312
moviesD-trai  e10cf37 2015-11-03 Mbprop.txt: Retrain ... 455/512/624 72.9%/82.1% mrr 0.766 avgtime 17632.270
moviesD-trai ue10cf37 2015-11-03 Mbprop.txt: Retrain ... 362/525/624 58.0%/84.1% mrr 0.658 avgtime 17203.644
moviesD-trai ve10cf37 2015-11-03 Mbprop.txt: Retrain ... 406/512/624 65.1%/82.1% mrr 0.715 avgtime 17474.994

Migrated:

moviesD-test  ee93719 2015-11-08 Migrate Freebase fro... 138/199/260 53.1%/76.5% mrr 0.604 avgtime 1687.771
moviesD-test uee93719 2015-11-08 Migrate Freebase fro... 135/208/260 51.9%/80.0% mrr 0.589 avgtime 1540.271
moviesD-test vee93719 2015-11-08 Migrate Freebase fro... 134/199/260 51.5%/76.5% mrr 0.594 avgtime 1627.686
moviesD-trai  ee93719 2015-11-08 Migrate Freebase fro... 447/511/624 71.6%/81.9% mrr 0.758 avgtime 4067.582
moviesD-trai uee93719 2015-11-08 Migrate Freebase fro... 359/519/624 57.5%/83.2% mrr 0.656 avgtime 3648.616
moviesD-trai vee93719 2015-11-08 Migrate Freebase fro... 398/511/624 63.8%/81.9% mrr 0.707 avgtime 3901.511

This also involves (i) updating to BaseKB Gold (Freebase snapshot from April rather than January) and (ii) reducing topLinkedConcepts from 5 to 4 (as some of our queries were too large for Virtuoso when we had too many parallel concepts).

(work in progress - this is actually a slowdown, while the goal was performance speedup)

Answer Sentence Selection using STS

Based on the attn1511 model.

Baseline:

curated-test  5ffddf9 2016-04-01 AnswerFV +originPsgB... 158/296/430 36.7%/68.8% mrr 0.456 avgtime 8127.067
curated-test u5ffddf9 2016-04-01 AnswerFV +originPsgB... 156/348/430 36.3%/80.9% mrr 0.449 avgtime 5806.669
curated-test v5ffddf9 2016-04-01 AnswerFV +originPsgB... 157/296/430 36.5%/68.8% mrr 0.455 avgtime 5918.403
curated-trai  5ffddf9 2016-04-01 AnswerFV +originPsgB... 321/328/430 74.7%/76.3% mrr 0.754 avgtime 14838.503
curated-trai u5ffddf9 2016-04-01 AnswerFV +originPsgB... 186/360/430 43.3%/83.7% mrr 0.529 avgtime 11158.011
curated-trai v5ffddf9 2016-04-01 AnswerFV +originPsgB... 274/328/430 63.7%/76.3% mrr 0.692 avgtime 11364.936

Replacing Simple, raw scores:

curated-test  31cbd7f 2016-04-01 CandidateGenerator: ... 163/287/430 37.9%/66.7% mrr 0.462 avgtime 6386.322
curated-test u31cbd7f 2016-04-01 CandidateGenerator: ... 160/339/430 37.2%/78.8% mrr 0.447 avgtime 5726.030
curated-test v31cbd7f 2016-04-01 CandidateGenerator: ... 160/287/430 37.2%/66.7% mrr 0.455 avgtime 5924.471
curated-trai  31cbd7f 2016-04-01 CandidateGenerator: ... 314/318/430 73.0%/74.0% mrr 0.734 avgtime 8401.669
curated-trai u31cbd7f 2016-04-01 CandidateGenerator: ... 184/349/430 42.8%/81.2% mrr 0.518 avgtime 7591.174
curated-trai v31cbd7f 2016-04-01 CandidateGenerator: ... 275/318/430 64.0%/74.0% mrr 0.680 avgtime 7908.613

...and sigmoid scores (default from now on):

curated-test  d511db5 2016-04-02 +AF PassageRR... 150/281/430 34.9%/65.3% mrr 0.434 avgtime 5966.079
curated-test ud511db5 2016-04-02 +AF PassageRR... 157/339/430 36.5%/78.8% mrr 0.437 avgtime 5278.153
curated-test vd511db5 2016-04-02 +AF PassageRR... 145/281/430 33.7%/65.3% mrr 0.426 avgtime 5481.323
curated-trai  d511db5 2016-04-02 +AF PassageRR... 314/316/430 73.0%/73.5% mrr 0.732 avgtime 8287.556
curated-trai ud511db5 2016-04-02 +AF PassageRR... 182/349/430 42.3%/81.2% mrr 0.510 avgtime 7423.897
curated-trai vd511db5 2016-04-02 +AF PassageRR... 277/316/430 64.4%/73.5% mrr 0.683 avgtime 7754.667

MRR +0.1 (transfer learning based model) in sts:

curated-test  acbee3d 2016-04-04 data/sts/: Switch fr... 144/267/430 33.5%/62.1% mrr 0.416 avgtime 5216.086
curated-test uacbee3d 2016-04-04 data/sts/: Switch fr... 141/326/430 32.8%/75.8% mrr 0.409 avgtime 4400.990
curated-test vacbee3d 2016-04-04 data/sts/: Switch fr... 143/267/430 33.3%/62.1% mrr 0.415 avgtime 4600.891
curated-trai  acbee3d 2016-04-04 data/sts/: Switch fr... 310/312/430 72.1%/72.6% mrr 0.723 avgtime 6712.507
curated-trai uacbee3d 2016-04-04 data/sts/: Switch fr... 176/343/430 40.9%/79.8% mrr 0.499 avgtime 5365.896
curated-trai vacbee3d 2016-04-04 data/sts/: Switch fr... 274/312/430 63.7%/72.6% mrr 0.674 avgtime 5688.107

BM25 in sts:

curated-test  803d377 2016-04-08 data/sts: Switch to ... 164/285/430 38.1%/66.3% mrr 0.466 avgtime 4270.254
curated-test u803d377 2016-04-08 data/sts: Switch to ... 159/336/430 37.0%/78.1% mrr 0.447 avgtime 3537.001
curated-test v803d377 2016-04-08 data/sts: Switch to ... 159/285/430 37.0%/66.3% mrr 0.458 avgtime 3739.071
curated-trai  803d377 2016-04-08 data/sts: Switch to ... 315/317/430 73.3%/73.7% mrr 0.734 avgtime 4745.415
curated-trai u803d377 2016-04-08 data/sts: Switch to ... 181/352/430 42.1%/81.9% mrr 0.521 avgtime 3752.468
curated-trai v803d377 2016-04-08 data/sts: Switch to ... 275/317/430 64.0%/73.7% mrr 0.682 avgtime 4069.240

Merged passages from all sources (top 36):

curated-test  6b3baee 2016-04-03 Merge branch 'f/pass... 157/299/430 36.5%/69.5% mrr 0.452 avgtime 6703.583
curated-test u6b3baee 2016-04-03 Merge branch 'f/pass... 150/351/430 34.9%/81.6% mrr 0.434 avgtime 5851.819
curated-test v6b3baee 2016-04-03 Merge branch 'f/pass... 153/299/430 35.6%/69.5% mrr 0.449 avgtime 6113.069
curated-trai  6b3baee 2016-04-03 Merge branch 'f/pass... 321/326/430 74.7%/75.8% mrr 0.751 avgtime 7812.992
curated-trai u6b3baee 2016-04-03 Merge branch 'f/pass... 185/360/430 43.0%/83.7% mrr 0.526 avgtime 6828.653
curated-trai v6b3baee 2016-04-03 Merge branch 'f/pass... 274/326/430 63.7%/75.8% mrr 0.686 avgtime 7218.447

Merged passages, MRR +0.1 (transfer learning based model) in sts:

curated-test  6589ed0 2016-04-04 data/sts/: Switch fr... 165/276/430 38.4%/64.2% mrr 0.455 avgtime 4998.423
curated-test u6589ed0 2016-04-04 data/sts/: Switch fr... 155/337/430 36.0%/78.4% mrr 0.436 avgtime 4190.877
curated-test v6589ed0 2016-04-04 data/sts/: Switch fr... 159/276/430 37.0%/64.2% mrr 0.452 avgtime 4433.142
curated-trai  6589ed0 2016-04-04 data/sts/: Switch fr... 314/318/430 73.0%/74.0% mrr 0.735 avgtime 6167.510
curated-trai u6589ed0 2016-04-04 data/sts/: Switch fr... 198/349/430 46.0%/81.2% mrr 0.533 avgtime 5209.310
curated-trai v6589ed0 2016-04-04 data/sts/: Switch fr... 274/318/430 63.7%/74.0% mrr 0.681 avgtime 5578.838

Hold-out Experiments

v1.1 TREC Hold-out Experiments

Note that the label-lookup, dectrees changes introduced before v1.1 did not improve performance on curated, but did improve movies, webquestions and large2180.

v1.1 with 12 inst. of 6 search results per IR query --- curated APR 80.0%, MRR 0.440 (but ~12s -> 20s per question):

curated-test  5768167 2015-08-29 AnswerScoreDecisionF... 138/290/430 32.1%/67.4% mrr 0.425 avgtime 5754.981
curated-test u5768167 2015-08-29 AnswerScoreDecisionF... 152/344/430 35.3%/80.0% mrr 0.440 avgtime 5465.723  
curated-test v5768167 2015-08-29 AnswerScoreDecisionF... 139/290/430 32.3%/67.4% mrr 0.427 avgtime 5705.498  
curated-trai  597b437 2015-08-28 SolrFullPrimarySearc... 300/308/430 69.8%/71.6% mrr 0.706 avgtime 4601.686
curated-trai u597b437 2015-08-28 SolrFullPrimarySearc... 194/344/430 45.1%/80.0% mrr 0.532 avgtime 4255.334  
curated-trai v597b437 2015-08-28 SolrFullPrimarySearc... 284/308/430 66.0%/71.6% mrr 0.685 avgtime 4530.166  

v1.1 without IR from enwiki --- curated APR 42.1%, MRR 0.253 (but ~2.5s per question):

curated-test  8795cd0 2015-08-27 Merge remote-trackin... 91/156/430 21.2%/36.3% mrr 0.254 avgtime 1085.359
curated-test u8795cd0 2015-08-27 Merge remote-trackin... 85/181/430 19.8%/42.1% mrr 0.253 avgtime 939.621
curated-test v8795cd0 2015-08-27 Merge remote-trackin... 88/156/430 20.5%/36.3% mrr 0.253 avgtime 1037.196
curated-trai  8795cd0 2015-08-27 Merge remote-trackin... 165/184/430 38.4%/42.8% mrr 0.398 avgtime 863.772
curated-trai u8795cd0 2015-08-27 Merge remote-trackin... 129/188/430 30.0%/43.7% mrr 0.339 avgtime 671.766
curated-trai v8795cd0 2015-08-27 Merge remote-trackin... 154/184/430 35.8%/42.8% mrr 0.382 avgtime 795.719

v1.1 without IR from structured knowledge bases (DBpedia, Freebase) --- curated APR 70.7%, MRR 0.378:

curated-test  a9bf875 2015-08-29 YodaQA: -structured ... 103/273/430 24.0%/63.5% mrr 0.336 avgtime 2271.578
curated-test ua9bf875 2015-08-29 YodaQA: -structured ... 124/304/430 28.8%/70.7% mrr 0.378 avgtime 2073.512
curated-test va9bf875 2015-08-29 YodaQA: -structured ... 100/273/430 23.3%/63.5% mrr 0.337 avgtime 2201.423
curated-trai  a9bf875 2015-08-29 YodaQA: -structured ... 291/292/430 67.7%/67.9% mrr 0.678 avgtime 2735.233
curated-trai ua9bf875 2015-08-29 YodaQA: -structured ... 198/314/430 46.0%/73.0% mrr 0.522 avgtime 2475.768
curated-trai va9bf875 2015-08-29 YodaQA: -structured ... 262/292/430 60.9%/67.9% mrr 0.641 avgtime 2638.769

v1.1 without answer typing using external resources (WordNet, DBpedia) --- curated APR 77.2%, MRR 0.394:

curated-test  e36e53c 2015-08-29 AnswerAnalysis: Disa... 116/279/430 27.0%/64.9% mrr 0.373 avgtime 1768.499
curated-test ue36e53c 2015-08-29 AnswerAnalysis: Disa... 132/332/430 30.7%/77.2% mrr 0.394 avgtime 1564.041 
curated-test ve36e53c 2015-08-29 AnswerAnalysis: Disa... 118/279/430 27.4%/64.9% mrr 0.380 avgtime 1723.894
curated-trai  e36e53c 2015-08-29 AnswerAnalysis: Disa... 298/303/430 69.3%/70.5% mrr 0.698 avgtime 2563.871  
curated-trai ue36e53c 2015-08-29 AnswerAnalysis: Disa... 196/333/430 45.6%/77.4% mrr 0.530 avgtime 2302.198
curated-trai ve36e53c 2015-08-29 AnswerAnalysis: Disa... 267/303/430 62.1%/70.5% mrr 0.657 avgtime 2496.949  

v1.1 without entity linking --- curated APR 68.1%, MRR 0.318:

curated-test  ecb30e3 2015-08-29 QuestionAnalysis: -C... 90/261/430 20.9%/60.7% mrr 0.298 avgtime 1624.336
curated-test uecb30e3 2015-08-29 QuestionAnalysis: -C... 96/293/430 22.3%/68.1% mrr 0.318 avgtime 1451.379
curated-test vecb30e3 2015-08-29 QuestionAnalysis: -C... 91/261/430 21.2%/60.7% mrr 0.307 avgtime 1577.783  
curated-trai  ecb30e3 2015-08-29 QuestionAnalysis: -C... 277/280/430 64.4%/65.1% mrr 0.648 avgtime 2008.781
curated-trai uecb30e3 2015-08-29 QuestionAnalysis: -C... 187/299/430 43.5%/69.5% mrr 0.496 avgtime 1788.493
curated-trai vecb30e3 2015-08-29 QuestionAnalysis: -C... 262/280/430 60.9%/65.1% mrr 0.626 avgtime 1942.117

v1.1 without decision forest and label-lookup --- curated APR 79.3%, MRR 0.436; large2180 APR 76.5%, MRR 0.399:

curated-test  20ab096 2015-07-28 Merge commit '0e52a1... 124/286/430 28.8%/66.5% mrr 0.386 avgtime 5522.054
curated-test u20ab096 2015-07-28 Merge commit '0e52a1... 150/341/430 34.9%/79.3% mrr 0.436 avgtime 5242.850
curated-test v20ab096 2015-07-28 Merge commit '0e52a1... 121/286/430 28.1%/66.5% mrr 0.382 avgtime 5428.800
curated-trai  20ab096 2015-07-28 Merge commit '0e52a1... 198/298/430 46.0%/69.3% mrr 0.546 avgtime 4790.583
curated-trai u20ab096 2015-07-28 Merge commit '0e52a1... 154/332/430 35.8%/77.2% mrr 0.458 avgtime 4522.161
curated-trai v20ab096 2015-07-28 Merge commit '0e52a1... 188/298/430 43.7%/69.3% mrr 0.531 avgtime 4697.530
large2180-te  20ab096 2015-07-28 Merge commit '0e52a1... 187/438/694 26.9%/63.1% mrr 0.357 avgtime 3614.539
large2180-te u20ab096 2015-07-28 Merge commit '0e52a1... 218/531/694 31.4%/76.5% mrr 0.399 avgtime 3338.304
large2180-te v20ab096 2015-07-28 Merge commit '0e52a1... 181/438/694 26.1%/63.1% mrr 0.351 avgtime 3526.321
large2180-tr  20ab096 2015-07-28 Merge commit '0e52a1... 425/905/1479 28.7%/61.2% mrr 0.373 avgtime 12576.337
large2180-tr u20ab096 2015-07-28 Merge commit '0e52a1... 415/1058/1479 28.1%/71.5% mrr 0.366 avgtime 11938.729
large2180-tr v20ab096 2015-07-28 Merge commit '0e52a1... 408/905/1479 27.6%/61.2% mrr 0.367 avgtime 12385.858

v1.1 without decision forest, with label-lookup --- curated APR 77.2%, MRR 0.413; large2180 APR 74.8%, MRR 0.399:

curated-test  a6ee873 2015-08-21 Mark: label-lookup 1... 119/281/430 27.7%/65.3% mrr 0.372 avgtime 2388.535
curated-test ua6ee873 2015-08-21 Mark: label-lookup 1... 140/332/430 32.6%/77.2% mrr 0.413 avgtime 2170.687
curated-test va6ee873 2015-08-21 Mark: label-lookup 1... 114/281/430 26.5%/65.3% mrr 0.367 avgtime 2321.839
curated-trai  a6ee873 2015-08-21 Mark: label-lookup 1... 183/296/430 42.6%/68.8% mrr 0.521 avgtime 3267.536
curated-trai ua6ee873 2015-08-21 Mark: label-lookup 1... 165/333/430 38.4%/77.4% mrr 0.464 avgtime 2986.020
curated-trai va6ee873 2015-08-21 Mark: label-lookup 1... 184/296/430 42.8%/68.8% mrr 0.520 avgtime 3175.556
large2180-te  a6ee873 2015-08-21 Mark: label-lookup 1... 216/430/694 31.1%/62.0% mrr 0.386 avgtime 29212.673
large2180-te ua6ee873 2015-08-21 Mark: label-lookup 1... 221/519/694 31.8%/74.8% mrr 0.399 avgtime 28906.655
large2180-te va6ee873 2015-08-21 Mark: label-lookup 1... 208/430/694 30.0%/62.0% mrr 0.382 avgtime 29153.467
large2180-tr  a6ee873 2015-08-21 Mark: label-lookup 1... 465/895/1479 31.4%/60.5% mrr 0.404 avgtime 40675.033
large2180-tr ua6ee873 2015-08-21 Mark: label-lookup 1... 454/1051/1479 30.7%/71.1% mrr 0.381 avgtime 39922.785
large2180-tr va6ee873 2015-08-21 Mark: label-lookup 1... 476/895/1479 32.2%/60.5% mrr 0.407 avgtime 40524.531

v1.1 without a CRF-based passage answer producer --- curated APR 77.2%, MRR 0.433; large2180 APR 74.8%, MRR 0.399:

curated-test  3fd576a 2015-08-29 PassageAnalysis: -BI... 145/286/430 33.7%/66.5% mrr 0.431 avgtime 2982.463
curated-test u3fd576a 2015-08-29 PassageAnalysis: -BI... 150/332/430 34.9%/77.2% mrr 0.433 avgtime 2742.708
curated-test v3fd576a 2015-08-29 PassageAnalysis: -BI... 153/286/430 35.6%/66.5% mrr 0.445 avgtime 2911.970
curated-trai  3fd576a 2015-08-29 PassageAnalysis: -BI... 297/303/430 69.1%/70.5% mrr 0.697 avgtime 2634.163
curated-trai u3fd576a 2015-08-29 PassageAnalysis: -BI... 176/332/430 40.9%/77.2% mrr 0.491 avgtime 2315.214
curated-trai v3fd576a 2015-08-29 PassageAnalysis: -BI... 258/303/430 60.0%/70.5% mrr 0.645 avgtime 2531.022
large2180-te  3fd576a 2015-08-29 PassageAnalysis: -BI... 217/446/694 31.3%/64.3% mrr 0.408 avgtime 3381.604
large2180-te u3fd576a 2015-08-29 PassageAnalysis: -BI... 215/519/694 31.0%/74.8% mrr 0.399 avgtime 3048.320
large2180-te v3fd576a 2015-08-29 PassageAnalysis: -BI... 217/446/694 31.3%/64.3% mrr 0.407 avgtime 3290.635
large2180-tr  3fd576a 2015-08-29 PassageAnalysis: -BI... 723/910/1479 48.9%/61.5% mrr 0.541 avgtime 8509.359
large2180-tr u3fd576a 2015-08-29 PassageAnalysis: -BI... 474/1050/1479 32.0%/71.0% mrr 0.399 avgtime 7668.941
large2180-tr v3fd576a 2015-08-29 PassageAnalysis: -BI... 605/910/1479 40.9%/61.5% mrr 0.478 avgtime 8273.441

Let's explore the impact of CRF a little further, comparing v1.1 that has disabled NP-based answer hypothesis generator (7d7b24d) with one that has in addition the CRF disabled (5a7ae5e) --- then, we can finally see a small MRR and APR drop showing that CRF contributes something:

curated-test  7d7b24d 2015-08-30 PassageAnalysis: -Ca... 117/253/430 27.2%/58.8% mrr 0.359 avgtime 1985.050
curated-test u7d7b24d 2015-08-30 PassageAnalysis: -Ca... 125/279/430 29.1%/64.9% mrr 0.375 avgtime 1801.975
curated-test v7d7b24d 2015-08-30 PassageAnalysis: -Ca... 121/253/430 28.1%/58.8% mrr 0.369 avgtime 1919.153
curated-trai  7d7b24d 2015-08-30 PassageAnalysis: -Ca... 305/308/430 70.9%/71.6% mrr 0.712 avgtime 2452.001
curated-trai u7d7b24d 2015-08-30 PassageAnalysis: -Ca... 211/319/430 49.1%/74.2% mrr 0.564 avgtime 2211.858
curated-trai v7d7b24d 2015-08-30 PassageAnalysis: -Ca... 274/308/430 63.7%/71.6% mrr 0.673 avgtime 2360.485

curated-test  5a7ae5e 2015-08-30 PassageAnalysis: als... 132/248/430 30.7%/57.7% mrr 0.377 avgtime 1774.094
curated-test u5a7ae5e 2015-08-30 PassageAnalysis: als... 128/273/430 29.8%/63.5% mrr 0.371 avgtime 1586.492
curated-test v5a7ae5e 2015-08-30 PassageAnalysis: als... 136/248/430 31.6%/57.7% mrr 0.386 avgtime 1705.106
curated-trai  5a7ae5e 2015-08-30 PassageAnalysis: als... 266/276/430 61.9%/64.2% mrr 0.627 avgtime 1903.655
curated-trai u5a7ae5e 2015-08-30 PassageAnalysis: als... 165/288/430 38.4%/67.0% mrr 0.462 avgtime 1667.754
curated-trai v5a7ae5e 2015-08-30 PassageAnalysis: als... 229/276/430 53.3%/64.2% mrr 0.578 avgtime 1813.072

So, could it be that CRF is useless with the other generators mixed in? That is curious, let's try v1.1 with retrained CRF model --- oh, curated APR 72.%, MRR 0.439; large2180 APR 74.8%, MRR 0.411; oops:

curated-test  0296763 2015-08-30 data/ml/biocrf/model... 146/287/430 34.0%/66.7% mrr 0.431 avgtime 2392.096
curated-test u0296763 2015-08-30 data/ml/biocrf/model... 152/332/430 35.3%/77.2% mrr 0.439 avgtime 2157.916
curated-test v0296763 2015-08-30 data/ml/biocrf/model... 151/287/430 35.1%/66.7% mrr 0.440 avgtime 2343.056
curated-trai  0296763 2015-08-30 data/ml/biocrf/model... 290/303/430 67.4%/70.5% mrr 0.689 avgtime 3887.648
curated-trai u0296763 2015-08-30 data/ml/biocrf/model... 181/332/430 42.1%/77.2% mrr 0.503 avgtime 3595.703
curated-trai v0296763 2015-08-30 data/ml/biocrf/model... 257/303/430 59.8%/70.5% mrr 0.644 avgtime 3816.893
large2180-te  0296763 2015-08-30 data/ml/biocrf/model... 224/439/694 32.3%/63.3% mrr 0.402 avgtime 3326.777
large2180-te u0296763 2015-08-30 data/ml/biocrf/model... 233/519/694 33.6%/74.8% mrr 0.411 avgtime 2994.481
large2180-te v0296763 2015-08-30 data/ml/biocrf/model... 221/439/694 31.8%/63.3% mrr 0.399 avgtime 3260.786
large2180-tr  0296763 2015-08-30 data/ml/biocrf/model... 735/925/1479 49.7%/62.5% mrr 0.551 avgtime 7906.924
large2180-tr u0296763 2015-08-30 data/ml/biocrf/model... 485/1052/1479 32.8%/71.1% mrr 0.406 avgtime 7057.941
large2180-tr v0296763 2015-08-30 data/ml/biocrf/model... 586/925/1479 39.6%/62.5% mrr 0.477 avgtime 7726.841

So the whole issue is that at some point, we had to retrain this and forgot. It is too late to fix this for v1.1, so we will tag the retrained version as v1.2 right after that.

v1.1 WebQuestions Hold-out Experiments

v1.2 without answer typing using external resources (WordNet, DBpedia) --- wq MRR 0.422 (so, this kind of typing is not very important when we already know the originating property):

wq-test-ovt-  4acbefc 2015-09-07 AnswerAnalysis: Disa... 732/1242/2032 36.0%/61.1% mrr 0.433 avgtime 3195.309
wq-test-ovt- u4acbefc 2015-09-07 AnswerAnalysis: Disa... 705/1368/2032 34.7%/67.3% mrr 0.422 avgtime 2743.912
wq-test-ovt- v4acbefc 2015-09-07 AnswerAnalysis: Disa... 747/1242/2032 36.8%/61.1% mrr 0.438 avgtime 3042.177
wq-train-ovt  4acbefc 2015-09-07 AnswerAnalysis: Disa... 1655/2479/3778 43.8%/65.6% mrr 0.511 avgtime 8228.916
wq-train-ovt u4acbefc 2015-09-07 AnswerAnalysis: Disa... 1501/2658/3778 39.7%/70.4% mrr 0.472 avgtime 6979.765
wq-train-ovt v4acbefc 2015-09-07 AnswerAnalysis: Disa... 1635/2479/3778 43.3%/65.6% mrr 0.502 avgtime 7784.849

v1.1 without decision forest and label-lookup --- moviesC APR 72.1%, MRR 0.449:

moviesC-test  fb80dc3 2015-08-20 data/eval/moviesC-*:... 92/157/233 39.5%/67.4% mrr 0.483 avgtime 842.395
moviesC-test ufb80dc3 2015-08-20 data/eval/moviesC-*:... 81/168/233 34.8%/72.1% mrr 0.449 avgtime 710.244
moviesC-test vfb80dc3 2015-08-20 data/eval/moviesC-*:... 93/157/233 39.9%/67.4% mrr 0.483 avgtime 789.272
moviesC-trai  fb80dc3 2015-08-20 data/eval/moviesC-*:... 205/350/542 37.8%/64.6% mrr 0.462 avgtime 1686.444
moviesC-trai ufb80dc3 2015-08-20 data/eval/moviesC-*:... 185/379/542 34.1%/69.9% mrr 0.429 avgtime 1432.278
moviesC-trai vfb80dc3 2015-08-20 data/eval/moviesC-*:... 207/350/542 38.2%/64.6% mrr 0.466 avgtime 1588.147

v1.1 without decision forest, with label-lookup --- moviesC APR 75.5%, MRR 0.468; wq APR 67.3%, MRR 0.408:

moviesC-test  0d660b4 2015-08-27 Merge remote-trackin... 94/161/233 40.3%/69.1% mrr 0.490 avgtime 788.321
moviesC-test u0d660b4 2015-08-27 Merge remote-trackin... 86/176/233 36.9%/75.5% mrr 0.468 avgtime 656.824
moviesC-test v0d660b4 2015-08-27 Merge remote-trackin... 94/161/233 40.3%/69.1% mrr 0.497 avgtime 735.070
moviesC-trai  0d660b4 2015-08-27 Merge remote-trackin... 217/365/542 40.0%/67.3% mrr 0.487 avgtime 1417.650
moviesC-trai u0d660b4 2015-08-27 Merge remote-trackin... 185/399/542 34.1%/73.6% mrr 0.438 avgtime 1148.684
moviesC-trai v0d660b4 2015-08-27 Merge remote-trackin... 215/365/542 39.7%/67.3% mrr 0.482 avgtime 1315.276
wq-test-ovt-  0d660b4 2015-08-27 Merge remote-trackin... 730/1232/2032 35.9%/60.6% mrr 0.433 avgtime 3639.533
wq-test-ovt- u0d660b4 2015-08-27 Merge remote-trackin... 665/1368/2032 32.7%/67.3% mrr 0.408 avgtime 3095.558
wq-test-ovt- v0d660b4 2015-08-27 Merge remote-trackin... 728/1232/2032 35.8%/60.6% mrr 0.431 avgtime 3462.939
wq-train-ovt  0d660b4 2015-08-27 Merge remote-trackin... 1525/2441/3778 40.4%/64.6% mrr 0.478 avgtime 11511.939
wq-train-ovt u0d660b4 2015-08-27 Merge remote-trackin... 1416/2658/3778 37.5%/70.4% mrr 0.456 avgtime 10022.556
wq-train-ovt v0d660b4 2015-08-27 Merge remote-trackin... 1498/2441/3778 39.7%/64.6% mrr 0.474 avgtime 11056.607

v1.1+enwiki with decision forest and label-lookup (just as a curious experiment) --- moviesC APR 84.5%, MRR 0.506; wq APR 78.3%, MRR 0.431:

moviesC-test  52cdd6c 2015-08-28 AnswerScoreDecisionF... 112/177/233 48.1%/76.0% mrr 0.565 avgtime 1738.979
moviesC-test u52cdd6c 2015-08-28 AnswerScoreDecisionF... 94/197/233 40.3%/84.5% mrr 0.506 avgtime 1581.404
moviesC-test v52cdd6c 2015-08-28 AnswerScoreDecisionF... 112/177/233 48.1%/76.0% mrr 0.568 avgtime 1703.425
moviesC-trai  52cdd6c 2015-08-28 AnswerScoreDecisionF... 388/431/542 71.6%/79.5% mrr 0.749 avgtime 4379.111
moviesC-trai u52cdd6c 2015-08-28 AnswerScoreDecisionF... 246/470/542 45.4%/86.7% mrr 0.553 avgtime 4003.825
moviesC-trai v52cdd6c 2015-08-28 AnswerScoreDecisionF... 352/431/542 64.9%/79.5% mrr 0.704 avgtime 4288.703
wq-test-ovt-  94ba475 2015-08-26 Merge branch 'f/labe... 792/1339/2032 39.0%/65.9% mrr 0.466 avgtime 10818.454
wq-test-ovt- u94ba475 2015-08-26 Merge branch 'f/labe... 696/1591/2032 34.3%/78.3% mrr 0.431 avgtime 10039.444
wq-test-ovt- v94ba475 2015-08-26 Merge branch 'f/labe... 778/1339/2032 38.3%/65.9% mrr 0.464 avgtime 10634.258
wq-train-ovt  94ba475 2015-08-26 Merge branch 'f/labe... 1622/2664/3778 42.9%/70.5% mrr 0.512 avgtime 54641.405
wq-train-ovt u94ba475 2015-08-26 Merge branch 'f/labe... 1451/3057/3778 38.4%/80.9% mrr 0.473 avgtime 52529.836
wq-train-ovt v94ba475 2015-08-26 Merge branch 'f/labe... 1637/2664/3778 43.3%/70.5% mrr 0.515 avgtime 54082.333