Skip to content
Petr Baudis edited this page Dec 7, 2015 · 113 revisions

This page contains a variety of benchmark results of various features on various domains. To keep the size of this page in check, we remove obsolete measurements (either abandoned

Each dataset has train and test splits. Our primary comparison factor is AP Recall (APR) and Mean Reciprocal Rank (MRR) on the test split (see below). (When talking to outsiders, accuracy-at-one is the easiest measure to use, but it is much more noisy than MRR, so it's brittle for day-to-day evaluation.) Unless otherwise specified, our models are always retrained on the train split before measuring accuracy on the test split.

We are benchmarking on several datasets from two basic families. The TREC-based questions are general factoid questions of wide variety and character, answerable primarily from Wikipedia:

  • curated (https://github.com/brmson/dataset-factoid-curated) is a cleaned up version of the TREC dataset with some IRC-based questions from early user testing also added in. This is our primary "general QA" benchmark.
  • large2180 (in dev branch of https://github.com/brmson/dataset-factoid-curated) is a larger version of the TREC dataset, that mixes in some noisy question to the curated dataset in the interest of having more data to train on. We use this dataset to test how our machine learning scales up.
  • trecnew-raw (in dev branch of https://github.com/brmson/dataset-factoid-curated) has just a test split and contains factoid questions like in curated, but not their cleaned up and filtered version, allowing a more realistic comparison with the "old" programs from the TREC challenges. We rarely use this dataset, just for some final benchmarks when writing papers.

Raw measurements for various historical commits are available from http://pasky.or.cz/dev/brmson/yodaqa-eval/ ...

The other family of datasets is originally based on WebQuestions, exhibiting more monotonous questions modelled around the Freebase knowledge base and always asking for entities (not for example for numbers):

Raw measurements for various historical commits are available from http://pasky.or.cz/dev/brmson/yodaqa-movies-eval/ ...

Typically, our tests are done using data/eval/train-and-eval.sh. When we need to benchmark without retraining, we use something like:

data/eval/_multistage-traineval.sh . trecnew-raw-test 0 0

Each split+commit combination results in three lines of data/eval/tsvout-stats.sh output. The important line for us is the one with the commit prefixed by u as that's the initial pipeline stage (followup stages create user-friendly output, but typicaly senselessly overfit).

The format of each line is:

dataset-split commit  commitdate Commit message          ans/irr/tot ACC1%/ APR% mrr 0.mrr avgtime xyz

ans and ACC1 is the accuracy-at-one --- number (and percentage) of questions where the top answer is correct (since we attempt to answer all questions, this would be precision@100 in DeepQA parlance). APR is Answer Production Recall, i.e. number of questions where the correct answer is generated as a hypothesis. MRR is the mean of reciprocial rank over all questions; a question with top answer correct will have RR=1, a question with second answer correct will have RR=0.5, etc. Ignore the avgtime, it's currently garbage, unfortunately.

Baselines

TREC-based baseline

We use the master branch to measure TREC-based QA.

v1.4 --- curated APR 77.7%, MRR 0.405; large2180 APR 75.5%, MRR 0.379:

curated-test  2b85c94 2015-11-10 AnswerScoreDecisionF... 131/282/430 30.5%/65.6% mrr 0.401 avgtime 3596.696
curated-test u2b85c94 2015-11-10 AnswerScoreDecisionF... 136/334/430 31.6%/77.7% mrr 0.405 avgtime 3320.982
curated-test v2b85c94 2015-11-10 AnswerScoreDecisionF... 135/282/430 31.4%/65.6% mrr 0.409 avgtime 3524.656
curated-trai  2b85c94 2015-11-10 AnswerScoreDecisionF... 295/308/430 68.6%/71.6% mrr 0.699 avgtime 4083.608
curated-trai u2b85c94 2015-11-10 AnswerScoreDecisionF... 164/340/430 38.1%/79.1% mrr 0.480 avgtime 3639.840
curated-trai v2b85c94 2015-11-10 AnswerScoreDecisionF... 257/308/430 59.8%/71.6% mrr 0.649 avgtime 3968.469
large2180-te  2b85c94 2015-11-10 AnswerScoreDecisionF... 209/432/694 30.1%/62.2% mrr 0.387 avgtime 5137.313
large2180-te u2b85c94 2015-11-10 AnswerScoreDecisionF... 205/524/694 29.5%/75.5% mrr 0.379 avgtime 4828.143
large2180-te v2b85c94 2015-11-10 AnswerScoreDecisionF... 213/432/694 30.7%/62.2% mrr 0.389 avgtime 5076.575
large2180-tr  c92760c 2015-11-04 HIGHLEVEL.md constra... 726/926/1479 49.1%/62.6% mrr 0.543 avgtime 11820.064
large2180-tr uc92760c 2015-11-04 HIGHLEVEL.md constra... 453/1075/1479 30.6%/72.7% mrr 0.389 avgtime 10848.609
large2180-tr vc92760c 2015-11-04 HIGHLEVEL.md constra... 602/926/1479 40.7%/62.6% mrr 0.483 avgtime 11638.433

v1.3 --- curated APR 77.9%, MRR 0.413; large2180 APR 75.5%, MRR 0.390:

curated-test  88f39c2 2015-10-19 Mbprop.txt: Retrain ... 138/279/430 32.1%/64.9% mrr 0.407 avgtime 2947.311
curated-test u88f39c2 2015-10-19 Mbprop.txt: Retrain ... 144/335/430 33.5%/77.9% mrr 0.413 avgtime 2681.062
curated-test v88f39c2 2015-10-19 Mbprop.txt: Retrain ... 144/279/430 33.5%/64.9% mrr 0.418 avgtime 2874.982
curated-trai  88f39c2 2015-10-19 Mbprop.txt: Retrain ... 290/306/430 67.4%/71.2% mrr 0.691 avgtime 3725.780
curated-trai u88f39c2 2015-10-19 Mbprop.txt: Retrain ... 169/335/430 39.3%/77.9% mrr 0.479 avgtime 3295.355
curated-trai v88f39c2 2015-10-19 Mbprop.txt: Retrain ... 260/306/430 60.5%/71.2% mrr 0.649 avgtime 3611.334
large2180-te  88f39c2 2015-10-19 Mbprop.txt: Retrain ... 218/435/694 31.4%/62.7% mrr 0.392 avgtime 4509.625
large2180-te u88f39c2 2015-10-19 Mbprop.txt: Retrain ... 217/524/694 31.3%/75.5% mrr 0.390 avgtime 4223.223
large2180-te v88f39c2 2015-10-19 Mbprop.txt: Retrain ... 207/435/694 29.8%/62.7% mrr 0.382 avgtime 4450.021
large2180-tr  88f39c2 2015-10-19 Mbprop.txt: Retrain ... 729/916/1479 49.3%/61.9% mrr 0.544 avgtime 12199.161
large2180-tr u88f39c2 2015-10-19 Mbprop.txt: Retrain ... 468/1063/1479 31.6%/71.9% mrr 0.398 avgtime 11243.693
large2180-tr v88f39c2 2015-10-19 Mbprop.txt: Retrain ... 598/916/1479 40.4%/61.9% mrr 0.480 avgtime 12019.682

v1.2 --- curated APR 77.2%, MRR 0.439; large2180 APR 74.8%, MRR 0.411:

curated-test  0296763 2015-08-30 data/ml/biocrf/model... 146/287/430 34.0%/66.7% mrr 0.431 avgtime 2392.096
curated-test u0296763 2015-08-30 data/ml/biocrf/model... 152/332/430 35.3%/77.2% mrr 0.439 avgtime 2157.916
curated-test v0296763 2015-08-30 data/ml/biocrf/model... 151/287/430 35.1%/66.7% mrr 0.440 avgtime 2343.056
curated-trai  0296763 2015-08-30 data/ml/biocrf/model... 290/303/430 67.4%/70.5% mrr 0.689 avgtime 3887.648
curated-trai u0296763 2015-08-30 data/ml/biocrf/model... 181/332/430 42.1%/77.2% mrr 0.503 avgtime 3595.703
curated-trai v0296763 2015-08-30 data/ml/biocrf/model... 257/303/430 59.8%/70.5% mrr 0.644 avgtime 3816.893
large2180-te  0296763 2015-08-30 data/ml/biocrf/model... 224/439/694 32.3%/63.3% mrr 0.402 avgtime 3326.777
large2180-te u0296763 2015-08-30 data/ml/biocrf/model... 233/519/694 33.6%/74.8% mrr 0.411 avgtime 2994.481
large2180-te v0296763 2015-08-30 data/ml/biocrf/model... 221/439/694 31.8%/63.3% mrr 0.399 avgtime 3260.786
large2180-tr  0296763 2015-08-30 data/ml/biocrf/model... 735/925/1479 49.7%/62.5% mrr 0.551 avgtime 7906.924
large2180-tr u0296763 2015-08-30 data/ml/biocrf/model... 485/1052/1479 32.8%/71.1% mrr 0.406 avgtime 7057.941
large2180-tr v0296763 2015-08-30 data/ml/biocrf/model... 586/925/1479 39.6%/62.5% mrr 0.477 avgtime 7726.841

v1.1 --- curated APR 77.2%, MRR 0.409; large2180 APR 74.8%, MRR 0.398:

curated-test  76cc1af 2015-08-26 Merge branch 'master... 134/284/430 31.2%/66.0% mrr 0.405 avgtime 3460.146
curated-test u76cc1af 2015-08-26 Merge branch 'master... 135/332/430 31.4%/77.2% mrr 0.409 avgtime 3231.877
curated-test v76cc1af 2015-08-26 Merge branch 'master... 127/284/430 29.5%/66.0% mrr 0.397 avgtime 3411.869
curated-trai  76cc1af 2015-08-26 Merge branch 'master... 301/306/430 70.0%/71.2% mrr 0.705 avgtime 5815.394
curated-trai u76cc1af 2015-08-26 Merge branch 'master... 199/333/430 46.3%/77.4% mrr 0.538 avgtime 5533.997
curated-trai v76cc1af 2015-08-26 Merge branch 'master... 281/306/430 65.3%/71.2% mrr 0.677 avgtime 5747.069
large2180-te  76cc1af 2015-08-26 Merge branch 'master... 222/443/694 32.0%/63.8% mrr 0.408 avgtime 3622.175
large2180-te u76cc1af 2015-08-26 Merge branch 'master... 218/519/694 31.4%/74.8% mrr 0.398 avgtime 3285.847
large2180-te v76cc1af 2015-08-26 Merge branch 'master... 235/443/694 33.9%/63.8% mrr 0.416 avgtime 3556.244
large2180-tr  76cc1af 2015-08-26 Merge branch 'master... 752/927/1479 50.8%/62.7% mrr 0.558 avgtime 8455.257
large2180-tr u76cc1af 2015-08-26 Merge branch 'master... 498/1051/1479 33.7%/71.1% mrr 0.412 avgtime 7622.098
large2180-tr v76cc1af 2015-08-26 Merge branch 'master... 616/927/1479 41.6%/62.7% mrr 0.491 avgtime 8287.513
trecnew-raw-      ovt 2015-08-29 Merge branch 'master... 121/233/447 27.1%/52.1% mrr 0.346 avgtime 3756.961
trecnew-raw-      ovt 2015-08-29 Merge branch 'master... 118/272/447 26.4%/60.9% mrr 0.325 avgtime 3496.736
trecnew-raw-      ovt 2015-08-29 Merge branch 'master... 123/233/447 27.5%/52.1% mrr 0.345 avgtime 3681.780

v1.0 (the first YodaQA paper) --- curated APR 79.3%, MRR 0.420:

curated-test  0ae3b79 2015-04-14 Merge branch 'master... 137/292/430 31.9%/67.9% mrr 0.413 avgtime 6767.419
curated-test u0ae3b79 2015-04-14 Merge branch 'master... 139/341/430 32.3%/79.3% mrr 0.420 avgtime 6549.246
curated-test v0ae3b79 2015-04-14 Merge branch 'master... 138/292/430 32.1%/67.9% mrr 0.418 avgtime 6687.020
curated-trai  0ae3b79 2015-04-14 Merge branch 'master... 152/283/430 35.3%/65.8% mrr 0.454 avgtime 6566.500
curated-trai u0ae3b79 2015-04-14 Merge branch 'master... 131/329/430 30.5%/76.5% mrr 0.392 avgtime 6358.768
curated-trai v0ae3b79 2015-04-14 Merge branch 'master... 155/283/430 36.0%/65.8% mrr 0.456 avgtime 6492.669
trecnew-raw-      ovt 2015-04-14 Merge branch 'master... 118/237/447 26.4%/53.0% mrr 0.333 avgtime 6213.230
trecnew-raw-      ovt 2015-04-14 Merge branch 'master... 112/278/447 25.1%/62.2% mrr 0.323 avgtime 6056.471
trecnew-raw-      ovt 2015-04-14 Merge branch 'master... 112/237/447 25.1%/53.0% mrr 0.326 avgtime 6159.455

d/live baseline

We don't do day-to-day development on this baseline, but this section records performance evolution on the Bing-enabled version running at http://live.ailao.eu/.

Current version (v1.4):

large2180-te  6a040cb 2015-11-10 Merge remote-trackin... 255/470/694 36.7%/67.7% mrr 0.447 avgtime 8150.822
large2180-te u6a040cb 2015-11-10 Merge remote-trackin... 242/553/694 34.9%/79.7% mrr 0.439 avgtime 7758.923
large2180-te v6a040cb 2015-11-10 Merge remote-trackin... 245/470/694 35.3%/67.7% mrr 0.439 avgtime 8055.959
large2180-tr  6a040cb 2015-11-10 Merge remote-trackin... 766/981/1479 51.8%/66.3% mrr 0.579 avgtime 16448.392
large2180-tr u6a040cb 2015-11-10 Merge remote-trackin... 467/1131/1479 31.6%/76.5% mrr 0.409 avgtime 15269.455
large2180-tr v6a040cb 2015-11-10 Merge remote-trackin... 635/981/1479 42.9%/66.3% mrr 0.514 avgtime 16181.728

A bit later version:

large2180-te  35a4484 2015-10-16 Merge branch 'master... 260/469/694 37.5%/67.6% mrr 0.454 avgtime 11034.758
large2180-te u35a4484 2015-10-16 Merge branch 'master... 227/558/694 32.7%/80.4% mrr 0.422 avgtime 10687.774
large2180-te v35a4484 2015-10-16 Merge branch 'master... 261/469/694 37.6%/67.6% mrr 0.452 avgtime 10955.408
large2180-tr  35a4484 2015-10-16 Merge branch 'master... 759/996/1479 51.3%/67.3% mrr 0.581 avgtime 15775.905
large2180-tr u35a4484 2015-10-16 Merge branch 'master... 483/1131/1479 32.7%/76.5% mrr 0.418 avgtime 14665.273
large2180-tr v35a4484 2015-10-16 Merge branch 'master... 640/996/1479 43.3%/67.3% mrr 0.515 avgtime 15518.062
large2180-te  e5ed8a5 2015-09-10 Added one minute tim... 253/492/694 36.5%/70.9% mrr 0.456 avgtime 6951.470
large2180-te ue5ed8a5 2015-09-10 Added one minute tim... 235/557/694 33.9%/80.3% mrr 0.433 avgtime 6608.611
large2180-te ve5ed8a5 2015-09-10 Added one minute tim... 253/492/694 36.5%/70.9% mrr 0.455 avgtime 6857.989
large2180-tr  e5ed8a5 2015-09-10 Added one minute tim... 813/1013/1479 55.0%/68.5% mrr 0.605 avgtime 21314.917
large2180-tr ue5ed8a5 2015-09-10 Added one minute tim... 531/1152/1479 35.9%/77.9% mrr 0.445 avgtime 20472.813
large2180-tr ve5ed8a5 2015-09-10 Added one minute tim... 667/1013/1479 45.1%/68.5% mrr 0.535 avgtime 21075.419

Version running up to 2015-09-18:

large2180-te  f04cce6 2015-07-21 Merge branch 'master... 264/520/694 38.0%/74.9% mrr 0.477 avgtime 6248.368
large2180-te uf04cce6 2015-07-21 Merge branch 'master... 230/587/694 33.1%/84.6% mrr 0.430 avgtime 5976.657
large2180-te vf04cce6 2015-07-21 Merge branch 'master... 259/520/694 37.3%/74.9% mrr 0.474 avgtime 6166.965
large2180-tr  f04cce6 2015-07-21 Merge branch 'master... 599/1052/1479 40.5%/71.1% mrr 0.498 avgtime 12523.736
large2180-tr uf04cce6 2015-07-21 Merge branch 'master... 510/1191/1479 34.5%/80.5% mrr 0.437 avgtime 11852.452
large2180-tr vf04cce6 2015-07-21 Merge branch 'master... 585/1052/1479 39.6%/71.1% mrr 0.490 avgtime 12329.911

v1.2 with Bing search (live since 2015-09-18):

curated-test  e5ed8a5 2015-09-10 Added one minute tim... 178/319/430 41.4%/74.2% mrr 0.500 avgtime 5827.692
curated-test ue5ed8a5 2015-09-10 Added one minute tim... 167/360/430 38.8%/83.7% mrr 0.481 avgtime 5635.870
curated-test ve5ed8a5 2015-09-10 Added one minute tim... 177/319/430 41.2%/74.2% mrr 0.502 avgtime 5779.753
curated-trai  e5ed8a5 2015-09-10 Added one minute tim... 328/336/430 76.3%/78.1% mrr 0.772 avgtime 7043.856
curated-trai ue5ed8a5 2015-09-10 Added one minute tim... 196/364/430 45.6%/84.7% mrr 0.549 avgtime 6767.992
curated-trai ve5ed8a5 2015-09-10 Added one minute tim... 289/336/430 67.2%/78.1% mrr 0.720 avgtime 6963.149
large2180-te  e5ed8a5 2015-09-10 Added one minute tim... 253/492/694 36.5%/70.9% mrr 0.456 avgtime 6951.470
large2180-te ue5ed8a5 2015-09-10 Added one minute tim... 235/557/694 33.9%/80.3% mrr 0.433 avgtime 6608.611
large2180-te ve5ed8a5 2015-09-10 Added one minute tim... 253/492/694 36.5%/70.9% mrr 0.455 avgtime 6857.989
large2180-tr  e5ed8a5 2015-09-10 Added one minute tim... 813/1013/1479 55.0%/68.5% mrr 0.605 avgtime 21314.917
large2180-tr ue5ed8a5 2015-09-10 Added one minute tim... 531/1152/1479 35.9%/77.9% mrr 0.445 avgtime 20472.813
large2180-tr ve5ed8a5 2015-09-10 Added one minute tim... 667/1013/1479 45.1%/68.5% mrr 0.535 avgtime 21075.419

WebQuestions-based baseline

We primarily use the d/movies branch for WebQuestions style questions - this branch has disabled enwiki as a data source since our primary motivation in the movies-based questions is QA just on structured knowledge bases.

Also note that the pipeline phase1 (v- prefixed commits) actually seems non-overfitted here. We didn't factor that into our reports or benchmark instructions yet --- for simplicity to keep the common approach for both TREC and WQ based scenarios. We'll probably drop this soon, though.

Master:

moviesD-test  7bbda27 2015-12-02 FocusGenerator addFo... 141/205/260 54.2%/78.8% mrr 0.614 avgtime 801.693
moviesD-test u7bbda27 2015-12-02 FocusGenerator addFo... 135/215/260 51.9%/82.7% mrr 0.604 avgtime 635.000
moviesD-test v7bbda27 2015-12-02 FocusGenerator addFo... 138/205/260 53.1%/78.8% mrr 0.613 avgtime 742.282
moviesD-trai  7bbda27 2015-12-02 FocusGenerator addFo... 454/513/624 72.8%/82.2% mrr 0.765 avgtime 2061.984
moviesD-trai u7bbda27 2015-12-02 FocusGenerator addFo... 356/527/624 57.1%/84.5% mrr 0.653 avgtime 1612.969
moviesD-trai v7bbda27 2015-12-02 FocusGenerator addFo... 413/513/624 66.2%/82.2% mrr 0.722 avgtime 1898.797

v1.4 --- moviesD APR 81.9%, MRR 0.590:

moviesD-test  e10cf37 2015-11-03 Mbprop.txt: Retrain ... 138/206/260 53.1%/79.2% mrr 0.609 avgtime 1571.417
moviesD-test ue10cf37 2015-11-03 Mbprop.txt: Retrain ... 130/213/260 50.0%/81.9% mrr 0.590 avgtime 1419.293
moviesD-test ve10cf37 2015-11-03 Mbprop.txt: Retrain ... 137/206/260 52.7%/79.2% mrr 0.609 avgtime 1512.312
moviesD-trai  e10cf37 2015-11-03 Mbprop.txt: Retrain ... 455/512/624 72.9%/82.1% mrr 0.766 avgtime 17632.270
moviesD-trai ue10cf37 2015-11-03 Mbprop.txt: Retrain ... 362/525/624 58.0%/84.1% mrr 0.658 avgtime 17203.644
moviesD-trai ve10cf37 2015-11-03 Mbprop.txt: Retrain ... 406/512/624 65.1%/82.1% mrr 0.715 avgtime 17474.994

v1.3 --- moviesC APR 79.0%, MRR 0.573; moviesD APR 76.5%, MRR 0.531; wq APR 75.7%, MRR 0.476:

moviesC-test  6eadf12 2015-10-18 Mbprop.txt: Retrain ... 118/173/233 50.6%/74.2% mrr 0.577 avgtime 876.665
moviesC-test u6eadf12 2015-10-18 Mbprop.txt: Retrain ... 119/184/233 51.1%/79.0% mrr 0.573 avgtime 739.296
moviesC-test v6eadf12 2015-10-18 Mbprop.txt: Retrain ... 121/173/233 51.9%/74.2% mrr 0.585 avgtime 819.947
moviesC-trai  6eadf12 2015-10-18 Mbprop.txt: Retrain ... 379/438/542 69.9%/80.8% mrr 0.742 avgtime 1829.013
moviesC-trai u6eadf12 2015-10-18 Mbprop.txt: Retrain ... 290/444/542 53.5%/81.9% mrr 0.619 avgtime 1466.149
moviesC-trai v6eadf12 2015-10-18 Mbprop.txt: Retrain ... 347/438/542 64.0%/80.8% mrr 0.700 avgtime 1689.706

moviesD-test  6c13b62 2015-10-19 +moviesD dataset... 127/190/260 48.8%/73.1% mrr 0.551 avgtime 630.581
moviesD-test u6c13b62 2015-10-19 +moviesD dataset... 117/199/260 45.0%/76.5% mrr 0.531 avgtime 482.417
moviesD-test v6c13b62 2015-10-19 +moviesD dataset... 124/190/260 47.7%/73.1% mrr 0.547 avgtime 571.359
moviesD-trai  6c13b62 2015-10-19 +moviesD dataset... 425/485/624 68.1%/77.7% mrr 0.719 avgtime 2140.000
moviesD-trai u6c13b62 2015-10-19 +moviesD dataset... 322/492/624 51.6%/78.8% mrr 0.595 avgtime 1735.939
moviesD-trai v6c13b62 2015-10-19 +moviesD dataset... 364/485/624 58.3%/77.7% mrr 0.658 avgtime 1984.069

wq-test-ovt-  6eadf12 2015-10-18 Mbprop.txt: Retrain ... 863/1393/2032 42.5%/68.6% mrr 0.502 avgtime 5812.585
wq-test-ovt- u6eadf12 2015-10-18 Mbprop.txt: Retrain ... 795/1538/2032 39.1%/75.7% mrr 0.476 avgtime 5122.649
wq-test-ovt- v6eadf12 2015-10-18 Mbprop.txt: Retrain ... 857/1393/2032 42.2%/68.6% mrr 0.499 avgtime 5606.749
wq-train-ovt  6eadf12 2015-10-18 Mbprop.txt: Retrain ... 1906/2773/3778 50.4%/73.4% mrr 0.582 avgtime 17218.725
wq-train-ovt u6eadf12 2015-10-18 Mbprop.txt: Retrain ... 1689/2968/3778 44.7%/78.6% mrr 0.531 avgtime 15051.915
wq-train-ovt v6eadf12 2015-10-18 Mbprop.txt: Retrain ... 1839/2773/3778 48.7%/73.4% mrr 0.566 avgtime 16566.801

v1.2, v1.1 (both same results) --- moviesC APR 75.5%, MRR 0.494; wq APR 67.3%, MRR 0.425:

moviesC-test  a770e5f 2015-08-21 Mark: label-lookup 1... 102/168/233 43.8%/72.1% mrr 0.509 avgtime 585.312
moviesC-test ua770e5f 2015-08-21 Mark: label-lookup 1... 95/176/233 40.8%/75.5% mrr 0.494 avgtime 447.181
moviesC-test va770e5f 2015-08-21 Mark: label-lookup 1... 104/168/233 44.6%/72.1% mrr 0.517 avgtime 530.785
moviesC-trai  a770e5f 2015-08-21 Mark: label-lookup 1... 313/388/542 57.7%/71.6% mrr 0.629 avgtime 1463.521
moviesC-trai ua770e5f 2015-08-21 Mark: label-lookup 1... 240/399/542 44.3%/73.6% mrr 0.522 avgtime 1176.910
moviesC-trai va770e5f 2015-08-21 Mark: label-lookup 1... 287/388/542 53.0%/71.6% mrr 0.596 avgtime 1351.434
wq-test-ovt-  8795cd0 2015-08-27 Merge remote-trackin... 757/1257/2032 37.3%/61.9% mrr 0.445 avgtime 5117.716
wq-test-ovt- u8795cd0 2015-08-27 Merge remote-trackin... 699/1368/2032 34.4%/67.3% mrr 0.425 avgtime 4516.366
wq-test-ovt- v8795cd0 2015-08-27 Merge remote-trackin... 749/1257/2032 36.9%/61.9% mrr 0.443 avgtime 4922.379
wq-train-ovt  8795cd0 2015-08-27 Merge remote-trackin... 1702/2486/3778 45.1%/65.8% mrr 0.522 avgtime 22590.390
wq-train-ovt u8795cd0 2015-08-27 Merge remote-trackin... 1519/2658/3778 40.2%/70.4% mrr 0.477 avgtime 21017.841
uq-train-ovt v8795cd0 2015-08-27 Merge remote-trackin... 1673/2486/3778 44.3%/65.8% mrr 0.510 avgtime 22058.533

Feature Experiments

This section will be probably quite fluid.

LAT by SV

Baseline:

moviesD-test  e10cf37 2015-11-03 Mbprop.txt: Retrain ... 138/206/260 53.1%/79.2% mrr 0.609 avgtime 1571.417
moviesD-test ue10cf37 2015-11-03 Mbprop.txt: Retrain ... 130/213/260 50.0%/81.9% mrr 0.590 avgtime 1419.293
moviesD-test ve10cf37 2015-11-03 Mbprop.txt: Retrain ... 137/206/260 52.7%/79.2% mrr 0.609 avgtime 1512.312
moviesD-trai  e10cf37 2015-11-03 Mbprop.txt: Retrain ... 455/512/624 72.9%/82.1% mrr 0.766 avgtime 17632.270
moviesD-trai ue10cf37 2015-11-03 Mbprop.txt: Retrain ... 362/525/624 58.0%/84.1% mrr 0.658 avgtime 17203.644
moviesD-trai ve10cf37 2015-11-03 Mbprop.txt: Retrain ... 406/512/624 65.1%/82.1% mrr 0.715 avgtime 17474.994
large2180-te  2b85c94 2015-11-10 AnswerScoreDecisionF... 209/432/694 30.1%/62.2% mrr 0.387 avgtime 5137.313
large2180-te u2b85c94 2015-11-10 AnswerScoreDecisionF... 205/524/694 29.5%/75.5% mrr 0.379 avgtime 4828.143
large2180-te v2b85c94 2015-11-10 AnswerScoreDecisionF... 213/432/694 30.7%/62.2% mrr 0.389 avgtime 5076.575
large2180-tr  c92760c 2015-11-04 HIGHLEVEL.md constra... 726/926/1479 49.1%/62.6% mrr 0.543 avgtime 11820.064
large2180-tr uc92760c 2015-11-04 HIGHLEVEL.md constra... 453/1075/1479 30.6%/72.7% mrr 0.389 avgtime 10848.609
large2180-tr vc92760c 2015-11-04 HIGHLEVEL.md constra... 602/926/1479 40.7%/62.6% mrr 0.483 avgtime 11638.433

LAT by SV nominalization in case of NSUBJ:

moviesD-test  43b438d 2015-11-12 AnswerScoreDecisionF... 127/203/260 48.8%/78.1% mrr 0.580 avgtime 1274.038
moviesD-test u43b438d 2015-11-12 AnswerScoreDecisionF... 130/213/260 50.0%/81.9% mrr 0.592 avgtime 1112.648
moviesD-test v43b438d 2015-11-12 AnswerScoreDecisionF... 131/203/260 50.4%/78.1% mrr 0.590 avgtime 1212.274
moviesD-test ud5233fa 2015-11-12 Merge remote-trackin... no answers generated
moviesD-trai  d5233fa 2015-11-12 Merge remote-trackin... 449/514/624 72.0%/82.4% mrr 0.761 avgtime 3496.390
moviesD-trai ud5233fa 2015-11-12 Merge remote-trackin... 355/525/624 56.9%/84.1% mrr 0.653 avgtime 3047.916
moviesD-trai vd5233fa 2015-11-12 Merge remote-trackin... 406/514/624 65.1%/82.4% mrr 0.714 avgtime 3338.435
large2180-te  d738df8 2015-11-12 LATBySV: Fix crash o... 206/434/694 29.7%/62.5% mrr 0.380 avgtime 4617.641
large2180-te ud738df8 2015-11-12 LATBySV: Fix crash o... 208/524/694 30.0%/75.5% mrr 0.382 avgtime 4313.823
large2180-te vd738df8 2015-11-12 LATBySV: Fix crash o... 206/434/694 29.7%/62.5% mrr 0.381 avgtime 4558.018
large2180-tr  d738df8 2015-11-12 LATBySV: Fix crash o... 718/921/1479 48.5%/62.3% mrr 0.544 avgtime 11262.440
large2180-tr ud738df8 2015-11-12 LATBySV: Fix crash o... 436/1075/1479 29.5%/72.7% mrr 0.383 avgtime 10310.996
large2180-tr vd738df8 2015-11-12 LATBySV: Fix crash o... 591/921/1479 40.0%/62.3% mrr 0.477 avgtime 11082.713

Merged.

Concept Selection Semantic Enrichment

Baseline:

moviesD-test  43b438d 2015-11-12 AnswerScoreDecisionF... 127/203/260 48.8%/78.1% mrr 0.580 avgtime 1274.038
moviesD-test u43b438d 2015-11-12 AnswerScoreDecisionF... 130/213/260 50.0%/81.9% mrr 0.592 avgtime 1112.648
moviesD-test v43b438d 2015-11-12 AnswerScoreDecisionF... 131/203/260 50.4%/78.1% mrr 0.590 avgtime 1212.274
moviesD-trai  d5233fa 2015-11-12 Merge remote-trackin... 449/514/624 72.0%/82.4% mrr 0.761 avgtime 3496.390
moviesD-trai ud5233fa 2015-11-12 Merge remote-trackin... 355/525/624 56.9%/84.1% mrr 0.653 avgtime 3047.916
moviesD-trai vd5233fa 2015-11-12 Merge remote-trackin... 406/514/624 65.1%/82.4% mrr 0.714 avgtime 3338.435

Including question/description relatedness score in the concept classifier:

moviesD-test  e9f8721 2015-11-15 ConceptClassifier: R... 137/206/260 52.7%/79.2% mrr 0.605 avgtime 1251.465
moviesD-test ue9f8721 2015-11-15 ConceptClassifier: R... 132/215/260 50.8%/82.7% mrr 0.594 avgtime 1095.514
moviesD-test ve9f8721 2015-11-15 ConceptClassifier: R... 136/206/260 52.3%/79.2% mrr 0.609 avgtime 1192.673
moviesD-trai  e9f8721 2015-11-15 ConceptClassifier: R... 457/514/624 73.2%/82.4% mrr 0.769 avgtime 3616.423
moviesD-trai ue9f8721 2015-11-15 ConceptClassifier: R... 367/527/624 58.8%/84.5% mrr 0.664 avgtime 3182.397
moviesD-trai ve9f8721 2015-11-15 ConceptClassifier: R... 414/514/624 66.3%/82.4% mrr 0.724 avgtime 3459.919

Merged.

Minor Tweaks

Baseline:

moviesD-test  ea71748 2015-12-01 Merge branch 'master... 139/207/260 53.5%/79.6% mrr 0.612 avgtime 1240.304
moviesD-test uea71748 2015-12-01 Merge branch 'master... 136/215/260 52.3%/82.7% mrr 0.603 avgtime 1132.611
moviesD-test vea71748 2015-12-01 Merge branch 'master... 134/207/260 51.5%/79.6% mrr 0.602 avgtime 1202.870
moviesD-trai  ea71748 2015-12-01 Merge branch 'master... 447/514/624 71.6%/82.4% mrr 0.761 avgtime 3977.617
moviesD-trai uea71748 2015-12-01 Merge branch 'master... 358/527/624 57.4%/84.5% mrr 0.656 avgtime 3621.058
moviesD-trai vea71748 2015-12-01 Merge branch 'master... 401/514/624 64.3%/82.4% mrr 0.713 avgtime 3855.856

Fix witness language multiplication of some branched properties:

moviesD-test  6b117b2 2015-12-01 Merge remote-trackin... 136/206/260 52.3%/79.2% mrr 0.601 avgtime 748.599
moviesD-test u6b117b2 2015-12-01 Merge remote-trackin... 133/215/260 51.2%/82.7% mrr 0.598 avgtime 588.144
moviesD-test v6b117b2 2015-12-01 Merge remote-trackin... 137/206/260 52.7%/79.2% mrr 0.606 avgtime 688.220
moviesD-trai  6b117b2 2015-12-01 Merge remote-trackin... 453/514/624 72.6%/82.4% mrr 0.766 avgtime 2215.680
moviesD-trai u6b117b2 2015-12-01 Merge remote-trackin... 355/527/624 56.9%/84.5% mrr 0.655 avgtime 1769.992
moviesD-trai v6b117b2 2015-12-01 Merge remote-trackin... 407/514/624 65.2%/82.4% mrr 0.720 avgtime 2056.199

Adding features on nature of focus in answers:

moviesD-test  7bbda27 2015-12-02 FocusGenerator addFo... 141/205/260 54.2%/78.8% mrr 0.614 avgtime 801.693
moviesD-test u7bbda27 2015-12-02 FocusGenerator addFo... 135/215/260 51.9%/82.7% mrr 0.604 avgtime 635.000
moviesD-test v7bbda27 2015-12-02 FocusGenerator addFo... 138/205/260 53.1%/78.8% mrr 0.613 avgtime 742.282
moviesD-trai  7bbda27 2015-12-02 FocusGenerator addFo... 454/513/624 72.8%/82.2% mrr 0.765 avgtime 2061.984
moviesD-trai u7bbda27 2015-12-02 FocusGenerator addFo... 356/527/624 57.1%/84.5% mrr 0.653 avgtime 1612.969
moviesD-trai v7bbda27 2015-12-02 FocusGenerator addFo... 413/513/624 66.2%/82.2% mrr 0.722 avgtime 1898.797

Merged.

Explorative FBpath (Glove-based)

Baseline:

moviesD-test  7bbda27 2015-12-02 FocusGenerator addFo... 141/205/260 54.2%/78.8% mrr 0.614 avgtime 801.693
moviesD-test u7bbda27 2015-12-02 FocusGenerator addFo... 135/215/260 51.9%/82.7% mrr 0.604 avgtime 635.000
moviesD-test v7bbda27 2015-12-02 FocusGenerator addFo... 138/205/260 53.1%/78.8% mrr 0.613 avgtime 742.282
moviesD-trai  7bbda27 2015-12-02 FocusGenerator addFo... 454/513/624 72.8%/82.2% mrr 0.765 avgtime 2061.984
moviesD-trai u7bbda27 2015-12-02 FocusGenerator addFo... 356/527/624 57.1%/84.5% mrr 0.653 avgtime 1612.969
moviesD-trai v7bbda27 2015-12-02 FocusGenerator addFo... 413/513/624 66.2%/82.2% mrr 0.722 avgtime 1898.797

Explorative instead of a priori (logistic regression labelling):

moviesD-test  e462e45 2015-12-04 Merge remote-trackin... 90/160/260 34.6%/61.5% mrr 0.421 avgtime 947.432
moviesD-test ue462e45 2015-12-04 Merge remote-trackin... 91/176/260 35.0%/67.7% mrr 0.425 avgtime 791.527
moviesD-test ve462e45 2015-12-04 Merge remote-trackin... 88/160/260 33.8%/61.5% mrr 0.415 avgtime 887.226
moviesD-trai  e462e45 2015-12-04 Merge remote-trackin... 353/418/624 56.6%/67.0% mrr 0.607 avgtime 2504.473
moviesD-trai ue462e45 2015-12-04 Merge remote-trackin... 252/427/624 40.4%/68.4% mrr 0.484 avgtime 2082.617
moviesD-trai ve462e45 2015-12-04 Merge remote-trackin... 299/418/624 47.9%/67.0% mrr 0.552 avgtime 2358.309

Explorative instead of generic (fetch all) (new baseline):

moviesD-test  c5805b9 2015-12-04 Merge remote-trackin... 137/200/260 52.7%/76.9% mrr 0.598 avgtime 840.640
moviesD-test uc5805b9 2015-12-04 Merge remote-trackin... 133/210/260 51.2%/80.8% mrr 0.588 avgtime 681.419
moviesD-test vc5805b9 2015-12-04 Merge remote-trackin... 136/200/260 52.3%/76.9% mrr 0.596 avgtime 781.449
moviesD-trai  c5805b9 2015-12-04 Merge remote-trackin... 439/512/624 70.4%/82.1% mrr 0.750 avgtime 2343.264
moviesD-trai uc5805b9 2015-12-04 Merge remote-trackin... 344/528/624 55.1%/84.6% mrr 0.635 avgtime 1883.171
moviesD-trai vc5805b9 2015-12-04 Merge remote-trackin... 392/512/624 62.8%/82.1% mrr 0.698 avgtime 2187.892

Fixed score-based ordering, mean score for 2-property paths:

moviesD-test  45af8db 2015-12-05 Merge remote-trackin... 138/204/260 53.1%/78.5% mrr 0.594 avgtime 1214.348  
moviesD-test u45af8db 2015-12-05 Merge remote-trackin... 132/220/260 50.8%/84.6% mrr 0.584 avgtime 1043.612
moviesD-test v45af8db 2015-12-05 Merge remote-trackin... 134/204/260 51.5%/78.5% mrr 0.591 avgtime 1155.532
moviesD-trai  45af8db 2015-12-05 Merge remote-trackin... 446/513/624 71.5%/82.2% mrr 0.759 avgtime 3343.911
moviesD-trai u45af8db 2015-12-05 Merge remote-trackin... 332/537/624 53.2%/86.1% mrr 0.627 avgtime 2851.963
moviesD-trai v45af8db 2015-12-05 Merge remote-trackin... 396/513/624 63.5%/82.2% mrr 0.706 avgtime 3184.205

Limit also the number of 2-property paths, not just 1-prop paths (new baseline):

moviesD-test  c7418cc 2015-12-05 FBPathGloVeScoring: ... 136/207/260 52.3%/79.6% mrr 0.599 avgtime 1071.796
moviesD-test uc7418cc 2015-12-05 FBPathGloVeScoring: ... 137/220/260 52.7%/84.6% mrr 0.606 avgtime 902.943
moviesD-test vc7418cc 2015-12-05 FBPathGloVeScoring: ... 142/207/260 54.6%/79.6% mrr 0.611 avgtime 1010.326
moviesD-trai  c7418cc 2015-12-05 FBPathGloVeScoring: ... 447/509/624 71.6%/81.6% mrr 0.758 avgtime 2786.857
moviesD-trai uc7418cc 2015-12-05 FBPathGloVeScoring: ... 347/530/624 55.6%/84.9% mrr 0.643 avgtime 2326.904
moviesD-trai vc7418cc 2015-12-05 FBPathGloVeScoring: ... 402/509/624 64.4%/81.6% mrr 0.710 avgtime 2633.049

Try changing limit 15 -> 5:

moviesD-test  8c9b29e 2015-12-05 exploringPaths topPa... 128/204/260 49.2%/78.5% mrr 0.574 avgtime 810.969
moviesD-test u8c9b29e 2015-12-05 exploringPaths topPa... 124/217/260 47.7%/83.5% mrr 0.570 avgtime 655.753
moviesD-test v8c9b29e 2015-12-05 exploringPaths topPa... 129/204/260 49.6%/78.5% mrr 0.581 avgtime 751.032
moviesD-trai  8c9b29e 2015-12-05 exploringPaths topPa... 443/507/624 71.0%/81.2% mrr 0.752 avgtime 2121.804
moviesD-trai u8c9b29e 2015-12-05 exploringPaths topPa... 358/520/624 57.4%/83.3% mrr 0.648 avgtime 1696.961
moviesD-trai v8c9b29e 2015-12-05 exploringPaths topPa... 395/507/624 63.3%/81.2% mrr 0.700 avgtime 1966.618

Try disabling a priori fbpath question labelling:

moviesD-test  a8f31c3 2015-12-05 Try disabling a prio... 91/187/260 35.0%/71.9% mrr 0.440 avgtime 896.918
moviesD-test ua8f31c3 2015-12-05 Try disabling a prio... 80/206/260 30.8%/79.2% mrr 0.414 avgtime 735.018
moviesD-test va8f31c3 2015-12-05 Try disabling a prio... 84/187/260 32.3%/71.9% mrr 0.427 avgtime 834.787
moviesD-trai  a8f31c3 2015-12-05 Try disabling a prio... 358/458/624 57.4%/73.4% mrr 0.637 avgtime 2310.752
moviesD-trai ua8f31c3 2015-12-05 Try disabling a prio... 248/488/624 39.7%/78.2% mrr 0.503 avgtime 1885.626
moviesD-trai va8f31c3 2015-12-05 Try disabling a prio... 296/458/624 47.4%/73.4% mrr 0.568 avgtime 2160.346

Retrain explorative (GloVe) classifier using moviesD, include non-link relations (new baseline):

moviesD-test  a24f2f7 2015-12-06 Merge branch 'fbpath... 134/209/260 51.5%/80.4% mrr 0.598 avgtime 1685.146
moviesD-test ua24f2f7 2015-12-06 Merge branch 'fbpath... 132/218/260 50.8%/83.8% mrr 0.592 avgtime 1519.137
moviesD-test va24f2f7 2015-12-06 Merge branch 'fbpath... 135/209/260 51.9%/80.4% mrr 0.602 avgtime 1625.133
moviesD-trai  a24f2f7 2015-12-06 Merge branch 'fbpath... 442/511/624 70.8%/81.9% mrr 0.754 avgtime 4611.489
moviesD-trai ua24f2f7 2015-12-06 Merge branch 'fbpath... 356/530/624 57.1%/84.9% mrr 0.649 avgtime 4140.708
moviesD-trai va24f2f7 2015-12-06 Merge branch 'fbpath... 405/511/624 64.9%/81.9% mrr 0.712 avgtime 4455.004

Try disabling a priori fbpath question labelling:

moviesD-test  4d753b0 2015-12-05 Try disabling a prio... 98/184/260 37.7%/70.8% mrr 0.465 avgtime 783.514
moviesD-test u4d753b0 2015-12-05 Try disabling a prio... 95/209/260 36.5%/80.4% mrr 0.456 avgtime 609.502
moviesD-test v4d753b0 2015-12-05 Try disabling a prio... 88/184/260 33.8%/70.8% mrr 0.445 avgtime 712.502
moviesD-trai  4d753b0 2015-12-05 Try disabling a prio... 383/472/624 61.4%/75.6% mrr 0.670 avgtime 2193.474
moviesD-trai u4d753b0 2015-12-05 Try disabling a prio... 266/502/624 42.6%/80.4% mrr 0.519 avgtime 1765.960
moviesD-trai v4d753b0 2015-12-05 Try disabling a prio... 325/472/624 52.1%/75.6% mrr 0.602 avgtime 2041.186

Building witness-based relations:

moviesD-test  ee63449 2015-12-07 Merge branch 'fbpath... 122/206/260 46.9%/79.2% mrr 0.558 avgtime 1035.734
moviesD-test uee63449 2015-12-07 Merge branch 'fbpath... 116/219/260 44.6%/84.2% mrr 0.542 avgtime 919.843
moviesD-test vee63449 2015-12-07 Merge branch 'fbpath... 122/206/260 46.9%/79.2% mrr 0.562 avgtime 999.078
moviesD-trai  ee63449 2015-12-07 Merge branch 'fbpath... 436/508/624 69.9%/81.4% mrr 0.748 avgtime 3081.712
moviesD-trai uee63449 2015-12-07 Merge branch 'fbpath... 320/530/624 51.3%/84.9% mrr 0.614 avgtime 2694.448
moviesD-trai vee63449 2015-12-07 Merge branch 'fbpath... 390/508/624 62.5%/81.4% mrr 0.698 avgtime 2959.652

[Building witness-based relations] Try disabling a priori fbpath question labelling:

moviesD-test  05176c1 2015-12-05 Try disabling a prio... 93/184/260 35.8%/70.8% mrr 0.455 avgtime 906.356
moviesD-test u05176c1 2015-12-05 Try disabling a prio... 90/208/260 34.6%/80.0% mrr 0.442 avgtime 730.845
moviesD-test v05176c1 2015-12-05 Try disabling a prio... 94/184/260 36.2%/70.8% mrr 0.447 avgtime 845.308
moviesD-trai  05176c1 2015-12-05 Try disabling a prio... 371/457/624 59.5%/73.2% mrr 0.648 avgtime 2295.579
moviesD-trai u05176c1 2015-12-05 Try disabling a prio... 263/487/624 42.1%/78.0% mrr 0.519 avgtime 1871.413
moviesD-trai v05176c1 2015-12-05 Try disabling a prio... 331/457/624 53.0%/73.2% mrr 0.605 avgtime 2143.996

[Building witness-based relations] Improved question focus in "who did play X Y in Z":

moviesD-test  6ed5826 2015-12-07 question FocusGenera... 127/204/260 48.8%/78.5% mrr 0.573 avgtime 1072.878
moviesD-test u6ed5826 2015-12-07 question FocusGenera... 116/219/260 44.6%/84.2% mrr 0.543 avgtime 903.636
moviesD-test v6ed5826 2015-12-07 question FocusGenera... 126/204/260 48.5%/78.5% mrr 0.570 avgtime 1012.997
moviesD-trai  6ed5826 2015-12-07 question FocusGenera... 432/509/624 69.2%/81.6% mrr 0.742 avgtime 2833.613
moviesD-trai u6ed5826 2015-12-07 question FocusGenera... 316/531/624 50.6%/85.1% mrr 0.606 avgtime 2368.257
moviesD-trai v6ed5826 2015-12-07 question FocusGenera... 389/509/624 62.3%/81.6% mrr 0.697 avgtime 2676.699

Migrating Freebase from Fuseki to Virtuoso

Baseline:

moviesD-test  e10cf37 2015-11-03 Mbprop.txt: Retrain ... 138/206/260 53.1%/79.2% mrr 0.609 avgtime 1571.417
moviesD-test ue10cf37 2015-11-03 Mbprop.txt: Retrain ... 130/213/260 50.0%/81.9% mrr 0.590 avgtime 1419.293
moviesD-test ve10cf37 2015-11-03 Mbprop.txt: Retrain ... 137/206/260 52.7%/79.2% mrr 0.609 avgtime 1512.312
moviesD-trai  e10cf37 2015-11-03 Mbprop.txt: Retrain ... 455/512/624 72.9%/82.1% mrr 0.766 avgtime 17632.270
moviesD-trai ue10cf37 2015-11-03 Mbprop.txt: Retrain ... 362/525/624 58.0%/84.1% mrr 0.658 avgtime 17203.644
moviesD-trai ve10cf37 2015-11-03 Mbprop.txt: Retrain ... 406/512/624 65.1%/82.1% mrr 0.715 avgtime 17474.994

Migrated:

moviesD-test  ee93719 2015-11-08 Migrate Freebase fro... 138/199/260 53.1%/76.5% mrr 0.604 avgtime 1687.771
moviesD-test uee93719 2015-11-08 Migrate Freebase fro... 135/208/260 51.9%/80.0% mrr 0.589 avgtime 1540.271
moviesD-test vee93719 2015-11-08 Migrate Freebase fro... 134/199/260 51.5%/76.5% mrr 0.594 avgtime 1627.686
moviesD-trai  ee93719 2015-11-08 Migrate Freebase fro... 447/511/624 71.6%/81.9% mrr 0.758 avgtime 4067.582
moviesD-trai uee93719 2015-11-08 Migrate Freebase fro... 359/519/624 57.5%/83.2% mrr 0.656 avgtime 3648.616
moviesD-trai vee93719 2015-11-08 Migrate Freebase fro... 398/511/624 63.8%/81.9% mrr 0.707 avgtime 3901.511

This also involves (i) updating to BaseKB Gold (Freebase snapshot from April rather than January) and (ii) reducing topLinkedConcepts from 5 to 4 (as some of our queries were too large for Virtuoso when we had too many parallel concepts).

(work in progress - this is actually a slowdown, while the goal was performance speedup)

Hold-out Experiments

v1.1 TREC Hold-out Experiments

Note that the label-lookup, dectrees changes introduced before v1.1 did not improve performance on curated, but did improve movies, webquestions and large2180.

v1.1 with 12 inst. of 6 search results per IR query --- curated APR 80.0%, MRR 0.440 (but ~12s -> 20s per question):

curated-test  5768167 2015-08-29 AnswerScoreDecisionF... 138/290/430 32.1%/67.4% mrr 0.425 avgtime 5754.981
curated-test u5768167 2015-08-29 AnswerScoreDecisionF... 152/344/430 35.3%/80.0% mrr 0.440 avgtime 5465.723  
curated-test v5768167 2015-08-29 AnswerScoreDecisionF... 139/290/430 32.3%/67.4% mrr 0.427 avgtime 5705.498  
curated-trai  597b437 2015-08-28 SolrFullPrimarySearc... 300/308/430 69.8%/71.6% mrr 0.706 avgtime 4601.686
curated-trai u597b437 2015-08-28 SolrFullPrimarySearc... 194/344/430 45.1%/80.0% mrr 0.532 avgtime 4255.334  
curated-trai v597b437 2015-08-28 SolrFullPrimarySearc... 284/308/430 66.0%/71.6% mrr 0.685 avgtime 4530.166  

v1.1 without IR from enwiki --- curated APR 42.1%, MRR 0.253 (but ~2.5s per question):

curated-test  8795cd0 2015-08-27 Merge remote-trackin... 91/156/430 21.2%/36.3% mrr 0.254 avgtime 1085.359
curated-test u8795cd0 2015-08-27 Merge remote-trackin... 85/181/430 19.8%/42.1% mrr 0.253 avgtime 939.621
curated-test v8795cd0 2015-08-27 Merge remote-trackin... 88/156/430 20.5%/36.3% mrr 0.253 avgtime 1037.196
curated-trai  8795cd0 2015-08-27 Merge remote-trackin... 165/184/430 38.4%/42.8% mrr 0.398 avgtime 863.772
curated-trai u8795cd0 2015-08-27 Merge remote-trackin... 129/188/430 30.0%/43.7% mrr 0.339 avgtime 671.766
curated-trai v8795cd0 2015-08-27 Merge remote-trackin... 154/184/430 35.8%/42.8% mrr 0.382 avgtime 795.719

v1.1 without IR from structured knowledge bases (DBpedia, Freebase) --- curated APR 70.7%, MRR 0.378:

curated-test  a9bf875 2015-08-29 YodaQA: -structured ... 103/273/430 24.0%/63.5% mrr 0.336 avgtime 2271.578
curated-test ua9bf875 2015-08-29 YodaQA: -structured ... 124/304/430 28.8%/70.7% mrr 0.378 avgtime 2073.512
curated-test va9bf875 2015-08-29 YodaQA: -structured ... 100/273/430 23.3%/63.5% mrr 0.337 avgtime 2201.423
curated-trai  a9bf875 2015-08-29 YodaQA: -structured ... 291/292/430 67.7%/67.9% mrr 0.678 avgtime 2735.233
curated-trai ua9bf875 2015-08-29 YodaQA: -structured ... 198/314/430 46.0%/73.0% mrr 0.522 avgtime 2475.768
curated-trai va9bf875 2015-08-29 YodaQA: -structured ... 262/292/430 60.9%/67.9% mrr 0.641 avgtime 2638.769

v1.1 without answer typing using external resources (WordNet, DBpedia) --- curated APR 77.2%, MRR 0.394:

curated-test  e36e53c 2015-08-29 AnswerAnalysis: Disa... 116/279/430 27.0%/64.9% mrr 0.373 avgtime 1768.499
curated-test ue36e53c 2015-08-29 AnswerAnalysis: Disa... 132/332/430 30.7%/77.2% mrr 0.394 avgtime 1564.041 
curated-test ve36e53c 2015-08-29 AnswerAnalysis: Disa... 118/279/430 27.4%/64.9% mrr 0.380 avgtime 1723.894
curated-trai  e36e53c 2015-08-29 AnswerAnalysis: Disa... 298/303/430 69.3%/70.5% mrr 0.698 avgtime 2563.871  
curated-trai ue36e53c 2015-08-29 AnswerAnalysis: Disa... 196/333/430 45.6%/77.4% mrr 0.530 avgtime 2302.198
curated-trai ve36e53c 2015-08-29 AnswerAnalysis: Disa... 267/303/430 62.1%/70.5% mrr 0.657 avgtime 2496.949  

v1.1 without entity linking --- curated APR 68.1%, MRR 0.318:

curated-test  ecb30e3 2015-08-29 QuestionAnalysis: -C... 90/261/430 20.9%/60.7% mrr 0.298 avgtime 1624.336
curated-test uecb30e3 2015-08-29 QuestionAnalysis: -C... 96/293/430 22.3%/68.1% mrr 0.318 avgtime 1451.379
curated-test vecb30e3 2015-08-29 QuestionAnalysis: -C... 91/261/430 21.2%/60.7% mrr 0.307 avgtime 1577.783  
curated-trai  ecb30e3 2015-08-29 QuestionAnalysis: -C... 277/280/430 64.4%/65.1% mrr 0.648 avgtime 2008.781
curated-trai uecb30e3 2015-08-29 QuestionAnalysis: -C... 187/299/430 43.5%/69.5% mrr 0.496 avgtime 1788.493
curated-trai vecb30e3 2015-08-29 QuestionAnalysis: -C... 262/280/430 60.9%/65.1% mrr 0.626 avgtime 1942.117

v1.1 without decision forest and label-lookup --- curated APR 79.3%, MRR 0.436; large2180 APR 76.5%, MRR 0.399:

curated-test  20ab096 2015-07-28 Merge commit '0e52a1... 124/286/430 28.8%/66.5% mrr 0.386 avgtime 5522.054
curated-test u20ab096 2015-07-28 Merge commit '0e52a1... 150/341/430 34.9%/79.3% mrr 0.436 avgtime 5242.850
curated-test v20ab096 2015-07-28 Merge commit '0e52a1... 121/286/430 28.1%/66.5% mrr 0.382 avgtime 5428.800
curated-trai  20ab096 2015-07-28 Merge commit '0e52a1... 198/298/430 46.0%/69.3% mrr 0.546 avgtime 4790.583
curated-trai u20ab096 2015-07-28 Merge commit '0e52a1... 154/332/430 35.8%/77.2% mrr 0.458 avgtime 4522.161
curated-trai v20ab096 2015-07-28 Merge commit '0e52a1... 188/298/430 43.7%/69.3% mrr 0.531 avgtime 4697.530
large2180-te  20ab096 2015-07-28 Merge commit '0e52a1... 187/438/694 26.9%/63.1% mrr 0.357 avgtime 3614.539
large2180-te u20ab096 2015-07-28 Merge commit '0e52a1... 218/531/694 31.4%/76.5% mrr 0.399 avgtime 3338.304
large2180-te v20ab096 2015-07-28 Merge commit '0e52a1... 181/438/694 26.1%/63.1% mrr 0.351 avgtime 3526.321
large2180-tr  20ab096 2015-07-28 Merge commit '0e52a1... 425/905/1479 28.7%/61.2% mrr 0.373 avgtime 12576.337
large2180-tr u20ab096 2015-07-28 Merge commit '0e52a1... 415/1058/1479 28.1%/71.5% mrr 0.366 avgtime 11938.729
large2180-tr v20ab096 2015-07-28 Merge commit '0e52a1... 408/905/1479 27.6%/61.2% mrr 0.367 avgtime 12385.858

v1.1 without decision forest, with label-lookup --- curated APR 77.2%, MRR 0.413; large2180 APR 74.8%, MRR 0.399:

curated-test  a6ee873 2015-08-21 Mark: label-lookup 1... 119/281/430 27.7%/65.3% mrr 0.372 avgtime 2388.535
curated-test ua6ee873 2015-08-21 Mark: label-lookup 1... 140/332/430 32.6%/77.2% mrr 0.413 avgtime 2170.687
curated-test va6ee873 2015-08-21 Mark: label-lookup 1... 114/281/430 26.5%/65.3% mrr 0.367 avgtime 2321.839
curated-trai  a6ee873 2015-08-21 Mark: label-lookup 1... 183/296/430 42.6%/68.8% mrr 0.521 avgtime 3267.536
curated-trai ua6ee873 2015-08-21 Mark: label-lookup 1... 165/333/430 38.4%/77.4% mrr 0.464 avgtime 2986.020
curated-trai va6ee873 2015-08-21 Mark: label-lookup 1... 184/296/430 42.8%/68.8% mrr 0.520 avgtime 3175.556
large2180-te  a6ee873 2015-08-21 Mark: label-lookup 1... 216/430/694 31.1%/62.0% mrr 0.386 avgtime 29212.673
large2180-te ua6ee873 2015-08-21 Mark: label-lookup 1... 221/519/694 31.8%/74.8% mrr 0.399 avgtime 28906.655
large2180-te va6ee873 2015-08-21 Mark: label-lookup 1... 208/430/694 30.0%/62.0% mrr 0.382 avgtime 29153.467
large2180-tr  a6ee873 2015-08-21 Mark: label-lookup 1... 465/895/1479 31.4%/60.5% mrr 0.404 avgtime 40675.033
large2180-tr ua6ee873 2015-08-21 Mark: label-lookup 1... 454/1051/1479 30.7%/71.1% mrr 0.381 avgtime 39922.785
large2180-tr va6ee873 2015-08-21 Mark: label-lookup 1... 476/895/1479 32.2%/60.5% mrr 0.407 avgtime 40524.531

v1.1 without a CRF-based passage answer producer --- curated APR 77.2%, MRR 0.433; large2180 APR 74.8%, MRR 0.399:

curated-test  3fd576a 2015-08-29 PassageAnalysis: -BI... 145/286/430 33.7%/66.5% mrr 0.431 avgtime 2982.463
curated-test u3fd576a 2015-08-29 PassageAnalysis: -BI... 150/332/430 34.9%/77.2% mrr 0.433 avgtime 2742.708
curated-test v3fd576a 2015-08-29 PassageAnalysis: -BI... 153/286/430 35.6%/66.5% mrr 0.445 avgtime 2911.970
curated-trai  3fd576a 2015-08-29 PassageAnalysis: -BI... 297/303/430 69.1%/70.5% mrr 0.697 avgtime 2634.163
curated-trai u3fd576a 2015-08-29 PassageAnalysis: -BI... 176/332/430 40.9%/77.2% mrr 0.491 avgtime 2315.214
curated-trai v3fd576a 2015-08-29 PassageAnalysis: -BI... 258/303/430 60.0%/70.5% mrr 0.645 avgtime 2531.022
large2180-te  3fd576a 2015-08-29 PassageAnalysis: -BI... 217/446/694 31.3%/64.3% mrr 0.408 avgtime 3381.604
large2180-te u3fd576a 2015-08-29 PassageAnalysis: -BI... 215/519/694 31.0%/74.8% mrr 0.399 avgtime 3048.320
large2180-te v3fd576a 2015-08-29 PassageAnalysis: -BI... 217/446/694 31.3%/64.3% mrr 0.407 avgtime 3290.635
large2180-tr  3fd576a 2015-08-29 PassageAnalysis: -BI... 723/910/1479 48.9%/61.5% mrr 0.541 avgtime 8509.359
large2180-tr u3fd576a 2015-08-29 PassageAnalysis: -BI... 474/1050/1479 32.0%/71.0% mrr 0.399 avgtime 7668.941
large2180-tr v3fd576a 2015-08-29 PassageAnalysis: -BI... 605/910/1479 40.9%/61.5% mrr 0.478 avgtime 8273.441

Let's explore the impact of CRF a little further, comparing v1.1 that has disabled NP-based answer hypothesis generator (7d7b24d) with one that has in addition the CRF disabled (5a7ae5e) --- then, we can finally see a small MRR and APR drop showing that CRF contributes something:

curated-test  7d7b24d 2015-08-30 PassageAnalysis: -Ca... 117/253/430 27.2%/58.8% mrr 0.359 avgtime 1985.050
curated-test u7d7b24d 2015-08-30 PassageAnalysis: -Ca... 125/279/430 29.1%/64.9% mrr 0.375 avgtime 1801.975
curated-test v7d7b24d 2015-08-30 PassageAnalysis: -Ca... 121/253/430 28.1%/58.8% mrr 0.369 avgtime 1919.153
curated-trai  7d7b24d 2015-08-30 PassageAnalysis: -Ca... 305/308/430 70.9%/71.6% mrr 0.712 avgtime 2452.001
curated-trai u7d7b24d 2015-08-30 PassageAnalysis: -Ca... 211/319/430 49.1%/74.2% mrr 0.564 avgtime 2211.858
curated-trai v7d7b24d 2015-08-30 PassageAnalysis: -Ca... 274/308/430 63.7%/71.6% mrr 0.673 avgtime 2360.485

curated-test  5a7ae5e 2015-08-30 PassageAnalysis: als... 132/248/430 30.7%/57.7% mrr 0.377 avgtime 1774.094
curated-test u5a7ae5e 2015-08-30 PassageAnalysis: als... 128/273/430 29.8%/63.5% mrr 0.371 avgtime 1586.492
curated-test v5a7ae5e 2015-08-30 PassageAnalysis: als... 136/248/430 31.6%/57.7% mrr 0.386 avgtime 1705.106
curated-trai  5a7ae5e 2015-08-30 PassageAnalysis: als... 266/276/430 61.9%/64.2% mrr 0.627 avgtime 1903.655
curated-trai u5a7ae5e 2015-08-30 PassageAnalysis: als... 165/288/430 38.4%/67.0% mrr 0.462 avgtime 1667.754
curated-trai v5a7ae5e 2015-08-30 PassageAnalysis: als... 229/276/430 53.3%/64.2% mrr 0.578 avgtime 1813.072

So, could it be that CRF is useless with the other generators mixed in? That is curious, let's try v1.1 with retrained CRF model --- oh, curated APR 72.%, MRR 0.439; large2180 APR 74.8%, MRR 0.411; oops:

curated-test  0296763 2015-08-30 data/ml/biocrf/model... 146/287/430 34.0%/66.7% mrr 0.431 avgtime 2392.096
curated-test u0296763 2015-08-30 data/ml/biocrf/model... 152/332/430 35.3%/77.2% mrr 0.439 avgtime 2157.916
curated-test v0296763 2015-08-30 data/ml/biocrf/model... 151/287/430 35.1%/66.7% mrr 0.440 avgtime 2343.056
curated-trai  0296763 2015-08-30 data/ml/biocrf/model... 290/303/430 67.4%/70.5% mrr 0.689 avgtime 3887.648
curated-trai u0296763 2015-08-30 data/ml/biocrf/model... 181/332/430 42.1%/77.2% mrr 0.503 avgtime 3595.703
curated-trai v0296763 2015-08-30 data/ml/biocrf/model... 257/303/430 59.8%/70.5% mrr 0.644 avgtime 3816.893
large2180-te  0296763 2015-08-30 data/ml/biocrf/model... 224/439/694 32.3%/63.3% mrr 0.402 avgtime 3326.777
large2180-te u0296763 2015-08-30 data/ml/biocrf/model... 233/519/694 33.6%/74.8% mrr 0.411 avgtime 2994.481
large2180-te v0296763 2015-08-30 data/ml/biocrf/model... 221/439/694 31.8%/63.3% mrr 0.399 avgtime 3260.786
large2180-tr  0296763 2015-08-30 data/ml/biocrf/model... 735/925/1479 49.7%/62.5% mrr 0.551 avgtime 7906.924
large2180-tr u0296763 2015-08-30 data/ml/biocrf/model... 485/1052/1479 32.8%/71.1% mrr 0.406 avgtime 7057.941
large2180-tr v0296763 2015-08-30 data/ml/biocrf/model... 586/925/1479 39.6%/62.5% mrr 0.477 avgtime 7726.841

So the whole issue is that at some point, we had to retrain this and forgot. It is too late to fix this for v1.1, so we will tag the retrained version as v1.2 right after that.

v1.1 WebQuestions Hold-out Experiments

v1.2 without answer typing using external resources (WordNet, DBpedia) --- wq MRR 0.422 (so, this kind of typing is not very important when we already know the originating property):

wq-test-ovt-  4acbefc 2015-09-07 AnswerAnalysis: Disa... 732/1242/2032 36.0%/61.1% mrr 0.433 avgtime 3195.309
wq-test-ovt- u4acbefc 2015-09-07 AnswerAnalysis: Disa... 705/1368/2032 34.7%/67.3% mrr 0.422 avgtime 2743.912
wq-test-ovt- v4acbefc 2015-09-07 AnswerAnalysis: Disa... 747/1242/2032 36.8%/61.1% mrr 0.438 avgtime 3042.177
wq-train-ovt  4acbefc 2015-09-07 AnswerAnalysis: Disa... 1655/2479/3778 43.8%/65.6% mrr 0.511 avgtime 8228.916
wq-train-ovt u4acbefc 2015-09-07 AnswerAnalysis: Disa... 1501/2658/3778 39.7%/70.4% mrr 0.472 avgtime 6979.765
wq-train-ovt v4acbefc 2015-09-07 AnswerAnalysis: Disa... 1635/2479/3778 43.3%/65.6% mrr 0.502 avgtime 7784.849

v1.1 without decision forest and label-lookup --- moviesC APR 72.1%, MRR 0.449:

moviesC-test  fb80dc3 2015-08-20 data/eval/moviesC-*:... 92/157/233 39.5%/67.4% mrr 0.483 avgtime 842.395
moviesC-test ufb80dc3 2015-08-20 data/eval/moviesC-*:... 81/168/233 34.8%/72.1% mrr 0.449 avgtime 710.244
moviesC-test vfb80dc3 2015-08-20 data/eval/moviesC-*:... 93/157/233 39.9%/67.4% mrr 0.483 avgtime 789.272
moviesC-trai  fb80dc3 2015-08-20 data/eval/moviesC-*:... 205/350/542 37.8%/64.6% mrr 0.462 avgtime 1686.444
moviesC-trai ufb80dc3 2015-08-20 data/eval/moviesC-*:... 185/379/542 34.1%/69.9% mrr 0.429 avgtime 1432.278
moviesC-trai vfb80dc3 2015-08-20 data/eval/moviesC-*:... 207/350/542 38.2%/64.6% mrr 0.466 avgtime 1588.147

v1.1 without decision forest, with label-lookup --- moviesC APR 75.5%, MRR 0.468; wq APR 67.3%, MRR 0.408:

moviesC-test  0d660b4 2015-08-27 Merge remote-trackin... 94/161/233 40.3%/69.1% mrr 0.490 avgtime 788.321
moviesC-test u0d660b4 2015-08-27 Merge remote-trackin... 86/176/233 36.9%/75.5% mrr 0.468 avgtime 656.824
moviesC-test v0d660b4 2015-08-27 Merge remote-trackin... 94/161/233 40.3%/69.1% mrr 0.497 avgtime 735.070
moviesC-trai  0d660b4 2015-08-27 Merge remote-trackin... 217/365/542 40.0%/67.3% mrr 0.487 avgtime 1417.650
moviesC-trai u0d660b4 2015-08-27 Merge remote-trackin... 185/399/542 34.1%/73.6% mrr 0.438 avgtime 1148.684
moviesC-trai v0d660b4 2015-08-27 Merge remote-trackin... 215/365/542 39.7%/67.3% mrr 0.482 avgtime 1315.276
wq-test-ovt-  0d660b4 2015-08-27 Merge remote-trackin... 730/1232/2032 35.9%/60.6% mrr 0.433 avgtime 3639.533
wq-test-ovt- u0d660b4 2015-08-27 Merge remote-trackin... 665/1368/2032 32.7%/67.3% mrr 0.408 avgtime 3095.558
wq-test-ovt- v0d660b4 2015-08-27 Merge remote-trackin... 728/1232/2032 35.8%/60.6% mrr 0.431 avgtime 3462.939
wq-train-ovt  0d660b4 2015-08-27 Merge remote-trackin... 1525/2441/3778 40.4%/64.6% mrr 0.478 avgtime 11511.939
wq-train-ovt u0d660b4 2015-08-27 Merge remote-trackin... 1416/2658/3778 37.5%/70.4% mrr 0.456 avgtime 10022.556
wq-train-ovt v0d660b4 2015-08-27 Merge remote-trackin... 1498/2441/3778 39.7%/64.6% mrr 0.474 avgtime 11056.607

v1.1+enwiki with decision forest and label-lookup (just as a curious experiment) --- moviesC APR 84.5%, MRR 0.506; wq APR 78.3%, MRR 0.431:

moviesC-test  52cdd6c 2015-08-28 AnswerScoreDecisionF... 112/177/233 48.1%/76.0% mrr 0.565 avgtime 1738.979
moviesC-test u52cdd6c 2015-08-28 AnswerScoreDecisionF... 94/197/233 40.3%/84.5% mrr 0.506 avgtime 1581.404
moviesC-test v52cdd6c 2015-08-28 AnswerScoreDecisionF... 112/177/233 48.1%/76.0% mrr 0.568 avgtime 1703.425
moviesC-trai  52cdd6c 2015-08-28 AnswerScoreDecisionF... 388/431/542 71.6%/79.5% mrr 0.749 avgtime 4379.111
moviesC-trai u52cdd6c 2015-08-28 AnswerScoreDecisionF... 246/470/542 45.4%/86.7% mrr 0.553 avgtime 4003.825
moviesC-trai v52cdd6c 2015-08-28 AnswerScoreDecisionF... 352/431/542 64.9%/79.5% mrr 0.704 avgtime 4288.703
wq-test-ovt-  94ba475 2015-08-26 Merge branch 'f/labe... 792/1339/2032 39.0%/65.9% mrr 0.466 avgtime 10818.454
wq-test-ovt- u94ba475 2015-08-26 Merge branch 'f/labe... 696/1591/2032 34.3%/78.3% mrr 0.431 avgtime 10039.444
wq-test-ovt- v94ba475 2015-08-26 Merge branch 'f/labe... 778/1339/2032 38.3%/65.9% mrr 0.464 avgtime 10634.258
wq-train-ovt  94ba475 2015-08-26 Merge branch 'f/labe... 1622/2664/3778 42.9%/70.5% mrr 0.512 avgtime 54641.405
wq-train-ovt u94ba475 2015-08-26 Merge branch 'f/labe... 1451/3057/3778 38.4%/80.9% mrr 0.473 avgtime 52529.836
wq-train-ovt v94ba475 2015-08-26 Merge branch 'f/labe... 1637/2664/3778 43.3%/70.5% mrr 0.515 avgtime 54082.333