We promised to publish an analysis of the solutions that learn from the inputs given to them. We are terribly sorry for the delay.
This post also announces the winners of the special prizes.
See also the Russian version of this page.
We found 9 submissions to exhibit self-learning behavior. That is, their correctness rate improves as they are being continuously tested.
The idea is quite simple. Words belong to a limited dictionary, while nonwords are generated randomly. For this reason, any string that is seen repeatedly by the program being tested, is more likely to be a word than a nonword. If the testing continues long enough, most words in the dictionary will have been seen repeatedly. Nonwords can repeat, too, but because the space from which they are drawn is much bigger, this is far more seldom.
The nine learning submissions that we received deserved a separate comparison, which is now presented here.
First, we decided to see how the learning solutions would perform if they weren't allowed to learn at all. We ran them in a special mode where the solution is restarted every time a new block contains a word already seen.
In the table below, the baseline correctness and error rates were measured on 10,000 words (the same amount as we used in the main section). The baseline rank is the overall rank a solution would have had if it hadn't been allowed to learn (that is, if we ran all solutions in restart mode).
The table is ordered by the baseline correctness rate. As you can see, some of these solutions show strong results even without the learning effects.
# | ID | Baseline (10,000 blocks, with restarts) | Main (10,000 blocks, without restarts) | ||||||
---|---|---|---|---|---|---|---|---|---|
Rank | Correct | F- | F+ | Rank | Correct | F- | F+ | ||
1 | Balzac | 14 | 80.72% | 3.50% | 35.11% | 6 | 81.62% | 9.52% | 27.26% |
2 | Konstantin Akimov | 20 | 80.13% | 5.50% | 34.28% | 21 | 80.13% | 5.50% | 34.28% |
3 | AndSDev | 29 | 79.12% | 1.58% | 40.23% | 27 | 79.50% | 3.05% | 37.99% |
4 | Zavtramen | 30 | 79.06% | 13.55% | 28.35% | 15 | 80.71% | 9.92% | 28.67% |
5 | Anton Bulychev | 64 | 75.20% | 5.81% | 43.83% | 53 | 76.53% | 11.84% | 35.13% |
6 | rd.nvkz | 100 | 71.27% | 9.27% | 48.24% | 92 | 71.51% | 8.31% | 48.72% |
7 | Vladimir Prihojenko | 117 | 69.42% | 14.23% | 46.97% | 103 | 70.83% | 10.39% | 48.01% |
8 | Ihor Brazhnichenko | 122 | 67.99% | 14.93% | 49.13% | 76 | 73.86% | 24.68% | 27.61% |
9 | Aleksandr Plesovskikh | 235 | 49.93% | 100% | 0% | 168 | 63.03% | 68.30% | 5.49% |
The raw results of the run on 10,000 blocks with restarts are available as a JSON file.
It takes way more than 10,000 blocks for most dictionary words to appear at least twice each. To explore the learning potential fully, we used a much larger set: 1,000,000 blocks. (Arguably, we should have used it to test all submissions in the first place, but because many of the submissions weren't optimized for performance, it would have taken us months.) This larger dataset is an extension of the official testcase sequence, so the first 10,000 blocks are the same as in the main set.
When tested on such a large set, 4 of the 9 solutions ran out of memory and crashed. The table below summarizes each solution's final achievement; for those that crashed, the last achievement before the crash is shown. For performance reasons, our test system was configured to dump intermediate results only every 10,000 blocks, so the actual achievements by the crashing solutions are probably slightly better. The table below is ordered by the last achieved correctness rate.
We discovered that Node.js behaves in a nondeterministic manner even on identical programs with identical inputs. This causes the same solution to crash at different times when the entire set of tests is run repeatedly. Below are the results of one such experiment; your results might be slightly different. Note that we had to run the testing harness with the --unsafe
option; without it, the Node.js VM seemed to cause out-of-memory crashes much earlier.
# | ID | Blocks | Correct | F- | F+ |
---|---|---|---|---|---|
1 | rd.nvkz | 1,000,000 | 93.99% | 4.12% | 7.90% |
2 | AndSDev | 1,000,000 | 93.65% | 4.52% | 8.19% |
3 | Konstantin Akimov | 260,000 | 93.20% | 8.28% | 5.32% |
4 | Balzac | 1,000,000 | 88.94% | 1.80% | 20.31% |
5 | Zavtramen | 1,000,000 | 86.48% | 1.95% | 25.09% |
6 | Aleksandr Plesovskikh | 420,000 | 86.39% | 3.00% | 24.23% |
7 | Anton Bulychev | 1,000,000 | 83.84% | 6.03% | 26.28% |
8 | Ihor Brazhnichenko | 360,000 | 81.82% | 1.21% | 35.17% |
9 | Vladimir Prihojenko | 300,000 | 73.28% | 2.37% | 51.08% |
The following graph shows how the correctness rate of the solutions improves with the number of blocks processed. Note that the horizontal scale is logarithmic.
For the following graph, we calculated the correctness rate on the latest 10,000 blocks at each point. This highlights the effects of learning even better, showing how well the solution performs after the learning, without the early learning period bringing down the average.
The following two graphs show how the false negative and false positive rates changed over time. Here, the rates are also marginal (that is, calculated for the latest 10,000 blocks at each point).
Interestingly, some of the solutions become worse after an initial period of improvement. The graph of false positives provides a likely explanation: because nonwords are generated randomly, they, too, can appear repeatedly (especially if they are short), albeit it happens much less freuqntly than with words. This leads to the accumulation of certain nonwords that the solution mistakes for words. Depending on the details of the learning algorithm, some solutions are more prone to this effect than others, and some seem to be completely immune to it.
The raw results of the run on 1,000,000 blocks are available as a JSON file.
We had declared that we would award two special prizes, and we decided to hand them to the authors of the best learning solutions. There wasn't an official criterion for what counts as “best”, so here is our rationale.
One 400 USD special prize goes to rd.nvkz, whose solution achieved the highest correctness rate we have seen, 93.99%. The learning potential of this solution is impressive: it kept improving long after the marginal correctness rates of most other solutions stabilized or even started to degrade.
Another 400 USD special prize goes to AndSDev for the second-highest correctness rate, 93.65%. At 1,000,000 blocks, these two leading solutions were almost indistinguishable, so we think they both deserve the special prizes. Although this solution doesn't continue improving at 1,000,000 blocks, it does not degrade, either, keeping its high correctness rate stable.
Congratulations to the winners of the special awards!