Bugfixes in inference from Numpy dtypes #826

Zac-HD · 2017-08-29T14:46:45Z

Two of these predate #472, but the datetime bug was entirely me and would probably break a lot of Pandas things. Whoops!

Inspect dtype strategies to check they can generate all representable values
Write a test that any value we get from a dtype-derived strategy can be round-tripped in and out of an array of that dtype. (it's a good test, we just shouldn't do it at runtime IMO)
Investigate round-tripping of bytestrings ending in the zero byte

Zac-HD · 2017-08-30T02:05:30Z

Found 4 benchmarks with significant difference at false discovery rate 0.0001
Different means for arrays10-valid=always-interesting=array_average: 2371.94 -> 1418.63. p=0.00000
Different means for arraysvar-valid=always-interesting=array_average: 1835.96 -> 1162.24. p=0.00000
Different means for arrays10-valid=usually-interesting=array_average: 2548.95 -> 1658.15. p=0.00000
Different means for arraysvar-valid=usually-interesting=array_average: 1958.75 -> 1039.24. p=0.00000

Of course this is good, but now I have to update the benchmarks too.

DRMacIver · 2017-08-30T08:31:20Z

Of course this is good, but now I have to update the benchmarks too.

I just use make update-all-benchmark-data (I know you can't run make, but you can get the command from there).

Warning: It takes quite a while to run.

DRMacIver · 2017-08-30T08:34:18Z

I'm not totally sure why the benchmarks have improved. Were we previously drawing two bytes for int8?

Zac-HD · 2017-08-31T08:16:47Z

I'm not totally sure why the benchmarks have improved. Were we previously drawing two bytes for int8?

It must be drawing fewer values for some reason, because the size (in bits) of each integer has increased. I confess this is also surprising to me - I would expect this kind of boost for fixed-size text arrays, but thought integers would be better and finding bugs but take no less time.

Update: the changed benchmarks were all for the array_means interestingness condition (mean of array above random threshold). Now that we can generate the full range of representable integers, this condition is true of many more examples.

DRMacIver

As well as the specific commentary, I think I'd like this to be a lot better tested than this is.

I had a bit of a think about how, and I think the thing we actually want to do is sneak some internal assertions into the from_dtype strategy (possibly enabled by some sort of debug flag?) so that rather than just doing a straight up conversion to the resulting dtype, we then convert the result back to the original type and assert that is equal. This will give us some nice explicit failures when this sort of silent truncation is happening.

DRMacIver · 2017-08-31T11:12:10Z

src/hypothesis/extra/numpy.py

@@ -53,19 +53,24 @@ def from_dtype(dtype):
    elif dtype.kind == u'c':
        result = st.complex_numbers()
    elif dtype.kind in (u'S', u'a'):
-        result = st.binary()
+        result = st.binary(max_size=dtype.itemsize or None)


Does this have an off-by-one error? If numpy can store variable size byte strings then it must either be length prefixing or zero terminating them. I'd assume the latter. But in that case itemsize must be counting the zero byte at the end. There's also a question of what happens if we have internal zeros here.

The dtype for a variable-length string (unicode or bytestring) actually creates an array of fixed-length strings, but the limiting length is that of the longest string present when the array is initialised.

So the array is not variable size, but it's safe to generate any length string because the array doesn't exist yet!

So what happens to the shorter strings? Do they get zero padded to the same length?

Internally, I assume so, but the python API gives you np.array(['aaaa', 'bb'])[1] == 'bb' - the same strings you put in. However assigning longer strings gets them truncated: np.array(['aaaa', 'bb'], 'U2')[0] == 'aa'; that's the shrinking issue I mentioned.

DRMacIver · 2017-08-31T11:14:18Z

src/hypothesis/extra/numpy.py

    elif dtype.kind == u'U':
-        result = st.text()
+        if dtype.itemsize == 0:  # pointer to arbitrary-length string


Is this really represented as an itemsize of zero? That seems very odd. I'd expect it to be a pointer size.

Misleading comment sorry - this dtype is not actually a pointer, it's a fixed but as yet undetermined length. Will update the comment to match.

Fun fact: you can also create dtypes for negative-length strings, which raise a MemoryError when you try to instantiate an array of them. (I do not propose to check for this)

DRMacIver · 2017-08-31T11:14:32Z

src/hypothesis/extra/numpy.py

+        if dtype.itemsize == 0:  # pointer to arbitrary-length string
+            result = st.text()
+        else:  # unicode encoded as UTF-32
+            result = st.text(max_size=dtype.itemsize // 4)


Same comment as with binary. Is this not zero terminated?

DRMacIver · 2017-08-31T13:24:50Z

BTW as predicted I have now started running into this in #785

Zac-HD · 2017-09-02T09:53:58Z

I don't actually care that much about the truncation thing - it's strictly a performance issue and doesn't affect example quality. On the other hand, the check would be fairly expensive at runtime (because there may often be lots of array elements) and wouldn't catch not-generated-values anyway. Much like the debug trace in #815, this predates our review process; I think that's sufficient to catch it now and wouldn't write explicit tests - just review this patch and area of code carefully.

DRMacIver · 2017-09-02T09:56:49Z

On the other hand, the check would be fairly expensive at runtime (because there may often be lots of array elements)

This is why I suggested putting it behind a debug flag. Something like the HYPOTHESIS_INTERNAL_BRANCHCHECK trick which swaps in an expensive implementation when turned on and is otherwise free.

But I accept the argument that this doesn't need to be done in this pull request.

Zac-HD · 2017-09-03T13:32:21Z

Short summary: all Numpy arrays of strings are of fixed length; unicode is four bytes per codepoint (I assume UTF-32) to make this work. Shorter strings are stored null-padded, so we no longer allow strings ending in null from our inferred strategies (to ensure values can be round-tripped).

String dtypes may or may not have a length, but if not the length of strings is fixed when the array is created to fit the longest string then present. Strings may only be variable-length by padding shorter strings as above.

DRMacIver · 2017-09-12T10:43:04Z

@Zac-HD I'm going to take the datetime part of this off your hands if you don't mind. It's causing a bunch of annoyance for the pandas work, so I'd like to get that fixed. Will leave the rest.

Zac-HD · 2017-09-12T11:03:01Z

Fine by me! I'd appreciate it if you also cherry-picked the integers bug, because it's pretty nasty and will also impact on Pandas.

DRMacIver · 2017-09-12T11:08:37Z

I'd appreciate it if you also cherry-picked the integers bug, because it's pretty nasty and will also impact on Pandas.

Happy to, but I'd like to do it in a separate PR. The datetime bug is actually blocking the release of pandas because it causes tests that should pass to fail, the integers bug is just not testing as much as we'd like.

…-resolution Extract datetime resolution fix from #826

Zac-HD · 2017-09-26T07:42:46Z

After some more investigation, I've decided that further changes to datetime dtype strategies aren't worth it:

Shrinking to the 1970 epoch is more natural than 2000 (as the datetimes() strategy does) due to the underlying representation as an integer. This also obviates the special-casing for higher resolutions (picosecond, etc) that can't represent the Y2K epoch.
I think my other problem is actually a repr bug, as I couldn't trigger it with any value-related tests. I'll try to track it down sometime, but if it exists at all it's not in Hypothesis.

Which means this is ready to merge!

todo: update benchmarks now that Smart example selection based on coverage #880 has been merged

Zac-HD · 2017-09-27T15:31:26Z

Naturally, this revealed some bugs in our tests - invalid sizes leading to an overflow in RangeIndex construction, and that uint64 is broken in Pandas before version 0.20. Hopefully that will be all.

Excludes NaN, because it's unequal to itself and nan-checks that handle timestamps etc are nontrivial. Excludes compound, record, and nested dtypes too.

DRMacIver

Bunch of small nitpicks (mostly questions) but generally LGTM. Sorry for the long review delay!

DRMacIver · 2017-10-07T08:41:30Z

src/hypothesis/strategies.py

@@ -1681,7 +1681,7 @@ def try_convert(typ, value, name):
                name, value, type(value).__name__, typ.__name__
            )
        )
-    except ValueError:
+    except (OverflowError, ValueError):


What caused this to become necessary?

Somewhere in Pandas index handling, from vague memory in uint64 or datetime64, casting Python integers to the specific type raises OverflowError. This seemed like the best place to handle it, and already has the right error message.

DRMacIver · 2017-10-07T08:43:01Z

tests/numpy/test_gen_data.py

+    # values are safe, not known type coercion.
+    arr = np.zeros(shape=1, dtype=dtype)
+    ex = data.draw(nps.from_dtype(arr.dtype))
+    assume(ex == ex)  # NaN is uninteresting for this test


Maybe worth noting in the comment that we're using this test because we might have non-numeric types where np.isnan would error.

DRMacIver · 2017-10-07T08:45:52Z

tests/pandas/test_indexes.py

@@ -47,10 +47,11 @@ def test_unique_indexes_of_small_values(ix):
    assert len(set(ix)) == len(ix)


-int64s = npst.from_dtype(np.dtype(int))
+# Sizes that fit into an int64 without overflow
+range_sizes = st.integers(0, 2 ** 63 - 100)



Why the -100 rather than -1?

DRMacIver · 2017-10-07T08:47:01Z

RELEASE.rst

+
+This patch improves the quality of strategies inferred from Numpy dtypes:
+
+* Signed integer dtypes generated examples with some upper bits set to zero


Just to check, did this mean we were never producing negative values? If so, ouch. That's pretty bad. Or was it just the high non-sign bits? Might be worth making that clearer here.

DRMacIver · 2017-10-07T08:47:41Z

RELEASE.rst

+  length, which should improve example and shrink quality.
+* Numpy arrays can only store fixed-size strings internally, and allow shorter
+  strings by right-padding them with null bytes.  Inferred string strategies
+  no longer generate such values, as they can never be retrieved from an array.


Unclear what impact this has to the user. It's basically just a performance improvement, right?

Yep - better shrinking because we know not to try null-terminated strings. I've added a line to that effect below.

This is actually possible to hit because RangeIndex can be lazy, and therefore the maximum possible value is a fairly obvious thing to feed it. Now that doesn't break things.

Zac-HD force-pushed the np-dtype-fixes branch from 71c3b57 to 0355bc2 Compare August 29, 2017 23:44

DRMacIver requested changes Aug 31, 2017

View reviewed changes

DRMacIver mentioned this pull request Aug 31, 2017

Implementation of hypothesis.extras.pandas #785

Merged

5 tasks

Zac-HD force-pushed the np-dtype-fixes branch from 0355bc2 to b6c76ec Compare September 3, 2017 13:25

DRMacIver mentioned this pull request Sep 12, 2017

Extract datetime resolution fix from #826 #851

Merged

DRMacIver added a commit that referenced this pull request Sep 12, 2017

Merge pull request #851 from HypothesisWorks/DRMacIver/numpy-datetime…

0c21275

…-resolution Extract datetime resolution fix from #826

Zac-HD force-pushed the np-dtype-fixes branch 3 times, most recently from dde052a to ddae4fe Compare September 14, 2017 05:45

Zac-HD force-pushed the np-dtype-fixes branch from ddae4fe to 0464cfa Compare September 27, 2017 13:22

Zac-HD added 5 commits September 30, 2017 14:34

Allow all representable integers in inferred strats

8d000b3

Improve string support in inferred Numpy strategies

afed367

Test that we can round-trip all examples

145bec5

Excludes NaN, because it's unequal to itself and nan-checks that handle timestamps etc are nontrivial. Excludes compound, record, and nested dtypes too.

Fix updating benchmarks

5d99e09

New and improved benchmarks

8852051

Zac-HD force-pushed the np-dtype-fixes branch from 041e90e to 1c9d8ad Compare September 30, 2017 04:37

Fix overflow in Pandas extra and avoid old bug

28bec79

Zac-HD force-pushed the np-dtype-fixes branch from 1c9d8ad to 28bec79 Compare September 30, 2017 08:43

DRMacIver requested changes Oct 7, 2017

View reviewed changes

Zac-HD added 2 commits October 7, 2017 23:23

Fix overflow in pd.RangeIndex sizes

02940c2

This is actually possible to hit because RangeIndex can be lazy, and therefore the maximum possible value is a fairly obvious thing to feed it. Now that doesn't break things.

Clarify release notes, comments

a40e448

Zac-HD force-pushed the np-dtype-fixes branch from 8e8a76f to a40e448 Compare October 7, 2017 12:24

DRMacIver approved these changes Oct 8, 2017

View reviewed changes

Zac-HD merged commit e2a87bc into HypothesisWorks:master Oct 8, 2017

Zac-HD deleted the np-dtype-fixes branch August 17, 2018 23:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfixes in inference from Numpy dtypes #826

Bugfixes in inference from Numpy dtypes #826

Zac-HD commented Aug 29, 2017 •

edited

Loading

Zac-HD commented Aug 30, 2017

DRMacIver commented Aug 30, 2017

DRMacIver commented Aug 30, 2017

Zac-HD commented Aug 31, 2017 •

edited

Loading

DRMacIver left a comment

DRMacIver Aug 31, 2017

Zac-HD Aug 31, 2017

DRMacIver Aug 31, 2017

Zac-HD Aug 31, 2017

DRMacIver Aug 31, 2017

Zac-HD Aug 31, 2017

DRMacIver Aug 31, 2017

DRMacIver commented Aug 31, 2017

Zac-HD commented Sep 2, 2017 •

edited

Loading

DRMacIver commented Sep 2, 2017

Zac-HD commented Sep 3, 2017

DRMacIver commented Sep 12, 2017

Zac-HD commented Sep 12, 2017

DRMacIver commented Sep 12, 2017

Zac-HD commented Sep 26, 2017 •

edited

Loading

Zac-HD commented Sep 27, 2017 •

edited

Loading

DRMacIver left a comment

DRMacIver Oct 7, 2017

Zac-HD Oct 7, 2017

DRMacIver Oct 7, 2017

DRMacIver Oct 7, 2017

DRMacIver Oct 7, 2017

DRMacIver Oct 7, 2017

Zac-HD Oct 7, 2017


		This patch improves the quality of strategies inferred from Numpy dtypes:

		* Signed integer dtypes generated examples with some upper bits set to zero

Bugfixes in inference from Numpy dtypes #826

Bugfixes in inference from Numpy dtypes #826

Conversation

Zac-HD commented Aug 29, 2017 • edited Loading

Zac-HD commented Aug 30, 2017

DRMacIver commented Aug 30, 2017

DRMacIver commented Aug 30, 2017

Zac-HD commented Aug 31, 2017 • edited Loading

DRMacIver left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DRMacIver commented Aug 31, 2017

Zac-HD commented Sep 2, 2017 • edited Loading

DRMacIver commented Sep 2, 2017

Zac-HD commented Sep 3, 2017

DRMacIver commented Sep 12, 2017

Zac-HD commented Sep 12, 2017

DRMacIver commented Sep 12, 2017

Zac-HD commented Sep 26, 2017 • edited Loading

Zac-HD commented Sep 27, 2017 • edited Loading

DRMacIver left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Zac-HD commented Aug 29, 2017 •

edited

Loading

Zac-HD commented Aug 31, 2017 •

edited

Loading

Zac-HD commented Sep 2, 2017 •

edited

Loading

Zac-HD commented Sep 26, 2017 •

edited

Loading

Zac-HD commented Sep 27, 2017 •

edited

Loading