-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfixes in inference from Numpy dtypes #826
Conversation
Of course this is good, but now I have to update the benchmarks too. |
I just use Warning: It takes quite a while to run. |
I'm not totally sure why the benchmarks have improved. Were we previously drawing two bytes for int8? |
It must be drawing fewer values for some reason, because the size (in bits) of each integer has increased. I confess this is also surprising to me - I would expect this kind of boost for fixed-size text arrays, but thought integers would be better and finding bugs but take no less time. Update: the changed benchmarks were all for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As well as the specific commentary, I think I'd like this to be a lot better tested than this is.
I had a bit of a think about how, and I think the thing we actually want to do is sneak some internal assertions into the from_dtype
strategy (possibly enabled by some sort of debug flag?) so that rather than just doing a straight up conversion to the resulting dtype, we then convert the result back to the original type and assert that is equal. This will give us some nice explicit failures when this sort of silent truncation is happening.
src/hypothesis/extra/numpy.py
Outdated
@@ -53,19 +53,24 @@ def from_dtype(dtype): | |||
elif dtype.kind == u'c': | |||
result = st.complex_numbers() | |||
elif dtype.kind in (u'S', u'a'): | |||
result = st.binary() | |||
result = st.binary(max_size=dtype.itemsize or None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this have an off-by-one error? If numpy can store variable size byte strings then it must either be length prefixing or zero terminating them. I'd assume the latter. But in that case itemsize must be counting the zero byte at the end. There's also a question of what happens if we have internal zeros here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dtype for a variable-length string (unicode or bytestring) actually creates an array of fixed-length strings, but the limiting length is that of the longest string present when the array is initialised.
So the array is not variable size, but it's safe to generate any length string because the array doesn't exist yet!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what happens to the shorter strings? Do they get zero padded to the same length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Internally, I assume so, but the python API gives you np.array(['aaaa', 'bb'])[1] == 'bb'
- the same strings you put in. However assigning longer strings gets them truncated: np.array(['aaaa', 'bb'], 'U2')[0] == 'aa'
; that's the shrinking issue I mentioned.
src/hypothesis/extra/numpy.py
Outdated
elif dtype.kind == u'U': | ||
result = st.text() | ||
if dtype.itemsize == 0: # pointer to arbitrary-length string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really represented as an itemsize of zero? That seems very odd. I'd expect it to be a pointer size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Misleading comment sorry - this dtype is not actually a pointer, it's a fixed but as yet undetermined length. Will update the comment to match.
Fun fact: you can also create dtypes for negative-length strings, which raise a MemoryError when you try to instantiate an array of them. (I do not propose to check for this)
src/hypothesis/extra/numpy.py
Outdated
if dtype.itemsize == 0: # pointer to arbitrary-length string | ||
result = st.text() | ||
else: # unicode encoded as UTF-32 | ||
result = st.text(max_size=dtype.itemsize // 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as with binary. Is this not zero terminated?
BTW as predicted I have now started running into this in #785 |
I don't actually care that much about the truncation thing - it's strictly a performance issue and doesn't affect example quality. On the other hand, the check would be fairly expensive at runtime (because there may often be lots of array elements) and wouldn't catch not-generated-values anyway. Much like the debug trace in #815, this predates our review process; I think that's sufficient to catch it now and wouldn't write explicit tests - just review this patch and area of code carefully. |
This is why I suggested putting it behind a debug flag. Something like the But I accept the argument that this doesn't need to be done in this pull request. |
0355bc2
to
b6c76ec
Compare
Short summary: all Numpy arrays of strings are of fixed length; unicode is four bytes per codepoint (I assume UTF-32) to make this work. Shorter strings are stored null-padded, so we no longer allow strings ending in null from our inferred strategies (to ensure values can be round-tripped). String dtypes may or may not have a length, but if not the length of strings is fixed when the array is created to fit the longest string then present. Strings may only be variable-length by padding shorter strings as above. |
@Zac-HD I'm going to take the datetime part of this off your hands if you don't mind. It's causing a bunch of annoyance for the pandas work, so I'd like to get that fixed. Will leave the rest. |
Fine by me! I'd appreciate it if you also cherry-picked the integers bug, because it's pretty nasty and will also impact on Pandas. |
Happy to, but I'd like to do it in a separate PR. The datetime bug is actually blocking the release of pandas because it causes tests that should pass to fail, the integers bug is just not testing as much as we'd like. |
…-resolution Extract datetime resolution fix from #826
dde052a
to
ddae4fe
Compare
After some more investigation, I've decided that further changes to datetime dtype strategies aren't worth it:
Which means this is ready to merge!
|
ddae4fe
to
0464cfa
Compare
Naturally, this revealed some bugs in our tests - invalid sizes leading to an overflow in RangeIndex construction, and that |
Excludes NaN, because it's unequal to itself and nan-checks that handle timestamps etc are nontrivial. Excludes compound, record, and nested dtypes too.
041e90e
to
1c9d8ad
Compare
1c9d8ad
to
28bec79
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bunch of small nitpicks (mostly questions) but generally LGTM. Sorry for the long review delay!
@@ -1681,7 +1681,7 @@ def try_convert(typ, value, name): | |||
name, value, type(value).__name__, typ.__name__ | |||
) | |||
) | |||
except ValueError: | |||
except (OverflowError, ValueError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What caused this to become necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhere in Pandas index handling, from vague memory in uint64 or datetime64, casting Python integers to the specific type raises OverflowError. This seemed like the best place to handle it, and already has the right error message.
tests/numpy/test_gen_data.py
Outdated
# values are safe, not known type coercion. | ||
arr = np.zeros(shape=1, dtype=dtype) | ||
ex = data.draw(nps.from_dtype(arr.dtype)) | ||
assume(ex == ex) # NaN is uninteresting for this test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth noting in the comment that we're using this test because we might have non-numeric types where np.isnan
would error.
tests/pandas/test_indexes.py
Outdated
@@ -47,10 +47,11 @@ def test_unique_indexes_of_small_values(ix): | |||
assert len(set(ix)) == len(ix) | |||
|
|||
|
|||
int64s = npst.from_dtype(np.dtype(int)) | |||
# Sizes that fit into an int64 without overflow | |||
range_sizes = st.integers(0, 2 ** 63 - 100) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the -100 rather than -1?
RELEASE.rst
Outdated
|
||
This patch improves the quality of strategies inferred from Numpy dtypes: | ||
|
||
* Signed integer dtypes generated examples with some upper bits set to zero |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to check, did this mean we were never producing negative values? If so, ouch. That's pretty bad. Or was it just the high non-sign bits? Might be worth making that clearer here.
length, which should improve example and shrink quality. | ||
* Numpy arrays can only store fixed-size strings internally, and allow shorter | ||
strings by right-padding them with null bytes. Inferred string strategies | ||
no longer generate such values, as they can never be retrieved from an array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unclear what impact this has to the user. It's basically just a performance improvement, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep - better shrinking because we know not to try null-terminated strings. I've added a line to that effect below.
This is actually possible to hit because RangeIndex can be lazy, and therefore the maximum possible value is a fairly obvious thing to feed it. Now that doesn't break things.
Two of these predate #472, but the datetime bug was entirely me and would probably break a lot of Pandas things. Whoops!