Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfixes in inference from Numpy dtypes #826

Merged
merged 8 commits into from
Oct 8, 2017

Conversation

Zac-HD
Copy link
Member

@Zac-HD Zac-HD commented Aug 29, 2017

Two of these predate #472, but the datetime bug was entirely me and would probably break a lot of Pandas things. Whoops!

  • Inspect dtype strategies to check they can generate all representable values
  • Write a test that any value we get from a dtype-derived strategy can be round-tripped in and out of an array of that dtype. (it's a good test, we just shouldn't do it at runtime IMO)
  • Investigate round-tripping of bytestrings ending in the zero byte

@Zac-HD
Copy link
Member Author

Zac-HD commented Aug 30, 2017

Found 4 benchmarks with significant difference at false discovery rate 0.0001
Different means for arrays10-valid=always-interesting=array_average: 2371.94 -> 1418.63. p=0.00000
Different means for arraysvar-valid=always-interesting=array_average: 1835.96 -> 1162.24. p=0.00000
Different means for arrays10-valid=usually-interesting=array_average: 2548.95 -> 1658.15. p=0.00000
Different means for arraysvar-valid=usually-interesting=array_average: 1958.75 -> 1039.24. p=0.00000

Of course this is good, but now I have to update the benchmarks too.

@DRMacIver
Copy link
Member

Of course this is good, but now I have to update the benchmarks too.

I just use make update-all-benchmark-data (I know you can't run make, but you can get the command from there).

Warning: It takes quite a while to run.

@DRMacIver
Copy link
Member

I'm not totally sure why the benchmarks have improved. Were we previously drawing two bytes for int8?

@Zac-HD
Copy link
Member Author

Zac-HD commented Aug 31, 2017

I'm not totally sure why the benchmarks have improved. Were we previously drawing two bytes for int8?

It must be drawing fewer values for some reason, because the size (in bits) of each integer has increased. I confess this is also surprising to me - I would expect this kind of boost for fixed-size text arrays, but thought integers would be better and finding bugs but take no less time.

Update: the changed benchmarks were all for the array_means interestingness condition (mean of array above random threshold). Now that we can generate the full range of representable integers, this condition is true of many more examples.

Copy link
Member

@DRMacIver DRMacIver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As well as the specific commentary, I think I'd like this to be a lot better tested than this is.

I had a bit of a think about how, and I think the thing we actually want to do is sneak some internal assertions into the from_dtype strategy (possibly enabled by some sort of debug flag?) so that rather than just doing a straight up conversion to the resulting dtype, we then convert the result back to the original type and assert that is equal. This will give us some nice explicit failures when this sort of silent truncation is happening.

@@ -53,19 +53,24 @@ def from_dtype(dtype):
elif dtype.kind == u'c':
result = st.complex_numbers()
elif dtype.kind in (u'S', u'a'):
result = st.binary()
result = st.binary(max_size=dtype.itemsize or None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have an off-by-one error? If numpy can store variable size byte strings then it must either be length prefixing or zero terminating them. I'd assume the latter. But in that case itemsize must be counting the zero byte at the end. There's also a question of what happens if we have internal zeros here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dtype for a variable-length string (unicode or bytestring) actually creates an array of fixed-length strings, but the limiting length is that of the longest string present when the array is initialised.

So the array is not variable size, but it's safe to generate any length string because the array doesn't exist yet!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what happens to the shorter strings? Do they get zero padded to the same length?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internally, I assume so, but the python API gives you np.array(['aaaa', 'bb'])[1] == 'bb' - the same strings you put in. However assigning longer strings gets them truncated: np.array(['aaaa', 'bb'], 'U2')[0] == 'aa'; that's the shrinking issue I mentioned.

elif dtype.kind == u'U':
result = st.text()
if dtype.itemsize == 0: # pointer to arbitrary-length string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really represented as an itemsize of zero? That seems very odd. I'd expect it to be a pointer size.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misleading comment sorry - this dtype is not actually a pointer, it's a fixed but as yet undetermined length. Will update the comment to match.

Fun fact: you can also create dtypes for negative-length strings, which raise a MemoryError when you try to instantiate an array of them. (I do not propose to check for this)

if dtype.itemsize == 0: # pointer to arbitrary-length string
result = st.text()
else: # unicode encoded as UTF-32
result = st.text(max_size=dtype.itemsize // 4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as with binary. Is this not zero terminated?

@DRMacIver
Copy link
Member

BTW as predicted I have now started running into this in #785

@Zac-HD
Copy link
Member Author

Zac-HD commented Sep 2, 2017

I don't actually care that much about the truncation thing - it's strictly a performance issue and doesn't affect example quality. On the other hand, the check would be fairly expensive at runtime (because there may often be lots of array elements) and wouldn't catch not-generated-values anyway. Much like the debug trace in #815, this predates our review process; I think that's sufficient to catch it now and wouldn't write explicit tests - just review this patch and area of code carefully.

@DRMacIver
Copy link
Member

On the other hand, the check would be fairly expensive at runtime (because there may often be lots of array elements)

This is why I suggested putting it behind a debug flag. Something like the HYPOTHESIS_INTERNAL_BRANCHCHECK trick which swaps in an expensive implementation when turned on and is otherwise free.

But I accept the argument that this doesn't need to be done in this pull request.

@Zac-HD
Copy link
Member Author

Zac-HD commented Sep 3, 2017

Short summary: all Numpy arrays of strings are of fixed length; unicode is four bytes per codepoint (I assume UTF-32) to make this work. Shorter strings are stored null-padded, so we no longer allow strings ending in null from our inferred strategies (to ensure values can be round-tripped).

String dtypes may or may not have a length, but if not the length of strings is fixed when the array is created to fit the longest string then present. Strings may only be variable-length by padding shorter strings as above.

@DRMacIver
Copy link
Member

@Zac-HD I'm going to take the datetime part of this off your hands if you don't mind. It's causing a bunch of annoyance for the pandas work, so I'd like to get that fixed. Will leave the rest.

@Zac-HD
Copy link
Member Author

Zac-HD commented Sep 12, 2017

Fine by me! I'd appreciate it if you also cherry-picked the integers bug, because it's pretty nasty and will also impact on Pandas.

@DRMacIver
Copy link
Member

I'd appreciate it if you also cherry-picked the integers bug, because it's pretty nasty and will also impact on Pandas.

Happy to, but I'd like to do it in a separate PR. The datetime bug is actually blocking the release of pandas because it causes tests that should pass to fail, the integers bug is just not testing as much as we'd like.

DRMacIver added a commit that referenced this pull request Sep 12, 2017
…-resolution

Extract datetime resolution fix from #826
@Zac-HD Zac-HD force-pushed the np-dtype-fixes branch 3 times, most recently from dde052a to ddae4fe Compare September 14, 2017 05:45
@Zac-HD
Copy link
Member Author

Zac-HD commented Sep 26, 2017

After some more investigation, I've decided that further changes to datetime dtype strategies aren't worth it:

  • Shrinking to the 1970 epoch is more natural than 2000 (as the datetimes() strategy does) due to the underlying representation as an integer. This also obviates the special-casing for higher resolutions (picosecond, etc) that can't represent the Y2K epoch.
  • I think my other problem is actually a repr bug, as I couldn't trigger it with any value-related tests. I'll try to track it down sometime, but if it exists at all it's not in Hypothesis.

Which means this is ready to merge!

@Zac-HD
Copy link
Member Author

Zac-HD commented Sep 27, 2017

Naturally, this revealed some bugs in our tests - invalid sizes leading to an overflow in RangeIndex construction, and that uint64 is broken in Pandas before version 0.20. Hopefully that will be all.

Excludes NaN, because it's unequal to itself and nan-checks that handle
timestamps etc are nontrivial.
Excludes compound, record, and nested dtypes too.
Copy link
Member

@DRMacIver DRMacIver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bunch of small nitpicks (mostly questions) but generally LGTM. Sorry for the long review delay!

@@ -1681,7 +1681,7 @@ def try_convert(typ, value, name):
name, value, type(value).__name__, typ.__name__
)
)
except ValueError:
except (OverflowError, ValueError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What caused this to become necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhere in Pandas index handling, from vague memory in uint64 or datetime64, casting Python integers to the specific type raises OverflowError. This seemed like the best place to handle it, and already has the right error message.

# values are safe, not known type coercion.
arr = np.zeros(shape=1, dtype=dtype)
ex = data.draw(nps.from_dtype(arr.dtype))
assume(ex == ex) # NaN is uninteresting for this test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth noting in the comment that we're using this test because we might have non-numeric types where np.isnan would error.

@@ -47,10 +47,11 @@ def test_unique_indexes_of_small_values(ix):
assert len(set(ix)) == len(ix)


int64s = npst.from_dtype(np.dtype(int))
# Sizes that fit into an int64 without overflow
range_sizes = st.integers(0, 2 ** 63 - 100)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the -100 rather than -1?

RELEASE.rst Outdated

This patch improves the quality of strategies inferred from Numpy dtypes:

* Signed integer dtypes generated examples with some upper bits set to zero
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to check, did this mean we were never producing negative values? If so, ouch. That's pretty bad. Or was it just the high non-sign bits? Might be worth making that clearer here.

length, which should improve example and shrink quality.
* Numpy arrays can only store fixed-size strings internally, and allow shorter
strings by right-padding them with null bytes. Inferred string strategies
no longer generate such values, as they can never be retrieved from an array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear what impact this has to the user. It's basically just a performance improvement, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - better shrinking because we know not to try null-terminated strings. I've added a line to that effect below.

Zac-HD added 2 commits October 7, 2017 23:23
This is actually possible to hit because RangeIndex can be lazy, and
therefore the maximum possible value is a fairly obvious thing to feed
it.  Now that doesn't break things.
@Zac-HD Zac-HD merged commit e2a87bc into HypothesisWorks:master Oct 8, 2017
@Zac-HD Zac-HD deleted the np-dtype-fixes branch August 17, 2018 23:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants