-
Notifications
You must be signed in to change notification settings - Fork 293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Super Serial- automatically save and load TFRecords from Tensorflow datasets #1280
Conversation
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
@googlebot I signed it! |
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. ℹ️ Googlers: Go here for more info. |
@googlebot I fixed it. |
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. ℹ️ Googlers: Go here for more info. |
@googlebot I fixed it. |
I have no idea why Lint is failing- Pycharm's linter indicates that my code is perfectly conformant to pep8. I tried with 2 space indents instead of 4 (as in the base Tensorflow contributor guidelines) but that didn't help, so I switched it back to 4. Can anyone advise me on what the issue is? |
@markemus You can run |
@yongtang I ran lint myself from the command line and apparently it's failing on the fstrings (line 60 of super_serial and when that's "fixed" it fails on line 69, also an fstring). Does that make sense? Should I go with old school format strings instead? I will try bazel after work tonight. |
@markemus Which python version are you using? Wondering if the issue is tied to the python version (we support 3.6 or higher). |
@yongtang I ran it again in python 3.8.3 (there was a conda env active last time), and this time I got a bunch of warnings. The fstring issue seems to have disappeared so that must have had something to do with the python version in conda. The warnings are: line-too-long code rating: 6.74/10 Okay, point taken. Is it necessary to fix all of these (ie I like to use "x" when looping and "y" if looping again) or are only some necessary? I'd like to keep the TODOs if possible- they're not bugs but they are features I'd like to add in future, and Pycharm tracks TODO comments. |
@markemus Are those errors What is the output when you run:
locally? |
@yongtang are you able to see the checks on the pull request? It lists |
@markemus you can run We use:
in our CI lint setup. |
@yongtang ok I appreciate it- I will set up bazel tonight and hopefully that will get things working. Thanks! |
@markemus You can install bazel which will not require black or pyupgrade installation explicitly, or install black or pyupgrade through pip or conda and run black and pyupgrade locally without bazel. |
@yongtang thank you, I ran bazel with |
@markemus I think that was an intermittent error caused by the connection from GitHub Actions to nasm. The issue should have been gone by now. Can you try bump the commit and push again? An empty commit will re-trigger the GitHub Actions to re-rerun. |
@markemus I will take a detailed review over the weekend. I think some mechanic changes to conform to tensorflow-io's layout will be needed. |
@markemus Can you move the test to tests directory, e.g., with name |
@markemus thanks for the contribution. I wonder why a file of metadata to reconstruct the dataset is needed? TFRecord doesn't need schema to read, right? |
@burgerkingeater as far as I know it is necessary to define a loading function as well- see the tutorial here or this separate Keras tutorial here. If there is already an automated way to do this I would love to know about it! BTW are the issues with the Bazel checks failing resolved? @yongtang I will hopefully fix the test tomorrow- thank you for taking a look at this and for all your help this week. |
@yongtang I moved the test as you requested- can you confirm that it's what you had in mind? All checks are passing now. |
Thanks @markemus. I think this feature is fine to be put into experimental package first. There are some mechanics that may need to be done:
The above will correctly expose the API in a way so that docs in tensorflow.org can handle. After that, in
to
and use Additionally, there might be some minor changes that is worth considering, e.g., maybe it makes sense to rename the API as |
def dataset_to_examples(ds): | ||
"""Converts a dataset to a dataset of tf.train.Example strings. Each Example is a single observation. | ||
WARNING: Only compatible with "dictionary-style" datasets {key: val, key2:val2,..., keyN, valN}. | ||
WARNING: Must run in eager mode!""" | ||
# TODO handle tuples and flat datasets as well. | ||
for x in ds: | ||
# Each individual tensor is converted to a known serializable type. | ||
features = {key: np_value_to_feature(value.numpy()) for key, value in x.items()} | ||
# All features are then packaged into a single Example object. | ||
example = tf.train.Example(features=tf.train.Features(feature=features)) | ||
|
||
yield example.SerializeToString() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If only eager mode is supported, then maybe you can check whether tf is executing in eager mode and raise an exception if it isn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. It will raise an exception already if tf is not in eager mode, and if no exceptions are raised then it will run fine. In general I'm not in favor of explicitly adding exceptions like that because I've seen it cause unnecessary exceptions in other packages, but if that's the practice here I'm happy to add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we ensure this in a different test case? The test would switch to graph mode and save the dataset and assert that the respective exception is raised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll add that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kvignesh1420 I added the test you requested as test_ensure_graph_fail()
in tests/test_serial_ops.py
.
@yongtang it looks like some of the checks are failing because it can't import yaml, which makes sense- I never added pyyaml as a dependency anywhere. How do I do that? I didn't see anything in setup.py and there doesn't seem to be a requirements.py anywhere. |
@markemus I think so that module will only be loaded as needed. |
@markemus Also, it looks like you are using an email that is not registered with your GitHub handle ( See the author of the commits is not linked to your GitHub handle. When the PR is merged, you may not see your name appears on the contributors list correctly. You can either change the git config to use the email you registered with GitHub, or add the email ( |
@yongtang I can do that, but the tests require yaml in order to run as well. |
…ely raising an exception if graph mode is detected.
@kvignesh1420 @yongtang rebase is done- tests aren't fully complete yet but look ok. Are we ready to merge? EDIT: Haha, tests failed of course- I deserve that. Working on a fix. |
@kvignesh1420 @yongtang all tests pass now! Sorry for the false alarm yesterday. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @yongtang can you please take a look.
@markemus LGTM. Thanks for the contribution! 👍 As a side note, have you seen my comment in #1280 (comment) ? If the email is not linked, then after the merge I think your author information will not show up in contributor's list. If you are ok with that we can go ahead and merge. However, I would suggest you to fix this, as we do respect contributor's hard work and want to make sure your contribution are correctly recognized in the repo. |
@yongtang thank you for the heads up! I checked the commit log and added both version of that email address to my account. Ready to merge when you are. |
@markemus Merged. Thanks for your contribution! 👍 |
@yongtang Awesome! Thank you and @kvignesh1420 for all your help :) |
…atasets (tensorflow#1280) * super_serial automatically creates TFRecords files from dictionary-style Tensorflow datasets. * pep8 fixes * more pep8 (undoing tensorflow 2 space tabs) * bazel changes * small change so github checks will run again * moved super_serial test to tests/ * bazel changes * moved super_serial to experimental * refactored super_serial test to work for serial_ops * bazel fixes * refactored test to load from tfio instead of full import path * licenses * bazel fixes * fixed license dates for new files * small change so tests rerun * small change so tests rerun * cleanup and bazel fix * added test to ensure proper crash occurs when trying to save in graph mode * bazel fixes * fixed imports for test * fixed imports for test * fixed yaml imports for serial_ops * fixed error path for new tf version * prevented flaky behavior in graph mode for serial_ops.py by preemptively raising an exception if graph mode is detected. * sanity check for graph execution in graph_save_fail() * it should be impossible for serial_ops not to raise an exception now outside of eager mode. Impossible. * moved eager execution check in serial_ops
TFRecords are notoriously difficult to work with, but no longer! Super Serial provides a simple and easy to use way to save (dictionary-style) datasets as TFRecords and load them back into memory. The saver will create a .header file that stores the metadata needed to reconstruct the dataset in a human-readable YAML format.
Users only need to deal with 2 functions:
save(dataset)
dataset = load(tfrecord_path, header_path)
Yes, it's really that easy! If you've worked with TFRecords before you'll know this is a big improvement.
I added super_serial.py to the "api" subfolder- not sure if that's the best place for it or not. Please let me know if there's anything more I need to do to make this acceptable. It also should not be too difficult to extend this to tuple-style datasets if you guys want me to.