Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Super Serial- automatically save and load TFRecords from Tensorflow datasets #1280

Merged
merged 27 commits into from
Mar 18, 2021

Conversation

markemus
Copy link
Contributor

TFRecords are notoriously difficult to work with, but no longer! Super Serial provides a simple and easy to use way to save (dictionary-style) datasets as TFRecords and load them back into memory. The saver will create a .header file that stores the metadata needed to reconstruct the dataset in a human-readable YAML format.

Users only need to deal with 2 functions:
save(dataset)
dataset = load(tfrecord_path, header_path)

Yes, it's really that easy! If you've worked with TFRecords before you'll know this is a big improvement.

I added super_serial.py to the "api" subfolder- not sure if that's the best place for it or not. Please let me know if there's anything more I need to do to make this acceptable. It also should not be too difficult to extend this to tuple-style datasets if you guys want me to.

@google-cla
Copy link

google-cla bot commented Jan 21, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@markemus
Copy link
Contributor Author

@googlebot I signed it!

@google-cla
Copy link

google-cla bot commented Jan 21, 2021

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@markemus
Copy link
Contributor Author

@googlebot I fixed it.

@google-cla
Copy link

google-cla bot commented Jan 21, 2021

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@markemus
Copy link
Contributor Author

@googlebot I fixed it.

@markemus
Copy link
Contributor Author

I have no idea why Lint is failing- Pycharm's linter indicates that my code is perfectly conformant to pep8. I tried with 2 space indents instead of 4 (as in the base Tensorflow contributor guidelines) but that didn't help, so I switched it back to 4. Can anyone advise me on what the issue is?

@yongtang
Copy link
Member

@markemus You can run bazel run //tools/lint:lint and that will fix most of the lint issues automatically.

@markemus
Copy link
Contributor Author

markemus commented Jan 26, 2021

@yongtang I ran lint myself from the command line and apparently it's failing on the fstrings (line 60 of super_serial and when that's "fixed" it fails on line 69, also an fstring). Does that make sense? Should I go with old school format strings instead?

I will try bazel after work tonight.

@yongtang
Copy link
Member

@markemus Which python version are you using? Wondering if the issue is tied to the python version (we support 3.6 or higher).

@markemus
Copy link
Contributor Author

@yongtang I ran it again in python 3.8.3 (there was a conda env active last time), and this time I got a bunch of warnings. The fstring issue seems to have disappeared so that must have had something to do with the python version in conda.

The warnings are:

line-too-long
fix-me
import-error (for tensorflow- that's a weird one, prob safe to ignore)
invalid-name
too-many-function-args

code rating: 6.74/10

Okay, point taken. Is it necessary to fix all of these (ie I like to use "x" when looping and "y" if looping again) or are only some necessary? I'd like to keep the TODOs if possible- they're not bugs but they are features I'd like to add in future, and Pycharm tracks TODO comments.

@yongtang
Copy link
Member

@markemus Are those errors pylint related? The tensorflow/io repo does not require pylint check. We only do black and pyupgrade check instead.

What is the output when you run:

bazel run //tools/lint:lint -- black pyupgrade

locally?

@markemus
Copy link
Contributor Author

@yongtang are you able to see the checks on the pull request? It lists GitHub CI / Lint (pull_request) as a failing check. Bazel is listed as successful for macOS, Linux, and Windows, but I don't have bazel installed on my local system to test it out.

@yongtang
Copy link
Member

@markemus you can run black and pyupgrade locally to re-format the .py file.

We use:

black==19.10b0
pyupgrade==2.1.0

in our CI lint setup.

@markemus
Copy link
Contributor Author

@yongtang ok I appreciate it- I will set up bazel tonight and hopefully that will get things working. Thanks!

@yongtang
Copy link
Member

@markemus You can install bazel which will not require black or pyupgrade installation explicitly, or install black or pyupgrade through pip or conda and run black and pyupgrade locally without bazel.

@markemus
Copy link
Contributor Author

@yongtang thank you, I ran bazel with bazel run //tools/lint:lint -- black pyupgrade and that fixed the lint issues. All the other checks broke though! From the log it looks as though that's a problem with the google servers, is that correct or do I need to fix something else?

@yongtang
Copy link
Member

@markemus I think that was an intermittent error caused by the connection from GitHub Actions to nasm. The issue should have been gone by now. Can you try bump the commit and push again? An empty commit will re-trigger the GitHub Actions to re-rerun.

@yongtang
Copy link
Member

@markemus I will take a detailed review over the weekend. I think some mechanic changes to conform to tensorflow-io's layout will be needed.

@yongtang
Copy link
Member

@markemus Can you move the test to tests directory, e.g., with name tests/test_xxx.py? That will trigger the GitHub CI to perform the test automatically. You can use tests/test_s3_eager.py as an example.

@burgerkingeater
Copy link
Member

@markemus thanks for the contribution. I wonder why a file of metadata to reconstruct the dataset is needed? TFRecord doesn't need schema to read, right?

@markemus
Copy link
Contributor Author

@burgerkingeater as far as I know it is necessary to define a loading function as well- see the tutorial here or this separate Keras tutorial here. If there is already an automated way to do this I would love to know about it!

BTW are the issues with the Bazel checks failing resolved?

@yongtang I will hopefully fix the test tomorrow- thank you for taking a look at this and for all your help this week.

@markemus
Copy link
Contributor Author

markemus commented Feb 9, 2021

@yongtang I moved the test as you requested- can you confirm that it's what you had in mind? All checks are passing now.

@yongtang
Copy link
Member

Thanks @markemus. I think this feature is fine to be put into experimental package first. There are some mechanics that may need to be done:

  1. Add License headers to super_serial.py and test_super_serial.py
  2. Move super_serial.py to tensorflow_io/core/python/experimental/serial_ops.py (let's remove the super_)
  3. Import the exposed functions (save and load) to tensorflow_io/core/python/api/experimental/serialization.py

The above will correctly expose the API in a way so that docs in tensorflow.org can handle.

After that, in test_super_serial.py the import should be changed from

import tensorflow_io.core.python.api.v0.super_serial as ser

to

import tensorflow_io as tfio

and use tfio.experimental.serialization.save/tfio.experimental.serialization.load directly.

Additionally, there might be some minor changes that is worth considering, e.g., maybe it makes sense to rename the API as save_dataset and load_dataset?

Comment on lines +153 to +169
def dataset_to_examples(ds):
"""Converts a dataset to a dataset of tf.train.Example strings. Each Example is a single observation.
WARNING: Only compatible with "dictionary-style" datasets {key: val, key2:val2,..., keyN, valN}.
WARNING: Must run in eager mode!"""
# TODO handle tuples and flat datasets as well.
for x in ds:
# Each individual tensor is converted to a known serializable type.
features = {key: np_value_to_feature(value.numpy()) for key, value in x.items()}
# All features are then packaged into a single Example object.
example = tf.train.Example(features=tf.train.Features(feature=features))

yield example.SerializeToString()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If only eager mode is supported, then maybe you can check whether tf is executing in eager mode and raise an exception if it isn't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. It will raise an exception already if tf is not in eager mode, and if no exceptions are raised then it will run fine. In general I'm not in favor of explicitly adding exceptions like that because I've seen it cause unnecessary exceptions in other packages, but if that's the practice here I'm happy to add it.

Copy link
Member

@kvignesh1420 kvignesh1420 Feb 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we ensure this in a different test case? The test would switch to graph mode and save the dataset and assert that the respective exception is raised.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll add that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kvignesh1420 I added the test you requested as test_ensure_graph_fail() in tests/test_serial_ops.py.

@markemus
Copy link
Contributor Author

markemus commented Mar 1, 2021

@yongtang it looks like some of the checks are failing because it can't import yaml, which makes sense- I never added pyyaml as a dependency anywhere. How do I do that? I didn't see anything in setup.py and there doesn't seem to be a requirements.py anywhere.

@yongtang
Copy link
Member

yongtang commented Mar 5, 2021

@markemus I think yaml should not be a hard requirement for users. In tensorflow-io we purposely limit the requirement to only tensorflow itself with no other dependencies. You can place import yaml inside the function like:
https://github.com/tensorflow/io/blob/master/tensorflow_io/core/python/ops/audio_ops.py#L195-L201

so that module will only be loaded as needed.

@yongtang
Copy link
Member

yongtang commented Mar 5, 2021

@markemus Also, it looks like you are using an email that is not registered with your GitHub handle ( <[email protected]>). This is not a requirement though that causes the commit not being correctly linked by GitHub:
https://github.com/tensorflow/io/pull/1280/commits

See the author of the commits is not linked to your GitHub handle. When the PR is merged, you may not see your name appears on the contributors list correctly.

You can either change the git config to use the email you registered with GitHub, or add the email ( <[email protected]>) to GitHub account to allow correct linking of the author/commit.

@markemus
Copy link
Contributor Author

markemus commented Mar 5, 2021

@yongtang I can do that, but the tests require yaml in order to run as well.

@markemus
Copy link
Contributor Author

markemus commented Mar 15, 2021

@kvignesh1420 @yongtang rebase is done- tests aren't fully complete yet but look ok. Are we ready to merge?

EDIT: Haha, tests failed of course- I deserve that. Working on a fix.

@markemus
Copy link
Contributor Author

@kvignesh1420 @yongtang all tests pass now! Sorry for the false alarm yesterday.

Copy link
Member

@kvignesh1420 kvignesh1420 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @yongtang can you please take a look.

@yongtang
Copy link
Member

@markemus LGTM. Thanks for the contribution! 👍

As a side note, have you seen my comment in #1280 (comment) ? If the email is not linked, then after the merge I think your author information will not show up in contributor's list.

If you are ok with that we can go ahead and merge. However, I would suggest you to fix this, as we do respect contributor's hard work and want to make sure your contribution are correctly recognized in the repo.

@markemus
Copy link
Contributor Author

@yongtang thank you for the heads up! I checked the commit log and added both version of that email address to my account. Ready to merge when you are.

@yongtang
Copy link
Member

@markemus Merged. Thanks for your contribution! 👍

@yongtang yongtang merged commit d65cc66 into tensorflow:master Mar 18, 2021
@markemus
Copy link
Contributor Author

@yongtang Awesome! Thank you and @kvignesh1420 for all your help :)

michaelbanfield pushed a commit to michaelbanfield/io that referenced this pull request Mar 30, 2021
…atasets (tensorflow#1280)

* super_serial automatically creates TFRecords files from dictionary-style Tensorflow datasets.

* pep8 fixes

* more pep8 (undoing tensorflow 2 space tabs)

* bazel changes

* small change so github checks will run again

* moved super_serial test to tests/

* bazel changes

* moved super_serial to experimental

* refactored super_serial test to work for serial_ops

* bazel fixes

* refactored test to load from tfio instead of full import path

* licenses

* bazel fixes

* fixed license dates for new files

* small change so tests rerun

* small change so tests rerun

* cleanup and bazel fix

* added test to ensure proper crash occurs when trying to save in graph mode

* bazel fixes

* fixed imports for test

* fixed imports for test

* fixed yaml imports for serial_ops

* fixed error path for new tf version

* prevented flaky behavior in graph mode for serial_ops.py by preemptively raising an exception if graph mode is detected.

* sanity check for graph execution in graph_save_fail()

* it should be impossible for serial_ops not to raise an exception now outside of eager mode. Impossible.

* moved eager execution check in serial_ops
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants