Super Serial- automatically save and load TFRecords from Tensorflow datasets #1280

markemus · 2021-01-21T22:18:25Z

TFRecords are notoriously difficult to work with, but no longer! Super Serial provides a simple and easy to use way to save (dictionary-style) datasets as TFRecords and load them back into memory. The saver will create a .header file that stores the metadata needed to reconstruct the dataset in a human-readable YAML format.

Users only need to deal with 2 functions:
save(dataset)
dataset = load(tfrecord_path, header_path)

Yes, it's really that easy! If you've worked with TFRecords before you'll know this is a big improvement.

I added super_serial.py to the "api" subfolder- not sure if that's the best place for it or not. Please let me know if there's anything more I need to do to make this acceptable. It also should not be too difficult to extend this to tuple-style datasets if you guys want me to.

google-cla · 2021-01-21T22:18:28Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

markemus · 2021-01-21T22:20:38Z

@googlebot I signed it!

google-cla · 2021-01-21T22:20:41Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

markemus · 2021-01-21T22:27:02Z

@googlebot I fixed it.

google-cla · 2021-01-21T22:27:05Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

markemus · 2021-01-21T22:28:26Z

@googlebot I fixed it.

markemus · 2021-01-25T20:19:51Z

I have no idea why Lint is failing- Pycharm's linter indicates that my code is perfectly conformant to pep8. I tried with 2 space indents instead of 4 (as in the base Tensorflow contributor guidelines) but that didn't help, so I switched it back to 4. Can anyone advise me on what the issue is?

yongtang · 2021-01-26T16:26:13Z

@markemus You can run bazel run //tools/lint:lint and that will fix most of the lint issues automatically.

markemus · 2021-01-26T17:56:07Z

@yongtang I ran lint myself from the command line and apparently it's failing on the fstrings (line 60 of super_serial and when that's "fixed" it fails on line 69, also an fstring). Does that make sense? Should I go with old school format strings instead?

I will try bazel after work tonight.

yongtang · 2021-01-26T18:33:35Z

@markemus Which python version are you using? Wondering if the issue is tied to the python version (we support 3.6 or higher).

markemus · 2021-01-26T18:51:53Z

@yongtang I ran it again in python 3.8.3 (there was a conda env active last time), and this time I got a bunch of warnings. The fstring issue seems to have disappeared so that must have had something to do with the python version in conda.

The warnings are:

line-too-long
fix-me
import-error (for tensorflow- that's a weird one, prob safe to ignore)
invalid-name
too-many-function-args

code rating: 6.74/10

Okay, point taken. Is it necessary to fix all of these (ie I like to use "x" when looping and "y" if looping again) or are only some necessary? I'd like to keep the TODOs if possible- they're not bugs but they are features I'd like to add in future, and Pycharm tracks TODO comments.

yongtang · 2021-01-26T18:59:06Z

@markemus Are those errors pylint related? The tensorflow/io repo does not require pylint check. We only do black and pyupgrade check instead.

What is the output when you run:

bazel run //tools/lint:lint -- black pyupgrade

locally?

markemus · 2021-01-26T19:04:51Z

@yongtang are you able to see the checks on the pull request? It lists GitHub CI / Lint (pull_request) as a failing check. Bazel is listed as successful for macOS, Linux, and Windows, but I don't have bazel installed on my local system to test it out.

yongtang · 2021-01-26T19:09:23Z

@markemus you can run black and pyupgrade locally to re-format the .py file.

We use:

black==19.10b0
pyupgrade==2.1.0

in our CI lint setup.

markemus · 2021-01-26T19:11:04Z

@yongtang ok I appreciate it- I will set up bazel tonight and hopefully that will get things working. Thanks!

yongtang · 2021-01-26T19:16:44Z

@markemus You can install bazel which will not require black or pyupgrade installation explicitly, or install black or pyupgrade through pip or conda and run black and pyupgrade locally without bazel.

markemus · 2021-01-28T18:48:01Z

@yongtang thank you, I ran bazel with bazel run //tools/lint:lint -- black pyupgrade and that fixed the lint issues. All the other checks broke though! From the log it looks as though that's a problem with the google servers, is that correct or do I need to fix something else?

yongtang · 2021-01-29T15:59:36Z

@markemus I think that was an intermittent error caused by the connection from GitHub Actions to nasm. The issue should have been gone by now. Can you try bump the commit and push again? An empty commit will re-trigger the GitHub Actions to re-rerun.

yongtang · 2021-01-30T00:18:11Z

@markemus I will take a detailed review over the weekend. I think some mechanic changes to conform to tensorflow-io's layout will be needed.

yongtang · 2021-01-30T00:19:45Z

@markemus Can you move the test to tests directory, e.g., with name tests/test_xxx.py? That will trigger the GitHub CI to perform the test automatically. You can use tests/test_s3_eager.py as an example.

burgerkingeater · 2021-01-31T04:07:15Z

@markemus thanks for the contribution. I wonder why a file of metadata to reconstruct the dataset is needed? TFRecord doesn't need schema to read, right?

markemus · 2021-01-31T04:25:34Z

@burgerkingeater as far as I know it is necessary to define a loading function as well- see the tutorial here or this separate Keras tutorial here. If there is already an automated way to do this I would love to know about it!

BTW are the issues with the Bazel checks failing resolved?

@yongtang I will hopefully fix the test tomorrow- thank you for taking a look at this and for all your help this week.

markemus · 2021-02-09T13:23:14Z

@yongtang I moved the test as you requested- can you confirm that it's what you had in mind? All checks are passing now.

yongtang · 2021-02-10T20:03:42Z

Thanks @markemus. I think this feature is fine to be put into experimental package first. There are some mechanics that may need to be done:

Add License headers to super_serial.py and test_super_serial.py
Move super_serial.py to tensorflow_io/core/python/experimental/serial_ops.py (let's remove the super_)
Import the exposed functions (save and load) to tensorflow_io/core/python/api/experimental/serialization.py

The above will correctly expose the API in a way so that docs in tensorflow.org can handle.

After that, in test_super_serial.py the import should be changed from

import tensorflow_io.core.python.api.v0.super_serial as ser

to

import tensorflow_io as tfio

and use tfio.experimental.serialization.save/tfio.experimental.serialization.load directly.

Additionally, there might be some minor changes that is worth considering, e.g., maybe it makes sense to rename the API as save_dataset and load_dataset?

tensorflow_io/core/python/experimental/serial_ops.py

kvignesh1420 · 2021-02-18T20:18:47Z

tensorflow_io/core/python/experimental/serial_ops.py

+def dataset_to_examples(ds):
+    """Converts a dataset to a dataset of tf.train.Example strings. Each Example is a single observation.
+    WARNING: Only compatible with "dictionary-style" datasets {key: val, key2:val2,..., keyN, valN}.
+    WARNING: Must run in eager mode!"""
+    # TODO handle tuples and flat datasets as well.
+    for x in ds:
+        # Each individual tensor is converted to a known serializable type.
+        features = {key: np_value_to_feature(value.numpy()) for key, value in x.items()}
+        # All features are then packaged into a single Example object.
+        example = tf.train.Example(features=tf.train.Features(feature=features))
+
+        yield example.SerializeToString()


If only eager mode is supported, then maybe you can check whether tf is executing in eager mode and raise an exception if it isn't.

Hmm. It will raise an exception already if tf is not in eager mode, and if no exceptions are raised then it will run fine. In general I'm not in favor of explicitly adding exceptions like that because I've seen it cause unnecessary exceptions in other packages, but if that's the practice here I'm happy to add it.

Can we ensure this in a different test case? The test would switch to graph mode and save the dataset and assert that the respective exception is raised.

Sure, I'll add that.

@kvignesh1420 I added the test you requested as test_ensure_graph_fail() in tests/test_serial_ops.py.

markemus · 2021-03-01T20:53:01Z

@yongtang it looks like some of the checks are failing because it can't import yaml, which makes sense- I never added pyyaml as a dependency anywhere. How do I do that? I didn't see anything in setup.py and there doesn't seem to be a requirements.py anywhere.

yongtang · 2021-03-05T17:39:05Z

@markemus I think yaml should not be a hard requirement for users. In tensorflow-io we purposely limit the requirement to only tensorflow itself with no other dependencies. You can place import yaml inside the function like:
https://github.com/tensorflow/io/blob/master/tensorflow_io/core/python/ops/audio_ops.py#L195-L201

so that module will only be loaded as needed.

yongtang · 2021-03-05T17:45:34Z

@markemus Also, it looks like you are using an email that is not registered with your GitHub handle ( <[email protected]>). This is not a requirement though that causes the commit not being correctly linked by GitHub:
https://github.com/tensorflow/io/pull/1280/commits

See the author of the commits is not linked to your GitHub handle. When the PR is merged, you may not see your name appears on the contributors list correctly.

You can either change the git config to use the email you registered with GitHub, or add the email ( <[email protected]>) to GitHub account to allow correct linking of the author/commit.

markemus · 2021-03-05T17:45:48Z

@yongtang I can do that, but the tests require yaml in order to run as well.

… mode

…ely raising an exception if graph mode is detected.

markemus · 2021-03-15T17:43:11Z

@kvignesh1420 @yongtang rebase is done- tests aren't fully complete yet but look ok. Are we ready to merge?

EDIT: Haha, tests failed of course- I deserve that. Working on a fix.

…outside of eager mode. Impossible.

markemus · 2021-03-16T18:00:48Z

@kvignesh1420 @yongtang all tests pass now! Sorry for the false alarm yesterday.

kvignesh1420

LGTM. @yongtang can you please take a look.

yongtang · 2021-03-18T16:03:51Z

@markemus LGTM. Thanks for the contribution! 👍

As a side note, have you seen my comment in #1280 (comment) ? If the email is not linked, then after the merge I think your author information will not show up in contributor's list.

If you are ok with that we can go ahead and merge. However, I would suggest you to fix this, as we do respect contributor's hard work and want to make sure your contribution are correctly recognized in the repo.

markemus · 2021-03-18T16:35:03Z

@yongtang thank you for the heads up! I checked the commit log and added both version of that email address to my account. Ready to merge when you are.

yongtang · 2021-03-18T19:54:13Z

@markemus Merged. Thanks for your contribution! 👍

markemus · 2021-03-19T16:14:57Z

@yongtang Awesome! Thank you and @kvignesh1420 for all your help :)

…atasets (tensorflow#1280) * super_serial automatically creates TFRecords files from dictionary-style Tensorflow datasets. * pep8 fixes * more pep8 (undoing tensorflow 2 space tabs) * bazel changes * small change so github checks will run again * moved super_serial test to tests/ * bazel changes * moved super_serial to experimental * refactored super_serial test to work for serial_ops * bazel fixes * refactored test to load from tfio instead of full import path * licenses * bazel fixes * fixed license dates for new files * small change so tests rerun * small change so tests rerun * cleanup and bazel fix * added test to ensure proper crash occurs when trying to save in graph mode * bazel fixes * fixed imports for test * fixed imports for test * fixed yaml imports for serial_ops * fixed error path for new tf version * prevented flaky behavior in graph mode for serial_ops.py by preemptively raising an exception if graph mode is detected. * sanity check for graph execution in graph_save_fail() * it should be impossible for serial_ops not to raise an exception now outside of eager mode. Impossible. * moved eager execution check in serial_ops

kvignesh1420 reviewed Feb 18, 2021

View reviewed changes

tensorflow_io/core/python/experimental/serial_ops.py Outdated Show resolved Hide resolved

kvignesh1420 reviewed Feb 18, 2021

View reviewed changes

markemus and others added 16 commits March 15, 2021 12:12

refactored super_serial test to work for serial_ops

b47525f

bazel fixes

bee9511

refactored test to load from tfio instead of full import path

34b191a

licenses

1ce75ab

bazel fixes

b88a046

fixed license dates for new files

cf2a2f3

small change so tests rerun

7447e94

small change so tests rerun

2d8c25b

cleanup and bazel fix

fd075e3

added test to ensure proper crash occurs when trying to save in graph…

d974c83

… mode

bazel fixes

1879e4f

fixed imports for test

22fa83c

fixed imports for test

a75a3a2

fixed yaml imports for serial_ops

8ada254

fixed error path for new tf version

94419bf

prevented flaky behavior in graph mode for serial_ops.py by preemptiv…

4a82fdd

…ely raising an exception if graph mode is detected.

markemus force-pushed the master branch from 1896cee to 4a82fdd Compare March 15, 2021 16:15

markemus added 3 commits March 15, 2021 14:51

sanity check for graph execution in graph_save_fail()

4b898e7

it should be impossible for serial_ops not to raise an exception now …

d4f1f24

…outside of eager mode. Impossible.

moved eager execution check in serial_ops

f9212ea

kvignesh1420 reviewed Mar 17, 2021

View reviewed changes

yongtang merged commit d65cc66 into tensorflow:master Mar 18, 2021

markemus mentioned this pull request Jun 14, 2022

super_serial: automate saving and restoring tfrecords tensorflow/addons#1918

Closed

Super Serial- automatically save and load TFRecords from Tensorflow datasets #1280

Super Serial- automatically save and load TFRecords from Tensorflow datasets #1280

Conversation

markemus commented Jan 21, 2021

google-cla bot commented Jan 21, 2021

What to do if you already signed the CLA

Individual signers

Corporate signers

markemus commented Jan 21, 2021

google-cla bot commented Jan 21, 2021

markemus commented Jan 21, 2021

google-cla bot commented Jan 21, 2021

markemus commented Jan 21, 2021

markemus commented Jan 25, 2021

yongtang commented Jan 26, 2021

markemus commented Jan 26, 2021 • edited Loading

yongtang commented Jan 26, 2021

markemus commented Jan 26, 2021

yongtang commented Jan 26, 2021

markemus commented Jan 26, 2021

yongtang commented Jan 26, 2021

markemus commented Jan 26, 2021

yongtang commented Jan 26, 2021

markemus commented Jan 28, 2021

yongtang commented Jan 29, 2021

yongtang commented Jan 30, 2021

yongtang commented Jan 30, 2021

burgerkingeater commented Jan 31, 2021

markemus commented Jan 31, 2021

markemus commented Feb 9, 2021

yongtang commented Feb 10, 2021

kvignesh1420 Feb 18, 2021

Choose a reason for hiding this comment

markemus Feb 18, 2021

Choose a reason for hiding this comment

kvignesh1420 Feb 18, 2021 • edited Loading

Choose a reason for hiding this comment

markemus Feb 18, 2021

Choose a reason for hiding this comment

markemus Mar 1, 2021

Choose a reason for hiding this comment

markemus commented Mar 1, 2021

yongtang commented Mar 5, 2021

yongtang commented Mar 5, 2021

markemus commented Mar 5, 2021

markemus commented Mar 15, 2021 • edited Loading

markemus commented Mar 16, 2021

kvignesh1420 left a comment

Choose a reason for hiding this comment

yongtang commented Mar 18, 2021

markemus commented Mar 18, 2021

yongtang commented Mar 18, 2021

markemus commented Mar 19, 2021

markemus commented Jan 26, 2021 •

edited

Loading

kvignesh1420 Feb 18, 2021 •

edited

Loading

markemus commented Mar 15, 2021 •

edited

Loading