Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

super_serial: automate saving and restoring tfrecords #1918

Closed
wants to merge 1 commit into from

Conversation

markemus
Copy link

@markemus markemus commented Jun 8, 2020

This module saves Datasets as TFRecords files alongside a .header file containing the metadata for reconstructing the Dataset. Users only need to call save(tfrecordpath, headerpath) and load(tfrecordpath, headerpath). It really is that easy.

Currently using TFRecords in Tensorflow has a steep learning curve, it can be difficult to write code for complex datasets and it requires you to keep the code needed to read the TFRecord back into memory. Super_serial eliminates these headaches entirely.

Includes a test to demonstrate how it works.

@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@bhack
Copy link
Contributor

bhack commented Jun 8, 2020

Is this related to tensorflow/tensorflow#38483?

@markemus
Copy link
Author

markemus commented Jun 8, 2020

Yes, this is a solution to that problem. Specifically it automates the process in this comment: tensorflow/tensorflow#38483 (comment) starting from the same tutorial code. The "feature description" steps are handled internally using the header file to determine, store, and retrieve the feature descriptions automatically.

@bhack
Copy link
Contributor

bhack commented Jun 8, 2020

My impression is that this could have a better fit in Tensorflow IO /cc @yongtang @jsimsa

@seanpmorgan
Copy link
Member

My impression is that this could have a better fit in Tensorflow IO /cc @yongtang @jsimsa

Agree. Thanks for the PR @markemus but this seems to scope better in tf/io. You may want to open an issue there to see if they'd like to add it to their repository!

@seanpmorgan seanpmorgan closed this Jun 8, 2020
@markemus
Copy link
Author

markemus commented Jun 8, 2020

Will do, thanks for the quick feedback everyone.

@jsimsa
Copy link

jsimsa commented Jun 8, 2020

I am actually working on a PR that will provide support for save and load of datasets. It should be available later this month (and before TF 2.3).

@markemus
Copy link
Author

markemus commented Jun 8, 2020

Good to hear! This functionality really should have been built into the TFRecord API to begin with. They're a nightmare to work with directly atm.

@bhack
Copy link
Contributor

bhack commented Jun 8, 2020

@markemus The origin of this was quite old see tensorflow/tensorflow#16926.
Thanks @jsimsa for disclosing your roadmap we care about third party contributor time it is so precious 😉 (@martinwicke)

@markemus
Copy link
Author

Two years later there's still no good way to save and restore TFRecords in Tensorflow. This PR was accepted and has been included in tf-io for over a year now (and also improved, with tests and additional support added).

tensorflow/io#1280

Any chance you folks have changed your mind about adding it to tensorflow proper? I have used this in multiple real projects and so have some of my coworkers. It's easy and it works well, and the header is stored separately so old reading approaches will still work. I'm also happy to add more features if required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants