Add tutorial for avro dataset API #1250

Can we replace the images by adding cells in the notebooks which output the schema and a sample mnist record?
Maybe we can pick up a few scenarios from the test cases and put it in this tutorial?

burgerkingeater · 2021-01-04T03:27:37Z

@kvignesh1420 thanks for the feedback,

Can we replace the images by adding cells in the notebooks which output the schema and a sample mnist record?
updated
Maybe we can pick up a few scenarios from the test cases and put it in this tutorial?
I did considered adding more cases from unit tests to the tutorial, but the examples in these unit tests require largely heterogeneous testing datasets, which is better to be added iteratively. This PR could be treated a starting point and I'm planing to enhance it step by step. However I just added one more example, and more explanation in this tutorial.

Let me know if you have additional feedback. Thanks.

yongtang · 2021-01-04T16:26:28Z

@burgerkingeater I think you may want to expand Overview section to give readers some background information about Avro format and its usage (e.g., where Avro is used, its integration with Kafka, etc). Not every reader in machine learning/tensorflow community are familiar with Avro format so background information may be needed.

You may also expand the Usage section with more details about the data file you used and how they are generated (Download a sample Avro file: part and Download the schema file corresponding for the sample Avro data: part)

If you are reusing the mnist data for training, I assume you are following some other similar tutorials but replacing tfrecord format with avro format? If that is the case, it might be better to provide the link of the other tutorial so that readers can easily do a comparison and following through different routes (tfrecord format vs avro format).

burgerkingeater · 2021-01-05T03:10:55Z

@yongtang replied inline:
@burgerkingeater I think you may want to expand Overview section to give readers some background information about Avro format and its usage (e.g., where Avro is used, its integration with Kafka, etc). Not every reader in machine learning/tensorflow community are familiar with Avro format so background information may be needed.
- updated

You may also expand the Usage section with more details about the data file you used and how they are generated (Download a sample Avro file: part and Download the schema file corresponding for the sample Avro data: part)
- updated

If you are reusing the mnist data for training, I assume you are following some other similar tutorials but replacing tfrecord format with avro format? If that is the case, it might be better to provide the link of the other tutorial so that readers can easily do a comparison and following through different routes (tfrecord format vs avro format).
_- we are not following other tutorials. we wrote some script to convert mnist from tfrecord to avro, and additionally trimmed the data for simplicity. As you can see, the mnist avro data we are using here only has 4 records, and the some fields are trimmed as well.

burgerkingeater · 2021-01-05T03:12:19Z

@yongtang @kvignesh1420 , updated the tutorial per feedback. thanks.

yongtang · 2021-01-05T17:33:30Z

@burgerkingeater Some additional background about the mnist format data you are using would be good. For example, the motivation (e.g., mnist is widely used as the beginner's guide in ml, etc), how is the avro mnist data file generated, the schema of the avro mnist and the matching comparison to normal mnist, etc.

burgerkingeater · 2021-01-06T04:31:39Z

@yongtang updated. Note although the avro file is originated from mnist, it has been trimmed and customized so it's significantly different from mnist. So I renamed mnist.avro to test.avro to avoid any confusion.

burgerkingeater · 2021-01-06T18:37:48Z

@yongtang @kvignesh1420 updated per feedback, added more background info. Let me know if you have any additional feedback. Thanks.

burgerkingeater · 2021-01-07T06:46:33Z

@yongtang @kvignesh1420 kindly ping..

docs/tutorials/avro.ipynb

burgerkingeater · 2021-01-10T08:53:47Z

@kvignesh1420 updated the tutorial per feedback. Please let me know how it looks. Thanks.

docs/tutorials/avro.ipynb

kvignesh1420

Please check the comments for minor edits.

burgerkingeater · 2021-01-10T10:44:00Z

@kvignesh1420 thanks for your quick response. Updated.

kvignesh1420 · 2021-01-10T14:33:24Z

LGTM. @yongtang Please take a look.

burgerkingeater · 2021-01-11T03:47:49Z

@kvignesh1420 thanks, can we resolve the change request? @yongtang please let me know how it looks. thanks.

yongtang · 2021-01-11T16:31:42Z

@burgerkingeater Can you move testing data (test.avro, test.avsc, training.avro, training.avsc) to another PR? Once the other PR is merged, it will be possible to live test this PR (with URLs of testing data already in place).

burgerkingeater · 2021-01-19T00:32:00Z

@yongtang i just checked, the two files are shown in this PR as deleted files as they are no longer needed in this tutorial.

kvignesh1420 · 2021-01-19T16:47:51Z

@burgerkingeater so you are using only test.avsc and test.avro in this tutorial? If that is the case then it might be confusing for the readers when they see that they are training the models on test.avro. I think it would be better to rename the files to train.avro and train.avsc ( since there is no testing/inference phase in the tutorial ).

Also, any idea why there were intermittent kernel crashes?

burgerkingeater · 2021-01-19T18:23:13Z

@kvignesh1420 thanks, i created another PR: #1278 for the data file renaming change.

kvignesh1420 · 2021-01-19T18:38:32Z

@burgerkingeater #1278 has been merged. Please rebase so that only the notebook shows up as a changed file in this PR.

burgerkingeater · 2021-01-19T18:40:09Z

@kvignesh1420 Still not sure about the intermittent kernel crashes, I'm unable to reproduce it from my local machine, and there's not yet enough output from failed kernel log.

burgerkingeater · 2021-01-19T18:42:15Z

@kvignesh1420 done, thanks.

MarkDaoust · 2021-01-19T18:51:57Z

One way you may get more detailed logs form the crash is to run the notebook in a subprocess.

!curl -O https://raw.githubusercontent.com/tensorflow/io/0747f5d99d479c4c3e07268c06f70dc5ad76c953/docs/tutorials/avro.ipynb
!jupyter nbconvert --execute avro.ipynb

That way when the subprocess dies the Colab kernel is still up and you can see all the logs.

kvignesh1420 · 2021-01-19T18:55:47Z

@MarkDaoust can we go ahead and merge this tutorial then? Let me know.

MarkDaoust · 2021-01-19T19:11:08Z

The current version did fail again for me in Colab just now. After 4 or 5 "restart and run all".

It would be good to get it fixed before publishing.

burgerkingeater · 2021-01-19T19:35:08Z

@MarkDaoust can you shed more lights? This is the output, and I didn't see any process running afterwards.

[NbConvertApp] Converting notebook avro.ipynb to html
[NbConvertApp] Executing notebook with kernel: python3
2021-01-19 11:25:32.664165: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-01-19 11:25:32.665549: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-01-19 11:25:33.313405: I tensorflow_io/core/kernels/cpu_check.cc:128] Your CPU supports instructions that this TensorFlow IO binary was not compiled to use: AVX2 FMA
2021-01-19 11:25:34.491664: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
[NbConvertApp] Writing 322486 bytes to avro.html

kvignesh1420 · 2021-01-19T19:37:32Z

@MarkDaoust as per your suggestion, I ran the following in colab:

!curl -O https://raw.githubusercontent.com/tensorflow/io/0747f5d99d479c4c3e07268c06f70dc5ad76c953/docs/tutorials/avro.ipynb
!jupyter nbconvert --execute avro.ipynb

The output was as follows:

% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 20850  100 20850    0     0   108k      0 --:--:-- --:--:-- --:--:--  108k
[NbConvertApp] Converting notebook avro.ipynb to html
[NbConvertApp] Executing notebook with kernel: python3
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
2021-01-19 19:33:25.670952: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
2021-01-19 19:33:30.669364: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-01-19 19:33:30.670455: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-01-19 19:33:30.681147: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-01-19 19:33:30.681201: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (bec4304dada5): /proc/driver/nvidia/version does not exist
2021-01-19 19:33:30.681789: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-01-19 19:33:30.787381: I tensorflow_io/core/kernels/cpu_check.cc:128] Your CPU supports instructions that this TensorFlow IO binary was not compiled to use: AVX2 FMA
2021-01-19 19:33:32.256142: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-01-19 19:33:32.256615: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2299995000 Hz
[NbConvertApp] Writing 322930 bytes to avro.html

I did a "restart and run" over 8 times and checked the avro.html file. Surprisingly, I did not see any session crash messages.
Am I missing something?
cc: @burgerkingeater

MarkDaoust · 2021-01-19T19:42:59Z

That's an example of the process completing successfully.

Below I've attached one where it fails (with the debug flag), but that doesn't seem to give any additional information. Just which cell it failed in, and we already knew that.

Anyone have other ideas?

!jupyter nbconvert --execute --debug avro.ipynb

[NbConvertApp] Executing cell:
features = {
    'features[*]': tfio.experimental.columnar.VarLenFeatureWithRank(dtype=tf.int32),
    'label': tf.io.FixedLenFeature(shape=[], dtype=tf.int32, default_value=-100),
    'dataType': tf.io.FixedLenFeature(shape=[], dtype=tf.string)
}

schema = tf.io.gfile.GFile('train.avsc').read()

dataset = tfio.experimental.columnar.make_avro_record_dataset(file_pattern=['train.avro'],
                                                              reader_schema=schema,
                                                              features=features,
                                                              shuffle=False,
                                                              batch_size=3,
                                                              num_epochs=1)

for record in dataset:
    print(record['features[*]'])
    print(record['label'])
    print(record['dataType'])
    print("--------------------")

2021-01-19 19:35:37.754502: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set

... [deleted lines]...

2021-01-19 19:35:39.878253: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-01-19 19:35:39.878756: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2000144999 Hz
[NbConvertApp] ERROR | Kernel died while waiting for execute reply.
Traceback (most recent call last):
  File "/usr/local/bin/jupyter-nbconvert", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python2.7/dist-packages/jupyter_core/application.py", line 267, in launch_instance
    return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/usr/local/lib/python2.7/dist-packages/nbconvert/nbconvertapp.py", line 338, in start
    self.convert_notebooks()

... [deleted lines] ...

  File "/usr/local/lib/python2.7/dist-packages/nbconvert/preprocessors/execute.py", line 473, in _wait_for_reply
    self._check_alive()
  File "/usr/local/lib/python2.7/dist-packages/nbconvert/preprocessors/execute.py", line 456, in _check_alive
    raise DeadKernelError("Kernel died")
nbconvert.preprocessors.execute.DeadKernelError: Kernel died

burgerkingeater · 2021-01-19T19:56:11Z

@MarkDaoust thanks. Any way to see why kernel crashes? I run it locally and it never fails for me..

kvignesh1420 · 2021-01-19T20:01:14Z

Related discussion: jupyter/nbconvert#1035
Resource constraints seem to affect nbconvert and give out this vague message.

MarkDaoust · 2021-01-19T20:08:56Z

No I'm running out of ideas here. I was expecting to see a segfault, core dumped or something more detailed from those subprocess logs.

Are there any debugging techniques you would have run locally that you could apply in Colab?

Colab does compile TensorFlow using tcmalloc, and I've seen that difference trip up extensions a few of times, but I have no way of knowing if that's what's going wrong here.

Tip: instead of doing a "restart and run all" you can copy/paste the !jupyter nbconvert --execute --debug avro.ipynb a number of times and do a single "run all" but that isn't getting us any more information.

If we're really at a dead end we could just publish it, a 1/10 intermittent failure on a non-core tutorial isn't a huge deal.

burgerkingeater · 2021-01-19T20:23:38Z

@MarkDaoust @kvignesh1420 it shouldn't be related to memory constraint as the training dataset is very small as colab has 8G mem. I'm running out of idea too.. I'm ok with publishing the tutorial given the chance of crash.

burgerkingeater · 2021-01-20T00:24:13Z

@MarkDaoust @yongtang @kvignesh1420 thanks!

…he parsing time (#1283) * Exposes num_parallel_reads and num_parallel_calls -Exposes `num_parallel_reads` and `num_parallel_calls` in AvroRecordDataset and `make_avro_record_dataset` -Adds parameter constraints -Fixes lint issues -Adds test method for _require() function -This update adds a test to check if ValueErrors are raised when given an invalid input for num_parallel_calls * Bump Apache Arrow to 2.0.0 (#1231) * Bump Apache Arrow to 2.0.0 Also bumps Apache Thrift to 0.13.0 Signed-off-by: Yong Tang <[email protected]> * Update code to match Arrow Signed-off-by: Yong Tang <[email protected]> * Bump pyarrow to 2.0.0 Signed-off-by: Yong Tang <[email protected]> * Stay with version=1 for write_feather to pass tests Signed-off-by: Yong Tang <[email protected]> * Bump flatbuffers to 1.12.0 Signed-off-by: Yong Tang <[email protected]> * Fix Windows issue Signed-off-by: Yong Tang <[email protected]> * Fix tests Signed-off-by: Yong Tang <[email protected]> * Fix Windows Signed-off-by: Yong Tang <[email protected]> * Remove -std=c++11 and leave default -std=c++14 for arrow build Signed-off-by: Yong Tang <[email protected]> * Update sha256 of libapr1 As the hash changed by the repo. Signed-off-by: Yong Tang <[email protected]> * Add emulator for gcs (#1234) * Bump com_github_googleapis_google_cloud_cpp to `1.21.0` * Add gcs testbench * Bump `libcurl` to `7.69.1` * Remove the CI build for CentOS 8 (#1237) Building shared libraries on CentOS 8 is pretty much the same as on Ubuntu 20.04 except `apt` should be changed to `yum`. For that our CentOS 8 CI test is not adding a lot of value. Furthermore with the upcoming CentOS 8 change: https://www.phoronix.com/scan.php?page=news_item&px=CentOS-8-Ending-For-Stream CentOS 8 is effectively EOLed at 2021. For that we may want to drop the CentOS 8 build (only leave a comment in README.md) Note we keep CentOS 7 build for now as there are still many users using CentOS 7 and CentOS 7 will only be EOLed at 2024. We might drop CentOS 7 build in the future as well if there is similiar changes to CentOS 7 like CentOS 8. Signed-off-by: Yong Tang <[email protected]> * add tf-c-header rule (#1244) * Skip tf-nightly:tensorflow-io==0.17.0 on API compatibility test (#1247) Signed-off-by: Yong Tang <[email protected]> * [s3] add support for testing on macOS (#1253) * [s3] add support for testing on macOS * modify docker-compose cmd * add notebook formatting instruction in README (#1256) * [docs] Restructure README.md content (#1257) * Refactor README.md content * bump to run ci jobs * Update libtiff/libgeotiff dependency (#1258) This PR updates libtiff/libgeotiff to the latest version. Signed-off-by: Yong Tang <[email protected]> * remove unstable elasticsearch test setup on macOS (#1263) * Exposes num_parallel_reads and num_parallel_calls (#1232) -Exposes `num_parallel_reads` and `num_parallel_calls` in AvroRecordDataset and `make_avro_record_dataset` -Adds parameter constraints -Fixes lint issues - Adds test method for _require() function -This update adds a test to check if ValueErrors are raised when given an invalid input for num_parallel_calls Co-authored-by: Abin Shahab <[email protected]> * Added AVRO_PARSER_NUM_MINIBATCH to override num_minibatches Added AVRO_PARSER_NUM_MINIBATCH to override num_minibatches. This is recommended to be set equal to the vcore request. * Exposes num_parallel_reads and num_parallel_calls (#1232) * Exposes num_parallel_reads and num_parallel_calls -Exposes `num_parallel_reads` and `num_parallel_calls` in AvroRecordDataset and `make_avro_record_dataset` -Adds parameter constraints -Fixes lint issues * Exposes num_parallel_reads and num_parallel_calls -Exposes `num_parallel_reads` and `num_parallel_calls` in AvroRecordDataset and `make_avro_record_dataset` -Adds parameter constraints -Fixes lint issues * Exposes num_parallel_reads and num_parallel_calls -Exposes `num_parallel_reads` and `num_parallel_calls` in AvroRecordDataset and `make_avro_record_dataset` -Adds parameter constraints -Fixes lint issues * Fixes Lint Issues * Removes Optional typing for method parameter - * Adds test method for _require() function -This update adds a test to check if ValueErrors are raised when given an invalid input for num_parallel_calls * Uncomments skip for macOS pytests * Fixes Lint issues Co-authored-by: Abin Shahab <[email protected]> * add avro tutorial testing data (#1267) Co-authored-by: Cheng Ren <[email protected]> * Update Kafka tutorial to work with Apache Kafka (#1266) * Update Kafka tutorial to work with Apache Kafka Minor update to the Kafka tutorial to remove the dependency on Confluent's distribution of Kafka, and instead work with vanilla Apache Kafka. Signed-off-by: Dale Lane <[email protected]> * Address review comments Remove redundant pip install commands Signed-off-by: Dale Lane <[email protected]> * add github workflow for performance benchmarking (#1269) * add github workflow for performance benchmarking * add github-action-benchmark step * handle missing dependencies while benchmarking (#1271) * handle missing dependencies while benchmarking * setup test_sql * job name change * set auto-push to true * remove auto-push * add personal access token * use alternate method to push to gh-pages * add name to the action * use different id * modify creds * use github_token * change repo name * set auto-push * set origin and push results * set env * use PERSONAL_GITHUB_TOKEN * use push changes action * use github.head_ref to push the changes * try using fetch-depth * modify branch name * use alternative push approach * git switch - * test by merging with forked master * Disable s3 macOS for now as docker is not working on GitHub Actions for macOS (#1277) * Revert "[s3] add support for testing on macOS (#1253)" This reverts commit 81789bd. Signed-off-by: Yong Tang <[email protected]> * Update Signed-off-by: Yong Tang <[email protected]> * rename testing data files (#1278) * Add tutorial for avro dataset API (#1250) * remove docker based mongodb tests in macos (#1279) * trigger benchmarks workflow only on commits (#1282) * Bump Apache Arrow to 3.0.0 (#1285) Signed-off-by: Yong Tang <[email protected]> * Add bazel cache (#1287) Signed-off-by: Yong Tang <[email protected]> * Add initial bigtable stub test (#1286) * Add initial bigtable stub test Signed-off-by: Yong Tang <[email protected]> * Fix kokoro test Signed-off-by: Yong Tang <[email protected]> * Add reference to github-pages benchmarks in README (#1289) * add reference to github-pages benchmarks * minor grammar change * Update README.md Co-authored-by: Yuan Tang <[email protected]> Co-authored-by: Yuan Tang <[email protected]> * Clear outputs (#1292) * fix kafka online-learning section in tutorial notebook (#1274) * kafka notebook fix for colab env * change timeout from 30 to 20 seconds * reduce stream_timeout * Only enable bazel caching writes for tensorflow/io github actions (#1293) This PR updates so that only GitHub actions run on tensorflow/io repo will be enabled with bazel cache writes. Without the updates, a focked repo actions will cause error. Note once bazel cache read-permissions are enabled from gcs forked repo will be able to access bazel cache (read-only). Signed-off-by: Yong Tang <[email protected]> * Enable ready-only bazel cache (#1294) This PR enables read-only bazel cache Signed-off-by: Yong Tang <[email protected]> * Rename tests (#1297) * Combine Ubuntu 20.04 and CentOS 7 tests into one GitHub jobs (#1299) When GitHub Actions runs it looks like there is an implicit concurrent jobs limit. As such the CentOS 7 test normally is scheduled later after other jobs completes. However, many times CentOS 7 test hangs (e.g., https://github.com/tensorflow/io/runs/1825943449). This is likely due to the CentOS 7 test is on the GitHub Actions queue for too long. This PR moves CentOS 7 to run after Ubuntu 20.04 test complete, to try to avoid hangs. Signed-off-by: Yong Tang <[email protected]> * Update names of api tests (#1300) We renamed the tests to remove "_eager" parts. This PR updates the api test for correct filenames Signed-off-by: Yong Tang <[email protected]> * Fix wrong benchmark tests names (#1301) Fixes wrong benchmark tests names caused by last commit Signed-off-by: Yong Tang <[email protected]> * Patch arrow to temporarily resolve the ARROW-11518 issue (#1304) This PR patchs arrow to temporarily resolve the ARROW-11518 issue. See 1281 for details Credit to diggerk. We will update arrow after the upstream PR is merged. Signed-off-by: Yong Tang <[email protected]> * Remove AWS headers from tensorflow, and use headers from third_party … (#1241) * Remove external headers from tensorflow, and use third_party headers instead This PR removes external headers from tensorflow, and use third_party headers instead. Signed-off-by: Yong Tang <[email protected]> * Address review comment Signed-off-by: Yong Tang <[email protected]> * Switch to use github to download libgeotiff (#1307) Signed-off-by: Yong Tang <[email protected]> * Add @com_google_absl//absl/strings:cord (#1308) Fix read/STDIN_FILENO Signed-off-by: Yong Tang <[email protected]> * Switch to modular file system for hdfs (#1309) * Switch to modular file system for hdfs This PR is part of the effort to switch to modular file system for hdfs. When TF_ENABLE_LEGACY_FILESYSTEM=1 is provided, old behavior will be preserved. Signed-off-by: Yong Tang <[email protected]> * Build against tf-nightly Signed-off-by: Yong Tang <[email protected]> * Update tests Signed-off-by: Yong Tang <[email protected]> * Adjust the if else logic, follow review comment Signed-off-by: Yong Tang <[email protected]> * Disable test_write_kafka test for now. (#1310) With tensorflow upgrade to tf-nightly, the test_write_kafka test is failing and that is block the plan to modular file system migration. This PR disables the test temporarily so that CI can continue to push tensorflow-io-nightly image (needed for modular file system migration) Signed-off-by: Yong Tang <[email protected]> * Switch to modular file system for s3 (#1312) This PR is part of the effort to switch to modular file system for s3. When TF_ENABLE_LEGACY_FILESYSTEM=1 is provided, old behavior will be preserved. Signed-off-by: Yong Tang <[email protected]> * Add python 3.9 on Windows (#1316) * Updates the PR to use attribute instead of Env Variable -Originally AVRO_PARSER_NUM_MINIBATCH was set as an environmental variable. Because tensorflow-io rarely uses env vars to fine tune kernal ops this was changed to an attribute. See comment here: #1283 (comment) * Added AVRO_PARSER_NUM_MINIBATCH to override num_minibatches Added AVRO_PARSER_NUM_MINIBATCH to override num_minibatches. This is recommended to be set equal to the vcore request. * Updates the PR to use attribute instead of Env Variable -Originally AVRO_PARSER_NUM_MINIBATCH was set as an environmental variable. Because tensorflow-io rarely uses env vars to fine tune kernal ops this was changed to an attribute. See comment here: #1283 (comment) * Adds addtional comments in source code for understandability Co-authored-by: Abin Shahab <[email protected]> Co-authored-by: Yong Tang <[email protected]> Co-authored-by: Vo Van Nghia <[email protected]> Co-authored-by: Vignesh Kothapalli <[email protected]> Co-authored-by: Cheng Ren <[email protected]> Co-authored-by: Cheng Ren <[email protected]> Co-authored-by: Dale Lane <[email protected]> Co-authored-by: Yuan Tang <[email protected]> Co-authored-by: Mark Daoust <[email protected]>

…he parsing time (tensorflow#1283) * Exposes num_parallel_reads and num_parallel_calls -Exposes `num_parallel_reads` and `num_parallel_calls` in AvroRecordDataset and `make_avro_record_dataset` -Adds parameter constraints -Fixes lint issues -Adds test method for _require() function -This update adds a test to check if ValueErrors are raised when given an invalid input for num_parallel_calls * Bump Apache Arrow to 2.0.0 (tensorflow#1231) * Bump Apache Arrow to 2.0.0 Also bumps Apache Thrift to 0.13.0 Signed-off-by: Yong Tang <[email protected]> * Update code to match Arrow Signed-off-by: Yong Tang <[email protected]> * Bump pyarrow to 2.0.0 Signed-off-by: Yong Tang <[email protected]> * Stay with version=1 for write_feather to pass tests Signed-off-by: Yong Tang <[email protected]> * Bump flatbuffers to 1.12.0 Signed-off-by: Yong Tang <[email protected]> * Fix Windows issue Signed-off-by: Yong Tang <[email protected]> * Fix tests Signed-off-by: Yong Tang <[email protected]> * Fix Windows Signed-off-by: Yong Tang <[email protected]> * Remove -std=c++11 and leave default -std=c++14 for arrow build Signed-off-by: Yong Tang <[email protected]> * Update sha256 of libapr1 As the hash changed by the repo. Signed-off-by: Yong Tang <[email protected]> * Add emulator for gcs (tensorflow#1234) * Bump com_github_googleapis_google_cloud_cpp to `1.21.0` * Add gcs testbench * Bump `libcurl` to `7.69.1` * Remove the CI build for CentOS 8 (tensorflow#1237) Building shared libraries on CentOS 8 is pretty much the same as on Ubuntu 20.04 except `apt` should be changed to `yum`. For that our CentOS 8 CI test is not adding a lot of value. Furthermore with the upcoming CentOS 8 change: https://www.phoronix.com/scan.php?page=news_item&px=CentOS-8-Ending-For-Stream CentOS 8 is effectively EOLed at 2021. For that we may want to drop the CentOS 8 build (only leave a comment in README.md) Note we keep CentOS 7 build for now as there are still many users using CentOS 7 and CentOS 7 will only be EOLed at 2024. We might drop CentOS 7 build in the future as well if there is similiar changes to CentOS 7 like CentOS 8. Signed-off-by: Yong Tang <[email protected]> * add tf-c-header rule (tensorflow#1244) * Skip tf-nightly:tensorflow-io==0.17.0 on API compatibility test (tensorflow#1247) Signed-off-by: Yong Tang <[email protected]> * [s3] add support for testing on macOS (tensorflow#1253) * [s3] add support for testing on macOS * modify docker-compose cmd * add notebook formatting instruction in README (tensorflow#1256) * [docs] Restructure README.md content (tensorflow#1257) * Refactor README.md content * bump to run ci jobs * Update libtiff/libgeotiff dependency (tensorflow#1258) This PR updates libtiff/libgeotiff to the latest version. Signed-off-by: Yong Tang <[email protected]> * remove unstable elasticsearch test setup on macOS (tensorflow#1263) * Exposes num_parallel_reads and num_parallel_calls (tensorflow#1232) -Exposes `num_parallel_reads` and `num_parallel_calls` in AvroRecordDataset and `make_avro_record_dataset` -Adds parameter constraints -Fixes lint issues - Adds test method for _require() function -This update adds a test to check if ValueErrors are raised when given an invalid input for num_parallel_calls Co-authored-by: Abin Shahab <[email protected]> * Added AVRO_PARSER_NUM_MINIBATCH to override num_minibatches Added AVRO_PARSER_NUM_MINIBATCH to override num_minibatches. This is recommended to be set equal to the vcore request. * Exposes num_parallel_reads and num_parallel_calls (tensorflow#1232) * Exposes num_parallel_reads and num_parallel_calls -Exposes `num_parallel_reads` and `num_parallel_calls` in AvroRecordDataset and `make_avro_record_dataset` -Adds parameter constraints -Fixes lint issues * Exposes num_parallel_reads and num_parallel_calls -Exposes `num_parallel_reads` and `num_parallel_calls` in AvroRecordDataset and `make_avro_record_dataset` -Adds parameter constraints -Fixes lint issues * Exposes num_parallel_reads and num_parallel_calls -Exposes `num_parallel_reads` and `num_parallel_calls` in AvroRecordDataset and `make_avro_record_dataset` -Adds parameter constraints -Fixes lint issues * Fixes Lint Issues * Removes Optional typing for method parameter - * Adds test method for _require() function -This update adds a test to check if ValueErrors are raised when given an invalid input for num_parallel_calls * Uncomments skip for macOS pytests * Fixes Lint issues Co-authored-by: Abin Shahab <[email protected]> * add avro tutorial testing data (tensorflow#1267) Co-authored-by: Cheng Ren <[email protected]> * Update Kafka tutorial to work with Apache Kafka (tensorflow#1266) * Update Kafka tutorial to work with Apache Kafka Minor update to the Kafka tutorial to remove the dependency on Confluent's distribution of Kafka, and instead work with vanilla Apache Kafka. Signed-off-by: Dale Lane <[email protected]> * Address review comments Remove redundant pip install commands Signed-off-by: Dale Lane <[email protected]> * add github workflow for performance benchmarking (tensorflow#1269) * add github workflow for performance benchmarking * add github-action-benchmark step * handle missing dependencies while benchmarking (tensorflow#1271) * handle missing dependencies while benchmarking * setup test_sql * job name change * set auto-push to true * remove auto-push * add personal access token * use alternate method to push to gh-pages * add name to the action * use different id * modify creds * use github_token * change repo name * set auto-push * set origin and push results * set env * use PERSONAL_GITHUB_TOKEN * use push changes action * use github.head_ref to push the changes * try using fetch-depth * modify branch name * use alternative push approach * git switch - * test by merging with forked master * Disable s3 macOS for now as docker is not working on GitHub Actions for macOS (tensorflow#1277) * Revert "[s3] add support for testing on macOS (tensorflow#1253)" This reverts commit 81789bd. Signed-off-by: Yong Tang <[email protected]> * Update Signed-off-by: Yong Tang <[email protected]> * rename testing data files (tensorflow#1278) * Add tutorial for avro dataset API (tensorflow#1250) * remove docker based mongodb tests in macos (tensorflow#1279) * trigger benchmarks workflow only on commits (tensorflow#1282) * Bump Apache Arrow to 3.0.0 (tensorflow#1285) Signed-off-by: Yong Tang <[email protected]> * Add bazel cache (tensorflow#1287) Signed-off-by: Yong Tang <[email protected]> * Add initial bigtable stub test (tensorflow#1286) * Add initial bigtable stub test Signed-off-by: Yong Tang <[email protected]> * Fix kokoro test Signed-off-by: Yong Tang <[email protected]> * Add reference to github-pages benchmarks in README (tensorflow#1289) * add reference to github-pages benchmarks * minor grammar change * Update README.md Co-authored-by: Yuan Tang <[email protected]> Co-authored-by: Yuan Tang <[email protected]> * Clear outputs (tensorflow#1292) * fix kafka online-learning section in tutorial notebook (tensorflow#1274) * kafka notebook fix for colab env * change timeout from 30 to 20 seconds * reduce stream_timeout * Only enable bazel caching writes for tensorflow/io github actions (tensorflow#1293) This PR updates so that only GitHub actions run on tensorflow/io repo will be enabled with bazel cache writes. Without the updates, a focked repo actions will cause error. Note once bazel cache read-permissions are enabled from gcs forked repo will be able to access bazel cache (read-only). Signed-off-by: Yong Tang <[email protected]> * Enable ready-only bazel cache (tensorflow#1294) This PR enables read-only bazel cache Signed-off-by: Yong Tang <[email protected]> * Rename tests (tensorflow#1297) * Combine Ubuntu 20.04 and CentOS 7 tests into one GitHub jobs (tensorflow#1299) When GitHub Actions runs it looks like there is an implicit concurrent jobs limit. As such the CentOS 7 test normally is scheduled later after other jobs completes. However, many times CentOS 7 test hangs (e.g., https://github.com/tensorflow/io/runs/1825943449). This is likely due to the CentOS 7 test is on the GitHub Actions queue for too long. This PR moves CentOS 7 to run after Ubuntu 20.04 test complete, to try to avoid hangs. Signed-off-by: Yong Tang <[email protected]> * Update names of api tests (tensorflow#1300) We renamed the tests to remove "_eager" parts. This PR updates the api test for correct filenames Signed-off-by: Yong Tang <[email protected]> * Fix wrong benchmark tests names (tensorflow#1301) Fixes wrong benchmark tests names caused by last commit Signed-off-by: Yong Tang <[email protected]> * Patch arrow to temporarily resolve the ARROW-11518 issue (tensorflow#1304) This PR patchs arrow to temporarily resolve the ARROW-11518 issue. See 1281 for details Credit to diggerk. We will update arrow after the upstream PR is merged. Signed-off-by: Yong Tang <[email protected]> * Remove AWS headers from tensorflow, and use headers from third_party … (tensorflow#1241) * Remove external headers from tensorflow, and use third_party headers instead This PR removes external headers from tensorflow, and use third_party headers instead. Signed-off-by: Yong Tang <[email protected]> * Address review comment Signed-off-by: Yong Tang <[email protected]> * Switch to use github to download libgeotiff (tensorflow#1307) Signed-off-by: Yong Tang <[email protected]> * Add @com_google_absl//absl/strings:cord (tensorflow#1308) Fix read/STDIN_FILENO Signed-off-by: Yong Tang <[email protected]> * Switch to modular file system for hdfs (tensorflow#1309) * Switch to modular file system for hdfs This PR is part of the effort to switch to modular file system for hdfs. When TF_ENABLE_LEGACY_FILESYSTEM=1 is provided, old behavior will be preserved. Signed-off-by: Yong Tang <[email protected]> * Build against tf-nightly Signed-off-by: Yong Tang <[email protected]> * Update tests Signed-off-by: Yong Tang <[email protected]> * Adjust the if else logic, follow review comment Signed-off-by: Yong Tang <[email protected]> * Disable test_write_kafka test for now. (tensorflow#1310) With tensorflow upgrade to tf-nightly, the test_write_kafka test is failing and that is block the plan to modular file system migration. This PR disables the test temporarily so that CI can continue to push tensorflow-io-nightly image (needed for modular file system migration) Signed-off-by: Yong Tang <[email protected]> * Switch to modular file system for s3 (tensorflow#1312) This PR is part of the effort to switch to modular file system for s3. When TF_ENABLE_LEGACY_FILESYSTEM=1 is provided, old behavior will be preserved. Signed-off-by: Yong Tang <[email protected]> * Add python 3.9 on Windows (tensorflow#1316) * Updates the PR to use attribute instead of Env Variable -Originally AVRO_PARSER_NUM_MINIBATCH was set as an environmental variable. Because tensorflow-io rarely uses env vars to fine tune kernal ops this was changed to an attribute. See comment here: tensorflow#1283 (comment) * Added AVRO_PARSER_NUM_MINIBATCH to override num_minibatches Added AVRO_PARSER_NUM_MINIBATCH to override num_minibatches. This is recommended to be set equal to the vcore request. * Updates the PR to use attribute instead of Env Variable -Originally AVRO_PARSER_NUM_MINIBATCH was set as an environmental variable. Because tensorflow-io rarely uses env vars to fine tune kernal ops this was changed to an attribute. See comment here: tensorflow#1283 (comment) * Adds addtional comments in source code for understandability Co-authored-by: Abin Shahab <[email protected]> Co-authored-by: Yong Tang <[email protected]> Co-authored-by: Vo Van Nghia <[email protected]> Co-authored-by: Vignesh Kothapalli <[email protected]> Co-authored-by: Cheng Ren <[email protected]> Co-authored-by: Cheng Ren <[email protected]> Co-authored-by: Dale Lane <[email protected]> Co-authored-by: Yuan Tang <[email protected]> Co-authored-by: Mark Daoust <[email protected]>

burgerkingeater force-pushed the tutorial branch from 897df36 to 1cbc657 Compare December 25, 2020 05:08

kvignesh1420 reviewed Dec 29, 2020

View reviewed changes

burgerkingeater force-pushed the tutorial branch 2 times, most recently from 8413bf7 to 7f5a3ba Compare December 30, 2020 00:22

kvignesh1420 reviewed Dec 31, 2020

View reviewed changes

burgerkingeater force-pushed the tutorial branch from 3a5a361 to 411673e Compare January 4, 2021 07:38

kvignesh1420 reviewed Jan 7, 2021

View reviewed changes

docs/tutorials/avro.ipynb Show resolved Hide resolved

docs/tutorials/avro.ipynb Show resolved Hide resolved

docs/tutorials/avro.ipynb Show resolved Hide resolved

burgerkingeater force-pushed the tutorial branch 2 times, most recently from 4061683 to 1f22557 Compare January 10, 2021 08:48

kvignesh1420 reviewed Jan 10, 2021

View reviewed changes

docs/tutorials/avro.ipynb Outdated Show resolved Hide resolved

kvignesh1420 requested changes Jan 10, 2021

View reviewed changes

burgerkingeater force-pushed the tutorial branch from 6d2ebdb to f843272 Compare January 10, 2021 10:43

burgerkingeater mentioned this pull request Jan 19, 2021

renaming testing data files used in Avro tutorial #1278

Merged

burgerkingeater force-pushed the tutorial branch from 853eece to f723704 Compare January 19, 2021 18:21

Add tutorial for avro dataset API

0747f5d

burgerkingeater force-pushed the tutorial branch from f723704 to 0747f5d Compare January 19, 2021 18:41

MarkDaoust approved these changes Jan 19, 2021

View reviewed changes

MarkDaoust merged commit 33ad81f into tensorflow:master Jan 19, 2021

burgerkingeater deleted the tutorial branch January 20, 2021 00:08

i-ony pushed a commit to i-ony/io that referenced this pull request Feb 8, 2021

Add tutorial for avro dataset API (tensorflow#1250)

65bdfab

i-ony pushed a commit to i-ony/io that referenced this pull request Mar 8, 2021

Add tutorial for avro dataset API (tensorflow#1250)

5805a2c

i-ony pushed a commit to i-ony/io that referenced this pull request Mar 15, 2021

Add tutorial for avro dataset API (tensorflow#1250)

2ef3c2b

michaelbanfield pushed a commit to michaelbanfield/io that referenced this pull request Mar 30, 2021

Add tutorial for avro dataset API (tensorflow#1250)

337ef96

Add tutorial for avro dataset API #1250

Add tutorial for avro dataset API #1250

Conversation

burgerkingeater commented Dec 25, 2020 • edited Loading

review-notebook-app bot commented Dec 25, 2020

burgerkingeater commented Dec 27, 2020

burgerkingeater commented Dec 29, 2020

kvignesh1420 left a comment

Choose a reason for hiding this comment

burgerkingeater commented Dec 29, 2020 • edited Loading

burgerkingeater commented Dec 30, 2020

kvignesh1420 left a comment

Choose a reason for hiding this comment

burgerkingeater commented Jan 4, 2021 • edited Loading

yongtang commented Jan 4, 2021

burgerkingeater commented Jan 5, 2021 • edited Loading

burgerkingeater commented Jan 5, 2021

yongtang commented Jan 5, 2021

burgerkingeater commented Jan 6, 2021 • edited Loading

burgerkingeater commented Jan 6, 2021 • edited Loading

burgerkingeater commented Jan 7, 2021

burgerkingeater commented Jan 10, 2021

kvignesh1420 left a comment

Choose a reason for hiding this comment

burgerkingeater commented Jan 10, 2021

kvignesh1420 commented Jan 10, 2021

burgerkingeater commented Jan 11, 2021

yongtang commented Jan 11, 2021

burgerkingeater commented Jan 19, 2021 • edited Loading

kvignesh1420 commented Jan 19, 2021 • edited Loading

burgerkingeater commented Jan 19, 2021

kvignesh1420 commented Jan 19, 2021

burgerkingeater commented Jan 19, 2021

burgerkingeater commented Jan 19, 2021

MarkDaoust commented Jan 19, 2021 • edited Loading

kvignesh1420 commented Jan 19, 2021 • edited Loading

MarkDaoust commented Jan 19, 2021

burgerkingeater commented Jan 19, 2021 • edited Loading

kvignesh1420 commented Jan 19, 2021

MarkDaoust commented Jan 19, 2021

burgerkingeater commented Jan 19, 2021

kvignesh1420 commented Jan 19, 2021 • edited Loading

MarkDaoust commented Jan 19, 2021

burgerkingeater commented Jan 19, 2021

burgerkingeater commented Jan 20, 2021

burgerkingeater commented Dec 25, 2020 •

edited

Loading

burgerkingeater commented Dec 29, 2020 •

edited

Loading

burgerkingeater commented Jan 4, 2021 •

edited

Loading

burgerkingeater commented Jan 5, 2021 •

edited

Loading

burgerkingeater commented Jan 6, 2021 •

edited

Loading

burgerkingeater commented Jan 6, 2021 •

edited

Loading

burgerkingeater commented Jan 19, 2021 •

edited

Loading

kvignesh1420 commented Jan 19, 2021 •

edited

Loading

MarkDaoust commented Jan 19, 2021 •

edited

Loading

kvignesh1420 commented Jan 19, 2021 •

edited

Loading

burgerkingeater commented Jan 19, 2021 •

edited

Loading

kvignesh1420 commented Jan 19, 2021 •

edited

Loading