Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(dataset): adding data from external files without copy #974

Merged
merged 6 commits into from
Mar 18, 2020

Conversation

mohammad-alisafaee
Copy link
Contributor

@mohammad-alisafaee mohammad-alisafaee commented Jan 31, 2020

Description

To add an external file to a dataset pass --external option to the dataset add command. To check for modification in external files run renku dataset update --external.

Fixes #815

@mohammad-alisafaee mohammad-alisafaee requested a review from a team as a code owner January 31, 2020 07:54
@mohammad-alisafaee mohammad-alisafaee changed the title [WIP] feat: adding external files to datasets without copying data wip: adding external files to datasets without copying data Jan 31, 2020
@mohammad-alisafaee mohammad-alisafaee changed the title wip: adding external files to datasets without copying data feat [WIP]: adding external files to datasets without copying data Jan 31, 2020
@mohammad-alisafaee mohammad-alisafaee changed the title feat [WIP]: adding external files to datasets without copying data feat(WIP): adding external files to datasets without copying data Jan 31, 2020
@rokroskar
Copy link
Member

This is starting to look nice! I just tried it and one thing I notice is that in the resulting graph there is no link between the pointer file and the file in the dataset (see below).

noname gv

Another thought I had was that if we are adding a pointer reference in .renku maybe it should be generic? That way I could do something like renku run --external-input <path-outside-of-repo> myscript.sh

@emmjab
Copy link
Contributor

emmjab commented Jan 31, 2020

I tried it out in the following way:

  1. created a simple text file in a temporary location
  2. added to the dataset with the new "external" flag
  3. ran a renku run a script to take that file as an input; write the contents into a new file in the repo
  4. pushed the repo to dev.renku.ch

I was able to view the lineage from the output file (but not the data file or the pointer file) [1]
The pointer files were effectively not there in the repo, but the metadata was preserved.

renku init
renku dataset create test
renku dataset add test /tmp/testing_renku/data.txt --external
renku run python scripts/modify_test.py data/test/data.txt out.txt

[1] https://dev.renku.ch/projects/e.jablonski/test_external/files/lineage/out.txt
[2] https://dev.renku.ch/projects/e.jablonski/test_external/files/blob/.renku/datasets/f6826b9c-f9a3-49ad-92ed-5464780874be/pointers/a1e20fa1-8ace-4fd8-8797-472113f6785a-e8e09898d91cc3a412ba36751fdba4ac931acb5c

Seems good to me! I didn't try changing data in the input file or code and running a renku update yet

@rokroskar
Copy link
Member

rokroskar commented Jan 31, 2020

Awesome, thanks for making the project @emmjab - This looks like a good starting point!

@mohammad-alisafaee
Copy link
Contributor Author

@rokroskar I'd say the pointer file should not show up in the graph or lineage at all. It's just an implementation detail that users shouldn't care about. I'm not sure though if it is possible to exclude it.

@emmjab
Copy link
Contributor

emmjab commented Jan 31, 2020

Also just tested the renku dataset update --external and called renku update; also works as I would expect & generates a linage: https://dev.renku.ch/projects/e.jablonski/test_external/files/lineage/out.txt

@rokroskar
Copy link
Member

I'd say the pointer file should not show up in the graph or lineage at all. It's just an implementation detail that users shouldn't care about. I'm not sure though if it is possible to exclude it.

I see - I suppose at the moment there isn't a way to exclude it. But if it's there, maybe a link between the two entities would make sense?

Two other points that come to mind at the moment:

  1. naming the external file nodes: should we have some stable naming convention? As it stands, I could use the same data from two different projects but this would not be obvious from the KG (apart from maybe the last part of the path being the same). I guess eventually they could be linked via their sha256 or similar. Have you thought about that?

  2. Could there be a mechanism to easily update the link in case the location of the external data changes?

Actually, I have a third :)

  1. Could this be extended to allow something like:
pwd
  <a renku project root>
ln -s ../../../data/path/file.csv
renku external add file.csv

So now in my renku project, I have a symbolic link file.csv pointing to the external data and the command renku external add registered this file and created the a file in .renku/external (?) that holds the metadata (checksum, etc.) Alternatively, the registration could be implicit, i.e. only when the link is used in a renku command. So I could create the symbolic link and commit it, but only when I did

renku run --input file.csv script.py 

we would see that it's a symbolic link and create the requisite metadata. This is very similar to the dataset functionality but it allows me to do it without having to explicitly create a dataset.

@ciyer
Copy link

ciyer commented Feb 6, 2020

This is very nice, and works pretty well!

I tested a case where I have code in one repo and a file that is processed just in a regular directory. The process and results are documented here:

https://dev.renku.ch/projects/renku-external-data-one-repo/renku-external-data-code/

I found the same thing that Emma found:

  • The pointer file is shown as an upstream input to the result, but it does not have lineage itself, which is probably a bug

Question
Is there a reason the pointer files are stored in git-lfs? I would think they can/should be stored in plain git.

@mohammad-alisafaee mohammad-alisafaee force-pushed the 815-poc-file-pointer branch 6 times, most recently from 9a5ed60 to 75e835a Compare March 11, 2020 10:25
@mohammad-alisafaee mohammad-alisafaee changed the title feat(WIP): adding external files to datasets without copying data feat(dataset): adding data from external files without copy Mar 11, 2020
@mohammad-alisafaee
Copy link
Contributor Author

Thanks all for your comments!
The issue with lineage of external file is solved and the lineage is available now.

  1. naming the external file nodes: should we have some stable naming convention? As it stands, I could use the same data from two different projects but this would not be obvious from the KG (apart from maybe the last part of the path being the same). I guess eventually they could be linked via their sha256 or similar. Have you thought about that?

As we discussed this will be implemented later.

  1. Could there be a mechanism to easily update the link in case the location of the external data changes?

You can re-add the files with --force flag from the new location.

rokroskar
rokroskar previously approved these changes Mar 17, 2020
Copy link
Member

@rokroskar rokroskar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works really well! I only have a few minor wording suggestions for the documentation.

One thing I realized is that it won't work for directories. So if I want to add a directory of files and then want renku to respond to any new data added to the directory I basically have to re-add the data. Maybe we can have a follow-up issue for that.

Comment on lines 134 to 136
actual files to your repository. This is useful for example when external data
is too large to store locally. The external data must exist (i.e. mounted) on
your filesystem. Renku create a symbolic to your data and you can use this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
actual files to your repository. This is useful for example when external data
is too large to store locally. The external data must exist (i.e. mounted) on
your filesystem. Renku create a symbolic to your data and you can use this
actual files to your repository. This is useful for example when external data
is too large to store locally. The external data must exist (i.e. be mounted) on
your filesystem. Renku creates a symbolic to your data and you can use this

'-e',
'--external',
is_flag=True,
help='Update only external storage files (symlinks).'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
help='Update only external storage files (symlinks).'
help='Update only external data.'

Maybe we can keep this more generic?

if client.has_external_files():
click.echo(
'Changes in external files are not detected automatically. To '
'update external files run "renku dataset update -e".'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


problems = (
'\n' + WARNING + 'There are missing external files.\n'
' (make sure that external volume is mounted and accessible)' +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works really nicely! Just a minor wording suggestion - in most cases it might not be a mounted volume so the user might wonder what that's all about.

Suggested change
' (make sure that external volume is mounted and accessible)' +
' (make sure that the external path is accessible)' +

@mohammad-alisafaee mohammad-alisafaee merged commit 6a17512 into master Mar 18, 2020
@mohammad-alisafaee mohammad-alisafaee deleted the 815-poc-file-pointer branch March 18, 2020 10:37
Panaetius pushed a commit that referenced this pull request Apr 14, 2020
* feat: support for external file

* fix: make renku work with symlinks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

create a PoC using a pointer file to link to arbitrary external data
5 participants