-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(dataset): adding data from external files without copy #974
Conversation
This is starting to look nice! I just tried it and one thing I notice is that in the resulting graph there is no link between the pointer file and the file in the dataset (see below). Another thought I had was that if we are adding a pointer reference in |
I tried it out in the following way:
I was able to view the lineage from the output file (but not the data file or the pointer file) [1]
[1] https://dev.renku.ch/projects/e.jablonski/test_external/files/lineage/out.txt Seems good to me! I didn't try changing data in the input file or code and running a renku update yet |
Awesome, thanks for making the project @emmjab - This looks like a good starting point! |
@rokroskar I'd say the pointer file should not show up in the graph or lineage at all. It's just an implementation detail that users shouldn't care about. I'm not sure though if it is possible to exclude it. |
Also just tested the |
I see - I suppose at the moment there isn't a way to exclude it. But if it's there, maybe a link between the two entities would make sense? Two other points that come to mind at the moment:
Actually, I have a third :)
So now in my renku project, I have a symbolic link
we would see that it's a symbolic link and create the requisite metadata. This is very similar to the dataset functionality but it allows me to do it without having to explicitly create a dataset. |
This is very nice, and works pretty well! I tested a case where I have code in one repo and a file that is processed just in a regular directory. The process and results are documented here: https://dev.renku.ch/projects/renku-external-data-one-repo/renku-external-data-code/ I found the same thing that Emma found:
Question |
d81943d
to
0f9adc9
Compare
9a5ed60
to
75e835a
Compare
75e835a
to
6c07126
Compare
Thanks all for your comments!
As we discussed this will be implemented later.
You can re-add the files with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works really well! I only have a few minor wording suggestions for the documentation.
One thing I realized is that it won't work for directories. So if I want to add a directory of files and then want renku to respond to any new data added to the directory I basically have to re-add the data. Maybe we can have a follow-up issue for that.
renku/cli/dataset.py
Outdated
actual files to your repository. This is useful for example when external data | ||
is too large to store locally. The external data must exist (i.e. mounted) on | ||
your filesystem. Renku create a symbolic to your data and you can use this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actual files to your repository. This is useful for example when external data | |
is too large to store locally. The external data must exist (i.e. mounted) on | |
your filesystem. Renku create a symbolic to your data and you can use this | |
actual files to your repository. This is useful for example when external data | |
is too large to store locally. The external data must exist (i.e. be mounted) on | |
your filesystem. Renku creates a symbolic to your data and you can use this |
renku/cli/dataset.py
Outdated
'-e', | ||
'--external', | ||
is_flag=True, | ||
help='Update only external storage files (symlinks).' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
help='Update only external storage files (symlinks).' | |
help='Update only external data.' |
Maybe we can keep this more generic?
if client.has_external_files(): | ||
click.echo( | ||
'Changes in external files are not detected automatically. To ' | ||
'update external files run "renku dataset update -e".' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
problems = ( | ||
'\n' + WARNING + 'There are missing external files.\n' | ||
' (make sure that external volume is mounted and accessible)' + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works really nicely! Just a minor wording suggestion - in most cases it might not be a mounted volume so the user might wonder what that's all about.
' (make sure that external volume is mounted and accessible)' + | |
' (make sure that the external path is accessible)' + |
85814f0
to
7e552ba
Compare
* feat: support for external file * fix: make renku work with symlinks
Description
To add an external file to a dataset pass
--external
option to thedataset add
command. To check for modification in external files runrenku dataset update --external
.Fixes #815