Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: avoid double fetches on ibis.duckdb.connect().read_csv("https://slow_url").cache() #10845

Open
1 task done
NickCrews opened this issue Feb 14, 2025 · 2 comments
Open
1 task done
Labels
feature Features or general enhancements

Comments

@NickCrews
Copy link
Contributor

Is your feature request related to a problem?

When you call .read_csv() on the duckdb backend, this makes duckdb actually go fetch [some] of the data in order to sniff the schema. Then, when you call .cache() on the created view, it actually goes and fetches the full data.

This is related to #9931.

What is the motivation behind your request?

I am working on relatively large tables on a slow internet connection. Each fetch takes about 30 seconds. I would like to avoid this double fetch.

Describe the solution you'd like

Since the result of .read_csv() needs to be a Table with a known schema, it is going to be required to fetch some data during that function call. So, I think we need to add an optional argument to the function, or create entirely new function. I would vote for adding params if we can come up with something sane. Maybe cache: bool?

What version of ibis are you running?

main

What backend(s) are you using, if any?

duckdb

Code of Conduct

  • I agree to follow this project's Code of Conduct
@NickCrews NickCrews added the feature Features or general enhancements label Feb 14, 2025
@NickCrews NickCrews changed the title feat: avoid double fetches on ibis.duckdb.connect().read_csv("https://....").cache() feat: avoid double fetches on ibis.duckdb.connect().read_csv("https://slow_url").cache() Feb 14, 2025
@cpcloud
Copy link
Member

cpcloud commented Feb 14, 2025

Can you break down the timing of:

CREATE VIEW v AS FROM read_csv('slow_url')

and

CREATE TABLE t AS FROM read_csv('slow_url')

?

It's not clear that there's a "double fetch" (as in they two fetches are equivalently slow) so much as a small fetch followed by a large fetch.

@NickCrews
Copy link
Contributor Author

Yeah good idea, I'll do that and report back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements
Projects
Status: backlog
Development

No branches or pull requests

2 participants