Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

snowflake: improve performance on write #760

Merged
merged 13 commits into from
Mar 15, 2024
Merged

Conversation

detule
Copy link
Collaborator

@detule detule commented Feb 14, 2024

Snowflake: use current DB/schema, if none are provided, in odbcConnectionColumns and when checking for existence in dbWriteTable.

Fixes #759.

@meztez
Copy link
Contributor

meztez commented Feb 19, 2024

Do you think this should also be done for the odbcConnectionTables method?

dbExistsTable ends up having the same problem when listing tables from all DATABASE / SCHEMA instead of the one that will be used by dbWriteTable to check if the table already exists in the target database?

found <- dbExistsTable(conn, name)

setMethod("dbExistsTable", c("OdbcConnection", "character"),

setMethod("odbcConnectionTables", c("OdbcConnection", "character"),

connection_sql_tables <- function(p, catalog_name = NULL, schema_name = NULL, table_name = NULL, table_type = NULL) {

@detule
Copy link
Collaborator Author

detule commented Feb 19, 2024

Thanks @meztez

I thought about this - but i think we need to be a bit careful. We should probably interpret dbExistsTable(conn, "tblname") to mean "does tblname exist anywhere", not just in the current db/schema.

The method has an ellipsis; one approach is to add a package::odbc specific argument - something like useCurrentSchemaIfUnspecified that defaults to FALSE. We would set it to TRUE in dbWriteTable. What do you think?

@meztez
Copy link
Contributor

meztez commented Feb 19, 2024

@detule Totally makes sense. The issue is only the found part of the dbWriteTable method. In any other use case, your response is on point. Not sure the best way to do the found check any faster.

@hadley
Copy link
Member

hadley commented Feb 20, 2024

@detule hmmmm, my gut feeling would have been that it only searches the current catalog/schema. We should try and figure out the underlying principle here.

@detule
Copy link
Collaborator Author

detule commented Feb 20, 2024

Hey @hadley - copy. Though as currently implemented it searches everywhere. I wonder then, if there is some ambiguity in terms of how to interpret it, if it makes sense to side with "current precedent" so as to not break any code folks might have out there in the wild.

This is further, I think, muddled by some anecdotal evidence - for example with SQL Server when you create a temporary table it is not in your current catalog. Rather, it is housed in tempdb.dbo. So - if trying to write to a temp table and only checking for existence in the current catalog, you might never find it.

With this said, I wonder if it makes sense to merge the optimization in this PR since according to #759 it goes a way towards improving performance. Happy to tackle dbExistsTable as a follow up.

@detule detule requested review from hadley and simonpcouch February 28, 2024 09:31
@detule
Copy link
Collaborator Author

detule commented Feb 28, 2024

@hadley

Let me know if I missed the mark on this one.

It allows us to, down the line and if needed, also change the default for dbExistsTableForWrite to something like what you suggested - checking for output of SELECT *.

@@ -1,5 +1,7 @@
# odbc (development version)

* Snowflake: Improved performance on write (#760).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By quite a bit!🎉

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Simon!

Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With regards to the SELECT *, I was thinking of that more of a general principle to guide whether or not a table exists, rather than an implementation idea.

It really feels like, particularly for snowflake and databricks, that assuming any catalog/schema makes everything really slow, and I'm not convinced that that makes sense as a default, since it doesn't correspond to the mental model you will have built up writing SQL queries. Does that make sense?


#' @rdname Snowflake
setMethod("dbExistsTableForWrite", c("Snowflake", "character"),
function(conn, name, ...) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think schema and catalog should be the explicit arguments here since you're modifying them, and I'm pretty sure that should allow you to drop the do.call.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - that's also a good thought!

But unless you feel strongly about this, I would err on the side of leaving the signature of dbExistsTableForWrite the same as dbExistsTable - at least for now, with the current default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can leave the signature of the generic the same, but it's ok for the signature of method to be different?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - let me know if that's not what you had in mind.

@detule
Copy link
Collaborator Author

detule commented Mar 2, 2024

With regards to the SELECT *, I was thinking of that more of a general principle to guide whether or not a table exists, rather than an implementation idea.

It really feels like, particularly for snowflake and databricks, that assuming any catalog/schema makes everything really slow, and I'm not convinced that that makes sense as a default, since it doesn't correspond to the mental model you will have built up writing SQL queries. Does that make sense?

I think I am with you, and in particular when writing. We should assume that the identifier passed is within the scope of the CURRENT catalog/schema, or a temporary table. There is some nuance with temporary tables. I think we are safe with what we have done here with Snowflake since temporary tables are always created in the current catalog/schema. But for other back-ends this may not be the case.

I plan on following up for databricks - i am not even sure if we are creating temporary tables correctly there when using dbWriteTable( ..., temporary = TRUE ).

@hadley
Copy link
Member

hadley commented Mar 5, 2024

Ah yeah, I keep forgetting about temporary tables, and how they live in different places in different databases. I think that's a nice place where the SELECT * idea works (since presumably if you create CREATE TEMPORARY TABLE foo then SELECT * FROM foo will work), but it may be hard to translate that idea into code that uses the odbc interface.


#' @rdname Snowflake
setMethod("dbExistsTableForWrite", c("Snowflake", "character"),
function(conn, name, ...,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this looks good!

@detule detule merged commit 3bca21d into r-dbi:main Mar 15, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants