Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Delta lake data source (initial implementation) #1119

Merged
merged 7 commits into from
Jun 14, 2023
Merged

Conversation

scsmithr
Copy link
Member

@scsmithr scsmithr commented Jun 10, 2023

Adds delta tables/deltalake as a data source using delta-rs.

What this looks like:

CREATE EXTERNAL DATABASE testing FROM delta OPTIONS
     (catalog_type = 'unity',
      access_token = '<access-token>',
      workspace_url = '<workspace-url>',
      catalog_id = '<catalog-id>',
      region = '<bucket-region>',
      access_key_id = '<aws-access-key-id>',
      secret_access_key = '<aws-secret-key>'
     );

And querying works just like the other data sources:

select * from testing.default.userdata_1 limit 5;
  registration_dttm  | id | first_name | last_name |          email           | gender |   ip_address   |        cc        |   country    | birthdate |  salary   |         title          | comments 
---------------------+----+------------+-----------+--------------------------+--------+----------------+------------------+--------------+-----------+-----------+------------------------+----------
 2016-02-03 07:55:29 |  1 | Amanda     | Jordan    | ajordan0@com.com         | Female | 1.197.201.2    | 6759521864920116 | Indonesia    | 3/8/1971  |  49756.53 | Internal Auditor       | 1E+02
 2016-02-03 17:04:03 |  2 | Albert     | Freeman   | afreeman1@is.gd          | Male   | 218.111.175.34 |                  | Canada       | 1/16/1968 | 150280.17 | Accountant IV          | 
 2016-02-03 01:09:31 |  3 | Evelyn     | Morgan    | emorgan2@altervista.org  | Female | 7.161.136.94   | 6767119071901597 | Russia       | 2/1/1960  | 144972.51 | Structural Engineer    | 
 2016-02-03 00:36:21 |  4 | Denise     | Riley     | driley3@gmpg.org         | Female | 140.35.109.83  | 3576031598965625 | China        | 4/8/1997  |  90263.05 | Senior Cost Accountant | 
 2016-02-03 05:05:31 |  5 | Carlos     | Burns     | cburns4@miitbeian.gov.cn |        | 169.113.235.40 | 5602256255204850 | South Africa |           |           |                        | 
(5 rows)

Current status

  • Currently AWS only. There's a TODO to support other providers, but I wanted to get something working first.
  • Currently only supports Databricks' unity catalog. There's a potential here plug in other catalogs as well (including our own). A local catalog implementation would be useful for testing.
  • Missing support for querying delta tables without a catalog.
  • Missing listing functions.
  • Writes not supported (we'll want to review how we want to support writes here and in other data sources)

Suffice to say there's a lot missing, and things will change. I want to get a general framework in to build off of for this.

@scsmithr scsmithr changed the title feat: Delta lake data source feat: Delta lake data source (initial implementation) Jun 13, 2023
@scsmithr scsmithr requested a review from vrongmeal June 13, 2023 16:03
@scsmithr scsmithr marked this pull request as ready for review June 13, 2023 16:03
Copy link
Contributor

@vrongmeal vrongmeal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good — just a little comment.

Should add the EXTERNAL TABLE support as well.

You also mentioned connecting without "catalog". How does that work? A catalog is just a database, right?

Comment on lines +43 to +46
let _resp = client
.get(format!("{}/api/2.1/unity-catalog/catalogs", workspace_url))
.send()
.await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should check the response status code?

@scsmithr
Copy link
Member Author

This looks good — just a little comment.

Should add the EXTERNAL TABLE support as well.

I opted to not do that in this PR since I'm now sure how we want to handle this. Delta files are able to be accessed without a catalog (since they're just files in object storage). If we added EXTERNAL TABLE support, what would we do? Require that a catalog is provided, or just have the user specify the path to the file?

It's not clear to me what the best solution is here. I figured we'll learn more while we continue to flesh this out.

You also mentioned connecting without "catalog". How does that work? A catalog is just a database, right?

The catalog in the case of deltalake just provides us the location of objects in some object store. The catalog isn't needed if the location of an object is known ahead of time, since all that's needed to read/modify a delta file is self-contained.

For example, the databricks deployment I have set up on AWS stores delta files in s3. Making the GET request for one of the tables will just return the table location in s3. I then use the credentials provided when actually access those objects in s3.

@scsmithr scsmithr enabled auto-merge (squash) June 14, 2023 17:55
@scsmithr scsmithr merged commit 5a2d5b1 into main Jun 14, 2023
@scsmithr scsmithr deleted the sean/delta branch June 14, 2023 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants