Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syntax comparison to QUEL #69

Closed
dumblob opened this issue Feb 9, 2022 · 5 comments
Closed

Syntax comparison to QUEL #69

dumblob opened this issue Feb 9, 2022 · 5 comments

Comments

@dumblob
Copy link

dumblob commented Feb 9, 2022

Before prql gets implemented, I'd like to see some comparison of the proposed syntax with QUEL.

https://en.wikipedia.org/wiki/QUEL_query_languages

QUEL is a more readable but fully composable alternative to SQL. It was created by a mathematician and fully implemented in POSTGRES 4.2 (yeah, POSTGRES got the frontend thrown away later and exchanged for SQL due to market pressure).

Btw. I'd strongly recommend everyone reading the paper What Goes Around Comes Around from M. Stonebraker and co.

It's a summary of 35 years of data model proposals (and thus what query languages are designed around), grouped into 9 different eras. The outcome of the paper is a list of lessons learned:

Lesson 1: Physical and logical data independence are highly desirable
Lesson 2: Tree structured data models are very restrictive
Lesson 3: It is a challenge to provide sophisticated logical reorganizations of tree structured data
Lesson 4: A record-at-a-time user interface forces the programmer to do manual query optimization, and this is often hard.
Lesson 5: Directed graphs are more flexible than hierarchies but more complex
Lesson 6: Loading and recovering directed graphs is more complex than hierarchies
Lesson 7: Set-a-time languages are good, regardless of the data model, since they offer much improved physical data independence.
Lesson 8: Logical data independence is easier with a simple data model than with a complex one.
Lesson 9: Technical debates are usually settled by the elephants of the marketplace, and often for reasons that have little to do with the technology.
Lesson 10: Query optimizers can beat all but the best record-at-a-time DBMS application programmers.
Lesson 11: Functional dependencies are too difficult for mere mortals to understand. Another reason for KISS (Keep It Simple Stupid).
Lesson 12: Unless there is a big performance or functionality advantage, new constructs will go nowhere.
Lesson 13: Packages will not sell to users unless they are in “major pain”
Lesson 14: Persistent languages will go nowhere without the support of the programming language community.
(yes, there is a numbering mistake in the paper)
Lesson 14: The major benefits of OR is two-fold: putting code in the data base (and thereby bluring the distinction between code and data) and a general purpose extension mechanism that allows OR DBMSs to quickly respond to market requirements.
Lesson 15: Widespread adoption of new technology requires either standards and/or an elephant pushing hard.
Lesson 16: Schema-later is a probably a niche market
Lesson 17: XQuery is pretty much OR SQL with a different syntax
Lesson 18: XML will not solve the semantic heterogeneity either inside or outside the enterprise.

@max-sixty
Copy link
Member

Thanks @dumblob — this was interesting reading. And they were correct about the future of XML!

Please continue adding things like this that you think people would find interesting; and any direct implications on PRQL are welcome too.

@dumblob
Copy link
Author

dumblob commented Feb 12, 2022

Thanks @dumblob — this was interesting reading.

You're welcome!

And they were correct about the future of XML!

Yep. Totally!

Btw. I'd still be interested in some comparison of the features between prql and QUEL. I kept it for me, but now I'll say it: QUEL seems only slightly less understandable than prql ("holy grail") while being (much) more lightweight and pure (mathematically, implementation-wise, etc.).

Therefore the question is whether to not make prql closer (identical? - of course with extensions on top) to QUEL. If not, then why to reinvent the wheel (something the cited paper criticized)?

@max-sixty
Copy link
Member

Btw. I'd still be interested in some comparison of the features between prql and QUEL.

I'd welcome a more detailed comparison — I'm not the best person to be doing this given my lack of familiarity with it — but on initial viewing:

  • QUEL has a similar take on pipelines, which is a foundational principle for PRQL
  • But QUEL's pipelines seem stateful with the DB — e.g. replace s (age=s.age+1) executes an update student set age=age+1 query
  • There are a bunch of syntax differences

We need to send the full query given the performance impact — particularly for analytical queries, which are our main target.

Arguably we could restrict ourselves to the select syntax; e.g. from Wikipedia:

range of E is EMPLOYEE
retrieve into W
(COMP = E.Salary / (E.Age - 18))
where E.Name = "Jones"

...but this seems more verbose, and goes against some of the feedback we've incorporated into PRQL — e.g. every transformation starting with a function name.

What's your take? What would you take from QUEL and put into PRQL?

@wtkhan
Copy link

wtkhan commented Nov 3, 2022

@max-sixty, I know this issue's been closed, but wanted to add this example of QUEL's powerful nested aggregations feature to the discussion. From Wikipedia:

retrieve (
  a = count(y.i by y.d where y.str = "ii*" or y.str = "foo"),
  b = max(count(y.i by y.d))
)

Is this something PRQL could support?

@max-sixty
Copy link
Member

Yes, I also find the inline transformations great. Here there's a couple:

  • a filter / where, e.g. count(y where y.str = "ii*"); ref Inline filters #82
  • The groupby; i.e. max(count(y.i by y.d)) is also nice

I'm not confident on the best way of having this in PRQL. In full verbosity a=count(y.i by y.d where y.str = "ii*") would be something like:

derive a = (| filter y.str = "ii*" | group [y.d] (aggregate [count y.i])

...which is not exactly pretty, with lots of syntax. (it's also not clear where the l_value a= should go; would it go in the derive or a = count y.i)

PRQL does benefit from clearly specifying the resulting type; a downside of count(y where y.str = "ii*") is that it doesn't specify whether it's an aggregate or not — rank(y where y.str = "ii*") (or some function like regex_match) returns a column rather than a value, but looks the same without knowing the type of count / rank / regex_match. I wrote more on this at "What’s going on with this aggregate function?" in the FAQ.

Lmk if you have any thoughts!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants