query language #9

wkalt · 2024-04-02T14:48:19Z

Background

Today dp3 executes only one kind of query, which you could consider as an unrestricted as-of join on timestamp, or a timestamp-sorted union.

We will want to expand this to more flexibly defined as-of joins on timestamp, and more generally provide a query language interface with some "psql-like" functionality in the CLI to the user to give them a more interactive and databasey experience.

SQL is a poor language for this because supporting any SQL sets up the expectation that we will support all of it -- which we are not prepared to do efficiently -- and even if we did there are no SQL implementations that express as-of joins in a concise and user-friendly way, that SQL non-users will be excited to pick up. People who like SQL will be annoyed that we only support a subset and people that hate it will be annoyed that our query language resembles it.

Finally SQL is going to limit our eventual prospects for flexible autocompletion support in the client, due to the way it lists columns prior to tables.

So, we are not going to implement SQL.

The concern about only implementing part of SQL also applies to the other languages people are trying to standardize like kusto or prql. We are at best going to get inspired by them - we aren't going to implement them or advertise support for them.

We do need a few similar features to what all of these provide though:

Filtering i.e where clauses with the usual binary operators
N-way as-of join on timestamp only, with support for either "succeeds" or "precedes" relationships since the user may want to think either way.
A modifier on the as-of join like "by less than X/by more than X", allowing you to specify you only want hits within a particular amount of time.
Composable
Limit/offset
Filtering on timestamp, but elevated to a part of the syntax.
Reverse the order of timestamp iteration, to enable reverse search as typically presented in log search interfaces.
Restriction on producer elevated to part of the syntax. Initially we will require it but not forever.
Complex array support. What we need here actually goes beyond SQL’s capabilities.
UDF support - a bonus but something we should plan for/think about, as users may wish to do some custom computations like linear algebra or whatever that we will never be fast enough to generalize in the product.

The intent of the language will be to support easy interactive searching, primarily on small numbers of topics, and to enable the expression of multi-topic conditions like "show me 50 times in the last month when we braked hard while it was raining" (considering in that example separate topics for "hard brake" and "is raining"). These conditions can get complex and incorporate several topics. I would expect that as-of joins bigger than 12-way will be rare and that most will be (well) under 6-way.

We likely never support other kinds of joins, or sorting on fields other than log time, or heavy analytics (at least in this engine). We need to play to the strengths of our system and the query engine that supports this will only be single-node. A closer analog for what we are targeting would be the query languages of sumologic, stackdriver, or any of the cloud log search tools, except with a lot more focus on as-of joins. For heavy analytics work users can use spark.

For purposes of autocomplete, we can assume that we can get both a fast listing of available tables (i.e topics), and a fast schema listing for a particular table. Both of those are true, we just don't have APIs for them. So if a query leads with a table and follows with the column restrictions, we will be able to autocomplete it (probably glossing over some nuances about the grammar).

Proposal

Get single topic:

/topic

Get two topics joined

Current behavior of topics param - comma operator

/topic1, /topic2

Restrict a single topic with scalar subfield

Conventional comparison operators supported i.e =, <, <=, ~, *~, <>, etc.

/topic where field.subfield.subfield2 = 'foo'

Restrict on a fixed-size array element.

/topic where field.subfield.subfield2[1] = 'foo'

Restrict on variable-sized array element.

We probably do need the ability to address a variable-sized element by index, but other queries on variable-sized arrays are going to require some kind of "any", "all", or "none" semantics I think. For access by index we will use the same syntax as above for fixed-size array. For the others we need something special

;; access by index
/topic where field.subfield.subfield2[1] = 'foo'

;; any/all/none - element alias. This can be anything the user wants but conventionally “.” will be used unless
;; disambiguation is required (e.g nested arrays). Note that we should be able to autocomplete child fields.

;; scalar arrays
/topic where field.subfield.subfield2 has any . > 10
/topic where field.subfield has no . > 10
/topic where field.subfield has all . > 10

;; complex element with scalar subfield
/topic where field.subfield.subfield2 has any .x.y > 10

;; complex element with scalar array-valued subfield – note use of alias “element” - could be anything.
/topic where field.subfield has any (.x.y has any element > 10)

;; complex element with complex array-valued subfield
/topic where field.subfield has any (.x.y has any element.x > 10)

;; double nested
/topic where field.subfield has any (.x.y has any (element.x.y has no child.x > 10))

It’s possible we won’t need the parentheses, but if we don’t it would be good to support them anyway since they make the structure clearer. In all cases above the argument to “any” could have been parenthesized.

Join two restrictions on timestamp

(/topic where ...), (/topic2 where ...)

Precedes/succeeds/neighbors operators.

Supports keywords nanoseconds, microseconds, milliseconds, seconds, minutes. We will probably cap at 1 or 5 minutes for now since something is going to buffer that data until/unless we spill queries to disk, which we can defer until requested.

(/topic where …) precedes (/topic2 where …) by less than 5 seconds
(/topic where …) succeeds (/topic2 where …) by less than or equal to 5 minutes
(/topic where …) neighbors (/topic2 where …) by less than 5 seconds

;; unrestricted
/topic precedes /topic2 by less than 5 seconds

;; Composition
((/topic where …) precedes (/topic2 where …) by less than 5 seconds) neighbors (/topic3 where…) by less than 5 seconds

As-of semijoin

^ AKA suppression operator

;; unrestricted semijoin of /topic and /topic2. Valid but pointless.
/topic, ^/topic2
;; useful example of semijoin, for instance if /topic is the only one required.
(/topic where …) precedes ^(/topic2 where …) by less than 5 seconds

Timestamp restriction

From/to keywords

;; dates
/topic from '2020-01-01' to '2020-01-02'

;; nanoseconds
/topic from 123 to 345

;; on the result of a join
/topic, /topic2 from '2020-01-01' to '2020-01-02'
(/topic where ...) precedes (/topic2 where ...) by less than 5 seconds from '2020-01-01' to '2020-01-02'

;; applied to subquery - mostly pointless but valid
(/topic from '2020-01-01' to '2020-01-02') precedes (/topic2 from '2020-01-02' to '2020-01-02')

Descending keyword

Reverse the sort order. This should be used at the end of queries (before limit/offset) but is also valid in subqueries. Most likely when used in subqueries the effect on precedes/succeeds behavior will become confusing.

/topic descending

/topic where ... descending

/topic where ... from a to b descending

;; legal but don't do this
(/topic where ... descending) precedes (/topic2 where ...) by less than 5 seconds

;; better
(/topic where ...) precedes (/topic2 where ...) by less than 5 seconds descending

Limit/offset

Valid at the end of any query/subquery only.

;; get 5 records, e.g to inspect structure
/topic limit 5

;; equivalent
/topic limit 5 offset 5
/topic offset 5 limit 5

;; on a join
(/topic where ...) precedes (/topic2 where ...) by less than 5 seconds descending limit 5 offset 5

The text was updated successfully, but these errors were encountered:

wkalt · 2024-04-02T15:06:29Z

ping @jainilajmera @mullr @jhubberts for thoughts/insights

wkalt · 2024-04-07T02:23:49Z

I am now targeting something that looks more like this:

a, b

a precedes b

a, b precedes c

a precedes b precedes c
where a.foo = 10 and b.bar = 20 and c. baz = 30
limit 10 offset 20

a preceeds b succeeds c
where a.foo = 10 and b.bar = 20 and c.baz = 30

wkalt · 2024-04-09T13:35:52Z

from discussion with Russ about https://docs.auxon.io/speqtr/syntax.html#queries

we need a way to control the behavior of "asof" to indicate whether all RHS messages should be emitted or just the one succeeding the LHS. This is their concept of "crossing any".
they have a "not followed by"/"not preceded by"
AS keyword looks nice too. Repeating topic names in where clauses will be inconvenient for long topics.

wkalt · 2024-04-09T14:16:15Z

from device between '2020-01-01' and '2021-01-01'
/is_raining as raining precedes immediate /events as events
where raining.raining = true and events.type = "hard-brake"

wkalt · 2024-04-10T03:56:27Z

big chunk of query execution stuff is now merged. Leaving this open as we still do not have variable length array support.

wkalt · 2024-04-11T17:10:14Z

will actually close this and track new query language features in the monster list - that won't be the only one

wkalt mentioned this issue Apr 2, 2024

monster issue list #2

Open

wkalt closed this as completed Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

query language #9

query language #9

wkalt commented Apr 2, 2024 •

edited

Loading

wkalt commented Apr 2, 2024

wkalt commented Apr 7, 2024

wkalt commented Apr 9, 2024

wkalt commented Apr 9, 2024

wkalt commented Apr 10, 2024

wkalt commented Apr 11, 2024

query language #9

query language #9

Comments

wkalt commented Apr 2, 2024 • edited Loading

Background

Proposal

Get single topic:

Get two topics joined

Restrict a single topic with scalar subfield

Restrict on a fixed-size array element.

Restrict on variable-sized array element.

Join two restrictions on timestamp

Precedes/succeeds/neighbors operators.

As-of semijoin

Timestamp restriction

Descending keyword

Limit/offset

wkalt commented Apr 2, 2024

wkalt commented Apr 7, 2024

wkalt commented Apr 9, 2024

wkalt commented Apr 9, 2024

wkalt commented Apr 10, 2024

wkalt commented Apr 11, 2024

wkalt commented Apr 2, 2024 •

edited

Loading