Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support UTF8 cast to Timestamp with timezone #3664

Closed
comphead opened this issue Feb 5, 2023 · 7 comments · Fixed by #3673
Closed

Support UTF8 cast to Timestamp with timezone #3664

comphead opened this issue Feb 5, 2023 · 7 comments · Fixed by #3673
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog

Comments

@comphead
Copy link
Contributor

comphead commented Feb 5, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Support cast UTF8 to Timestamp with time zone

Describe the solution you'd like

Support UTF8 cast to TimestampTZ in formats like PG https://www.postgresql.org/docs/current/datatype-datetime.html#DATATYPE-DATETIME-TIME-TABLE

Describe alternatives you've considered

Not doing this

Additional context

Original ticket apache/datafusion#5164

@comphead comphead added the enhancement Any new improvement worthy of a entry in the changelog label Feb 5, 2023
@comphead
Copy link
Contributor Author

comphead commented Feb 5, 2023

@viirya @waitingkuo @alamb please assign this issue to me

@comphead
Copy link
Contributor Author

comphead commented Feb 5, 2023

        let valid = StringArray::from(vec![
            "2023-01-01 04:05:06.789-8",
            "2023-01-01 04:05:06.789-7",
        ]);

        let array = Arc::new(valid) as ArrayRef;
        let b = cast(&array, &DataType::Timestamp(TimeUnit::Nanosecond,?????????)).unwrap();

As I mentioned apache/datafusion#5164 (comment) currently the cast function supports one target datatype which includes only 1 timezone info, however input array can have mixed timezones, like in example above

@viirya @waitingkuo please let me know your thoughts. Probably cast functions should accept array type TimestampNanosecondArray instead of primitive type DataType::Timestamp

@tustvold
Copy link
Contributor

tustvold commented Feb 6, 2023

however input array can have mixed timezones,

My naive expectation in such a case is that it would cast the timestamps to the provided output timezone. So in your example

let valid = StringArray::from(vec![
    "2023-01-01 04:05:06.789-8",
    "2023-01-01 04:05:06.789-7",
]);
let array = Arc::new(valid) as ArrayRef;
let b = cast(&array, &DataType::Timestamp(TimeUnit::Nanosecond, Some("+02:00".to_string()))).unwrap();

Would result in

"2023-01-01 14:05:06.789+2"
"2023-01-01 13:05:06.789+2"

Or to put it differently, each row would be parsed with any FixedOffset specific to that row, and it would then be converted to the output timezone. This would preserve the notion of the same point in time, but represented in different timezones.

This is also consistent with how timezone casts for timezone casts work, where the conversion is a metadata only operation (as TimestampArray always stores time since UTC epoch)

Probably cast functions should accept array type TimestampNanosecondArray instead of primitive type DataType::Timestamp

I don't follow what you mean by this

@comphead
Copy link
Contributor Author

comphead commented Feb 6, 2023

@tustvold sorry probably I have expressed not very clear. The entire idea is to cast user provided Utf8 value to Timestamp with timezone like

select "2023-01-01 04:05:06.789-8"::timestamptz union all
select "2023-01-01 04:05:06.789-7"::timestamptz

In Datafusion such transformation happens through arrow-rs cast function which has single target datatype , in this specific case I believe it would be DataType::Timestamp(TimeUnit::Nanosecond, Something) where is Something is not very clear for me. Presumably we can use +00:00 or None, I'm struggling if we can lose precision in this case.

@tustvold
Copy link
Contributor

tustvold commented Feb 6, 2023

we can use +00:00

I think using +00:00 makes sense to me, DataFusion is already smart enough to then coerce this where necessary to a different timezone (which is a purely metadata operation).

I'm struggling if we can lose precision in this case

I think this is fine, timestamps are always stored with respect to UTC Unix epoch, and so this should have no impact on overflow behaviour.

@comphead
Copy link
Contributor Author

comphead commented Feb 7, 2023

Thanks @tustvold

Should we support all supported PG formats https://www.postgresql.org/docs/current/datatype-datetime.html#DATATYPE-DATETIME-TIME-TABLE

I'm doing some test

        let valid = StringArray::from(vec![
            "2023-01-01 04:05:06.789000-08:00",
            "2023-01-01 04:05:06.789000-07:00",
            "2023-01-01 04:05:06.789 PST",
            "2023-01-01 04:05:06.789-8",
            "2023-01-01 04:05:06.789-800",
            "2023-01-01 04:05:06.789-8:00",
            "2023-01-01 04:05:06.789+07:30:00",
            "2023-01-01 040506+07:30:00",
            "2023-01-01 040506+0730",
            "2023-01-01 040506 America/Los_Angeles",
            "2023-01-01 04:05:06",
            "2023-01-01",

        ]);

All those valid for PG, but there is potential performance problem as we need to iterate through multiple formats until we find the match

@waitingkuo
Copy link
Contributor

@tustvold sorry probably I have expressed not very clear. The entire idea is to cast user provided Utf8 value to Timestamp with timezone like

select "2023-01-01 04:05:06.789-8"::timestamptz union all
select "2023-01-01 04:05:06.789-7"::timestamptz

In Datafusion such transformation happens through arrow-rs cast function which has single target datatype , in this specific case I believe it would be DataType::Timestamp(TimeUnit::Nanosecond, Something) where is Something is not very clear for me. Presumably we can use +00:00 or None, I'm struggling if we can lose precision in this case.

I think casting to timestamptz should return the same time zone (the one shown in SHOW TIME ZONE) in the same query.

i'm in UTC+8

in postgresql, select '2023-01-01 04:05:06.789-8'::timestamptz returns timestamptz in UTC+8

willy=# select '2023-01-01T04:05:06.789-8'::timestamptz;
        timestamptz         
----------------------------
 2023-01-01 20:05:06.789+08
(1 row)
willy=# select '2023-01-01 04:05:06.789-7'::timestamptz;
        timestamptz         
----------------------------
 2023-01-01 19:05:06.789+08
(1 row)

so in your query, the union return +8 time zone (my current time zone) for me in postgresql.

I think we could consider either fix it to +00:00 or current timezone

@tustvold tustvold added the arrow Changes to the arrow crate label Feb 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants