Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the exported clock API instead of localtime.c #322

Merged
merged 18 commits into from
Apr 13, 2021
Merged

Use the exported clock API instead of localtime.c #322

merged 18 commits into from
Apr 13, 2021

Conversation

DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Apr 9, 2021

Closes #273
Closes #321

Alternative to #321, paired with r-lib/clock#213

@jimhester what if I provided you a callable API from clock to use? This would get you:

  • Error handling in zone_name_load() to ensure that an R error (through cpp11) is thrown if a bad zone name is passed, rather than a runtime error
  • No need for tz.cpp
  • No need to handle anything from tzdb

If you could assume that tz_ was constant, you could also call zone_name_load() once for the entire column, which would be a little faster, but I'm not sure you can make that assumption in vroom, given that a user might have different time zones in the same vector of date-time strings?

Regarding performance, right now clock uses the text version of the IANA time zone database. The underlying date library can also use the binary version of the database, which is much faster, but it currently has a few bugs. Even with the text version, it seems a little faster than the localtime version in the limited test below. I'll eventually switch to the binary version, and vroom would automatically inherit the benefits of that. (You can switch to using the binary version by flipping this define to 1 on a Mac https://github.com/r-lib/clock/blob/700ba688513b2bfa15c5cc7dbb81272a52357b3f/src/Makevars#L34)

Note: Still need to remove localtime.c/h from this PR. We can also talk about how this handles nonexistent times (DST gaps) and ambiguous times (DST fallbacks). Done

library(vroom)

set.seed(123)

# 10,000,000 random date-times
x <- .POSIXct(sample(-1e9:1e9, 1e7, T), tz = "America/New_York")
x <- as.character(x)
df <- data.frame(x = x)

tf <- tempfile(fileext = ".csv")
vroom_write(df, tf, delim = "\t")

bench::mark(
  vroom = {
    vroom(
      tf, 
      delim = "\t", 
      col_types = list(x = col_datetime()), 
      locale = locale(tz = "America/New_York"),
      altrep = FALSE
    )
  },
  iterations = 5
)

# master - with localtime
#> # A tibble: 1 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 vroom          8.6s    8.82s     0.113    77.8MB   0.0282

# this PR - with text time zone database
#> # A tibble: 1 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 vroom         3.03s    3.25s     0.310    77.8MB        0

# this PR - with binary time zone database
#> # A tibble: 1 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 vroom         1.24s    1.33s     0.751    77.8MB        0

unlink(tf)

@jimhester
Copy link
Collaborator

LGTM, thanks

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Apr 9, 2021

Okay @jimhester, I'm finished. If it looks good to you, I can merge the clock PR and update this one to only have a Remote on dev clock, then we can merge this.

A few notes:

I've tweaked check_tz() to materialize the system timezone if the user passes in "". Normally clock can handle "", but it involves calling back to Sys.timezone() from C++. Since the code where that happens will be running in parallel through vroom, we can't do that for fear of crashing R.

I've removed all mentions of localtime.c, including the copyright and the fact that R Core was listed as an author.

I've removed consumeInteger1() since date doesn't use 0-indexed month and day values (unlike mktime). This has the small side effect of making amPm_ 1-index based (because of a change in consumeString()), but that doesn't seem too bad.

I've added a few tests regarding tz = "".

I've added a few tests making assumptions about ambiguous/nonexistent times explicit. For ambiguous times, we default to always choosing the earliest of the two times. For nonexistent times, we return NA.

@DavisVaughan DavisVaughan requested a review from jimhester April 9, 2021 18:12
@jimhester
Copy link
Collaborator

yeah this looks good to me, thank you for working on it! I may not merge this until Monday, I want to hunt down why the windows tests are failing in the current vroom master and get the CI green before merging, but it shouldn't require any changes to what you have done.

@DavisVaughan
Copy link
Member Author

Sounds good! Just merged the clock PR and updated the Remote here, so it should be good to go for you on Monday

@DavisVaughan
Copy link
Member Author

FYI clock 0.2.0 is on CRAN so I have adjusted the required version and removed the Remote

@jimhester
Copy link
Collaborator

Thank you!

jimhester added a commit to tidyverse/readr that referenced this pull request Apr 13, 2021
jimhester added a commit to tidyverse/readr that referenced this pull request Apr 13, 2021
* Use the clock package to convert date times

A port of tidyverse/vroom#322 to readr

* Add withr to suggests
@DavisVaughan DavisVaughan deleted the feature/clock branch April 16, 2021 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider cctz instead of R derived localtime
2 participants