Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate: improve performance and simplify error report format #172

Merged
merged 2 commits into from
Feb 22, 2022

Conversation

mhuang74
Copy link
Contributor

@mhuang74 mhuang74 commented Feb 22, 2022

Validate: improve performance (#164)

  • significantly reduce serde_json serialization by only retrieving validation output for invalid rows, and directly retrieving needed fields instead of traversing JSON tree
  • change validation error output format from jsonl to TSV. This avoids json manipulation just to insert row number, with added bonus of making output dramatically easier to read
  • performance improved by ~2.6X (from ~36s to ~14s)
(base) ➜  qsv git:(validate_performance) ✗ time ./target/release/qsvlite validate tmp/NYC_311_SR_2010-2020-sample-1M.csv tmp/NYC-short.csv.schema.json
[00:00:13] [==================== 100% validated 1,000,000 records.] (126,103/sec)
6,494 out of 1,000,000 records invalid.
./target/release/qsvlite validate tmp/NYC_311_SR_2010-2020-sample-1M.csv   25.45s user 0.84s system 189% cpu 13.841 total

validation error file for adur public toilets

row_number      field   error
1       ExtractDate     null is not of type "string"
1       OrganisationLabel       null is not of type "string"
3       CoordinateReferenceSystem       "OSGB3" does not match "(WGS84|OSGB36)"
3       Category        "Mens" does not match "(Female|Male|Female and Male|Unisex|Male urinal|Children only|None)"

flamegraph looks much cleaner

flamegraph

@jqnatividad jqnatividad merged commit f61c67a into dathere:master Feb 22, 2022
@jqnatividad
Copy link
Collaborator

@mhuang74 reducing serde serialization was spot on and changing the report format was inspired - increasing readability AND increasing performance at the same time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants