Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Record hashing perf #781

Merged
merged 31 commits into from
Dec 15, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
02557d0
todo
johnkerl Dec 13, 2021
bfefc8d
Merge branch 'main' of git+ssh://github.com/johnkerl/miller
johnkerl Dec 13, 2021
03cd9e0
Rename inputChannel,outputChannel to readerChannel,writerChannel
johnkerl Dec 7, 2021
8abd334
Rename inputChannel,outputChannel to readerChannel,writerChannel (#772)
johnkerl Dec 7, 2021
2f8d2b1
Start batched-reader API mods
johnkerl Dec 7, 2021
f2879ae
Singleton-list step for reader-batching at input
johnkerl Dec 8, 2021
5c02275
CLI options for records-per-batch and hash-records
johnkerl Dec 8, 2021
aaf0c27
Push channelized-reader logic into DKVP reader
johnkerl Dec 8, 2021
9bdc53d
Push batching logic into chain-transformer, transformers, and channel…
johnkerl Dec 8, 2021
0335c4e
foo
johnkerl Dec 9, 2021
7f1aced
cmd/mprof and cmd/mprof2
johnkerl Dec 9, 2021
6ad475b
cmd/mprof3 and cmd/mprof4
johnkerl Dec 9, 2021
b6ab5d9
narrowed in on regexp-splitting on IFS/IPS as perf-hit
johnkerl Dec 9, 2021
3faf66a
neaten
johnkerl Dec 9, 2021
5666b59
channelize nidx
johnkerl Dec 9, 2021
79bc7fd
cmd/mprof5
johnkerl Dec 9, 2021
d183d11
channelize CSV reader
johnkerl Dec 9, 2021
2618d63
channelize NIDX reader
johnkerl Dec 11, 2021
4df477f
Dedupe DKVP-reader and NIDX-reader source files
johnkerl Dec 11, 2021
be3286f
channelize CSV-lite reader
johnkerl Dec 11, 2021
8403549
channelize XTAB reader
johnkerl Dec 11, 2021
f47b2dd
batchify JSON reader
johnkerl Dec 11, 2021
e24a36d
channelize GEN pseudo-reader
johnkerl Dec 11, 2021
34627e7
scripts for perf-testing on larger files
johnkerl Dec 12, 2021
4aab213
merge with main for #776
johnkerl Dec 13, 2021
056b8e9
Fix record-batching for join and repl
johnkerl Dec 13, 2021
8cea6ae
Fix comment-handling in channelized XTAB reader
johnkerl Dec 13, 2021
fe6d6cf
Fix bug found in positional-rename
johnkerl Dec 13, 2021
641ecbd
merge
johnkerl Dec 13, 2021
ba6286c
merge
johnkerl Dec 13, 2021
447f139
Use --no-hash-records by default
johnkerl Dec 15, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 13 additions & 1 deletion docs/src/manpage.md
Original file line number Diff line number Diff line change
Expand Up @@ -478,6 +478,14 @@ MISCELLANEOUS FLAGS
rather than after. May be used more than once.
Example: `mlr --from a.dat --from b.dat cat` is the
same as `mlr cat a.dat b.dat`.
--hash-records This is an internal parameter which normally does not
need to be modified. It controls the mechanism by
which Miller accesses fields within records. In
general --no-hash-records is faster, and is the
default. For specific use-cases involving data having
many fields, and many of them being processed during
a given processing run, --hash-records might offer a
slight performance benefit.
--infer-int-as-float or -A
Cast all integers in data files to floats.
--infer-no-octal or -O Treat numbers like 0123 in data files as string
Expand Down Expand Up @@ -508,12 +516,16 @@ MISCELLANEOUS FLAGS
unlikely to be a noticeable performance improvement,
since direct-to-screen output for large files has its
own overhead.
--no-hash-records See --hash-records.
--nr-progress-mod {m} With m a positive integer: print filename and record
count to os.Stderr every m input records.
--ofmt {format} E.g. `%.18f`, `%.0f`, `%9.6e`. Please use
sprintf-style codes for floating-point nummbers. If
not specified, default formatting is used. See also
the `fmtnum` function and the `format-values` verb.
--records-per-batch {n} This is an internal parameter for maximum number of
records in a batch size. Normally this does not need
to be modified.
--seed {n} with `n` of the form `12345678` or `0xcafefeed`. For
`put`/`filter` `urand`, `urandint`, and `urand32`.
--tz {timezone} Specify timezone, overriding `$TZ` environment
Expand Down Expand Up @@ -2994,5 +3006,5 @@ SEE ALSO



2021-12-07 MILLER(1)
2021-12-15 MILLER(1)
</pre>
14 changes: 13 additions & 1 deletion docs/src/manpage.txt
Original file line number Diff line number Diff line change
Expand Up @@ -457,6 +457,14 @@ MISCELLANEOUS FLAGS
rather than after. May be used more than once.
Example: `mlr --from a.dat --from b.dat cat` is the
same as `mlr cat a.dat b.dat`.
--hash-records This is an internal parameter which normally does not
need to be modified. It controls the mechanism by
which Miller accesses fields within records. In
general --no-hash-records is faster, and is the
default. For specific use-cases involving data having
many fields, and many of them being processed during
a given processing run, --hash-records might offer a
slight performance benefit.
--infer-int-as-float or -A
Cast all integers in data files to floats.
--infer-no-octal or -O Treat numbers like 0123 in data files as string
Expand Down Expand Up @@ -487,12 +495,16 @@ MISCELLANEOUS FLAGS
unlikely to be a noticeable performance improvement,
since direct-to-screen output for large files has its
own overhead.
--no-hash-records See --hash-records.
--nr-progress-mod {m} With m a positive integer: print filename and record
count to os.Stderr every m input records.
--ofmt {format} E.g. `%.18f`, `%.0f`, `%9.6e`. Please use
sprintf-style codes for floating-point nummbers. If
not specified, default formatting is used. See also
the `fmtnum` function and the `format-values` verb.
--records-per-batch {n} This is an internal parameter for maximum number of
records in a batch size. Normally this does not need
to be modified.
--seed {n} with `n` of the form `12345678` or `0xcafefeed`. For
`put`/`filter` `urand`, `urandint`, and `urand32`.
--tz {timezone} Specify timezone, overriding `$TZ` environment
Expand Down Expand Up @@ -2973,4 +2985,4 @@ SEE ALSO



2021-12-07 MILLER(1)
2021-12-15 MILLER(1)
6 changes: 6 additions & 0 deletions docs/src/reference-main-flag-list.md
Original file line number Diff line number Diff line change
Expand Up @@ -341,6 +341,8 @@ These are flags which don't fit into any other category.
`: Force buffered output to be written after every output record. The default is flush output after every record if the output is to the terminal, or less often if the output is to a file or a pipe. The default is a significant performance optimization for large files. Use this flag to force frequent updates even when output is to a pipe or file, at a performance cost.
* `--from {filename}
`: Use this to specify an input file before the verb(s), rather than after. May be used more than once. Example: `mlr --from a.dat --from b.dat cat` is the same as `mlr cat a.dat b.dat`.
* `--hash-records
`: This is an internal parameter which normally does not need to be modified. It controls the mechanism by which Miller accesses fields within records. In general --no-hash-records is faster, and is the default. For specific use-cases involving data having many fields, and many of them being processed during a given processing run, --hash-records might offer a slight performance benefit.
* `--infer-int-as-float or -A
`: Cast all integers in data files to floats.
* `--infer-no-octal or -O
Expand All @@ -355,10 +357,14 @@ These are flags which don't fit into any other category.
`: Like `--load` but works with more than one filename, e.g. `--mload *.mlr --`.
* `--no-fflush
`: Let buffered output not be written after every output record. The default is flush output after every record if the output is to the terminal, or less often if the output is to a file or a pipe. The default is a significant performance optimization for large files. Use this flag to allow less-frequent updates when output is to the terminal. This is unlikely to be a noticeable performance improvement, since direct-to-screen output for large files has its own overhead.
* `--no-hash-records
`: See --hash-records.
* `--nr-progress-mod {m}
`: With m a positive integer: print filename and record count to os.Stderr every m input records.
* `--ofmt {format}
`: E.g. `%.18f`, `%.0f`, `%9.6e`. Please use sprintf-style codes for floating-point nummbers. If not specified, default formatting is used. See also the `fmtnum` function and the `format-values` verb.
* `--records-per-batch {n}
`: This is an internal parameter for maximum number of records in a batch size. Normally this does not need to be modified.
* `--seed {n}
`: with `n` of the form `12345678` or `0xcafefeed`. For `put`/`filter` `urand`, `urandint`, and `urand32`.
* `--tz {timezone}
Expand Down
8 changes: 6 additions & 2 deletions internal/pkg/cli/option_parse.go
Original file line number Diff line number Diff line change
Expand Up @@ -2578,7 +2578,11 @@ this does not need to be modified.`,

{
name: "--hash-records",
help: `This is an internal parameter which normally does not need to be modified.`,
help: `This is an internal parameter which normally does not need to be modified.
It controls the mechanism by which Miller accesses fields within records.
In general --no-hash-records is faster, and is the default. For specific use-cases involving
data having many fields, and many of them being processed during a given processing run,
--hash-records might offer a slight performance benefit.`,
parser: func(args []string, argc int, pargi *int, options *TOptions) {
types.HashRecords(true)
*pargi += 1
Expand All @@ -2587,7 +2591,7 @@ this does not need to be modified.`,

{
name: "--no-hash-records",
help: `This is an internal parameter which normally does not need to be modified.`,
help: `See --hash-records.`,
parser: func(args []string, argc int, pargi *int, options *TOptions) {
types.HashRecords(false)
*pargi += 1
Expand Down
2 changes: 1 addition & 1 deletion internal/pkg/types/mlrmap.go
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ package types
// Both these figures are for just doing mlr cat. At the moment I'm leaving this
// default-on pending more profiling on more complex record-processing operations
// such as mlr sort.
var hashRecords = true
var hashRecords = false

func HashRecords(onOff bool) {
hashRecords = onOff
Expand Down
14 changes: 13 additions & 1 deletion man/manpage.txt
Original file line number Diff line number Diff line change
Expand Up @@ -457,6 +457,14 @@ MISCELLANEOUS FLAGS
rather than after. May be used more than once.
Example: `mlr --from a.dat --from b.dat cat` is the
same as `mlr cat a.dat b.dat`.
--hash-records This is an internal parameter which normally does not
need to be modified. It controls the mechanism by
which Miller accesses fields within records. In
general --no-hash-records is faster, and is the
default. For specific use-cases involving data having
many fields, and many of them being processed during
a given processing run, --hash-records might offer a
slight performance benefit.
--infer-int-as-float or -A
Cast all integers in data files to floats.
--infer-no-octal or -O Treat numbers like 0123 in data files as string
Expand Down Expand Up @@ -487,12 +495,16 @@ MISCELLANEOUS FLAGS
unlikely to be a noticeable performance improvement,
since direct-to-screen output for large files has its
own overhead.
--no-hash-records See --hash-records.
--nr-progress-mod {m} With m a positive integer: print filename and record
count to os.Stderr every m input records.
--ofmt {format} E.g. `%.18f`, `%.0f`, `%9.6e`. Please use
sprintf-style codes for floating-point nummbers. If
not specified, default formatting is used. See also
the `fmtnum` function and the `format-values` verb.
--records-per-batch {n} This is an internal parameter for maximum number of
records in a batch size. Normally this does not need
to be modified.
--seed {n} with `n` of the form `12345678` or `0xcafefeed`. For
`put`/`filter` `urand`, `urandint`, and `urand32`.
--tz {timezone} Specify timezone, overriding `$TZ` environment
Expand Down Expand Up @@ -2973,4 +2985,4 @@ SEE ALSO



2021-12-07 MILLER(1)
2021-12-15 MILLER(1)
16 changes: 14 additions & 2 deletions man/mlr.1
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@
.\" Title: mlr
.\" Author: [see the "AUTHOR" section]
.\" Generator: ./mkman.rb
.\" Date: 2021-12-07
.\" Date: 2021-12-15
.\" Manual: \ \&
.\" Source: \ \&
.\" Language: English
.\"
.TH "MILLER" "1" "2021-12-07" "\ \&" "\ \&"
.TH "MILLER" "1" "2021-12-15" "\ \&" "\ \&"
.\" -----------------------------------------------------------------
.\" * Portability definitions
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -576,6 +576,14 @@ These are flags which don't fit into any other category.
rather than after. May be used more than once.
Example: `mlr --from a.dat --from b.dat cat` is the
same as `mlr cat a.dat b.dat`.
--hash-records This is an internal parameter which normally does not
need to be modified. It controls the mechanism by
which Miller accesses fields within records. In
general --no-hash-records is faster, and is the
default. For specific use-cases involving data having
many fields, and many of them being processed during
a given processing run, --hash-records might offer a
slight performance benefit.
--infer-int-as-float or -A
Cast all integers in data files to floats.
--infer-no-octal or -O Treat numbers like 0123 in data files as string
Expand Down Expand Up @@ -606,12 +614,16 @@ These are flags which don't fit into any other category.
unlikely to be a noticeable performance improvement,
since direct-to-screen output for large files has its
own overhead.
--no-hash-records See --hash-records.
--nr-progress-mod {m} With m a positive integer: print filename and record
count to os.Stderr every m input records.
--ofmt {format} E.g. `%.18f`, `%.0f`, `%9.6e`. Please use
sprintf-style codes for floating-point nummbers. If
not specified, default formatting is used. See also
the `fmtnum` function and the `format-values` verb.
--records-per-batch {n} This is an internal parameter for maximum number of
records in a batch size. Normally this does not need
to be modified.
--seed {n} with `n` of the form `12345678` or `0xcafefeed`. For
`put`/`filter` `urand`, `urandint`, and `urand32`.
--tz {timezone} Specify timezone, overriding `$TZ` environment
Expand Down
1 change: 1 addition & 0 deletions todo.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
================================================================
PUNCHDOWN LIST

* --ifs-regex & --ips-regex -- guessing is not safe as evidence by '.' and '|'

* big-picture item @ Rmd; also webdoc intro page

Expand Down