-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
db_bench crashes with segmentation fault #345
Comments
@stephenpmorgan this is pretty bad if it is a real issue. Would you be able to reproduce the problem? |
Sure, what would you like me to do? Simply re-run it? It takes a few hours. |
I'll also try reproducing |
I got "put error: IO error: 006936.sst: No space left on device" after a billion writes :) |
@stephenpmorgan would you mind trying this out with RocksDB 2.4? |
I reran it with RocksDB version 3.6 (at least that's what it says in the output trace). It hung after 2.484 billion keys, No progress, no segmentation fault, no further output, just hanging there... I would be happy to try out RocksDB 2.4 if I knew how to do so. How do I do that? I don't really know how to use GitHub to extract 2.4. |
I tried reproducing 3 times, no luck. :( Sorry, I meant RocksDB 3.4. Here is 3.4 release: https://github.com/facebook/rocksdb/releases/tag/rocksdb-3.4 Did you check the stack trace when the process was hanging? You can do that by attaching to the process using gdb and then issuing |
(gdb) thread all apply bt |
Sorry, |
(gdb) thread apply all bt Thread 9 (Thread 0x7fbd0c4ce700 (LWP 3405)): Thread 8 (Thread 0x7fbd0bccd700 (LWP 3406)): Thread 7 (Thread 0x7fbd0b4cc700 (LWP 3407)): Thread 6 (Thread 0x7fbd0accb700 (LWP 3408)): Thread 5 (Thread 0x7fbd0a4ca700 (LWP 3409)): Thread 4 (Thread 0x7fbd09cc9700 (LWP 3410)): Thread 3 (Thread 0x7fbd094c8700 (LWP 3411)): Thread 2 (Thread 0x7fbd08ac7700 (LWP 3412)): Thread 1 (Thread 0x7fbd0df14800 (LWP 3404)): |
Huh. This does not look good. Wonder why I can't repro. Can you also share your LOG as well? |
Are you sure? The log is enormous. How would you like me to share it? |
Here are the last 50 lines of the log: Cumulative writes: 2482000000 writes, 2482000000 batches, 1.0 writes per batch, 1923.20 GB user ingest 2014/10/09-22:22:24 ... thread 0: (1000000,2483000000) ops and (135188.1,143703.6) ops/second in (7.397102,17278.626286) seconds ** Compaction Stats [default] ** Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s) Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) RecordIn RecordDropL0 5/1 608 1.0 0.0 0.0 0.0 1904.3 1904.3 0.0 0.0 0.0 41.5 0 0 0 0 46975 16043 2.928 0.05 51680 0.00 0 0 ** DB Stats ** 2014/10/09-22:22:31 ... thread 0: (1000000,2484000000) ops and (143864.0,143703.6) ops/second in (6.951008,17285.577294) seconds ** Compaction Stats [default] ** Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s) Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) RecordIn RecordDropL0 6/1 729 1.2 0.0 0.0 0.0 1905.0 1905.0 0.0 0.0 0.0 41.5 0 0 0 0 46994 16049 2.928 0.05 51698 0.00 0 0 ** DB Stats ** |
@stephenpmorgan Can you zip it and upload it somewhere? |
The config has write_buffer_size = 128MB, target_file_size = 64MB, compression=zlib, ratio=0.5, but the LOG seems to show that average file size is ~128MB. I may misunderstand something, but it seems fishy to me. |
@igorcanadi I can zip the file (after compression size is 351877 bytes) but I don't have anywhere to put it and I can't even email it (security rules). If you tell me what you might be interested in, I can excerpt sections of the file for you here. Sorry, but that's the best I can do. |
@ljinfb So what should I do if it seems fishy? Did I do something dumb in the configuration? (I'd be happy to find out I did something dumb!) |
@stephenpmorgan I am purely guessing based on the information presented in this thread. Do you want to try a different compression type or even no-compression. Could be be caused by integration with zlib? I could be wrong, but that is something you can look at if we don't have other clue yet. |
@ljinfb OK, how do I try another compression type? |
@stephenpmorgan I'm mostly interested in |
There's no string "Stop" or "stop" anywhere in the output. "writes" shows up repeatedly in the following context but nowhere with "Stopping writes": Cumulative writes: 6000000 writes, 6000000 batches, 1.0 writes per batch, 4.65 GB user ingest Another suggestion? |
Uhm, this is very strange. |
Ah, @stephenpmorgan , we're talking about different logs. You gave us output of db_bench, right? We're more interested in RocksDB internal log, which you can find at |
@stephenpmorgan --compression_type=none would disable it |
I'm very interested in last 1000 lines of LOG in db directory. |
You asked for it: 2014/10/09-22:15:56.225283 7fbd0bccd700 [default] Stopping writes because we have 3 immutable memtables (waiting for flush) |
Oh, wait, that's not what you asked for. This is what you wanted: 2014/10/09-22:21:41.834245 7fbd0b4cc700 [default] Level-0 flush table #16011: 127435086 bytes OK 2014/10/09-22:21:42.497524 7fbd0bccd700 Table was constructed: 2014/10/09-22:21:42.586220 7fbd0a4ca700 Table was constructed: 2014/10/09-22:21:42.632725 7fbd0b4cc700 [default] Stopping writes because we have 3 immutable memtables (waiting for flush) 2014/10/09-22:21:46.127758 7fbd0c4ce700 Table was constructed: 2014/10/09-22:21:46.231659 7fbd09cc9700 Table was constructed: 2014/10/09-22:21:46.244383 7fbd0b4cc700 [default] Stopping writes because we have 3 immutable memtables (waiting for flush) 2014/10/09-22:21:49.494366 7fbd09cc9700 [default] Stopping writes because we have 3 immutable memtables (waiting for flush) 2014/10/09-22:21:49.683953 7fbd0b4cc700 Table was constructed: 2014/10/09-22:21:52.311868 7fbd0c4ce700 [default] Stopping writes because we have 3 immutable memtables (waiting for flush) 2014/10/09-22:21:52.794485 7fbd0c4ce700 (Original Log Time 2014/10/09-22:21:52.794470) [default] Moved #16019 to level-1 127435089 bytes OK: files[5 1 0 1 8 16001] 2014/10/09-22:21:53.034652 7fbd0a4ca700 Table was constructed: 2014/10/09-22:21:55.662039 7fbd0bccd700 [default] Stopping writes because we have 3 immutable memtables (waiting for flush) 2014/10/09-22:21:56.134753 7fbd09cc9700 (Original Log Time 2014/10/09-22:21:56.134741) [default] Moved #16021 to level-1 127435084 bytes OK: files[5 1 1 0 9 16003] 2014/10/09-22:21:56.386504 7fbd0a4ca700 Table was constructed: 2014/10/09-22:21:59.032325 7fbd0accb700 [default] Stopping writes because we have 3 immutable memtables (waiting for flush) 2014/10/09-22:21:59.575389 7fbd0bccd700 [default] Stopping writes because we have 3 immutable memtables (waiting for flush) 2014/10/09-22:21:59.742856 7fbd09cc9700 Table was constructed: 2014/10/09-22:22:02.553294 7fbd0c4ce700 [default] Stopping writes because we have 3 immutable memtables (waiting for flush) 2014/10/09-22:22:03.148204 7fbd08ac7700 Table was constructed: 2014/10/09-22:22:03.247106 7fbd0bccd700 Table was constructed: 2014/10/09-22:22:03.256553 7fbd0a4ca700 [default] Stopping writes because we have 3 immutable memtables (waiting for flush) 2014/10/09-22:22:06.661887 7fbd0bccd700 Table was constructed: 2014/10/09-22:22:06.755264 7fbd0a4ca700 Table was constructed: 2014/10/09-22:22:09.505053 7fbd0b4cc700 [default] Stopping writes because we have 3 immutable memtables (waiting for flush) 2014/10/09-22:22:10.208124 7fbd08ac7700 Table was constructed: 2014/10/09-22:22:10.312555 7fbd0accb700 Table was constructed: 2014/10/09-22:22:10.339020 7fbd0a4ca700 [default] Stopping writes because we have 3 immutable memtables (waiting for flush) 2014/10/09-22:22:13.830520 7fbd094c8700 [default] New memtable created with log file: #3 2014/10/09-22:22:14.000077 7fbd0a4ca700 [default] Stopping writes because we have 3 immutable memtables (waiting for flush) 2014/10/09-22:22:16.897956 7fbd09cc9700 [default] Stopping writes because we have 3 immutable memtables (waiting for flush) 2014/10/09-22:22:17.596482 7fbd08ac7700 Table was constructed: 2014/10/09-22:22:17.674891 7fbd09cc9700 Table was constructed: 2014/10/09-22:22:20.528815 7fbd0accb700 [default] Stopping writes because we have 3 immutable memtables (waiting for flush) 2014/10/09-22:22:21.286339 7fbd08ac7700 Table was constructed: 2014/10/09-22:22:21.378886 7fbd0b4cc700 Table was constructed: 2014/10/09-22:22:23.176860 7fbd0a4ca700 [default] Stopping writes because we have 3 immutable memtables (waiting for flush) 2014/10/09-22:22:24.844194 7fbd0b4cc700 [default] Stopping writes because we have 3 immutable memtables (waiting for flush) 2014/10/09-22:22:25.067555 7fbd0accb700 Table was constructed: 2014/10/09-22:22:27.791907 7fbd0a4ca700 [default] Stopping writes because we have 3 immutable memtables (waiting for flush) 2014/10/09-22:22:28.304584 7fbd0bccd700 [default] Stopping writes because we have 3 immutable memtables (waiting for flush) 2014/10/09-22:22:28.509348 7fbd0a4ca700 Table was constructed: 2014/10/09-22:22:31.212426 7fbd0c4ce700 [default] Stopping writes because we have 3 immutable memtables (waiting for flush) 2014/10/09-22:22:31.740636 7fbd094c8700 [default] New memtable created with log file: #3 2014/10/09-22:22:31.865527 7fbd0b4cc700 [default] Stopping writes because we have 3 immutable memtables (waiting for flush) 2014/10/09-22:22:34.732938 7fbd09cc9700 [default] Stopping writes because we have 3 immutable memtables (waiting for flush) 2014/10/09-22:22:35.395874 7fbd094c8700 [default] New memtable created with log file: #3 2014/10/09-22:22:35.576567 7fbd0a4ca700 [default] Stopping writes because we have 3 immutable memtables (waiting for flush) 2014/10/09-22:22:38.464045 7fbd0bccd700 [default] Stopping writes because we have 3 immutable memtables (waiting for flush) |
Yup, this was the bug: https://reviews.facebook.net/D24807 |
BTW it seems to me you might have some troubles with your memory. Did you check your dmesg? |
@igorcanadi So what should I do? Also, dmsg didn't show any messages about memory or any other system problems. |
Summary: This might have caused #345. If we're stopping writes and bg_error comes along, we will never unblock the write. Test Plan: compiles Reviewers: ljin Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D24807
Try running master now. |
It completed writing 4.5 billion keys. 8-) Thanks! |
Aaah. Awesome! |
Running db_bench with fillseq benchmark and large number of keys (4.5 billion) caused segmentation error after more than 1.959 billion keys were generated but well before device was filled. Using code as of yesterday afternoon.
Command was:
bpl=1048576;overlap=10;mcz=2;del=300000000;levels=6;ctrig=4; delay=8; stop=12; wbn=3; mbc=6; mb=67108864;wbs=134217728; dds=0; sync=0; r=4500000000; t=1; vs=800; bs=65536; cs=1048576; of=500000; si=1000000; ./db_bench --benchmarks=fillseq --disable_seek_compaction=1 --mmap_read=0 --statistics=1 --histogram=1 --num=$r --threads=$t --value_size=$vs --block_size=$bs --cache_size=$cs --bloom_bits=10 --cache_numshardbits=4 --open_files=$of --verify_checksum=1 --db=/media/stephen/f4098456-97de-4569-852d-d3eceb1e7a82/database --sync=$sync --disable_wal=1 --compression_type=zlib --stats_interval=$si --compression_ratio=0.50 --disable_data_sync=$dds --write_buffer_size=$wbs --target_file_size_base=$mb --max_write_buffer_number=$wbn --max_background_compactions=$mbc --level0_file_num_compaction_trigger=$ctrig --level0_slowdown_writes_trigger=$delay --level0_stop_writes_trigger=$stop --num_levels=$levels --delete_obsolete_files_period_micros=$del --min_level_to_compress=$mcz --max_grandparent_overlap_factor=$overlap --stats_per_interval=1 --max_bytes_for_level_base=$bpl --use_existing_db=0
Crash resulted in:
Received signal 11 (Segmentation fault)
#0 ./db_bench() [0x47b039] GetVarint32Ptr /home/stephen/Downloads/rocksdb/rocksdb-master/./util/coding.h:109
#1 ./db_bench() [0x41e7e9] rocksdb::BuildTable(std::string const&, rocksdb::Env_, rocksdb::ImmutableCFOptions const&, rocksdb::EnvOptions const&, rocksdb::TableCache_, rocksdb::Iterator_, rocksdb::FileMetaData_, rocksdb::InternalKeyComparator const&, unsigned long, unsigned long, ocksdb::CompressionType, rocksdb::CompressionOptions const&, rocksdb::Env::IOPriority) /home/stephen/Downloads/rocksdb/rocksdb-master/db/builder.cc:100
#2 ./db_bench() [0x444637] rocksdb::Status::operator=(rocksdb::Status const&) /home/stephen/Downloads/rocksdb/rocksdb-master/./include/rocksdb/status.h:145 (discriminator 1)
#3 ./db_bench() [0x444bed] rocksdb::Status::ok() const /home/stephen/Downloads/rocksdb/rocksdb-master/db/db_impl.cc:4899
#4 ./db_bench() [0x44576e] rocksdb::Status::operator=(rocksdb::Status const&) /home/stephen/Downloads/rocksdb/rocksdb-master/./include/rocksdb/status.h:145 (discriminator 1)
#5 ./db_bench() [0x4532fb] rocksdb::DBImpl::BackgroundCallFlush() /home/stephen/Downloads/rocksdb/rocksdb-master/db/db_impl.cc:2101 (discriminator 1)
#6 ./db_bench() [0x4e2024] BGThread /home/stephen/Downloads/rocksdb/rocksdb-master/util/env_posix.cc:1616
#7 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f9b6b48ee9a] ?? ??:0
#8 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f9b69f543fd] ?? ??:0
./Test2: line 2: 6723 Segmentation fault (core dumped) ./db_bench --benchmarks=fillseq --disable_seek_compaction=1 --mmap_read=0 --statistics=1 --histogram=1 --num=$r --threads=$t --value_size=$vs --block_size=$bs --cache_size=$cs --bloom_bits=10 --cache_numshardbits=4 --open_files=$of --verify_checksum=1 --db=/media/stephen/f4098456-97de-4569-852d-d3eceb1e7a82/database --sync=$sync --disable_wal=1 --compression_type=zlib --stats_interval=$si --compression_ratio=0.50 --disable_data_sync=$dds --write_buffer_size=$wbs --target_file_size_base=$mb --max_write_buffer_number=$wbn --max_background_compactions=$mbc --level0_file_num_compaction_trigger=$ctrig --level0_slowdown_writes_trigger=$delay --level0_stop_writes_trigger=$stop --num_levels=$levels --delete_obsolete_files_period_micros=$del --min_level_to_compress=$mcz --max_grandparent_overlap_factor=$overlap --stats_per_interval=1 --max_bytes_for_level_base=$bpl --use_existing_db=0
Command exited with non-zero status 139
Last few lines of trace files were:
** Compaction Stats [default] **
Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s) Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) RecordIn RecordDrop
L0 5/1 608 1.0 0.0 0.0 0.0 1502.4 1502.4 0.0 0.0 0.0 38.5 0 0 0 0 39992 12657 3.160 0.04 37297 0.00 0 0
L2 1/1 122 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 0.000 0.00 0 0.00 0 0
L4 8/0 972 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 0.000 0.00 0 0.00 0 0
L5 12643/0 1536767 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 0.000 0.00 0 0.00 0 0
Sum 12657/2 1538468 0.0 0.0 0.0 0.0 1502.4 1502.4 1.0 1.0 0.0 38.5 0 0 0 0 39992 12657 3.160 0.04 37297 0.00 0 0
Int 0/0 0 0.0 0.0 0.0 0.0 0.7 0.7 1.0 1.0 0.0 33.5 0 0 0 0 22 6 3.632 0.00 13 0.00 0 0
Flush(GB): accumulative 1502.410, interval 0.712
Stalls(secs): 0.000 level0_slowdown, 0.000 level0_numfiles, 0.037 memtable_compaction, 0.000 leveln_slowdown_soft, 0.000 leveln_slowdown_hard
Stalls(count): 0 level0_slowdown, 0 level0_numfiles, 37297 memtable_compaction, 0 leveln_slowdown_soft, 0 leveln_slowdown_hard
** DB Stats **
Uptime(secs): 14513.4 total, 7.9 interval
Cumulative writes: 1959000000 writes, 1959000000 batches, 1.0 writes per batch, 1517.95 GB user ingest
Cumulative WAL: 0 writes, 0 syncs, 0.00 writes per sync, 0.00 GB written
Interval writes: 1000000 writes, 1000000 batches, 1.0 writes per batch, 793.5 MB user ingest
Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, 0.00 MB written
df ouput showed:
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda1 476770320 250691264 201860484 56% /
udev 8062516 4 8062512 1% /dev
tmpfs 3251452 840 3250612 1% /run
none 5120 0 5120 0% /run/lock
none 8128624 156 8128468 1% /run/shm
none 102400 44 102356 1% /run/user
cgroup 8128624 0 8128624 0% /sys/fs/cgroup
/dev/sdb 3845710688 1576027268 2074332492 44% /media/stephen/f4098456-97de-4569-852d-d3eceb1e7a82
The text was updated successfully, but these errors were encountered: