Failures using gocryptfs-on-NFS as ElasticSearch snapshot store #156

charles-dyfis-net · 2017-11-09T16:24:43Z

This is distinct from #39 because it proposes using advisory locks to simulate Windows single-writer filesystem semantics (which are safer when having a writer can invalidate concurrent reads), as opposed to letting applications make their own use of advisory locks.

I'm using gocryptfs to encrypt an ElasticSearch snapshot store.

This works perfectly when the ElasticSearch cluster consists of only one node. With multiple nodes, however, there are various race conditions and timing errors that take place as one node tries to read content that another has not fully written, or when a node attempts to delete a presently-invalid/partial file (resulting in the unlink() attempt returning ENOENT despite the file actually existing).

Since ElasticSearch's snapshot support is designed to work with Microsoft's filesystem semantics -- where only a single writer to a given file is allowed and no concurrent reads are permitted during writes -- it should be possible to avoid these errors by using POSIX advisory locks to prevent any node from reading or writing to a file which any other node is actively writing to, or to prevent a file from being opened for write while any other node has it opened for read.

Obviously this would not be appropriate as default behavior, but would be useful to provide as an available option.

Thoughts on feasibility? Guidance on where to start with an implementation?

The text was updated successfully, but these errors were encountered:

rfjakob · 2017-11-09T16:44:19Z

Should be possible withou too much work, you can reuse the openfiletable infrastructure that already today prevents concurrent writes to the same inode number.

However, can you explain your stack a bit more? I don't understand where the races come from. Does it look like this?

Ext4 | gocryptfs | elasticsearch

Where are the other nodes?

charles-dyfis-net · 2017-11-09T16:45:19Z

(customer-maintained / black-box unknown infrastructure) -> NFSv4 -> gocryptfs -> elasticsearch

...so multiple nodes have the same NFSv4 store mounted, with separate gocryptfs daemons on each machine.

rfjakob · 2017-11-09T16:52:39Z

So with multiple nodes it is

Nfs -- gocryptfs - elasticsearch
     \ gocryptfs - elasticsearch

So both gocryptfs instances would have to coordinate somehow?

rfjakob · 2017-11-09T16:54:37Z

(Does it work without gocryptfs, just nfs and multiple nodes?)

charles-dyfis-net · 2017-11-09T16:56:59Z

Correct explanation of architecture. Yes, it works with gocryptfs taken out of the loop.

charles-dyfis-net · 2017-11-09T16:58:00Z

...I'm positing NFS support for advisory locking as the means of coordinating between instances here. (Note that we have filename encryption turned off, so that's one layer of indirection/complexity avoided).

rfjakob · 2017-11-09T17:03:34Z

Hmm. This sounds like a caching problem to me. There is no cli option, but you could try setting the three cache timeouts to zero at

https://github.com/rfjakob/gocryptfs/blob/master/mount.go#L263

charles-dyfis-net · 2017-11-09T19:15:23Z

Even running a patched gocryptfs with zeroed timeouts, I'm still seeing errors -- in particular, ENOENTs for files which are expected to exist (and do exist after the backup process is complete):

IndexShardSnapshotFailedException[failed to list blobs]; nested: NoSuchFileException[/mnt/crypt/elasticsearch/snapshots/indices/G6_LEmjcRjW7D2QQ-8f6ig/2/index-62]

I'll run some traces to figure out exactly what's going on here.

charles-dyfis-net · 2017-11-09T22:26:07Z

Curiouser and curiouser. I'm starting to wonder if this is a caching issue, and the change made (changing NegativeTimeout, AttrTimeout and EntryTimeout from time.Second to 0) was somehow insufficient. One two of the three nodes, listing the file given in the error message is successful:

-rw-r--r-- 1 elasticsearch elasticsearch 1683 Nov  9 10:41 index-62

...but on the third, that same file shows up without any metadata and can't be opened (though everything else in the same directory seems fine):

-????????? ? ?             ?               ?            ? index-62

rfjakob · 2017-11-10T08:50:31Z

When you get a ????????? file again, hold on to to it. Run strace -f -p PID against gocryptfs and stat the file. This should give us more info what it going on.

The kernel talks to FUSE using file numbers (nodeid) and the go-fuse library has a translation table that converts the nodeid back to a path. Now when the backing storage changes behind our back this translation table can get out of sync with the actual directory structure (NFS faces the same problem, but seems to have mechanisms to cope with it).

Can you try to disable ClientInodes in https://github.com/rfjakob/gocryptfs/blob/master/mount.go#L232 ? This takes hard link tracking (via inode numbers) out of the equation. I have seen inode number reuse cause problems for reverse mode ( 8c1b363 ).

charles-dyfis-net · 2017-11-10T19:14:13Z

Interesting -- it's trying to stat a completely different file.

2957 19:11:50.260014830 0 gocryptfs (10051) < lstat res=-2(ENOENT) path=/mnt/nfs/kmquBM2LMGyUeGvGtKKwLkiiveJ4irlhjh10Cw4sMxQ/crypt-elasticsearch/snapshots/tests-1LHvAnFpSYiDUkdYjbhGqA/master.dat
2958 19:11:50.260039234 0 gocryptfs (10051) > getpid
2959 19:11:50.260040464 0 gocryptfs (10051) < getpid
2960 19:11:50.260046209 0 gocryptfs (10051) > write fd=7(<u>) size=84
2961 19:11:50.260051767 0 gocryptfs (10051) < write res=84 data=<15>Nov 10 19:11:50 gocryptfs[3826]: FS.GetAttr failed: 2=no such file or directory.
2962 19:11:50.260056065 0 gocryptfs (10051) > write fd=8(<f>/dev/fuse) size=16
2963 19:11:50.260062983 0 gocryptfs (10051) < write res=16 data=................
2964 19:11:50.260066556 0 gocryptfs (10051) > read fd=8(<f>/dev/fuse) size=135168
2965 19:11:50.260071181 0 gocryptfs (10051) > switch next=0 pgft_maj=0 pgft_min=20062 vm_size=2965096 vm_rss=58636 vm_swap=0
2966 19:11:50.260078894 24 stat (16134) < lstat res=-2(ENOENT) path=/mnt/crypt/elasticsearch/snapshots/meta-LUgtE_B6QNe8nFwwEDfErw.dat

I'll let you know if the proposed patch fixes this.

rfjakob · 2017-11-10T19:37:34Z

Ok good, matches what I have seen with the reverse mode problem.

charles-dyfis-net · 2017-11-10T21:07:01Z

After leaving our test cluster running for an hour (with a backup every five minutes), not a sign of any issue, whereas previously it was happening every 2/3 snapshots. Thus, this issue looks to have been resolved.

What's your thoughts on what to do here? Command-line argument to disable caching (leaving it on-by-default for the common case)?

rfjakob · 2017-11-11T16:22:27Z

Yes, the minimal solution would be a command-line switch like -cluster.

I'd like to check how libfuse filesystems handle this. Maybe this can be handled automatically.

rfjakob · 2017-11-12T13:39:15Z

Looks like the situation is worse in libfuse: vgough/encfs#452

At the moment, it does two things: 1. Disable stat() caching so changes to the backing storage show up immediately. 2. Disable hard link tracking, as the inode numbers on the backing storage are not stable when files are deleted and re-created behind our back. This would otherwise produce strange "file does not exist" and other errors. Mitigates #156

rfjakob · 2017-11-12T19:24:12Z

I have added the -sharedstorage command line switch (man page description).

Automatically handling this seems pretty difficult, so this will stay a manual switch for now.

charles-dyfis-net · 2017-11-13T00:56:43Z

If you're inclined to close this ticket, I consider the above an adequate fix.

rfjakob · 2017-11-15T17:47:30Z

Ok let's close this for now. Thanks again for the report!

charles-dyfis-net changed the title ~~RFC: (Non-propagated) locks to avoid concurrent writes by multiple nodes~~ Failures using gocryptfs-on-NFS as ElasticSearch snapshot store Nov 9, 2017

rfjakob closed this as completed Nov 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failures using gocryptfs-on-NFS as ElasticSearch snapshot store #156

Failures using gocryptfs-on-NFS as ElasticSearch snapshot store #156

charles-dyfis-net commented Nov 9, 2017

rfjakob commented Nov 9, 2017

charles-dyfis-net commented Nov 9, 2017 •

edited

Loading

rfjakob commented Nov 9, 2017 •

edited

Loading

rfjakob commented Nov 9, 2017

charles-dyfis-net commented Nov 9, 2017

charles-dyfis-net commented Nov 9, 2017

rfjakob commented Nov 9, 2017

charles-dyfis-net commented Nov 9, 2017

charles-dyfis-net commented Nov 9, 2017

rfjakob commented Nov 10, 2017

charles-dyfis-net commented Nov 10, 2017

rfjakob commented Nov 10, 2017

charles-dyfis-net commented Nov 10, 2017

rfjakob commented Nov 11, 2017

rfjakob commented Nov 12, 2017

rfjakob commented Nov 12, 2017

charles-dyfis-net commented Nov 13, 2017

rfjakob commented Nov 15, 2017

Failures using gocryptfs-on-NFS as ElasticSearch snapshot store #156

Failures using gocryptfs-on-NFS as ElasticSearch snapshot store #156

Comments

charles-dyfis-net commented Nov 9, 2017

rfjakob commented Nov 9, 2017

charles-dyfis-net commented Nov 9, 2017 • edited Loading

rfjakob commented Nov 9, 2017 • edited Loading

rfjakob commented Nov 9, 2017

charles-dyfis-net commented Nov 9, 2017

charles-dyfis-net commented Nov 9, 2017

rfjakob commented Nov 9, 2017

charles-dyfis-net commented Nov 9, 2017

charles-dyfis-net commented Nov 9, 2017

rfjakob commented Nov 10, 2017

charles-dyfis-net commented Nov 10, 2017

rfjakob commented Nov 10, 2017

charles-dyfis-net commented Nov 10, 2017

rfjakob commented Nov 11, 2017

rfjakob commented Nov 12, 2017

rfjakob commented Nov 12, 2017

charles-dyfis-net commented Nov 13, 2017

rfjakob commented Nov 15, 2017

charles-dyfis-net commented Nov 9, 2017 •

edited

Loading

rfjakob commented Nov 9, 2017 •

edited

Loading