Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failures using gocryptfs-on-NFS as ElasticSearch snapshot store #156

Closed
charles-dyfis-net opened this issue Nov 9, 2017 · 18 comments
Closed

Comments

@charles-dyfis-net
Copy link
Contributor

This is distinct from #39 because it proposes using advisory locks to simulate Windows single-writer filesystem semantics (which are safer when having a writer can invalidate concurrent reads), as opposed to letting applications make their own use of advisory locks.


I'm using gocryptfs to encrypt an ElasticSearch snapshot store.

This works perfectly when the ElasticSearch cluster consists of only one node. With multiple nodes, however, there are various race conditions and timing errors that take place as one node tries to read content that another has not fully written, or when a node attempts to delete a presently-invalid/partial file (resulting in the unlink() attempt returning ENOENT despite the file actually existing).

Since ElasticSearch's snapshot support is designed to work with Microsoft's filesystem semantics -- where only a single writer to a given file is allowed and no concurrent reads are permitted during writes -- it should be possible to avoid these errors by using POSIX advisory locks to prevent any node from reading or writing to a file which any other node is actively writing to, or to prevent a file from being opened for write while any other node has it opened for read.

Obviously this would not be appropriate as default behavior, but would be useful to provide as an available option.

Thoughts on feasibility? Guidance on where to start with an implementation?

@rfjakob
Copy link
Owner

rfjakob commented Nov 9, 2017

Should be possible withou too much work, you can reuse the openfiletable infrastructure that already today prevents concurrent writes to the same inode number.

However, can you explain your stack a bit more? I don't understand where the races come from. Does it look like this?

Ext4 | gocryptfs | elasticsearch

Where are the other nodes?

@charles-dyfis-net
Copy link
Contributor Author

charles-dyfis-net commented Nov 9, 2017

(customer-maintained / black-box unknown infrastructure) -> NFSv4 -> gocryptfs -> elasticsearch

...so multiple nodes have the same NFSv4 store mounted, with separate gocryptfs daemons on each machine.

@rfjakob
Copy link
Owner

rfjakob commented Nov 9, 2017

So with multiple nodes it is

Nfs -- gocryptfs - elasticsearch
     \ gocryptfs - elasticsearch

So both gocryptfs instances would have to coordinate somehow?

@rfjakob
Copy link
Owner

rfjakob commented Nov 9, 2017

(Does it work without gocryptfs, just nfs and multiple nodes?)

@charles-dyfis-net
Copy link
Contributor Author

Correct explanation of architecture. Yes, it works with gocryptfs taken out of the loop.

@charles-dyfis-net
Copy link
Contributor Author

...I'm positing NFS support for advisory locking as the means of coordinating between instances here. (Note that we have filename encryption turned off, so that's one layer of indirection/complexity avoided).

@rfjakob
Copy link
Owner

rfjakob commented Nov 9, 2017

Hmm. This sounds like a caching problem to me. There is no cli option, but you could try setting the three cache timeouts to zero at

https://github.com/rfjakob/gocryptfs/blob/master/mount.go#L263

@charles-dyfis-net
Copy link
Contributor Author

Even running a patched gocryptfs with zeroed timeouts, I'm still seeing errors -- in particular, ENOENTs for files which are expected to exist (and do exist after the backup process is complete):

IndexShardSnapshotFailedException[failed to list blobs]; nested: NoSuchFileException[/mnt/crypt/elasticsearch/snapshots/indices/G6_LEmjcRjW7D2QQ-8f6ig/2/index-62]

I'll run some traces to figure out exactly what's going on here.

@charles-dyfis-net
Copy link
Contributor Author

Curiouser and curiouser. I'm starting to wonder if this is a caching issue, and the change made (changing NegativeTimeout, AttrTimeout and EntryTimeout from time.Second to 0) was somehow insufficient. One two of the three nodes, listing the file given in the error message is successful:

-rw-r--r-- 1 elasticsearch elasticsearch 1683 Nov  9 10:41 index-62

...but on the third, that same file shows up without any metadata and can't be opened (though everything else in the same directory seems fine):

-????????? ? ?             ?               ?            ? index-62

@charles-dyfis-net charles-dyfis-net changed the title RFC: (Non-propagated) locks to avoid concurrent writes by multiple nodes Failures using gocryptfs-on-NFS as ElasticSearch snapshot store Nov 9, 2017
@rfjakob
Copy link
Owner

rfjakob commented Nov 10, 2017

When you get a ????????? file again, hold on to to it. Run strace -f -p PID against gocryptfs and stat the file. This should give us more info what it going on.

The kernel talks to FUSE using file numbers (nodeid) and the go-fuse library has a translation table that converts the nodeid back to a path. Now when the backing storage changes behind our back this translation table can get out of sync with the actual directory structure (NFS faces the same problem, but seems to have mechanisms to cope with it).

Can you try to disable ClientInodes in https://github.com/rfjakob/gocryptfs/blob/master/mount.go#L232 ? This takes hard link tracking (via inode numbers) out of the equation. I have seen inode number reuse cause problems for reverse mode ( 8c1b363 ).

@charles-dyfis-net
Copy link
Contributor Author

Interesting -- it's trying to stat a completely different file.

2957 19:11:50.260014830 0 gocryptfs (10051) < lstat res=-2(ENOENT) path=/mnt/nfs/kmquBM2LMGyUeGvGtKKwLkiiveJ4irlhjh10Cw4sMxQ/crypt-elasticsearch/snapshots/tests-1LHvAnFpSYiDUkdYjbhGqA/master.dat
2958 19:11:50.260039234 0 gocryptfs (10051) > getpid
2959 19:11:50.260040464 0 gocryptfs (10051) < getpid
2960 19:11:50.260046209 0 gocryptfs (10051) > write fd=7(<u>) size=84
2961 19:11:50.260051767 0 gocryptfs (10051) < write res=84 data=<15>Nov 10 19:11:50 gocryptfs[3826]: FS.GetAttr failed: 2=no such file or directory.
2962 19:11:50.260056065 0 gocryptfs (10051) > write fd=8(<f>/dev/fuse) size=16
2963 19:11:50.260062983 0 gocryptfs (10051) < write res=16 data=................
2964 19:11:50.260066556 0 gocryptfs (10051) > read fd=8(<f>/dev/fuse) size=135168
2965 19:11:50.260071181 0 gocryptfs (10051) > switch next=0 pgft_maj=0 pgft_min=20062 vm_size=2965096 vm_rss=58636 vm_swap=0
2966 19:11:50.260078894 24 stat (16134) < lstat res=-2(ENOENT) path=/mnt/crypt/elasticsearch/snapshots/meta-LUgtE_B6QNe8nFwwEDfErw.dat

I'll let you know if the proposed patch fixes this.

@rfjakob
Copy link
Owner

rfjakob commented Nov 10, 2017

Ok good, matches what I have seen with the reverse mode problem.

@charles-dyfis-net
Copy link
Contributor Author

After leaving our test cluster running for an hour (with a backup every five minutes), not a sign of any issue, whereas previously it was happening every 2/3 snapshots. Thus, this issue looks to have been resolved.

What's your thoughts on what to do here? Command-line argument to disable caching (leaving it on-by-default for the common case)?

@rfjakob
Copy link
Owner

rfjakob commented Nov 11, 2017

Yes, the minimal solution would be a command-line switch like -cluster.

I'd like to check how libfuse filesystems handle this. Maybe this can be handled automatically.

@rfjakob
Copy link
Owner

rfjakob commented Nov 12, 2017

Looks like the situation is worse in libfuse: vgough/encfs#452

rfjakob added a commit that referenced this issue Nov 12, 2017
At the moment, it does two things:

1. Disable stat() caching so changes to the backing storage show up
   immediately.
2. Disable hard link tracking, as the inode numbers on the backing
   storage are not stable when files are deleted and re-created behind
   our back. This would otherwise produce strange "file does not exist"
   and other errors.

Mitigates #156
@rfjakob
Copy link
Owner

rfjakob commented Nov 12, 2017

I have added the -sharedstorage command line switch (man page description).

Automatically handling this seems pretty difficult, so this will stay a manual switch for now.

@charles-dyfis-net
Copy link
Contributor Author

If you're inclined to close this ticket, I consider the above an adequate fix.

@rfjakob
Copy link
Owner

rfjakob commented Nov 15, 2017

Ok let's close this for now. Thanks again for the report!

@rfjakob rfjakob closed this as completed Nov 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants