Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Thread safety requirements & pitfalls for storr backends with caching = "worker" #1094

Closed
3 tasks done
strazto opened this issue Dec 6, 2019 · 10 comments
Closed
3 tasks done

Comments

@strazto
Copy link
Contributor

strazto commented Dec 6, 2019

Prework

  • Read and abide by drake's code of conduct.
  • Search for duplicates among the existing issues, both open and closed.
  • If you think your question has a quick and definite answer, consider posting to Stack Overflow under the drake-r-package tag. (If you anticipate extended follow-up and discussion, you are already in the right place!)

Context

As mentioned in #1025 , my workplace's HPC array uses LustreFS, which is said to perform poorly under intensive read/write operations on small files (Under 20MB, I think?).

This makes the RDS backend for storr a poor choice.

This forces me to use the SQLite & caching = "master" strategy, but I do notice that I only seem to be able to have ~ 10 or so jobs queued/running on the PBS (The nominal limit of our system is 200 jobs in queue/running), and I wonder if that's because the master process is a bottleneck (The CPU's on individual compute nodes are not particular fast on our HPC, there's just lots of them).

(The other bottleneck is that often the queue time for a transient worker exceeds the actual compute time it uses by orders of magnitude, I plan to change from future to clustermq when I have the chance because I do expect that persistent workers will execute much faster)

Question

I understand why it might be that the SQLite backend isn't threadsafe, as it's stored as a monolithic db file, that probably wouldn't hold up to concurrent modification/access of its resources, which mandates the "master" caching strategy.

My question, however is:

  • How robust does the threadsafety of a storr backend intended for use by drake actually need to be?
    I ask for the following reasons
    • Thread-safety is a requirement when we anticipate that multiple workers require access to a shared resource, and that modification of the resource may occur at the same time as access of the resource.

    • Unless thread-safety is ensured, this could lead to undefined behaviour, as with the classic RC of
      Scenario

       j1 and j2 are both given the task of incrementing some variable
       the whole subroutine each job runs is as follows:
      sub:
        local = data
        local + local + 1
        data = local
      
      RC

      RC

       data = 0
       j1:
         local = data (= 0)
      
       j2:
         local = data (= 0)
       
       j1: 
         local += 1 (= 1)
         data = local (=1)
       j2: 
        local += 1 (=1)
        data = local (=1) 
       
       data = 1 when it should = 2
      
    • In the execution of a drake workflow, a given resource (ie, a target) is only written to once,

    • after this, workers building dependent targets will read from this target.
      Multiple workers may be doing so at once, but this is still threadsafe as long as none are attempting to modify the parent target.

I'm interesting in implementing a very simple storr backend that (perhaps) simply writes targets to disk (as a single .rds), and makes no guarantees of thread-safety for concurrent IO of a given target, but can assure thread-safety in the context of drake::make execution, as targets are only written to once, and only read from after they've been written.

I'm interested to hear what anyone makes of this idea, whether there's a reason it could damage performance, and whether there's some concurrent file IO risk (intrinsic to filesystem access) I'm neglecting.

@wlandau
Copy link
Member

wlandau commented Dec 6, 2019

First of all, I suggest verify possible sources of slowness using concrete profiling data. When @billdenney and @adamkski sent me their profiling data for #1089 and #1086, we actually pinpointed and eliminated the bottlenecks. As I describe at https://github.com/ropensci/drake/blob/master/.github/ISSUE_TEMPLATE/bottleneck.md#benchmarks, it would be great if you would post zip files of profiling data to this issue thread. The code that goes along with it would be super helpful as well.

How robust does the threadsafety of a storr backend intended for use by drake actually need to be?

For worker caching, the cache should fully expect concurrent writes of different targets. When I tested an RSQLite-powered storr_dbi() with a parallel make() and caching = "worker", I ran into obvious race conditions immediately. caching = "master" solves the race conditions, but as you said, it could be a bottleneck if the file system is slow.

Related: storr_rds() itself takes measures to avoid race conditions for concurrent writes of the same data. drake used to rely on this kind of thread safety, but since I refactored how the "progress" storr namespace works, I suspect this assumption is no longer necessary in drake. (I need to test this on multiple platforms using richfitz/storr#117.)

There may be a use case for a monolithic database-style storr that is also threadsafe. But before you do the hard work to implement it, there are a couple issues worth thinking about.

  • I often find caching = "worker" is slower than caching = "master" because of the lag of synchronizing storage over a network file system. Network lag is very noticeable on the SGE cluster I use.
  • For your system, it might be worth thinking about aggregating work into a smaller number targets. It may require more work to write more custom functions, but it could pay off. I have some general recommendations at https://books.ropensci.org/drake/plans.html#how-to-choose-good-targets. What I often do is create functions for targets that call clustermq. clustermq::Q() within a single target is super powerful for a large number of tasks, and it does not require any interaction with the file system. If you go that route, I recommend disabling HPC through drake, either by setting parallelism = "loop" or disabling the HPC of select targets individually using target(hpc = FALSE).

@wlandau
Copy link
Member

wlandau commented Dec 6, 2019

Also, I recommend checking with @richfitz about the current state of database storrs.

@strazto
Copy link
Contributor Author

strazto commented Dec 6, 2019

First of all, I suggest verify possible sources of slowness using concrete profiling data. When @billdenney and @adamkski sent me their profiling data for #1089 and #1086, we actually pinpointed and eliminated the bottlenecks. As I describe at https://github.com/ropensci/drake/blob/master/.github/ISSUE_TEMPLATE/bottleneck.md#benchmarks, it would be great if you would post zip files of profiling data to this issue thread. The code that goes along with it would be super helpful as well.

My other question was going to be about the best way to measure this bottleneck in my workflow, thank you for this!

There may be a use case for a monolithic database-style storr that is also threadsafe. But before you do the hard work to implement it, there are a couple issues worth thinking about.

I'm sure such a use-case exists, although my proposed implementation would probably just be a storr API that's just backended by saving whole targets to the native file-system as .rds files.

  • I often find caching = "worker" is slower than caching = "master" because of the lag of synchronizing storage over a network file system. Network lag is very noticeable on the SGE cluster I use.

That's an interesting comment, I wonder how that would relate to my workplace's file-system.
I guess once I get profiling set-up I can see for myself.

  • For your system, it might be worth thinking about aggregating work into a smaller number targets.

It's actually by design that my system has as many targets as it does (Targets are split into n indexed partitions that have homogenous keyspaces on the highest level foreign key, which is intended to manage the memory requirements of the huger targets, while preserving downstream linking, and allowing combine(giant_freetext_df, small_high_level_df, .id = index, .by = index, .tag_in = cluster_id) to ensure that keyspaces are aligned and split targets can be joined. ).

Presently each input df is split into 100 partitions, but as the workflow grows I'm finding that the cost of submitting this on HPC is significant, so I think I need to revise this approach, the problem is that I'm working with a test extract of data, and I don't know what size of data to expect from the full extract but it will be significantly more massive (But I digress)

@wlandau
Copy link
Member

wlandau commented Dec 6, 2019

I'm sure such a use-case exists, although my proposed implementation would probably just be a storr API that's just backended by saving whole targets to the native file-system as .rds files.

Yeah, maybe a storr where all the namespaces for a key condense down to a single file. We would need (1) a way for the existing namespace API to still work, and (2) atomicity within keys. But different keys could write concurrently. Because of multiple namespaces, we may need something more nuanced than an RDS file though, maybe a database file.

It's actually by design that my system has as many targets as it does (Targets are split into n indexed partitions that have homogenous keyspaces on the highest level foreign key, which is intended to manage the memory requirements of the huger targets, while preserving downstream linking, and allowing combine(giant_freetext_df, small_high_level_df, .id = index, .by = index, .tag_in = cluster_id) to ensure that keyspaces are aligned and split targets can be joined. ).

The conceptual layout of tasks does not necessarily need to match the collection of targets. A single target can run a bunch of tasks and return a list of results.

@wlandau
Copy link
Member

wlandau commented Dec 7, 2019

FYI: I expect #1095 to speed up things for you.

@wlandau
Copy link
Member

wlandau commented Dec 9, 2019

@mstr3336, did I answer your questions? Do you have more?

@strazto
Copy link
Contributor Author

strazto commented Dec 10, 2019

@mstr3336, did I answer your questions? Do you have more?

You did, I would start diverging off-topic if I got into them for this thread.

I have questions about the following:

FYI: I expect #1095 to speed up things for you.

Also, I did install from the latest dev version, (7.7.0.9001ish -> 7.8.9000) and test drive this, and I noticed that the preprocessing/analysis/import/skip phase was sped up by about %200 on the cluster (From about 50-60 minutes til the first target is actual submitted to 20 or so minutes), which was impressive, but the error message did change from the clustermq migration related errors I've been troubleshooting to the following:

Error : too many SQL variables Error: too many SQL variables

I haven't filed it in its own issue because I can't say for sure that the upgrade in drake is the culprit, given there's other factors I'm troubleshooting, and I don't have a reprex or sessioninfo to provide you, though my instincts tell me this is more likely to be related to the new cache optimizations rather than clustermq troubles.

For the time being I rolled drake back to 7.7.9001ish so I know what I'm dealing with while I try to fix things.

@wlandau
Copy link
Member

wlandau commented Dec 10, 2019

Hmm... sorry you're getting an error. The SQL stuff doesn't sound like a drake issue though.

Glad you saw a speedup. I do not think drake should take 20 minutes to start targets, certainly not 50-60 minutes. Whenever you're ready, it would be great if you would upload profiling data or results from specific workflows so we can have a further look.

FYI, I just released a profiling package that should make it easier to collaborate on performance issues: https://github.com/wlandau/proffer.

@wlandau
Copy link
Member

wlandau commented Dec 10, 2019

I am closing this general issue. Let's follow up on more specific issues as they arise.

@wlandau wlandau closed this as completed Dec 10, 2019
@strazto
Copy link
Contributor Author

strazto commented Dec 10, 2019

Glad you saw a speedup. I do not think drake should take 20 minutes to start targets, certainly not 50-60 minutes. Whenever you're ready, it would be great if you would upload profiling data or results from specific workflows so we can have a further look.

Sure - I do note that I have a large amount of raw data, and I used static branching to produce a very big plan (It might be in the order of thousands of targets). You have mentioned that that design mightn't be optimal.

Also, our HPC nodes aren't particularly fast.

FYI, I just released a profiling package that should make it easier to collaborate on performance issues: https://github.com/wlandau/proffer.

Thanks! That should be very helpful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants