-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Fix MGLMapSnapshotter concurrency bugs (issue #11827). #11831
Conversation
@julianrex One thing you mentioned in #11801 is that we may need to make sure the completion callback is called even if the background snapshot operation is abandoned (either by an explicit "cancel" or by destruction of the |
EDIT: I was wrong, ignore the above. |
Current state here is that we're able to frequently produce a deadlock somewhere in GL drawing code when we're running six snapshotters at once in the ios test app on the iOS simulator. The deadlock can happen either in native CoreGraphics rendering (applying attribution to a rendered map), or in the background gl-native calls into the EAGLBackend. The trigger seems to be running the attribution-stamping code in a GCD worker queue -- although we haven't identified a specific unsafe operation and it's possible we're just shifting timing around when we move it to the foreground. We haven't seen the deadlock running on a physical iPhone or on macOS. I would really like to get to the bottom of the cause of the deadlock so we can be confident we're not causing a (potentially rare) deadlock bug in the wild. However, our immediate fix may be to just stop doing attribution on a GCD worker queue. It should be a cheap operation, and fine to do on the foreground. I talked to @ivovandongen today about why we originally did it on the background, and his memory was that we were mainly just copying an Android design pattern of doing as little non-UI work on the foreground as possible. It's worth keeping in mind, though, that our API allows the completion callback to run on arbitrary threads, so if there's a gotcha on sharing cc @frederoni (Ivo tells me you helped him with the original implementation) |
🤔 Seems like attribution is much more expensive than I would have guessed, here's timing from six runs (in microseconds):
The first run loads logo images from disk, and I thought that might be the expensive part so I cached the logos in memory for subsequent runs, but it still seems to take around half a second to run. That is... way longer than I expected... and probably a good reason to keep doing attribution in the background? |
Sounds like it. If we temporarily remove the attribution code - is "everything OK"? |
4f2970b
to
3ed3c30
Compare
3ed3c30
to
98727ca
Compare
@julianrex pending your 🍏 , I think we should merge these changes and include them in the next patch release. I'm unhappy that we've been unable to satisfactorily explain the deadlock we see in the simulator, but I don't think it's a good idea to include the various "fixes" we've tried (e.g. wrapping EAGLBackend usage with a global lock), since we don't have a theoretical justification for them and we're probably just altering the timing of the reproduction case. Current state:
|
@ChrisLoer will review! Do you think this could go into the next minor release instead (4.1.0 on iOS)?
Agreed. We could wrap any locks/unlocks with a simulator /cc @lilykaiser |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor tweaks - also needs an entry in the ios & macos change logs.
_snapshotCallback = std::make_unique<mbgl::Actor<mbgl::MapSnapshotter::Callback>>(*mbgl::Scheduler::GetCurrent(), [=](std::exception_ptr mbglError, mbgl::PremultipliedImage image, mbgl::MapSnapshotter::Attributions attributions, mbgl::MapSnapshotter::PointForFn pointForFn) { | ||
__typeof__(self) strongSelf = weakSelf; | ||
// If self had died, _snapshotCallback would have been destroyed and this block would not be executed | ||
assert(strongSelf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We tend to prefer to use the NSAssert
family of assert macros. I think in this case we might need to use NSCAssert
since the NSAssert
macro uses self.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's important that we call the completion block in the case where _snapshotCallback
is destroyed. Is that straightforward?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a change with NSCAssert.
I thinking calling the iOS completion block on cancellation is outside the scope of the core code -- that is, I don't think you'd want the callback destructor to block waiting for a forced run of the completion callback (which might not be on the same thread the destructor is running from). To handle this in MGLMapSnapshotter
, I guess you'd want to explicitly call the completion callback with an error value from cancel
and also from...what is it, dealloc
? I guess the short answer is that it's not terribly difficult but also not totally straightforward. I think we can call it outside the scope of this fix, although I agree that it's a better interface if the callback always either succeeds or errors. FWIW, the Android interface also allows you to cancel without receiving a callback.
|
||
+ (void)drawAttributedSnapshotWorker:(mbgl::MapSnapshotter::Attributions)attributions snapshotImage:(MGLImage *)mglImage pointForFn:(mbgl::MapSnapshotter::PointForFn)pointForFn queue:(dispatch_queue_t)queue scale:(CGFloat)scale size:(CGSize)size completionHandler:(MGLMapSnapshotCompletionHandler)completion { | ||
|
||
NS_ARRAY_OF(MGLAttributionInfo *)* attributionInfo = [MGLMapSnapshotter generateAttributionInfos:attributions]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an FYI for when it comes to merge to master: NS_ARRAY_OF
and friends have been removed (but are still in release-boba
)
[self waitForExpectations:@[expectation] timeout:10.0]; | ||
} | ||
|
||
- (void)testMultipleSnapshotters { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth temporarily disabling the test so that we don't get false positives during CI. Let's just rename the failing methods something like - (void)disabled_test....
When we merge to master, we can mark this as a pending test (which uses a environment variable to decide whether to run or not)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the details of how this runs on CI, but I agree we should disable tests that may fail intermittently. Could you disable the tests that you think need disabling?
I think the changelog entry should be:
I'm not sure where it's supposed to go in the changelog since I'm not sure which release this will go in. |
- Biggest change: when we apply the watermark on a background thread, don't capture self (turn most of the related instance methods into class methods) - Don't call mbglMapSnapshotter->snapshot from a user-provided queue, since it's an asynchronous call anyway and starting it on the user's queue requires capturing self.
… without active run loop.
8 simultaneous mapsnapshotter test periodically deadlocks in simulator. Also, increase timeouts to decrease chance of spurious test failure.
3dfc244
to
4f1360c
Compare
Rebased on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
I took a stab at making the changes @julianrex and I have been discussing for fixing issue #11827.
The design goal here is that we should not be passing shared pointers/references across thread boundaries. Because
startWithQueue
allows users to provide their own queue for handling the completion, we have to assume that anything on that queue can run on another thread (even though the more commonstartWithCompletion
will run on the main/UI thread). When we capture a block for execution on a (potentially) different thread, we need to copy everything it needs by value. Any further changes toMGLMapSnapshotter
should only affect the result of futurestart
calls.It's now possible to remove the "can't restart after cancel" constraint we had (just need to use the last-set "options" to re-initialize the
mbglMapSnapshotter
), but I haven't done that yet. I'm still getting up to speed on all the context for how this interface is/should be used (see issue #11825).I'm a bit out of my element in Objective-C so careful review from someone on the iOS team is definitely a must. I'm also not sure of the best way to test beyond what CI gives us -- I've done just the basic loading of snapshots in the macos and ios test apps.
/cc @julianrex @friedbunny @1ec5