-
Notifications
You must be signed in to change notification settings - Fork 959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Catastrophic Join issue with timeouts #3668
Comments
@ks0m1c it looks like you're doing something on the server that is taking too long to complete
which causes LiveView to refresh.
This would align with the LiveView taking too long to respond. So it does not crash which would log an exception, but instead is killed when the client disconnects (which can always happen and is not an error). The consecutive reloads logic does not assume a large number of open tabs from the same browser, so the numbers there are probably off and aren't that indicative of a client problem. It's hard to tell what's going on without something to reproduce. But I'd suggest to look into everything that's happening in your mount logic, including any hooks and maybe even try to do some IO.puts debugging to see if/when your mount finishes. |
Could it be related to this error message, I had been ignoring it because the hook at slider-thumb has actually been working even though the console raises an error
I am currently moving every assign on my handle_params into an async_assign process let me get back to you with that test soon |
What LiveView version are you using? There was a recent issue (#3496) causing such an error, but it should not cause a redirect loop as it's a clientside error. |
my locked version is 1.0.4
|
I have moved every function on mount into a start_async/handle_async pair, but the issue with consecutive reloads still persists. And also the tabs that do load remain unresponsive on binding events though a bunch phx-update events stream through after a bit of a delay |
I was able to replicate this issue on a fresh new phoenix projected generated using sessionStorage.setItem("phx:fallback:ge", true);
sessionStorage.setItem("phx:fallback:LongPoll", true);
const numberOfTabs = 30; // Define the number of tabs to open
const delayBetweenTabs = 1000; // Delay in milliseconds (1 second)
for (let i = 0; i < numberOfTabs; i++) {
setTimeout(() => {
const sessionData = {
"active": true,
"id": (Math.random() + 1).toString(36).substring(7),
"name": (Math.random() + 1).toString(36).substring(7)
};
localStorage.setItem('session', JSON.stringify(sessionData));
window.open("http://localhost:4000/users/new", '_blank'); // Open in new tab
}, i * delayBetweenTabs);
}
then it will refresh the page
infinite refresh loop continues |
This doesn't necessarily show an issue as the browser may simply be background/limiting network connectivity across all the tabs you are launching (as they won't all be in focus/active states). In those cases they would timeout if they aren't given network access or the browser is limited the same # of simultaneous connections to a given host, so I don't think this reproduction is actually the issue at hand. As @SteffenDE said it sounds like the prod issue you had was timeout related which could be from any code path in your stack from endpoint plugs to LV mount lifecyce. A single overloaded DB call or process call could potentially trigger denial of service as users would all be blocking on mount, and the client gives up. You had the timeout in your logs, so let's start there. I don't think the dozens of tabs approach is a repro of what you saw in prod, but we lack all the code paths and logs so it's really impossible for us to say. If you have more info we could take a closer look. Thanks! |
Hey Chris! I understand that browser may simply be background/limiting network connectivity across all the tabs, but why does the actively focused tab not gracefully recover towards a stable state, even after closing all the other tabs? Does the behaviour arise from browser specific details rather than something library specific? But it also doesnt fully explain why phx-updates when they do reach back the client (as observed on the client) fail to render changes on the DOM. I do have trouble replicating the conditions for why all the active websocket connections on that day led to all being degraded to longpolling due to the timeout fallback. The base local-only client functionality works well after mount splash so having an ability to allow the user to re-initiate connection by user triggered action could also help as a fallback instead of the loop of doom Here is source code: the project is a bit of Seymour Papert meets Joe Armstrong for kids in a classroom to go wild together in a shared concurrent mathematical world -> https://github.com/paperlands/dojo/blob/master/lib/dojo_web/live/shell_live.ex I have since moved all function calls at mount and handle params into the async pattern of assigning yesterday, there are two potential hotpaths which could have led to the behaviour one was this list_users for the classroom. def list(topic, timeout \\ 5000) do
__MODULE__
|> Phoenix.Tracker.Shard.name_for_topic(topic, pool_size())
|> GenServer.call({:list, topic}, timeout)
|> Phoenix.Tracker.State.get_by_topic(topic)
end
def list_users(topic),
do: Enum.map(list(topic), fn {_k, meta} -> Map.put(meta, :topic, topic) end) And the other was a joining a table for last state and joining the rest of classroom in the presence end def join(pid, book, disciple) do
spec = %{id: Table, start: {Table, :start_link, [%{topic: topic(book), disciple: disciple}]}}
{:ok, class} = DynamicSupervisor.start_child(
{:via, PartitionSupervisor, {__MODULE__, self()}},
spec
)
Dojo.Gate.track(pid, topic(book), %{disciple | node: class})
{:ok, class}
end |
I just remembered something that already was a problem with server sent events in the past, but also applies to regular HTTP/1 requests: there is a maximum number of concurrent requests to the same server in the same browser. See https://stackoverflow.com/questions/18584525/server-sent-events-and-browser-limits. This completely explains the problem you see locally. As soon as this limit is reached, tabs will time out. You don’t see it with websockets, because they are not affected by the limit, but due to the nature of long polling, the effect there is extreme. Thank you for providing the code! I’ll try to have a look to see if I find anything that might be problematic. |
Hey @SteffenDE In the meantime I am exploring a path to fallback to offline with limited functionality through service workers and attempt reconnection to websocket or longpolling in the background, is there any prior art wrt this. |
Not that I know of. And to be honest I don't think it's worth the effort. |
Environment
Actual behavior
I was trying to replicate a terrible production bug that results in the browser entering a refresh/retry loop that doesnt end across all devices connected during a live session. There was 0 error logs on the backend which was really surprising but the only thing that stood out to me was all the connections were defaulting to long polling even though websockets were initially running on the network
Also user behavior with any binded html element like phx-click will take a very long time to respond and all at once.
So i went down the rabbithole of trying to replicate this issue locally and stumbled on an approach, I set these session storage params to force livepolling and open multiple tabs.
There is again no indication of errors on the server side such as rejection of connection or memory issues etc but there is indefinite join cycles forced. This does not stop even if i close all other tabs
another such load
example of first splash immediately back to error state
Closest issue I found to this was from this particular issue > #1488
But my mount is quite clean
if I am doing anything special, its to rely on localstorage to init user data
Expected behavior
No infinite breaking loop, if any connection issue also dont crash the running local client behaviour just to refresh socket connection to crash again. Also better error handling on server end
The text was updated successfully, but these errors were encountered: