Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catastrophic Join issue with timeouts #3668

Open
ks0m1c opened this issue Feb 7, 2025 · 11 comments
Open

Catastrophic Join issue with timeouts #3668

ks0m1c opened this issue Feb 7, 2025 · 11 comments

Comments

@ks0m1c
Copy link

ks0m1c commented Feb 7, 2025

Environment

  • Elixir version (elixir -v):Elixir 1.17.3 (compiled with Erlang/OTP 27)
  • Phoenix version (mix deps): 1.7.18
  • Operating system: 6.12.7-arch1-1/ alpine on server

Actual behavior

I was trying to replicate a terrible production bug that results in the browser entering a refresh/retry loop that doesnt end across all devices connected during a live session. There was 0 error logs on the backend which was really surprising but the only thing that stood out to me was all the connections were defaulting to long polling even though websockets were initially running on the network

Also user behavior with any binded html element like phx-click will take a very long time to respond and all at once.

So i went down the rabbithole of trying to replicate this issue locally and stumbled on an approach, I set these session storage params to force livepolling and open multiple tabs.

sessionStorage.setItem("phx:fallback:ge", true);
sessionStorage.setItem("phx:fallback:LongPoll", true);

const numberOfTabs = 20; // Define the number of tabs to open
const delayBetweenTabs = 4000; // Delay in milliseconds (1 second)

for (let i = 0; i < numberOfTabs; i++) {
    setTimeout(() => {
        const sessionData = {
            "active": true,
            "id": (Math.random() + 1).toString(36).substring(7),
            "name": (Math.random() + 1).toString(36).substring(7)
        };

        localStorage.setItem('session', JSON.stringify(sessionData));
        window.open("http://localhost:4000/shell", '_blank'); // Open in new tab
    }, i * delayBetweenTabs);
}

There is again no indication of errors on the server side such as rejection of connection or memory issues etc but there is indefinite join cycles forced. This does not stop even if i close all other tabs

phx-GCGLlc21mqiq4uFi join: encountered 10 consecutive reloads -  undefined utils.js:28:12
phx-GCGLlc21mqiq4uFi join: exceeded 10 consecutive reloads. Entering failsafe mode -  undefined utils.js:28:12
phx-GCGLlc21mqiq4uFi destroyed: the child has been removed from the parent -  undefined utils.js:28:12
phx-GCGLlc21mqiq4uFi join: encountered 11 consecutive reloads -  undefined utils.js:28:12
phx-GCGLlc21mqiq4uFi join: exceeded 10 consecutive reloads. Entering failsafe mode -  undefined

another such load

phx-GCGNg6JKHiB6P_tk join: encountered 29 consecutive reloads -  undefined [app.js:2413:15](http://localhost:4000/assets/app.js)
phx-GCGNg6JKHiB6P_tk join: exceeded 10 consecutive reloads. Entering failsafe mode -  undefined [app.js:2413:15](http://localhost:4000/assets/app.js)

example of first splash immediately back to error state

phx-GCGNp8N2ioqGrjcB join: encountered 0 consecutive reloads -  undefined [app.js:2413:15](http://localhost:4000/assets/app.js)
phx-GCGNp8N2ioqGrjcB error: view crashed -  undefined [app.js:2413:15](http://localhost:4000/assets/app.js)
phx-GCGNuTObX3XMHSni error: unable to join -  
Object { reason: "timeout" }
[utils.js:28:12](http://localhost:4000/deps/phoenix_live_view/assets/js/phoenix_live_view/utils.js)
phx-GCGNuTObX3XMHSni error: unable to join -  
Object { reason: "timeout" }

Closest issue I found to this was from this particular issue > #1488

But my mount is quite clean

def mount(_params, _session, socket) do
    {:ok,
     socket
     |> assign(
       label: nil,
       clan: nil,
       outershell: nil,
       sensei: false,
       myfunctions: [],
       outerfunctions: [],
       class: nil,
       disciples: %{},
       deck: true,
       pane: true
     )
     |> assign(focused_phx_ref: "")}
  end

if I am doing anything special, its to rely on localstorage to init user data

defmodule DojoWeb.Session do
  import Phoenix.Component, only: [assign: 2]
  import Phoenix.LiveView, only: [get_connect_params: 1]

  @derive Jason.Encoder
  defstruct name: nil, id: nil, active: false, last_opened: DateTime.now!("Etc/UTC")

  @default_locale "en"
  @timezone "UTC"
  @timezone_offset 0

  def on_mount(:anon, params, _sessions, socket) do
    # get_connect_params returns nil on the first (static rendering) mount call, and has the added connect params from the js LiveSocket creation on the subsequent (LiveSocket) call
    #
    {:cont,
     socket
     |> assign(
     locale: get_connect_params(socket)["locale"] || @default_locale,
     tz: %{timezone: get_connect_params(socket)["timezone"] || @timezone,
           timezone_offset: get_connect_params(socket)["timezone_offset"] || @timezone_offset},
     session: get_connect_params(socket)["session"] |> mutate_session(params)
     )}
  end

  # careful of client and server state race. id here is not SOT
  defp mutate_session(%{"active" => true} = sess, _) do
    atomised_sess =
      for {key, val} <- sess, reduce: %{} do
        acc -> hydrate_session(acc, key, val)
      end

    struct(%__MODULE__{}, atomised_sess)
  end
  # false first load
  defp mutate_session(_, _), do: %__MODULE__{}

  defp hydrate_session(acc, key, val) when key in ["name", "id", "active", "last_opened"] do
    put_in(acc, [String.to_existing_atom(key)], val)
  end

  defp hydrate_session(acc, _key, _val), do: acc
 end

Expected behavior

No infinite breaking loop, if any connection issue also dont crash the running local client behaviour just to refresh socket connection to crash again. Also better error handling on server end

@SteffenDE SteffenDE transferred this issue from phoenixframework/phoenix Feb 7, 2025
@SteffenDE
Copy link
Collaborator

@ks0m1c it looks like you're doing something on the server that is taking too long to complete

phx-GCGNuTObX3XMHSni error: unable to join -  
Object { reason: "timeout" }

which causes LiveView to refresh.

There was 0 error logs on the backend which was really surprising

This would align with the LiveView taking too long to respond. So it does not crash which would log an exception, but instead is killed when the client disconnects (which can always happen and is not an error).

The consecutive reloads logic does not assume a large number of open tabs from the same browser, so the numbers there are probably off and aren't that indicative of a client problem.

It's hard to tell what's going on without something to reproduce. But I'd suggest to look into everything that's happening in your mount logic, including any hooks and maybe even try to do some IO.puts debugging to see if/when your mount finishes.

@ks0m1c
Copy link
Author

ks0m1c commented Feb 7, 2025

Could it be related to this error message, I had been ignoring it because the hook at slider-thumb has actually been working even though the console raises an error

utils.js:7 no hook found for custom element: slider-thumb undefined
logError @ utils.js:7
addHook @ view.js:709
maybeAddNewHook @ view.js:460
(anonymous) @ view.js:409
all @ dom.js:40
execNewMounted @ view.js:407
applyJoinPatch @ view.js:425
onJoinComplete @ view.js:382
(anonymous) @ view.js:342
maybeRecoverForms @ view.js:574
(anonymous) @ view.js:341
applyDiff @ view.js:304
onJoin @ view.js:334
(anonymous) @ view.js:820
after @ live_socket.js:1002
requestDOMUpdate @ live_socket.js:300
ok @ view.js:820
(anonymous) @ view.js:905
withLatency @ view.js:901
(anonymous) @ view.js:905
(anonymous) @ push.js:76
matchReceive @ push.js:76
(anonymous) @ push.js:107
trigger @ channel.js:278
(anonymous) @ channel.js:70
trigger @ channel.js:278
(anonymous) @ socket.js:631
decode @ serializer.js:25
onConnMessage @ socket.js:618
conn.onmessage @ socket.js:353
view.js:711 Uncaught TypeError: Cannot read properties of undefined (reading '__attachView')
    at _View.addHook (view.js:711:12)
    at _View.maybeAddNewHook (view.js:460:24)
    at view.js:409:14
    at Array.forEach (<anonymous>)
    at Object.all (dom.js:40:29)
    at _View.execNewMounted (view.js:407:9)
    at _View.applyJoinPatch (view.js:425:10)
    at _View.onJoinComplete (view.js:382:14)
    at view.js:342:14
    at _View.maybeRecoverForms (view.js:574:14)

I am currently moving every assign on my handle_params into an async_assign process let me get back to you with that test soon

@ks0m1c ks0m1c changed the title Catastrophic Join issue with multiple users Catastrophic Join issue with timeouts Feb 7, 2025
@SteffenDE
Copy link
Collaborator

What LiveView version are you using? There was a recent issue (#3496) causing such an error, but it should not cause a redirect loop as it's a clientside error.

@ks0m1c
Copy link
Author

ks0m1c commented Feb 7, 2025

my locked version is 1.0.4

Config: {:phoenix_live_view, "~> 1.0"}
Locked version: 1.0.4
Releases: 1.0.4, 1.0.3, 1.0.2, 1.0.1, 1.0.0, 1.0.0-rc.9, 1.0.0-rc.8, 1.0.0-rc.7, ...
not sure if its related in anyway to a view crash that happens on init, this is probably cause I am forcing lonpolling for testing currently
phx-GCH46GvevWOi4Q0i error: view crashed -  undefined [app.js:2423:15](http://localhost:4000/assets/app.js)

@ks0m1c
Copy link
Author

ks0m1c commented Feb 7, 2025

I have moved every function on mount into a start_async/handle_async pair, but the issue with consecutive reloads still persists. And also the tabs that do load remain unresponsive on binding events though a bunch phx-update events stream through after a bit of a delay

@ks0m1c
Copy link
Author

ks0m1c commented Feb 8, 2025

I was able to replicate this issue on a fresh new phoenix projected generated using phx.gen.new with the generated user liveview mix phx.gen.live Accounts User users name:string age:integer, I used the same multiples tabs opened in intervals of 1000s js command

sessionStorage.setItem("phx:fallback:ge", true);
sessionStorage.setItem("phx:fallback:LongPoll", true);

const numberOfTabs = 30; // Define the number of tabs to open
const delayBetweenTabs = 1000; // Delay in milliseconds (1 second)

for (let i = 0; i < numberOfTabs; i++) {
    setTimeout(() => {
        const sessionData = {
            "active": true,
            "id": (Math.random() + 1).toString(36).substring(7),
            "name": (Math.random() + 1).toString(36).substring(7)
        };

        localStorage.setItem('session', JSON.stringify(sessionData));
        window.open("http://localhost:4000/users/new", '_blank'); // Open in new tab
    }, i * delayBetweenTabs);
}
phx-GCIeOwdSuaapThsC error: view crashed -  undefined utils.js:39:12
phx-GCIeOwdSuaapThsC mount:  -  
Object { 0: ' src="/images/logo.svg"', 1: "1.7.19", 2: {…}, 3: {…}, c: {…}, s: (5) […], t: "New User" }
utils.js:39:12
phx-GCIeOwdSuaapThsC error: view crashed -  undefined utils.js:39:12

then it will refresh the page

phx-GCIeOwdSuaapThsC join: encountered 3 consecutive reloads -  undefined app.js:1812:15
Layout was forced before the page was fully loaded. If stylesheets are not yet loaded this may cause a flash of unstyled content. node.js:417:1

infinite refresh loop continues

@chrismccord
Copy link
Member

This doesn't necessarily show an issue as the browser may simply be background/limiting network connectivity across all the tabs you are launching (as they won't all be in focus/active states). In those cases they would timeout if they aren't given network access or the browser is limited the same # of simultaneous connections to a given host, so I don't think this reproduction is actually the issue at hand. As @SteffenDE said it sounds like the prod issue you had was timeout related which could be from any code path in your stack from endpoint plugs to LV mount lifecyce. A single overloaded DB call or process call could potentially trigger denial of service as users would all be blocking on mount, and the client gives up. You had the timeout in your logs, so let's start there. I don't think the dozens of tabs approach is a repro of what you saw in prod, but we lack all the code paths and logs so it's really impossible for us to say. If you have more info we could take a closer look. Thanks!

@ks0m1c
Copy link
Author

ks0m1c commented Feb 8, 2025

Hey Chris!

I understand that browser may simply be background/limiting network connectivity across all the tabs, but why does the actively focused tab not gracefully recover towards a stable state, even after closing all the other tabs?

Does the behaviour arise from browser specific details rather than something library specific?

But it also doesnt fully explain why phx-updates when they do reach back the client (as observed on the client) fail to render changes on the DOM.

I do have trouble replicating the conditions for why all the active websocket connections on that day led to all being degraded to longpolling due to the timeout fallback.

The base local-only client functionality works well after mount splash so having an ability to allow the user to re-initiate connection by user triggered action could also help as a fallback instead of the loop of doom

Here is source code: the project is a bit of Seymour Papert meets Joe Armstrong for kids in a classroom to go wild together in a shared concurrent mathematical world -> https://github.com/paperlands/dojo/blob/master/lib/dojo_web/live/shell_live.ex

I have since moved all function calls at mount and handle params into the async pattern of assigning yesterday, there are two potential hotpaths which could have led to the behaviour one was this list_users for the classroom.

def list(topic, timeout \\ 5000) do
    __MODULE__
    |> Phoenix.Tracker.Shard.name_for_topic(topic, pool_size())
    |> GenServer.call({:list, topic}, timeout)
    |> Phoenix.Tracker.State.get_by_topic(topic)
  end

  def list_users(topic),
    do: Enum.map(list(topic), fn {_k, meta} -> Map.put(meta, :topic, topic) end)

And the other was a joining a table for last state and joining the rest of classroom in the presence end

def join(pid, book, disciple) do

    spec = %{id: Table, start: {Table, :start_link, [%{topic: topic(book), disciple: disciple}]}}

    {:ok, class} = DynamicSupervisor.start_child(
      {:via, PartitionSupervisor, {__MODULE__, self()}},
      spec
    )

    Dojo.Gate.track(pid, topic(book), %{disciple | node: class})

    {:ok, class}
  end

@SteffenDE
Copy link
Collaborator

I just remembered something that already was a problem with server sent events in the past, but also applies to regular HTTP/1 requests: there is a maximum number of concurrent requests to the same server in the same browser. See https://stackoverflow.com/questions/18584525/server-sent-events-and-browser-limits. This completely explains the problem you see locally. As soon as this limit is reached, tabs will time out. You don’t see it with websockets, because they are not affected by the limit, but due to the nature of long polling, the effect there is extreme.

Thank you for providing the code! I’ll try to have a look to see if I find anything that might be problematic.

@ks0m1c
Copy link
Author

ks0m1c commented Feb 13, 2025

Hey @SteffenDE

In the meantime I am exploring a path to fallback to offline with limited functionality through service workers and attempt reconnection to websocket or longpolling in the background, is there any prior art wrt this.

@SteffenDE
Copy link
Collaborator

is there any prior art wrt this.

Not that I know of. And to be honest I don't think it's worth the effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants