-
Notifications
You must be signed in to change notification settings - Fork 309
investigate reports of instability since redesign #626
Comments
I can confirm this. It was working well for me yesterday during the day, but started giving problems at night (i.e. ~12 hours before this post). I couldn't login, refreshing consistently showed an "Application error" instead of the website during some periods. Today it's been behaving properly so far. |
|
There are a few different instances of this appearing separately in Sentry. I'm going through each to compare them to make sure they're all the same thing. |
Am I correct that "EOF detected" here means a lost network connection to the database? |
@bruceadams Seems likely. |
There are 11 instances of this in Sentry. They hit various simplates. They all have the same SQL query, |
From https://devcenter.heroku.com/articles/postgres-logs-errors#pgerror-ssl-syscall-error-eof-detected:
I don't know enough about Aspen to make any educated guesses whether it could be the first reason they list, but the second reason while broad seems more than plausible. Now the question remains, where do we even begin to debug this? |
I'm going to check Google Analytics to get a sense of how loaded up we are. |
Load is normal. |
Currently working just fine in terms of auth. |
I said no in IRC, but I was wrong. First Seen: 3 days ago |
I ignored it at the time because I see various db connection dropping errors in sentry, usually in spurts overnight, which I assume are heroku housecleaning and moving the app around or whatever. I flagged this as that and figured if it wasn't we'd find out. Which we did. :-) |
Okay, let's explore Heroku's option 1. |
What constitutes as normal?
Was this the same code that was causing problems, or different?
How is Aspen set up to fork process/thread in production? |
Any thoughts on what might this translate to in terms of concurrent users on the site? Google analytics has a real-time reporting view that shows you activity/people on the site under "Standard Reports" in "Reporting" |
As mentioned we have 11 instances in Sentry, though those are sampled and aggregated (and I'm not 100% clear on how that works). One of the aggregates has 4,200 instances, the next-most-common has 81. Let me collect the code paths in question ... |
well one thing is I'd note that Ive never seen that error before, we do detect errors that mean "disconnection", and that is nothing like them. Are we using psycopg2 / libpq with heroku here ? is there some middleware / pgbouncer type of thing in use ? |
It's interesting that no matter the code path, it's always the same SQL query in the failure. |
can you paste out the full SQL text of that query and also what does EXPLAIN give you for it ? |
@zzzeek Yes.
I don't know. |
What did we roll out on Feb 3? |
The paste above is everything I have to go on from Sentry. Are you suggesting there's another place I can pull that query text from? |
Heroku keeps log of activity for the app, check it out under https://dashboard.heroku.com/apps/ You can also have Heroku track the hashes by linking the code to the repository on Github:
|
6.1.19 on Feb 1 is the last deploy prior to Feb 4. |
just so we're on the same page, the paste above ends the query as "participants.session_expires AS participants_s", no closing quote, seems to be truncated. SQLAlchemy's exception messages include the full query. OperationalError is a DBAPI exception raised by psycopg2, but we wrap it in an agnostic wrapper and this seems to be wrapped. So the error reporting here by heroku would appear to be truncating the message. |
@zzzeek Gotcha. It could also be Sentry doing the truncation. I'm going to dig in Heroku logs to see if the full message is in there. |
I'm currently focused on comprehending SQLAlchemy. |
Do you think it would be worthwhile to open up a support ticket with Heroku to ask them whether there's anything on their end? I've not used their offering so I'm not familiar with it, do you have access to any kind of database logs or such? |
Good idea, @joonas. Check out what I'm finding in the ticket queue: |
They all say the same thing:
Then:
Finally:
|
The latest of those support tickets is still open, so I've commented on it. |
On Feb 8, 2013, at 4:23 PM, Chad Whitacre [email protected] wrote:
right but if the query itself is very slow / heavy, it could perhaps be why timeout/memory issues are triggered. or it might just be the most common query the application does on a regular basis. Usually when I write an app, I spend a lot of time just running it with SQL echoing on so I get to know it's SQL behavior fairly intimately. |
You mean locally in development or in production and you review the logs? |
On Feb 8, 2013, at 5:12 PM, Chad Whitacre [email protected] wrote:
|
On Feb 8, 2013, at 5:14 PM, Michael Bayer [email protected] wrote:
like in this case, you said the same query is coming out via like seven different code paths. That seems a little weird, from a cyclomatic POV. |
They're all places where we instantiate a user/participant object, which is central to the app. The query is a straight |
On Feb 8, 2013, at 5:23 PM, Chad Whitacre [email protected] wrote:
OK, it's not a big deal. more curious what this weird disconnect is all about, and it seems like something environment specific. |
I did tail the logs for a bit (per IRC w/ @joonas). It appears that |
For completeness, here's the Heroku ticket: https://help.heroku.com/tickets/75571 [authorization required] |
I believe only you'll have access to this, btw (unless you add people as collaborators to the app) |
Since this only affecting a portion of users, I'm going to take a break for supper. My current thinking is still centered on these two options:
Thanks for the input on this, @zzzeek and @joonas. bbiab |
Is the query in question the first one performed for any given request? What does SQLAlchemy's unit of work marshalling look like and how does that play into this? |
On Feb 8, 2013, at 5:57 PM, Chad Whitacre [email protected] wrote:
the unit of work isn't really involved in why you're getting a disconnection error. The UOW's interaction with the database is that you might see a bunch of INSERT/UPDATE/DELETE statements emitted on the connection if changes are pending, before a query. It doesn't affect the lifespan or scope of the connection, which always remains local to a single Session object. |
I have left a few comments on your Heroku ticket. I would very much like to get to the bottom of this. |
Thanks @gregburek, much appreciated. For the record here we are looking at a possible hardware issue and are going to migrate to new hardware. |
From the support ticket:
|
|
I've taken a local backup of the database and plan to cut over. |
Okay, cutover is complete. I'm going to close this ticket. If it crops up again we can reopen/reticket. Thanks for all your help @zzzeek @joonas @gregburek. !m |
Post-mortem:
|
I've followed up with the original reporters on Twitter. |
E.g.:
https://twitter.com/freakboy3742/status/299658501507129344
https://twitter.com/wong_jim/status/299789632328040450
https://twitter.com/jamey_sharp/status/299812754435502080
https://twitter.com/jamey_sharp/status/299813665216032769
https://twitter.com/KULKING/status/299816183044120576
https://twitter.com/llanga/status/299828878552227841
The text was updated successfully, but these errors were encountered: