Skip to content
This repository has been archived by the owner on Feb 8, 2018. It is now read-only.

investigate reports of instability since redesign #626

Closed
chadwhitacre opened this issue Feb 8, 2013 · 89 comments
Closed

investigate reports of instability since redesign #626

chadwhitacre opened this issue Feb 8, 2013 · 89 comments
Assignees

Comments

@ghost ghost assigned chadwhitacre Feb 8, 2013
@dmitshur
Copy link
Contributor

dmitshur commented Feb 8, 2013

I can confirm this. It was working well for me yesterday during the day, but started giving problems at night (i.e. ~12 hours before this post). I couldn't login, refreshing consistently showed an "Application error" instead of the website during some periods.

Today it's been behaving properly so far.

@chadwhitacre
Copy link
Contributor Author

OperationalError: (OperationalError) SSL SYSCALL error: EOF detected
 'SELECT participants.id AS participants_id, participants.statement AS participants_statement, participants.stripe_customer_id AS participants_stripe_customer_id, participants.last_bill_result AS participants_last_bill_result, participants.session_token AS participants_session_token, participants.session_expires AS participants_s

Stacktrace (most recent call last):

  File "aspen/website.py", line 66, in handle_safely
    response = self.handle(request)
  File "aspen/website.py", line 99, in handle
    response = request.resource.respond(request)
  File "aspen/resources/dynamic_resource.py", line 47, in respond
    exec self.pages[1] in context
  File "/app/www/on/github/associate", line 38, in <module>
    user = account.opt_in(login)  # setting 'user' gives them a session
  File "gittip/elsewhere/__init__.py", line 71, in opt_in
    user = User.from_id(self.participant_id)  # give them a session
  File "gittip/models/user.py", line 27, in from_id
    user = User.query.filter_by(id=user_id).first()
  File "sqlalchemy/orm/query.py", line 2181, in first
    ret = list(self[0:1])
  File "sqlalchemy/orm/query.py", line 2048, in __getitem__
    return list(res)
  File "sqlalchemy/orm/query.py", line 2252, in __iter__
    return self._execute_and_instances(context)
  File "sqlalchemy/orm/query.py", line 2267, in _execute_and_instances
    result = conn.execute(querycontext.statement, self._params)
  File "sqlalchemy/engine/base.py", line 664, in execute
    params)
  File "sqlalchemy/engine/base.py", line 764, in _execute_clauseelement
    compiled_sql, distilled_params
  File "sqlalchemy/engine/base.py", line 878, in _execute_context
    context)
  File "sqlalchemy/engine/base.py", line 871, in _execute_context
    context)
  File "sqlalchemy/engine/default.py", line 320, in do_execute
    cursor.execute(statement, parameters)

@chadwhitacre
Copy link
Contributor Author

There are a few different instances of this appearing separately in Sentry. I'm going through each to compare them to make sure they're all the same thing.

@bruceadams
Copy link
Contributor

Am I correct that "EOF detected" here means a lost network connection to the database?

@chadwhitacre
Copy link
Contributor Author

@bruceadams Seems likely.

@chadwhitacre
Copy link
Contributor Author

There are 11 instances of this in Sentry. They hit various simplates. They all have the same SQL query, which is from the auth code which we get to from various code paths.

@joonas
Copy link
Contributor

joonas commented Feb 8, 2013

From https://devcenter.heroku.com/articles/postgres-logs-errors#pgerror-ssl-syscall-error-eof-detected:

PG::Error: SSL SYSCALL error: EOF detected

This error is typically Postgres saying “there was a client side violation of the wire protocol”. This happens for one of two reasons in our experience:

  • Sharing a Postgres connection from more than one process or thread. Typical offenders are Resque workers or Unicorn. The fix there is to reestablish ActiveRecord connections on the after_fork hooks of either library.
  • Abrupt client (application side) disconnections. This can happen for many reasons, from your app crashing, to transient network availbility. When your app tries to issue a query again against postgres, the connection is just gone, leading to a crash. When Heroku detects a crash, we migrate kill that dyno and start a new one, which re-establishes the connection.

I don't know enough about Aspen to make any educated guesses whether it could be the first reason they list, but the second reason while broad seems more than plausible.

Now the question remains, where do we even begin to debug this?

@chadwhitacre
Copy link
Contributor Author

I'm going to check Google Analytics to get a sense of how loaded up we are.

@chadwhitacre
Copy link
Contributor Author

Load is normal.

@ambv
Copy link

ambv commented Feb 8, 2013

Currently working just fine in terms of auth.

@chadwhitacre
Copy link
Contributor Author

joonas: have you seen this before the redesign?
joonas: I mean, is there anything in the logs before that?

I said no in IRC, but I was wrong.

First Seen: 3 days ago

@chadwhitacre
Copy link
Contributor Author

I ignored it at the time because I see various db connection dropping errors in sentry, usually in spurts overnight, which I assume are heroku housecleaning and moving the app around or whatever. I flagged this as that and figured if it wasn't we'd find out. Which we did. :-)

@chadwhitacre
Copy link
Contributor Author

I just invited @zzzeek to join us here.

@chadwhitacre
Copy link
Contributor Author

Okay, let's explore Heroku's option 1.

@joonas
Copy link
Contributor

joonas commented Feb 8, 2013

Load is normal.

What constitutes as normal?

I said no in IRC, but I was wrong.

First Seen: 3 days ago

Was this the same code that was causing problems, or different?

Okay, let's explore Heroku's option 1.

How is Aspen set up to fork process/thread in production?

@chadwhitacre
Copy link
Contributor Author

Load (note that the final day is partial):

load

@joonas
Copy link
Contributor

joonas commented Feb 8, 2013

Load (note that the final day is partial):

Any thoughts on what might this translate to in terms of concurrent users on the site?

Google analytics has a real-time reporting view that shows you activity/people on the site under "Standard Reports" in "Reporting"

@chadwhitacre
Copy link
Contributor Author

Was this the same code that was causing problems, or different?

As mentioned we have 11 instances in Sentry, though those are sampled and aggregated (and I'm not 100% clear on how that works). One of the aggregates has 4,200 instances, the next-most-common has 81. Let me collect the code paths in question ...

@chadwhitacre
Copy link
Contributor Author

However, this is nearly all for /%participant/widget.html.

concurrent

@zzzeek
Copy link

zzzeek commented Feb 8, 2013

well one thing is I'd note that Ive never seen that error before, we do detect errors that mean "disconnection", and that is nothing like them. Are we using psycopg2 / libpq with heroku here ? is there some middleware / pgbouncer type of thing in use ?

@chadwhitacre
Copy link
Contributor Author

It's interesting that no matter the code path, it's always the same SQL query in the failure.

@zzzeek
Copy link

zzzeek commented Feb 8, 2013

can you paste out the full SQL text of that query and also what does EXPLAIN give you for it ?

@chadwhitacre
Copy link
Contributor Author

Are we using psycopg2 / libpq with heroku here ?

@zzzeek Yes.

is there some middleware / pgbouncer type of thing in use ?

I don't know.

@chadwhitacre
Copy link
Contributor Author

What did we roll out on Feb 3?

@chadwhitacre
Copy link
Contributor Author

can you paste out the full SQL text of that query [...] ?

The paste above is everything I have to go on from Sentry. Are you suggesting there's another place I can pull that query text from?

@joonas
Copy link
Contributor

joonas commented Feb 8, 2013

What did we roll out on Feb 3?

Heroku keeps log of activity for the app, check it out under https://dashboard.heroku.com/apps/

You can also have Heroku track the hashes by linking the code to the repository on Github:

To link your deploy hashes to GitHub add a repository on the settings page (of the Heroku app).

@chadwhitacre
Copy link
Contributor Author

6.1.19 on Feb 1 is the last deploy prior to Feb 4.

@zzzeek
Copy link

zzzeek commented Feb 8, 2013

just so we're on the same page, the paste above ends the query as "participants.session_expires AS participants_s", no closing quote, seems to be truncated. SQLAlchemy's exception messages include the full query. OperationalError is a DBAPI exception raised by psycopg2, but we wrap it in an agnostic wrapper and this seems to be wrapped. So the error reporting here by heroku would appear to be truncating the message.

@chadwhitacre
Copy link
Contributor Author

@zzzeek Gotcha. It could also be Sentry doing the truncation. I'm going to dig in Heroku logs to see if the full message is in there.

@chadwhitacre
Copy link
Contributor Author

I'm currently focused on comprehending SQLAlchemy.

@joonas
Copy link
Contributor

joonas commented Feb 8, 2013

Do you think it would be worthwhile to open up a support ticket with Heroku to ask them whether there's anything on their end?

I've not used their offering so I'm not familiar with it, do you have access to any kind of database logs or such?

@chadwhitacre
Copy link
Contributor Author

Good idea, @joonas. Check out what I'm finding in the ticket queue:

interesting

@chadwhitacre
Copy link
Contributor Author

They all say the same thing:

Our automated systems have detected a database issue.

Then:

Your database suffered a sudden loss of availability at 2013-02-08 05:46:20 UTC.
Automated processes are attempting to restore availability.
If they are unable to do so an engineer will be paged.
When availability is restored this ticket will be updated.

Finally:

Service has been restored at 2013-02-08 05:57:11 UTC.
We expect operations to continue normally, but feel free to notify us on this ticket if there are any outstanding issues.

@chadwhitacre
Copy link
Contributor Author

The latest of those support tickets is still open, so I've commented on it.

@zzzeek
Copy link

zzzeek commented Feb 8, 2013

On Feb 8, 2013, at 4:23 PM, Chad Whitacre [email protected] wrote:

something is specifically up with that query (need the full SQL, run EXPLAIN, etc.)

I don't think it's this, because people are experiencing problems intermittently. I would expect nearly every request to the site fail if the problem were with the query.

right but if the query itself is very slow / heavy, it could perhaps be why timeout/memory issues are triggered. or it might just be the most common query the application does on a regular basis. Usually when I write an app, I spend a lot of time just running it with SQL echoing on so I get to know it's SQL behavior fairly intimately.

@chadwhitacre
Copy link
Contributor Author

Usually when I write an app, I spend a lot of time just running it with SQL echoing on so I get to know it's SQL behavior fairly intimately.

You mean locally in development or in production and you review the logs?

@zzzeek
Copy link

zzzeek commented Feb 8, 2013

On Feb 8, 2013, at 5:12 PM, Chad Whitacre [email protected] wrote:

Usually when I write an app, I spend a lot of time just running it with SQL echoing on so I get to know it's SQL behavior fairly intimately.

You mean locally in development or in production and you review the logs?

well both. Usually in development I'm watching the logs a lot to make sure I'm not hitting the database in an unreasonable way. then in production I'm usually logging SQL for some period of time, though it's less appropriate on an app that's getting hit a lot as it slows things down and generates way too much output.

@zzzeek
Copy link

zzzeek commented Feb 8, 2013

On Feb 8, 2013, at 5:14 PM, Michael Bayer [email protected] wrote:

On Feb 8, 2013, at 5:12 PM, Chad Whitacre [email protected] wrote:

Usually when I write an app, I spend a lot of time just running it with SQL echoing on so I get to know it's SQL behavior fairly intimately.

You mean locally in development or in production and you review the logs?

well both. Usually in development I'm watching the logs a lot to make sure I'm not hitting the database in an unreasonable way. then in production I'm usually logging SQL for some period of time, though it's less appropriate on an app that's getting hit a lot as it slows things down and generates way too much output.

like in this case, you said the same query is coming out via like seven different code paths. That seems a little weird, from a cyclomatic POV.

@chadwhitacre
Copy link
Contributor Author

like in this case, you said the same query is coming out via like seven different code paths. That seems a little weird, from a cyclomatic POV.

They're all places where we instantiate a user/participant object, which is central to the app. The query is a straight select * from participants where foo=bar.

@zzzeek
Copy link

zzzeek commented Feb 8, 2013

On Feb 8, 2013, at 5:23 PM, Chad Whitacre [email protected] wrote:

like in this case, you said the same query is coming out via like seven different code paths. That seems a little weird, from a cyclomatic POV.

They're all places where we instantiate a user/participant object, which is central to the app. The query is a straight select * from participants where foo=bar.

OK, it's not a big deal. more curious what this weird disconnect is all about, and it seems like something environment specific.

@chadwhitacre
Copy link
Contributor Author

I did tail the logs for a bit (per IRC w/ @joonas). It appears that log_min_duration_statement is set to 50ms or so, because I am seeing queries in the 90s and up, but certainly not all the queries in the app. I'm not seeing any queries matching what we're looking for with this ticket, suggesting that a heavy query isn't the problem here.

@chadwhitacre
Copy link
Contributor Author

For completeness, here's the Heroku ticket: https://help.heroku.com/tickets/75571 [authorization required]

@joonas
Copy link
Contributor

joonas commented Feb 8, 2013

For completeness, here's the Heroku ticket: https://help.heroku.com/tickets/75571 [authorization required]

I believe only you'll have access to this, btw (unless you add people as collaborators to the app)

@chadwhitacre
Copy link
Contributor Author

Since this only affecting a portion of users, I'm going to take a break for supper. My current thinking is still centered on these two options:

  • Gittip is mis-using SQLAlchemy. Way forward for me here is to comprehend SQLAlchemy and our use of it.
  • Heroku is letting us down. Let's see if they respond here or on the support ticket.

Thanks for the input on this, @zzzeek and @joonas.

bbiab

@chadwhitacre
Copy link
Contributor Author

Is the query in question the first one performed for any given request? What does SQLAlchemy's unit of work marshalling look like and how does that play into this?

@zzzeek
Copy link

zzzeek commented Feb 8, 2013

On Feb 8, 2013, at 5:57 PM, Chad Whitacre [email protected] wrote:

Is the query in question the first one performed for any given request? What does SQLAlchemy's unit of work marshalling look like and how does that play into this?

the unit of work isn't really involved in why you're getting a disconnection error. The UOW's interaction with the database is that you might see a bunch of INSERT/UPDATE/DELETE statements emitted on the connection if changes are pending, before a query. It doesn't affect the lifespan or scope of the connection, which always remains local to a single Session object.

@gregburek
Copy link

I have left a few comments on your Heroku ticket. I would very much like to get to the bottom of this.

@chadwhitacre
Copy link
Contributor Author

Thanks @gregburek, much appreciated. For the record here we are looking at a possible hardware issue and are going to migrate to new hardware.

@chadwhitacre
Copy link
Contributor Author

From the support ticket:

At the moment, I do not see any errors on your app, but I did some digging and am seeing that your db's hardware has a long list of incidents and issues. I am now spinning up a follower for you to migrate to in order to rule out hardware problems for your application.

Once the follower is caught up, you could migrate over yoursself by following the directions found here (https://devcenter.heroku.com/articles/heroku-postgres-follower-databases), or I could do it for you tomorrow. Let me know.

@chadwhitacre
Copy link
Contributor Author

https://devcenter.heroku.com/articles/heroku-postgres-follower-databases#database-upgrades-and-migrations

[F]ollowers can also be used to change database plans with minimal downtime. At a high level, a follower is created in order to move your data from one database to another (which can be the same or different production tier database plans). Once it has received the majority of the data and is closely following your main database, you will prevent new data from being written (usually by enabling maintenance mode on your app). The follower will then fully catch-up to the main database. The follower is then promoted to be the primary database for the application.

@chadwhitacre
Copy link
Contributor Author

I've taken a local backup of the database and plan to cut over.

@chadwhitacre
Copy link
Contributor Author

Okay, cutover is complete. I'm going to close this ticket. If it crops up again we can reopen/reticket.

Thanks for all your help @zzzeek @joonas @gregburek.

!m

@chadwhitacre
Copy link
Contributor Author

Post-mortem:

  • I should've contacted Heroku sooner when the klaxons were going off re: db failures.
  • I've leveled up on SLQAlchemy internals.
  • Both SQLAlchemy and Heroku Postgres are well-supported. :-)
  • Gittip has great people looking after it. :-)

@chadwhitacre
Copy link
Contributor Author

I've followed up with the original reporters on Twitter.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants