investigate reports of instability since redesign #626

chadwhitacre · 2013-02-08T18:22:07Z

E.g.:

https://twitter.com/freakboy3742/status/299658501507129344
https://twitter.com/wong_jim/status/299789632328040450
https://twitter.com/jamey_sharp/status/299812754435502080
https://twitter.com/jamey_sharp/status/299813665216032769
https://twitter.com/KULKING/status/299816183044120576
https://twitter.com/llanga/status/299828878552227841

dmitshur · 2013-02-08T18:31:58Z

I can confirm this. It was working well for me yesterday during the day, but started giving problems at night (i.e. ~12 hours before this post). I couldn't login, refreshing consistently showed an "Application error" instead of the website during some periods.

Today it's been behaving properly so far.

chadwhitacre · 2013-02-08T18:39:23Z

OperationalError: (OperationalError) SSL SYSCALL error: EOF detected
 'SELECT participants.id AS participants_id, participants.statement AS participants_statement, participants.stripe_customer_id AS participants_stripe_customer_id, participants.last_bill_result AS participants_last_bill_result, participants.session_token AS participants_session_token, participants.session_expires AS participants_s

Stacktrace (most recent call last):

  File "aspen/website.py", line 66, in handle_safely
    response = self.handle(request)
  File "aspen/website.py", line 99, in handle
    response = request.resource.respond(request)
  File "aspen/resources/dynamic_resource.py", line 47, in respond
    exec self.pages[1] in context
  File "/app/www/on/github/associate", line 38, in <module>
    user = account.opt_in(login)  # setting 'user' gives them a session
  File "gittip/elsewhere/__init__.py", line 71, in opt_in
    user = User.from_id(self.participant_id)  # give them a session
  File "gittip/models/user.py", line 27, in from_id
    user = User.query.filter_by(id=user_id).first()
  File "sqlalchemy/orm/query.py", line 2181, in first
    ret = list(self[0:1])
  File "sqlalchemy/orm/query.py", line 2048, in __getitem__
    return list(res)
  File "sqlalchemy/orm/query.py", line 2252, in __iter__
    return self._execute_and_instances(context)
  File "sqlalchemy/orm/query.py", line 2267, in _execute_and_instances
    result = conn.execute(querycontext.statement, self._params)
  File "sqlalchemy/engine/base.py", line 664, in execute
    params)
  File "sqlalchemy/engine/base.py", line 764, in _execute_clauseelement
    compiled_sql, distilled_params
  File "sqlalchemy/engine/base.py", line 878, in _execute_context
    context)
  File "sqlalchemy/engine/base.py", line 871, in _execute_context
    context)
  File "sqlalchemy/engine/default.py", line 320, in do_execute
    cursor.execute(statement, parameters)

chadwhitacre · 2013-02-08T18:40:55Z

There are a few different instances of this appearing separately in Sentry. I'm going through each to compare them to make sure they're all the same thing.

bruceadams · 2013-02-08T18:43:54Z

Am I correct that "EOF detected" here means a lost network connection to the database?

chadwhitacre · 2013-02-08T18:45:44Z

@bruceadams Seems likely.

chadwhitacre · 2013-02-08T18:49:55Z

There are 11 instances of this in Sentry. They hit various simplates. They all have the same SQL query, ~~which is from the auth code~~ which we get to from various code paths.

joonas · 2013-02-08T18:51:48Z

From https://devcenter.heroku.com/articles/postgres-logs-errors#pgerror-ssl-syscall-error-eof-detected:

PG::Error: SSL SYSCALL error: EOF detected

This error is typically Postgres saying “there was a client side violation of the wire protocol”. This happens for one of two reasons in our experience:

Sharing a Postgres connection from more than one process or thread. Typical offenders are Resque workers or Unicorn. The fix there is to reestablish ActiveRecord connections on the after_fork hooks of either library.

Abrupt client (application side) disconnections. This can happen for many reasons, from your app crashing, to transient network availbility. When your app tries to issue a query again against postgres, the connection is just gone, leading to a crash. When Heroku detects a crash, we migrate kill that dyno and start a new one, which re-establishes the connection.

I don't know enough about Aspen to make any educated guesses whether it could be the first reason they list, but the second reason while broad seems more than plausible.

Now the question remains, where do we even begin to debug this?

chadwhitacre · 2013-02-08T18:54:11Z

I'm going to check Google Analytics to get a sense of how loaded up we are.

chadwhitacre · 2013-02-08T18:55:11Z

Load is normal.

ambv · 2013-02-08T18:55:29Z

Currently working just fine in terms of auth.

chadwhitacre · 2013-02-08T18:58:19Z

joonas: have you seen this before the redesign?
joonas: I mean, is there anything in the logs before that?

I said no in IRC, but I was wrong.

First Seen: 3 days ago

chadwhitacre · 2013-02-08T18:59:22Z

I ignored it at the time because I see various db connection dropping errors in sentry, usually in spurts overnight, which I assume are heroku housecleaning and moving the app around or whatever. I flagged this as that and figured if it wasn't we'd find out. Which we did. :-)

chadwhitacre · 2013-02-08T19:01:52Z

I just invited @zzzeek to join us here.

chadwhitacre · 2013-02-08T19:02:19Z

Okay, let's explore Heroku's option 1.

joonas · 2013-02-08T19:02:29Z

Load is normal.

What constitutes as normal?

I said no in IRC, but I was wrong.

First Seen: 3 days ago

Was this the same code that was causing problems, or different?

Okay, let's explore Heroku's option 1.

How is Aspen set up to fork process/thread in production?

chadwhitacre · 2013-02-08T19:05:35Z

Load (note that the final day is partial):

joonas · 2013-02-08T19:09:15Z

Load (note that the final day is partial):

Any thoughts on what might this translate to in terms of concurrent users on the site?

Google analytics has a real-time reporting view that shows you activity/people on the site under "Standard Reports" in "Reporting"

chadwhitacre · 2013-02-08T19:10:31Z

Was this the same code that was causing problems, or different?

As mentioned we have 11 instances in Sentry, though those are sampled and aggregated (and I'm not 100% clear on how that works). One of the aggregates has 4,200 instances, the next-most-common has 81. Let me collect the code paths in question ...

chadwhitacre · 2013-02-08T19:12:52Z

However, this is nearly all for /%participant/widget.html.

zzzeek · 2013-02-08T19:14:21Z

well one thing is I'd note that Ive never seen that error before, we do detect errors that mean "disconnection", and that is nothing like them. Are we using psycopg2 / libpq with heroku here ? is there some middleware / pgbouncer type of thing in use ?

chadwhitacre · 2013-02-08T19:14:26Z

It's interesting that no matter the code path, it's always the same SQL query in the failure.

zzzeek · 2013-02-08T19:15:38Z

can you paste out the full SQL text of that query and also what does EXPLAIN give you for it ?

chadwhitacre · 2013-02-08T19:15:49Z

Are we using psycopg2 / libpq with heroku here ?

@zzzeek Yes.

is there some middleware / pgbouncer type of thing in use ?

I don't know.

chadwhitacre · 2013-02-08T19:16:58Z

What did we roll out on Feb 3?

chadwhitacre · 2013-02-08T19:19:31Z

can you paste out the full SQL text of that query [...] ?

The paste above is everything I have to go on from Sentry. Are you suggesting there's another place I can pull that query text from?

joonas · 2013-02-08T19:20:12Z

What did we roll out on Feb 3?

Heroku keeps log of activity for the app, check it out under https://dashboard.heroku.com/apps/

You can also have Heroku track the hashes by linking the code to the repository on Github:

To link your deploy hashes to GitHub add a repository on the settings page (of the Heroku app).

chadwhitacre · 2013-02-08T19:22:35Z

6.1.19 on Feb 1 is the last deploy prior to Feb 4.

zzzeek · 2013-02-08T19:25:20Z

just so we're on the same page, the paste above ends the query as "participants.session_expires AS participants_s", no closing quote, seems to be truncated. SQLAlchemy's exception messages include the full query. OperationalError is a DBAPI exception raised by psycopg2, but we wrap it in an agnostic wrapper and this seems to be wrapped. So the error reporting here by heroku would appear to be truncating the message.

chadwhitacre · 2013-02-08T19:27:06Z

@zzzeek Gotcha. It could also be Sentry doing the truncation. I'm going to dig in Heroku logs to see if the full message is in there.

chadwhitacre · 2013-02-08T21:37:45Z

I'm currently focused on comprehending SQLAlchemy.

joonas · 2013-02-08T21:41:16Z

Do you think it would be worthwhile to open up a support ticket with Heroku to ask them whether there's anything on their end?

I've not used their offering so I'm not familiar with it, do you have access to any kind of database logs or such?

chadwhitacre · 2013-02-08T21:51:20Z

Good idea, @joonas. Check out what I'm finding in the ticket queue:

chadwhitacre · 2013-02-08T21:55:32Z

They all say the same thing:

Our automated systems have detected a database issue.

Then:

Your database suffered a sudden loss of availability at 2013-02-08 05:46:20 UTC.
Automated processes are attempting to restore availability.
If they are unable to do so an engineer will be paged.
When availability is restored this ticket will be updated.

Finally:

Service has been restored at 2013-02-08 05:57:11 UTC.
We expect operations to continue normally, but feel free to notify us on this ticket if there are any outstanding issues.

chadwhitacre · 2013-02-08T21:59:58Z

The latest of those support tickets is still open, so I've commented on it.

zzzeek · 2013-02-08T22:03:31Z

On Feb 8, 2013, at 4:23 PM, Chad Whitacre [email protected] wrote:

something is specifically up with that query (need the full SQL, run EXPLAIN, etc.)

I don't think it's this, because people are experiencing problems intermittently. I would expect nearly every request to the site fail if the problem were with the query.

right but if the query itself is very slow / heavy, it could perhaps be why timeout/memory issues are triggered. or it might just be the most common query the application does on a regular basis. Usually when I write an app, I spend a lot of time just running it with SQL echoing on so I get to know it's SQL behavior fairly intimately.

chadwhitacre · 2013-02-08T22:12:10Z

Usually when I write an app, I spend a lot of time just running it with SQL echoing on so I get to know it's SQL behavior fairly intimately.

You mean locally in development or in production and you review the logs?

zzzeek · 2013-02-08T22:14:12Z

On Feb 8, 2013, at 5:12 PM, Chad Whitacre [email protected] wrote:

Usually when I write an app, I spend a lot of time just running it with SQL echoing on so I get to know it's SQL behavior fairly intimately.

You mean locally in development or in production and you review the logs?

well both. Usually in development I'm watching the logs a lot to make sure I'm not hitting the database in an unreasonable way. then in production I'm usually logging SQL for some period of time, though it's less appropriate on an app that's getting hit a lot as it slows things down and generates way too much output.

zzzeek · 2013-02-08T22:16:22Z

On Feb 8, 2013, at 5:14 PM, Michael Bayer [email protected] wrote:

On Feb 8, 2013, at 5:12 PM, Chad Whitacre [email protected] wrote:

Usually when I write an app, I spend a lot of time just running it with SQL echoing on so I get to know it's SQL behavior fairly intimately.

You mean locally in development or in production and you review the logs?

well both. Usually in development I'm watching the logs a lot to make sure I'm not hitting the database in an unreasonable way. then in production I'm usually logging SQL for some period of time, though it's less appropriate on an app that's getting hit a lot as it slows things down and generates way too much output.

like in this case, you said the same query is coming out via like seven different code paths. That seems a little weird, from a cyclomatic POV.

chadwhitacre · 2013-02-08T22:23:33Z

like in this case, you said the same query is coming out via like seven different code paths. That seems a little weird, from a cyclomatic POV.

They're all places where we instantiate a user/participant object, which is central to the app. The query is a straight select * from participants where foo=bar.

zzzeek · 2013-02-08T22:41:48Z

On Feb 8, 2013, at 5:23 PM, Chad Whitacre [email protected] wrote:

like in this case, you said the same query is coming out via like seven different code paths. That seems a little weird, from a cyclomatic POV.

They're all places where we instantiate a user/participant object, which is central to the app. The query is a straight select * from participants where foo=bar.

OK, it's not a big deal. more curious what this weird disconnect is all about, and it seems like something environment specific.

chadwhitacre · 2013-02-08T22:49:37Z

I did tail the logs for a bit (per IRC w/ @joonas). It appears that log_min_duration_statement is set to 50ms or so, because I am seeing queries in the 90s and up, but certainly not all the queries in the app. I'm not seeing any queries matching what we're looking for with this ticket, suggesting that a heavy query isn't the problem here.

chadwhitacre · 2013-02-08T22:50:28Z

For completeness, here's the Heroku ticket: https://help.heroku.com/tickets/75571 [authorization required]

joonas · 2013-02-08T22:54:06Z

For completeness, here's the Heroku ticket: https://help.heroku.com/tickets/75571 [authorization required]

I believe only you'll have access to this, btw (unless you add people as collaborators to the app)

chadwhitacre · 2013-02-08T22:56:03Z

Since this only affecting a portion of users, I'm going to take a break for supper. My current thinking is still centered on these two options:

Gittip is mis-using SQLAlchemy. Way forward for me here is to comprehend SQLAlchemy and our use of it.
Heroku is letting us down. Let's see if they respond here or on the support ticket.

Thanks for the input on this, @zzzeek and @joonas.

bbiab

chadwhitacre · 2013-02-08T22:57:42Z

Is the query in question the first one performed for any given request? What does SQLAlchemy's unit of work marshalling look like and how does that play into this?

zzzeek · 2013-02-08T23:02:24Z

On Feb 8, 2013, at 5:57 PM, Chad Whitacre [email protected] wrote:

Is the query in question the first one performed for any given request? What does SQLAlchemy's unit of work marshalling look like and how does that play into this?

the unit of work isn't really involved in why you're getting a disconnection error. The UOW's interaction with the database is that you might see a bunch of INSERT/UPDATE/DELETE statements emitted on the connection if changes are pending, before a query. It doesn't affect the lifespan or scope of the connection, which always remains local to a single Session object.

gregburek · 2013-02-09T01:18:21Z

I have left a few comments on your Heroku ticket. I would very much like to get to the bottom of this.

chadwhitacre · 2013-02-09T01:28:43Z

Thanks @gregburek, much appreciated. For the record here we are looking at a possible hardware issue and are going to migrate to new hardware.

chadwhitacre · 2013-02-09T01:33:14Z

From the support ticket:

At the moment, I do not see any errors on your app, but I did some digging and am seeing that your db's hardware has a long list of incidents and issues. I am now spinning up a follower for you to migrate to in order to rule out hardware problems for your application.

Once the follower is caught up, you could migrate over yoursself by following the directions found here (https://devcenter.heroku.com/articles/heroku-postgres-follower-databases), or I could do it for you tomorrow. Let me know.

chadwhitacre · 2013-02-09T01:34:42Z

https://devcenter.heroku.com/articles/heroku-postgres-follower-databases#database-upgrades-and-migrations

[F]ollowers can also be used to change database plans with minimal downtime. At a high level, a follower is created in order to move your data from one database to another (which can be the same or different production tier database plans). Once it has received the majority of the data and is closely following your main database, you will prevent new data from being written (usually by enabling maintenance mode on your app). The follower will then fully catch-up to the main database. The follower is then promoted to be the primary database for the application.

chadwhitacre · 2013-02-09T01:37:37Z

I've taken a local backup of the database and plan to cut over.

chadwhitacre · 2013-02-09T01:51:27Z

Okay, cutover is complete. I'm going to close this ticket. If it crops up again we can reopen/reticket.

Thanks for all your help @zzzeek @joonas @gregburek.

!m

chadwhitacre · 2013-02-09T01:59:28Z

Post-mortem:

I should've contacted Heroku sooner when the klaxons were going off re: db failures.
I've leveled up on SLQAlchemy internals.
Both SQLAlchemy and Heroku Postgres are well-supported. :-)
Gittip has great people looking after it. :-)

chadwhitacre · 2013-02-09T02:10:32Z

I've followed up with the original reporters on Twitter.

ghost assigned chadwhitacre Feb 8, 2013

chadwhitacre closed this as completed Feb 9, 2013

This was referenced Feb 12, 2013

Deny on GitHub auth request #619

Closed

hire @joonas gratipay/humans#1

Closed

optimize db reads on homepage #659

Closed

chadwhitacre mentioned this issue Jul 17, 2013

Automate issue filing for production and staging problems #1156

Closed

investigate reports of instability since redesign #626

investigate reports of instability since redesign #626

Comments

chadwhitacre commented Feb 8, 2013

dmitshur commented Feb 8, 2013

chadwhitacre commented Feb 8, 2013

chadwhitacre commented Feb 8, 2013

bruceadams commented Feb 8, 2013

chadwhitacre commented Feb 8, 2013

chadwhitacre commented Feb 8, 2013

joonas commented Feb 8, 2013

chadwhitacre commented Feb 8, 2013

chadwhitacre commented Feb 8, 2013

ambv commented Feb 8, 2013

chadwhitacre commented Feb 8, 2013

chadwhitacre commented Feb 8, 2013

chadwhitacre commented Feb 8, 2013

chadwhitacre commented Feb 8, 2013

joonas commented Feb 8, 2013

chadwhitacre commented Feb 8, 2013

joonas commented Feb 8, 2013

chadwhitacre commented Feb 8, 2013

chadwhitacre commented Feb 8, 2013

zzzeek commented Feb 8, 2013

chadwhitacre commented Feb 8, 2013

zzzeek commented Feb 8, 2013

chadwhitacre commented Feb 8, 2013

chadwhitacre commented Feb 8, 2013

chadwhitacre commented Feb 8, 2013

joonas commented Feb 8, 2013

chadwhitacre commented Feb 8, 2013

zzzeek commented Feb 8, 2013

chadwhitacre commented Feb 8, 2013

chadwhitacre commented Feb 8, 2013

joonas commented Feb 8, 2013

chadwhitacre commented Feb 8, 2013

chadwhitacre commented Feb 8, 2013

chadwhitacre commented Feb 8, 2013

zzzeek commented Feb 8, 2013

chadwhitacre commented Feb 8, 2013

zzzeek commented Feb 8, 2013

zzzeek commented Feb 8, 2013

chadwhitacre commented Feb 8, 2013

zzzeek commented Feb 8, 2013

chadwhitacre commented Feb 8, 2013

chadwhitacre commented Feb 8, 2013

joonas commented Feb 8, 2013

chadwhitacre commented Feb 8, 2013

chadwhitacre commented Feb 8, 2013

zzzeek commented Feb 8, 2013

gregburek commented Feb 9, 2013

chadwhitacre commented Feb 9, 2013

chadwhitacre commented Feb 9, 2013

chadwhitacre commented Feb 9, 2013

chadwhitacre commented Feb 9, 2013

chadwhitacre commented Feb 9, 2013

chadwhitacre commented Feb 9, 2013

chadwhitacre commented Feb 9, 2013