"DiskFull" error when refreshing db views for service requests #1403

nichhk · 2022-10-13T03:21:02Z

Overview

Our nightly Prefect runs have been failing with this message:

  | 2022-10-10T01:09:01.781-07:00 | [2022-10-10 08:09:01+0000] ERROR - prefect.TaskRunner \| Unexpected error: DiskFull('could not write to file "base/pgsql_tmp/pgsql_tmp30707.2": No space left on device\nCONTEXT: SQL statement "SELECT newdata.*::pg_temp_4.pg_temp_27736 FROM pg_temp_4.pg_temp_27736 newdata WHERE newdata.* IS NOT NULL AND EXISTS (SELECT 1 FROM pg_temp_4.pg_temp_27736 newdata2 WHERE newdata2.* IS NOT NULL AND newdata2.* OPERATOR(pg_catalog.*=) newdata.* AND newdata2.ctid OPERATOR(pg_catalog.<>) newdata.ctid)"\n')
  | 2022-10-10T01:09:01.781-07:00 | Traceback (most recent call last):
  | 2022-10-10T01:09:01.781-07:00 | File "/usr/local/lib/python3.7/site-packages/prefect/engine/runner.py", line 48, in inner
  | 2022-10-10T01:09:01.781-07:00 | new_state = method(self, state, *args, **kwargs)
  | 2022-10-10T01:09:01.781-07:00 | File "/usr/local/lib/python3.7/site-packages/prefect/engine/task_runner.py", line 869, in get_task_run_state
  | 2022-10-10T01:09:01.781-07:00 | logger=self.logger,
  | 2022-10-10T01:09:01.781-07:00 | File "/usr/local/lib/python3.7/site-packages/prefect/utilities/executors.py", line 299, in run_task_with_timeout
  | 2022-10-10T01:09:01.781-07:00 | return task.run(*args, **kwargs) # type: ignore
  | 2022-10-10T01:09:01.781-07:00 | File "/app/tasks/postgres.py", line 216, in complete_load
  | 2022-10-10T01:09:01.781-07:00 | cursor.execute(f"REFRESH MATERIALIZED VIEW CONCURRENTLY {view};")
  | 2022-10-10T01:09:01.781-07:00 | psycopg2.errors.DiskFull: could not write to file "base/pgsql_tmp/pgsql_tmp30707.2": No space left on device
  | 2022-10-10T01:09:01.781-07:00

<br class="Apple-interchange-newline">

It looks like the command "REFRESH MATERIALIZED VIEW CONCURRENTLY {view};" is quite expensive because it needs to create an entire extra copy of the database to allow the refresh and db reads to occur concurrently.

This post suggests that the query that we are using to create this view is too inefficient, so we'll have to look into fixing that. The current query certainly looks slow: we are joining 5 different tables. 4 of them are relatively small (~100 rows or less), but the last table, the requests table, is quite large--maybe millions of rows.

Action Items

make query to create materialized view more efficient

Resources/Instructions

[EPIC] Engineering Maps Page H1 2023 #1396

The text was updated successfully, but these errors were encountered:

edwinjue · 2022-10-14T03:30:24Z

Hi @nichhk just trying to help you brainstorm and troubleshoot. I'm sure you did a lot of investigating however I was wondering if something in particular lead you to isolate this query to be the root cause of the error. Do we know the date the errors began to happen and what may have changed at that point in time?

Have we eliminated the cause to be related to any other reason such as with running out of space we have on AWS or the docker container itself?

Somebody had a similar issue and it had to do with the amount of space allocated in their docker container. There are a lot of possibilities why this might happen, perhaps it would be good to begin with a process of elimination:

Also, is there a way for us to run the query adhoc within a console to see the error happen in real time? Would like to see how we can optimize the query if it comes down to it but not sure how to do so. Hope to somehow help you get to the bottom of this. Thanks!

edwinjue · 2022-10-18T02:18:39Z

May have started failing on October 5th because selecting a date range from Oct 4 - Oct 17 (current date) is the last range to return data on the map:

Oct 4 - Oct 17 (data exists)

Oct 5 - Oct 17 (data does not exist)

Perhaps we can inspect what happened differently during the nightly batch of Oct 4th

nichhk · 2022-10-21T21:36:53Z

Thanks for taking a look Edwin. I also noticed that the requests disappeared around that date. Unfortunately, the logs are only persisted for 1 week, so I wasn't able to look at the logs for 10/4.

AFAICT, nothing has changed w.r.t. the data loading pipeline or the related infra in a long while. So the hypothesis that the query is causing issue makes sense--the amount of data that the query needs to look at has been steadily increasing each day, and now the query is too large.

The other suggestions that you brought up are helpful! I don't think Docker is the issue, since all of the postgres temp files should be stored on AWS RDS (see "Short Description" from the AWS RDS link you shared), not some Docker container.

I followed the suggestion from the AWS RDS link to look at FreeStorageSpace. We can see that this value is indeed declining, and it drops sharply everyday around 8am, which is when the daily data pipeline is run.

The article offers an easier solution than optimizing the query: enable storage autoscaling for your instance. I'll try to enable that and hopefully that fixes the issue.

nichhk · 2022-10-21T22:35:14Z

I enabled autoscaling, and set the max storage threshold to 100GB. It is currently at 20GB.

We can see if the prefect run is successful tomorrow morning to determine whether this fixed the issue.

nichhk · 2022-10-24T01:43:11Z

So the auto-scaling didn't work--it has some specific conditions that need to be met in order to trigger. So I just increased the current size explicitly to 50GB, and now the Prefect run succeeded!

edwinjue · 2022-10-24T03:25:15Z

Nice!

edwinjue mentioned this issue Oct 13, 2022

[EPIC] Engineering Maps Page H1 2023 #1396

Closed

12 tasks

nichhk self-assigned this Oct 21, 2022

nichhk added Role: Backend Related to API or other server-side work size: 3pt Can be done in 13-18 hours labels Oct 21, 2022

nichhk mentioned this issue Oct 21, 2022

Haven't pulled any request data from Socrata since 8/14 #1327

Closed

1 task

nichhk closed this as completed Oct 24, 2022

ExperimentsInHonesty added this to P: 311: Project Board Jun 7, 2024

ExperimentsInHonesty moved this to Done (without merge) in P: 311: Project Board Jun 7, 2024

ExperimentsInHonesty added Feature: Missing Milestone: Missing labels Feb 9, 2025

cottonchristopher added the Complexity: Missing This ticket needs a complexity (good first issue, small, medium, or large) label Feb 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"DiskFull" error when refreshing db views for service requests #1403

"DiskFull" error when refreshing db views for service requests #1403

nichhk commented Oct 13, 2022 •

edited by edwinjue

Loading

edwinjue commented Oct 14, 2022

edwinjue commented Oct 18, 2022

nichhk commented Oct 21, 2022

nichhk commented Oct 21, 2022

nichhk commented Oct 24, 2022

edwinjue commented Oct 24, 2022

"DiskFull" error when refreshing db views for service requests #1403

"DiskFull" error when refreshing db views for service requests #1403

Comments

nichhk commented Oct 13, 2022 • edited by edwinjue Loading

Overview

Action Items

Resources/Instructions

edwinjue commented Oct 14, 2022

edwinjue commented Oct 18, 2022

Oct 4 - Oct 17 (data exists)

Oct 5 - Oct 17 (data does not exist)

nichhk commented Oct 21, 2022

nichhk commented Oct 21, 2022

nichhk commented Oct 24, 2022

edwinjue commented Oct 24, 2022

nichhk commented Oct 13, 2022 •

edited by edwinjue

Loading