Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"DiskFull" error when refreshing db views for service requests #1403

Closed
1 task
nichhk opened this issue Oct 13, 2022 · 6 comments
Closed
1 task

"DiskFull" error when refreshing db views for service requests #1403

nichhk opened this issue Oct 13, 2022 · 6 comments
Assignees
Labels
Complexity: Missing This ticket needs a complexity (good first issue, small, medium, or large) Feature: Missing Milestone: Missing Role: Backend Related to API or other server-side work size: 3pt Can be done in 13-18 hours

Comments

@nichhk
Copy link
Member

nichhk commented Oct 13, 2022

Overview

Our nightly Prefect runs have been failing with this message:

  | 2022-10-10T01:09:01.781-07:00 | [2022-10-10 08:09:01+0000] ERROR - prefect.TaskRunner \| Unexpected error: DiskFull('could not write to file "base/pgsql_tmp/pgsql_tmp30707.2": No space left on device\nCONTEXT: SQL statement "SELECT newdata.*::pg_temp_4.pg_temp_27736 FROM pg_temp_4.pg_temp_27736 newdata WHERE newdata.* IS NOT NULL AND EXISTS (SELECT 1 FROM pg_temp_4.pg_temp_27736 newdata2 WHERE newdata2.* IS NOT NULL AND newdata2.* OPERATOR(pg_catalog.*=) newdata.* AND newdata2.ctid OPERATOR(pg_catalog.<>) newdata.ctid)"\n')
  | 2022-10-10T01:09:01.781-07:00 | Traceback (most recent call last):
  | 2022-10-10T01:09:01.781-07:00 | File "/usr/local/lib/python3.7/site-packages/prefect/engine/runner.py", line 48, in inner
  | 2022-10-10T01:09:01.781-07:00 | new_state = method(self, state, *args, **kwargs)
  | 2022-10-10T01:09:01.781-07:00 | File "/usr/local/lib/python3.7/site-packages/prefect/engine/task_runner.py", line 869, in get_task_run_state
  | 2022-10-10T01:09:01.781-07:00 | logger=self.logger,
  | 2022-10-10T01:09:01.781-07:00 | File "/usr/local/lib/python3.7/site-packages/prefect/utilities/executors.py", line 299, in run_task_with_timeout
  | 2022-10-10T01:09:01.781-07:00 | return task.run(*args, **kwargs) # type: ignore
  | 2022-10-10T01:09:01.781-07:00 | File "/app/tasks/postgres.py", line 216, in complete_load
  | 2022-10-10T01:09:01.781-07:00 | cursor.execute(f"REFRESH MATERIALIZED VIEW CONCURRENTLY {view};")
  | 2022-10-10T01:09:01.781-07:00 | psycopg2.errors.DiskFull: could not write to file "base/pgsql_tmp/pgsql_tmp30707.2": No space left on device
  | 2022-10-10T01:09:01.781-07:00

<br class="Apple-interchange-newline">

It looks like the command "REFRESH MATERIALIZED VIEW CONCURRENTLY {view};" is quite expensive because it needs to create an entire extra copy of the database to allow the refresh and db reads to occur concurrently.

This post suggests that the query that we are using to create this view is too inefficient, so we'll have to look into fixing that. The current query certainly looks slow: we are joining 5 different tables. 4 of them are relatively small (~100 rows or less), but the last table, the requests table, is quite large--maybe millions of rows.

Action Items

  • make query to create materialized view more efficient

Resources/Instructions

@edwinjue
Copy link
Member

Hi @nichhk just trying to help you brainstorm and troubleshoot. I'm sure you did a lot of investigating however I was wondering if something in particular lead you to isolate this query to be the root cause of the error. Do we know the date the errors began to happen and what may have changed at that point in time?

Have we eliminated the cause to be related to any other reason such as with running out of space we have on AWS or the docker container itself?

Somebody had a similar issue and it had to do with the amount of space allocated in their docker container. There are a lot of possibilities why this might happen, perhaps it would be good to begin with a process of elimination:

Also, is there a way for us to run the query adhoc within a console to see the error happen in real time? Would like to see how we can optimize the query if it comes down to it but not sure how to do so. Hope to somehow help you get to the bottom of this. Thanks!

@edwinjue
Copy link
Member

May have started failing on October 5th because selecting a date range from Oct 4 - Oct 17 (current date) is the last range to return data on the map:

Oct 4 - Oct 17 (data exists)

image

Oct 5 - Oct 17 (data does not exist)

image

Perhaps we can inspect what happened differently during the nightly batch of Oct 4th

@nichhk
Copy link
Member Author

nichhk commented Oct 21, 2022

Thanks for taking a look Edwin. I also noticed that the requests disappeared around that date. Unfortunately, the logs are only persisted for 1 week, so I wasn't able to look at the logs for 10/4.

AFAICT, nothing has changed w.r.t. the data loading pipeline or the related infra in a long while. So the hypothesis that the query is causing issue makes sense--the amount of data that the query needs to look at has been steadily increasing each day, and now the query is too large.

The other suggestions that you brought up are helpful! I don't think Docker is the issue, since all of the postgres temp files should be stored on AWS RDS (see "Short Description" from the AWS RDS link you shared), not some Docker container.

I followed the suggestion from the AWS RDS link to look at FreeStorageSpace. We can see that this value is indeed declining, and it drops sharply everyday around 8am, which is when the daily data pipeline is run.
rds-free-storage-space

The article offers an easier solution than optimizing the query: enable storage autoscaling for your instance. I'll try to enable that and hopefully that fixes the issue.

@nichhk
Copy link
Member Author

nichhk commented Oct 21, 2022

I enabled autoscaling, and set the max storage threshold to 100GB. It is currently at 20GB.

db-increase

We can see if the prefect run is successful tomorrow morning to determine whether this fixed the issue.

@nichhk nichhk self-assigned this Oct 21, 2022
@nichhk nichhk added Role: Backend Related to API or other server-side work size: 3pt Can be done in 13-18 hours labels Oct 21, 2022
@nichhk
Copy link
Member Author

nichhk commented Oct 24, 2022

So the auto-scaling didn't work--it has some specific conditions that need to be met in order to trigger. So I just increased the current size explicitly to 50GB, and now the Prefect run succeeded!

map-has-data

@nichhk nichhk closed this as completed Oct 24, 2022
@edwinjue
Copy link
Member

Nice!

@ExperimentsInHonesty ExperimentsInHonesty moved this to Done (without merge) in P: 311: Project Board Jun 7, 2024
@cottonchristopher cottonchristopher added the Complexity: Missing This ticket needs a complexity (good first issue, small, medium, or large) label Feb 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Complexity: Missing This ticket needs a complexity (good first issue, small, medium, or large) Feature: Missing Milestone: Missing Role: Backend Related to API or other server-side work size: 3pt Can be done in 13-18 hours
Projects
Status: Done (without merge)
Development

No branches or pull requests

4 participants