-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Baseline metrics - part 1: set up read replicas on stage #3618
Comments
@pkfec Did we talk about group testing on this one - each developer taking care of one endpoints? Or each developer come out some test cases for each endpoint and I'll put things together and automate the metrics collection process. |
since most of the slow queries are from schedule_a and schedule_b, I'll make some test cases for those endpoints first. |
@qqss88 The idea was to test each api endpoint after lifting the 2 year restrictions. we talked about testing the endpoints by distributing them among the developers. If you PR is reviewed and ready to be merged, in the same PR we can assign the endpoints and ping the developers and ask them to test. |
Top 10 slow queries reported last september : |
here is a link from previous locust test for reference: |
added sched_a_multi_cycle testing to locust file. fixed some data issues with existing locust file. tested locust file against prod and stage. will summarize the result later on. |
Moving to sprint 8.6 to do additional performance baselines in stage after API release cut today. |
@patphongs @qqss88 are you still doing performance baselines testing in |
@qqss88 I'd still like to see a concise summary of baseline and testing results. |
For sched_a and sched_b queries, because of the amount of data we have, the factors affecting query performance are mainly(assuming server resource usage and DB caching is not a factor here ):
Considering those factors, data skewness for a certain argument value plays important roles when the query is submitted. I did some data skewness analysis on both sched_a and sched_b arguments involved in the api and here is a quick summary: top skewed argument value in sched_a:
as a result, most queries with single top argument value + most recent 3 cycles are loading ok(from seconds to around 1min). But those 3 queries are not loading within the time-out window(502 error):
sched_b: sched_b general performance is much better(I think it is mainly due to the amount of data comparing to sched_a)。 Here is a quick summary of data skewness:
Performance is general good, top single argument value with 14-16-18 cycles comes back within seconds. But I failed on one combination filtering:
this test is database-oriented. It is not a through test identifying all slow queries. Adding some server side resource consumption metrics will be better. |
thanks @qqss88 |
What we're after: We would like to start tracking our performance through the use of metrics. But before we start we need to take a baseline measurement.
Completion criteria:
Test against one read DB replica of stage. Increase replicas based on schedule_a and schedule_b endpoints performance testing.
Take performance baseline of all the endpoints and documentTest slower query endpoints: schedule_a and schedule_b
Receipts - 16
/committee/{committee_id}/schedules/schedule_a/by_employer/
/committee/{committee_id}/schedules/schedule_a/by_occupation/
/committee/{committee_id}/schedules/schedule_a/by_size/
/committee/{committee_id}/schedules/schedule_a/by_state/
/committee/{committee_id}/schedules/schedule_a/by_zip/
/schedules/schedule_a/
/schedules/schedule_a/by_employer/
/schedules/schedule_a/by_occupation/
/schedules/schedule_a/by_size/
/schedules/schedule_a/by_size/by_candidate/
/schedules/schedule_a/by_state/
/schedules/schedule_a/by_state/by_candidate/
/schedules/schedule_a/by_state/totals/
/schedules/schedule_a/by_zip/
/schedules/schedule_a/efile/
/schedules/schedule_a/{sub_id}/
Disbursements - 9
/committee/{committee_id}/schedules/schedule_b/by_purpose/
/committee/{committee_id}/schedules/schedule_b/by_recipient/
/committee/{committee_id}/schedules/schedule_b/by_recipient_id/
/schedules/schedule_b/
/schedules/schedule_b/by_purpose/
/schedules/schedule_b/by_recipient/
/schedules/schedule_b/by_recipient_id/
/schedules/schedule_b/efile/
/schedules/schedule_b/{sub_id}/
updated locust file and did the first round of testing.
Shut down replicas when we're done measuring the baselinewill run another two rounds of locust testing today and tomorrow and put things together. Will generate a quick report based on that.
locust testing result is uploaded here:
https://docs.google.com/spreadsheets/d/1SCrvrFEbCV57rtrXpClWrFg-lAeqyfjU4WDr6dBtW6Y/edit#gid=2097394488
for schedule_a and schedule_b testing/bench-marking, we need to re-collect the data after the 2-year restriction removal is deployed to stg/prod.
For regular checking, I think the slow query analysis will do most of the important work on this.
Other work done:
prepare a PR for fixing the locust file testing data issue. also added two endpoints - schedule_c/sub_id and schedule_d/sub_id into endpoints_walk.py:
Feature/update locust and endpoints #3698
python script for batch query and query profiling(will demo)
script script for batch api walks and timing(will demo).
Part 2: fecgov/fec-cms#2743(We closed that issue.)The text was updated successfully, but these errors were encountered: