Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spurious premature shutdown of services #5100

Closed
1 task done
Tracked by #951
GitHK opened this issue Nov 28, 2023 · 9 comments
Closed
1 task done
Tracked by #951

Spurious premature shutdown of services #5100

GitHK opened this issue Nov 28, 2023 · 9 comments
Assignees
Labels
a:director-v2 issue related with the director-v2 service bug buggy, it does not work as expected High Priority a totally crucial bug/feature to be fixed asap

Comments

@GitHK
Copy link
Contributor

GitHK commented Nov 28, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Which deploy/s?

production aws (e.g. osparc.io), staging aws (e.g. staging.osparc.io), production on-premise (dalco), stating on-premise (dalco), development (master)

Current Behavior

Some users experience their service shutting down as soon as it opens.
director-v2 nor the dynamic-sidecar provide any hint or log message to help with debugging

Expected Behavior

Service should not automatically shut down without an issue. The issue should be at least logged at some log level before triggering a shutdown.

Steps To Reproduce

No response

Anything else?

No response

@GitHK GitHK added the bug buggy, it does not work as expected label Nov 28, 2023
@GitHK GitHK assigned pcrespov and GitHK and unassigned pcrespov Nov 28, 2023
@GitHK GitHK added the a:director-v2 issue related with the director-v2 service label Nov 28, 2023
@GitHK GitHK added this to the Kobayashi Maru milestone Nov 28, 2023
@GitHK GitHK added the High Priority a totally crucial bug/feature to be fixed asap label Nov 28, 2023
@GitHK
Copy link
Contributor Author

GitHK commented Nov 28, 2023

I am currently not sure how to reproduce it.

I have added extra logs and now I'm trying to reproduce the issue, so far master is set to debug logs, but no luck.

@matusdrobuliak66
Copy link
Contributor

matusdrobuliak66 commented Feb 13, 2024

Update: obsolete

12.2.2024 Manuel service 05ed4290-4a94-44ae-9458-3c8413ecabc9
search in Graylog (log_service:/.*director-v2.*/ OR log_service:/.*schdlr.*/) AND "05ed4290-4a94-44ae-9458-3c8413ecabc9"

dynamic-schdlr

Starting rPC call run_dynamic_service(<fastapi.applications.FastAPI object at 0x7fcc2057b550>, rpc_dynamic_service_create=key='simcore/services/dynamic/sim4life-8-0-0-dy' version='3.2.19' user_id=8 project_id=UUID('e97230a6-c9bd-11ee-beb7-02420a0b0047') node_uuid=UUID('05ed4290-4a94-44ae-9458-3c8413ecabc9') basepath=None service_resources={'s4l-core': ImageResources(image='simcore/services/dynamic/s4l-core-8-0-0-dy:3.2.19', resources={'CPU': ResourceValue(limit=12.6, reservation=12.6), 'RAM': ResourceValue(limit=55405078118, reservation=55405078118), 'VRAM': ResourceValue(limit=1, reservation=1)}, boot_modes=[<BootMode.GPU: 'GPU'>]), 'rt-web-dy': ImageResources(image='simcore/services/dynamic/sim4life-8-0-0-dy:3.2.19', resources={'CPU': ResourceValue(limit=1.0, reservation=1.0), 'RAM': ResourceValue(limit=2147483648, reservation=2147483648)}, boot_modes=[<BootMode.CPU: 'CPU'>]), 's4l-stream': ImageResources(image='simcore/services/dynamic/s4l-stream-8-0-0-dy:3.2.19', resources={'CPU': ResourceValue(limit=0.5, reservation=0.5), 'RAM': ResourceValue(limit=1073741824, reservation=1073741824)}, boot_modes=[<BootMode.CPU: 'CPU'>]), 'sym-server': ImageResources(image='simcore/services/dynamic/sym-server-8-0-0-dy:3.2.19', resources={'CPU': ResourceValue(limit=0.5, reservation=0.5), 'RAM': ResourceValue(limit=2147483648, reservation=2147483648)}, boot_modes=[<BootMode.CPU: 'CPU'>])} product_name='s4l' can_save=True wallet_info=WalletInfo(wallet_id=37, wallet_name='Prof. Niels Kuster Credits') pricing_info=PricingInfo(pricing_plan_id=5, pricing_unit_id=11, pricing_unit_cost_id=20) hardware_info=HardwareInfo(aws_ec2_instances=['g4dn.4xlarge']) request_dns='sim4life.io' request_scheme='https' simcore_user_agent='undefined')
  • getting a lot of times status OK
HTTP Request: GET http://staging_director-v2:8000/v2/dynamic_services/05ed4290-4a94-44ae-9458-3c8413ecabc9 "HTTP/1.1 200 OK"
  • ⚠️ at some point getting redirected:
HTTP Request: GET http://staging_director-v2:8000/v2/dynamic_services/05ed4290-4a94-44ae-9458-3c8413ecabc9 "HTTP/1.1 307 Temporary Redirect"
  • then getting 404 from director 0:
HTTP Request: GET http://staging_director:8080/v0/running_interactive_services/05ed4290-4a94-44ae-9458-3c8413ecabc9 "HTTP/1.1 404 Not Found"
  • sending delete to director-v2
HTTP Request: DELETE http://staging_director-v2:8000/v2/dynamic_services/05ed4290-4a94-44ae-9458-3c8413ecabc9?can_save=true "HTTP/1.1 204 No Content"
  • after some time starting the service again
Starting rPC call run_dynamic_service(<fastapi.applications.FastAPI object at 0x7f5ec7efc430>, rpc_dynamic_service_create=key='simcore/services/dynamic/sim4life-8-0-0-dy' version='3.2.19' user_id=8 project_id=UUID('e97230a6-c9bd-11ee-beb7-02420a0b0047') node_uuid=UUID('05ed4290-4a94-44ae-9458-3c8413ecabc9') basepath=None service_resources={'s4l-core': ImageResources(image='simcore/services/dynamic/s4l-core-8-0-0-dy:3.2.19', resources={'CPU': ResourceValue(limit=12.6, reservation=12.6), 'RAM': ResourceValue(limit=55405078118, reservation=55405078118), 'VRAM': ResourceValue(limit=1, reservation=1)}, boot_modes=[<BootMode.GPU: 'GPU'>]), 'rt-web-dy': ImageResources(image='simcore/services/dynamic/sim4life-8-0-0-dy:3.2.19', resources={'CPU': ResourceValue(limit=1.0, reservation=1.0), 'RAM': ResourceValue(limit=2147483648, reservation=2147483648)}, boot_modes=[<BootMode.CPU: 'CPU'>]), 's4l-stream': ImageResources(image='simcore/services/dynamic/s4l-stream-8-0-0-dy:3.2.19', resources={'CPU': ResourceValue(limit=0.5, reservation=0.5), 'RAM': ResourceValue(limit=1073741824, reservation=1073741824)}, boot_modes=[<BootMode.CPU: 'CPU'>]), 'sym-server': ImageResources(image='simcore/services/dynamic/sym-server-8-0-0-dy:3.2.19', resources={'CPU': ResourceValue(limit=0.5, reservation=0.5), 'RAM': ResourceValue(limit=2147483648, reservation=2147483648)}, boot_modes=[<BootMode.CPU: 'CPU'>])} product_name='s4l' can_save=True wallet_info=WalletInfo(wallet_id=37, wallet_name='Prof. Niels Kuster Credits') pricing_info=PricingInfo(pricing_plan_id=5, pricing_unit_id=11, pricing_unit_cost_id=20) hardware_info=HardwareInfo(aws_ec2_instances=['g4dn.4xlarge']) request_dns='sim4life.io' request_scheme='https' simcore_user_agent='undefined') ...
  • starting to get 200 again
HTTP Request: GET http://staging_director-v2:8000/v2/dynamic_services/05ed4290-4a94-44ae-9458-3c8413ecabc9 "HTTP/1.1 200 OK"
  • This process of redirect/stop and start is repeated ones again, until there is final stop_dynamic_service call

UPDATE: from the sidecars logs in the end it seems this might also be problem because of not been able to connect to storage:

'e=ClientConnectorError(ConnectionKey(host='staging_storage', port=8080, is_ssl=False, ssl=None, proxy=None, proxy_auth=None, proxy_headers_hash=None), gaierror(-2, 'Name or service not known'))

@GitHK
Copy link
Contributor Author

GitHK commented Feb 15, 2024

12.2.2024 Manuel service 05ed4290-4a94-44ae-9458-3c8413ecabc9 search in Graylog (log_service:/.*director-v2.*/ OR log_service:/.*schdlr.*/) AND "05ed4290-4a94-44ae-9458-3c8413ecabc9"

dynamic-schdlr

Starting rPC call run_dynamic_service(<fastapi.applications.FastAPI object at 0x7fcc2057b550>, rpc_dynamic_service_create=key='simcore/services/dynamic/sim4life-8-0-0-dy' version='3.2.19' user_id=8 project_id=UUID('e97230a6-c9bd-11ee-beb7-02420a0b0047') node_uuid=UUID('05ed4290-4a94-44ae-9458-3c8413ecabc9') basepath=None service_resources={'s4l-core': ImageResources(image='simcore/services/dynamic/s4l-core-8-0-0-dy:3.2.19', resources={'CPU': ResourceValue(limit=12.6, reservation=12.6), 'RAM': ResourceValue(limit=55405078118, reservation=55405078118), 'VRAM': ResourceValue(limit=1, reservation=1)}, boot_modes=[<BootMode.GPU: 'GPU'>]), 'rt-web-dy': ImageResources(image='simcore/services/dynamic/sim4life-8-0-0-dy:3.2.19', resources={'CPU': ResourceValue(limit=1.0, reservation=1.0), 'RAM': ResourceValue(limit=2147483648, reservation=2147483648)}, boot_modes=[<BootMode.CPU: 'CPU'>]), 's4l-stream': ImageResources(image='simcore/services/dynamic/s4l-stream-8-0-0-dy:3.2.19', resources={'CPU': ResourceValue(limit=0.5, reservation=0.5), 'RAM': ResourceValue(limit=1073741824, reservation=1073741824)}, boot_modes=[<BootMode.CPU: 'CPU'>]), 'sym-server': ImageResources(image='simcore/services/dynamic/sym-server-8-0-0-dy:3.2.19', resources={'CPU': ResourceValue(limit=0.5, reservation=0.5), 'RAM': ResourceValue(limit=2147483648, reservation=2147483648)}, boot_modes=[<BootMode.CPU: 'CPU'>])} product_name='s4l' can_save=True wallet_info=WalletInfo(wallet_id=37, wallet_name='Prof. Niels Kuster Credits') pricing_info=PricingInfo(pricing_plan_id=5, pricing_unit_id=11, pricing_unit_cost_id=20) hardware_info=HardwareInfo(aws_ec2_instances=['g4dn.4xlarge']) request_dns='sim4life.io' request_scheme='https' simcore_user_agent='undefined')
  • getting a lot of times status OK
HTTP Request: GET http://staging_director-v2:8000/v2/dynamic_services/05ed4290-4a94-44ae-9458-3c8413ecabc9 "HTTP/1.1 200 OK"
  • ⚠️ at some point getting redirected:
HTTP Request: GET http://staging_director-v2:8000/v2/dynamic_services/05ed4290-4a94-44ae-9458-3c8413ecabc9 "HTTP/1.1 307 Temporary Redirect"
  • then getting 404 from director 0:
HTTP Request: GET http://staging_director:8080/v0/running_interactive_services/05ed4290-4a94-44ae-9458-3c8413ecabc9 "HTTP/1.1 404 Not Found"
  • sending delete to director-v2
HTTP Request: DELETE http://staging_director-v2:8000/v2/dynamic_services/05ed4290-4a94-44ae-9458-3c8413ecabc9?can_save=true "HTTP/1.1 204 No Content"
  • after some time starting the service again
Starting rPC call run_dynamic_service(<fastapi.applications.FastAPI object at 0x7f5ec7efc430>, rpc_dynamic_service_create=key='simcore/services/dynamic/sim4life-8-0-0-dy' version='3.2.19' user_id=8 project_id=UUID('e97230a6-c9bd-11ee-beb7-02420a0b0047') node_uuid=UUID('05ed4290-4a94-44ae-9458-3c8413ecabc9') basepath=None service_resources={'s4l-core': ImageResources(image='simcore/services/dynamic/s4l-core-8-0-0-dy:3.2.19', resources={'CPU': ResourceValue(limit=12.6, reservation=12.6), 'RAM': ResourceValue(limit=55405078118, reservation=55405078118), 'VRAM': ResourceValue(limit=1, reservation=1)}, boot_modes=[<BootMode.GPU: 'GPU'>]), 'rt-web-dy': ImageResources(image='simcore/services/dynamic/sim4life-8-0-0-dy:3.2.19', resources={'CPU': ResourceValue(limit=1.0, reservation=1.0), 'RAM': ResourceValue(limit=2147483648, reservation=2147483648)}, boot_modes=[<BootMode.CPU: 'CPU'>]), 's4l-stream': ImageResources(image='simcore/services/dynamic/s4l-stream-8-0-0-dy:3.2.19', resources={'CPU': ResourceValue(limit=0.5, reservation=0.5), 'RAM': ResourceValue(limit=1073741824, reservation=1073741824)}, boot_modes=[<BootMode.CPU: 'CPU'>]), 'sym-server': ImageResources(image='simcore/services/dynamic/sym-server-8-0-0-dy:3.2.19', resources={'CPU': ResourceValue(limit=0.5, reservation=0.5), 'RAM': ResourceValue(limit=2147483648, reservation=2147483648)}, boot_modes=[<BootMode.CPU: 'CPU'>])} product_name='s4l' can_save=True wallet_info=WalletInfo(wallet_id=37, wallet_name='Prof. Niels Kuster Credits') pricing_info=PricingInfo(pricing_plan_id=5, pricing_unit_id=11, pricing_unit_cost_id=20) hardware_info=HardwareInfo(aws_ec2_instances=['g4dn.4xlarge']) request_dns='sim4life.io' request_scheme='https' simcore_user_agent='undefined') ...
  • starting to get 200 again
HTTP Request: GET http://staging_director-v2:8000/v2/dynamic_services/05ed4290-4a94-44ae-9458-3c8413ecabc9 "HTTP/1.1 200 OK"
  • This process of redirect/stop and start is repeated ones again, until there is final stop_dynamic_service call

UPDATE: from the sidecars logs in the end it seems this might also be problem because of not been able to connect to storage:

'e=ClientConnectorError(ConnectionKey(host='staging_storage', port=8080, is_ssl=False, ssl=None, proxy=None, proxy_auth=None, proxy_headers_hash=None), gaierror(-2, 'Name or service not known'))

@matusdrobuliak66 What you have reported is not related to this issue.
The service closes without raising an error and without any incoming call from the director-v2.

@matusdrobuliak66
Copy link
Contributor

12.2.2024 Manuel service 05ed4290-4a94-44ae-9458-3c8413ecabc9 search in Graylog (log_service:/.*director-v2.*/ OR log_service:/.*schdlr.*/) AND "05ed4290-4a94-44ae-9458-3c8413ecabc9"

dynamic-schdlr

Starting rPC call run_dynamic_service(<fastapi.applications.FastAPI object at 0x7fcc2057b550>, rpc_dynamic_service_create=key='simcore/services/dynamic/sim4life-8-0-0-dy' version='3.2.19' user_id=8 project_id=UUID('e97230a6-c9bd-11ee-beb7-02420a0b0047') node_uuid=UUID('05ed4290-4a94-44ae-9458-3c8413ecabc9') basepath=None service_resources={'s4l-core': ImageResources(image='simcore/services/dynamic/s4l-core-8-0-0-dy:3.2.19', resources={'CPU': ResourceValue(limit=12.6, reservation=12.6), 'RAM': ResourceValue(limit=55405078118, reservation=55405078118), 'VRAM': ResourceValue(limit=1, reservation=1)}, boot_modes=[<BootMode.GPU: 'GPU'>]), 'rt-web-dy': ImageResources(image='simcore/services/dynamic/sim4life-8-0-0-dy:3.2.19', resources={'CPU': ResourceValue(limit=1.0, reservation=1.0), 'RAM': ResourceValue(limit=2147483648, reservation=2147483648)}, boot_modes=[<BootMode.CPU: 'CPU'>]), 's4l-stream': ImageResources(image='simcore/services/dynamic/s4l-stream-8-0-0-dy:3.2.19', resources={'CPU': ResourceValue(limit=0.5, reservation=0.5), 'RAM': ResourceValue(limit=1073741824, reservation=1073741824)}, boot_modes=[<BootMode.CPU: 'CPU'>]), 'sym-server': ImageResources(image='simcore/services/dynamic/sym-server-8-0-0-dy:3.2.19', resources={'CPU': ResourceValue(limit=0.5, reservation=0.5), 'RAM': ResourceValue(limit=2147483648, reservation=2147483648)}, boot_modes=[<BootMode.CPU: 'CPU'>])} product_name='s4l' can_save=True wallet_info=WalletInfo(wallet_id=37, wallet_name='Prof. Niels Kuster Credits') pricing_info=PricingInfo(pricing_plan_id=5, pricing_unit_id=11, pricing_unit_cost_id=20) hardware_info=HardwareInfo(aws_ec2_instances=['g4dn.4xlarge']) request_dns='sim4life.io' request_scheme='https' simcore_user_agent='undefined')
  • getting a lot of times status OK
HTTP Request: GET http://staging_director-v2:8000/v2/dynamic_services/05ed4290-4a94-44ae-9458-3c8413ecabc9 "HTTP/1.1 200 OK"
  • ⚠️ at some point getting redirected:
HTTP Request: GET http://staging_director-v2:8000/v2/dynamic_services/05ed4290-4a94-44ae-9458-3c8413ecabc9 "HTTP/1.1 307 Temporary Redirect"
  • then getting 404 from director 0:
HTTP Request: GET http://staging_director:8080/v0/running_interactive_services/05ed4290-4a94-44ae-9458-3c8413ecabc9 "HTTP/1.1 404 Not Found"
  • sending delete to director-v2
HTTP Request: DELETE http://staging_director-v2:8000/v2/dynamic_services/05ed4290-4a94-44ae-9458-3c8413ecabc9?can_save=true "HTTP/1.1 204 No Content"
  • after some time starting the service again
Starting rPC call run_dynamic_service(<fastapi.applications.FastAPI object at 0x7f5ec7efc430>, rpc_dynamic_service_create=key='simcore/services/dynamic/sim4life-8-0-0-dy' version='3.2.19' user_id=8 project_id=UUID('e97230a6-c9bd-11ee-beb7-02420a0b0047') node_uuid=UUID('05ed4290-4a94-44ae-9458-3c8413ecabc9') basepath=None service_resources={'s4l-core': ImageResources(image='simcore/services/dynamic/s4l-core-8-0-0-dy:3.2.19', resources={'CPU': ResourceValue(limit=12.6, reservation=12.6), 'RAM': ResourceValue(limit=55405078118, reservation=55405078118), 'VRAM': ResourceValue(limit=1, reservation=1)}, boot_modes=[<BootMode.GPU: 'GPU'>]), 'rt-web-dy': ImageResources(image='simcore/services/dynamic/sim4life-8-0-0-dy:3.2.19', resources={'CPU': ResourceValue(limit=1.0, reservation=1.0), 'RAM': ResourceValue(limit=2147483648, reservation=2147483648)}, boot_modes=[<BootMode.CPU: 'CPU'>]), 's4l-stream': ImageResources(image='simcore/services/dynamic/s4l-stream-8-0-0-dy:3.2.19', resources={'CPU': ResourceValue(limit=0.5, reservation=0.5), 'RAM': ResourceValue(limit=1073741824, reservation=1073741824)}, boot_modes=[<BootMode.CPU: 'CPU'>]), 'sym-server': ImageResources(image='simcore/services/dynamic/sym-server-8-0-0-dy:3.2.19', resources={'CPU': ResourceValue(limit=0.5, reservation=0.5), 'RAM': ResourceValue(limit=2147483648, reservation=2147483648)}, boot_modes=[<BootMode.CPU: 'CPU'>])} product_name='s4l' can_save=True wallet_info=WalletInfo(wallet_id=37, wallet_name='Prof. Niels Kuster Credits') pricing_info=PricingInfo(pricing_plan_id=5, pricing_unit_id=11, pricing_unit_cost_id=20) hardware_info=HardwareInfo(aws_ec2_instances=['g4dn.4xlarge']) request_dns='sim4life.io' request_scheme='https' simcore_user_agent='undefined') ...
  • starting to get 200 again
HTTP Request: GET http://staging_director-v2:8000/v2/dynamic_services/05ed4290-4a94-44ae-9458-3c8413ecabc9 "HTTP/1.1 200 OK"
  • This process of redirect/stop and start is repeated ones again, until there is final stop_dynamic_service call

UPDATE: from the sidecars logs in the end it seems this might also be problem because of not been able to connect to storage:

'e=ClientConnectorError(ConnectionKey(host='staging_storage', port=8080, is_ssl=False, ssl=None, proxy=None, proxy_auth=None, proxy_headers_hash=None), gaierror(-2, 'Name or service not known'))

@matusdrobuliak66 What you have reported is not related to this issue. The service closes without raising an error and without any incoming call from the director-v2.

Agree, there is UPDATE in the last sentence when i realized this one was also connected to the connection to storage issue.

@matusdrobuliak66
Copy link
Contributor

matusdrobuliak66 commented Feb 15, 2024

UPDATE: obsolete

@GitHK So I was looking into the logs of the second problem of not being able to start a service. Firstly I thought it is connected to a garbage collector but in the end, it seems it is connected to something else:

"2024-02-13T12:56:29.165Z","sim4lifeManager3","DEBUG","staging-simcore_staging_director-v2","Observing dynamic-sidecars ['dy-sidecar_b957ecc9-a89d-5d45-a85d-738463a3a80d', 'dy-sidecar_04e687dc-6bde-418c-9e5c-ec4940bba462', 'dy-sidecar_c7358ba3-a4b3-4027-88de-b31866c9ad54', 'dy-sidecar_de39a6e2-23df-4afe-9391-f83c9583fd55', 'dy-sidecar_d93ffc01-ad2c-47c3-b67e-7b7767b47fe7', 'dy-sidecar_a0d8bc8d-42a6-4ad9-ab04-ffdca4a2b2c5', 'dy-sidecar_d537f6f9-8a34-4d13-9cc1-c0e03f81493a', 'dy-sidecar_507bcf97-db78-4282-a9a3-5b036a48a747', 'dy-sidecar_05e78019-5ea4-41d3-91be-f55dfba21c19']","simcore_service_director_v2.modules.dynamic_sidecar.scheduler._core._scheduler:_run_scheduler_task(565)","log_level=DEBUG | log_timestamp=2024-02-13 12:56:29,165 | log_source=simcore_service_director_v2.modules.dynamic_sidecar.scheduler._core._scheduler:_run_scheduler_task(565) | log_uid=None | log_msg=Observing dynamic-sidecars ['dy-sidecar_b957ecc9-a89d-5d45-a85d-738463a3a80d', 'dy-sidecar_04e687dc-6bde-418c-9e5c-ec4940bba462', 'dy-sidecar_c7358ba3-a4b3-4027-88de-b31866c9ad54', 'dy-sidecar_de39a6e2-23df-4afe-9391-f83c9583fd55', 'dy-sidecar_d93ffc01-ad2c-47c3-b67e-7b7767b47fe7', 'dy-sidecar_a0d8bc8d-42a6-4ad9-ab04-ffdca4a2b2c5', 'dy-sidecar_d537f6f9-8a34-4d13-9cc1-c0e03f81493a', 'dy-sidecar_507bcf97-db78-4282-a9a3-5b036a48a747', 'dy-sidecar_05e78019-5ea4-41d3-91be-f55dfba21c19']"
"2024-02-13T12:56:29.167Z","sim4lifeManager3","INFO","staging-simcore_staging_director-v2","Handling observation for dy-sidecar_a0d8bc8d-42a6-4ad9-ab04-ffdca4a2b2c5","simcore_service_director_v2.modules.dynamic_sidecar.scheduler._core._scheduler:_run_trigger_observation_queue_task(539)","log_level=INFO | log_timestamp=2024-02-13 12:56:29,166 | log_source=simcore_service_director_v2.modules.dynamic_sidecar.scheduler._core._scheduler:_run_trigger_observation_queue_task(539) | log_uid=None | log_msg=Handling observation for dy-sidecar_a0d8bc8d-42a6-4ad9-ab04-ffdca4a2b2c5"
"2024-02-13T12:56:29.207Z","ip-10-0-0-40","WARNING","dy-sidecar_a0d8bc8d-42a6-4ad9-ab04-ffdca4a2b2c5","No compose-spec was found","simcore_service_dynamic_sidecar.modules.long_running_tasks:task_runs_docker_compose_down(236)","log_level=WARNING | log_timestamp=2024-02-13 12:56:29,207 | log_source=simcore_service_dynamic_sidecar.modules.long_running_tasks:task_runs_docker_compose_down(236) | log_uid=None | log_msg=No compose-spec was found"
"2024-02-13T12:56:29.207Z","ip-10-0-0-40","INFO","dy-sidecar_a0d8bc8d-42a6-4ad9-ab04-ffdca4a2b2c5","10.11.4.218:51370 - ""POST /v1/containers%3Adown HTTP/1.1"" 202","uvicorn.access:send(478)","log_level=INFO | log_timestamp=2024-02-13 12:56:29,207 | log_source=uvicorn.access:send(478) | log_uid=None | log_msg=10.11.4.218:51370 - ""POST /v1/containers%3Adown HTTP/1.1"" 202"
"2024-02-13T12:56:29.207Z","ip-10-0-0-40","DEBUG","dy-sidecar_a0d8bc8d-42a6-4ad9-ab04-ffdca4a2b2c5","Progress update: message='starting' percent=0.0","models_library.api_schemas_long_running_tasks.base:update(43)","log_level=DEBUG | log_timestamp=2024-02-13 12:56:29,207 | log_source=models_library.api_schemas_long_running_tasks.base:update(43) | log_uid=None | log_msg=Progress update: message='starting' percent=0.0"
"2024-02-13T12:56:29.210Z","sim4lifeManager3","INFO","staging-simcore_staging_director-v2","HTTP Request: POST http://dy-sidecar_a0d8bc8d-42a6-4ad9-ab04-ffdca4a2b2c5:8000/v1/containers:down ""HTTP/1.1 202 Accepted""","httpx:_send_single_request(1729)","log_level=INFO | log_timestamp=2024-02-13 12:56:29,210 | log_source=httpx:_send_single_request(1729) | log_uid=None | log_msg=HTTP Request: POST http://dy-sidecar_a0d8bc8d-42a6-4ad9-ab04-ffdca4a2b2c5:8000/v1/containers:down ""HTTP/1.1 202 Accepted"""
"2024-02-13T12:56:29.211Z","ip-10-0-0-40","DEBUG","dy-sidecar_a0d8bc8d-42a6-4ad9-ab04-ffdca4a2b2c5","Progress update: message='finished' percent=1.0","models_library.api_schemas_long_running_tasks.base:update(43)","log_level=DEBUG | log_timestamp=2024-02-13 12:56:29,207 | log_source=models_library.api_schemas_long_running_tasks.base:update(43) | log_uid=None | log_msg=Progress update: message='finished' percent=1.0"
"2024-02-13T12:56:29.212Z","ip-10-0-0-40","INFO","dy-sidecar_a0d8bc8d-42a6-4ad9-ab04-ffdca4a2b2c5","10.11.4.218:51384 - ""GET /task/simcore_service_dynamic_sidecar.modules.long_running_tasks.task_runs_docker_compose_down.4e01744a-78d4-4c10-91d4-2c837e1a17f5 HTTP/1.1"" 200","uvicorn.access:send(478)","log_level=INFO | log_timestamp=2024-02-13 12:56:29,212 | log_source=uvicorn.access:send(478) | log_uid=None | log_msg=10.11.4.218:51384 - ""GET /task/simcore_service_dynamic_sidecar.modules.long_running_tasks.task_runs_docker_compose_down.4e01744a-78d4-4c10-91d4-2c837e1a17f5 HTTP/1.1"" 200"
"2024-02-13T12:56:29.214Z","sim4lifeManager3","INFO","staging-simcore_staging_director-v2","HTTP Request: GET http://dy-sidecar_a0d8bc8d-42a6-4ad9-ab04-ffdca4a2b2c5:8000/task/simcore_service_dynamic_sidecar.modules.long_running_tasks.task_runs_docker_compose_down.4e01744a-78d4-4c10-91d4-2c837e1a17f5 ""HTTP/1.1 200 OK""","httpx:_send_single_request(1729)","log_level=INFO | log_timestamp=2024-02-13 12:56:29,214 | log_source=httpx:_send_single_request(1729) | log_uid=None | log_msg=HTTP Request: GET http://dy-sidecar_a0d8bc8d-42a6-4ad9-ab04-ffdca4a2b2c5:8000/task/simcore_service_dynamic_sidecar.modules.long_running_tasks.task_runs_docker_compose_down.4e01744a-78d4-4c10-91d4-2c837e1a17f5 ""HTTP/1.1 200 OK"""
"2024-02-13T12:56:29.219Z","ip-10-0-0-40","INFO","dy-sidecar_a0d8bc8d-42a6-4ad9-ab04-ffdca4a2b2c5","10.11.4.218:51398 - ""GET /task/simcore_service_dynamic_sidecar.modules.long_running_tasks.task_runs_docker_compose_down.4e01744a-78d4-4c10-91d4-2c837e1a17f5/result HTTP/1.1"" 200","uvicorn.access:send(478)","log_level=INFO | log_timestamp=2024-02-13 12:56:29,219 | log_source=uvicorn.access:send(478) | log_uid=None | log_msg=10.11.4.218:51398 - ""GET /task/simcore_service_dynamic_sidecar.modules.long_running_tasks.task_runs_docker_compose_down.4e01744a-78d4-4c10-91d4-2c837e1a17f5/result HTTP/1.1"" 200"
"2024-02-13T12:56:29.221Z","sim4lifeManager3","INFO","staging-simcore_staging_director-v2","HTTP Request: GET http://dy-sidecar_a0d8bc8d-42a6-4ad9-ab04-ffdca4a2b2c5:8000/task/simcore_service_dynamic_sidecar.modules.long_running_tasks.task_runs_docker_compose_down.4e01744a-78d4-4c10-91d4-2c837e1a17f5/result ""HTTP/1.1 200 OK""","httpx:_send_single_request(1729)","log_level=INFO | log_timestamp=2024-02-13 12:56:29,221 | log_source=httpx:_send_single_request(1729) | log_uid=None | log_msg=HTTP Request: GET http://dy-sidecar_a0d8bc8d-42a6-4ad9-ab04-ffdca4a2b2c5:8000/task/simcore_service_dynamic_sidecar.modules.long_running_tasks.task_runs_docker_compose_down.4e01744a-78d4-4c10-91d4-2c837e1a17f5/result ""HTTP/1.1 200 OK"""

@elisabettai
Copy link
Collaborator

@GitHK, @matusdrobuliak66 I am closing the Fogbugz case (it is probably a duplicate of this one).

So please use only this one.

I don't think there's useful information in the Fogbugz case (there @matusdrobuliak66 added logs for the GC hypothesis, which doesn't seem to be the cause anymore).

@elisabettai elisabettai changed the title Services shutdown after starting Spurious premature shutdown of services Feb 16, 2024
@matusdrobuliak66 matusdrobuliak66 self-assigned this Feb 19, 2024
@matusdrobuliak66
Copy link
Contributor

matusdrobuliak66 commented Feb 19, 2024

Garbage collector premature shutdown investigation

deployment: sim4life.io
date: 13.2.2023
user: 8 (Manuel)
node id: a0d8bc8d-42a6-4ad9-ab04-ffdca4a2b2c5
project id: 146dccf0-ca6f-11ee-beb7-02420a0b0047

Webserver logs

log_level=INFO | log_timestamp=2024-02-13 12:54:57,026 | log_source=gunicorn.access:log(206) | log_uid=None | log_msg=10.4.66.165 [13/Feb/2024:12:54:56 +0000] "POST /v0/projects/146dccf0-ca6f-11ee-beb7-02420a0b0047:open HTTP/1.1" 200 1043 [873125us] "https://sim4life.io/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"

log_level=INFO | log_timestamp=2024-02-13 12:55:04,804 | log_source=gunicorn.access:log(206) | log_uid=None | log_msg=10.4.66.165 [13/Feb/2024:12:55:03 +0000] "POST /v0/projects/146dccf0-ca6f-11ee-beb7-02420a0b0047/nodes HTTP/1.1" 201 224 [1210787us] "https://sim4life.io/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"

log_level=INFO | log_timestamp=2024-02-13 12:55:05,393 | log_source=gunicorn.access:log(206) | log_uid=None | log_msg=10.4.66.165 [13/Feb/2024:12:55:04 +0000] "GET /v0/projects/146dccf0-ca6f-11ee-beb7-02420a0b0047/nodes/a0d8bc8d-42a6-4ad9-ab04-ffdca4a2b2c5 HTTP/1.1" 200 704 [481869us] "https://sim4life.io/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"



# probably second start
log_level=INFO | log_timestamp=2024-02-13 12:59:17,604 | log_source=gunicorn.access:log(206) | log_uid=None | log_msg=10.4.66.165 [13/Feb/2024:12:59:16 +0000] "POST /v0/projects/146dccf0-ca6f-11ee-beb7-02420a0b0047/nodes/a0d8bc8d-42a6-4ad9-ab04-ffdca4a2b2c5:start HTTP/1.1" 204 131 [1147487us] "https://sim4life.io/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"

# probably third start
log_level=INFO | log_timestamp=2024-02-13 13:02:04,907 | log_source=gunicorn.access:log(206) | log_uid=None | log_msg=10.4.66.163 [13/Feb/2024:13:02:03 +0000] "POST /v0/projects/146dccf0-ca6f-11ee-beb7-02420a0b0047/nodes/a0d8bc8d-42a6-4ad9-ab04-ffdca4a2b2c5:start HTTP/1.1" 204 131 [1356853us] "https://sim4life.io/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"

# ?
log_level=WARNING | log_timestamp=2024-02-13 13:20:17,557 | log_source=simcore_service_webserver.projects.projects_api:try_close_project_for_user(1022) | log_uid=8 | log_msg=project_uuid='146dccf0-ca6f-11ee-beb7-02420a0b0047' is not registered as resource of 8. Skipping close project

  • project creates: 2024-02-13 12:54:52.792
  • it was not the first node a0d8bc8d-42a6-4ad9-ab04-ffdca4a2b2c5 created 2024-02-13 12:55:04.009272+00
  • the first node 8a6d679c-99e1-4d60-b743-d45bc8414ae2 created 2024-02-13 12:54:52.799045+00

TIMELINE: GC (triggers delete) -> dynamic-scheduler (Delete) -> director-v2 (Delete) -> Sidecar (Stop containers)

Garbage collector logs
This is the last garbage collector cycle before the issue occurs

log_level=INFO | log_timestamp=2024-02-13 12:54:31,467 | log_source=simcore_service_webserver.garbage_collector._core:log_context(279) | log_uid=None | log_msg=Starting step 1: Removes disconnected user sessions

log_level=INFO | log_timestamp=2024-02-13 12:54:31,472 | log_source=simcore_service_webserver.garbage_collector._core:log_context(286) | log_uid=None | log_msg=Finished step 1: Removes disconnected user sessions

# GC was restarted (BUT STILL THE SAME CONTAINER!)
2024-02-13 12:54:31.486 |\/ \/ \___\|_.__/ \|___/ \___\|_\| \_/ \___\|_\| v0.37.1|
  • Why did GC container restarted?
  • Deleting API keys from Manuel. It is logged in the GC & WEBSERVER (probably is not connected to this issue, it just deleted one of his 5 services, what is strange why only some of the delete_keys are logged in the GC and not all of them?)
# THIS LOG IS FROM GARBAGE COLLECTOR (probably should not be logged here!):
log_level=INFO | log_timestamp=2024-02-13 12:54:44,883 | log_source=rpc.access:log_context(279) | log_uid=None | log_msg=Starting rPC call delete_api_keys(<Application 0x7f512a7fed70>, user_id=8, product_name=s4l, name=_auto_384de5cc-ca7e-59b2-b7b1-0371eac73d12) ...

# THIS LOG IS FROM WEBSERVER:
"2024-02-13T12:54:44.884Z","sim4lifeManager1","staging-simcore_staging_wb-garbage-collector.1.mohlo4qgp2bbub6iwmvcfc079","log_level=INFO | log_timestamp=2024-02-13 12:54:44,883 | log_source=rpc.access:log_context(279) | log_uid=None | log_msg=Starting rPC call delete_api_keys(<Application 0x7f512a7fed70>, user_id=8, product_name=s4l, name=_auto_384de5cc-ca7e-59b2-b7b1-0371eac73d12) ..."
"2024-02-13T12:54:44.887Z","sim4lifeManager1","staging-simcore_staging_wb-garbage-collector.1.mohlo4qgp2bbub6iwmvcfc079","log_level=INFO | log_timestamp=2024-02-13 12:54:44,887 | log_source=rpc.access:log_context(286) | log_uid=None | log_msg=Finished rPC call delete_api_keys(<Application 0x7f512a7fed70>, user_id=8, product_name=s4l, name=_auto_384de5cc-ca7e-59b2-b7b1-0371eac73d12) in 0.003637s",

In the next garbage collect cycle dead session of Manuel was found:

log_level=INFO | log_timestamp=2024-02-13 12:55:25,804 | log_source=simcore_service_webserver.garbage_collector._tasks_core:log_context(279) | log_uid=None | log_msg=Starting garbage collect cycle ...

log_level=INFO | log_timestamp=2024-02-13 12:55:25,804 | log_source=simcore_service_webserver.garbage_collector._core:log_context(279) | log_uid=None | log_msg=Starting step 1: Removes disconnected user sessions ...

log_level=INFO | log_timestamp=2024-02-13 12:55:25,810 | log_source=simcore_service_webserver.garbage_collector._core_disconnected:remove_disconnected_user_resources(65) | log_uid=None | log_msg=dead_session={'user_id': '8', 'client_session_id': '96a73f68-edbd-4fff-935e-860d667ccf62'} expired. Checking resources to cleanup

log_level=INFO | log_timestamp=2024-02-13 12:55:25,816 | log_source=simcore_service_webserver.garbage_collector._core_disconnected:remove_disconnected_user_resources(98) | log_uid=None | log_msg=(1) Releasing resource resource_name='project_id':resource_value='146dccf0-ca6f-11ee-beb7-02420a0b0047' acquired by expired {'user_id': '8', 'client_session_id': '96a73f68-edbd-4fff-935e-860d667ccf62'}

log_level=INFO | log_timestamp=2024-02-13 12:56:30,381 | log_source=simcore_service_webserver.garbage_collector._core_disconnected:remove_disconnected_user_resources(139) | log_uid=None | log_msg=(2) Removing field for released resource resource_name='project_id':resource_value='146dccf0-ca6f-11ee-beb7-02420a0b0047' from registry keys: [{'user_id': '8', 'client_session_id': '96a73f68-edbd-4fff-935e-860d667ccf62'}]

log_level=INFO | log_timestamp=2024-02-13 12:56:30,381 | log_source=simcore_service_webserver.garbage_collector._core:log_context(286) | log_uid=None | log_msg=Finished step 1: Removes disconnected user sessions
  • Why did GC thought that user was disconnected, as clearly the service was starting? Is it somehow connected with GC restart?

Dynamic-scheduler logs

log_level=INFO | log_timestamp=2024-02-13 12:56:30,205 | log_source=rpc.access:log_context(286) | log_uid=None | log_msg=Finished rPC call stop_dynamic_service(<fastapi.applications.FastAPI object at 0x7f9be2c00a90>, save_state=True, simcore_user_agent=undefined, node_id=a0d8bc8d-42a6-4ad9-ab04-ffdca4a2b2c5) in 64.234464s

log_level=INFO | log_timestamp=2024-02-13 12:56:30,205 | log_source=httpx:_send_single_request(1729) | log_uid=None | log_msg=HTTP Request: DELETE http://staging_director-v2:8000/v2/dynamic_services/a0d8bc8d-42a6-4ad9-ab04-ffdca4a2b2c5?can_save=true "HTTP/1.1 204 No Content"

Director-v2 logs

log_level=INFO | log_timestamp=2024-02-13 12:56:30,205 | log_source=uvicorn.access:send(496) | log_uid=None | log_msg=10.4.66.7:37710 - "DELETE /v2/dynamic_services/a0d8bc8d-42a6-4ad9-ab04-ffdca4a2b2c5?can_save=true HTTP/1.1" 204

NOTE: For some reason 2 more times delete was received from dynamic-scheduler:

log_level=INFO | log_timestamp=2024-02-13 13:00:35,138 | log_source=uvicorn.access:send(496) | log_uid=None | log_msg=10.4.66.150:40458 - "DELETE /v2/dynamic_services/a0d8bc8d-42a6-4ad9-ab04-ffdca4a2b2c5?can_save=true HTTP/1.1" 204

log_level=INFO | log_timestamp=2024-02-13 13:04:51,065 | log_source=uvicorn.access:send(496) | log_uid=None | log_msg=10.4.66.150:53978 - "DELETE /v2/dynamic_services/a0d8bc8d-42a6-4ad9-ab04-ffdca4a2b2c5?can_save=true HTTP/1.1" 204

Sidecar logs

log_level=INFO | log_timestamp=2024-02-13 12:56:29,207 | log_source=uvicorn.access:send(478) | log_uid=None | log_msg=10.11.4.218:51370 - "POST /v1/containers%3Adown HTTP/1.1" 202

Notes:

  • It probably is connected with GC restart (but I don't yet know why..), but the service has not restart a lot (for the last 7 days)

image

@matusdrobuliak66
Copy link
Contributor

Closing this issue as we have already dealt with most of the known issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:director-v2 issue related with the director-v2 service bug buggy, it does not work as expected High Priority a totally crucial bug/feature to be fixed asap
Projects
None yet
Development

No branches or pull requests

6 participants