Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hwp-pid direction read bugfix #728

Merged
merged 11 commits into from
Oct 14, 2024
Merged

hwp-pid direction read bugfix #728

merged 11 commits into from
Oct 14, 2024

Conversation

bbixler500
Copy link
Contributor

Initial attempt to fix occasional issues in reading the hwp-pid direction

Description

Adds additional logic to handle read strings that are formatted in a non standard way.

Motivation and Context

Recently changes were pushed to the hwp-pid agent #686 which changed how response messages were processed. As an unintended consequence, a new bug surfaced when reading the rotation direction #727. From these error logs it appears as though the response message is mostly correct, but it is missing an initial 0 digit. This PR is a simple fix which will remedy this specific issue. Feedback is appreciated if the issue is more complicated than described above.

How Has This Been Tested?

This fix has not been tested and needs to be before it is merged into main

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

@ykyohei
Copy link
Contributor

ykyohei commented Aug 19, 2024

Yeah, I was not sure if 'R2400000' means miscellaneous or it actually means direction but zero is missing.
I'm fine with merging this and see if it resolves this type of error.
But I request to test this for a while to make sure that this does not introduce other errors.

@ykyohei
Copy link
Contributor

ykyohei commented Aug 19, 2024

Actually, I got IndexError from pid agent probably for the first time.
I think this means that there is a problem in the current string decoding for edge cases..
https://github.com/simonsobs/chwp-discussions/discussions/28

We had another error which seems to be related with the drop of response string (reported in the same discussion).
I think it might be better to check the length of the response string, and skip or retry if the sting length is different from the expected length.

@BrianJKoopman BrianJKoopman requested review from jlashner and removed request for BrianJKoopman August 28, 2024 18:05
Copy link
Collaborator

@jlashner jlashner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How sure are we that there is actually a missing 0 in the return string, and not that this is some other error?

Instead of assuming this is the case, I'd much rather add proper error handling to the string parsing such that unexpected responses do no crash the main process. I think string parsing like this must be wrapped in a try/catch, where if there is an error, we log the unexpected response and return an "error" response so that the main process can handle it properly.

@BrianJKoopman BrianJKoopman linked an issue Sep 16, 2024 that may be closed by this pull request
@bbixler500
Copy link
Contributor Author

I've added a decorator to the read functions which will cause those functions to try multiple times before failing. This should hopefully fix some of the crashes this agent has been getting. I've also reverted some of the response handling changes that were previously implemented. The manner in which the read statements fail doesn't seem to be as consistent as I had believed, so those changes did not cover all failure cases. The most recent additions have not been tested yet, but I've pushed what they are currently in case anyone has comments.

Copy link
Collaborator

@jlashner jlashner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Bryce, thanks for the changes! Looks good to me for the most part but there are a couple of important points in inline comments that I think need to be addressed.

except:
time.sleep(0.2)
print(f'Could not complete {func.__name__} after {loops} attempt(s)')
return None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this return None changes the failure behavior of the functions. Before, on failure they would raise an error.

I'm afraid this will make it look as though the agent get_state process is successful, while the values are actually None. I'd either change this so that it raises an error, or modify the agent to properly handle None return vals.

if decoded_resp.msg_type == 'measure':
return decoded_resp.measure
elif decoded_resp.msg_type == 'error':
raise ValueError(f"Error reading freq: {decoded_resp.msg}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be swallowed by the try/catch in the retry_multiple_times decorator, so it won't be logged to stdout. Make sure to log an error here with the problematic response.

@bbixler500
Copy link
Contributor Author

I've added handling of the None response to the agent and changed how the errors are logged. Let me know if there is any other issue or if this PR looks good to merge

@BrianJKoopman BrianJKoopman requested a review from jlashner October 7, 2024 20:06
@jlashner
Copy link
Collaborator

jlashner commented Oct 7, 2024

I'm sorry I don't think I agree with this method of handling the readout errors. I think that if any of the parameters fail to read-out, we should throw an error in get_pid_state rather than returning a valid dict but with specific fields removed.

Agents calling this function expect to get a complete response, and if that is not possible then they need decide how to handle that case. Removing fields like you do here kind of just hides the problem, and moves the responsibility of checking whether fields exist to the caller, which they're not currently doing.

We'll also need to add error handling to the functions in this agent that call get_pid_state. I think the behavior we want is:

  • If get_pid_state fails in the main process, we should mark the process as "degraded", and continue onto the next iteration after a short sleep
  • If the function fails in the get_state task, then I think we want that task to fail.

@bbixler500
Copy link
Contributor Author

Okay, I've changed the handling of readout errors to follow Jack's suggestions. Read errors in the main process will print a warning message but will continue on to the next iteration, while read errors in the get_state will raise a ValueError

@jlashner
Copy link
Collaborator

Ok cool, I think this is starting to look better. One quick thing and then I think this is good to merge... to set the process to be degraded, we want to run:

session.degraded = True

and then we want to set this back to False after it has recovered.

This will make it so the operation state will appear as "degraded" in grafana, and we can set alarms based on that.

Copy link
Collaborator

@jlashner jlashner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @bbixler500, looks like there is a test failing, but as soon as that's fixed and this is tested I think we can merge this.

@bbixler500
Copy link
Contributor Author

Okay, all tests are passing now

@BrianJKoopman
Copy link
Member

Okay, all tests are passing now

Great, has the most recent version been tested against hardware?

@bbixler500
Copy link
Contributor Author

Yeah, every step of the way I was testing on satp2 before pushing my updates

@BrianJKoopman BrianJKoopman merged commit 6731bb9 into main Oct 14, 2024
5 checks passed
@BrianJKoopman BrianJKoopman deleted the hwp-pid-direction-bugfix branch October 14, 2024 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error reading direction in HWP PID Agent
4 participants