-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HLD for Memory_Statistics #1760
Conversation
reviewer is expected and please leave your comments if you want to be a reviewer. Thanks. |
community review recording https://zoom.us/rec/share/DtubGviNDO1Rt8VRnHsqWpnF17tu62g_CSP8aTY-hjIZct50ChlNJVWsjBqLaYIl.eYe4gvza4Lsqjhs3 |
Metric Current High Low D1-D3 D3-D5 D5-D7 D7-D9 D9-D11 D11-D13 D13-D15 | ||
Value Value Value 01Jun24 03Jun24 05Jun24 07Jun24 09Jun24 11Jun24 13Jun24 | ||
-------------- -------- -------- -------- --------- --------- --------- --------- --------- --------- --------- | ||
Total Memory 15.6G 15.6G 15.1G 15.1G 15.2G 15.3G 15.3G 15.4G 15.5G 15.5G | ||
Used Memory 2.3G 2.5G 2.0G 2.1G 2.2G 2.2G 2.3G 2.4G 2.3G 2.2G | ||
Free Memory 11.9G 12.4G 10.6G 11.0G 11.2G 11.4G 11.5G 11.7G 11.8G 11.9G | ||
Available Memory 13.0G 14.3G 12.4G 12.6G 12.7G 12.8G 12.9G 13.0G 13.1G 13.2G | ||
Cached Memory 1.2G 1.5G 1.0G 1.1G 1.2G 1.3G 1.4G 1.3G 1.4G 1.2G | ||
Buffer Memory 0.3G 0.4G 0.2G 0.2G 0.3G 0.3G 0.4G 0.3G 0.3G 0.4G | ||
Shared Memory 0.5G 0.6G 0.4G 0.5G 0.5G 0.5G 0.4G 0.5G 0.5G 0.5G | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Arham-Nasirare these stats collected from the output of free
linux command?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data is not directly collected from the output of the free Linux command. Instead, they are gathered using the psutil Python library, which directly accesses system memory information via the same underlying mechanisms that the free command relies on. Specifically, the psutil.virtual_memory() function retrieves memory stats such as total, used, free, buffers, cached, and available memory.This approach allows for more direct access to memory data, avoiding the need to parse command output while providing flexibility for programmatic use.
Analysis Period: From 2024-06-01 09:00:00 to 2024-06-15 09:00:00 | ||
Interval: 2 days | ||
-------------------------------------------------------------------------------- | ||
Metric Current High Low D1-D3 D3-D5 D5-D7 D7-D9 D9-D11 D11-D13 D13-D15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sonic-gnmi (telemetry) already has a comprehensive implementation https://github.com/sonic-net/sonic-gnmi/blob/64ed32b715e1673c3376701ee192d6292f5ec0ec/sonic_data_client/non_db_client.go#L235. Please check and ensure matrix are aligned.
Even better, suggest reuse:
- delegate telemetry container as the memory collector
- reduce this HLD scope to only memory-stats and storage, connecting to telemetry container by gnmi protocol
- ensure this design could run inside and outside sonic switch
- ensure this feature could be completely disabled inside a sonic switch and no performance burden if disabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Thank you for the suggestion. We will review the sonic-gnmi implementation to align our metrics with the existing telemetry framework for system consistency. We will also explore reusing the telemetry container for memory collection and narrow the HLD scope to focus on memory statistics and storage.
- Currently, we use the psutil.virtual_memory() function from the psutil library to gather memory data, which is similar to the getProcMeminfo() function in non_db_client.go, as both methods pull information from /proc/meminfo. We believe that psutil offers a more efficient and Pythonic approach compared to relying solely on the telemetry container for this purpose.
- Can you elaborate more on how the design could run inside and outside the sonic switch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this design rely on telemetry container as the memory stats collector, it can run inside or outside sonic switch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this thread is still applicable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To provide some context on the current design, we gather memory statistics directly from SONiC using the psutil library. The collected data is stored as gzip files in the /var/log directory and is accessible via show commands through Unix sockets.
We appreciate your suggestion to leverage the telemetry container. As part of enhancements for this feature, we will certainly consider integrating with the telemetry container and aligning the design with your recommendations.
Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In enable/disable case, do you still use SIGHUP which means memstatsd will keep running? or just SIGTERM the memoryStats daemon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hostcfgd will detect the enable/disable flag in ConfigDB. For disable, it sends SIGTERM to stop the daemon, where the daemon gracefully stops all processes, releases resources, and completely shuts down. For enable, it restarts the daemon using systemd, and SIGHUP is used for configuration reloads while the daemon is running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, better to mention it explicitly, memorystats is not daemon but process controlled by hostcfgd, what if hostcfgd crashes, will it reload memorystats and recalculate retention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a misunderstanding. Yes, we are introducing memorystatsd as a process and not a separate container. If the hostcfgd crashes, memorystats will restart and when the user enables memory statistics feature, it will use the default retention time unless the user explicitly mentions a custom retention period. Lets say, user enters 5 as retention period, then it will start the retention period from that time and will show the last 5 days data.
Add dynamic config reload and graceful shutdown for memorystatsd
@qiluo-msft @xincunli-sonic @zbud-msft @FengPan-Frank We have linked the code PRs. Please help review the feature. |
Signed-off-by: Arham-Nasir <[email protected]>
@qiluo-msft @xincunli-sonic @zbud-msft @FengPan-Frank We have linked the code PRs. Please help review the feature. |
@qiluo-msft @xincunli-sonic @FengPan-Frank Please help review the feature. |
@FengPan-Frank please help review and approve this feature. Code PRs are already in review. Target is 202411. We are running out of time for HLD review. Please help expedite review. |
Pls get final signoff from Qi, I think code PR will need to be merged before HLD. |
@qiluo-msft help with this HLD review. |
| sampling_interval | unit16 | Interval for memory data collection. | | ||
| retention_period | unit16 | Duration for which memory data is retained. | | ||
| sampling_interval | unit8 | Interval for memory data collection. | | ||
| retention_period | unit8 | Duration for which memory data is retained. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the units for interval and period? in seconds? IF yes, is uint8 too small?
typo: unit8->uint8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your valuable feedback. To clarify:
- The sampling_interval is provided by the user in minutes, with a configurable range from 3 to 15 minutes.
- The retention_period is provided by the user in days, with a configurable range from 1 to 30 days.
Since the values for sampling_interval and retention_period fall within the range of uint8 (0 to 255), we believe uint8 is appropriate as the data type for representing the user input.
Additionally, the typo has been corrected to uint8 as suggested and the descriptions have been updated for further clarity
…ated descriptions and hld for clarity
code PRs and HLD PR are not merged, move to backlog |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This High-Level Design (HLD) document explains the Memory Statistics feature in SONiC. The aim is to improve memory monitoring and management to optimize network performance and reliability.
The Linked PR are as follows: