Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HLD for Memory_Statistics #1760

Merged
merged 16 commits into from
Dec 14, 2024
Merged

Conversation

Arham-Nasir
Copy link
Contributor

@Arham-Nasir Arham-Nasir commented Jul 23, 2024

This High-Level Design (HLD) document explains the Memory Statistics feature in SONiC. The aim is to improve memory monitoring and management to optimize network performance and reliability.

The Linked PR are as follows:

Repo PR title State
sonic-host-services Support for Memory Statistics Host-Services GitHub issue/pull request detail
sonic-buildimage Add YANG Model and Configuration Support for Memory Statistics GitHub issue/pull request detail
sonic-buildimage Support for Memory Statisticsd Process and Configurations GitHub issue/pull request detail
sonic-utilities Memory Statistics Config and Show Commands GitHub issue/pull request detail
sonic-swss-common Add definition for MEMORY_STATISTICS table in schema GitHub issue/pull request detail

Copy link

linux-foundation-easycla bot commented Jul 23, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@zhangyanzhao
Copy link
Collaborator

reviewer is expected and please leave your comments if you want to be a reviewer. Thanks.

@zhangyanzhao
Copy link
Collaborator

Comment on lines 267 to 277
Metric Current High Low D1-D3 D3-D5 D5-D7 D7-D9 D9-D11 D11-D13 D13-D15
Value Value Value 01Jun24 03Jun24 05Jun24 07Jun24 09Jun24 11Jun24 13Jun24
-------------- -------- -------- -------- --------- --------- --------- --------- --------- --------- ---------
Total Memory 15.6G 15.6G 15.1G 15.1G 15.2G 15.3G 15.3G 15.4G 15.5G 15.5G
Used Memory 2.3G 2.5G 2.0G 2.1G 2.2G 2.2G 2.3G 2.4G 2.3G 2.2G
Free Memory 11.9G 12.4G 10.6G 11.0G 11.2G 11.4G 11.5G 11.7G 11.8G 11.9G
Available Memory 13.0G 14.3G 12.4G 12.6G 12.7G 12.8G 12.9G 13.0G 13.1G 13.2G
Cached Memory 1.2G 1.5G 1.0G 1.1G 1.2G 1.3G 1.4G 1.3G 1.4G 1.2G
Buffer Memory 0.3G 0.4G 0.2G 0.2G 0.3G 0.3G 0.4G 0.3G 0.3G 0.4G
Shared Memory 0.5G 0.6G 0.4G 0.5G 0.5G 0.5G 0.4G 0.5G 0.5G 0.5G

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Arham-Nasirare these stats collected from the output of free linux command?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data is not directly collected from the output of the free Linux command. Instead, they are gathered using the psutil Python library, which directly accesses system memory information via the same underlying mechanisms that the free command relies on. Specifically, the psutil.virtual_memory() function retrieves memory stats such as total, used, free, buffers, cached, and available memory.This approach allows for more direct access to memory data, avoiding the need to parse command output while providing flexibility for programmatic use.

Analysis Period: From 2024-06-01 09:00:00 to 2024-06-15 09:00:00
Interval: 2 days
--------------------------------------------------------------------------------
Metric Current High Low D1-D3 D3-D5 D5-D7 D7-D9 D9-D11 D11-D13 D13-D15
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sonic-gnmi (telemetry) already has a comprehensive implementation https://github.com/sonic-net/sonic-gnmi/blob/64ed32b715e1673c3376701ee192d6292f5ec0ec/sonic_data_client/non_db_client.go#L235. Please check and ensure matrix are aligned.
Even better, suggest reuse:

  1. delegate telemetry container as the memory collector
  2. reduce this HLD scope to only memory-stats and storage, connecting to telemetry container by gnmi protocol
  3. ensure this design could run inside and outside sonic switch
  4. ensure this feature could be completely disabled inside a sonic switch and no performance burden if disabled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Thank you for the suggestion. We will review the sonic-gnmi implementation to align our metrics with the existing telemetry framework for system consistency. We will also explore reusing the telemetry container for memory collection and narrow the HLD scope to focus on memory statistics and storage.
  • Currently, we use the psutil.virtual_memory() function from the psutil library to gather memory data, which is similar to the getProcMeminfo() function in non_db_client.go, as both methods pull information from /proc/meminfo. We believe that psutil offers a more efficient and Pythonic approach compared to relying solely on the telemetry container for this purpose.
  • Can you elaborate more on how the design could run inside and outside the sonic switch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this design rely on telemetry container as the memory stats collector, it can run inside or outside sonic switch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this thread is still applicable.

Copy link
Contributor Author

@Arham-Nasir Arham-Nasir Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To provide some context on the current design, we gather memory statistics directly from SONiC using the psutil library. The collected data is stored as gzip files in the /var/log directory and is accessible via show commands through Unix sockets.

We appreciate your suggestion to leverage the telemetry container. As part of enhancements for this feature, we will certainly consider integrating with the telemetry container and aligning the design with your recommendations.

Thank you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In enable/disable case, do you still use SIGHUP which means memstatsd will keep running? or just SIGTERM the memoryStats daemon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hostcfgd will detect the enable/disable flag in ConfigDB. For disable, it sends SIGTERM to stop the daemon, where the daemon gracefully stops all processes, releases resources, and completely shuts down. For enable, it restarts the daemon using systemd, and SIGHUP is used for configuration reloads while the daemon is running.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, better to mention it explicitly, memorystats is not daemon but process controlled by hostcfgd, what if hostcfgd crashes, will it reload memorystats and recalculate retention?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a misunderstanding. Yes, we are introducing memorystatsd as a process and not a separate container. If the hostcfgd crashes, memorystats will restart and when the user enables memory statistics feature, it will use the default retention time unless the user explicitly mentions a custom retention period. Lets say, user enters 5 as retention period, then it will start the retention period from that time and will show the last 5 days data.

@Arham-Nasir
Copy link
Contributor Author

Arham-Nasir commented Sep 26, 2024

@qiluo-msft @xincunli-sonic @zbud-msft @FengPan-Frank We have linked the code PRs. Please help review the feature.

@ridahanif96
Copy link
Collaborator

ridahanif96 commented Oct 25, 2024

@qiluo-msft @xincunli-sonic @zbud-msft @FengPan-Frank We have linked the code PRs. Please help review the feature.
Target 202411, help with expedite review.

@ridahanif96
Copy link
Collaborator

ridahanif96 commented Nov 6, 2024

@qiluo-msft @xincunli-sonic @FengPan-Frank Please help review the feature.
Target 202411, help with expedite review.

@ridahanif96
Copy link
Collaborator

@FengPan-Frank please help review and approve this feature. Code PRs are already in review. Target is 202411. We are running out of time for HLD review. Please help expedite review.

@FengPan-Frank
Copy link
Contributor

@FengPan-Frank please help review and approve this feature. Code PRs are already in review. Target is 202411. We are running out of time for HLD review. Please help expedite review.

Pls get final signoff from Qi, I think code PR will need to be merged before HLD.

@ridahanif96
Copy link
Collaborator

@FengPan-Frank please help review and approve this feature. Code PRs are already in review. Target is 202411. We are running out of time for HLD review. Please help expedite review.

Pls get final signoff from Qi, I think code PR will need to be merged before HLD.

@qiluo-msft help with this HLD review.

| sampling_interval | unit16 | Interval for memory data collection. |
| retention_period | unit16 | Duration for which memory data is retained. |
| sampling_interval | unit8 | Interval for memory data collection. |
| retention_period | unit8 | Duration for which memory data is retained. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the units for interval and period? in seconds? IF yes, is uint8 too small?
typo: unit8->uint8

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your valuable feedback. To clarify:

  • The sampling_interval is provided by the user in minutes, with a configurable range from 3 to 15 minutes.
  • The retention_period is provided by the user in days, with a configurable range from 1 to 30 days.

Since the values for sampling_interval and retention_period fall within the range of uint8 (0 to 255), we believe uint8 is appropriate as the data type for representing the user input.

Additionally, the typo has been corrected to uint8 as suggested and the descriptions have been updated for further clarity

@zhangyanzhao
Copy link
Collaborator

code PRs and HLD PR are not merged, move to backlog

Copy link
Contributor

@xincunli-sonic xincunli-sonic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ridahanif96 ridahanif96 merged commit 07851ff into sonic-net:master Dec 14, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: MovedToBacklog
Status: 🏗 In Progress
Development

Successfully merging this pull request may close these issues.

8 participants