Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removal of LAG with no members takes around 1 sec on 202012 #8071

Closed
dprital opened this issue Jul 6, 2021 · 0 comments · Fixed by sonic-net/sonic-swss#1841
Closed

Removal of LAG with no members takes around 1 sec on 202012 #8071

dprital opened this issue Jul 6, 2021 · 0 comments · Fixed by sonic-net/sonic-swss#1841

Comments

@dprital
Copy link
Collaborator

dprital commented Jul 6, 2021

Recently we have faced one very strange degradation in LAGs deinit process which requires your attention.
It takes place in 202012 while works perfectly fine in 201911.

The bug is caused by teamd.service long deinit time:

Jun 29 12:19:05.007799 sonic NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel57
Jun 29 12:19:05.023175 sonic NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:PortChannel58 admin:0 oper:0 addr:1c:34:da:a1:96:80 ifindex:269 master:0 type:team
Jun 29 12:19:05.024577 sonic NOTICE swss#portsyncd: :- onMsg: nlmsg type:17 key:PortChannel58 admin:0 oper:0 addr:1c:34:da:a1:96:80 ifindex:269 master:0 type:team
Jun 29 12:19:05.927276 sonic INFO teamd#supervisord 2021-06-29 12:19:05,924 INFO reaped unknown pid 453 (exit status 0)
Jun 29 12:19:05.928010 sonic INFO teamd#supervisord 2021-06-29 12:19:05,926 INFO waiting for supervisor-proc-exit-listener, rsyslogd, teammgrd to die
Jun 29 12:19:06.024504 sonic NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel58
Jun 29 12:19:06.040217 sonic NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:PortChannel59 admin:0 oper:0 addr:1c:34:da:a1:96:80 ifindex:270 master:0 type:team
Jun 29 12:19:06.044142 sonic NOTICE swss#portsyncd: :- onMsg: nlmsg type:17 key:PortChannel59 admin:0 oper:0 addr:1c:34:da:a1:96:80 ifindex:270 master:0 type:team
Jun 29 12:19:06.931980 sonic INFO teamd#supervisord 2021-06-29 12:19:06,930 INFO reaped unknown pid 461 (exit status 0)

This results in teammgrd is stopped before the cleanup is completed:

Jun 29 12:19:07.368929 sonic INFO dockerd[693]: time="2021-06-29T12:19:07.366572636Z" level=info msg="Container ed1dbc48e9acb7f01984c7e26e6c36a0496b906b2f38bcc092b8680c32dcde65 failed to exit within 60 seconds of signal 15 - using the force"

In general, it takes approximately ~ 1 sec to remove LAG with no members:

root@sonic:/# /usr/bin/teamd -r -t PortChannel8 -c '{"device":"PortChannel8","hwaddr":"1c:34:da:a1:96:80","runner":{"active":true,"name":"lacp"}}' -L /var/warmboot/teamd/ -g -d
Using team device "PortChannel8".
Using PID file "/var/run/teamd/PortChannel8.pid"
This program is not intended to be run as root.

root@sonic:/# time teamd -t PortChannel8 -k

real 0m1.020s
user 0m0.006s
sys 0m0.006s

The current teammgrd implementation uses teamd -k LAG cleanup interface:

https://github.com/Azure/sonic-swss/blob/202012/cfgmgr/teammgr.cpp#L116

bool TeamMgr::removeLag(const string &alias)
{
SWSS_LOG_ENTER();

stringstream cmd;
string res;

cmd << TEAMD_CMD << " -k -t " << shellquote(alias);
EXEC_WITH_ERROR_THROW(cmd.str(), res);

SWSS_LOG_NOTICE("Stop port channel %s", alias.c_str());

return true;

}

void TeamMgr::cleanTeamProcesses()
{
SWSS_LOG_ENTER();
SWSS_LOG_NOTICE("Cleaning up LAGs during shutdown...");
for (const auto& it: m_lagList)
{
//This will call team -k kill -t which internally send SIGTERM
removeLag(it);
}

return;

}

After some more investigation, it appears that the LAG remove call (teamd -k - t <port_channel_name>) is a blocking operation,
which consumes extra time because of supervisord is executed with python3.
Things work fine on 201911, where supervisord is executed with python2.7.

It seems, that the new python3 version somehow interferes teamd -k call,
which invokes libdaemon daemon_pid_file_kill_wait function, which uses kill system call to manage the process and select system call to deal with a timeout.

https://github.com/Distrotech/libdaemon/blob/master/libdaemon/dpid.c#L182

int daemon_pid_file_kill_wait(int s, int m) {
pid_t pid;
time_t t;

if ((pid = daemon_pid_file_is_running()) < 0)
    return -1;

if (kill(pid, s) < 0)
    return -1;

t = time(NULL) + m;

for (;;) {
    int r;
    struct timeval tv = { 0, 100000 };

    if (time(NULL) > t) {
        errno = ETIME;
        return -1;
    }

    if ((r = kill(pid, 0)) < 0 && errno != ESRCH)
        return -1;

    if (r)
        return 0;

    if (select(0, NULL, NULL, NULL, &tv) < 0)
        return -1;
}

}

Here is the steps to reproduce:

  1. Build teamd docker for 202012:
    make target/docker-teamd.gz

  2. Load teamd docker:
    docker load < target/docker-teamd.gz

  3. Compose a Dockerfile:
    FROM docker-teamd:latest

RUN apt-get update && apt-get install python2.7 python-pip -y
RUN apt-get remove python3 python3-minimal -y

RUN pip2 install supervisor setuptools

  1. Build teamd docker for python2.7:
    docker build -t docker-teamd:p2.7 -f Dockerfile .

  2. Test teamd python2.7 docker:
    docker rm -f team_2.7
    docker run --privileged -t -d --cap-add NET_ADMIN --name team_2.7 docker-teamd:p2.7
    docker exec -ti team_3.7 bash

root@9742295dc210:/# ps -aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 26748 13780 pts/0 Ss+ Jul01 0:21 /usr/bin/python /usr/local/bin/supervisord
root 532 0.0 0.0 3996 3192 pts/1 Ss 15:29 0:00 bash
root 703 0.0 0.0 7640 2692 pts/1 R+ 15:51 0:00 ps -aux

root@9742295dc210:/# teamd -r -t PortChannel6 -c '{"device":"PortChannel6","hwaddr":"1c:34:da:a1:96:80","runner":{"active":true,"name":"lacp"}}' -L /var/warmboot/teamd/ -g -d
Using team device "PortChannel6".
Using PID file "/var/run/teamd/PortChannel6.pid"
This program is not intended to be run as root.

root@9742295dc210:/# time teamd -k -t PortChannel6

real 0m0.108s
user 0m0.004s
sys 0m0.004s

  1. Test teamd python3 docker:

docker rm -f team_3.7
docker run --privileged -t -d --cap-add NET_ADMIN --name team_3.7 docker-teamd:latest
docker exec -ti team_3.7 bash

root@9cd2d740705f:/# rm /etc/supervisor/critical_processes
root@9cd2d740705f:/# supervisorctl stop rsyslogd dependent-startup supervisor-proc-exit-listener teammgrd teamsyncd tlm_teamd

root@9cd2d740705f:/# ps -aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.5 0.0 32288 24144 pts/0 Ss+ 12:57 0:00 /usr/bin/python3 /usr/local/bin/supervisord
root 27 0.0 0.0 3988 3260 pts/1 Ss 12:57 0:00 bash
root 36 0.0 0.0 7640 2708 pts/1 R+ 12:59 0:00 ps -aux

root@9cd2d740705f:/# teamd -r -t PortChannel6 -c '{"device":"PortChannel6","hwaddr":"1c:34:da:a1:96:80","runner":{"active":true,"name":"lacp"}}' -L /var/warmboot/teamd/ -g -d
Using team device "PortChannel6".
Using PID file "/var/run/teamd/PortChannel6.pid"
This program is not intended to be run as root.

root@9cd2d740705f:/# time teamd -k -t PortChannel6

real 0m1.010s
user 0m0.004s
sys 0m0.004s

Profiler output:

root@9742295dc210:/# apt-get install linux-perf

root@9742295dc210:/# teamd -r -t PortChannel6 -c '{"device":"PortChannel6","hwaddr":"1c:34:da:a1:96:80","runner":{"active":true,"name":"lacp"}}' -L /var/warmboot/teamd/ -g -d
Using team device "PortChannel6".
Using PID file "/var/run/teamd/PortChannel6.pid"
This program is not intended to be run as root.

root@9742295dc210:/# perf_4.19 record teamd -k -t PortChannel6
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.016 MB perf.data (39 samples) ]

root@9742295dc210:/# perf_4.19 report --stdio -g none --show-cpu-utilization

To display the perf.data header info, please use --header/--header-only options.

Total Lost Samples: 0

Samples: 39 of event 'cycles:ppp'

Event count (approx.): 10648759

Overhead sys usr Command Shared Object Symbol

........ ........ ........ ....... ................. .......................................

 5.46%     5.46%     0.00%  teamd    [kernel.kallsyms]          [k] filemap_map_pages 
 5.24%     0.00%     5.24%  teamd    ld-2.28.so                 [.] 0x000000000000980e
 3.72%     3.72%     0.00%  teamd    [kernel.kallsyms]          [k] syscall_return_via_sysret
 3.59%     3.59%     0.00%  teamd    [kernel.kallsyms]          [k] dget_parent
 3.57%     0.00%     3.57%  teamd    ld-2.28.so                 [.] 0x000000000001e0d0
 3.10%     0.00%     3.10%  teamd    ld-2.28.so                 [.] 0x0000000000006189
 2.99%     2.99%     0.00%  teamd    [kernel.kallsyms]          [k] do_syscall_64
 2.93%     2.93%     0.00%  teamd    [kernel.kallsyms]          [k] legitimize_path.isra.29
 2.83%     2.83%     0.00%  teamd    [kernel.kallsyms]          [k] kfree
 2.79%     0.00%     2.79%  teamd    ld-2.28.so                 [.] 0x00000000000010f0
 2.76%     0.00%     2.76%  teamd    ld-2.28.so                 [.] 0x0000000000009a95
 2.73%     0.00%     2.73%  teamd    ld-2.28.so                 [.] 0x000000000001ab70
 2.71%     2.71%     0.00%  teamd    [kernel.kallsyms]          [k] down_read_trylock
 2.69%     0.00%     2.69%  teamd    ld-2.28.so                 [.] 0x00000000000097ca
 2.67%     0.00%     2.67%  teamd    ld-2.28.so                 [.] 0x000000000000bc4e
 2.66%     0.00%     2.66%  teamd    ld-2.28.so                 [.] 0x00000000000097d3
 2.64%     0.00%     2.64%  teamd    ld-2.28.so                 [.] 0x000000000000a3e3
 2.62%     0.00%     2.62%  teamd    ld-2.28.so                 [.] 0x000000000000981e
 2.61%     0.00%     2.61%  teamd    ld-2.28.so                 [.] 0x0000000000009881
 2.60%     0.00%     2.60%  teamd    ld-2.28.so                 [.] 0x000000000000a25a
 2.60%     0.00%     2.60%  teamd    ld-2.28.so                 [.] 0x0000000000009a35
 2.59%     0.00%     2.59%  teamd    ld-2.28.so                 [.] 0x000000000001ab1c
 2.59%     0.00%     2.59%  teamd    ld-2.28.so                 [.] 0x000000000001ab76
 2.59%     0.00%     2.59%  teamd    ld-2.28.so                 [.] 0x000000000000a293
 2.58%     0.00%     2.58%  teamd    ld-2.28.so                 [.] 0x0000000000009605
 2.58%     0.00%     2.58%  teamd    libc-2.28.so               [.] 0x0000000000082690
 2.58%     0.00%     2.58%  teamd    libnl-route-3.so.200.26.0  [.] 0x000000000002b788
 2.58%     2.58%     0.00%  teamd    [kernel.kallsyms]          [k] __radix_tree_lookup
 2.18%     2.18%     0.00%  teamd    [kernel.kallsyms]          [k] fput
 2.11%     2.11%     0.00%  teamd    [kernel.kallsyms]          [k] _raw_spin_lock
 2.03%     2.03%     0.00%  teamd    [kernel.kallsyms]          [k] page_remove_rmap
 1.94%     1.94%     0.00%  teamd    [kernel.kallsyms]          [k] unmap_page_range
 1.90%     1.90%     0.00%  teamd    [kernel.kallsyms]          [k] vma_interval_tree_insert
 1.85%     1.85%     0.00%  teamd    [kernel.kallsyms]          [k] copy_user_generic_unrolled
 1.85%     1.85%     0.00%  teamd    [kernel.kallsyms]          [k] radix_tree_next_chunk
 1.85%     0.00%     1.85%  teamd    libjansson.so.4.11.1       [.] 0x0000000000003800
 0.29%     0.29%     0.00%  perf_4.  [kernel.kallsyms]          [k] x86_pmu_enable
 0.28%     0.28%     0.00%  teamd    [kernel.kallsyms]          [k] nmi_restore
 0.05%     0.05%     0.00%  perf_4.  [kernel.kallsyms]          [k] native_flush_tlb_one_user
 0.04%     0.04%     0.00%  teamd    [kernel.kallsyms]          [k] fill_pud
 0.01%     0.01%     0.00%  perf_4.  [kernel.kallsyms]          [k] native_apic_mem_write
 0.01%     0.01%     0.00%  teamd    [kernel.kallsyms]          [k] native_apic_mem_write
 0.00%     0.00%     0.00%  perf_4.  [kernel.kallsyms]          [k] native_write_msr
 0.00%     0.00%     0.00%  teamd    [kernel.kallsyms]          [k] native_write_msr

Summary:

python2.7:
root@9742295dc210:/# ./run.sh | grep -a spin
Using team device "PortChannel6".
Using PID file "/var/run/teamd/PortChannel6.pid"
This program is not intended to be run as root.
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.000 MB (null) ]

python3:
root@9cd2d740705f:/# ./run.sh | grep -a spin
Using team device "PortChannel6".
Using PID file "/var/run/teamd/PortChannel6.pid"
This program is not intended to be run as root.
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.000 MB (null) ]
2.84% 2.84% 0.00% teamd [kernel.kallsyms] [k] _raw_spin_lock

IMHO, the culprit is python2.7 -> python3 migration.

P.S: teamd -k call fails constantly with a timeout when python2.7/python3 is used as docker entrypoint and no issues observed with bash

qiluo-msft pushed a commit to sonic-net/sonic-swss that referenced this issue Sep 17, 2021
…o PID. (#1841)

This PR is intended to fix LAGs cleanup degradation caused by python2.7 -> python3 migration.
The approach is to replace `teamd -k -t` call with the raw `SIGTERM` and add PID alive check.
This will make sure the `teammgrd` is stopped only after all managed processes are being killed.

resolves: sonic-net/sonic-buildimage#8071

**What I did**
* Replaced `teamd -k -t` call with raw `SIGTERM`
* Added PID alive check

**Why I did it**
* To fix LAGs cleanup timeout issue caused by python2.7 -> python3 upgrade

**How I verified it**
1. Configure 64 LAG RIFs
2. Reload config
nazariig added a commit to nazariig/sonic-swss that referenced this issue Sep 17, 2021
…c-net#1841)

This PR is intended to fix LAGs cleanup degradation caused by python2.7 -> python3 migration.
The approach is to replace `teamd -k -t` call with the raw `SIGTERM` and add PID alive check.
This will make sure the `teammgrd` is stopped only after all managed processes are being killed.

resolves: sonic-net/sonic-buildimage#8071

**What I did**
* Replaced `teamd -k -t` call with raw `SIGTERM`
* Added PID alive check

**Why I did it**
* To fix LAGs cleanup timeout issue caused by python2.7 -> python3 upgrade

**How I verified it**
1. Configure 64 LAG RIFs
2. Reload config
qiluo-msft pushed a commit to sonic-net/sonic-swss that referenced this issue Sep 20, 2021
This PR is intended to fix LAGs cleanup degradation caused by python2.7 -> python3 migration.
The approach is to replace `teamd -k -t` call with the raw `SIGTERM` and add PID alive check.
This will make sure the `teammgrd` is stopped only after all managed processes are being killed.

resolves: sonic-net/sonic-buildimage#8071

**What I did**
* Replaced `teamd -k -t` call with raw `SIGTERM`
* Added PID alive check

**Why I did it**
* To fix LAGs cleanup timeout issue caused by python2.7 -> python3 upgrade

**How I verified it**
1. Configure 64 LAG RIFs
2. Reload config

**Details if related**
* N/A
judyjoseph pushed a commit to sonic-net/sonic-swss that referenced this issue Sep 27, 2021
…o PID. (#1841)

This PR is intended to fix LAGs cleanup degradation caused by python2.7 -> python3 migration.
The approach is to replace `teamd -k -t` call with the raw `SIGTERM` and add PID alive check.
This will make sure the `teammgrd` is stopped only after all managed processes are being killed.

resolves: sonic-net/sonic-buildimage#8071

**What I did**
* Replaced `teamd -k -t` call with raw `SIGTERM`
* Added PID alive check

**Why I did it**
* To fix LAGs cleanup timeout issue caused by python2.7 -> python3 upgrade

**How I verified it**
1. Configure 64 LAG RIFs
2. Reload config
raphaelt-nvidia pushed a commit to raphaelt-nvidia/sonic-swss that referenced this issue Oct 5, 2021
…o PID. (sonic-net#1841)

This PR is intended to fix LAGs cleanup degradation caused by python2.7 -> python3 migration.
The approach is to replace `teamd -k -t` call with the raw `SIGTERM` and add PID alive check.
This will make sure the `teammgrd` is stopped only after all managed processes are being killed.

resolves: sonic-net/sonic-buildimage#8071

**What I did**
* Replaced `teamd -k -t` call with raw `SIGTERM`
* Added PID alive check

**Why I did it**
* To fix LAGs cleanup timeout issue caused by python2.7 -> python3 upgrade

**How I verified it**
1. Configure 64 LAG RIFs
2. Reload config
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants