[Metricbeat] Flaky test in Elasticsearch module #10866

sayden · 2019-02-21T13:30:03Z

Flaky Test

Test Name: Elasticsearch integration test in Metricbeat
Link: https://beats-ci.elastic.co/job/elastic+beats+pull-request+multijob-linux/beat=metricbeat,label=linux-immutable/5200/console
Branch: Master
Notes: I have seen this fail occuring sometimes without any connection between the PR's originating errors (apart that all of them comes from Metricbeat). I discovered this flaky the first time while migrating modules to use the ReporterV2 interface in Metricbeat [Metricbeat] ReporterV2 migration meta-issue #10774 but I don't really think it's related.

Stack Trace

17:42:59 --- FAIL: TestFetch (115.91s)
17:42:59 	assertions.go:256: 
17:42:59 			Error Trace:	elasticsearch_integration_test.go:81
17:42:59 			Error:      	Received unexpected error:
17:42:59 			            	Put http://elasticsearch:9200/rats/_ccr/follow: EOF
17:42:59 			Test:       	TestFetch
17:42:59     --- FAIL: TestFetch/ccr (0.00s)
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:89
17:42:59     			Error:      	Should be empty, but was [error determining if connected Elasticsearch node is master: error making http request: Get http://elasticsearch:9200/_nodes/_local/nodes: dial tcp 172.18.0.7:9200: connect: connection refused]
17:42:59     			Test:       	TestFetch/ccr
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:90
17:42:59     			Error:      	Should NOT be empty, but was []
17:42:59     			Test:       	TestFetch/ccr
17:42:59     --- FAIL: TestFetch/cluster_stats (0.00s)
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:89
17:42:59     			Error:      	Should be empty, but was [error determining if connected Elasticsearch node is master: error making http request: Get http://elasticsearch:9200/_nodes/_local/nodes: dial tcp 172.18.0.7:9200: connect: connection refused]
17:42:59     			Test:       	TestFetch/cluster_stats
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:90
17:42:59     			Error:      	Should NOT be empty, but was []
17:42:59     			Test:       	TestFetch/cluster_stats
17:42:59     --- FAIL: TestFetch/index (0.00s)
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:89
17:42:59     			Error:      	Should be empty, but was [error determining if connected Elasticsearch node is master: error making http request: Get http://elasticsearch:9200/_nodes/_local/nodes: dial tcp 172.18.0.7:9200: connect: connection refused]
17:42:59     			Test:       	TestFetch/index
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:90
17:42:59     			Error:      	Should NOT be empty, but was []
17:42:59     			Test:       	TestFetch/index
17:42:59     --- FAIL: TestFetch/index_recovery (0.01s)
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:89
17:42:59     			Error:      	Should be empty, but was [error determining if connected Elasticsearch node is master: error making http request: Get http://elasticsearch:9200/_nodes/_local/nodes: dial tcp 172.18.0.7:9200: connect: connection refused]
17:42:59     			Test:       	TestFetch/index_recovery
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:90
17:42:59     			Error:      	Should NOT be empty, but was []
17:42:59     			Test:       	TestFetch/index_recovery
17:42:59     --- FAIL: TestFetch/index_summary (0.01s)
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:89
17:42:59     			Error:      	Should be empty, but was [error determining if connected Elasticsearch node is master: error making http request: Get http://elasticsearch:9200/_nodes/_local/nodes: dial tcp 172.18.0.7:9200: connect: connection refused]
17:42:59     			Test:       	TestFetch/index_summary
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:90
17:42:59     			Error:      	Should NOT be empty, but was []
17:42:59     			Test:       	TestFetch/index_summary
17:42:59     --- FAIL: TestFetch/ml_job (0.01s)
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:89
17:42:59     			Error:      	Should be empty, but was [error determining if connected Elasticsearch node is master: error making http request: Get http://elasticsearch:9200/_nodes/_local/nodes: dial tcp 172.18.0.7:9200: connect: connection refused]
17:42:59     			Test:       	TestFetch/ml_job
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:90
17:42:59     			Error:      	Should NOT be empty, but was []
17:42:59     			Test:       	TestFetch/ml_job
17:42:59     --- FAIL: TestFetch/node (0.01s)
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:89
17:42:59     			Error:      	Should be empty, but was [error making http request: Get http://elasticsearch:9200/_nodes/_local: dial tcp 172.18.0.7:9200: connect: connection refused]
17:42:59     			Test:       	TestFetch/node
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:90
17:42:59     			Error:      	Should NOT be empty, but was []
17:42:59     			Test:       	TestFetch/node
17:42:59     --- FAIL: TestFetch/node_stats (0.00s)
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:89
17:42:59     			Error:      	Should be empty, but was [error making http request: Get http://elasticsearch:9200/_nodes/_local/stats: dial tcp 172.18.0.7:9200: connect: connection refused]
17:42:59     			Test:       	TestFetch/node_stats
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:90
17:42:59     			Error:      	Should NOT be empty, but was []
17:42:59     			Test:       	TestFetch/node_stats
17:42:59     --- FAIL: TestFetch/shard (0.00s)
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:89
17:42:59     			Error:      	Should be empty, but was [error determining if connected Elasticsearch node is master: error making http request: Get http://elasticsearch:9200/_nodes/_local/nodes: dial tcp 172.18.0.7:9200: connect: connection refused]
17:42:59     			Test:       	TestFetch/shard
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:90
17:42:59     			Error:      	Should NOT be empty, but was []
17:42:59     			Test:       	TestFetch/shard
17:42:59 Killing metricbeatb8f075ca3ba9f0ca8d77c37c7ab1a9ce06333331_http_1 ... 
17:42:59 
17:42:59 Killing metricbeatb8f075ca3ba9f0ca8d77c37c7ab1a9ce06333331_http_1 ... done
17:42:59 Killing metricbeatb8f075ca3ba9f0ca8d77c37c7ab1a9ce06333331_haproxy_1 ... 
17:42:59 
17:42:59 Killing metricbeatb8f075ca3ba9f0ca8d77c37c7ab1a9ce06333331_haproxy_1 ... done
17:42:59 Building elasticsearch
17:42:59 Step 1/2 : FROM docker.elastic.co/elasticsearch/elasticsearch:6.6.0
17:42:59  ---> 13aa43015aa1
17:42:59 Step 2/2 : HEALTHCHECK --interval=1s --retries=300 CMD curl -f http://localhost:9200/_xpack/license
17:42:59  ---> Using cache
17:42:59  ---> 92eb12fcce35
17:42:59 Successfully built 92eb12fcce35
17:42:59 Successfully tagged metricbeatb8f075ca3ba9f0ca8d77c37c7ab1a9ce06333331_elasticsearch:latest
17:42:59 Recreating metricbeatb8f075ca3ba9f0ca8d77c37c7ab1a9ce06333331_elasticsearch_1 ... 
17:42:59 
17:42:59 Recreating metricbeatb8f075ca3ba9f0ca8d77c37c7ab1a9ce06333331_elasticsearch_1 ... done
17:42:59 FAIL
17:42:59 coverage: 14.4% of statements
17:42:59 FAIL	github.com/elastic/beats/metricbeat/module/elasticsearch	163.051s

The text was updated successfully, but these errors were encountered:

sayden · 2019-02-21T13:45:50Z

The error originates here #L81 so it might be something related with the setup of CCR, specifically an EOF when setting up the follower index #L278.

Once it fails, all failures here #L89 and here #L90 after it seems related

elasticmachine · 2019-02-25T14:40:38Z

Pinging @elastic/stack-monitoring

Refs elastic#10866 There was a bug fixed in Go a while ago which looks very much like what's happening here. When a series of requests were made with the HTTP client, a race condition would be triggered wherein the first request would return an EOF error and subsequent requests would fail. This was a bug in the connection reuse cod This is supposedly fixed upstream, and discussed in depth here: golang/go#4677 However, this is similar enough to what we are seeing, that I disabled connection re-use in these tests to see if things improve for us. If they do not, there may be no harm in simply putting some 1s sleeps in here as an easy mitigation path if this does not resolve it.

ruflin · 2019-03-06T09:53:10Z

@cachedout Some Elasticsearch test errors went away when I commented out the ccr tests. Not sure if it's related.

cachedout · 2019-03-06T15:44:16Z

@ruflin Hmm. I have seen other issues around those tests in recent days. I, too, suspect that something funny is going on with them.

ycombinator · 2019-03-09T01:05:59Z

I also couldn't exactly reproduce the EOF error that's reported in this issue. However, while investigating this issue, I did find another potential source of flakiness, which I'm trying to address in #11172.

After repeatedly running the Elasticsearch module integration test in Metricbeat, I found that sometimes Elasticsearch doesn't get enough time to perform CCR and generate CCR stats. This causes the following error, but only some times: ``` --- FAIL: TestFetch (2.44s) --- FAIL: TestFetch/ccr (0.08s) elasticsearch_integration_test.go:92: Error Trace: elasticsearch_integration_test.go:92 Error: Should NOT be empty, but was [] Test: TestFetch/ccr ``` So this PR adds a 300ms sleep to give Elasticsearch enough time to perform CCR and generate CCR stats. After testing various sleep durations, I found that 300ms seemed to be the lowest (round) value I could use that consistently passed this test. Possibly related: #10866

ycombinator · 2019-03-13T12:30:49Z

I did some more testing here. It appears that the EOF error happens when the ES docker container has started up but the ES process in the container itself hasn't completely started up. If the ES integration test starts making ES REST API calls in this time window, the EOF error happens.

ycombinator · 2019-03-13T12:57:58Z

I'm going to be debugging the ES test flakiness over here in this draft PR: #11224

* Add sleep to allow ES sufficient time for CCR (#11172) After repeatedly running the Elasticsearch module integration test in Metricbeat, I found that sometimes Elasticsearch doesn't get enough time to perform CCR and generate CCR stats. This causes the following error, but only some times: ``` --- FAIL: TestFetch (2.44s) --- FAIL: TestFetch/ccr (0.08s) elasticsearch_integration_test.go:92: Error Trace: elasticsearch_integration_test.go:92 Error: Should NOT be empty, but was [] Test: TestFetch/ccr ``` So this PR adds a 300ms sleep to give Elasticsearch enough time to perform CCR and generate CCR stats. After testing various sleep durations, I found that 300ms seemed to be the lowest (round) value I could use that consistently passed this test. Possibly related: #10866 * Fixing formatting

cachedout · 2019-07-17T12:09:10Z

@ycombinator Since you have a PR up, I reassigned this. Please let me know if that's all right.

ycombinator · 2019-07-17T12:38:27Z

@sayden The flaky tests mentioned in this issue have been un-skipped since June 4. AFAICT there haven't been any flakiness issues since then. Are you okay resolving this issue now?

ycombinator · 2019-10-16T13:47:03Z

@sayden I'm closing this issue per my previous comment. If you disagree, feel free to reopen.

…lastic#12437) * Add sleep to allow ES sufficient time for CCR (elastic#11172) After repeatedly running the Elasticsearch module integration test in Metricbeat, I found that sometimes Elasticsearch doesn't get enough time to perform CCR and generate CCR stats. This causes the following error, but only some times: ``` --- FAIL: TestFetch (2.44s) --- FAIL: TestFetch/ccr (0.08s) elasticsearch_integration_test.go:92: Error Trace: elasticsearch_integration_test.go:92 Error: Should NOT be empty, but was [] Test: TestFetch/ccr ``` So this PR adds a 300ms sleep to give Elasticsearch enough time to perform CCR and generate CCR stats. After testing various sleep durations, I found that 300ms seemed to be the lowest (round) value I could use that consistently passed this test. Possibly related: elastic#10866 * Fixing formatting

sayden added Metricbeat Metricbeat flaky-test Unstable or unreliable test cases. Team:Integrations Label for the Integrations team labels Feb 21, 2019

This was referenced Feb 21, 2019

[Metricbeat] Fix flaky test in Elasticsearch module #10874

Merged

[Metricbeat] Migrate ETCD Store to use ReporterV2 interface #10815

Merged

[Metricbeat] Migrate PostgreSQL database Metricset to use ReporterV2 interface #10830

Merged

alvarolobato assigned sayden and unassigned sayden Feb 25, 2019

alvarolobato added the Feature:Stack Monitoring label Feb 25, 2019

alvarolobato removed the Team:Integrations Label for the Integrations team label Feb 25, 2019

cachedout self-assigned this Feb 25, 2019

cachedout mentioned this issue Feb 25, 2019

Force HTTP connection to close immediately in Elasticsearch test #10939

Closed

This comment has been minimized.

Sign in to view

alvarolobato added [zube]: Ready [zube]: In Progress and removed [zube]: Ready labels Mar 4, 2019

pusulurielastic added [zube]: Ready and removed [zube]: In Progress labels Mar 4, 2019

eyalkoren added [zube]: In Progress and removed [zube]: Ready labels Mar 4, 2019

rmaruko added [zube]: Ready and removed [zube]: In Progress labels Mar 4, 2019

jeremyarose added [zube]: In Progress and removed [zube]: Ready labels Mar 4, 2019

alvarolobato added [zube]: Ready and removed [zube]: In Progress labels Mar 4, 2019

watson removed the [zube]: Ready label Mar 4, 2019

ecneladis added the [zube]: Ready label Mar 4, 2019

mikker added [zube]: In Progress and removed [zube]: Ready labels Mar 4, 2019

eyalkoren added [zube]: Ready and removed [zube]: In Progress labels Mar 4, 2019

BernardIsibor added [zube]: In Progress and removed [zube]: Ready labels Mar 4, 2019

mcnealyk added [zube]: Ready and removed [zube]: In Progress labels Mar 4, 2019

sayden mentioned this issue Mar 5, 2019

[Metricbeat] Add docs about manual module testing #10724

Merged

ycombinator mentioned this issue Mar 9, 2019

Add sleep to allow ES sufficient time for CCR #11172

Merged

cachedout assigned ycombinator and unassigned cachedout Jul 17, 2019

andresrc added [zube]: Backlog and removed [zube]: Ready labels Jul 22, 2019

ycombinator closed this as completed Oct 16, 2019

zube bot added [zube]: Done and removed [zube]: Backlog labels Oct 16, 2019

andresrc removed the [zube]: Done label Oct 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Metricbeat] Flaky test in Elasticsearch module #10866

[Metricbeat] Flaky test in Elasticsearch module #10866

sayden commented Feb 21, 2019

sayden commented Feb 21, 2019 •

edited

Loading

elasticmachine commented Feb 25, 2019

This comment has been minimized.

ruflin commented Mar 6, 2019

cachedout commented Mar 6, 2019

ycombinator commented Mar 9, 2019

ycombinator commented Mar 13, 2019

ycombinator commented Mar 13, 2019

cachedout commented Jul 17, 2019

ycombinator commented Jul 17, 2019

ycombinator commented Oct 16, 2019

[Metricbeat] Flaky test in Elasticsearch module #10866

[Metricbeat] Flaky test in Elasticsearch module #10866

Comments

sayden commented Feb 21, 2019

Flaky Test

Stack Trace

sayden commented Feb 21, 2019 • edited Loading

elasticmachine commented Feb 25, 2019

This comment has been minimized.

ruflin commented Mar 6, 2019

cachedout commented Mar 6, 2019

ycombinator commented Mar 9, 2019

ycombinator commented Mar 13, 2019

ycombinator commented Mar 13, 2019

cachedout commented Jul 17, 2019

ycombinator commented Jul 17, 2019

ycombinator commented Oct 16, 2019

sayden commented Feb 21, 2019 •

edited

Loading