Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metricbeat] Flaky test in Elasticsearch module #10866

Closed
sayden opened this issue Feb 21, 2019 · 11 comments
Closed

[Metricbeat] Flaky test in Elasticsearch module #10866

sayden opened this issue Feb 21, 2019 · 11 comments
Assignees
Labels
Feature:Stack Monitoring flaky-test Unstable or unreliable test cases. Metricbeat Metricbeat

Comments

@sayden
Copy link
Contributor

sayden commented Feb 21, 2019

Flaky Test

Stack Trace

17:42:59 --- FAIL: TestFetch (115.91s)
17:42:59 	assertions.go:256: 
17:42:59 			Error Trace:	elasticsearch_integration_test.go:81
17:42:59 			Error:      	Received unexpected error:
17:42:59 			            	Put http://elasticsearch:9200/rats/_ccr/follow: EOF
17:42:59 			Test:       	TestFetch
17:42:59     --- FAIL: TestFetch/ccr (0.00s)
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:89
17:42:59     			Error:      	Should be empty, but was [error determining if connected Elasticsearch node is master: error making http request: Get http://elasticsearch:9200/_nodes/_local/nodes: dial tcp 172.18.0.7:9200: connect: connection refused]
17:42:59     			Test:       	TestFetch/ccr
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:90
17:42:59     			Error:      	Should NOT be empty, but was []
17:42:59     			Test:       	TestFetch/ccr
17:42:59     --- FAIL: TestFetch/cluster_stats (0.00s)
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:89
17:42:59     			Error:      	Should be empty, but was [error determining if connected Elasticsearch node is master: error making http request: Get http://elasticsearch:9200/_nodes/_local/nodes: dial tcp 172.18.0.7:9200: connect: connection refused]
17:42:59     			Test:       	TestFetch/cluster_stats
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:90
17:42:59     			Error:      	Should NOT be empty, but was []
17:42:59     			Test:       	TestFetch/cluster_stats
17:42:59     --- FAIL: TestFetch/index (0.00s)
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:89
17:42:59     			Error:      	Should be empty, but was [error determining if connected Elasticsearch node is master: error making http request: Get http://elasticsearch:9200/_nodes/_local/nodes: dial tcp 172.18.0.7:9200: connect: connection refused]
17:42:59     			Test:       	TestFetch/index
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:90
17:42:59     			Error:      	Should NOT be empty, but was []
17:42:59     			Test:       	TestFetch/index
17:42:59     --- FAIL: TestFetch/index_recovery (0.01s)
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:89
17:42:59     			Error:      	Should be empty, but was [error determining if connected Elasticsearch node is master: error making http request: Get http://elasticsearch:9200/_nodes/_local/nodes: dial tcp 172.18.0.7:9200: connect: connection refused]
17:42:59     			Test:       	TestFetch/index_recovery
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:90
17:42:59     			Error:      	Should NOT be empty, but was []
17:42:59     			Test:       	TestFetch/index_recovery
17:42:59     --- FAIL: TestFetch/index_summary (0.01s)
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:89
17:42:59     			Error:      	Should be empty, but was [error determining if connected Elasticsearch node is master: error making http request: Get http://elasticsearch:9200/_nodes/_local/nodes: dial tcp 172.18.0.7:9200: connect: connection refused]
17:42:59     			Test:       	TestFetch/index_summary
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:90
17:42:59     			Error:      	Should NOT be empty, but was []
17:42:59     			Test:       	TestFetch/index_summary
17:42:59     --- FAIL: TestFetch/ml_job (0.01s)
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:89
17:42:59     			Error:      	Should be empty, but was [error determining if connected Elasticsearch node is master: error making http request: Get http://elasticsearch:9200/_nodes/_local/nodes: dial tcp 172.18.0.7:9200: connect: connection refused]
17:42:59     			Test:       	TestFetch/ml_job
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:90
17:42:59     			Error:      	Should NOT be empty, but was []
17:42:59     			Test:       	TestFetch/ml_job
17:42:59     --- FAIL: TestFetch/node (0.01s)
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:89
17:42:59     			Error:      	Should be empty, but was [error making http request: Get http://elasticsearch:9200/_nodes/_local: dial tcp 172.18.0.7:9200: connect: connection refused]
17:42:59     			Test:       	TestFetch/node
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:90
17:42:59     			Error:      	Should NOT be empty, but was []
17:42:59     			Test:       	TestFetch/node
17:42:59     --- FAIL: TestFetch/node_stats (0.00s)
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:89
17:42:59     			Error:      	Should be empty, but was [error making http request: Get http://elasticsearch:9200/_nodes/_local/stats: dial tcp 172.18.0.7:9200: connect: connection refused]
17:42:59     			Test:       	TestFetch/node_stats
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:90
17:42:59     			Error:      	Should NOT be empty, but was []
17:42:59     			Test:       	TestFetch/node_stats
17:42:59     --- FAIL: TestFetch/shard (0.00s)
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:89
17:42:59     			Error:      	Should be empty, but was [error determining if connected Elasticsearch node is master: error making http request: Get http://elasticsearch:9200/_nodes/_local/nodes: dial tcp 172.18.0.7:9200: connect: connection refused]
17:42:59     			Test:       	TestFetch/shard
17:42:59     	assertions.go:256: 
17:42:59     			Error Trace:	elasticsearch_integration_test.go:90
17:42:59     			Error:      	Should NOT be empty, but was []
17:42:59     			Test:       	TestFetch/shard
17:42:59 Killing metricbeatb8f075ca3ba9f0ca8d77c37c7ab1a9ce06333331_http_1 ... 
17:42:59 
17:42:59 Killing metricbeatb8f075ca3ba9f0ca8d77c37c7ab1a9ce06333331_http_1 ... done
17:42:59 Killing metricbeatb8f075ca3ba9f0ca8d77c37c7ab1a9ce06333331_haproxy_1 ... 
17:42:59 
17:42:59 Killing metricbeatb8f075ca3ba9f0ca8d77c37c7ab1a9ce06333331_haproxy_1 ... done
17:42:59 Building elasticsearch
17:42:59 Step 1/2 : FROM docker.elastic.co/elasticsearch/elasticsearch:6.6.0
17:42:59  ---> 13aa43015aa1
17:42:59 Step 2/2 : HEALTHCHECK --interval=1s --retries=300 CMD curl -f http://localhost:9200/_xpack/license
17:42:59  ---> Using cache
17:42:59  ---> 92eb12fcce35
17:42:59 Successfully built 92eb12fcce35
17:42:59 Successfully tagged metricbeatb8f075ca3ba9f0ca8d77c37c7ab1a9ce06333331_elasticsearch:latest
17:42:59 Recreating metricbeatb8f075ca3ba9f0ca8d77c37c7ab1a9ce06333331_elasticsearch_1 ... 
17:42:59 
17:42:59 Recreating metricbeatb8f075ca3ba9f0ca8d77c37c7ab1a9ce06333331_elasticsearch_1 ... done
17:42:59 FAIL
17:42:59 coverage: 14.4% of statements
17:42:59 FAIL	github.com/elastic/beats/metricbeat/module/elasticsearch	163.051s
@sayden sayden added Metricbeat Metricbeat flaky-test Unstable or unreliable test cases. Team:Integrations Label for the Integrations team labels Feb 21, 2019
@sayden
Copy link
Contributor Author

sayden commented Feb 21, 2019

The error originates here #L81 so it might be something related with the setup of CCR, specifically an EOF when setting up the follower index #L278.

Once it fails, all failures here #L89 and here #L90 after it seems related

@elasticmachine
Copy link
Collaborator

Pinging @elastic/stack-monitoring

@alvarolobato alvarolobato removed the Team:Integrations Label for the Integrations team label Feb 25, 2019
@cachedout cachedout self-assigned this Feb 25, 2019
cachedout added a commit to cachedout/beats that referenced this issue Feb 25, 2019
Refs elastic#10866

There was a bug fixed in Go a while ago which looks very much
like what's happening here. When a series of requests were made
with the HTTP client, a race condition would be triggered wherein
the first request would return an EOF error and subsequent requests
would fail. This was a bug in the connection reuse cod

This is supposedly fixed upstream, and discussed in depth here:

golang/go#4677

However, this is similar enough to what we are seeing, that I disabled
connection re-use in these tests to see if things improve for us.

If they do not, there may be no harm in simply putting some 1s sleeps
in here as an easy mitigation path if this does not resolve it.
@cachedout

This comment has been minimized.

@ruflin
Copy link
Collaborator

ruflin commented Mar 6, 2019

@cachedout Some Elasticsearch test errors went away when I commented out the ccr tests. Not sure if it's related.

@cachedout
Copy link
Contributor

@ruflin Hmm. I have seen other issues around those tests in recent days. I, too, suspect that something funny is going on with them.

@ycombinator
Copy link
Contributor

I also couldn't exactly reproduce the EOF error that's reported in this issue. However, while investigating this issue, I did find another potential source of flakiness, which I'm trying to address in #11172.

ruflin pushed a commit that referenced this issue Mar 13, 2019
After repeatedly running the Elasticsearch module integration test in Metricbeat, I found that sometimes Elasticsearch doesn't get enough time to perform CCR and generate CCR stats. This causes the following error, but only some times:

```
--- FAIL: TestFetch (2.44s)
    --- FAIL: TestFetch/ccr (0.08s)
        elasticsearch_integration_test.go:92:
                Error Trace:    elasticsearch_integration_test.go:92
                Error:          Should NOT be empty, but was []
                Test:           TestFetch/ccr
```

So this PR adds a 300ms sleep to give Elasticsearch enough time to perform CCR and generate CCR stats. After testing various sleep durations, I found that 300ms seemed to be the lowest (round) value I could use that consistently passed this test.

Possibly related: #10866
@ycombinator
Copy link
Contributor

I did some more testing here. It appears that the EOF error happens when the ES docker container has started up but the ES process in the container itself hasn't completely started up. If the ES integration test starts making ES REST API calls in this time window, the EOF error happens.

@ycombinator
Copy link
Contributor

I'm going to be debugging the ES test flakiness over here in this draft PR: #11224

ycombinator added a commit that referenced this issue Jun 6, 2019
* Add sleep to allow ES sufficient time for CCR (#11172)

After repeatedly running the Elasticsearch module integration test in Metricbeat, I found that sometimes Elasticsearch doesn't get enough time to perform CCR and generate CCR stats. This causes the following error, but only some times:

```
--- FAIL: TestFetch (2.44s)
    --- FAIL: TestFetch/ccr (0.08s)
        elasticsearch_integration_test.go:92:
                Error Trace:    elasticsearch_integration_test.go:92
                Error:          Should NOT be empty, but was []
                Test:           TestFetch/ccr
```

So this PR adds a 300ms sleep to give Elasticsearch enough time to perform CCR and generate CCR stats. After testing various sleep durations, I found that 300ms seemed to be the lowest (round) value I could use that consistently passed this test.

Possibly related: #10866

* Fixing formatting
@cachedout cachedout assigned ycombinator and unassigned cachedout Jul 17, 2019
@cachedout
Copy link
Contributor

@ycombinator Since you have a PR up, I reassigned this. Please let me know if that's all right.

@ycombinator
Copy link
Contributor

@sayden The flaky tests mentioned in this issue have been un-skipped since June 4. AFAICT there haven't been any flakiness issues since then. Are you okay resolving this issue now?

@ycombinator
Copy link
Contributor

@sayden I'm closing this issue per my previous comment. If you disagree, feel free to reopen.

leweafan pushed a commit to leweafan/beats that referenced this issue Apr 28, 2023
…lastic#12437)

* Add sleep to allow ES sufficient time for CCR (elastic#11172)

After repeatedly running the Elasticsearch module integration test in Metricbeat, I found that sometimes Elasticsearch doesn't get enough time to perform CCR and generate CCR stats. This causes the following error, but only some times:

```
--- FAIL: TestFetch (2.44s)
    --- FAIL: TestFetch/ccr (0.08s)
        elasticsearch_integration_test.go:92:
                Error Trace:    elasticsearch_integration_test.go:92
                Error:          Should NOT be empty, but was []
                Test:           TestFetch/ccr
```

So this PR adds a 300ms sleep to give Elasticsearch enough time to perform CCR and generate CCR stats. After testing various sleep durations, I found that 300ms seemed to be the lowest (round) value I could use that consistently passed this test.

Possibly related: elastic#10866

* Fixing formatting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Stack Monitoring flaky-test Unstable or unreliable test cases. Metricbeat Metricbeat
Projects
None yet
Development

No branches or pull requests