Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transactional producer fixes #3971

Merged
merged 7 commits into from
Sep 20, 2022

Conversation

emasab
Copy link
Contributor

@emasab emasab commented Sep 8, 2022

This is a collection of fixes developed while running different long running chaos tests with single and multiple brokers and several configurations.

  • When requesting an abort with drain and bump after message timeouts, the bump was executed first, causing a series of retries on InitProducerIdRequest until a transaction timeout happen and the producer was fenced. This commit changes the order of the operations, starting the drain bump after the complete transaction ack.
  • While a commit operation was in queue, a timeout happens that can cause an abort. The state changes from COMMITTING_TRANSACTION to ABORTABLE_ERROR to ABORTING_TRANSACTION. When the broker comes up the error is permanent or fatal because the state has changed from the initial one.
  • When doing a drain and bump txn_curr_coord is not null in state WAIT_TRANSPORT, but if txn_coord has to be requested or broker is down it's retried. During this retry the txn_curr_coord can be set to null after a COORDINATOR_NOT_AVAILABLE error so when it comes back to the WAIT_TRANSPORT case a fatal error happens in the assert. It has to query for a transaction coordinator and retry.
  • When aborting a transaction, if a local TIMED_OUT, TIMED_OUT_QUEUE or OUTDATED error happens, the error was not retriable (nor fenced or abortable) so rd_kafka_txn_op_abort_transaction_ack is never called, caller code doesn't know what to do so starts a new transaction but begin transaction fails because transactional state is not READY. These errors are now retriable.
  • When calling TxnOffsetCommit, if retries were needed they were made at full speed, causing too much logging and network calls. A one second timeout has been added between retries.

@emasab emasab changed the base branch from master to rel193-rc2 September 8, 2022 13:57
@emasab emasab marked this pull request as draft September 9, 2022 08:47
@emasab emasab force-pushed the bugfix/fenced-producer-after-disconnection branch 3 times, most recently from 45a6035 to d161222 Compare September 16, 2022 10:30
@emasab emasab marked this pull request as ready for review September 16, 2022 10:31
The producer was incorrectly fenced after a temporary
coordinator unavailability. It was calling InitProducerId to
bump the epoch before aborting the transaction, while the correct
order is the other way around.
Invalid transaction state transition attempted:
AbortingTransaction -> CommitNotAcked
the fatal error rd_kafka_idemp_pid_fsm: Assertion
`rk->rk_eos.txn_curr_coord' failed
and outdated errors, causing Operation not valid
in state AbortingTransaction or AbortedNotAcked
@emasab emasab force-pushed the bugfix/fenced-producer-after-disconnection branch from d161222 to 9f814a6 Compare September 16, 2022 10:34
and TxnOffsetCommitRequest producing
too much logging and network requests
@emasab emasab force-pushed the bugfix/fenced-producer-after-disconnection branch from 9f814a6 to 1e8c576 Compare September 16, 2022 10:44
@emasab emasab force-pushed the bugfix/fenced-producer-after-disconnection branch from 1e8c576 to 1bd5ad6 Compare September 19, 2022 11:00
Copy link
Contributor

@mhowlett mhowlett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor comments.
LGTM, for an RC release, given the amount of testing you've done and earlier discussions.

@emasab emasab merged commit 067cab7 into rel193-rc2 Sep 20, 2022
@emasab emasab deleted the bugfix/fenced-producer-after-disconnection branch September 20, 2022 19:38
@emasab emasab changed the title Fix for fenced producer after disconnection Transactional producer fixes Sep 20, 2022
@emasab emasab restored the bugfix/fenced-producer-after-disconnection branch October 7, 2022 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants