0

I'm facing an issue with my 3-node Kafka cluster, and perhaps you can help me troubleshoot.

Several times a day, I notice that my Kafka service restarts on my servers, and I'm struggling to understand why.

For instance, this morning around 09:15 AM, when checking the journalctl logs, I can see that there was a restart.

Jul 01 05:56:39 myserver kafka-server-start.sh[57215]: WARNING: All illegal access operations will be denied in a future release
Jul 01 09:15:02 myserver systemd[1]: Stopping kafka...
Jul 01 09:15:07 myserver systemd[1]: kafka.service: Main process exited, code=exited, status=143/n/a
Jul 01 09:15:07 myserver systemd[1]: kafka.service: Failed with result 'exit-code'.
Jul 01 09:15:07 myserver systemd[1]: Stopped kafka.
Jul 01 09:15:07 myserver systemd[1]: Started kafka.
Jul 01 09:16:39 myserver kafka-server-start.sh[61301]: WARNING: An illegal reflective access operation has occurred
Jul 01 09:16:39 myserver kafka-server-start.sh[61301]: WARNING: Illegal reflective access by org.xerial.snappy.pure.UnsafeUtil (file:/opt/kafka/kafka_2.13-3.2.3/libs/snappy-java-1.1.8.4.jar) t>
Jul 01 09:16:39 myserver kafka-server-start.sh[61301]: WARNING: Please consider reporting this to the maintainers of org.xerial.snappy.pure.UnsafeUtil
Jul 01 09:16:39 myserver kafka-server-start.sh[61301]: WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
Jul 01 09:16:39 myserver kafka-server-start.sh[61301]: WARNING: All illegal access operations will be denied in a future release

Since the return code is 143, does that mean Kafka received a SIGTERM signal?

If I look at the Zookeeper logs at that time, here's what I find:

2024-07-01 09:14:26,474 [myid:2] - WARN  [RecvWorker:1:QuorumCnxManager$RecvWorker@1396] - Connection broken for id 1, my id = 2
java.io.EOFException
        at java.base/java.io.DataInputStream.readInt(Unknown Source)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1384)
2024-07-01 09:14:26,476 [myid:2] - WARN  [RecvWorker:1:QuorumCnxManager$RecvWorker@1402] - Interrupting SendWorker thread from RecvWorker. sid: 1. myId: 2
2024-07-01 09:14:26,480 [myid:2] - WARN  [SendWorker:1:QuorumCnxManager$SendWorker@1282] - Interrupted while waiting for message on queue
java.lang.InterruptedException
        at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(Unknown Source)
        at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown Source)
        at org.apache.zookeeper.util.CircularBlockingQueue.poll(CircularBlockingQueue.java:105)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1447)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$900(QuorumCnxManager.java:98)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1271)
2024-07-01 09:14:26,482 [myid:2] - WARN  [SendWorker:1:QuorumCnxManager$SendWorker@1294] - Send worker leaving thread id 1 my id = 2
2024-07-01 09:14:28,602 [myid:2] - INFO  [ListenerHandler-myserverfv1.com/10.10.1.106:3888:QuorumCnxManager$Listener$ListenerHandler@1070] - Received connection request from /10.58.1.69:44330
2024-07-01 09:14:28,657 [myid:2] - INFO  [QuorumConnectionThread-[myid=2]-14:SaslQuorumServerCallbackHandler@143] - Successfully authenticated learner: authenticationID=learner;  authorizationID=learner.
2024-07-01 09:14:28,658 [myid:2] - INFO  [QuorumConnectionThread-[myid=2]-14:SaslQuorumAuthServer@110] - Successfully completed the authentication using SASL. learner addr: /10.58.1.69:44330
2024-07-01 09:14:28,669 [myid:2] - INFO  [QuorumConnectionThread-[myid=2]-15:SecurityUtils@70] - QuorumLearner will use DIGEST-MD5 as SASL mechanism.
2024-07-01 09:14:28,673 [myid:2] - INFO  [QuorumConnectionThread-[myid=2]-15:SaslQuorumAuthLearner@142] - Successfully completed the authentication using SASL. server addr: /10.58.1.69:3888, status: SUCCESS
2024-07-01 09:14:28,687 [myid:2] - INFO  [WorkerReceiver[myid=2]:FastLeaderElection$Messenger$WorkerReceiver@389] - Notification: my state:FOLLOWING; n.sid:1, n.state:LOOKING, n.leader:1, n.round:0x1, n.peerEpoch:0x78, n.zxid:0x78000020ea, message format version:0x2, n.config version:0x0

And here's the content of the Kafka logs:

<[2024-07-01 09:14:32,208][KAFKA][INFO][kafka.server.ReplicaFetcherThread][ReplicaFetcherThread-0-1] [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Partition qual-qrs-4 has an older epoch (1783) than the current leader. Will await the new LeaderAndIsr state before resuming fetching.>
<[2024-07-01 09:14:32,208][KAFKA][WARN][kafka.server.ReplicaFetcherThread][ReplicaFetcherThread-0-1] [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Partition qual-qrs-4 marked as failed>
<[2024-07-01 09:14:32,208][KAFKA][INFO][kafka.server.ReplicaFetcherThread][ReplicaFetcherThread-0-1] [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Partition qual-qualifier-12 has an older epoch (1788) than the current leader. Will await the new LeaderAndIsr state before resuming fetching.>
<[2024-07-01 09:14:32,208][KAFKA][WARN][kafka.server.ReplicaFetcherThread][ReplicaFetcherThread-0-1] [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Partition qual-qualifier-12 marked as failed>
<[2024-07-01 09:14:32,208][KAFKA][INFO][kafka.server.ReplicaFetcherThread][ReplicaFetcherThread-0-1] [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Partition qual-qualifier-6 has an older epoch (1786) than the current leader. Will await the new LeaderAndIsr state before resuming fetching.>
<[2024-07-01 09:14:32,208][KAFKA][WARN][kafka.server.ReplicaFetcherThread][ReplicaFetcherThread-0-1] [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Partition qual-qualifier-6 marked as failed>
<[2024-07-01 09:14:32,208][KAFKA][INFO][kafka.server.ReplicaFetcherThread][ReplicaFetcherThread-0-1] [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Partition qual-qualifier-24 has an older epoch (1773) than the current leader. Will await the new LeaderAndIsr state before resuming fetching.>
<[2024-07-01 09:14:32,208][KAFKA][WARN][kafka.server.ReplicaFetcherThread][ReplicaFetcherThread-0-1] [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Partition qual-qualifier-24 marked as failed>
<[2024-07-01 09:14:32,208][KAFKA][INFO][kafka.server.ReplicaFetcherThread][ReplicaFetcherThread-0-1] [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Partition qual-qualifier-18 has an older epoch (1779) than the current leader. Will await the new LeaderAndIsr state before resuming fetching.>
<[2024-07-01 09:14:32,208][KAFKA][WARN][kafka.server.ReplicaFetcherThread][ReplicaFetcherThread-0-1] [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Partition qual-qualifier-18 marked as failed>
<[2024-07-01 09:14:32,208][KAFKA][INFO][kafka.server.ReplicaFetcherThread][ReplicaFetcherThread-0-1] [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Partition qual-qualifier-30 has an older epoch (1778) than the current leader. Will await the new LeaderAndIsr state before resuming fetching.>
<[2024-07-01 09:14:32,208][KAFKA][WARN][kafka.server.ReplicaFetcherThread][ReplicaFetcherThread-0-1] [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Partition qual-qualifier-30 marked as failed>

This happens several times a day and results in communication errors between Kafka brokers, with errors such as these:

<[2024-07-01 09:15:28,515][KAFKA][ERROR][kafka.server.ReplicaManager][data-plane-kafka-request-handler-1] [ReplicaManager broker=1] Error processing append operation on partition __transaction_state-7org.apache.kafka.common.errors.NotEnoughReplicasException: The size of the current ISR Set(1) is insufficient to satisfy the min.isr requirement of 2 for partition __transaction_state-7
>

Here is my configuration: RHEL 8.8 kafka_2.13-3.2.3 zookeeper 3.6.3 openjdk 11.0.22 2024-01-16

Thank you for your help

I modified my Kafka server.properties configuration file and upgraded my JDK version.

Here's an excerpt of the diff between the old file (server.properties.before-update-jdk) and the new one (server.properties):

[root@myserver config]# diff server.properties.before-update-jdk server.properties
17,19c17,18
< listeners=METIER://:9093,ADMIN://:29093
< advertised.listeners=METIER://kafka1.qual.com:9093,ADMIN://agbt00j4bv0.enedis.fr:29093
< listener.security.protocol.map=METIER:SASL_SSL,ADMIN:SASL_SSL
---

AFTER

> listener.security.protocol.map=PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL,METIER:SASL_SSL,ADMIN:SSL
>
20a20,23
>
> listeners=METIER://:9093,ADMIN://:29093
> advertised.listeners=METIER://kafka1.qual.com:9093,ADMIN://kafka1.qual.com:29093
>
24d26
<

0

Browse other questions tagged or ask your own question.