Kafka, Windows, Logs, and KAFKA-1194

Overview

This KB article discusses issues related to Windows that put limitations on Kafka “log file” cleanup operations, workarounds for them, and more permanent solutions.

Kafka’s use of the term “log file” is (confusingly) different from the standard sysadmin definition of a “log file” (ie: a historical & sequential, temporally-sorted list of informational, warnings, or debug statements documenting operation of the system). The term “database” (DB) may be more appropriate as Kafka uses “log files” to actually store and process the event queue/messages.

Since Kafka also has the concept of a traditional log file, for the purpose of this article we will henceforth use the term database to refer to files that contain data that Kafka uses to store process the event queue/messages & log files to refer to the standard sysadmin definition of a log file.

Applies to

  • Syndeia Cloud 3.3

  • Syndeia Cloud 3.4

Symptoms

In the Kafka service wrapped log file (located in %SystemDrive%\cygwin64\opt\kafka-current\logs) you will notice errors similar to the following (the key text to note here is: Error Failed to clean up log for [...] The process cannot access the file because it is being used by another process):

INFO|5652/0|Service org.apache.kafka|22-01-19 10:44:46|[2022-01-19 10:44:46,289] ERROR Failed to clean up log for __consumer_offsets-30 in dir C:\cygwin64\opt\kafka-current\logs due to IOException (kafka.server.LogDirFailureChannel) INFO|5652/0|Service org.apache.kafka|22-01-19 10:44:46|java.nio.file.FileSystemException: C:\cywin64\opt\kafka-current\logs__consumer_offsets-30\00000000000000000000.log.cleaned -> C:\cygwin64\opt\kafka-current\logs__consumer_offsets-30\00000000000000000000.log.swap: The process cannot access the file because it is being used by another process. INFO|5652/0|Service org.apache.kafka|22-01-19 10:44:46| INFO|5652/0|Service org.apache.kafka|22-01-19 10:44:46| at sun.nio.fs.WindowsException.translateToIOException(Unknown Source) INFO|5652/0|Service org.apache.kafka|22-01-19 10:44:46| at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source) INFO|5652/0|Service org.apache.kafka|22-01-19 10:44:46| at sun.nio.fs.WindowsFileCopy.move(Unknown Source) INFO|5652/0|Service org.apache.kafka|22-01-19 10:44:46| at sun.nio.fs.WindowsFileSystemProvider.move(Unknown Source) INFO|5652/0|Service org.apache.kafka|22-01-19 10:44:46| at java.nio.file.Files.move(Unknown Source) INFO|5652/0|Service org.apache.kafka|22-01-19 10:44:46| at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:697) INFO|5652/0|Service org.apache.kafka|22-01-19 10:44:46| at org.apache.kafka.common.record.FileRecords.renameTo(FileRecords.java:212) INFO|5652/0|Service org.apache.kafka|22-01-19 10:44:46| at kafka.log.LogSegment.changeFileSuffixes(LogSegment.scala:415) INFO|5652/0|Service org.apache.kafka|22-01-19 10:44:46| at kafka.log.Log.replaceSegments(Log.scala:1662) INFO|5652/0|Service org.apache.kafka|22-01-19 10:44:46| at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:535) INFO|5652/0|Service org.apache.kafka|22-01-19 10:44:46| at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.scala:462) INFO|5652/0|Service org.apache.kafka|22-01-19 10:44:46| at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.scala:461) INFO|5652/0|Service org.apache.kafka|22-01-19 10:44:46| at scala.collection.immutable.List.foreach...

Cause

Windows has a limitation with in-use files as they cannot be renamed. Other OS-es such as Linux do not have this limitation.

Fix

As of December 2021, there is no official release of Apache Kafka for Windows that permanently resolves this issue. There currently is a PR with a patch provided by Windows users but as of 2022-01-29 has not been merged/accepted yet (see KAFKA-1194: Fix renaming open files on Windows by robertbraeutigam · Pull Request #6329 · apache/kafka ).

Meanwhile, Intercax advises its users to host Syndeia and peer services on Linux RedHat or CentOS operating systems. Those organizations that elect to use Windows nevertheless should either know how to manually compile and apply patches to the Kafka project OR implement the below workaround.

Workaround

The current workarounds on Windows is to either:

A. disable database file cleanup via log.cleaner.enable=false (no downtime, but requires additional storage over time)

B. periodically purge/archive the Kafka database (downtime, no additional storage over time)

Option A: Disable Database Cleanup

To disable cleanup and set the retention time, edit %SystemDrive%\cygwin64\opt\kafka-current\config\server.properties and make the following changes to it:

# The minimum age of a log file to be eligible for deletion due to age log.retention.hours=-1 # Add this property at the end of your properties file. log.cleaner.enable=false

The above configures the Kafka broker to set the retention period to infinity (-1) and disables the database cleaning.

Note, since the Kafka database contains the complete historical evolution of your Total System Model (TSM) graph, it is also recommended to set log.retention.hours=-1 to retain database files indefinitely (as of SC 3.3 & 3.4, this historical data is not used but future versions of SC may allow you to “replay” this data and view the evolution of your graph). This incidentally will also give more time to consumers (ie: sc-graph service) to consume any remaining data into the “live” JG database if operations are ever interrupted (ex: maintenance, outage).

Warning, since cleanup will now be disabled & retention set to infinity, we suggest monitoring storage on the system and periodically adding additional storage as required by normal DB operations. DB size growth is dependent on event ingestion rate (as a sample point we have observed the Kafka database (logs folder) to be ~663MB on a SC 3.4 system with ~26k artifacts, 365 artifact types, ~12k relations, 1389 containers that has had ~2 years of events on it, ie: create/update/delete)

Option B: Purge/Archive Kafka Database

If you do not wish to retain this data, you may periodically purge/archive it after checking the CURRENT-OFFSET, LOG-END-OFFSET, and LAG output columns from the below command to first ensure all consumers have consumed the data :
./kafka-consumer-groups.sh --bootstrap-server 127.0.0.1:9092 --describe --group graph

When you wish to purge/archive the data, stop Kafka services and purge/archive the logs folder.

Warning, if purging, this will result in loss of graph evolution data and possibly live data if any events have not yet been consumed.

References

Additional information on this bug on Apache Kafka's JIRA + GitHub:

 Related articles

 

Related pages