Kafka, Windows, Logs, and KAFKA-1194
Overview
This KB article discusses issues related to Windows that put limitations on Kafka “log file” cleanup operations, workarounds for them, and more permanent solutions.
Kafka’s use of the term “log file” is (confusingly) different from the standard sysadmin definition of a “log file” (ie: a historical & sequential, temporally-sorted list of informational, warnings, or debug statements documenting operation of the system). The term “database” (DB) may be more appropriate as Kafka uses “log files” to actually store and process the event queue/messages.
Since Kafka also has the concept of a traditional log file, for the purpose of this article we will henceforth use the term database to refer to files that contain data that Kafka uses to store process the event queue/messages & log files to refer to the standard sysadmin definition of a log file.
Applies to
Syndeia Cloud 3.3
Syndeia Cloud 3.4
Symptoms
In the Kafka service wrapped
log file (located in %SystemDrive%\cygwin64\opt\kafka-current\logs
) you will notice errors similar to the following (the key text to note here is: Error Failed to clean up log for [...] The process cannot access the file because it is being used by another process
):
INFO|5652/0|Service org.apache.kafka|22-01-19 10:44:46|[2022-01-19 10:44:46,289] ERROR Failed to clean up log for __consumer_offsets-30 in dir C:\cygwin64\opt\kafka-current\logs due to IOException (kafka.server.LogDirFailureChannel)
INFO|5652/0|Service org.apache.kafka|22-01-19 10:44:46|java.nio.file.FileSystemException: C:\cywin64\opt\kafka-current\logs__consumer_offsets-30\00000000000000000000.log.cleaned -> C:\cygwin64\opt\kafka-current\logs__consumer_offsets-30\00000000000000000000.log.swap: The process cannot access the file because it is being used by another process.
INFO|5652/0|Service org.apache.kafka|22-01-19 10:44:46|
INFO|5652/0|Service org.apache.kafka|22-01-19 10:44:46| at sun.nio.fs.WindowsException.translateToIOException(Unknown Source)
INFO|5652/0|Service org.apache.kafka|22-01-19 10:44:46| at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source)
INFO|5652/0|Service org.apache.kafka|22-01-19 10:44:46| at sun.nio.fs.WindowsFileCopy.move(Unknown Source)
INFO|5652/0|Service org.apache.kafka|22-01-19 10:44:46| at sun.nio.fs.WindowsFileSystemProvider.move(Unknown Source)
INFO|5652/0|Service org.apache.kafka|22-01-19 10:44:46| at java.nio.file.Files.move(Unknown Source)
INFO|5652/0|Service org.apache.kafka|22-01-19 10:44:46| at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:697)
INFO|5652/0|Service org.apache.kafka|22-01-19 10:44:46| at org.apache.kafka.common.record.FileRecords.renameTo(FileRecords.java:212)
INFO|5652/0|Service org.apache.kafka|22-01-19 10:44:46| at kafka.log.LogSegment.changeFileSuffixes(LogSegment.scala:415)
INFO|5652/0|Service org.apache.kafka|22-01-19 10:44:46| at kafka.log.Log.replaceSegments(Log.scala:1662)
INFO|5652/0|Service org.apache.kafka|22-01-19 10:44:46| at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:535)
INFO|5652/0|Service org.apache.kafka|22-01-19 10:44:46| at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.scala:462)
INFO|5652/0|Service org.apache.kafka|22-01-19 10:44:46| at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.scala:461)
INFO|5652/0|Service org.apache.kafka|22-01-19 10:44:46| at scala.collection.immutable.List.foreach...
Cause
Windows has a limitation with in-use files as they cannot be renamed. Other OS-es such as Linux do not have this limitation.
Fix
As of December 2021, there is no official release of Apache Kafka for Windows that permanently resolves this issue. There currently is a PR with a patch provided by Windows users but as of 2022-01-29 has not been merged/accepted yet (see https://github.com/apache/kafka/pull/6329 ).
Meanwhile, Intercax advises its users to host Syndeia and peer services on Linux RedHat or CentOS operating systems. Those organizations that elect to use Windows nevertheless should either know how to manually compile and apply patches to the Kafka project OR implement the below workaround.
Workaround
The current workarounds on Windows is to either:
A. disable database file cleanup via log.cleaner.enable=false
(no downtime, but requires additional storage over time)
B. periodically purge/archive the Kafka database (downtime, no additional storage over time)
Option A: Disable Database Cleanup
To disable cleanup and set the retention time, edit %SystemDrive%\cygwin64
\opt\kafka-current\config\server.properties
and make the following changes to it:
# The minimum age of a log file to be eligible for deletion due to age
log.retention.hours=-1
# Add this property at the end of your properties file.
log.cleaner.enable=false
The above configures the Kafka broker to set the retention period to infinity (-1
) and disables the database cleaning.
Note, since the Kafka database contains the complete historical evolution of your Total System Model (TSM) graph, it is also recommended to set log.retention.hours=-1
to retain database files indefinitely (as of SC 3.3 & 3.4, this historical data is not used but future versions of SC may allow you to “replay” this data and view the evolution of your graph). This incidentally will also give more time to consumers (ie: sc-graph
service) to consume any remaining data into the “live” JG database if operations are ever interrupted (ex: maintenance, outage).
Warning, since cleanup will now be disabled & retention set to infinity, we suggest monitoring storage on the system and periodically adding additional storage as required by normal DB operations. DB size growth is dependent on event ingestion rate (as a sample point we have observed the Kafka database (logs
folder) to be ~663MB on a SC 3.4 system with ~26k artifacts, 365 artifact types, ~12k relations, 1389 containers that has had ~2 years of events on it, ie: create/update/delete)
Option B: Purge/Archive Kafka Database
If you do not wish to retain this data, you may periodically purge/archive it after checking the CURRENT-OFFSET
, LOG-END-OFFSET
, and LAG
output columns from the below command to first ensure all consumers have consumed the data : ./kafka-consumer-groups.sh --bootstrap-server 127.0.0.1:9092 --describe --group graph
When you wish to purge/archive the data, stop Kafka services and purge/archive the logs
folder.
Warning, if purging, this will result in loss of graph evolution data and possibly live data if any events have not yet been consumed.
References
Additional information on this bug on Apache Kafka's JIRA + GitHub:
Related articles