Skip to main content

File Integrity Checks

Release 2.9: MD5 checks are optional for large files. By setting Uploader (source connector) configuration property file.maximum.size.bytes.for.md5, MD5 checks will not be performed for files exceeding this size: instead the Downloader (sink connector) completes the file integrity check by checking that the uploaded and downloaded (merged) files are identical in size. This behaviour is configurable becuase MD5 checks are computationally costly and create memory pressure on the JVM, causing OutOfMemoryError for files exceeding 2GB in size. When the downloaded file is merged, if the MD5 checksum or file sizes do not match, then the Downloader logs an error and does not finalize the filename of the downloaded file.

If the size of an input file is below file.maximum.size.bytes.for.md5 then the file-chunk source connector generates a MD5 checksum for each uploaded file. The file-chunk sink connector generates a MD5 checksum for each merged file. The MD5 checksums are compared by the Sink connector. If they match, then the file is renamed to filename__FINISHED. If they do not match then the file is renamed to filename__ERROR.

Various integrity checks are performed during pipeline streaming:

  • File chunks are numbered, from 1 to n; where n = (filesize / binary.chunk.size.bytes). Each chunk will be exactly binary.chunk.size.bytes bytes; except for the final chunk, which is usually smaller than binary.chunk.size.bytes.
  • The Sink connector verifies that the size of the (partially) merged target file is always (binary.chunk.size.bytes * chunk-number).
  • After merging the final chunk, the MD5 checksum for the target file is compared with the previously generated MD5 checksum for the source file; which is populated into the header for each chunk (for files below the file.maximum.size.bytes.for.md5)