12

I'm currently working on a multi-threaded downloader with help of PycURL module. I am downloading parts of the files and merging them afterwards.

The parts are downloaded separately from multiple threads , they are written to temporary files in binary mode, but when I merge them into single file(they are merged in correct order) , the checksums do not match.

This only happens in linux env. The same script works flawlessly in Windows env.

This is the code(part of the script) that merges the files:

with open(filename,'wb') as outfile:
    print('Merging temp files ...')
    for tmpfile in self.tempfile_arr:
        with open(tmpfile, 'rb') as infile:
            shutil.copyfileobj(infile, outfile)
    print('Done!')

I tried write() method as well , but it results with same issue, and it will take a lot of memory for large files.

If I manually cat the part files into a single file in linux, then file's checksum matches, the issue is with python's merging of files.

EDIT:
Here are the files and checksums(sha256) that I used to reproduce the issue:

  • Original file
    • HASH: 158575ed12e705a624c3134ffe3138987c64d6a7298c5a81794ccf6866efd488
  • file merged by script
    • HASH: c3e5a0404da480f36d37b65053732abe6d19034f60c3004a908b88d459db7d87
  • file merged manually using cat

    • HASH: 158575ed12e705a624c3134ffe3138987c64d6a7298c5a81794ccf6866efd488
    • Command used:

      for i in /tmp/pycurl_*_{0..7}; do cat $i >> manually_merged.tar.gz; done
      
  • Part files - numbered at the end, from 0 through 7

16
  • 3
    I think your open mode is not right (wb). Based on stackoverflow.com/a/4388244/3727050 you need ab (or r+b and seek) Commented Dec 28, 2019 at 16:48
  • 3
    You need to provide a minimal reproducible example including some example tempfiles. I think you should be able to reproduce the issue with some tempfiles of just a few bytes each. Hopefully buffer size is not part of the problem. Also binary mode is probably not important, so you could use plain text files. Commented Dec 28, 2019 at 17:05
  • FWIW I wasn't able to reproduce the problem with two very short text files on Linux unfortunately. Commented Dec 28, 2019 at 17:19
  • 3
    OK, the files help but your code's still incomplete: filename, self.tempfile_arr, and shutil are undefined Commented Dec 28, 2019 at 19:08
  • 1
    Your two files appear to have the same contents, just in a different order. In other words, you didn't merge the pieces in the correct order. Commented Jan 13, 2020 at 15:00

1 Answer 1

1

A minimally reproducible case would be convenient, but I'd suspect universal newlines to be the issue: by default, if your files are windows-style text (newlines are \r\n) they're going to get translated to Unix-style newlines (\n) on reading. And then those unix-style newlines are going to get written back to the output file rather than the Windows-style ones you were expecting. That would explain the divergence between python and cat (which'd do no translation whatsoever).

Try to run your script passing newline='' (the empty string) to open.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.