0

I am trying to put multiple files(ard 25k) into a zip file using multithreading in python cgi. I have written the script below, but somehow the response I get has content length 0 and there is no data in the response. This is my first time using multithreading in python. Is there anything I am missing in the code. Does the output gets printed even before the data is posted?

Any help will be appreciated.

Here is my code:

b = StringIO()
z = zipfile.ZipFile(b, 'w', zipfile.ZIP_DEFLATED)

def read_file(link):
    fname = link.split('/')
    fname = fname[-1]
    z.write(link, fname)

if __name__ == '__main__':
    form = cgi.FieldStorage()
    fileLinks = form.getvalue("fileLink")

    p = Pool(10)
    p.map(read_file, fileLinks)
    p.close()
    p.join()
    z.close()
    zipFilename = "DataFiles-" + str(time.time()) + ".zip"   
    length = b.tell()
    sys.stdout.write(
        HEADERS % ('application/zip', zipFilename, zipFilename, length)
    )
    b.seek(0)
    sys.stdout.write(b.read())
    b.close()

Sequential version of the same code:

 for fileLink in fileLinks:
     fname = fileLink.split('/')
     filename = fname[-1] 
     z.write(fileLink, filename)
z.close()
5
  • 1
    Does a single-thread version of your algorithm work as expected? Commented Jan 23, 2017 at 16:21
  • Thanks for the comment, let me try it and see. Commented Jan 23, 2017 at 16:24
  • I have tried with single thread and limiting the number of files to 1000. It doesn't work. It gives same response with zero content length. Commented Jan 23, 2017 at 16:38
  • Then the issue is not multithreading! Commented Jan 23, 2017 at 16:44
  • Sequential version of the same code works. Adding the code to the question. Commented Jan 23, 2017 at 16:53

1 Answer 1

1

The problem should be that ZipFile.write() (ZipFile in general) is not thread safe.

You must somehow serialize thread access to the zip file. This is one way to do it (in Python 3):

ziplock = threading.Lock()

def read_file(link):
    fname = link.split('/')
    fname = fname[-1]
    with ziplock:
        z.write(link, fname)

There should be no advantage to doing it that way because what the lock is effectively doing is serializing the zip file creation.

Some parallelization may be achieved with this version, which reads the file contents before adding them to the zip file:

def read_file(link):
    fname = link.split('/')
    fname = fname[-1]
    # the file is read in parallel
    contents = open(link).read()
    with ziplock:
        # writes to the zip file a re serialized
        z.writestr(fname, contents)

Yet, if the files reside on the same file system, it is likely that the reads will, to all effects, act as if they had been serialized by the operating system.

Because it is files, the possible target for parallelization would be the CPU-bound part of the process, which is the compression, and that doesn't seem possible with the zip format (because a zip file behaves like a directory, so every write() must leave the state ready to produce a complete archive upon close()).

If you can use a different compression format, then parallelization would work without locks using gizp for compression and tar (tarfile) as the archive format, as each file could be read and compressed in parallel, and only the tar concatenation would be done serially (the .tar.gz or .tgz archive format).

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for your answer. Appreciate your help. I will try it out and let you know.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.