1

I am working with a lot of very large files (e.g. 1000 x 512MB) and I implemented a way to speed up things by writing certain information into databases which can be accessed when I re-run the software. For this reason I need to be able to generate unique filenames for arbitrary subsets of these files.

I have tried to generate file names based on the total file size in the subset and the file modification date and even combinations of them. The problem is that many files have the same size and the same modification date, which makes my current identifier string ambiguous. Important is only that for the same list of files, the identifier is always the same so that I can always access the correct file for the same files. Any ideas are greatly appreciated!

Here is what I use at the moment, which does not work...

import os
import glob
import datetime

file_paths = glob.glob("path/to/files/*.foo")

def modification_date(file_path):
    return datetime.datetime.fromtimestamp(os.path.getmtime(filename=file_path))

uid = [modification_date(f) for f in file_paths]
uid = [d.year + d.day + d.day + d.hour + d.minute + d.second for d in uid]
uid = sum(uid) // len(uid) + sum([os.path.getsize(f) for f in file_paths])
9
  • 1
    If all the metadata (size, modification date) are the same, you need to use the contents of the file. A simple md5 hash might do. Commented Jan 29, 2016 at 14:12
  • Do these files have names? Can you base your subset name on a combination of those file names? ex. The subset consisting of "foo.txt" and "bar.exe" is named "(foo.txt)(bar.exe)". Commented Jan 29, 2016 at 14:15
  • @Kevin I also thought about using the filenames, but they are quite long and I would end up with a very long unique identifier. Maybe there is a quick way to convert all filenames into a sequence of numbers...the filenames contain all sorts of letters and symbols though... Commented Jan 29, 2016 at 14:18
  • @Selcuk Wouldn't an md5 hash take a long time to generate for such large files? That's why I avoided it in the first place. It has to be quick! Commented Jan 29, 2016 at 14:19
  • 1
    I think you're going to end up with a long unique identifier no matter what you do. There are 2^1000 different possible subsets of 1000 files, so by the pigeonhole principle at least some of them will have identifiers that are log(2^1000, 128) = 143 characters long, assuming your filenames can contain any of the 128 ASCII characters. Commented Jan 29, 2016 at 14:22

2 Answers 2

1

This may well be wildly out but as long as the files included in the list doesn't change, could you get what you are after with hashlib?

import hashlib, glob
file_paths = glob.glob("/home/rolf/StackOverflow/*.py")
hash_object = hashlib.sha256(str(file_paths))
file_name= hash_object.hexdigest()
file_name
'f69dd9583eabae55319e9a56dbfc737cc16ab58634d9042f4c530e9a85e7d77c'

file_paths = glob.glob("/home/rolf/Research/*.py")
hash_object = hashlib.sha256(str(file_paths))
file_name= hash_object.hexdigest()
file_name
'dc4beec8e859b190d966068a52674d135f22520514ac1020b3e11a4af090e579'
Sign up to request clarification or add additional context in comments.

2 Comments

Hmmm. This works, but how can I be sure that this is unique for all potential combinations?
As @Michael Mullany states in the attached link the probability of a collision among two inputs is purely dependent on the count of inputs and is 1 - (2^256)! / ((2^256^inputcount) * (2^256-inputcount)!), basically zero for a reasonable number of inputs. stackoverflow.com/questions/2299846/…
0

If the master list of all files never changes, then you can uniquely encode any subset of that list into a string of about len(master_list)/8 characters. First, you create a bit array where a 0 at index i means "the ith element of the master list is not part of this subset" and a 1 means "the ith element of the master list is part of this subset". Then you convert the bit array into an integer, which you can then pack into a string.

import struct
import base64

#encodes a positive integer of any size into a string.
def string_from_int(value):
    bytes = []
    while value > 0:
        bytes.append(value % 2**8)
        value >>= 8
    return struct.pack("i"*len(bytes), *bytes)

#decodes a string into a positive integer. Only works on strings whose length is divisible by 4.
def int_from_string(s):
    bytes = struct.unpack("i" * (len(s)/4), s)
    return sum(byte * (256**i) for i, byte in enumerate(bytes))

#encodes a subset of a list into a filename-safe string.
def encode(master_file_list, subset):
    bitmask = [int(filename in subset) for filename in master_file_list]
    int_value = sum(bit*2**(i) for i, bit in enumerate(bitmask))
    string_value = string_from_int(int_value)
    #the string may contain slashes or other characters illegal in filenames, so we b64 endoce it, at the expense of increasing the length by %20 or so
    return base64.urlsafe_b64encode(string_value)

#the reverse of `encode`.
def decode(master_file_list, filename):
    string_value = base64.urlsafe_b64decode(filename)
    int_value = int_from_string(string_value)
    subset = [value for i,value in enumerate(master_file_list) if (2**i) & int_value]
    return subset

master_file_list = ['caddy.jpg', 'fjeld.dat', 'gyve.ini', 'karts.png', 'laves.txt', 'nimbi.jpg', 'ocas.ini', 'sipes.dat', 'wacky.png', 'waff.png']
subset = ["fjeld.dat", "laves.txt", "ocas.ini"]
filename = encode(master_file_list, subset)
print "Subset is:", subset
print "Filename is:", filename
print "Filename decodes back to:", decode(master_file_list, filename)

Result:

Subset is: ['fjeld.dat', 'laves.txt', 'ocas.ini']
Filename is: UgAAAA==
Filename decodes back to: ['fjeld.dat', 'laves.txt', 'ocas.ini']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.