I'm implementing feature vectors as bit maps for documents in a corpus. I already have the vocabulary for the entire corpus (as a list/set) and a list of the terms in each document.
For example, if the corpus vocabulary is ['a', 'b', 'c', 'd'] and the terms in document d1 is ['a', 'b', 'd', 'd'], the feature vector for d1 should be [1, 1, 0, 2].
To generate the feature vector, I'd iterate over the corpus vocabulary and check if each term is in the list of document terms, then set the bit in the correct position in the document's feature vector.
What would be the most efficient way to implement this? Here are some things I've considered:
- Using a
setwould make checking vocab membership very efficient butsets have no ordering, and the feature vector bits need to be in the order of the sorted corpus vocabulary. - Using a
dictfor the corpus vocab (mapping each vocab term to an arbitrary value, like1) would allow iteration oversorted(dict.keys())so I could keep track of the index. However, I'd have the space overhead ofdict.values(). - Using a
sorted(list)would be inefficient to check membership.
What would StackOverflow suggest?
O(1).