let's say I have a list:
a = [['a','b'],['a','c'],['d','e'],['c','a']]
I need it to be
a = [[1,2],[1,3],[4,5],[3,1]]
I've tried to change the value using a counter but it does not work
let's say I have a list:
a = [['a','b'],['a','c'],['d','e'],['c','a']]
I need it to be
a = [[1,2],[1,3],[4,5],[3,1]]
I've tried to change the value using a counter but it does not work
This will work if you you have a list of lists and no other nesting. It will not only work for strings of any length, but for all hashable types. It also has the benefit of always using contiguous indexes starting at 0 (or 1) no matter which strings are used:
> from collections import defaultdict
> d = defaultdict(lambda: len(d)) # use 'len(d) + 1' if you want 1-based counting
> a = [[d[s] for s in x] for x in a]
> a
[[0, 1], [0, 2], [3, 4], [2, 0]]
# [[1, 2], [1, 3], [4, 5], [3, 1]]
It uses a defaultdict that always returns its current size for unknown items, thus always producing a unique integer.
You could also use Gensims Dictionary using the following:
from gensim.corpora import Dictionary
a = [['a','b'],['a','c'],['d','e'],['c','a']]
# create Dictionary object
dic = Dictionary(a)
# map letters to tokens
def mapper(l):
# take in list and return numeric representation using Dic above
return map(lambda x: dic.token2id[x], l)
a2 = map(lambda x: mapper(x), a) # run mapper on each sublist to get letter id's
a2 # [[0, 1], [0, 2], [4, 3], [2, 0]]
And if you wanted to convert to id's with counts (bag of words) you can use:
map(lambda x: dic.doc2bow(x), a)
# [[(0, 1), (1, 1)], [(0, 1), (2, 1)], [(3, 1), (4, 1)], [(0, 1), (2, 1)]]
If you're looking to expand this to multi-character strings, Python provides a builtin hash function that has a low chance of collisions:
a = [['a','b'],['a','c'],['d','e'],['c','a']]
b = [[hash(f), hash(s)] for f,s in a]
b
Out[11]:
[[-351532772472455791, 5901277274144560781],
[-351532772472455791, 791873246212810409],
[3322017449728496367, 3233520255652808905],
[791873246212810409, -351532772472455791]]
If the strings are single characters, I'd define a function that translates them, then do the same list comprehension:
def to_int(char):
return ord(char) - ord('a') + 1
b = [[to_int(f), to_int(s)] for f,s in a]
b
Out[14]: [[1, 2], [1, 3], [4, 5], [3, 1]]