Is there a more efficient way to write this Python code?

Question

I am dealing with a preprocessing stage of a data table. My current code works but I am wondering if there is a more efficient way.

My data table looks like this

object A    object B     features of A     features of B   
   aaa          w          1                    0
   aaa          q          1                    1 
   bbb          x          0                    0
   ccc          w          1                    0

for the X it would be

[ (aaa, aaa, bbb, ccc), (w, q, x, w), (1, 1, 0, 1), (0, 1, 0, 0)]

Now I am writing a code to make a table that includes all the combination of every possible match of object A & object B (iterate the combination of object A & object B without repetition), while A & B keeps their features respectively. The table would look like the follows:(rows with a star are the added rows)

object A    object B     features of A     features of B   
   aaa         w           1                    0
   aaa         q           1                    1 
 * aaa         x           1                    0
---------------------------------------------------------
   bbb         x           0                    0
 * bbb         w           0                    0
 * bbb         q           0                    1
---------------------------------------------------------
   ccc         w           1                    0 
 * ccc         x           1                    0 
 * ccc         q           1                    1

The whole data is named X To get the table: My code is as follows, but it runs very slow:

-----------------------------------------
#This part is still fast 

#to make the combination of object A and object B with no repetition

def uprod(*seqs):
    def inner(i):
        if i == n:
            yield tuple(result)
            return
        for elt in sets[i] - seen:
            seen.add(elt)
            result[i] = elt
            for t in inner(i+1):
                yield t
            seen.remove(elt)

    sets = [set(seq) for seq in seqs]
    n = len(sets)
    seen = set()
    result = [None] * n
    for t in inner(0):
        yield t

#add all possibility into a new list named "new_data"

new_data = list(uprod(X[0],X[1]))

X_8v = X[:]
y_8v = y[:]

-----------------------------------------
#if the current X_8v( content equals to X) does not have the match of object A and object B
#in the list "new_data"
#append a new row to the current X_8v
#Now this part is super slow, I think because I iterate a lot

for i, j in list(enumerate(X_8v[0])):
    for k, w in list(enumerate(X_8v[1])):
            if (X_8v[0][i], X_8v[1][k]) not in new_data:
                X_8v[0] + (X_8v[0][i],)
                X_8v[1] + (X_8v[1][k],)
                X_8v[2] + (X_8v[2][i],)
                X_8v[3] + (X_8v[3][k],)  
                X_8v[4] + (X_8v[4][i],)
                X_8v[5] + (0,)
                X_8v[6] + (0,)
                y_8v.append(0)

is there any possible improvement for the code above?

Many thanks!

This would be a lot easier with an example of what the data structure looks like (i.e. X = ...). — user94559
– user94559, Commented Jul 24, 2016 at 20:59
@smarx : X = [ (aaa, aaa, bbb, ccc), (w, q, x, w), (1, 1, 0, 1), (0, 1, 0, 0)] — Winds
– Winds, Commented Jul 24, 2016 at 22:00
@LeighTsai That's not valid Python (unless aaa and the like are variables?). — user94559
– user94559, Commented Jul 24, 2016 at 22:08
yes, the original data is interval(id no). to make it more understandable i use alphabet instead — Winds
– Winds, Commented Jul 24, 2016 at 22:31

eddiewould · Accepted Answer · 2016-07-24 22:07:28Z

In relational algebra terms, it sounds like you want

π[features of A, features of B] ((object A) X (object B))

i.e. project fields 'features of A', 'features of B' from the cross-product of "object A" and "object B".

This is very natural to express in SQL.

For Python, you probably want to load your data into a couple of dictionaries i.e. object_a_to_features = {"aaa": 1, "bbb": 0} object_b_to_features = {"w": 0, "q": 1}

You'll then want to generate the cross-product of object_a_to_features.keys() and object_b_to_features.keys() and then for each row, look up the features in the appropriate dictionary.

Have a look at product() from itertools.

Something like:

import itertools for pair in itertools.product(object_a_to_features.keys(), object_b_to_features.keys()): yield (pair[0], pair[1], object_a_to_features[pair[0]], object_b_to_features[pair[1]])

Sample output:

('aaa', 'q', 1, 1) ('aaa', 'w', 1, 0) ('bbb', 'q', 0, 1) ('bbb', 'w', 0, 0)

user94559 · Accepted Answer · 2016-07-24 22:48:09Z

1

Assuming the data actually looks like I think it does, this should do what you want quite efficiently:

import itertools

x = [('aaa', 'aaa', 'bbb', 'ccc'), ('w', 'q', 'x', 'w'), (1, 1, 0, 1), (0, 1, 0, 0)]

a_list = set((x[0][i], x[2][i]) for i in range(len(x[0])))
b_list = set((x[1][i], x[3][i]) for i in range(len(x[1])))

for combination in itertools.product(a_list, b_list):
    print(combination)

# Output:
# (('ccc', 1), ('w', 0))
# (('ccc', 1), ('x', 0))
# (('ccc', 1), ('q', 1))
# (('aaa', 1), ('w', 0))
# (('aaa', 1), ('x', 0))
# (('aaa', 1), ('q', 1))
# (('bbb', 0), ('w', 0))
# (('bbb', 0), ('x', 0))
# (('bbb', 0), ('q', 1))

Of course you can easily convert the data back into the order you originally had:

reordered = [[a[0], b[0], a[1], b[1]] for a, b in itertools.product(a_list, b_list)]

for row in reordered:
    print(row)

# ['ccc', 'w', 1, 0]
# ['ccc', 'x', 1, 0]
# ['ccc', 'q', 1, 1]
# ['aaa', 'w', 1, 0]
# ['aaa', 'x', 1, 0]
# ['aaa', 'q', 1, 1]
# ['bbb', 'w', 0, 0]
# ['bbb', 'x', 0, 0]
# ['bbb', 'q', 0, 1]

EDIT

Based on the comment below, if you want to add a column with 1 indicating "This row was in the original dataset" and 0 indicating "This row was not in the original dataset," give this a try:

existing_combinations = set(zip(x[0], x[1]))
reordered = [
    [a[0], b[0], a[1], b[1],
     1 if (a[0], b[0]) in existing_combinations else 0
    ] for a, b in itertools.product(a_list, b_list)
]

# Output:
# ['ccc', 'x', 1, 0, 0]
# ['ccc', 'q', 1, 1, 0]
# ['ccc', 'w', 1, 0, 1]
# ['bbb', 'x', 0, 0, 1]
# ['bbb', 'q', 0, 1, 0]
# ['bbb', 'w', 0, 0, 0]
# ['aaa', 'x', 1, 0, 0]
# ['aaa', 'q', 1, 1, 1]
# ['aaa', 'w', 1, 0, 1]

edited Jul 24, 2016 at 22:48

answered Jul 24, 2016 at 22:12

user94559

60.3k6 gold badges108 silver badges107 bronze badges

6 Comments

Winds Over a year ago

Wow, so nice to ask here.Thank you so much,

Winds Over a year ago

if I have a column names "existed" the original data has value 1, and the added data has value 0, which made the row like ['ccc', 'w', 1, 0, 1] , ['aaa', 'w', 1, 0, 1] how do i deal with this when I want to identify the data is added or original?

user94559 Over a year ago

@LeighTsai I'm not sure what "added" and "original" mean here. I guess "original" means a row that is identical to a row in the input data, and "added" is a row that is not identical to any row in the input data? And I'm not sure why the "original" data would have a different value in it than the "added" data. Can you give an example?

Winds Over a year ago

and your understanding about added and original is correct. "added" is the rows we newly added after processing, and the "original" is the input rows

Winds Over a year ago

To be honest, The column name was "past collaboration" the data i owned originally (before preprocessing) shows the collaboration result of object a and object b (which has both value "1" & "0" in the column). now we are making a table of all combination of object a and object b, which means that the records of matched object a and object b is also created , but within these created rows, the objects have no collaboration experience with each other, which have value "0".

|

Collectives™ on Stack Overflow

Is there a more efficient way to write this Python code?

2 Answers 2

Comments

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related