0

I am dealing with a preprocessing stage of a data table. My current code works but I am wondering if there is a more efficient way.

My data table looks like this

object A    object B     features of A     features of B   
   aaa          w          1                    0
   aaa          q          1                    1 
   bbb          x          0                    0
   ccc          w          1                    0 

for the X it would be

[ (aaa, aaa, bbb, ccc), (w, q, x, w), (1, 1, 0, 1), (0, 1, 0, 0)]

Now I am writing a code to make a table that includes all the combination of every possible match of object A & object B (iterate the combination of object A & object B without repetition), while A & B keeps their features respectively. The table would look like the follows:(rows with a star are the added rows)

object A    object B     features of A     features of B   
   aaa         w           1                    0
   aaa         q           1                    1 
 * aaa         x           1                    0
---------------------------------------------------------
   bbb         x           0                    0
 * bbb         w           0                    0
 * bbb         q           0                    1
---------------------------------------------------------
   ccc         w           1                    0 
 * ccc         x           1                    0 
 * ccc         q           1                    1 

The whole data is named X To get the table: My code is as follows, but it runs very slow:

-----------------------------------------
#This part is still fast 

#to make the combination of object A and object B with no repetition

def uprod(*seqs):
    def inner(i):
        if i == n:
            yield tuple(result)
            return
        for elt in sets[i] - seen:
            seen.add(elt)
            result[i] = elt
            for t in inner(i+1):
                yield t
            seen.remove(elt)

    sets = [set(seq) for seq in seqs]
    n = len(sets)
    seen = set()
    result = [None] * n
    for t in inner(0):
        yield t

#add all possibility into a new list named "new_data"

new_data = list(uprod(X[0],X[1]))

X_8v = X[:]
y_8v = y[:]

-----------------------------------------
#if the current X_8v( content equals to X) does not have the match of object A and object B
#in the list "new_data"
#append a new row to the current X_8v
#Now this part is super slow, I think because I iterate a lot

for i, j in list(enumerate(X_8v[0])):
    for k, w in list(enumerate(X_8v[1])):
            if (X_8v[0][i], X_8v[1][k]) not in new_data:
                X_8v[0] + (X_8v[0][i],)
                X_8v[1] + (X_8v[1][k],)
                X_8v[2] + (X_8v[2][i],)
                X_8v[3] + (X_8v[3][k],)  
                X_8v[4] + (X_8v[4][i],)
                X_8v[5] + (0,)
                X_8v[6] + (0,)
                y_8v.append(0)

is there any possible improvement for the code above?

Many thanks!

5
  • This would be a lot easier with an example of what the data structure looks like (i.e. X = ...). Commented Jul 24, 2016 at 20:59
  • I'll get OP started: pastebin.com/Acu5ZQND Commented Jul 24, 2016 at 21:04
  • @smarx : X = [ (aaa, aaa, bbb, ccc), (w, q, x, w), (1, 1, 0, 1), (0, 1, 0, 0)] Commented Jul 24, 2016 at 22:00
  • @LeighTsai That's not valid Python (unless aaa and the like are variables?). Commented Jul 24, 2016 at 22:08
  • yes, the original data is interval(id no). to make it more understandable i use alphabet instead Commented Jul 24, 2016 at 22:31

2 Answers 2

2

In relational algebra terms, it sounds like you want

π[features of A, features of B] ((object A) X (object B))

i.e. project fields 'features of A', 'features of B' from the cross-product of "object A" and "object B".

This is very natural to express in SQL.

For Python, you probably want to load your data into a couple of dictionaries i.e. object_a_to_features = {"aaa": 1, "bbb": 0} object_b_to_features = {"w": 0, "q": 1}

You'll then want to generate the cross-product of object_a_to_features.keys() and object_b_to_features.keys() and then for each row, look up the features in the appropriate dictionary.

Have a look at product() from itertools.

Something like:

import itertools for pair in itertools.product(object_a_to_features.keys(), object_b_to_features.keys()): yield (pair[0], pair[1], object_a_to_features[pair[0]], object_b_to_features[pair[1]])

Sample output:

('aaa', 'q', 1, 1) ('aaa', 'w', 1, 0) ('bbb', 'q', 0, 1) ('bbb', 'w', 0, 0)

Sign up to request clarification or add additional context in comments.

Comments

1

Assuming the data actually looks like I think it does, this should do what you want quite efficiently:

import itertools

x = [('aaa', 'aaa', 'bbb', 'ccc'), ('w', 'q', 'x', 'w'), (1, 1, 0, 1), (0, 1, 0, 0)]

a_list = set((x[0][i], x[2][i]) for i in range(len(x[0])))
b_list = set((x[1][i], x[3][i]) for i in range(len(x[1])))

for combination in itertools.product(a_list, b_list):
    print(combination)

# Output:
# (('ccc', 1), ('w', 0))
# (('ccc', 1), ('x', 0))
# (('ccc', 1), ('q', 1))
# (('aaa', 1), ('w', 0))
# (('aaa', 1), ('x', 0))
# (('aaa', 1), ('q', 1))
# (('bbb', 0), ('w', 0))
# (('bbb', 0), ('x', 0))
# (('bbb', 0), ('q', 1))

Of course you can easily convert the data back into the order you originally had:

reordered = [[a[0], b[0], a[1], b[1]] for a, b in itertools.product(a_list, b_list)]

for row in reordered:
    print(row)

# ['ccc', 'w', 1, 0]
# ['ccc', 'x', 1, 0]
# ['ccc', 'q', 1, 1]
# ['aaa', 'w', 1, 0]
# ['aaa', 'x', 1, 0]
# ['aaa', 'q', 1, 1]
# ['bbb', 'w', 0, 0]
# ['bbb', 'x', 0, 0]
# ['bbb', 'q', 0, 1]

EDIT

Based on the comment below, if you want to add a column with 1 indicating "This row was in the original dataset" and 0 indicating "This row was not in the original dataset," give this a try:

existing_combinations = set(zip(x[0], x[1]))
reordered = [
    [a[0], b[0], a[1], b[1],
     1 if (a[0], b[0]) in existing_combinations else 0
    ] for a, b in itertools.product(a_list, b_list)
]

# Output:
# ['ccc', 'x', 1, 0, 0]
# ['ccc', 'q', 1, 1, 0]
# ['ccc', 'w', 1, 0, 1]
# ['bbb', 'x', 0, 0, 1]
# ['bbb', 'q', 0, 1, 0]
# ['bbb', 'w', 0, 0, 0]
# ['aaa', 'x', 1, 0, 0]
# ['aaa', 'q', 1, 1, 1]
# ['aaa', 'w', 1, 0, 1]

6 Comments

Wow, so nice to ask here.Thank you so much,
if I have a column names "existed" the original data has value 1, and the added data has value 0, which made the row like ['ccc', 'w', 1, 0, 1] , ['aaa', 'w', 1, 0, 1] how do i deal with this when I want to identify the data is added or original?
@LeighTsai I'm not sure what "added" and "original" mean here. I guess "original" means a row that is identical to a row in the input data, and "added" is a row that is not identical to any row in the input data? And I'm not sure why the "original" data would have a different value in it than the "added" data. Can you give an example?
and your understanding about added and original is correct. "added" is the rows we newly added after processing, and the "original" is the input rows
To be honest, The column name was "past collaboration" the data i owned originally (before preprocessing) shows the collaboration result of object a and object b (which has both value "1" & "0" in the column). now we are making a table of all combination of object a and object b, which means that the records of matched object a and object b is also created , but within these created rows, the objects have no collaboration experience with each other, which have value "0".
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.