Python: Using pandas for joining one array to another

Question

How do I use pandas to come up with a joined result of aoiFeatures and allFeaturesReadings that results in this:

183  0.03
845  0.03
853  0.01

Given the following starting code and data:

import numpy
import pandas as pd
allFeatures = [101, 179, 181, 183, 185, 843, 845, 847, 849, 851, 853, 855]
allReadings = [0.03, 0.01, 0.01, 0.03, 0.03, 0.01, 0.03, 0.02, 0.07, 0.06, 0.01, 0.04]
aoiFeatures = [183, 845, 853]

allFeaturesReadings = zip(allFeatures, allReadings)
#
# Use pandas to create Series and Join here?
#
sAllFeaturesReadings = pd.Series(dict(allFeaturesReadings))
sAOIFeatures = pd.Series(numpy.ma.filled(aoiFeatures))
sIndexedAOIFeatures = sAOIFeatures.reindex(numpy.ma.filled(aoiFeatures))
result = pd.concat([sIndexedAOIFeatures,sAllFeaturesReadings], axis=1, join='inner')

Does this look correct or is there an easier way?

JackedUpDBA
– JackedUpDBA

2018-05-21 21:20:40 +00:00
Commented May 21, 2018 at 21:20 — JackedUpDBA
– JackedUpDBA, Commented May 21, 2018 at 21:20

Toby Petty · Accepted Answer · 2018-05-21 21:19:37Z

1

Without needing to zip you can do:

df = pd.DataFrame(data={"allFeatures":allFeatures, "allReadings":allReadings})
df[df["allFeatures"].isin(aoiFeatures)]

answered May 21, 2018 at 21:19

Toby Petty

4,6901 gold badge19 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

JackedUpDBA Over a year ago

Worked great, Toby... much faster as well!

Toby Petty Over a year ago

Thanks - @unutbu's answer is the same concept but much more thorough

unutbu · Accepted Answer · 2018-05-22 00:45:23Z

0

You could use isin:

import pandas as pd
allFeatures = [101, 179, 181, 183, 185, 843, 845, 847, 849, 851, 853, 855]
allReadings = [0.03, 0.01, 0.01, 0.03, 0.03, 0.01, 0.03, 0.02, 0.07, 0.06, 0.01, 0.04]
aoiFeatures = [183, 845, 853]

df = pd.DataFrame({'features':allFeatures, 'readings':allReadings})
result = df.loc[df['features'].isin(aoiFeatures)]
print(result)

yields

    features  readings
3        183      0.03
6        845      0.03
10       853      0.01

If you plan on selecting rows based on feature values often, and if the features can be made into a unique Index, and if the DataFrame is at least moderately large (say ~10K rows) then it may be better (for performance) to make features the index:

import pandas as pd
allFeatures = [101, 179, 181, 183, 185, 843, 845, 847, 849, 851, 853, 855]
allReadings = [0.03, 0.01, 0.01, 0.03, 0.03, 0.01, 0.03, 0.02, 0.07, 0.06, 0.01, 0.04]
aoiFeatures = [183, 845, 853]

df = pd.DataFrame({'readings':allReadings}, index=allFeatures)
result = df.loc[aoiFeatures]
print(result)

yields

     readings
183      0.03
845      0.03
853      0.01

Here is the setup I used to make the IPython %timeit tests:

import pandas as pd
N = 10000
allFeatures = np.repeat(np.arange(N), 1)
allReadings = np.random.random(N)
aoiFeatures = np.random.choice(allFeatures, N//10, replace=False)

def using_isin():
    df = pd.DataFrame({'features':allFeatures, 'readings':allReadings})
    for i in range(1000):
        result = df.loc[df['features'].isin(aoiFeatures)]
    return result


def using_index():
    df = pd.DataFrame({'readings':allReadings}, index=allFeatures)
    for i in range(1000):
        result = df.loc[aoiFeatures]
    return result

This shows using_index can be a bit faster:

In [108]: %timeit using_isin()
1 loop, best of 3: 697 ms per loop

In [109]: %timeit using_index()
1 loop, best of 3: 432 ms per loop

Note however, if allFeatures contains duplicates, then making it the Index is NOT advantageous. For example, if you change the setup above to use:

allFeatures = np.repeat(np.arange(N//2), 2)    # repeat every value twice

then

In [114]: %timeit using_isin()
1 loop, best of 3: 667 ms per loop

In [115]: %timeit using_index()
1 loop, best of 3: 3.47 s per loop

edited May 22, 2018 at 0:45

answered May 21, 2018 at 21:23

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

4 Comments

JackedUpDBA Over a year ago

I wanted to eventually do exactly as you said in your second example, using allFeatures as the index... thank you!

unutbu Over a year ago

@Jack: I posted my answer a bit too soon. I did some timeit tests and found the benefit of making features the index come only under certain conditions: The features index has to be unique and the DataFrame has to be moderately large (on my machine, at least 10K rows) before there is a performance benefit.

JackedUpDBA Over a year ago

So... say I had another array of readings like: nextReadings = [0.04, 0.09, 0.21, 0.01, 0.06, 0.08, 0.13, 0.01, 0.01, 0.02, 0.04, 0.06] that still indexed in order with allFeatures. Is there an easy way to find the values from each array that were largest for a given allFeatures element? (In the example above, the readings for 845 and 853 would change to 0.13 and 0.04, respectively). Also, in the real world, allFeatures and allReadings each have 2.4 million elements, and aoiFeatures has 67,000 elements. I have 17 other arrays of readings like allReadings to compare

unutbu Over a year ago

df.loc[[845, 853]] selects the rows whose index labels correspond with 845 or 853. To find the maximum for each row, you could use df.loc[[845, 853]].max(axis=1). But I'm not sure I'm understanding your question properly. If I'm not, then please post a new question with all the details. (Sample data and expected output as you provided in this question is very helpful.)

Collectives™ on Stack Overflow

Python: Using pandas for joining one array to another

2 Answers 2

2 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related