1

How do I use pandas to come up with a joined result of aoiFeatures and allFeaturesReadings that results in this:

183  0.03
845  0.03
853  0.01

Given the following starting code and data:

import numpy
import pandas as pd
allFeatures = [101, 179, 181, 183, 185, 843, 845, 847, 849, 851, 853, 855]
allReadings = [0.03, 0.01, 0.01, 0.03, 0.03, 0.01, 0.03, 0.02, 0.07, 0.06, 0.01, 0.04]
aoiFeatures = [183, 845, 853]

allFeaturesReadings = zip(allFeatures, allReadings)
#
# Use pandas to create Series and Join here?
#
sAllFeaturesReadings = pd.Series(dict(allFeaturesReadings))
sAOIFeatures = pd.Series(numpy.ma.filled(aoiFeatures))
sIndexedAOIFeatures = sAOIFeatures.reindex(numpy.ma.filled(aoiFeatures))
result = pd.concat([sIndexedAOIFeatures,sAllFeaturesReadings], axis=1, join='inner')
1
  • Does this look correct or is there an easier way? Commented May 21, 2018 at 21:20

2 Answers 2

1

Without needing to zip you can do:

df = pd.DataFrame(data={"allFeatures":allFeatures, "allReadings":allReadings})
df[df["allFeatures"].isin(aoiFeatures)]
Sign up to request clarification or add additional context in comments.

2 Comments

Worked great, Toby... much faster as well!
Thanks - @unutbu's answer is the same concept but much more thorough
0

You could use isin:

import pandas as pd
allFeatures = [101, 179, 181, 183, 185, 843, 845, 847, 849, 851, 853, 855]
allReadings = [0.03, 0.01, 0.01, 0.03, 0.03, 0.01, 0.03, 0.02, 0.07, 0.06, 0.01, 0.04]
aoiFeatures = [183, 845, 853]

df = pd.DataFrame({'features':allFeatures, 'readings':allReadings})
result = df.loc[df['features'].isin(aoiFeatures)]
print(result)

yields

    features  readings
3        183      0.03
6        845      0.03
10       853      0.01

If you plan on selecting rows based on feature values often, and if the features can be made into a unique Index, and if the DataFrame is at least moderately large (say ~10K rows) then it may be better (for performance) to make features the index:

import pandas as pd
allFeatures = [101, 179, 181, 183, 185, 843, 845, 847, 849, 851, 853, 855]
allReadings = [0.03, 0.01, 0.01, 0.03, 0.03, 0.01, 0.03, 0.02, 0.07, 0.06, 0.01, 0.04]
aoiFeatures = [183, 845, 853]

df = pd.DataFrame({'readings':allReadings}, index=allFeatures)
result = df.loc[aoiFeatures]
print(result)

yields

     readings
183      0.03
845      0.03
853      0.01

Here is the setup I used to make the IPython %timeit tests:

import pandas as pd
N = 10000
allFeatures = np.repeat(np.arange(N), 1)
allReadings = np.random.random(N)
aoiFeatures = np.random.choice(allFeatures, N//10, replace=False)

def using_isin():
    df = pd.DataFrame({'features':allFeatures, 'readings':allReadings})
    for i in range(1000):
        result = df.loc[df['features'].isin(aoiFeatures)]
    return result


def using_index():
    df = pd.DataFrame({'readings':allReadings}, index=allFeatures)
    for i in range(1000):
        result = df.loc[aoiFeatures]
    return result

This shows using_index can be a bit faster:

In [108]: %timeit using_isin()
1 loop, best of 3: 697 ms per loop

In [109]: %timeit using_index()
1 loop, best of 3: 432 ms per loop

Note however, if allFeatures contains duplicates, then making it the Index is NOT advantageous. For example, if you change the setup above to use:

allFeatures = np.repeat(np.arange(N//2), 2)    # repeat every value twice

then

In [114]: %timeit using_isin()
1 loop, best of 3: 667 ms per loop

In [115]: %timeit using_index()
1 loop, best of 3: 3.47 s per loop

4 Comments

I wanted to eventually do exactly as you said in your second example, using allFeatures as the index... thank you!
@Jack: I posted my answer a bit too soon. I did some timeit tests and found the benefit of making features the index come only under certain conditions: The features index has to be unique and the DataFrame has to be moderately large (on my machine, at least 10K rows) before there is a performance benefit.
So... say I had another array of readings like: nextReadings = [0.04, 0.09, 0.21, 0.01, 0.06, 0.08, 0.13, 0.01, 0.01, 0.02, 0.04, 0.06] that still indexed in order with allFeatures. Is there an easy way to find the values from each array that were largest for a given allFeatures element? (In the example above, the readings for 845 and 853 would change to 0.13 and 0.04, respectively). Also, in the real world, allFeatures and allReadings each have 2.4 million elements, and aoiFeatures has 67,000 elements. I have 17 other arrays of readings like allReadings to compare
df.loc[[845, 853]] selects the rows whose index labels correspond with 845 or 853. To find the maximum for each row, you could use df.loc[[845, 853]].max(axis=1). But I'm not sure I'm understanding your question properly. If I'm not, then please post a new question with all the details. (Sample data and expected output as you provided in this question is very helpful.)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.