1

I have a sample data look like this (real dataset has more columns):

data = {'stringID':['AB CD Efdadasfd','RFDS EDSfdsadf dsa','FDSADFDSADFFDSA'],'IDct':[1,3,4]}
data = pd.DataFrame(data)
data['Index1'] = [[3,6],[7,9],[5,6]]
data['Index2'] = [[4,8],[10,13],[8,9]]

enter image description here

What i want to achieve is i want to slice stringID column based on second elment in Index1 and Index2 (both are list), only if IDct value is bigger than 1, otherwise return NaN.

I tried this, it works as Output1 column, but there must be a better way (i mean faster when apply to a large dataset) to do it, please kindly advise, thanks!

data['pos'] = data.Index1.map(lambda x: x[1])
data['pos1'] = data.Index2.map(lambda x: x[1])

def cal(m):
    if m['IDct'] > 1:
        return m['stringID'][m['pos']:m['pos1']]
    else:
        return 'NaN'

data['Output1'] = data.apply(cal,axis=1)

enter image description here

3
  • 1
    You say there "must be a better way to do it". In your case, what would define a "better" way? What is the problem you have with the current method? Memory efficiency, time efficiency, etc? Commented Sep 24, 2020 at 19:39
  • I'm thinking a clearer or faster way, if that makes sense. Like calculation time if apply to a very large data set. Commented Sep 24, 2020 at 19:40
  • 3
    Here is a really, really good overview of some times when native pandas methods are best, when loops or apply are just as good, and when to drop back to regular old python Commented Sep 24, 2020 at 21:20

1 Answer 1

1

I love pandas - but realistically speaking it's just one of many tools that belong in your tool belt.

pandas and numpy really shine for computation and analysis. It's okay to use pandas to visualize and analyze your data - but that doesn't mean it's the right tool for the job.

This kind of problem is better suited for regular python. Assuming we can, let's move StringID and IDct out of the dict and back into lists. If we assume the result is regular in shape (all lists are of equal length)

StringID = ['AB CD Efdadasfd','RFDS EDSfdsadf dsa','FDSADFDSADFFDSA'],
IDct = [1,3,4]
Index1 = [[3,6],[7,9],[5,6]]
Index2 = [[4,8],[10,13],[8,9]]

for stringID, IDct, Index1, Index2 in zip(stringID, IDct, Index1, Index2):
    result = []
    if IDct > 1:
       result.append(your_indexing_goes_here())
    else:
       result.append(None) 

You can then blend the result data back in as you see fit.

data = {
    'StringID': StringID,
    'IDct': IDct,
    'Index1': Index1,
    'Index2': Index2,
    'Result': result
}

pd.DataFrame(data)
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you! I do have a follow up question if lists are with dynamic length: for example i want to pick out second element of the list but some lists only got one value in it. I tried np.where(data['IDct']>1, data.Index1.map(lambda x: x[1]),0) or np.where(data['IDct']>1, [x[1] for x in data['Index1']],0) but all got error of list index out of range...
Use regular Python logic - simple is better. If Index1 and Index2 are of variable length then you use their lengths to draw conclusions on what to do. IE if len(Index1) < 1: None/NaN, elif len(Index1) = 1: Index[0], else: Index[1] .
Thanks! I tried data.loc[data['IDct']>1]['Index1'].apply(lambda x:x[1]) and it worked as well!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.