1

Replace a loop with lambda or something else to increase run speed

I have a loop which works, but for my real data set it's going to be far too slow I basically have a huge text file, each line separated by \n characaters.

There is a distinctive message fingerprint at the beginning of each unique message, for the purposes of this let's say they begin with a #. I've put occurence of this # (Y) or not (N) in a separate column, called 'Beginning'

I want to look for lines which don't begin with a # , and if the line below also doesn't begin with a # I want to concantenate the two. Ignore any desire to strip out \ns at the moment, I've got that covered.

My loop works, but how can I do this using a lambda function or any other way to get a good speed up?

Huge thanks in advance

for i in range(2,(len(df)-1)):
    if ((df['Beginning'][i] == 'N') and (df['Beginning'][i+1] == 'N')):
        df['Message'][i] = df['Message'][i]  +  df['Message'][i+1]
        df['Message'][i+1] = ""

An attempt at an edit to add an example:

Message-begins-now 01:01:2018:12:15:28 \n

bla bla text message \n

details about location of issue \n

specifics about somethign else \n

Message-begins-now 01:01:2018:12:16:78 \n

bla bla text message type 2 something xxxxxx \n

Message-begins-now 01:01:2018:12:21:05 \n

bla bla text message type 3 something xxxxxx \n

location detail for this thing \n

location detail for that thing \n

price detail for me \n

price detail for you \n

lots \n

more \n

boring \n

text \n

Message-begins-now 01:01:2018:12:35:01 \n

bla bla text message type 2 something xxxxxx \n

So the above is 4 different messages, different lengths, and I want to concatenate the text so I have one row per message which contains all the info from beginning to end

4
  • 1
    Can you clarify: I see df in your code - do you use pandas.DataFrame? Commented Feb 8, 2019 at 16:18
  • 1
    I'm also having trouble visualizing your example input. Could you include some lines as an example? Commented Feb 8, 2019 at 16:18
  • Yes I am using pandas, cheers for the prompt Commented Feb 8, 2019 at 16:24
  • 1
    There's no reason a lambda would be any faster than your current solution. For faster processing, you'd need to change the data format, the processing, or both. Commented Feb 8, 2019 at 16:27

1 Answer 1

1

I think what you're looking for is df.shift()

for example you can replace the iteration and if statement with something like this:

df[(df['Beginning'] == df['Beginning'].shift(1)) & (df['Beginning'] == 'N')]

or (what I would actually do)

mask = (df['Beginning'] == df['Beginning'].shift(1)) & (df['Beginning'] == 'N')

df.loc[mask, 'Message'] = df.loc[mask, 'Message'] + df.loc[mask, 'Message'].shift(1)  # you'd have to check that this is what you want, perhaps you need to shift the mask rather than the df, i'm not sure

edit: oops, typos

edit 2 - your question has changed, i'm less sure this will be helpful to you.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks. It's too late on a Friday to really register this at the moment, brain fried, but just to check - will this work with varying length message slike y example above? At the moment I run my loop eg 20 times to capture messages from length 2 to 20 lines
@KieranIngram when i wrote this i was under the impression that you had your data already organised in a dataframe and you were looking for a way to take the loop out of your code (i think you're right that it is inefficient, but it's hard to know if it's the limiting factor in your code). When you first posted it looked like you had a data frame already and now it looks like you're asking for help processing the text file, so I'm not sure how helpful this will be.
I'm not sure the best way to present my question. I've got quote a few places within my larger code where I'm currently making it work using a loop on a small segment of the real data, but as soon as I try the real large dataset the code is unbearably slow. I guess my real question is how to do something to a huge dataset without a loop. I come from an R background where I'd replace a loop with a direct mapping, eg newdata$var = olddata$var * 2 instead of using a loop for i in 1 to 100000 newdata$var[i] = olddata$var[i] * 2.
@KieranIngram I think that's a good instinct - and using df.apply is probably the right way to go about it (best is to define a function that works on the lines you want, then get the filters correct, then use apply to map that function where you want it). You'll appreciate though that it's hard for me to comment on performance without knowing more about the code. There are also profiling tools that can help you identify which bits of your code are taking the time (i don't know what's good, I use pycharm and there is a tool built in for that). Good luck, hope you get it working. :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.