0

I'm stuck trying to get a substring from a list of strings using pandas. Basically, the application returns data in this way:

['com.server.application.service.sprint.Sprint@20137b52[id=8837,rapidViewId=7061,state=CLOSED,name=name_of_the_sprint_1,startDate=2022-02-21T13:07:00.000Z,endDate=2022-03-11T13:07:00.000Z,completeDate=2022-03-14T17:19:29.271Z,activatedDate=2022-02-21T20:57:03.111Z,sequence=8837,goal=,autoStartStop=false]', 'com.server.application.service.sprint.Sprint@5fcc83c9[id=8919,rapidViewId=7061,state=CLOSED,name=name_of_the_sprint_2,startDate=2022-03-14T14:52:00.000Z,endDate=2022-04-01T14:52:00.000Z,completeDate=2022-04-04T18:25:08.141Z,activatedDate=2022-03-14T20:52:24.680Z,sequence=8919,goal=,autoStartStop=false]']

This list has two items and what I'm trying to do is to get the name of the sprint name_of_the_sprint_1 and name_of_the_sprint_2 that are after the name=.

What I did until now (I do not know if this is the best and only way to do it) is the following:

df['sprints'].iloc[idx][0].split(',') so it creates a list where I can get the information I want. But I'll need to split it again (I'm gonna find 'name=name_of_the_sprint_1' in this sublist) in order to get only the name I want and need.

Is there a better way extract this information from my dataframe? I'll need to iterate over a dataframe with 3500 rows and do it for each item.

Thanks, folks for the help.

2
  • Try using the expand=True parameter for pd.Series.str.split (pandas.pydata.org/docs/reference/api/…) then you can do another split and get the data you need. Ex: df['sprints'].iloc[idx][0].split(',', expand=True) Alternatively, a regex or pd.Series.str.extract method would work as well though probably not needed since your data is nicely structured for splitting. Commented Apr 7, 2022 at 2:02
  • Just curious if every row of dataframe has two elements in your columns? Because your post actually returns two elements Commented Apr 7, 2022 at 2:26

2 Answers 2

1

A nested for loop will be useful if you arrange your code neatly, I have tried this with 7000 rows of your data:

def function(df):
    result = []
    for i in df['sprints']:
        split_string = i.split(',')
        for row in split_string:
            if 'name=' in row: 
                aa = row[5:]
                result.append(aa)
    return result

%timeit function()
14.4 ms ± 261 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Extra

I've just realized that since you have known the keyword you wish to seek, you can just use re.search to get your output:

def function(df):
    return [re.search('name_of_the_sprint_'+r"(\d+)",row).group() for row in df['sprints']]

%timeit function(df)
10.9 ms ± 328 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Or if there's different names after the name=, you can try this:

result = [re.search('name='+'\w+',row).group()[5:] for row in df[0]]
Sign up to request clarification or add additional context in comments.

Comments

1

First thing that comes to mind would be to slice the string, starting after the = and ending at the ,. If the list of lists was named data, it might look like this:

data = ["whatever items, not important, name=your_thing_name, some more random stuff,", "even more random stuff, name=a_different_name, some more random things"]

for d in data:
  sub = d.index("name")+5
  val = d[sub:sub+d[sub:].index(",")]

As far as performance goes, I ran this and the total time measured about 0.2 seconds

from time import perf_counter as pc

start = pc()

data = []
for i in range(3500):
  data.append(f"this, things, name={i}_loop, very cool, ik")

for d in data:
  sub = d.index("name")+5
  val = d[sub:sub+d[sub:].index(",")]
  print(val)

print(pc() - start)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.