1

I have a series that looks like this:

s = pd.Series(['abdhd','abadh','aba', 'djjb','kjsdhf','abwer', 'djd, 'kja'])

I need to select all rows whose strings begin with 'dh' or 'kj'

I attempted to use .startswith() and .match(); but i get boolean returns of True and False instead of the values of the list.

I tried this as part of a dictionary as well and got the same bool returns and not the valued themselves.

Is there something else I can do?

1
  • Just slice the series by that Boolean list. Commented Mar 21, 2018 at 3:25

2 Answers 2

4

Try

s[(s.str.startswith('dh')) | (s.str.startswith('kj'))]

Explanation: (s.str.startswith('dh')) | (s.str.startswith('kj')) is the logical condition you care about, and then putting that inside of s[] slices the series by rows, returning only the rows where the condition is True

Sign up to request clarification or add additional context in comments.

2 Comments

@M-M feel free to upvote this answer as well if you found it useful.
@ALolls I posted time tests if you are interested. Sometimes I get bored. Nice answer btw. +1
2

pd.Series.str.contains

s[s.str.contains('^dh|kj')]

4    kjsdhf
7       kja
dtype: object

pd.Series.isin

s[s.str[:2].isin(['dh', 'kj'])]

4    kjsdhf
7       kja
dtype: object

str.startswith within a comprehension

s[[any(map(x.startswith, ['dh', 'kj'])) for x in s]]

4    kjsdhf
7       kja
dtype: object

Time Tests

Functions
pir1 = lambda s: s[s.str.contains('^dh|kj')]
pir2 = lambda s: s[s.str[:2].isin(['dh', 'kj'])]
pir3 = lambda s: s[[any(map(x.startswith, ['dh', 'kj'])) for x in s]]
alol = lambda s: s[(s.str.startswith('dh')) | (s.str.startswith('kj'))]
Testing
res = pd.DataFrame(
    np.nan, [10, 30, 100, 300, 1000, 3000, 10000, 30000],
    'pir1 pir2 pir3 alol'.split()
)

for i in res.index:
    s_ = pd.concat([s] * i)
    for j in res.columns:
        stmt = f'{j}(s_)'
        setp = f'from __main__ import {j}, s_'
        res.at[i, j] = timeit(stmt, setp, number=200)
Results
res.plot(loglog=True)

enter image description here

res.div(res.min(1), 0)

           pir1      pir2      pir3      alol
10     2.424637  3.272403  1.000000  4.747473
30     2.756702  2.812140  1.000000  4.446757
100    2.673724  2.190306  1.000000  3.128486
300    1.787894  1.000000  1.342434  1.997433
1000   2.164429  1.000000  1.788028  2.244033
3000   2.325746  1.000000  1.922993  2.227902
10000  2.424354  1.000000  2.042643  2.242508
30000  2.153505  1.000000  1.847457  1.935085
Conclusions

The only real winner (and only just barely) is isin and it also happens to be the least general. You can only really extend its use so long as you are looking at just the first 2 characters.

Other than that, the other methods all seem to perform with similar time complexity.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.