select rows where values match specific characters python

Question

I have a series that looks like this:

s = pd.Series(['abdhd','abadh','aba', 'djjb','kjsdhf','abwer', 'djd, 'kja'])

I need to select all rows whose strings begin with 'dh' or 'kj'

I attempted to use .startswith() and .match(); but i get boolean returns of True and False instead of the values of the list.

I tried this as part of a dictionary as well and got the same bool returns and not the valued themselves.

Is there something else I can do?

Just slice the series by that Boolean list.

ALollz
– ALollz

2018-03-21 03:25:45 +00:00
Commented Mar 21, 2018 at 3:25 — ALollz
– ALollz, Commented Mar 21, 2018 at 3:25

ALollz · Accepted Answer · 2018-03-21 03:27:49Z

4

Try

s[(s.str.startswith('dh')) | (s.str.startswith('kj'))]

Explanation: (s.str.startswith('dh')) | (s.str.startswith('kj')) is the logical condition you care about, and then putting that inside of s[] slices the series by rows, returning only the rows where the condition is True

answered Mar 21, 2018 at 3:27

ALollz

59.7k7 gold badges73 silver badges97 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

piRSquared Over a year ago

@M-M feel free to upvote this answer as well if you found it useful.

piRSquared Over a year ago

@ALolls I posted time tests if you are interested. Sometimes I get bored. Nice answer btw. +1

piRSquared · Accepted Answer · 2018-03-21 04:12:25Z

`pd.Series.str.contains`

s[s.str.contains('^dh|kj')]

4    kjsdhf
7       kja
dtype: object

`pd.Series.isin`

s[s.str[:2].isin(['dh', 'kj'])]

4    kjsdhf
7       kja
dtype: object

`str.startswith` within a comprehension

s[[any(map(x.startswith, ['dh', 'kj'])) for x in s]]

4    kjsdhf
7       kja
dtype: object

Time Tests

Functions

pir1 = lambda s: s[s.str.contains('^dh|kj')]
pir2 = lambda s: s[s.str[:2].isin(['dh', 'kj'])]
pir3 = lambda s: s[[any(map(x.startswith, ['dh', 'kj'])) for x in s]]
alol = lambda s: s[(s.str.startswith('dh')) | (s.str.startswith('kj'))]

Testing

res = pd.DataFrame(
    np.nan, [10, 30, 100, 300, 1000, 3000, 10000, 30000],
    'pir1 pir2 pir3 alol'.split()
)

for i in res.index:
    s_ = pd.concat([s] * i)
    for j in res.columns:
        stmt = f'{j}(s_)'
        setp = f'from __main__ import {j}, s_'
        res.at[i, j] = timeit(stmt, setp, number=200)

Results

res.plot(loglog=True)

res.div(res.min(1), 0)

           pir1      pir2      pir3      alol
10     2.424637  3.272403  1.000000  4.747473
30     2.756702  2.812140  1.000000  4.446757
100    2.673724  2.190306  1.000000  3.128486
300    1.787894  1.000000  1.342434  1.997433
1000   2.164429  1.000000  1.788028  2.244033
3000   2.325746  1.000000  1.922993  2.227902
10000  2.424354  1.000000  2.042643  2.242508
30000  2.153505  1.000000  1.847457  1.935085

Conclusions

The only real winner (and only just barely) is isin and it also happens to be the least general. You can only really extend its use so long as you are looking at just the first 2 characters.

Other than that, the other methods all seem to perform with similar time complexity.

Collectives™ on Stack Overflow

select rows where values match specific characters python

2 Answers 2

2 Comments

`pd.Series.str.contains`

`pd.Series.isin`

`str.startswith` within a comprehension

Time Tests

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

str.startswith within a comprehension

Time Tests

Comments

Your Answer

Sign up or log in

Post as a guest

Related

`str.startswith` within a comprehension