Python. Split string using any word from a list of word

Question

I have a list of words.

trails = ("Fire trail", "Firetrail", "Fire Trail", "FT", "firetrail")

I need to split another string based on any of these words.
So, say, if the names to check are:

Poverty Point FT
Cedar Party Fire Trails
Mailbox Trail
Carpet Snake Creek Firetrail
Pretty Gully firetrail - Roayl NP

I want to modify them to look like this:

Poverty Point
Cedar Party
Mailbox
Carpet Snake Creek
Pretty Gully

Split before one of the word from trail list and only copy the part before.

Thanks!

I should add, my code starts with:

for f in arcpy.da.SearchCursor("firetrail_O_noD_Layer", "FireTrailName", None, None):
...     if any(var in str(f[0]) for var in trail):
...         new_field = *that part of string without any fire trails and anything after it*

str(f[0]) is referring to the names from the first list new_field is refereing to the names I have in my second list, which I need to create

Are your strings in files or in a list? What format do you have them in? — gtlambert
– gtlambert, Commented Mar 13, 2016 at 22:46
From your question it seems, you need rather to strip trailing parts of the lines, not to split. Is that correct? — Jan Vlcinsky
– Jan Vlcinsky, Commented Mar 13, 2016 at 22:51
gtlambert, my string is in rows (if that makes sense). I am reading it from a field, one by one through a loop. It comes as part of tuple. I then just refer to it as str(f[0]). I hope it makes sense. I am very new to python! — lida
– lida, Commented Mar 13, 2016 at 23:04
Jan, I have no idea what you mean! I am going through records, one by one, that's why I have them listed with bullet points. Did I anwer your question? — lida
– lida, Commented Mar 13, 2016 at 23:05
@lida Yes, you have answered my question. In python, split means splitting a string into parts, creating a list. strip on the other hand removes part of a string, if that is possible. Your question is using word split which confused me a bit. You have mean strip. — Jan Vlcinsky
– Jan Vlcinsky, Commented Mar 13, 2016 at 23:10

Bharel · Accepted Answer · 2016-03-13 23:28:54Z

3

I believe that's what you're looking for. You may also add the flag re.IGNORECASE like so res = re.split(regex, s, re.IGNORECASE) if you wish for it to be case insensitive. See re.split() for further documentation.

import re
trails = ("Fire trail", "Firetrail", "Fire Trail", "FT", "firetrail")

# \b means word boundaries.
regex = r"\b(?:{})\b".format("|".join(trails))

s = """Poverty Point FT
Cedar Party Fire Trails
Mailbox Trail
Carpet Snake Creek Firetrail
Pretty Gully firetrail - Roayl NP"""

res = re.split(regex, s)

UPDATE:

In case you go line by line, and don't want the end you can do this:

import re
trails = ("Fire trail", "Firetrail", "Fire Trail", "FT", "firetrail", "Trail", "Trails")

# \b means word boundaries.
regex = r"\b(?:{}).*".format("|".join(trails))

s = """Poverty Point FT
Cedar Party Fire Trails
Mailbox Trail
Carpet Snake Creek Firetrail
Pretty Gully firetrail - Roayl NP"""

res = [r.strip() for r in re.split(regex, s)]

edited Mar 13, 2016 at 23:28

answered Mar 13, 2016 at 22:53

Bharel

27.5k8 gold badges52 silver badges100 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

lida Over a year ago

Bharel. thanks. I ran it. Results is not what I need.. From your code I got: ['Poverty Point ', '\nCedar Party Fire Trails\nMailbox Trail\nCarpet Snake Creek ', '\nPretty Gully ', ' - Roayl NP']. However, I need no 'Fire Trails, or Trail, or Royal NP (it was after trail, so needs to be removed) after so I can copy value into another field. How would you write it if there was just one line from my list ( but you don't know which one) ? (as I already have a loop to go though line by line.

midori · Accepted Answer · 2016-03-13 23:14:20Z

1

you can use re.split here:

import re

_list = re.split(r'Fire trail|Firetrail|Fire Trail|FT|firetrail', _string)

edited Mar 13, 2016 at 23:14

answered Mar 13, 2016 at 22:53

midori

4,8375 gold badges37 silver badges62 bronze badges

3 Comments

Saleem Over a year ago

Isn't this script render wrong result for Pretty Gully firetrail - Roayl NP? OP is expecting just Pretty Gully

midori Over a year ago

@Saleem i gave him solution writing it from the top of my head, he should fix his list of phrases to make it work the way he wants

Saleem Over a year ago

Got it. just want to make sure any newcomer don't get confused. +1

Saleem · Accepted Answer · 2016-03-13 23:16:27Z

1

Well, here is more dynamic way to perform task

import re

courses = r"""
Poverty Point FT
Cedar Party Fire Trails
Mailbox Trail
Carpet Snake Creek Firetrail
Pretty Gully firetrail - Roayl NP
"""

trails = ("Fire trail", "Firetrail", "Fire Trail", "FT", "firetrail")

rx_str = '|'.join(trails)
rx_str = r"^.+?(?=(?:{0}|$))".format(rx_str)

rx = re.compile(rx_str, re.IGNORECASE | re.MULTILINE)

for course in rx.finditer(courses):
    print(course.group())

As you can notice, I'm converting list into regex dynamically, without hardcoding. Script will render following result:

Poverty Point 
Cedar Party 
Mailbox Trail
Carpet Snake Creek 
Pretty Gully

answered Mar 13, 2016 at 23:16

Saleem

9,0582 gold badges22 silver badges34 bronze badges

Comments

Jan Vlcinsky · Accepted Answer · 2016-03-14 14:06:20Z

As it seems, the requirements and solution shall be clarified and tested iteratively, I provide here proposed solution incl. test suite to be used with pytest.

First, create test_trails.py file:

import pytest


def fix_trails(trails):
    """Clean up list of trails to make sure, longest phrases are processed
    with highest priority (are sooner in the list).

    This is needed, if some trail phrases contain other ones.
    """
    trails.sort(key=len, reverse=True)
    return trails


@pytest.fixture
def trails():
    phrases = ["Fire trail", "Firetrail", "Fire Trail",
               "FT", "firetrail", "Trail", "Fire Trails"]
    return fix_trails(phrases)


def remove_trails(line, trails):
    for trail in trails:
        if trail in line:
            res = line.replace(trail, "").strip()
            return res.replace("  ", " ")
    return line


scenarios = [
    ["Poverty Point FT", "Poverty Point"],
    ["Cedar Party Fire Trails", "Cedar Party Fire"],
    ["Mailbox Trail", "Mailbox"],
    ["Carpet Snake Creek Firetrail", "Carpet Snake Creek"],
    ["Pretty Gully firetrail - Roayl NP", "Pretty Gully - Roayl NP"],
]


@pytest.mark.parametrize("scenario", scenarios, ids=lambda itm: itm[0])
def test(scenario, trails):
    line, expected = scenario
    result = remove_trails(line, trails)
    assert result == expected

The file defines the function removing not needed text from processed lines as well as it contains test case test_trails.

To test it, install pytest:

$ pip install pytest

Then run the test:

$ py.test -sv test_trails.py
========================================= test session starts ==================================
=======
platform linux2 -- Python 2.7.9, pytest-2.8.7, py-1.4.31, pluggy-0.3.1 -- /home/javl/.virtualenvs/stack
/bin/python2
cachedir: .cache
rootdir: /home/javl/sandbox/stack, inifile:
collected 5 items

test_trails.py::test[Poverty Point FT] PASSED
test_trails.py::test[Cedar Party Fire Trails] FAILED
test_trails.py::test[Mailbox Trail] PASSED
test_trails.py::test[Carpet Snake Creek Firetrail] PASSED
test_trails.py::test[Pretty Gully firetrail - Roayl NP] PASSED

================ FAILURES ==================
______ test[Cedar Party Fire Trails] _______

scenario = ['Cedar Party Fire Trails', 'Cedar Party Fire']
trails = ['Fire Trails', 'Fire trail', 'Fire Trail', 'Firetrail', 'firetrail', 'Trail', ...]

    @pytest.mark.parametrize("scenario", scenarios, ids=lambda itm: itm[0])
    def test(scenario, trails):
        line, expected = scenario
        result = remove_trails(line, trails)
>       assert result == expected
E       assert 'Cedar Party' == 'Cedar Party Fire'
E         - Cedar Party
E         + Cedar Party Fire
E         ?            +++++

test_trails.py:42: AssertionError
======== 1 failed, 4 passed in 0.01 seconds ============

The py.test command discovers in the file the test case, finds input arguments, uses injection to put into it the value of trails and parametrization of the test case provides the scenario parameter.

You may then fine tune the function remove_trails and list of trails untill all passes.

When you are finished, you may move the remove_trails function where you need (probably incl. trails list).

You may use this approach to test whatever of solutin proposed to your question.

Jan, that did not work the way I need it to. It still had firetrail words in the print statement:['Poverty Point ', 'Cedar Party Fire Trails', 'Mailbox Trail', 'Carpet Snake Creek ', 'Pretty Gully firetrail - Roayl NP']. I need Fire Trails, and firetrails - Royal NP to be gone. I already have a loop, so don't need to code for the multiple line. how would you do it if it was just one line?
@lida I see, you do ask neither for splitting nor for striping end of the line, but you want to remove that text.

donkopotamus · Accepted Answer · 2016-03-13 22:58:45Z

0

You could do this using a regular expression, for example:

def make_matcher(trails):
    import re
    rgx = re.compile(r"{}".format("|".join(trails)))
    return lambda txt: rgx.split(txt)[0]

>>> m = make_matcher(["Fire trail", "Firetrail", "Fire Trail", "FT", "firetrail"])
>>> examples = ["Poverty Point FT", "Cedar Party Fire Trails", "Mailbox Trail", "Carpet Snake Creek Firetrail", "Pretty Gully firetrail - Roayl NP"]
>>> for x in examples:
...     print(m(x))
Poverty Point 
Cedar Party 
Mailbox Trail
Carpet Snake Creek 
Pretty Gully

Note that the in this example the trailing space before the occurrence of eg Firetrail are maintained. That might not be what you want.

answered Mar 13, 2016 at 22:58

donkopotamus

23.4k3 gold badges58 silver badges61 bronze badges

3 Comments

lida Over a year ago

just adjusted your version for my table and it worked the way I wanted. thanks. I am not sure about space though, it's not a problem at this stage, I think. Many thank!

lida Over a year ago

donkopotamus, for some reason some of the cases I have are not picked up. Though they look exactly the same, e.g. Aberdare Fire Trail , Winters Fire Trail - Karuah NP, Carpet Snake Creek Fire Trail. Would it be because of some space issues or something else? could you please explain what your rgx expression does. thanks!

donkopotamus Over a year ago

@lida m("Aberdare Fire Trail") => 'Aberdare ' ... is that not what you expect

Collectives™ on Stack Overflow

Python. Split string using any word from a list of word

5 Answers 5

1 Comment

3 Comments

Comments

2 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

3 Comments

Comments

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related