Python function to remove strings from a list if a substring already exists

Question

I'm wondering if they're is a "pythonic" way of removing elements from a list, if that element contains a substring of another element.

For example say we have a list like this:

["/var/html/site1", "/var/html/site1/nested_web_root1", "/var/html/site1/nested_web_root2"]

/var/html/site1 is contained within both: /var/html/site1/nested_web_root1 & /var/html/site1/nested_web_root2 As such I'd like them removed from the list.

I've already written a function to do this, and it does mostly work, but the code is just horrible and overly-complex. There are also edge-cases where it just fails.

Here is what I've written so far:

def substringSieve(string_list):
    string_list.sort(key=lambda s: len(s), reverse=False)
    out = []
    bad_list = []
    for outer_string in string_list:
        for inner_string in string_list:
            if outer_string != inner_string:
                if outer_string in inner_string:
                    out.append(outer_string)
                    bad_list.append(inner_string)
        if outer_string not in out and outer_string not in bad_list:
            out.append(outer_string)
    return out

Can anyone offer some insight?

Thanks.

Please post your code, and give examples of the edge cases that make it fail. — Swifty
– Swifty, Commented Nov 14, 2024 at 16:42
Please edit your post to add your own attempt. Even tho they might not work, they help us answering the question in a way you learn something from it, instead of 'dumping the code'. It also shows you've done your research — 0stone0
– 0stone0, Commented Nov 14, 2024 at 16:42
Please add examples of your failing edge cases; and about sorting, if you keep the default sort (lexicographic), a "base url" will always be followed immediately by its "sub urls", which will make your task easier and faster. — Swifty
– Swifty, Commented Nov 14, 2024 at 16:51
Side-note: In the specific case here (where parent directories eliminate their sub-directories), you almost certainly don't want "contains" checks, you want "starts with" checks. So your test should really be if inner_string.startswith(outer_string):. — ShadowRanger
– ShadowRanger, Commented Nov 14, 2024 at 16:52

MT0 · Accepted Answer · 2024-11-14 23:29:42Z

1

You appear to be comparing file paths and want to compare the start of the strings rather than comparing sub-strings at any location.

Sort the list and then, for each value in the list, do not include it in the output if it starts with any previous value:

def substring_sieve(data):
    output = []
    for value in sorted(data):
        if not any(value.startswith(prev) for prev in output):
            output.append(value)
    return output

data = ["/var/html/site1", "/var/html/site1/nested_web_root1", "/var/html/site1/nested_web_root2"]

print(substring_sieve(data))

Which outputs:

['/var/html/site1']

You can make it more efficient by just checking that the latest value does not start with the most recently output value:

def substring_sieve(data):
    if not data:
        return data
    prev, *remaining = sorted(data)
    output = [prev]
    for value in remaining:
        if not value.startswith(prev):
            output.append(value)
            prev = value
    return output

Which outputs the same.

@nocomment gave an alternate version using list comprehension:

def substring_sieve(data):
    prev = ()
    return [
        prev := value
        for value in sorted(data)
        if not value.startswith(prev)
    ]

If you do want to match sub-strings rather than the start of strings then use:

def substring_sieve(data):
    output = []
    for value in sorted(data, key=len):
        if not any(prev in value for prev in output):
            output.append(value)
    return output

edited Nov 14, 2024 at 23:29

answered Nov 14, 2024 at 22:11

MT0

173k12 gold badges70 silver badges136 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

no comment Over a year ago

Two variations that use that startswith accepts tuples.

no comment Over a year ago

One more because I like list comprehensions. (Thinking about itertools now, which I love.)

the_slug Over a year ago

Thanks guys. These were perfect. I learned a lot too!

Eternal Dreamer · Accepted Answer · 2024-11-14 18:15:39Z

0


def remove_nested(my_list: list) -> list:
    i = 0
    vlen = len(my_list)
    while i != vlen:    # Until we reach end of list
        pos_search = 0
        for _str in my_list:
            # If the current item isn't at the same position
            # And if the string _str contains the item
            if i != pos_search and my_list[i] in _str:
                del my_list[i]  # Delete
                vlen = vlen - 1 # Update list length
                i = i - 1       # Decrement i that will be incremented after (so i remains the same)
                break
            pos_search = pos_search + 1
        i = i + 1
    return my_list

my_list = ["/var/html/site1", "/var/html/site1/nested_web_root1", "/var/html/site1/nested_web_root2"]

print(remove_nested(my_list))

Output:

['/var/html/site1/nested_web_root1', '/var/html/site1/nested_web_root2']

As I commented my code I'm not sure I have to explain more

answered Nov 14, 2024 at 18:15

Eternal Dreamer

5631 gold badge6 silver badges19 bronze badges

1 Comment

the_slug Over a year ago

Hey, you've got it backwards. I just need the parent, not the children. So ideally it should return just ["/var/html/site1"]

Michael Cao · Accepted Answer · 2024-11-14 22:28:23Z

Create url trees and add branches whenever a url has a shared root. We note any nodes on a tree that is an actual listed url with an end flag.

Then we do a depth-first-search traversal of each tree but we terminate searching whenever hitting an end node.

from collections import deque

class UrlNode:
    def __init__(self, name, end = False, parent = None):
        self.name = name
        self.children = {}
        self.end = end
        self.parent = parent
    
    def full_name(self):
        if not self.parent:
            return f"/{self.name}"
        else:
            return f"{self.parent.full_name()}/{self.name}"


def create_trees(urls):
    roots = {}

    for url in urls:
        root, *descs = url.split("/")[1:]

        roots[root] = roots.get(root, UrlNode(root))

        parent = roots[root]

        for name in descs:
            parent.children[name] = parent.children.get(name, UrlNode(name))
            parent.children[name].parent = parent
            parent = parent.children[name]
        
        parent.end = True

    return roots

def dfs(root):
    nodes = deque([root])

    urls = []

    while nodes:
        node = nodes.pop()

        if node.end:
            urls.append(node)
        else:
            for child in node.children.values():
                nodes.append(child)

    return urls


urls = [
    "/var/html",
    "/var/html/site1/nested_web_root1",
    "/var/html/site1/nested_web_root2",
    "/var2",
    "/var3/html/site1",
    "/var3/html/site2",
]

roots = create_trees(urls)

[url.full_name() for root in roots.values() for url in dfs(root)]

Output:

['/var/html', '/var2', '/var3/html/site2', '/var3/html/site1']

the_slug · Accepted Answer · 2024-11-15 11:20:39Z

-1

Thanks to everyone who made suggestions. Especially MTO and no-comment.

In the end the following code worked:

def substring_sieve(data):
    prev, *remaining = sorted(data)
    output = [prev]
    for value in remaining:
        value = value.rstrip('/') + '/'
        if not value.startswith(prev):
            output.append(value)
            prev = value
    return output

This handles edge cases where there is a duplicate entry in the input list, and where one of the paths is a substring of another. For example: ['/home/greatlon/test_site2', '/home/greatlon/test_site']

Thanks again all!

answered Nov 15, 2024 at 11:20

the_slug

1276 bronze badges

Collectives™ on Stack Overflow

Python function to remove strings from a list if a substring already exists

4 Answers 4

3 Comments

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related