1

I'm wondering if they're is a "pythonic" way of removing elements from a list, if that element contains a substring of another element.

For example say we have a list like this:

["/var/html/site1", "/var/html/site1/nested_web_root1", "/var/html/site1/nested_web_root2"]

/var/html/site1 is contained within both: /var/html/site1/nested_web_root1 & /var/html/site1/nested_web_root2 As such I'd like them removed from the list.

I've already written a function to do this, and it does mostly work, but the code is just horrible and overly-complex. There are also edge-cases where it just fails.

Here is what I've written so far:

def substringSieve(string_list):
    string_list.sort(key=lambda s: len(s), reverse=False)
    out = []
    bad_list = []
    for outer_string in string_list:
        for inner_string in string_list:
            if outer_string != inner_string:
                if outer_string in inner_string:
                    out.append(outer_string)
                    bad_list.append(inner_string)
        if outer_string not in out and outer_string not in bad_list:
            out.append(outer_string)
    return out

Can anyone offer some insight?

Thanks.

9
  • 2
    Please post your code, and give examples of the edge cases that make it fail. Commented Nov 14, 2024 at 16:42
  • 1
    Please edit your post to add your own attempt. Even tho they might not work, they help us answering the question in a way you learn something from it, instead of 'dumping the code'. It also shows you've done your research Commented Nov 14, 2024 at 16:42
  • Done @Swifty :) Commented Nov 14, 2024 at 16:49
  • 1
    Please add examples of your failing edge cases; and about sorting, if you keep the default sort (lexicographic), a "base url" will always be followed immediately by its "sub urls", which will make your task easier and faster. Commented Nov 14, 2024 at 16:51
  • 2
    Side-note: In the specific case here (where parent directories eliminate their sub-directories), you almost certainly don't want "contains" checks, you want "starts with" checks. So your test should really be if inner_string.startswith(outer_string):. Commented Nov 14, 2024 at 16:52

4 Answers 4

1

You appear to be comparing file paths and want to compare the start of the strings rather than comparing sub-strings at any location.

Sort the list and then, for each value in the list, do not include it in the output if it starts with any previous value:

def substring_sieve(data):
    output = []
    for value in sorted(data):
        if not any(value.startswith(prev) for prev in output):
            output.append(value)
    return output

data = ["/var/html/site1", "/var/html/site1/nested_web_root1", "/var/html/site1/nested_web_root2"]

print(substring_sieve(data))

Which outputs:

['/var/html/site1']

You can make it more efficient by just checking that the latest value does not start with the most recently output value:

def substring_sieve(data):
    if not data:
        return data
    prev, *remaining = sorted(data)
    output = [prev]
    for value in remaining:
        if not value.startswith(prev):
            output.append(value)
            prev = value
    return output

Which outputs the same.

@nocomment gave an alternate version using list comprehension:

def substring_sieve(data):
    prev = ()
    return [
        prev := value
        for value in sorted(data)
        if not value.startswith(prev)
    ]

If you do want to match sub-strings rather than the start of strings then use:

def substring_sieve(data):
    output = []
    for value in sorted(data, key=len):
        if not any(prev in value for prev in output):
            output.append(value)
    return output
Sign up to request clarification or add additional context in comments.

3 Comments

Two variations that use that startswith accepts tuples.
One more because I like list comprehensions. (Thinking about itertools now, which I love.)
Thanks guys. These were perfect. I learned a lot too!
0

def remove_nested(my_list: list) -> list:
    i = 0
    vlen = len(my_list)
    while i != vlen:    # Until we reach end of list
        pos_search = 0
        for _str in my_list:
            # If the current item isn't at the same position
            # And if the string _str contains the item
            if i != pos_search and my_list[i] in _str:
                del my_list[i]  # Delete
                vlen = vlen - 1 # Update list length
                i = i - 1       # Decrement i that will be incremented after (so i remains the same)
                break
            pos_search = pos_search + 1
        i = i + 1
    return my_list

my_list = ["/var/html/site1", "/var/html/site1/nested_web_root1", "/var/html/site1/nested_web_root2"]

print(remove_nested(my_list))

Output:

['/var/html/site1/nested_web_root1', '/var/html/site1/nested_web_root2']

As I commented my code I'm not sure I have to explain more

1 Comment

Hey, you've got it backwards. I just need the parent, not the children. So ideally it should return just ["/var/html/site1"]
0

Create url trees and add branches whenever a url has a shared root. We note any nodes on a tree that is an actual listed url with an end flag.

Then we do a depth-first-search traversal of each tree but we terminate searching whenever hitting an end node.

from collections import deque

class UrlNode:
    def __init__(self, name, end = False, parent = None):
        self.name = name
        self.children = {}
        self.end = end
        self.parent = parent
    
    def full_name(self):
        if not self.parent:
            return f"/{self.name}"
        else:
            return f"{self.parent.full_name()}/{self.name}"


def create_trees(urls):
    roots = {}

    for url in urls:
        root, *descs = url.split("/")[1:]

        roots[root] = roots.get(root, UrlNode(root))

        parent = roots[root]

        for name in descs:
            parent.children[name] = parent.children.get(name, UrlNode(name))
            parent.children[name].parent = parent
            parent = parent.children[name]
        
        parent.end = True

    return roots

def dfs(root):
    nodes = deque([root])

    urls = []

    while nodes:
        node = nodes.pop()

        if node.end:
            urls.append(node)
        else:
            for child in node.children.values():
                nodes.append(child)

    return urls


urls = [
    "/var/html",
    "/var/html/site1/nested_web_root1",
    "/var/html/site1/nested_web_root2",
    "/var2",
    "/var3/html/site1",
    "/var3/html/site2",
]

roots = create_trees(urls)

[url.full_name() for root in roots.values() for url in dfs(root)]

Output:

['/var/html', '/var2', '/var3/html/site2', '/var3/html/site1']

Comments

-1

Thanks to everyone who made suggestions. Especially MTO and no-comment.

In the end the following code worked:

def substring_sieve(data):
    prev, *remaining = sorted(data)
    output = [prev]
    for value in remaining:
        value = value.rstrip('/') + '/'
        if not value.startswith(prev):
            output.append(value)
            prev = value
    return output

This handles edge cases where there is a duplicate entry in the input list, and where one of the paths is a substring of another. For example: ['/home/greatlon/test_site2', '/home/greatlon/test_site']

Thanks again all!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.