-1

Given an initial list of URLs crawled from a site:

https://somesite.com/
https://somesite.com/advertise
https://somesite.com/articles
https://somesite.com/articles/read
https://somesite.com/articles/read/1154
https://somesite.com/articles/read/1155
https://somesite.com/articles/read/1156
https://somesite.com/articles/read/1157
https://somesite.com/articles/read/1158
https://somesite.com/blogs

I am trying to turn the list into a tab-organized tree hierarchy:

https://somesite.com
    /advertise
    /articles
        /read
            /1154
            /1155
            /1156
            /1157
            /1158
    /blogs

I've tried using lists, tuples, and dictionaries. So far I have figured out two flawed ways to output the content.

Method 1 will miss elements if they have the same name and position in the hierarchy:

Input:
https://somesite.com
https://somesite.com/missions
https://somesite.com/missions/playit
https://somesite.com/missions/playit/extbasic
https://somesite.com/missions/playit/extbasic/0
https://somesite.com/missions/playit/stego
https://somesite.com/missions/playit/stego/0
Output:
https://somesite.com/
    /missions
        /playit
            /extbasic
                /0
            /stego

----------------^ Missing expected output "/0"

Method 2 will not miss any elements, but it will print redundant content:

Input:
https://somesite.com
https://somesite.com/missions
https://somesite.com/missions/playit
https://somesite.com/missions/playit/extbasic
https://somesite.com/missions/playit/extbasic/0
https://somesite.com/missions/playit/stego
https://somesite.com/missions/playit/stego/0
Output:
https://somesite.com/
    /missions
        /playit
            /extbasic
                /0
    /missions       <- Redundant content
        /playit     <- Redundant content
            /stego      
                /0

I'm not sure how to properly do this, and my googling has only turned up references to urllib that don't seem to be what I need. Perhaps there is a much better approach, but I have been unable to find it.

My code for getting the content into a usable list:

#!/usr/bin/python3

import re

# Read the original list of URLs from file
with open("sitelist.raw", "r") as f:
    raw_site_list = f.readlines()

# Extract the prefix and domain from the first line
first_line = raw_site_list[0]
prefix, domain = re.match("(http[s]://)(.*)[/]" , first_line).group(1, 2)

# Remove instances of prefix and domain, and trailing newlines, drop any lines that are only a slash
clean_site_list = []
for line in raw_site_list:
    clean_line = line.strip(prefix).strip(domain).strip()
    if not clean_line == "/":
        if not clean_line[len(clean_line) - 1] == "/":
            clean_site_list += [clean_line]

# Split the resulting relative paths into their component parts and filter out empty strings
split_site_list = []
for site in clean_site_list:
    split_site_list += [list(filter(None, site.split("/")))]

This gives a list to manipulate, but I've run out of ideas on how to output it without losing elements or outputting redundant elements.

Thanks


Edit: This is the final working code I put together based on the answer chosen below:

# Read list of URLs from file
with open("sitelist.raw", "r") as f:
    urls = f.readlines()

# Remove trailing newlines
for url in urls:
    urls[urls.index(url)] = url[:-1]

# Remove any trailing slashes
for url in urls:
    if url[-1:] == "/":
        urls[urls.index(url)] = url[:-1]

# Remove duplicate lines
unique_urls = []
for url in urls:
    if url not in unique_urls:
        unique_urls += [url]

# Do the actual work (modified to use unique_urls and use tabs instead of 4x spaces, and to write to file)
base = unique_urls[0]
tabdepth = 0
tlen = len(base.split('/'))

final_urls = []
for url in unique_urls[1:]:
    t = url.split('/')
    lt = len(t)
    if lt != tlen:
        tabdepth += 1 if lt > tlen else -1
        tlen = lt
    pad = ''.join(['\t' for _ in range(tabdepth)])
    final_urls += [f'{pad}/{t[-1]}']

with open("sitelist.new", "wt") as f:
    f.write(base + "\n")
    for url in final_urls:
        f.write(url + "\n")
3
  • Not an exact duplicate but close: stackoverflow.com/questions/8484943 Commented Dec 17, 2021 at 4:49
  • Show how you coded the actual methods... Commented Dec 17, 2021 at 7:02
  • Consider using a trie. Lots of sources for how build one and python libraries. Commented Jan 23 at 14:46

2 Answers 2

0

This works with your sample data:

urls = ['https://somesite.com',
        'https://somesite.com/missions',
        'https://somesite.com/missions/playit',
        'https://somesite.com/missions/playit/extbasic',
        'https://somesite.com/missions/playit/extbasic/0',
        'https://somesite.com/missions/playit/stego',
        'https://somesite.com/missions/playit/stego/0']


base = urls[0]
print(base)
tabdepth = 0
tlen = len(base.split('/'))

for url in urls[1:]:
    t = url.split('/')
    lt = len(t)
    if lt != tlen:
        tabdepth += 1 if lt > tlen else -1
        tlen = lt
    pad = ''.join(['    ' for _ in range(tabdepth)])
    print(f'{pad}/{t[-1]}')
Sign up to request clarification or add additional context in comments.

Comments

0

This code will help you in your task. I agree this code might be a bit large and might contain some redundant codes and checks but this will create a dictionary containing hierarchy of the urls, you can use that dictionary however you like, print it or store it.

More over this code will also parse different urls and create a seprate tree of them (see code and output)

EDIT: This will also take care of the redundant urls

Code:

    from json import dumps


def process_urls(urls: list):
    tree = {}

    for url in urls:
        url_components = url.split("/")
        # First three components will be the protocol
        # an empty entry
        # and the base domain 
        base_domain = url_components[:3]
        base_domain = base_domain[0] + "//" + "".join(base_domain[1:])
        # Add base domain to tree if not there.
        try:
            tree[base_domain]
        except:
            tree[base_domain] = {}

        structure = url_components[3:]
        
        for i in range(len(structure)):
            # add the first element
            if i == 0 :
                try:
                    tree[base_domain]["/"+structure[i]]
                except:
                    tree[base_domain]["/"+structure[i]] = {}
            else:
                base = tree[base_domain]["/"+structure[0]]
                for j in range(1, i):
                    base = base["/"+structure[j]]

                try:
                    base["/"+structure[i]]
                except:
                    base["/"+structure[i]] = {}

    return tree


def print_tree(tree: dict, depth=0):
    for key in tree.keys():
        print("\t"*depth+key)

        # redundant checks
        if type(tree[key]) == dict:
            
            # if dictionary is empty then do nothing
            # else call this function recuressively
            # increase depth by 1
            if tree[key]:
                print_tree(tree[key], depth+1)


if __name__ == "__main__":
        urls = [
            'https://somesite.com',
            'https://somesite.com/missions',
            'https://somesite.com/missions/playit',
            'https://somesite.com/missions/playit/extbasic',
            'https://somesite.com/missions/playit/extbasic/0',
            'https://somesite.com/missions/playit/extbasic/0',
            'https://somesite.com/missions/playit/extbasic/0',
            'https://somesite.com/missions/playit/extbasic/0',
            'https://somesite.com/missions/playit/stego',
            'https://somesite.com/missions/playit/stego/0',
            'https://somesite2.com/missions/playit',
            'https://somesite2.com/missions/playit/extbasic',
            'https://somesite2.com/missions/playit/extbasic/0',
            'https://somesite2.com/missions/playit/stego',
            'https://somesite2.com/missions/playit/stego/0'
        ]
    tree = process_urls(urls)
    print_tree(tree)

Output:

https://somesite.com
    /missions
            /playit
                    /extbasic
                            /0
                    /stego
                            /0
https://somesite2.com
    /missions
            /playit
                    /extbasic
                            /0
                    /stego
                            /0

1 Comment

Thank you for this really beautiful solution. It's a bit too complex for my current project, but I will be saving this as an example for if and when the requirements grow in the future, as I agree that dicts will allow for more versatile functionality if I should need it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.