0

I have a long json object which contains URL links in the value, these links can be at any depth and with any key. The depth and key is not known. Ex.,

data = {
  "name": "John Doe",
  "a": "https:/example.com",
  "b": {
    "c": "https://example.com/path",
    "d": {
      "e": "https://example.com/abc/?q=u",
    }
  }
}

I want to extract all links in a list like

links = ["https://example.com", "https://example.com/path", "https://example.com/abc/?q=u"]

How can I extract all the links from the object using Python?

2
  • 1
    how do you identify urls? Is it okay to assume they all start with "HTTP"? Commented Jun 19, 2020 at 5:23
  • Yes, they all wll start from http or https. Any string without these protocols will not be treated as valid URL Commented Jun 19, 2020 at 5:26

1 Answer 1

2

Here's a recursive solution:

def extract_urls(d):
    urls = []
    for k, v in d.items():
        if isinstance(v, str) and v.lower().startswith("http"):
            urls.append(v)
        elif isinstance(v, dict):
            urls.extend(etract_urls(v))
    return urls

extract_urls(data)

Output:

['https:/example.com',
 'https://example.com/path',
 'https://example.com/abc/?q=u']
Sign up to request clarification or add additional context in comments.

1 Comment

There's a typo in this answer; someone else edited the function to be called extract_urls (it was originally called etract_urls) to fix a typo, but didn't update the recursive function. The recursive function is still (incorrectly) calling etract_urls.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.