-1

I have a JSON file that contains metadata for 900 articles. I want to delete all the data except for the lines that contain URLs and resave the file as .txt. I created this code but I couldn't continue the saving phase:

import re

with open("path\url_example.json") as file:
    for line in file:
         urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', line)
         print(urls)

A part of the results:

['http://www.google.com.']
['https://www.tutorialspoint.com']

Another issue is the results are marked between [' '] and may end with . I don't need this. My expected result is:

 http://www.google.com
 https://www.tutorialspoint.com
3
  • 2
    a json file - url_example.txt - how that? Commented Dec 15, 2018 at 16:32
  • 1
    I'd have thought "path\url_example.txt" would raise a SyntaxError as well... Commented Dec 15, 2018 at 16:36
  • 2
    Could you show an example of your input file? Is it a JSON object per line for instance? If so, does it have attributes called "url" or "link" or "href" or whatever, so that you can parse the line as json using json.loads and then just retrieve the appropriate parts instead of regexing stuff out? Commented Dec 15, 2018 at 16:38

2 Answers 2

0

If you know which key your URLs will be found under in your JSON, you might find an easier approach is to deserialize the JSON using the JSON module from the Python standard library and work with a dict instead of using regex.

However, if you want to work with regex, remember urls is a list of regex matches. If you know there's definitely only going to be only one match per line, then just print the first entry and rstrip off the terminal ".", if it's there.

import re

with open("path\url_example.txt") as file:
    for line in file:
         urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', line)
         print(urls[0].rstrip('.'))

If you expect to see multiple matches per line:

import re

with open("path\url_example.txt") as file:
    for line in file:
         urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', line)
         for url in urls:
             print(url.rstrip('.'))
Sign up to request clarification or add additional context in comments.

2 Comments

You can just print(url.rstrip('.')) - seems a bit of waste using the if/else here to check it ends with . to remove it... just print it stripped, and if it didn't have a dot, it still won't, and if it did, it won't now... so no need to check it first.
@JonClements thanks for picking that up, having a dim moment.
0

Without further information on the file you have (txt, json?) and on the kind of input line you are looping through, here a simple try without re.findall().

with open("path\url_example.txt") as handle:
    for line in handle:
        if not re.search('http'):
            continue
        spos = line.find('http')
        epos = line.find(' ', spos)
        url = line[spos:epos]
        print(url)

4 Comments

I guess your file is a txt and not a json otherwise your code wouldn't work. - well, it would if it was one json object per line, or formatted such that it's pretty printed and the urls happen to be accessible on a single line... :)
Also... that re.search could if 'http' not in line... also... try running your code with line = 'http://example.com'... you'll get the wrong output...
Modified the intro text, should more precise.
Given an input of http://example.com where there isn't a space, you end up with epos == -1 which means you slice off the last character giving you an output of: 'http://testing.co'...

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.