delete everything except URL with python [duplicate]

Question

I have a JSON file that contains metadata for 900 articles. I want to delete all the data except for the lines that contain URLs and resave the file as .txt. I created this code but I couldn't continue the saving phase:

import re

with open("path\url_example.json") as file:
    for line in file:
         urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', line)
         print(urls)

A part of the results:

['http://www.google.com.']
['https://www.tutorialspoint.com']

Another issue is the results are marked between [' '] and may end with . I don't need this. My expected result is:

 http://www.google.com
 https://www.tutorialspoint.com

I'd have thought "path\url_example.txt" would raise a SyntaxError as well... — Jon Clements
– Jon Clements, Commented Dec 15, 2018 at 16:36
Could you show an example of your input file? Is it a JSON object per line for instance? If so, does it have attributes called "url" or "link" or "href" or whatever, so that you can parse the line as json using json.loads and then just retrieve the appropriate parts instead of regexing stuff out? — Jon Clements
– Jon Clements, Commented Dec 15, 2018 at 16:38

Benjamin Rowell · Accepted Answer · 2018-12-15 16:48:55Z

0

If you know which key your URLs will be found under in your JSON, you might find an easier approach is to deserialize the JSON using the JSON module from the Python standard library and work with a dict instead of using regex.

However, if you want to work with regex, remember urls is a list of regex matches. If you know there's definitely only going to be only one match per line, then just print the first entry and rstrip off the terminal ".", if it's there.

import re

with open("path\url_example.txt") as file:
    for line in file:
         urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', line)
         print(urls[0].rstrip('.'))

If you expect to see multiple matches per line:

import re

with open("path\url_example.txt") as file:
    for line in file:
         urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', line)
         for url in urls:
             print(url.rstrip('.'))

edited Dec 15, 2018 at 16:48

answered Dec 15, 2018 at 16:45

Benjamin Rowell

1,4119 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Jon Clements Over a year ago

You can just print(url.rstrip('.')) - seems a bit of waste using the if/else here to check it ends with . to remove it... just print it stripped, and if it didn't have a dot, it still won't, and if it did, it won't now... so no need to check it first.

Benjamin Rowell Over a year ago

@JonClements thanks for picking that up, having a dim moment.

Densetsu_No · Accepted Answer · 2018-12-15 16:50:53Z

0

Without further information on the file you have (txt, json?) and on the kind of input line you are looping through, here a simple try without re.findall().

with open("path\url_example.txt") as handle:
    for line in handle:
        if not re.search('http'):
            continue
        spos = line.find('http')
        epos = line.find(' ', spos)
        url = line[spos:epos]
        print(url)

edited Dec 15, 2018 at 16:50

answered Dec 15, 2018 at 16:41

Densetsu_No

636 bronze badges

4 Comments

Jon Clements Over a year ago

I guess your file is a txt and not a json otherwise your code wouldn't work. - well, it would if it was one json object per line, or formatted such that it's pretty printed and the urls happen to be accessible on a single line... :)

Jon Clements Over a year ago

Also... that re.search could if 'http' not in line... also... try running your code with line = 'http://example.com'... you'll get the wrong output...

Densetsu_No Over a year ago

Modified the intro text, should more precise.

Jon Clements Over a year ago

Given an input of http://example.com where there isn't a space, you end up with epos == -1 which means you slice off the last character giving you an output of: 'http://testing.co'...

Collectives™ on Stack Overflow

delete everything except URL with python [duplicate]

2 Answers 2

2 Comments

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

4 Comments

Linked

Related