6

I want to delete all the URLs in the sentence.

Here is my code:

import ijson
f = open("/content/drive/My Drive/PTT 爬蟲/content/MakeUp/PTT_MakeUp_content_0_1000.json")
objects = ijson.items(f, 'item')

for obj in list(objects):
    article = obj['content']
    ret = re.findall("http[s*]:[a-zA-Z0-9_.+-/#~]+ ", article) # Question here
    for r in ret:
        article = article.replace(r, "")
    print(article)

But a URL with "http" is still left in the sentence.

article_example = "眼影盤長這樣 http://i.imgur.com/uxvRo3h.jpg 說真的 很不好拍"

How can I fix it?

2

3 Answers 3

5

One simple fix would be to just replace the pattern https?://\S+ with an empty string:

article_example = "眼影盤長這樣 http://i.imgur.com/uxvRo3h.jpg 說真的 很不好拍"
output = re.sub(r'https?://\S+', '', article_example)
print(output)

This prints:

眼影盤長這樣  說真的 很不好拍

My pattern assumes that whatever non whitespace characters which follow http:// or https:// are part of the URL.

Sign up to request clarification or add additional context in comments.

Comments

2

The URL starts with http and in your pattern you match [s*] which will match either a s or * in the character class.

I think you are looking for

https?:[a-zA-Z0-9_.+-/#~]+

Regex demo | Python demo

import re
regex = r"https?:[a-zA-Z0-9_.+-/#~]+ "
article = "眼影盤長這樣 http://i.imgur.com/uxvRo3h.jpg 說真的 很不好拍"
result = re.sub(regex, "", article)
print(result)

Result

眼影盤長這樣 說真的 很不好拍

A shortened expression, which is a bit broader match, could also be matching a non whitespace \S+ char one or more times, followed by a space zero or more times to match the trailing space as in your original pattern.

\bhttps?:\S+ *

Regex demo

Comments

1

Change the [s*] to s?. The former is a set of two characters. The latter is an optional character. There are websites like regex101.com that let you experiment with regular expressions in the Python dialect. It will explain the interpretation of each part of the regex.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.