Python: Find and replace under certain conditions with regex

Question

basically i want to write a script that will sanitize a URL, will replace the dots with "(dot)" string. For example if i have http://www.google.com after i run the script i would like it to be http://www(dot)google(dot). Well this is pretty easy to achieve with .replace when my text file consist only of urls or other strings, but in my case i also have IP addresses inside my text file, and i dont want the dots in the IP address to change to "(dot)".

I tried to do this using regex, but my output is " http://ww(dot)oogl(dot)om 192.60.10.10 33.44.55.66 "

This is my code

from __future__ import print_function


import sys
import re

nargs = len(sys.argv)
if nargs < 2:

    sys.exit('You did not specify a file')
else:
    inputFile = sys.argv[1]
    fp = open(inputFile)
    content = fp.read()

replace = '(dot)'
regex = '[a-z](\.)[a-z]'
print(re.sub(regex, replace, content, re.M| re.I| re.DOTALL))

I guess i need to have a condition which checks that if the pattern is number.number - dont replace.

Padraic Cunningham · Accepted Answer · 2015-12-04 21:42:44Z

3

You can use lookahead and lookbehind assertions:

import  re

s = "http://www.google.com 127.0.0.1"

print(re.sub("(?<=[a-z])\.(?=[a-z])", "(dot)", s))
http://www(dot)google(dot)com 127.0.0.1

To work for letters and digits this should hopefully do the trick, making sure there is at least one letter:

s = "http://www.googl1.2com 127.0.0.1"

print(re.sub("(?=.*[a-z])(?<=\w)\.(?=\w)", "(dot)", s, re.I))

http://www(dot)googl1(dot)2com 127.0.0.1

For your file you need re.M:

In [1]: cat test.txt
google8.com
google9.com
192.60.10.10
33.44.55.66
google10.com
192.168.1.1
google11.com

In [2]: with open("test.txt") as f:
   ...:         import re
   ...:         print(re.sub("(?=.*[a-z])(?<=\w)\.(?=\w)", "(dot)", f.read(), re.I|re.M))
   ...:     
google8(dot)com
google9(dot)com
192.60.10.10
33.44.55.66
google10(dot)com
192.168.1.1
google11(dot)com

If the files were large and memory was an issue you could also do it line by line, either storing all the lines in a list or using each line as you go:

import re
with open("test.txt") as f:
    r = re.compile("(?=.*[a-z])(?<=\w)\.(?=\w)", re.I)
    lines = [r.sub("(?=.*[a-z])(?<=\w)\.(?=\w)", "(dot)") for line in f]

edited Dec 4, 2015 at 21:42

answered Dec 4, 2015 at 17:00

Padraic Cunningham

181k30 gold badges264 silver badges327 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Steven Rumbalski Over a year ago

Good. This will work in nearly all cases. However, numbers are legal in domain names. Also, this counts on domain names being stored in lower case.

Padraic Cunningham Over a year ago

@StevenRumbalski, I was just basing it on the OP's own code and the fact they only used [a-z] in their pattern, I left the flags to the OP

Steven Rumbalski Over a year ago

Fair enough. For what it's worth I did upvote your answer. But I still think its a big problem that this can't handle 'io9.com'. The issue with case is minor and easily solved.

Padraic Cunningham Over a year ago

@StevenRumbalski, no worries, I am throwing together something to work for the case where numbers are involved

Vlad Og Over a year ago

i tested this and it works fine with 1 or 2 lines of text, other than that it is not working. i mean it will only replace the beginning of a text, for a text file that has inside : google9.com google9.com 192.60.10.10 33.44.55.66 it will only replace the first url

|

Marc Schulder · Accepted Answer · 2015-12-05 00:00:11Z

1

Judging by your code, you were hoping to replace the first group within your pattern . However, re.sub replaces the entire matching pattern, rather than a group. In your case this is the single character right before the period, the period itself and the single character after it.

Even if sub worked as you hoped, your regex would be missing edge cases where numbers are part of URLs, such as www.2048game.com. Defining what an IP looks like is far easier. It's always a set of four numbers with one, two or three digits each, separated by dots. (In the case of IPv4, anyway. But IPv6 does not use periods, so it doesn't matter here.)

Assuming you have only URLs and IPs in your text file, simply filter out all IPs and then replace the periods in the remaining URLs:

is_ip = re.compile('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
urls = content.split(" ")
for i, url in enumerate(urls):
    if not is_ip.match(url):
        urls[i] = url.replace('.', '(dot)')
content = ' '.join(urls)

Of course, if you have regular text in content, this will also replace all regular periods, not just URLs. In that case, you would first require a more intricate URL detection. See In search of the perfect URL validation regex

edited Dec 5, 2015 at 0:00

answered Dec 4, 2015 at 17:21

Marc Schulder

7655 silver badges11 bronze badges

4 Comments

Vlad Og Over a year ago

tested this, it also replaces the dots in the ip address

Marc Schulder Over a year ago

Sorry, had a typo in the code. You should of course call is_ip.match(url), not is_ip.match(content)

Vlad Og Over a year ago

great, but why it is changing the output structure? i mean, when i open the text document with the saved content it is all in one long line instead of being the same structure as before the changes.

Marc Schulder Over a year ago

@Vlad Ah, another slip of mine. split() without parameter splits by any whitespace, be it space character or newline. Use split(' ') instead.

Serhii Nosov · Accepted Answer · 2015-12-04 17:23:45Z

0

import re

content = "I tried to do this using regex, but my output is http://www.googl.com 192.60.10.10 33.44.55.66\nhttp://ya.ru\n..."

reg = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

all_urls = re.findall(reg, content, re.M| re.I| re.DOTALL)
repl_urls = [u.replace('.', '(dot)') for u in all_urls]

for u, r in zip(all_urls, repl_urls):
    content = content.replace(u, r)

print content

answered Dec 4, 2015 at 17:23

Serhii Nosov

3603 silver badges11 bronze badges

Comments

qneill · Accepted Answer · 2015-12-04 17:34:12Z

0

You have to store the [a-z] content before and after the dot, to put it again in the replaced string. Here how I solved it:

from __future__ import print_function
import sys
import re

nargs = len(sys.argv)
if nargs < 2:
    sys.exit('You did not specify a file')
else:
    inputFile = sys.argv[1]
    fp = open(inputFile)
    content = fp.read()

replace = '\\1(dot)\\3'
regex = '(.*[a-z])(\.)([a-z].*)'
print(re.sub(regex, replace, content, re.M| re.I| re.DOTALL))

edited Dec 4, 2015 at 17:34

qneill

1,74315 silver badges19 bronze badges

answered Dec 4, 2015 at 17:02

Antonio Ragagnin

2,3174 gold badges27 silver badges41 bronze badges

2 Comments

Steven Rumbalski Over a year ago

Fails on 'www.google.com' giving 'www.google(dot)com'. Also, numbers are legal in domain names, so this would fail on those as well (consider io9.com).

Steven Rumbalski Over a year ago

The problem is that matches don't overlap. See Padraic Cunningham's answer for a way around this with lookahead and lookbehind assertions.

Collectives™ on Stack Overflow

Python: Find and replace under certain conditions with regex

4 Answers 4

9 Comments

4 Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

9 Comments

4 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related