2

basically i want to write a script that will sanitize a URL, will replace the dots with "(dot)" string. For example if i have http://www.google.com after i run the script i would like it to be http://www(dot)google(dot). Well this is pretty easy to achieve with .replace when my text file consist only of urls or other strings, but in my case i also have IP addresses inside my text file, and i dont want the dots in the IP address to change to "(dot)".

I tried to do this using regex, but my output is " http://ww(dot)oogl(dot)om 192.60.10.10 33.44.55.66 "

This is my code

from __future__ import print_function


import sys
import re

nargs = len(sys.argv)
if nargs < 2:

    sys.exit('You did not specify a file')
else:
    inputFile = sys.argv[1]
    fp = open(inputFile)
    content = fp.read()

replace = '(dot)'
regex = '[a-z](\.)[a-z]'
print(re.sub(regex, replace, content, re.M| re.I| re.DOTALL))

I guess i need to have a condition which checks that if the pattern is number.number - dont replace.

0

4 Answers 4

3

You can use lookahead and lookbehind assertions:

import  re

s = "http://www.google.com 127.0.0.1"

print(re.sub("(?<=[a-z])\.(?=[a-z])", "(dot)", s))
http://www(dot)google(dot)com 127.0.0.1

To work for letters and digits this should hopefully do the trick, making sure there is at least one letter:

s = "http://www.googl1.2com 127.0.0.1"

print(re.sub("(?=.*[a-z])(?<=\w)\.(?=\w)", "(dot)", s, re.I))

http://www(dot)googl1(dot)2com 127.0.0.1

For your file you need re.M:

In [1]: cat test.txt
google8.com
google9.com
192.60.10.10
33.44.55.66
google10.com
192.168.1.1
google11.com

In [2]: with open("test.txt") as f:
   ...:         import re
   ...:         print(re.sub("(?=.*[a-z])(?<=\w)\.(?=\w)", "(dot)", f.read(), re.I|re.M))
   ...:     
google8(dot)com
google9(dot)com
192.60.10.10
33.44.55.66
google10(dot)com
192.168.1.1
google11(dot)com

If the files were large and memory was an issue you could also do it line by line, either storing all the lines in a list or using each line as you go:

import re
with open("test.txt") as f:
    r = re.compile("(?=.*[a-z])(?<=\w)\.(?=\w)", re.I)
    lines = [r.sub("(?=.*[a-z])(?<=\w)\.(?=\w)", "(dot)") for line in f]
Sign up to request clarification or add additional context in comments.

9 Comments

Good. This will work in nearly all cases. However, numbers are legal in domain names. Also, this counts on domain names being stored in lower case.
@StevenRumbalski, I was just basing it on the OP's own code and the fact they only used [a-z] in their pattern, I left the flags to the OP
Fair enough. For what it's worth I did upvote your answer. But I still think its a big problem that this can't handle 'io9.com'. The issue with case is minor and easily solved.
@StevenRumbalski, no worries, I am throwing together something to work for the case where numbers are involved
i tested this and it works fine with 1 or 2 lines of text, other than that it is not working. i mean it will only replace the beginning of a text, for a text file that has inside : google9.com google9.com 192.60.10.10 33.44.55.66 it will only replace the first url
|
1

Judging by your code, you were hoping to replace the first group within your pattern . However, re.sub replaces the entire matching pattern, rather than a group. In your case this is the single character right before the period, the period itself and the single character after it.

Even if sub worked as you hoped, your regex would be missing edge cases where numbers are part of URLs, such as www.2048game.com. Defining what an IP looks like is far easier. It's always a set of four numbers with one, two or three digits each, separated by dots. (In the case of IPv4, anyway. But IPv6 does not use periods, so it doesn't matter here.)

Assuming you have only URLs and IPs in your text file, simply filter out all IPs and then replace the periods in the remaining URLs:

is_ip = re.compile('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
urls = content.split(" ")
for i, url in enumerate(urls):
    if not is_ip.match(url):
        urls[i] = url.replace('.', '(dot)')
content = ' '.join(urls)

Of course, if you have regular text in content, this will also replace all regular periods, not just URLs. In that case, you would first require a more intricate URL detection. See In search of the perfect URL validation regex

4 Comments

tested this, it also replaces the dots in the ip address
Sorry, had a typo in the code. You should of course call is_ip.match(url), not is_ip.match(content)
great, but why it is changing the output structure? i mean, when i open the text document with the saved content it is all in one long line instead of being the same structure as before the changes.
@Vlad Ah, another slip of mine. split() without parameter splits by any whitespace, be it space character or newline. Use split(' ') instead.
0
import re

content = "I tried to do this using regex, but my output is http://www.googl.com 192.60.10.10 33.44.55.66\nhttp://ya.ru\n..."

reg = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

all_urls = re.findall(reg, content, re.M| re.I| re.DOTALL)
repl_urls = [u.replace('.', '(dot)') for u in all_urls]

for u, r in zip(all_urls, repl_urls):
    content = content.replace(u, r)

print content

Comments

0

You have to store the [a-z] content before and after the dot, to put it again in the replaced string. Here how I solved it:

from __future__ import print_function
import sys
import re

nargs = len(sys.argv)
if nargs < 2:
    sys.exit('You did not specify a file')
else:
    inputFile = sys.argv[1]
    fp = open(inputFile)
    content = fp.read()

replace = '\\1(dot)\\3'
regex = '(.*[a-z])(\.)([a-z].*)'
print(re.sub(regex, replace, content, re.M| re.I| re.DOTALL))

2 Comments

Fails on 'www.google.com' giving 'www.google(dot)com'. Also, numbers are legal in domain names, so this would fail on those as well (consider io9.com).
The problem is that matches don't overlap. See Padraic Cunningham's answer for a way around this with lookahead and lookbehind assertions.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.