I want to locate urls without protocols in the text, and then add the protocol before them. This means I don't want urls that begin with http(s):// or http(s)://www., only the kind of example.com. I'm aware that I might accidentally match with any text1.text2 if I forgot to add a space after a period, so I came up with some rules to make it more like an actual url:
(?<=^|\s)(\w*-?\w+\.[a-z]{2,}\S*)
(?<=^|\s)The URL should be after the newline or a space.\w*-?\w+The domain part, could have a dash (-) or not. Since it's after a newline or space, it removes the protocol.[a-z]{2,}The extension, should be more than 2 letters\S*The rest of the URL
It works well to match example.com or example.com/x1/x2 and not https://example.com. But I think it's a bit clumsy, and it fails if there is . or , after the url.
How can I achieve the same goal more elegantly? I don't need to match urls like 1.1.1.1. Are there some loopholes in the above rules that I haven't yet considered?
(?<!\S)in place of(?<=^|\s)(with this simple negation you avoid the alternation). If you want to avoid a dot (that ends a sentence), change the last\S*to\S*(?<![.]). (But whatever you do, don't dream, it can't be perfect even if your pattern fully and precisely describes the URL syntax. \$\endgroup\$