1

I am trying to replace plain link to hyperlink in an html document. and my logic is

private static final Pattern WEB_URL_PROTOCOL = Pattern.compile("(?i)http|https://");

StringBuffer sb = new StringBuffer();
        if (text != null) {
            // Escape any inadvertent HTML in the text message
            text = EmailHtmlUtil.escapeCharacterToDisplay(text);
            // Find any embedded URL's and linkify
              Matcher m = Patterns.WEB_URL.matcher(text);


            while (m.find()) {
                int start = m.start();

                if (start == 0 || text.charAt(start - 1) != '@') {
                    String url = m.group();
                    Matcher proto = WEB_URL_PROTOCOL.matcher(url);
                    String link;
                    if (proto.find()) {
                        lower case protocol link.
                        link = proto.group().toLowerCase() + url.substring(proto.end());
                    } else {

                        link = "http://" + url;
                    }
                    String href = String.format("<a href=\"%s\">%s</a>", link, url);
                    m.appendReplacement(sb, href);
                }
                else {
                    m.appendReplacement(sb, "$0");
                }
            }
            m.appendTail(sb);
        }

This code is successfully find out all links in a html doc .but problem is it also find the hyperlink.So i want to exclude the hyperlinks and want to find only plain links for example it should exclude

<p class="MsoNormal"><a href="awbs://www.google.com" target="_BLANK">https://www.google.com</a> normal address https</p> 

but plain link https://www.google.com should be replaced by a hyperlink

Edit if doc contain text like this - 1. https://www.yahoo.com

2. https://www.google.com normal address https

so here i want to replace https://www.yahoo.com with

<p class="MsoNormal"><a href = "https://www.yahoo.com>https://www.yahoo.com</a></p>

and it should not effect 2 at all .

1 Answer 1

1

I would recommand you to use Jsoup here.

Sample code

String text = "<html><head></head><body><a href='http://google.com'>Don't change this link</a> Change this: http://yahoo.com foo.com</body></html>";
Document d = Jsoup.parse(text);
String newHtmlCode = "";
String oldHtmlCode = d.outerHtml();
List<TextNode> textNodes = d.body().textNodes();

Matcher m = Patterns.WEB_URL.matcher("");
for (TextNode textNode : textNodes) {
    m.reset(textNode.text());

    String fragment = "";
    while (m.find()) {
        fragment = m.replaceAll("<a href=\"\\*\\*\\*$1\">$1</a>");
        textNode.replaceWith(new Element(Tag.valueOf("span"),"").html(fragment));
    }

    newHtmlCode = d.outerHtml().replaceAll("\"\\Q***\\E(?!https?://)", "\"http://").replaceAll("\"\\Q***\\E(https?://)", "\"$1");
}

System.out.println("BEFORE:\n\n" + oldHtmlCode);
System.out.println("----------------------------");
System.out.println("AFTER:\n\n" + newHtmlCode);

Output

BEFORE:

<html>
 <head></head>
 <body>
  <a href="http://google.com">Don't change this link</a> Change this: http://yahoo.com foo.com
 </body>
</html>
----------------------------
AFTER:

<html>
 <head></head>
 <body>
  <a href="http://google.com">Don't change this link</a>
  <span> Change this: <a href="http://yahoo.com">http://yahoo.com</a> <a href="http://foo.com">foo.com</a></span>
 </body>
</html>
Sign up to request clarification or add additional context in comments.

3 Comments

is it possible do without using parsers. i am looking for regex.
@Subham Check my answer I use a regex for finding the plain links. Jsoup is pretty fast.
@Subham Check my new answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.