0

I am trying to extract both the tag and the text between the tags in a text file. I am trying to achieve this using regex (Not many xml tags are there).

below is what I have tried so far

     String txt="<DATE>December</DATE>";

        String re1="(<[^>]+>)"; // Tag 1
        String re2="(.*?)"; // Variable Name 1
        String re3="(<[^>]+>)"; // Tag 2

        Pattern p = Pattern.compile(re1+re2+re3,Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
        Matcher m = p.matcher(txt);
        if (m.find())
        {
            String tag1=m.group(1);
            String var1=m.group(2);
            String tag2=m.group(3);
            //System.out.print("("+tag1.toString()+")"+"("+var1.toString()+")"+"("+tag2.toString()+")"+"\n");

            System.out.println(tag1.toString().replaceAll("<>", ""));
            System.out.println(var1.toString());
        }

As an answer, I get:

<DATE>
December

How do I get rid of the <>?

2 Answers 2

2

Don't use regex to parse markup syntax, such as XML, HTML, XHTML and so on.

Many reasons are shown here.

Instead, do yourself a favor and use XPath and XQuery.

Sign up to request clarification or add additional context in comments.

1 Comment

yes, your right. But, I only have less tags in my text file (max 10 tags). Hence regex.
1

It is a bad idea to use regex to parse xml. Using a regex there is no way of identifying a complete element from opening to closing tag (a regex cannot "remember" a number of occurances).

However why your regex fails in this specific case:

In re1, re2, re3 you choose the capturing group to include < and > (also you do not include the / in re3). You could simply change this

String re1="<([^>]+)>"; // Tag 1
String re2="([^<]*)"; // Variable Name 1
String re3="</([^>]+)>"; // Tag 2

or use a suitable regex to remove < and > form tag1:

System.out.println(tag1.toString().replaceAll("<|>", ""));

or

System.out.println(tag1.toString().replaceAll("[<>]", ""));

2 Comments

It works. But, It does not recognize any further tags in the sentence. EG: American Airlines made <TRIPS> 100 <TRIPS> flights in <DATE> December </DATE> it only recognizes TRIPS and 100 but not the next tag
@Betafish: <TRIPS> is not closed by a </TRIPS> tag in your example. If you want to ignore that, you could use re3 = "</?([^>]+)>" or re3 = re1.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.