1

I want to remove some particular text from my html content. I am using replaceAll method in java to replace the content with "" to achieve that.

My content is

<html xmlns="http://www.w3.org/1999/xhtml" lang="fr-CA" xml:lang="fr-CA"> or 
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-AU" xml:lang="en-AU"> or
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-GB" xml:lang="en-GB"> or
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-IE" xml:lang="en-IE"> or
<html xmlns="http://www.w3.org/1999/xhtml" lang="es-PR" xml:lang="es-PR> or
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US">

I want to remove lang="-" xml:lang="-" As you can see, value of lang and xml:lang is changing dynamically. So I want a regular expression which can detect this particular string sequence then I will replace it with "" using replaceAll(regex, string) method in java.

3
  • 2
    You should use for example jsoup to do that. Take a look at this post: stackoverflow.com/questions/18281894/… Commented Apr 29, 2015 at 11:00
  • 2
    Don't use regex. Use parser. Generate DOM, remove elements/attributes you don't want and return altered structure. One of simplest and cleanest parsers is Jsoup. Commented Apr 29, 2015 at 11:02
  • Is <html xmlns=".." ...> or <html xmlns=".." ...> real content, or did you perhaps mean that content can be <html xmlns=".." ...> or <html xmlns=".." ...>? Commented Apr 29, 2015 at 11:18

3 Answers 3

3

This answer is based on assumption that

<html xmlns="http://www.w3.org/1999/xhtml" lang="fr-CA" xml:lang="fr-CA"> or 
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-AU" xml:lang="en-AU"> or
...

means that you have HTML structures like

<html xmlns="http://www.w3.org/1999/xhtml" lang="fr-CA" xml:lang="fr-CA">
   ...
</html>

or

<html xmlns="http://www.w3.org/1999/xhtml" lang="en-AU" xml:lang="en-AU">
   ...
</html>

In that case instead of regex use HTML/XML parser like Jsoup. Your code could look like

String htmlText = 
        "<html xmlns=\"http://www.w3.org/1999/xhtml\" lang=\"fr-CA\" xml:lang=\"fr-CA\">" +
        "   <body>hello</body>" +
        "</html>";

//use XML parser if you don't want Jsoup to change optimize your HTML code
Document doc = Jsoup.parse(htmlText,"",Parser.xmlParser());
Elements htmlTag = doc.select("html");
htmlTag.removeAttr("lang").removeAttr("xml:lang");//remove these attributes from selected tag

String replaced = doc.toString();
System.out.println(replaced);
Sign up to request clarification or add additional context in comments.

Comments

2

You can try this:

$strings = <<< LOL
<html xmlns="http://www.w3.org/1999/xhtml" lang="fr-CA" xml:lang="fr-CA">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-AU" xml:lang="en-AU">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-GB" xml:lang="en-GB">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-IE" xml:lang="en-IE">
<html xmlns="http://www.w3.org/1999/xhtml" lang="es-PR" xml:lang="es-PR">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US">
LOL;

$strings = preg_replace('/(lang=".*?"|xml:lang=".*?")/', '', $strings);

echo $strings;

Output:

<html xmlns="http://www.w3.org/1999/xhtml"  >
<html xmlns="http://www.w3.org/1999/xhtml"  >
<html xmlns="http://www.w3.org/1999/xhtml"  >
<html xmlns="http://www.w3.org/1999/xhtml"  >
<html xmlns="http://www.w3.org/1999/xhtml"  >
<html xmlns="http://www.w3.org/1999/xhtml"  >

Demo:

http://ideone.com/vhtVcW


Regex Explanation:

(lang=".*?"|xml:lang=".*?")

Match the regex below and capture its match into backreference number 1 «(lang=".*?"|xml:lang=".*?")»
   Match this alternative «lang=".*?"»
      Match the character string “lang="” literally «lang="»
      Match any single character that is NOT a line break character «.*?»
         Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
      Match the character “"” literally «"»
   Or match this alternative «xml:lang=".*?"»
      Match the character string “xml:lang="” literally «xml:lang="»
      Match any single character that is NOT a line break character «.*?»
         Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
      Match the character “"” literally «"»

Comments

0
text.replaceAll("\\w?{3}:?lang=\"\\S*\"", "");

This should do the job.

1 Comment

While this code may solve the problem, a few words of explanation would help all of the readers to gain more insight into the solution.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.