0

Hi I have a html like

<html>
   <head>
     <title>
          Some title
   </title>
</head>
<body>
    <div id="one">         some sample info </div>
</body>
</html>

How can I remove white spaces in this html except those in contents and within the tags using some regex using preg_replace? so to get something like this

<html><head><title>Some title</title></head><body><div id="one">some sample info</div></body></html>

please can anyone help me with this?? :)

1
  • what if there is <pre> elements? Commented Feb 1, 2012 at 12:09

1 Answer 1

5

You can replace (?<=>)\s+(?=<)|(?<=>)\s+(?!=<)|(?!<=>)\s+(?=<) with empty strings.

Edit: There's a simpler form: replace (?<=>)\s+|\s+(?=<)

Simply spoken, this regex will replace a group of one or more whitespaces if it has a > to the left or a < to the right.

It actually has two parts joined by OR (symbol: |), so either one may match:

  1. (?<=>)\s+ - this will match one or more whitespaces (\s+ in the regex), if it is preceded by a < (in regex: (?<=>)).

  2. \s+(?!=<) - this will match one or more whitespaces if it is followed by a < (in regex: (?!=<))

Learn more about regex.

Sign up to request clarification or add additional context in comments.

1 Comment

This answer is completely unstable and relies on the notion that there are no lingering > or < symbols in any of the textnodes in the html document. I would not recommend this technique to anyone. This is just another case where using regex to do a DOM parser's job is inappropriate. Researchers, please be informed that regex is "DOM-ignorant" -- it doesn't know if it is matching the start/end of a tag or merely something that resembles the start/end of a tag. At the very least, this regex is too primitive to do a consistently good job.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.