1

I would like to replace all leading spaces and tabs, in an encoded xml/html, per line with html-codes.

replace all groups of 4 spaces or every tabulator through tabulator (#09;) replace rest of spaces through space ( ) the replaces may/must be only on the start of each line, until the first non-space or tab character

Example

Begin of Line: (^|(\\r|\\n)+) => (\\r|\\n)+ multiple linebrakes can be wrapped

Replacmentment chars: [ ], [\t]

21 whitespaces = 5 x #09; + 1 x  
10 Whitespace + 1 tab + 6 whitespaces = 2x #09; + 2x   + 1x #09; + 1x 
#09; + 2x  

:: 10 spaces = 2 x #09 + 2x &nbsp
:: 1 tab = 1x #09
:: 6 spaces = 1 x #09 + 2x &nbsp

Input is an string, and will previously replaces by other regular expressions

text = text.replace(regex1, replacement1)
text = text.replace(regex2, replacement2)
text = text.replace(regex3, replacement3)
text = text.replace(regex4, replacement4)

at this position i must implement the new regular expression

Visual XML

<TEST>
    <NODE1>
        <VALUE>         Test</VALUE>
    </NODE1>
    <NODE1>
        <VALUE>         Test</VALUE>
    </NODE1>
</TEST>

Encoded xml structure, from visual and so on input string

&lt;TEST&gt;
    &lt;NODE1&gt;
        &lt;VALUE&gt;         Test&lt;/VALUE&gt;
    &lt;/NODE1&gt;
    &lt;NODE1&gt;
        &lt;VALUE&gt;         Test&lt;/VALUE&gt;
    &lt;/NODE1&gt;
&lt;/TEST&gt;

Expected output

&lt;TEST&gt;
&#09;&lt;NODE1&gt;
&#09;&#09;&nbsp;&lt;VALUE&gt;         Test&lt;/VALUE&gt; <- NOT replaced in <VALUE>
&#09;&lt;/NODE1&gt;
&#09;&lt;NODE1&gt;
&#09;&#09;&nbsp;&lt;VALUE&gt;         Test&lt;/VALUE&gt; <- NOT replaced in <VALUE>
&#09;&lt;/NODE1&gt;
&lt;/TEST&gt;

i tried a lot,

tried and failed to store beginning of the line in regex-mempory, replace whitespaces groups

result: repeating beginning of the line and html coded spaces/tabs
example: \r&#09;\r&#09;\r&#09;\r&#09;
expected:\r&#09;&#09;&#09;&#09;

"(^|(\\r|\\n))[ ]{4}", "\\1&#09"

tried to to this in 2 line, first replace 4 spaces to tabs, tabs to tabs, and second replace the rest of spaces to &bnsp; but then it replaces every space tried the same, with "&#09;[ ]", "&#09;&nbps;"

i tried to do this with Matcher.find() loop and substring shows the best but not 100% correct results.

I fail and fail to get the correct regex, can anyone help?

1 Answer 1

1

How about the following program using bunch of replaceAll methods and lookbehinds:

    public static void main (String[] args) {
        final String[] INPUT = new String[] {
"<TEST>",
"    <NODE1>",
"         <VALUE>         Test</VALUE>",                // 2 tabs 1 space here
"    </NODE1>",
"    <NODE1>",
"        <VALUE>         Test</VALUE>",
"    </NODE1>",
"</TEST>"
    };

        for (String str: INPUT) {
            System.out.println("NEW: " + htmlspecialchars(str));
        }
    }

    private static String htmlspecialchars(String str) {
        return str
            .replaceAll("&", "&quot;")                  // replace html entities
            .replaceAll("<", "&lt;")
            .replaceAll(">", "&gt;")
            .replaceAll("(?<=^\\s*)\t", "    ")         // replace tabs by 4 spaces
            .replaceAll("(?<=^\\s*)    ", "&#09;")      // replace 4 spaces by &#09;
            .replaceAll("(?<=^(?:&#09;)*) ", "&nbsp;"); // replace rest spaces by &nbsp;
    }

The resulting output is:

NEW: &lt;TEST&gt;
NEW: &#09;&lt;NODE1&gt;
NEW: &#09;&#09;&nbsp;&lt;VALUE&gt;         Test&lt;/VALUE&gt;
NEW: &#09;&lt;/NODE1&gt;
NEW: &#09;&lt;NODE1&gt;
NEW: &#09;&#09;&lt;VALUE&gt;         Test&lt;/VALUE&gt;
NEW: &#09;&lt;/NODE1&gt;
NEW: &lt;/TEST&gt;
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.