4

I've been trying to achieve this: I want to wrap words into tag and spaces (which may be multiple) in tag, assuming original text can contain html tags that should not be toched

This is   <b>very bold</b> word. 

convert to -->

<w>This</w><s> </s><w>is</w><s>   </s><b><w>very</w><s> </s><w>bold</w></b><s> </s><w>word</w>

What is the right regEx to achieve that?

2 Answers 2

1

You should use two replacements >>

s.replace(/([^\s<>]+)(?:(?=\s)|$)/g, '<w>$1</w>').replace(/(\s+)/g, '<s>$1</s>')

Check this demo.


EDIT:

For more complex inputs (based on your comment below), go with >>

s.replace(/([^\s<>]+)(?![^<>]*>)(?:(?=[<\s])|$)/g, '<w>$1</w>').replace(/(\s+)(?![^<>]*>)/g, '<s>$1</s>');

Check this demo.

Sign up to request clarification or add additional context in comments.

5 Comments

Can you explain the (?=...) part?
@SeanVaughn - Part (?=\s) means "followed by whitespace"
Great, but is it possible to modify your solution to handle "complex" tags for example <span style="font-weight:bold">very bold</span> instead of <b>very bold</b>?
Doesn't work for nested elements jsfiddle.net/EfzW8/1 ("bold" not wrapped in <w>). You can add as many special cases as you want, I will always find a counterexample by definition en.wikipedia.org/wiki/Chomsky_hierarchy#The_hierarchy.
@Prinzhorn - First of all, OP asks for regex solution, so it is nice to find the closest regex solution. Your most recent example has data-foo="<bar>" tag parameter, which is possible, but very unlikely. As I don't know what kind of HTML source code OP needs to parse, it is hard to say how far with regex complexity we have to go. I believe my most recent code should work for OP.
0

Regular expressions are not suited for every task. If your string can contain arbitrary HTML, than it's not possible to handle all cases using regular expressions, because HTML is a context-free language and regular expressions covers only a subset of them. Now before messing around with loops and a load of code to handle this, let me suggest the following:

If you are in a browser environment or have access to a DOM library, you could put this string inside a temporary DOM element, then work on the text nodes and then read the string back.

Here's an example using a lib I wrote some month and updated now which is called Linguigi

var element = document.createElement('div');
element.innerHTML = 'This is   <b>very bold</b> word.';

var ling = new Linguigi(element);

ling.eachWord(true, function(text) {
    return '<w>' + text + '</w>';
});

ling.eachToken(/ +/g, true, function(text) {
    return '<s>' + text + '</s>';
});

alert(element.innerHTML);

Example: http://prinzhorn.github.com/Linguigi/ (hit the Stackoverflow 12758422 button)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.