0

I have data in the following format:

string1='<id1> <id2> "abc <id3> ".'
string2='<id_4> <id_5> <id_6>.'

I want to split this into: (<id1>,<id2>, "abc <id3> ") and (<id_4>, <id_5>, <id_6>). I tried: re.split('(?<=)\s+(?=<)',string1) but it incorrectly splits string1 into (<id1>,<id2>,"abc <id3>"). (Although it splits string2 correctly as desired).

How can I correctly split such that it splits on <> but does not split when <> is in quotes.

The delimiters here are <> and "". If we find < then we try to find >. And if we find " then we try to find ". For string 1(string1=' "abc ".'): I start with < ..find id1 and find closing angle bracket, then I find < and try to find closing angle bracket > i.e. id2, then start with " and try to find the " before dot i.e. "abc "

6
  • Try a new approach with re.findall, it's more easy. Commented Apr 6, 2015 at 22:56
  • @CasimiretHippolyte Thanks a lot. But I did not get. Can you please explain with the help of an example. Commented Apr 6, 2015 at 22:58
  • 1
    Instead of trying to split the string, try to describe the items you want. (so you want parts between angle brackets or parts between double quotes) Commented Apr 6, 2015 at 23:00
  • @CasimiretHippolyte If angle brackets appear first then I want parts between angle brackets e.g. in <id_6>. However, if quotes appear before angle brackets, then I want parts between quotes e.g. in "abc <id3> ". Its just like for string1 you start with < and find the closing angle > for <id1>, then you start with < therefore try to find the closing angle bracket >, then you start with quote and try to find the last quote before the dot...i.e. "abc <id3> " Commented Apr 6, 2015 at 23:14
  • You only need an alternation | (a logical OR) to separate the two different subpatterns. Keep in mind that the regex engine tests the pattern for each positions in the string from left to right. So if an angle bracket is found one subpattern succeeds, if a double quote is found the other subpattern succeeds. Commented Apr 6, 2015 at 23:20

1 Answer 1

1

I think that you should get what you need using the following regular expression and re.findall:

re.findall('<.*?>|".*?"', string1)

This matches <id1>, <id2> and "abc <id3> "

Similarly,

re.findall('<.*?>|".*?"', string2)

matches <id4>, <id5> and <id6>.

Sign up to request clarification or add additional context in comments.

4 Comments

Sorry I need (<id1>,<id2>, "abc <id3> ") I made an edit to the post. Sorry for the blunder. But perhaps an expert like you can help with this too
@StegVerner ok, i edited my answer. please let me know if that works better.
Perfect this works well. But if my input is: str1='<P> <c> <I("N")>' . Then we should get the output as (<P>,<c>,<I("N")>) but it is giving me the output as (<P>,<c>,'"N"') which dont you think is incorrect as it came across <l (i.e. starting brackets before ", therefore it should end up with ending brackets before hitting dot. Thanks for all the help, But I guess this tweak would be super easy for a genius like u
@StegVerner I think this last edit should do the trick. let me know if it does not.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.