0

So I have a lot of text in a text file that acts like a "database" and I need to extract a specific part that starts from a string and ends with another one.

To be more specific, some of the "database" looks like this:

i:24;s:5:"sName";s:12:"adsfasdffdfd";s:7:"iStatus";i:1;s:9:"iPosition";i:0;s:17:"sDescriptionShort";s:29:"<p>test short description</p>";s:16:"sDescriptionFull";s:28:"<p>test full description</p>";

And I need to extract the part between <p> and </p> having as parameter the first i:24, the number being the parameter.
I tried using regexp but no success until now.
Now I know it's not good practice asking for code itself but this time I'm really stuck! Any ideas?
P.S. The file contains strings like this one after another. So I need the regexp to find a i:$a with $a my number and return the content from the first paragraph it encounters.

So what I expect to be returned is: <p>test short description</p> Considering this should be the first paragraph encountered AFTER i:24

4
  • 1
    This seems more like a invalid serialized string. Commented Jun 27, 2016 at 19:32
  • I know but the CMS puts the whole content like this... :( Commented Jun 27, 2016 at 19:33
  • 1
    Then there must be something wrong with the CMS. This seems like it should be a serialized string. Try to fix the real cause of this instead of implementing a new one with a unnecessary regex. Commented Jun 27, 2016 at 19:34
  • You can use: i:24.*?\K<p>[^<]*</p> Commented Jun 27, 2016 at 19:49

1 Answer 1

1

So you're looking for text that comes after the literals i:24? Since none of these are special characters, let's begin our pattern construction with that literal sequence...

i:24

Next there may or may not be more characters to consume between the i:24 and the opening <p> tag. Let's assume that these characters can be anything, so we'll use a wildcard metacharacter with the {,INF} quantifier, * giving us...

i:24.*

We want to tame the regex engine's appetite so let's modify our quantifier by making it non-greedy.

i:24.*?

Next we want to match AND CAPTURE an opening, <p>...

i:24.*?(<p>)

...and the content inside of the <p> tag, which we'll assume can be anything (read wildcard) and maybe nothing, {,INF}, or *.

i:24.*?(<p>.*)

Remember to tame our * quantifier's appetite so that it doesn't consume too many <p> tags.

i:24.*?(<p>.*?)

And finally we'll close it off by consuming and capturing the closing </p> tag, with the escaped forward-slash, since it's a special character.

i:24.*?(<p>.*?<\/p>)

Hope this works for what you're trying to accomplish.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you very much! Your explanation is very clean and easy to understand. Great content :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.