3

I am trying to get a value in between certain text of html , so far not successful ,I can not use html aglity pack as it gives the data only present in between html tags

public static string[] split_comments(string html)
    {
        html = html.ToLower();


        html = html.Replace(@""""," ");

the actual line in html is this

//<meta itemprop="rating" content="4.7"> the 4.7 value changes every time and I need to get this value

Match match = Regex.Match(html, @"<meta itemprop=rating content=([A-Za-z0-9\-]+)\>$");
            if (match.Success)
            {
                // Finally, we get the Group value and display it.
                string key = match.Groups[1].Value;
            }

So I am trying to get a tag of html and in that tag I wish to get the data whic is variable all the time .

6
  • See the question stackoverflow.com/questions/10077320/… Commented Apr 10, 2012 at 7:08
  • Not sure I understand why you can't use HtmlAgilityPack Commented Apr 10, 2012 at 7:12
  • Do not use a regex to pass structured content. What when unparsed entities like &amp; or &eacut; are used, or numeric entities like &#x61? What if the attributes are in another order, attribute content is placed over multiple lines or XML vs HTML tag endings are used? Regexes are good, but not for this task. See stackoverflow.com/questions/3406174/… and check HtmlAgilityPack to do this the stable and reliable way. Commented Apr 10, 2012 at 7:13
  • "as it gives the data only present in between html tags" >> no. It gives also the data in attributes. Show what you tried and we'll help you fix it. Commented Apr 10, 2012 at 7:15
  • html agility pack only gives data in between the tags like <h1>I give you this data,html agility pack<\h1> ,but waht for this <h1 style="I need this"><\h1> Commented Apr 10, 2012 at 7:16

5 Answers 5

4
string html = "<meta itemprop=\"rating\" content=\"4.7\">";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var content = doc.DocumentNode
                .Element("meta")
                .Attributes["content"].Value;

--EDIT--

From your first accepting and then unaccepting the answer, I guess you took the code and run with your real html and saw that it returned wrong result.

This doesn't show that the answer is not correct since It works correctly with the snippet you posted.

So by making a wild guess and assuming that there are other meta tags in your real html with itemprop attributes like

<meta itemprop="rating" content="4.7">
<meta itemprop="somekey" content="somevalue">

the code would be:

var content = doc.DocumentNode
                .Descendants("meta")
                .Where(n => n.Attributes["itemprop"] != null && n.Attributes["itemprop"].Value == "rating")
                .Select(n => n.Attributes["content"].Value)
                .First();
Sign up to request clarification or add additional context in comments.

Comments

2

First you should replace that:

html = html.Replace(@""""," ");

with that:

html = html.Replace(@"""","");

and change your Regex with:

Match match = Regex.Match(html, @"<meta itemprop=rating content=([A-Za-z0-9\-.]+)\>$");

otherwise your if will always be false. After that you could simply use a substring :

 html = html.Substring(html.IndexOf("content=") + 8);

 html = html.Substring(0, html.Length - 1);

I hope that'll help

Comments

1

Here

html = html.Replace(@""""," "); 

you replace double quotes with spaces. Thus, your example string now looks like this:

<meta itemprop= rating  content= 4.7 > 

Your Regex, however, matches for text without those extra spaces. Also, your regex requires a backslash before the closing >, which is not present in the example.

1 Comment

I did chaged it html = html.Replace(@"""","");also meta itemprop=rating content=([A-Za-z0-9\-]+)\>$") , doesnt it has backslash already ?
1

Your regex should be something like @"\<meta.+?content\=\"(.+)\"\>". Although parsing HTLM with regex is a bad thing.

1 Comment

html agility pack only gives data in between the tags like <h1>I give you this data,html agility pack<\h1> ,but waht for this <h1 style="I need this"><\h1>
1

try this:

        double searchedValue;
        Regex reg = new Regex(@"content= (?<groupname>.*?) >");
        var matches = reg.Match(@"<meta itemprop= rating  content= 4.7 >");
        var value = matches.Groups["groupname"].Value;
        //maybe you need to replace like value.Replace('.',',')
        double.TryParse(value , out searchedValue);

(?<groupname> ... ) sets up a group. you can access the value with matches.Groups["groupname"].Value

.*? is reading to the next match of " >".

if you do not use the "?" it will search for the last match of " >" in your text.

Good luck =)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.