Extracting data from HTML file using c# script

Question

What I need to do : Extract (Information of From, To, Cc and Subject ) and remove them from HTML file. Without the use of any 3rd party ( HTMLAgilityPack, etc)

What I am having trouble with: What will be my approach to get the following(from,to,subject,cc) from the html tags?

Steps I tried: I tried to get the index of <p class=MsoNormal> and the last index of the email @sampleemail.com but I think that is a bad approach since in some html files there will be a lot of "<p class=MsNormal>" , regarding the removal of the from,to,cc and subject I just used the string.Remove(indexOf, i counted the characters from indexOf to lastIndexOf) function and it worked

Sample tag containing information of from:

<p class=MsoNormal style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>[email protected]<o:p></o:p></span></p>

HTML FILE output:

Alfred Luu · Accepted Answer · 2020-05-05 08:46:17Z

3

HTMLAgilityPack is your friend. Simply using XPath like //p[@class ='MsoNormal'] to get tags content in HTML

public static void Main()
{
    var html =
    @"<p class=MsoNormal style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>[email protected]<o:p></o:p></span></p>                                     ";

    var htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(html);

    var nodes = htmlDoc.DocumentNode.SelectNodes("//p[@class ='MsoNormal']");

    foreach(var node in nodes)
        Console.WriteLine(node.InnerText);      
}

Result:

From:[email protected]

Update

We may use Regex to write this simple parser. But remember that it cannot clear all cases for complicated html document.

    public static void MainFunc()
    {
        string str = @"<p class='MsoNormal' style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>[email protected]<o:p></o:p></span></p>                                     ";
        var result = Regex.Replace(str, "<(?:\"[^\"]*\"['\"]*|'[^']*'['\"]*|[^'\">])+>", "");
        Console.WriteLine(result);
    }

edited May 5, 2020 at 8:46

answered May 5, 2020 at 5:20

Alfred Luu

2,0864 gold badges20 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

keinz Over a year ago

I forgot to add in my question that my requirements is that I don't rely on 3rd party like htmlagilitypack. Is this possible?

Alfred Luu Over a year ago

@lonewolfkein HTMLAgilityPack is built from System.Xml.XPath.XPathDocument. You have 3 choices, HtmlAgilityPack for simple, more code with System.Xml or complicated code with your own parser.

Alfred Luu Over a year ago

@lonewolfkein yah it seems we cannot use system.xml because it's validated the html before using XPath. The more specific html the more cases we could test it by Regex for writing simple parser. If you could break it that line, I suggest simple Regex pattern. I updated my answer

keinz Over a year ago

seems like I cant use it because its a complicated html document, thanks for the help tan.

keinz Over a year ago

again thank youfor all this information im new to. Im really grateful will check it out

|

Collectives™ on Stack Overflow

Extracting data from HTML file using c# script

1 Answer 1

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related