1

What I need to do : Extract (Information of From, To, Cc and Subject ) and remove them from HTML file. Without the use of any 3rd party ( HTMLAgilityPack, etc)

What I am having trouble with: What will be my approach to get the following(from,to,subject,cc) from the html tags?

Steps I tried: I tried to get the index of <p class=MsoNormal> and the last index of the email @sampleemail.com but I think that is a bad approach since in some html files there will be a lot of "<p class=MsNormal>" , regarding the removal of the from,to,cc and subject I just used the string.Remove(indexOf, i counted the characters from indexOf to lastIndexOf) function and it worked

Sample tag containing information of from:

<p class=MsoNormal style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>[email protected]<o:p></o:p></span></p>                                     

HTML FILE output:

HTML File output

1 Answer 1

3

HTMLAgilityPack is your friend. Simply using XPath like //p[@class ='MsoNormal'] to get tags content in HTML

public static void Main()
{
    var html =
    @"<p class=MsoNormal style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>[email protected]<o:p></o:p></span></p>                                     ";

    var htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(html);

    var nodes = htmlDoc.DocumentNode.SelectNodes("//p[@class ='MsoNormal']");

    foreach(var node in nodes)
        Console.WriteLine(node.InnerText);      
}

Result:

From:[email protected]

Update

We may use Regex to write this simple parser. But remember that it cannot clear all cases for complicated html document.

    public static void MainFunc()
    {
        string str = @"<p class='MsoNormal' style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>[email protected]<o:p></o:p></span></p>                                     ";
        var result = Regex.Replace(str, "<(?:\"[^\"]*\"['\"]*|'[^']*'['\"]*|[^'\">])+>", "");
        Console.WriteLine(result);
    }

enter image description here

Sign up to request clarification or add additional context in comments.

8 Comments

I forgot to add in my question that my requirements is that I don't rely on 3rd party like htmlagilitypack. Is this possible?
@lonewolfkein HTMLAgilityPack is built from System.Xml.XPath.XPathDocument. You have 3 choices, HtmlAgilityPack for simple, more code with System.Xml or complicated code with your own parser.
@lonewolfkein yah it seems we cannot use system.xml because it's validated the html before using XPath. The more specific html the more cases we could test it by Regex for writing simple parser. If you could break it that line, I suggest simple Regex pattern. I updated my answer
seems like I cant use it because its a complicated html document, thanks for the help tan.
again thank youfor all this information im new to. Im really grateful will check it out
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.