1

I'm going to extract some data from a site, I use HTML agility pack, but surprisingly this site cannot be analyzed correctly when I use its remote address, so I have to save file to local system, then use HTML agility pack. How can I copy this file to my server and then use HTML agility pack to analyze and extract data?

for instance this is my remote file: www.testsite.com/testfile.html

I want to save this file to my server, and then work with the local file (I use C#)

2 Answers 2

1

After my investigation I found out that using WebRequest will not get the complete html source since there are other parts of the page that are being called separately like data using ajax, css, images etc. There is solution to get the complete html code of a page and that is using WebBrowser control but you should use Windows Application. Here try this solution.

  1. Create Windows Application.

  2. Drag and drop a WebBrowser control from tool box.

  3. On the Form-Load add the following code.

    webBrowser1.Url = new Uri("http://tse.ir/default.aspx"); webBrowser1.DocumentCompleted += webBrowser1_DocumentCompleted;

  4. Add the following method.

    private void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
        {
            var browser = sender as WebBrowser;
            var htmlPath = Path.Combine("C:\\Test", "testhtml.html");
            using (var writer = new StreamWriter(htmlPath, false, Encoding.UTF8))
            {
                if (browser != null) writer.WriteLine(browser.DocumentText);
                writer.Close();
            }
        }
    
  5. Run your application and check the saved file.

Sign up to request clarification or add additional context in comments.

6 Comments

thanks, it works but there is something strange here, please use this address (of course it is in Persian language, but you can easily see difference between remote and local files). When this address is being saved, almost nothing is stored on the local system and there is only a warning in the page that tells page is not currently available, please refer later, what is going wrong? why this page is not being saved correctly?
Can you give example URL of a site where you encounter that warning?
tse.ir/default.aspx, this page is not saved correctly, also this page: jahanesanat.ir/currency.html
-1 for 1) Eating exceptions, and 2) not using using blocks.
Try one more time. I added encoding. This becomes var writer = new StreamWriter(htmlPath, false, Encoding.UTF8); Just updated the answer
|
0

You can take benefit of http web request and http web response

HttpWebRequest req = (HttpWebRequest)WebRequest.Create("http://www.testsite.com/testfile.html");
req.Method = "POST";
req.ContentType = "application/x-www-form-urlencoded";
//string login = string.Format("go=&Fuser={0}&Fpass={1}", user, password);
//byte[] postbuf = Encoding.ASCII.GetBytes(login);
//req.ContentLength = postbuf.Length;
Stream rs = req.GetRequestStream();
rs.Write(postbuf,0,postbuf.Length);
rs.Close();
WebResponse resp = req.GetResponse();

Now you can cast your response in stream and save it as html file

// we will read data via the response stream
Stream ReceiveStream = resp.GetResponseStream();

string filename = ...;

byte[] buffer = new byte[1024];
FileStream outFile = new FileStream(filename, FileMode.Create);

int bytesRead;
while((bytesRead = ReceiveStream.Read(buffer, 0, buffer.Length)) != 0)
outFile.Write(buffer, 0, bytesRead);

1 Comment

it gives me error, what is postbuf? how can I use it? why is it commented?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.