5

I have a web application that I want to get the h1 tag and the first image and the first few lines of text from an external web page. I have never done this before and think that it is done best using jquery but I'm not sure. Can you please point me in the right direction or give a coding example in .net and jquery? Thanks.

I am thinking kinda of like Facebook pulls out the picture and a few lines when you type a url in the post box as a new post.

2
  • If possible, can post url of external web page ? Thanks Commented May 24, 2014 at 20:00
  • Are you interested in all the h1 tags or just the first one; as for the image, it is clear? First five lines of text ok? Commented May 27, 2014 at 16:56

4 Answers 4

1

You cannot fetch any URLs markup using AJAX due to CORS (cross-origin resource sharing) and most sites on the web won't permit just anyone to use their content. What you should do in your case is use a proxy method on your server.

Create an action which receives a URL and fetches its markup on your server, then use AJAX to request the pages HTML using your new action.

From there you have two options. Either parsing the HTML on the server, extracting all the data you need, then sending it back to the client OR send all of the HTML back to the client. I highly recommend using the server to do the parsing, it will use less bandwidth and your server probably has better performance and speed than most browsers provide.

If you decided to analyze the markup on the client end, the most simple way to do so would be passing the HTML into a root element, then querying for the data using regular methods.

i.e.

var $root = $('<div>').html(response.html);
console.log($root.find('h1')); // all h1 tags in response's html

The downside here is that once you've allowed the browser to parse your markup it will automatically load any resources that were present, such as images.

I don't use .Net so I am unable to provide you with the exact tools you may need, but I do suggest that you look up yourself for ways to accomplish these two tasks on the server.

  1. Read a given URL content into a string.
  2. Use any given DOM parser, pass it the HTML string and query for the data.
Sign up to request clarification or add additional context in comments.

Comments

0

you could try with a mix of jquery and php, or what ever you have:

//requestExternalURL.php

<?php
    $url = "http://url...";
    $homepage = file_get_contents($url);
    echo $homepage;
?>

and with ajax/jquery:

$(".target").load("requestExternalURL.php", function(){
    var h1 = $("h1").first();
    var img = $("img").first().attr("src");
    //do something
});

A simple .net call filename: //requestExternalURL.aspx

<%@ Page Language="C#" %>
<script runat="server">
    string homepage = new System.Net.WebClient().DownloadString("http://url...");
</script>
<%=homepage%>

and again with ajax/jquery:

$(".target").load("requestExternalURL.aspx", function(){
    var h1 = $("h1").first();
    var img = $("img").first().attr("src");
    //do something
});

hope it helps.

2 Comments

What exactly are you doing with the PHP? I use .net, so do you know how to do it in .net?
The PHP part is requesting the content of the URL and prints it out.
0

Option 1: If the external page is on the same server as the calling page then just ensure that you have included a modern version of jQuery and then set up the following JS:

//let's say that page is external.html
$(function() {
    $.get( 'external.html', function( data ) {
        var html = $( $.parseHTML( data ) );
        var h1 = html.find( 'h1' ).first(); //first h1 tag
        var img = html.find( 'img' ).first(); //first img tag
        var text = html.find( 'body' ).contents().not( 'h1' ).filter(function() {
            return this.nodeType == 3;
        }).lt(5); //first few lines of text
        //h1, img and text may be added to the DOM or processed 
        //however you want
    });
});

Option 2: If, however, the external page is on another server you may want to create a .NET proxy to fetch the page for you. Then you would make a similar call as that above but you would have to replace

external.html above with myproxy.aspx?url=http://www.example.com/somepage.html.

Option 3: If the other server which has the content you want to fetch supports CORS you would not need a server-side proxy but you would just supply the full path of the external page.

http://www.example.com/somepage.html in place of external.html (opt. 1 code)

Comments

0

Facebook encourages the usage of Open Graph Protocol data to pull this kind of metadata. They have infrastructure that does the work of scraping pages and parsing available metadata.

You indicate you're using .NET, if that's the case then perhaps you can leverage libraries that parse Open Graph data for your purpose: See OpenGraph-Net and OpenGraph .NET

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.