0

I would like to extract text from HTML with pure Javascript (this is for a Chrome extension).

Specifically, I would like to be able to find text on a page and extract text after it.

Even more specifically, on a page like

https://picasaweb.google.com/kevin.smilak/BestOfAmericaSGrandCircle#4974033581081755666

I would like to find text "Latitude" and extract the value that goes after it. HTML there is not in a very structured form.

What is an elegant solution to do it?

5 Answers 5

2

There is no elegant solution in my opinion because as you said HTML is not structured and the words "Latitude" and "Longitude" depends on page localization. Best I can think of is relying on the cardinal points, which might not change...

var data = document.getElementById("lhid_tray").innerHTML;
var lat = data.match(/((\d)*\.(\d)*)°(\s*)(N|S)/)[1];
var lon = data.match(/((\d)*\.(\d)*)°(\s*)(E|W)/)[1];
Sign up to request clarification or add additional context in comments.

3 Comments

I really don't think you can rely on ° W and ° N not changing, but you can easily change N to N|S and W to E|W in the regexes.
I was convinced that lat & lon were always expressed in terms of N, W. I'll edit the regex.
lat & lon should have minus signs if element [3] of the regexp is S and W correspondingly, but these are further details that could be implemented with two extra lines of code...
1

you could do

var str = document.getElementsByClassName("gphoto-exifbox-exif-field")[4].innerHTML;
var latPos = str.indexOf('Latitude')
lat = str.substring(str.indexOf('<em>',latPos)+4,str.indexOf('</em>',latPos))

Comments

1

The text you're interested in is found inside of a div with class gphoto-exifbox-exif-field. Since this is for a Chrome extension, we have document.querySelectorAll which makes selecting that element easy:

var div = document.querySelectorAll('div.gphoto-exifbox-exif-field')[4],
    text = div.innerText;

/* text looks like:
"Filename: img_3474.jpg
Camera: Canon
Model: Canon EOS DIGITAL REBEL
ISO: 800
Exposure: 1/60 sec
Aperture: 5.0
Focal Length: 18mm
Flash Used: No
Latitude: 36.872068° N
Longitude: 111.387291° W"
*/

It's easy to get what you want now:

var lng = text.split('Longitude:')[1].trim(); // "111.387291° W"

I used trim() instead of split('Longitude: ') since that's not actually a space character in the innerText (URL-encoded, it's %C2%A0 ...no time to figure out what that maps to, sorry).

1 Comment

It does. Firefox is the outlier here (use textContent instead). quirksmode.org/dom/w3c_html.html#t04
0

I would query the DOM and just collect the image information into an object, so you can reference any property you want.

E.g.

function getImageData() {
    var props = {};
    Array.prototype.forEach.apply(
        document.querySelectorAll('.gphoto-exifbox-exif-field > em'),
        [function (prop) {
            props[prop.previousSibling.nodeValue.replace(/[\s:]+/g, '')] = prop.textContent;
        }]
    );
    return props;
}

var data = getImageData();
console.log(data.Latitude); // 36.872068° N

Comments

0

Well if a more general answer is required for other sites then you can try something like:

var text = document.body.innerHTML;
text = text.replace(/(<([^>]+)>)/ig,"");  //strip out all HTML tags
var latArray = text.match(/Latitude:?\s*[^0-9]*[0-9]*\.?[0-9]*\s*°\s*[NS]/gim);
//search for and return an array of all found results for:
//"latitude", one or 0 ":", white space, A number, white space, 1 or 0 "°", white space, N or S
//(ignores case)(ignores multi-line)(global)

For that example an array of 1 element containing "Latitude: 36.872068° N" is returned (which should be easy to parse).

1 Comment

Note: I am not a regex expert by any means, that example should work for almost anything but I am sure their are more complete and elegant solutions.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.