0

I have a webpage http://timetable.ait.ie/js/filter.js and I seriously need to parse this page. I have been using BeautifulSoup over the past few days to parse html pages and I really get what I am doing there but this .js file is killing me.

At the moment I am using the following code:

import urllib
page = urllib.urlopen("http://timetable.ait.ie/js/filter.js")
pageInfo = page.read()

and it is returning a string with the whole file of 18283 lines of code. In the code I am trying to get the staff names towards the bottom, there is an array:

staffarray[373][0] = "BRADY, DAMIEN";
staffarray[373][1] = "SCI";
staffarray[373][2] = "BRADY001608";

I need the value from [0] and from [1] and then build a database with these values that I can reference later.

I have tried regex to find the staffarray but I am completely frustrated trying to get this information. Is there anyone that can help me please.

2
  • urllib and requests reads only data from serve. BS lets you find tag in HTML - ie tag <script> with JavaScript code. But you need standard string function or regex to find anything in JavaString code. Commented Nov 12, 2016 at 1:20
  • if you have problem with regex then use standard string function - first split file into lines and then find line with staffarray[ - ie if "staffarray[" in line:. And then you can easily find interesting values using other string functions or slicing. Commented Nov 12, 2016 at 1:23

2 Answers 2

1

If you have problem with regex then use standard string functions and slicing.

First split code into lines and later search staffarray[ and [0] or [1]. Lastly use slicing.

import urllib

req = urllib.urlopen("http://timetable.ait.ie/js/filter.js")
lines = req.read().split('\n')

for x in lines:
    if 'staffarray[' in x:
        if '[0] = ' in x:
            start = x.find('"')+1
            end = -3
            print '0', x[start:end]
        elif '[1] = ' in x:
            start = x.find('"')+1
            end = -3
            print '1', x[start:end]
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a million for the help @furas
1

You could write a regexp pattern with capturing groups:

import re
with open('filter.js') as file:
    pattern = r'staffarray\[(?P<first_index>\d+)\]\s*\[(?P<second_index>\d+)\] = "(?P<name>.+)"'
    for line in file:
        match = re.search(pattern, line)
        if match:
            first_index, second_index, name = match.groups()
            # do something with data

1 Comment

Thanks for the answer got sorted a while back.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.