Read and process data from URL in python

Question

I am trying to get the data from URL.below is the URL Format.

What I am trying to do
1)read line by line and find if the line contains the desired keyword. 3)If yes then store the previous line's content "GETCONTENT" in a list

<http://www.example.com/XYZ/a-b-c/w#>DONTGETCONTENT    
 a       <http://www.example.com/XYZ/mount/v1#NNNN> , 
<http://www.w3.org/2002/w#Individual> ;
        <http://www.w3.org/2000/01/rdf-schema#label>
                "some content , "some url content ;
        <http://www.example.com/XYZ/log/v1#hasRelation>
                <http://www.example.com/XYZ/data/v1#Change> ;
        <http://www.example.com/XYZ/log/v1#ServicePage>
                <https://dev.org.net/apis/someLabel> ;
        <http://www.example.com/XYZ/log/v1#Description>
                "Some API Content .

<http://www.example.com/XYZ/model/v1#GETBBBBBB>
a       <http://www.w3.org/01/07/w#BBBBBB> ;
        <http://www.w3.org/2000/01/schema#domain>
                <http://www.example.com/XYZ/data/v1#xyz> ;
        <http://www.w3.org/2000/01/schema#label1>
               "some content , "some url content ;
        <http://www.w3.org/2000/01/schema#range>
                <http://www.w3.org/2001/XMLSchema#boolean> ;
       <http://www.example.com/XYZ/log/v1#Description>
            "Some description .

<http://www.example.com/XYZ/datamodel-ee/v1#GETAAAAAA>
 a       <http://www.w3.org/01/07/w#AAAAAA> ;
        <http://www.w3.org/2000/01/schema#domain>
                <http://www.example.com/XYZ/data/v1#Version> ;
        <http://www.w3.org/2000/01/schema#label>
                "some content ;
        <http://www.w3.org/2000/01/schema#range>
            <http://www.example.com/XYZ/data/v1#uuu> .

<http://www.example.com/XYZ/datamodel/v1#GETCCCCCC>
 a       <http://www.w3.org/01/07/w#CCCCCC , 
<http://www.w3.org/2002/07/w#Name> 
        <http://www.w3.org/2000/01/schema#domain>
                <http://www.example.com/XYZ/data/v1#xyz> ;
        <http://www.w3.org/2000/01/schema#label1>
              "some content , "some url content ;
        <http://www.w3.org/2000/01/schema#range>
               <http://www.w3.org/2001/XMLSchema#boolean> ;
        <http://www.example.com/XYZ/log/v1#Description>
               "Some description .

below is the code i tried so far but it is printing all the content of the file

  import re
        def read_from_url():
            try:
                from urllib.request import urlopen
            except ImportError:
                from urllib2 import urlopen
            url_link = "examle.com"
            html = urlopen(url_link)
            previous=None
            for line in html:
                previous=line
                line = re.search(r"^(\s*a\s*)|\#GETBBBBBB|#GETAAAAAA|#GETCCCCCC\b", 
        line.decode('UTF-8'))
                print(previous)
        if __name__ == '__main__':
        read_from_url()

Expected output:

GETBBBBBB , GETAAAAAA , GETCCCCCC

Thanks in advance!!

Can you include what you expect your code to produce from the example data? — glibdud
– glibdud, Commented May 22, 2019 at 14:02
Expected Output is to print or store the GETCONTENT , 3 times in a list if it find ACACAC or BCBCBC or ABABAB in the line which starts with "a" i.e. 2nd line. — RJ_Singh
– RJ_Singh, Commented May 23, 2019 at 5:43
Please add it to the question itself and show exactly what you expect. Don't just describe it. — glibdud
– glibdud, Commented May 23, 2019 at 10:35

Malekai · Accepted Answer · 2022-10-04 14:49:04Z

When it comes to reading data from URLs, the requests library is much simpler:

import requests

url = "https://www.example.com/your/target.html"
text = requests.get(url).text

If you haven't got it installed you could use the following to do so:

pip3 install requests

Next, why go through the hassle of shoving all of your words into a single regular expression when you could use a word array and then use a for loop instead?

For example:

search_words = "hello word world".split(" ")
matching_lines = []

for (i, line) in enumerate(text.split()):
  line = line.strip()
  if len(line) < 1:
    continue
  for word i search_words:
    if re.search("\b" + word + "\b", line):
      matching_lines.append(line)
      continue

Then you'd output the result, like this:

print(matching_lines)

Running this where the text variable equals:

"""
this word will save the line
ignore me!
hello my friend!
what about me?
"""

Should output:

[
  "this word will save the line",
  "hello my friend!"
]

You could make the search case insensitive by using the lower method, like this:

search_words = [word for word in "hello word world".lower().split(" ")]
matching_lines = []

for (i, line) in enumerate(text.split()):
  line = line.strip()
  if len(line) < 1:
    continue
  line = line.lower()
  for word i search_words:
    if re.search("\b" + word + "\b", line):
      matching_lines.append(line)
      continue

Notes and information:

the continue keyword prevents you from searching for more than one word match in the current line
the enumerate function allows us to iterate of the index and the current line
I didn't put the lower function for the words inside of the for loop to prevent you from having to call lower for every word match and every line
I didn't call lower on the line until after the check because there's no point in lowercasing an empty line

Good luck.

Baruch Spinoza · Accepted Answer · 2019-05-30 06:59:44Z

I'm puzzled about a few things-- answering which may help the community better assist you. Specifically, I can't tell what form the file is in (ie. is it a txt file or a url you're making a request to and parsing the response of). I also can't tell if you're trying to get the entire line, just the url, or just the bit that follows the hash symbol.

Nonetheless, you stated you were looking for the program to output GETBBBBBB , GETAAAAAA , GETCCCCCC, and here's a quick way to get those specific values (assuming the values are in the form of a string):

search = re.findall(r'#(GET[ABC]{6})>', string)

Otherwise, if you're reading from a txt file, this may help:

with open('example_file.txt', 'r') as file:
    lst = []
    for line in file:
        search = re.findall(r'#(GET[ABC]{6})', line)
        if search != []: 
            lst += search
    print(lst)

Of course, these are just some quick suggestions in case they may be of help. Otherwise, please answer the questions I mentioned at the beginning of my response and maybe it can help someone on SO better understand what you're looking to get.

Collectives™ on Stack Overflow

Read and process data from URL in python

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related