0

I have an HTML page like:-

<div>
<a href='link'>
<u class>name</u>
</a>
text
<br>
<a href='link'>
<u class>name</u>
</a>
text
<br>
<a href='link'>
<u class>name</u>
</a>
text
<br>
<a href='link'>
<u class>name</u>
</a>
text
<br>
<a href='link'>
<u class>name</u>
</a>
text
<br>
</div>

I need to select a group like this:-

<a href='link'>
<u class>name</u>
</a>
text
<br>

I need to select 3 values from a group:- link, name, and text. Is there any way to select a group like this, and extract these particular values from each group in scrapy using, CSS selectors, Xpath, or anything?

2 Answers 2

1

Scrapy provides a mechanism to yield multiple values on the html page using Items- as items, Python objects that define key-value pairs.

You can extract individually and but yield them together as key-value pairs.

  • to extract value of an attribute of an element, use attr().
  • to extract innerhtml, use text.

Like you can define your parse function in scrapy like this:

def parse(self, response):
      
        for_link = response.css(' .row.no-gutters div:nth-child(3) div:nth-child(8)  a::attr(href)').getall()
            
        for_name = response.css(' .row.no-gutters div:nth-child(3) div:nth-child(8) a u::text').getall()
              
        for_text =  response.css(' .row.no-gutters div:nth-child(3) div:nth-child(8)::text').getall()
             
            # Yield all elements
            yield {"link": for_link, "name": for_name, "text": for_text}

Open the items.py file.

# Define here the models for your scraped
# items
# Import the required library
import scrapy
 
# Define the fields for Scrapy item here
# in class
class <yourspider>Item(scrapy.Item):
     
    # Item key for a
    for_link = scrapy.Field()
     
    # Item key for u
    for_name = scrapy.Field()
     
    # Item key for span
    for_text = scrapy.Field()

for more details, read this tutorial

Sign up to request clarification or add additional context in comments.

7 Comments

Hi, in my provided example, 'text' is not in <span>, so I guess this will not work, can you answer in case the 'text' is not in <span>?
Hi @Vishnu, try it now and feel free to ask any further questions :)
Hi, I am sorry, but, the above code does not works as in my case 'name' is inside <a> but both 'text and <a> are inside <div>, so to select 'text' maybe would need to do like 'div *::text', but this also does not work as it will get 'name' again, and also gets '\n' for some reason, maybe from <br>
can you share the website, you are trying to scrape? I will share code accordingly.
Hi, sorry for late reply (power outage), I am trying to scrape https://www.mangaupdates.com/series/r4ayzg7/mairimashita-iruma-kun here in this page, I want to get "Related Series" section
|
1

If it's okay to wrap text in a span like so:

<a href='link'>
<u class>name</u>
</a>
<span>text</span>
<br>

Then you can select everything in CSS like so:

a, a + span {}

Or you can style these two separately:

a {}
a + span {}

The + means "comes immediately after" or "is immediately followed by".

1 Comment

Sorry @Sam, but I do not own the HTML, I just receive the HTML in the specified format

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.