Cutting a part of string variable in python (web scraping)

Question

Im trying to scrape a website, so I managed to extract all the text that I wanted, using this template:

nameList = bsObj.findAll("strong")
for text in nameList:
    string = text.get_text()
    if "Title" in string:
        print(text.get_text())

And I get the texts in this fashion:

Title 1: textthatineed

Title 2: textthatineed

Title 3: textthatineed

Title 4: textthatineed

Title 5: textthatineed

Title 6: textthatineed

Title 7: textthatineed ....

Is there any way that I can cut the string in python using beautifulsoup or any other way, and get only the "textthatineed" without "title(number): ".

ren · Accepted Answer · 2016-12-31 22:17:10Z

1

Say we have

s = 'Title 1: textthatineed'

The title starts two characters after the colon, so we find the colon's index, move two characters down, and take the substring from that index to the end:

index = s.find(':') + 2
title = s[index:]

Note that find() only returns the index of the first occurrence, so titles containing colons are unaffected.

answered Dec 31, 2016 at 22:17

ren

2802 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Apara · Accepted Answer · 2017-01-01 19:07:34Z

1

In Python, there is a very handy operation that can be done on strings called slicing.

An example taken from the docs

>>> word = 'Python'
>>> word[0:2]  # characters from position 0 (included) to 2 (excluded)
'Py'
>>> word[2:5]  # characters from position 2 (included) to 5 (excluded)
'tho'
>>> word[:2] + word[2:]
'Python'
>>> word[:4] + word[4:]
'Python'
>>> word[:2]   # character from the beginning to position 2 (excluded)
'Py'
>>> word[4:]   # characters from position 4 (included) to the end
'on'
>>> word[-2:]  # characters from the second-last (included) to the end
'on'

So in your case you would do something like this

text = 'Title 1: important information here'
#'Title 1: ' are the first 9 characters i.e., indices 0 through 8
#So you need to extract the information that begins at the 9th index
text = text[9:]

#For general cases
index = text.find(':') + 2
text = text[index:]

edited Jan 1, 2017 at 19:07

answered Dec 31, 2016 at 22:12

Apara

3744 silver badges15 bronze badges

2 Comments

user4663715 Over a year ago

HI @Apara, thankyou for your sulution, but the information does not always begin at 9th index, because it goes from Title9 to Title10(so here i have to start from index no. 10), but this was a good solution: index = s.find(':') + 2 title = s[index:]

Apara Over a year ago

Oops. I seem to have overlooked that nuance. Let me edit my answer with your correction.

Collectives™ on Stack Overflow

Cutting a part of string variable in python (web scraping)

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related