1

Im trying to scrape a website, so I managed to extract all the text that I wanted, using this template:

nameList = bsObj.findAll("strong")
for text in nameList:
    string = text.get_text()
    if "Title" in string:
        print(text.get_text())

And I get the texts in this fashion:

Title 1: textthatineed

Title 2: textthatineed

Title 3: textthatineed

Title 4: textthatineed

Title 5: textthatineed

Title 6: textthatineed

Title 7: textthatineed ....

Is there any way that I can cut the string in python using beautifulsoup or any other way, and get only the "textthatineed" without "title(number): ".

2 Answers 2

1

Say we have

s = 'Title 1: textthatineed'

The title starts two characters after the colon, so we find the colon's index, move two characters down, and take the substring from that index to the end:

index = s.find(':') + 2
title = s[index:]

Note that find() only returns the index of the first occurrence, so titles containing colons are unaffected.

Sign up to request clarification or add additional context in comments.

Comments

1

In Python, there is a very handy operation that can be done on strings called slicing.

An example taken from the docs

>>> word = 'Python'
>>> word[0:2]  # characters from position 0 (included) to 2 (excluded)
'Py'
>>> word[2:5]  # characters from position 2 (included) to 5 (excluded)
'tho'
>>> word[:2] + word[2:]
'Python'
>>> word[:4] + word[4:]
'Python'
>>> word[:2]   # character from the beginning to position 2 (excluded)
'Py'
>>> word[4:]   # characters from position 4 (included) to the end
'on'
>>> word[-2:]  # characters from the second-last (included) to the end
'on'

So in your case you would do something like this

text = 'Title 1: important information here'
#'Title 1: ' are the first 9 characters i.e., indices 0 through 8
#So you need to extract the information that begins at the 9th index
text = text[9:]

#For general cases
index = text.find(':') + 2
text = text[index:]

2 Comments

HI @Apara, thankyou for your sulution, but the information does not always begin at 9th index, because it goes from Title9 to Title10(so here i have to start from index no. 10), but this was a good solution: index = s.find(':') + 2 title = s[index:]
Oops. I seem to have overlooked that nuance. Let me edit my answer with your correction.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.