1

I have been trying to remove unnecessary parts of a scraped string and I'm having difficulty. I'm sure it's simple but I'm probably lacking the terminology to search for an effective solution.

I have all the information I need and am now trying to create a clean output. I am using this code...

for each in soup.findAll('div', attrs={'class': 'className'}):
    print(each.text.split('\n'))

And the output, a mix of numbers and text with variable spaces, is similar to...

['', '', '', '                    1                ', '  Text Example', '                        (4)']

What I need to produce is a list like...

['1', 'Text Example', '(4)']

Perhaps even removing the brackets "()" from the number 4.

Thanks.

5
  • 1
    Possible duplicate of How to remove whitespace in BeautifulSoup Commented Nov 28, 2017 at 21:37
  • I have tried removing the whitespace with split() and strip() variants and I haven't been able to figure out the combination I need. Commented Nov 28, 2017 at 21:39
  • text.strip() without parameters removes spaces, tabs, enters. If you have list then you have do result = [x.strip() for x in your_list if x.strip() != ''] Commented Nov 28, 2017 at 21:40
  • @furas and yet when I'm doing it that way, it keeps splitting the two word text I need, eg. ['text', 'example']. I need them together. Commented Nov 28, 2017 at 21:42
  • strip() only removes at the ends - split() splits text into words so don't use it. Commented Nov 28, 2017 at 21:44

2 Answers 2

2
clean = []
for each in soup.findAll('div', attrs={'class': 'className'}):
    clean.append([s.strip() for s in each.text.strip() if s.strip()])
print(clean)

should do it, full code for where do I put it...

Update:

Since there was a comment about inefficiency, out of curiosity I timed dual strip vs nested list, on py3. It seems like there is something behind when people say it's best to profile...

%timeit [s.strip() for s in l if s.strip()]
1.83 µs ± 21.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%timeit [i for i in (s.strip() for s in l) if i]
2.16 µs ± 24.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Results are as usual a bit different with larger data amounts...

%timeit [s.strip() for s in l*1000 if s.strip()]
1.57 ms ± 85.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit [i for i in (s.strip() for s in l*1000) if i]
1.45 ms ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sign up to request clarification or add additional context in comments.

6 Comments

I've seen that used, but where do I put it. I'm learning, I very much appreciate the help.
how about string only with spaces ? if s will not remove it.
true, added a strip() in the if, not really tested myself, but I suppose that works for strings with only spaces. If you want to deal with other characters I would probably put that in it's own loop over clean afterwards, makes it a bit easier to understand what is done where.
This is inefficient as you are stripping each string twice
yepp, can't argue with that, but not everything in the world needs to be efficient, you are welcome to do a regex solution or a nested list or something else you had in mind.
|
1

Let's reduce your problem down to a basic list:

l = ['', '', '', '                    1                ', '  Text Example', '                        (4)']

then use a list-comp:

[i for i in (s.strip() for s in l) if i]

to get your result of:

['1', 'Text Example', '(4)']

1 Comment

I just wanted to say thank you. You have helped me out a great deal.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.