Clean up a scraped text string with Python

Question

I have been trying to remove unnecessary parts of a scraped string and I'm having difficulty. I'm sure it's simple but I'm probably lacking the terminology to search for an effective solution.

I have all the information I need and am now trying to create a clean output. I am using this code...

for each in soup.findAll('div', attrs={'class': 'className'}):
    print(each.text.split('\n'))

And the output, a mix of numbers and text with variable spaces, is similar to...

['', '', '', '                    1                ', '  Text Example', '                        (4)']

What I need to produce is a list like...

['1', 'Text Example', '(4)']

Perhaps even removing the brackets "()" from the number 4.

Thanks.

Possible duplicate of How to remove whitespace in BeautifulSoup — Cfreak
– Cfreak, Commented Nov 28, 2017 at 21:37
I have tried removing the whitespace with split() and strip() variants and I haven't been able to figure out the combination I need. — Toby Booth
– Toby Booth, Commented Nov 28, 2017 at 21:39
text.strip() without parameters removes spaces, tabs, enters. If you have list then you have do result = [x.strip() for x in your_list if x.strip() != ''] — furas
– furas, Commented Nov 28, 2017 at 21:40
@furas and yet when I'm doing it that way, it keeps splitting the two word text I need, eg. ['text', 'example']. I need them together. — Toby Booth
– Toby Booth, Commented Nov 28, 2017 at 21:42
strip() only removes at the ends - split() splits text into words so don't use it. — furas
– furas, Commented Nov 28, 2017 at 21:44

ahed87 · Accepted Answer · 2017-11-28 23:24:51Z

2

clean = []
for each in soup.findAll('div', attrs={'class': 'className'}):
    clean.append([s.strip() for s in each.text.strip() if s.strip()])
print(clean)

should do it, full code for where do I put it...

Update:

Since there was a comment about inefficiency, out of curiosity I timed dual strip vs nested list, on py3. It seems like there is something behind when people say it's best to profile...

%timeit [s.strip() for s in l if s.strip()]
1.83 µs ± 21.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%timeit [i for i in (s.strip() for s in l) if i]
2.16 µs ± 24.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Results are as usual a bit different with larger data amounts...

%timeit [s.strip() for s in l*1000 if s.strip()]
1.57 ms ± 85.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit [i for i in (s.strip() for s in l*1000) if i]
1.45 ms ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

edited Nov 28, 2017 at 23:24

answered Nov 28, 2017 at 21:43

ahed87

1,36010 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Toby Booth Over a year ago

I've seen that used, but where do I put it. I'm learning, I very much appreciate the help.

furas Over a year ago

how about string only with spaces ? if s will not remove it.

ahed87 Over a year ago

true, added a strip() in the if, not really tested myself, but I suppose that works for strings with only spaces. If you want to deal with other characters I would probably put that in it's own loop over clean afterwards, makes it a bit easier to understand what is done where.

Joe Iddon Over a year ago

This is inefficient as you are stripping each string twice

ahed87 Over a year ago

yepp, can't argue with that, but not everything in the world needs to be efficient, you are welcome to do a regex solution or a nested list or something else you had in mind.

|

Joe Iddon · Accepted Answer · 2017-11-28 22:12:55Z

1

Let's reduce your problem down to a basic list:

l = ['', '', '', '                    1                ', '  Text Example', '                        (4)']

then use a list-comp:

[i for i in (s.strip() for s in l) if i]

to get your result of:

['1', 'Text Example', '(4)']

answered Nov 28, 2017 at 22:12

Joe Iddon

20.5k7 gold badges38 silver badges62 bronze badges

1 Comment

Toby Booth Over a year ago

I just wanted to say thank you. You have helped me out a great deal.

Collectives™ on Stack Overflow

Clean up a scraped text string with Python

2 Answers 2

6 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related