14

I'm building a data extract using scrapy and want to normalize a raw string pulled out of an HTML document. Here's an example string:

  Sapphire RX460 OC  2/4GB

Notice two groups of two whitespaces preceeding the string literal and between OC and 2.

Python provides trim as described in How do I trim whitespace with Python? But that won't handle the two spaces between OC and 2, which I need collapsed into a single space.

I've tried using normalize-space() from XPath while extracting data with my scrapy Selector and that works but the assignment verbose with strong rightward drift:

product_title = product.css('h3').xpath('normalize-space((text()))').extract_first()

Is there an elegant way to normalize whitespace using Python? If not a one-liner, is there a way I can break the above line into something easier to read without throwing an indentation error, e.g.

product_title = product.css('h3')
    .xpath('normalize-space((text()))')
    .extract_first()
2
  • 1
    In reply to your secondary question, about how to format the Python code without throwing an indentation error. You can break a Python statement across multiple lines by using parentheses. Commented Jul 28, 2023 at 17:39
  • 1
    Here's an example of how to use parentheses to format Python code across several lines. Commented Jul 28, 2023 at 17:46

4 Answers 4

40

You can use:

" ".join(s.split())

where s is your string.

Sign up to request clarification or add additional context in comments.

Comments

4

Instead of using regex's for this, a more efficient solution is to use the join/split option, observe:

>>> timeit.Timer((lambda:' '.join(' Sapphire RX460 OC  2/4GB'.split()))).timeit()
0.7263979911804199

>>> def f():
        return re.sub(" +", ' ', "  Sapphire RX460 OC  2/4GB").split()

>>> timeit.Timer(f).timeit()
4.163465976715088

2 Comments

I will circle back on this answer as my extracts grow in size. Thank you!!
Pleasure's all mine.
2

The accepted answer is the right way to normalize the whitespace. This is an answer to your secondary question about formatting.

You also asked about how to format Python code across multiple lines without throwing an indentation error. You can do that in Python using parentheses. Here's what the example code from your question would look like formatted across several lines for readability.

product_title = (
    product.css("h3")
    .xpath("normalize-space((text()))")
    .extract_first()
)

Note that these parentheses don't create a tuple because there is no comma. The outer parentheses are just for formatting purposes.

The multi-line code above is exactly equivalent to chaining all the method calls together on one line.

product_title = product.css("h3").xpath("normalize-space((text()))").extract_first()

Comments

0

You can use a function like below with regular expression to scan for continuous spaces and replace them by 1 space

import re

def clean_data(data):
    return re.sub(" {2,}", " ", data.strip())

product_title = clean(product.css('h3::text').extract_first())

And then improve clean function anyway you like it

1 Comment

Not as elegant as what I was looking for, but points for extensibility.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.