2

I have this code

site = hxs.select("//h1[@class='state']")
mydata = site.select("string()").extract()
cleaned_mydata = re.sub(ur'(\s)\s+', ur'\1', mydata[0], flags=re.MULTILINE + re.UNICODE)

        log.msg(str(mydata),level=log.ERROR)
        log.msg(str(cleaned_mydata),level=log.ERROR)

The first output is

ERROR: [u'\r\n 212\r\n jobs containing php in xxxx \r\n ']

other output is

jobs containing php in xxxxxx

regex is also stripping the 212 numeric with it. how can i fix that

3
  • What are you trying to match? Commented Nov 22, 2012 at 6:04
  • I am trying to delete the more than one space and end lines. i copied this from internet , i don't know what it does exactly Commented Nov 22, 2012 at 6:12
  • 2
    As a test, why not replace with something visible, instead of: ur'\1', use: 'XYZ'. Before you run the regex, why not remove the \r\n? Also, when using flags= they should be OR'd together, not ADDED (i.e use | not +) Commented Nov 22, 2012 at 6:17

1 Answer 1

1

The problem is that this regex leaves the first whitespace it finds and strips only the subsequent ones.

This means that

u'\r\n 212\r\n jobs containing php in xxxx \r\n '

becomes

u'\r212\rjobs containing php in xxxx '

When you print this, the 212 will be printed, then a carriage return will return the cursor to the first column, so that the following jobs... will overwrite the 212.

This raises two questions:

  • You appear to be reading a text file in binary mode (otherwise the \r\n would have been normalized into \ns) - why?
  • Do you really want the regex to work this way?

Edit:

So, according to your comment, you want to

  • strip leading and trailing whitespace completely
  • condense multiple consecutive whitespace characters into a single space (ASCII 32).

Then use

cleaned_mydata = re.sub(r'\s+', ' ', mydata[0].strip())
Sign up to request clarification or add additional context in comments.

3 Comments

I am scrapping the website with scrapy and using there. i don't know why it was appearing like that. i only want to remove the blamk spaces and \r\n
@user32: Please be more specific. What exactly do you want to remove? I don't think you want your result to be 212jobscontainingphpinxxxx.
i want the result to be 212 jobs containing php in xxx

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.