0

I try to extract text from url request, but not all dict contain key with text, and when I try to use {k: v[0] for k, v in parse_qs(str).items()} to urls, I lose a lot of requests, so I try str = urllib.unquote(u[0]). After that I get strings like

смотреть лучше не бывает&clid=1955453&win=176
Jade+Jantzen&ie=utf-8&oe=utf-8&gws_rd=cr&ei=FQB0V9WbIoahsAH5zZGACg
как+скрыть+лопоухость&newwindow=1&biw=1366&bih=657&source=lnms&sa=X&sqi=2&pjf=1&ved=0ahUKEwju5cPJy83NAhUPKywKHVHXBesQ_AUICygA&dpr=1
смотреть лучше не бывает&clid=1955453&win=176
2&clid=1976874&win=85&msid=1467228292.64946.22901.24595&text=как выбрать смартфон
маскаи гейла&lr=10750&clid=1985551-210&win=213

And I want to get

смотреть лучше не бывает
Jade Jantzen
как скрыть лопоухость
смотреть лучше не бывает
как выбрать смартфон
маскаи гейла

Is any way to extract that?

1 Answer 1

1

Just split by & and take the first part:

txt = urllib.unquote(u[0]).split("&")[0]

And don't use str as a variable name - it's a built-in type name in Python.

EDIT: Unfortunatelly this 2&clid=1976874&win=85&msid=1467228292.64946.22901.24595&text=как выбрать смартфон line has a different pattern than the others. There's no common way to handle this one together with the others. I was tempted to use regex to match Cyrillic characters but Jade Jantzen wouldn't match. So for this one line, where the desired text is at the end, something like

txt = urllib.unquote(u[0]).split("=")[-1]

would work. Still you didn't provide any actual criteria for desired text. As humans we can say how to transform what you get into what you want from this specific sample. But without clear rules of what to match, we can't provide a complete solution.

I'm aware that some (again some) of the lines have "+" in place of " ". This can possibly be solved with .replace("+", " ").

Sign up to request clarification or add additional context in comments.

2 Comments

and can you say, if string looks like 213&msid=1466344978.51184.22872.22654&text=дэрил диксон
I overlooked this one line. There will be no generic way to handle this one together with the others. For this one, the split should happen on = and the last part should be taken.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.