0

I have a small problem to extract the words which are in bold:

Médoc, Rouge
2ème Vin, Margaux, Rosé
2ème vin, Pessac-Léognan, Blanc

I have to clarify more my question : I'm trying to extract some information from web pages, so each time i found a kind of sentence but me i'm interesting in which is in bold. I give you the adress of the tree wab pages :

Any ideas?

1
  • What's the criteria here? Commented Sep 3, 2013 at 12:55

2 Answers 2

2

You can use positive look ahead to see if Rouge or Blanc or Rosé is after the word we are looking for:

>>> import re
>>> l = [u"Médoc, Rouge", u"2ème Vin, Margaux, Rosé", u"2ème vin, Pessac-Léognan, Blanc"]
>>> for s in l:
...     print re.search(ur'([\w-]+)(?=\W+(Rouge|Blanc|Rosé))', s, re.UNICODE).group(0)
... 
Médoc
Margaux
Pessac-Léognan
Sign up to request clarification or add additional context in comments.

10 Comments

Last output should be Pessac-Léognan not Léognan.
@MartijnPieters yeah, I think you caught him.
@alecxe -1 on your answer was from me(my connection got yanked so was not able to comment). +1 now ;)
@alecxe: You mean the serial downvote I just received? shrug, that'll be reverted tonight anyway. I am pretty sure who did that in any case.
@MartijnPieters yeah, he was very upset because of eval().
|
1

Seems like it's always the second to last term in the comma separated list? You can split and select the second to last, example:

>>> myStr = '2ème vin, Pessac-Léognan, Blanc'
>>> res = myStr.split(', ')[-2]

Otherwise, if you want regex alone... I'll suggest this:

>>> res = re.search(r'([^,]+),[^,]+$', myStr).group(1)

And trim if necessary for spaces.

5 Comments

Don't use str as a variable name.
@AshwiniChaudhary Okay, are there perhaps functions having names containing str?
str is a built-in type in python.
thank you for answer but I have to clarify more my question : I'm trying to extract some information from web pages, so each time i found a kind of sentence but me i'm interesting in which is in bold. I give you the adress of the tree wab pages : - link - link - [link] (nicolas.com/page.php/fr/18_409_9068_leshautsdesmith.htm)
@xeroxSO Oh, but that changes everything... You might try this one, which looks for the specific title you're looking for: res = re.search(r'<div class="pro_blk_trans_titre">.*?\s([^,]+),[^,]+</div>', myPage).group(1).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.