Python Regular Expressions - Ignore sequence during searching

Question

I am trying to make a kind of data miner with python. What I am about to examine is a dictionary of the Greek language. The said dictionary was originally in PDF format, and I turned it into a rougly corresponding HTML format to parse it more easily. I have done some further formating on it, since the data structure was heavily distorted.

My current task is to find and seperately store the individual words, along with their descriptions. So the first thought that came to mind about that, was to identify the words first, apart from their descriptions. The headers of the word's space has a very specific syntax, and I use that to create a corresponding regular expression to match each and every one of them.

There is one problem though. Despite the formatting I have done to HTML so far, there are still many points where a series of logical data is interrupted by the sequence followed by a newline, with random order. Is there any way to direct my regular expression to "ignore" that sequence, that is to treat that certain sequence as non-existent, when met, and therefore including those matches which are interrupted by it?

That is, without putting a ( \n)? in every part of my RE, to cover every possible case.

The regular expression I use is the following:

(ο|η|το)?( )?<b>([α-ωάέήίόύώϊϋΐΰ])*</b>(, ((ο|η|το)? <b>([α-ωάέήίόύώϊϋΐΰ])*</b>))*( \(.*\))? ([Α-Ω])*\.( \(.*\))?<b>:</b>

and does a fine job with the matching, when the data is not interrupted by the sequence given above.

The problem, in case not understood, lies in that the interrupting sequence can occur anywhere within the match, therefore I am looking for a way other than covering every single spot where the sequence might occur (ignoring the sequence in deciding whether to return a match or not), as I explained earlier.

Have you tried removing the br before doing the regex search? myDocument = myDocument.replace("", "")? — Kevin
– Kevin, Commented Dec 29, 2014 at 15:28
That is a solution. Still, if there is an answer to what I am asking, I would have the solution ready right away, and moreover I imagine there will (generally) be cases where one would like to ignore some specific sequence without altering the text given, so I believe it's worth researching that possibility. — Noob Doob
– Noob Doob, Commented Dec 29, 2014 at 15:33
seems similar to this stackoverflow.com/questions/2078915/… — aberna
– aberna, Commented Dec 29, 2014 at 15:45
I don't want to drop the match, if it is interrupted. I don't want to "exclude" a string. I want to accept the match, whether it contains the interrupting sequence or not. That is, to direct the RE to treat the sequence, as if it did not exist, and not to use it to decide whether to return a match or not. The problem lies in that the interrupting sequence can be placed anywhere within the match, and not in a specific position. — Noob Doob
– Noob Doob, Commented Dec 29, 2014 at 15:50
The best would be to give 4 or five lines with this kind of interupts. But if I understood well, you have to add [<\/br>\n] to your matching character classes which are not a .. I'm pretty sure removing the \n as @Kevin said is the best option to get proper matches as output. — Tensibai
– Tensibai, Commented Dec 29, 2014 at 16:35

Borealid · Accepted Answer · 2014-12-29 16:49:26Z

1

What you're asking for is a different regular expression.

The new regular expression would be the old one, with (<br\s*?/>\n?)? or the like after every non-quantifier character.

You could write something to transmute a regular expression into the form you're looking for. It would take in your existing regex and produce a br-tolerant regex. No construct in the regular expression grammar exists to do this for you automatically.

I think the easier thing to do is to permute the source document to not contain the sequences you wish to ignore. This should be an easy text substitution.

If it weren't for your explicit use of the  tags for meaning, an alternative would be to just take the plain-text document content instead of the HTML content.

answered Dec 29, 2014 at 16:49

Borealid

99.4k9 gold badges111 silver badges123 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python Regular Expressions - Ignore sequence during searching

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related