2

I am trying to make a kind of data miner with python. What I am about to examine is a dictionary of the Greek language. The said dictionary was originally in PDF format, and I turned it into a rougly corresponding HTML format to parse it more easily. I have done some further formating on it, since the data structure was heavily distorted.

My current task is to find and seperately store the individual words, along with their descriptions. So the first thought that came to mind about that, was to identify the words first, apart from their descriptions. The headers of the word's space has a very specific syntax, and I use that to create a corresponding regular expression to match each and every one of them.

There is one problem though. Despite the formatting I have done to HTML so far, there are still many points where a series of logical data is interrupted by the sequence < /br> followed by a newline, with random order. Is there any way to direct my regular expression to "ignore" that sequence, that is to treat that certain sequence as non-existent, when met, and therefore including those matches which are interrupted by it?

That is, without putting a (< br/>\n)? in every part of my RE, to cover every possible case.

The regular expression I use is the following:

(ο|η|το)?( )?<b>([α-ωάέήίόύώϊϋΐΰ])*</b>(, ((ο|η|το)? <b>([α-ωάέήίόύώϊϋΐΰ])*</b>))*( \(.*\))? ([Α-Ω])*\.( \(.*\))?<b>:</b>  

and does a fine job with the matching, when the data is not interrupted by the sequence given above.

The problem, in case not understood, lies in that the interrupting sequence can occur anywhere within the match, therefore I am looking for a way other than covering every single spot where the sequence might occur (ignoring the sequence in deciding whether to return a match or not), as I explained earlier.

6
  • Have you tried removing the br before doing the regex search? myDocument = myDocument.replace("</br>", "")? Commented Dec 29, 2014 at 15:28
  • That is a solution. Still, if there is an answer to what I am asking, I would have the solution ready right away, and moreover I imagine there will (generally) be cases where one would like to ignore some specific sequence without altering the text given, so I believe it's worth researching that possibility. Commented Dec 29, 2014 at 15:33
  • seems similar to this stackoverflow.com/questions/2078915/… Commented Dec 29, 2014 at 15:45
  • I don't want to drop the match, if it is interrupted. I don't want to "exclude" a string. I want to accept the match, whether it contains the interrupting sequence or not. That is, to direct the RE to treat the sequence, as if it did not exist, and not to use it to decide whether to return a match or not. The problem lies in that the interrupting sequence can be placed anywhere within the match, and not in a specific position. Commented Dec 29, 2014 at 15:50
  • The best would be to give 4 or five lines with this kind of interupts. But if I understood well, you have to add [<\/br>\n] to your matching character classes which are not a .. I'm pretty sure removing the </br>\n as @Kevin said is the best option to get proper matches as output. Commented Dec 29, 2014 at 16:35

1 Answer 1

1

What you're asking for is a different regular expression.

The new regular expression would be the old one, with (<br\s*?/>\n?)? or the like after every non-quantifier character.

You could write something to transmute a regular expression into the form you're looking for. It would take in your existing regex and produce a br-tolerant regex. No construct in the regular expression grammar exists to do this for you automatically.

I think the easier thing to do is to permute the source document to not contain the sequences you wish to ignore. This should be an easy text substitution.

If it weren't for your explicit use of the <b> tags for meaning, an alternative would be to just take the plain-text document content instead of the HTML content.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.