grab required field values from the paragraph block using regex in python

Question

I've a text file, from that I have extracted these two paragraph block. The text example is given below.

Text Example:

EXONERAR, com validade a contar de 19 de agosto de 2020, DE- NILSON DE BRITO LIMA, ID FUNCIONAL Nº 2100423-4, do cargo em comissão de Coordenador, símbolo DAS-8, da Coordenadoria de Gestão Centralizada de Serviços, da Superintendência de Gestão Centralizada, da Subsecretaria de Logística, da Secretaria de Estado de Planejamento e Gestão. Processo nº SEI-120001/010643/2020

EXONERAR, a pedido, NADIA NAKAMURA VIEIRA, ID FUNCIONAL Nº 5099589-8, do cargo em comissão de Assessor Especial, símbolo DG, da Secretaria de Estado de Planejamento e Gestão. Processo nº SEI-150001/004627/2020

EXONERAR, com validade a contar de 26 de novembro de 2020, BRUNO RAFAEL ROCHA COSTA, ID FUNCIONAL Nº 5108093-1, do cargo em comissão de Assessor, símbolo DAS-7, da Assessoria de Planejamento e Gestão, da Presidência, da Superintendência de Des- portos do Estado do Rio de Janeiro - SUDERJ, da Secretaria de Es- tado de Esporte, Lazer e Juventude. Processo nº SEI- 3 0 0 0 0 2 / 0 0 0 4 11 / 2 0 2 0 .

EXONERAR, com validade a contar de 16 de novembro de 2020, LUIS HENRIQUE FERREIRA DE AQUINO, ID FUNCIONAL Nº 1914315-0, do cargo em comissão de Assistente II, símbolo DAI-6, da Secretaria de Estado de Planejamento e Gestão. Processo nº SEI120001/014825/2020:

From the above text block I want to grab the bold values only from each paragraph as a individual row.

What I have tried:

r"\b(?:(?:EXONERAR|d[ae]|por|símbolo)\s([^,]+?)(?: e Gestão)?,|\b(?!SEI\b)([A-Z\d]+-\s*\d+)|SEI-\s*([\d /]+)\b)"

My Current Output:

https://regex101.com/r/FCimoW/1

My current output is almost OK but having issue to not matching all the required parts e.g CAPITALIZED name part.

Perhaps like this? regex101.com/r/gpbqU9/1

The fourth bird
– The fourth bird

2020-12-01 17:03:17 +00:00
Commented Dec 1, 2020 at 17:03 — The fourth bird
– The fourth bird, Commented Dec 1, 2020 at 17:03

The fourth bird · Accepted Answer · 2020-12-01 17:09:01Z

2

For the bold uppercase parts, you can add an alternation, matching 1 or more uppercase words separated by a whitespace char or a hyphen and that end with a comma.

\b([A-Z]+(?:[\s-]+[A-Z]+)+(?=,)

Regex demo for the full pattern

answered Dec 1, 2020 at 17:09

The fourth bird

165k16 gold badges61 silver badges75 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

A l w a y s S u n n y Over a year ago

[A-Z]+ It is capturing the CAPITALIZED name but not the international characters. See: regex101.com/r/wqAaSg/1

The fourth bird Over a year ago

@AlwaysSunny Try it like this using \p{Lu} regex101.com/r/7iNy7o/1

A l w a y s S u n n y Over a year ago

may be it is not valid in python, getting error sre_constants.error: bad escape \p at position 113

A l w a y s S u n n y Over a year ago

added ` before that like \\p` but it is now creating issue with capturing on python

A l w a y s S u n n y Over a year ago

Please don't be. I've installed that regex package and it is working now. Thanks for the link. Earlier I saw that link but not sure will it work for me or not. But when you advised I used it and it is working as per my requirements. Thanks a million :)

|

Collectives™ on Stack Overflow

grab required field values from the paragraph block using regex in python

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related