PHP reverse regex match

Question

I am in a real trouble here to read a large txt file (around 12mb) with PHP. I have to match a regex, and then search for the first another regex occurrence backwards this matched regex, and then extract the string between these two matches. Here is a real example:

PROCESSO:583.00.2012.105981
No ORDEM:01.19.2012/000154
CLASSE:PROCEDIMENTO SUMÁRIO (EM GERAL)
REQUERENTE:ASSETJ ASSOCIAÇÃO DOS SERVIDORES DO TRIBUNAL DE JUSTIÇA DO ESTADO DE SÃO PAULO
ADVOGADO:273919/SP - THIAGO PUGINA
Requerido:TIM CELULAR S/A E OUTRO
VARA:19a. VARA CÍVEL

PROCESSO:583.00.2012.105970
No ORDEM:01.07.2012/000134
CLASSE:PROCEDIMENTO ORDINÁRIO (EM GERAL)
REQUERENTE:CARLOS NEUMANN
ADVOGADO:79117/SP - ROSANA CHIAVASSA
Requerido:SUL AMÉRICA SEGURO SAÚDE S/A
VARA:7a. VARA CÍVEL

The script should find this code: 273919/SP (regex: [0-9]{6}/SP) Check backwards for the code: 583.00.2012.105981 (regex: [0-9]{3}.[0-9]{2}.[0-9]{4}.[0-9]{6})

And then get all the text between it.

I can't do a preg_match with both of those regex at the same pattern because through the file some of the blocks have more than one 273919/SP type and it would mess up with everything

What can I do? Do you have any ideas?

Sorry if my regex is crappy, I am new at it and it is very difficult to learn :P

EDIT:

Please check another form that the code appears:

583.00.2012.100905-6/000000-000 - no ordem 82/2012 - Procedimento Sumário (em geral) - JOSE APARECIDO DOS
SANTOS X SEGURADORA LIDER DOS CONSORCIOS DO SEGUROS DPVAT S/A - Fls. 79 - Demonstre o autor, por meio
de documento idôneo (declaração de bens e renda e comprovante de pagamento), a necessidade de obtenção do benefício
da justiça gratuita, a fim de ser cumprido o disposto no artigo 5o, LXXIV da CF. Após, tornem os autos conclusos. Int. - ADV
GUILHERME DIAS GONÇALVES OAB/SP 302632 - ADV TIAGO RAFAEL OLIVEIRA ALEGRE OAB/SP 302811

That is my problem. Now I have two occurrences: OAB/SP 302632 and OAB/SP 302811, and I need to get the last one and extract the text between the id 583.00.2012.100905-6/000000-000 and OAB/SP 302811

Those numbers aren't fixed, so I can't do a search for OAB/SP 302811, but OAB\/SP\s\d{6}

Because I can't stop in the first occurrence of 273919/SP regex, as a block may contain more then just one. So I must extract this string for every 273919/SP regex I encounter, and then go backwards to find the 583.00.2012.105981 regex — Vinicius Tavares
– Vinicius Tavares, Commented Jan 27, 2012 at 16:31
Wouldn't it be better (feasible?) to look for the AVOCADO: and PROSECCO: keys? Or do need to extract a single block only? Have you tried using the search strings in the natural order with .*? in between? — mario
– mario, Commented Jan 27, 2012 at 16:31
If your block of text has more than one 273919/SP do you want to match up to the first or last 273919/SP? — Jonathan Kuhn
– Jonathan Kuhn, Commented Jan 27, 2012 at 16:32

Qtax · Accepted Answer · 2012-01-27 16:49:27Z

2

You have two expressions, re1 and re2, and you want to match re1 and then find the first re2 match before it, and get the content between them.

Assuming that there's always a re2 match before a re1 match, then this is equivalent to: Match re2, followed by a string not containing any re2 matches and capturing it, followed by a re1 match.

This can be written as:

(?s)re2((?:(?!re2).)*?)re1

If re1 is \d{6}/SP and re2 is \d{3}\.\d{2}\.\d{4}\.\d{6} you get:

(?s)(\d{3}\.\d{2}\.\d{4}\.\d{6})((?:(?!\d{3}\.\d{2}\.\d{4}\.\d{6}).)*?)(\d{6}/SP)

I've put the re1 and re2 matches in capturing groups here in case you'd want their values as well.

edited Jan 27, 2012 at 16:49

answered Jan 27, 2012 at 16:44

Qtax

34k9 gold badges92 silver badges127 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

mario · Accepted Answer · 2012-01-27 16:46:48Z

I would assume it is actually as simple as just looking for the two keys/id tokens and fetching the text block in between with an .*? substitute:

 preg_match_all('~

     (?: ^  PROCESSO:  \d+(?:\.\d+){3}  \s* )
   ( (?: ^  [\w\s]+:   .*               \s* )+ )  # multiple lines in between
     (?: ^  ADVOGADO:  273919/SP            )

     ~mx',
     $input, $matches
 )
 and print_r($matches);

This looks for your data block, and will return the middle part in $matches[1]. So you could use end($matches[1]) to get the last entry for the 273919/SP id. You probably don't need that much assertion for the inner text, just as illustration to avoid the empty lines.

But in essence, you don't "match in reverse", but simply make it more specific for the inner part. Then you can just list the two things you want to search for in the very order they would occur in your file.

FtDRbwLXw6 · Accepted Answer · 2012-01-27 16:36:49Z

1

I don't see why you have to do some weird backwards search. Just do something like this:

$search = 273919; // assume this would come from user input of some sort?
preg_match('#PROCESSO:(\d{3}\.\d{2}\.\d{4}\.\d{6}).+?ADVOGADO:' . preg_quote($search, '#') . '/SP#ms', $fileContents, $matches);
echo $matches[1]; // 583.00.2012.105981

answered Jan 27, 2012 at 16:36

FtDRbwLXw6

29.1k16 gold badges74 silver badges108 bronze badges

Comments

ghoti · Accepted Answer · 2012-01-27 17:42:57Z

1

You're trying to extract the lines between PROCESS0 and ADVOGADO for each record, where records are idenfitied by a new PROCESS0 line?

For a very large consistently formatted text file like this, I wouldn't use regexp this way at all. I'd use standard file handling and do my own record keeping.

<?php

$fh = fopen("/path/to/file.txt", "r");

$keep = 0;
$buffer = "";

while ($line = fgets($fh, 80)) {
  if (strpos($line, "PROCESSO:") !== FALSE) {
    $keep = 1;
    continue;
  }
  if (strpos($line, "ADVOGADO:") !== FALSE) {
    print $buffer; // or do whatever you want with it
    $keep = 0;
    $buffer = "";
    continue;
  }
  if ($keep == 1) {
    $buffer .= $line;
  }
}

?>

edited Jan 27, 2012 at 17:42

answered Jan 27, 2012 at 17:26

ghoti

47.2k8 gold badges71 silver badges108 bronze badges

Comments

Susam Pal · Accepted Answer · 2012-01-27 16:35:50Z

<?php

$txt = <<<TEXT
PROCESSO:583.00.2012.105981
No ORDEM:01.19.2012/000154
CLASSE:PROCEDIMENTO SUMÁRIO (EM GERAL)
REQUERENTE:ASSETJ ASSOCIAÇÃO DOS SERVIDORES DO TRIBUNAL DE JUSTIÇA DO ESTADO DE SÃO PAULO
ADVOGADO:273919/SP - THIAGO PUGINA
Requerido:TIM CELULAR S/A E OUTRO
VARA:19a. VARA CÍVEL

PROCESSO:583.00.2012.105970
No ORDEM:01.07.2012/000134
CLASSE:PROCEDIMENTO ORDINÁRIO (EM GERAL)
REQUERENTE:CARLOS NEUMANN
ADVOGADO:79117/SP - ROSANA CHIAVASSA
Requerido:SUL AMÉRICA SEGURO SAÚDE S/A
VARA:7a. VARA CÍVEL
TEXT;

$matches = array();
preg_match('/[0-9]{6}\/SP(.*)[0-9]{3}.[0-9]{2}.[0-9]{4}.[0-9]{6}/s', $txt, $matches) . "\n";
echo $matches[1];
?>

Output:

 - THIAGO PUGINA
Requerido:TIM CELULAR S/A E OUTRO
VARA:19a. VARA CÍVEL

PROCESSO:

Jeremy Harris · Accepted Answer · 2012-01-27 16:29:26Z

-1

It seems your data has a repeating pattern. If so, you could explode() it into an array and process each array element individually which effectively limits the scope of your regex calls.

// Get data
$file_data = get_file_contents('/path/to/my/file.txt');

// Explode data into chunks using repeated delimiter
$data = explode("PROCESSO:", $file_data);

// Process array
foreach($data as $chunk)
{
    // Perform regex functions on $chunk here
}

answered Jan 27, 2012 at 16:29

Jeremy Harris

24.7k13 gold badges84 silver badges133 bronze badges

4 Comments

mario Over a year ago

Let's not use explode for everything, okay? It doesn't assert much structure, and specifically in this case wouldn't ease fetching the intended data from the resulting chunks.

Vinicius Tavares Over a year ago

Thanks cillosis but it is a little more complicated. It is from a PDF that I'd extract to txt and it have a lot of dirty chars and useless strings. I am trying to clean them all, but it is a lot of headers and footers that changes and I am having trouble in that, so I thought in the possibility of just check the content data instead of stripping all content that I don't need

Jeremy Harris Over a year ago

@ViniciusTavares Ah, I see. So I'm guessing this is a one time run (assuming it works correctly) to parse out data and do something with it?

Vinicius Tavares Over a year ago

Yep. It will have around four thousand results that I'll have to match in a database

Collectives™ on Stack Overflow

PHP reverse regex match

6 Answers 6

Comments

Comments

Comments

Comments

Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Comments

Comments

Comments

Comments

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related