0

I am in a real trouble here to read a large txt file (around 12mb) with PHP. I have to match a regex, and then search for the first another regex occurrence backwards this matched regex, and then extract the string between these two matches. Here is a real example:

PROCESSO:583.00.2012.105981
No ORDEM:01.19.2012/000154
CLASSE:PROCEDIMENTO SUMÁRIO (EM GERAL)
REQUERENTE:ASSETJ ASSOCIAÇÃO DOS SERVIDORES DO TRIBUNAL DE JUSTIÇA DO ESTADO DE SÃO PAULO
ADVOGADO:273919/SP - THIAGO PUGINA
Requerido:TIM CELULAR S/A E OUTRO
VARA:19a. VARA CÍVEL

PROCESSO:583.00.2012.105970
No ORDEM:01.07.2012/000134
CLASSE:PROCEDIMENTO ORDINÁRIO (EM GERAL)
REQUERENTE:CARLOS NEUMANN
ADVOGADO:79117/SP - ROSANA CHIAVASSA
Requerido:SUL AMÉRICA SEGURO SAÚDE S/A
VARA:7a. VARA CÍVEL

The script should find this code: 273919/SP (regex: [0-9]{6}/SP) Check backwards for the code: 583.00.2012.105981 (regex: [0-9]{3}.[0-9]{2}.[0-9]{4}.[0-9]{6})

And then get all the text between it.

I can't do a preg_match with both of those regex at the same pattern because through the file some of the blocks have more than one 273919/SP type and it would mess up with everything

What can I do? Do you have any ideas?

Sorry if my regex is crappy, I am new at it and it is very difficult to learn :P

EDIT:

Please check another form that the code appears:

583.00.2012.100905-6/000000-000 - no ordem 82/2012 - Procedimento Sumário (em geral) - JOSE APARECIDO DOS
SANTOS X SEGURADORA LIDER DOS CONSORCIOS DO SEGUROS DPVAT S/A - Fls. 79 - Demonstre o autor, por meio
de documento idôneo (declaração de bens e renda e comprovante de pagamento), a necessidade de obtenção do benefício
da justiça gratuita, a fim de ser cumprido o disposto no artigo 5o, LXXIV da CF. Após, tornem os autos conclusos. Int. - ADV
GUILHERME DIAS GONÇALVES OAB/SP 302632 - ADV TIAGO RAFAEL OLIVEIRA ALEGRE OAB/SP 302811

That is my problem. Now I have two occurrences: OAB/SP 302632 and OAB/SP 302811, and I need to get the last one and extract the text between the id 583.00.2012.100905-6/000000-000 and OAB/SP 302811

Those numbers aren't fixed, so I can't do a search for OAB/SP 302811, but OAB\/SP\s\d{6}

4
  • Why do you have to do the search in opposite order? Commented Jan 27, 2012 at 16:28
  • Because I can't stop in the first occurrence of 273919/SP regex, as a block may contain more then just one. So I must extract this string for every 273919/SP regex I encounter, and then go backwards to find the 583.00.2012.105981 regex Commented Jan 27, 2012 at 16:31
  • 1
    Wouldn't it be better (feasible?) to look for the AVOCADO: and PROSECCO: keys? Or do need to extract a single block only? Have you tried using the search strings in the natural order with .*? in between? Commented Jan 27, 2012 at 16:31
  • If your block of text has more than one 273919/SP do you want to match up to the first or last 273919/SP? Commented Jan 27, 2012 at 16:32

6 Answers 6

2

You have two expressions, re1 and re2, and you want to match re1 and then find the first re2 match before it, and get the content between them.

Assuming that there's always a re2 match before a re1 match, then this is equivalent to: Match re2, followed by a string not containing any re2 matches and capturing it, followed by a re1 match.

This can be written as:

(?s)re2((?:(?!re2).)*?)re1

If re1 is \d{6}/SP and re2 is \d{3}\.\d{2}\.\d{4}\.\d{6} you get:

(?s)(\d{3}\.\d{2}\.\d{4}\.\d{6})((?:(?!\d{3}\.\d{2}\.\d{4}\.\d{6}).)*?)(\d{6}/SP)

I've put the re1 and re2 matches in capturing groups here in case you'd want their values as well.

Sign up to request clarification or add additional context in comments.

Comments

1

I would assume it is actually as simple as just looking for the two keys/id tokens and fetching the text block in between with an .*? substitute:

 preg_match_all('~

     (?: ^  PROCESSO:  \d+(?:\.\d+){3}  \s* )
   ( (?: ^  [\w\s]+:   .*               \s* )+ )  # multiple lines in between
     (?: ^  ADVOGADO:  273919/SP            )

     ~mx',
     $input, $matches
 )
 and print_r($matches);

This looks for your data block, and will return the middle part in $matches[1]. So you could use end($matches[1]) to get the last entry for the 273919/SP id. You probably don't need that much assertion for the inner text, just as illustration to avoid the empty lines.

But in essence, you don't "match in reverse", but simply make it more specific for the inner part. Then you can just list the two things you want to search for in the very order they would occur in your file.

Comments

1

I don't see why you have to do some weird backwards search. Just do something like this:

$search = 273919; // assume this would come from user input of some sort?
preg_match('#PROCESSO:(\d{3}\.\d{2}\.\d{4}\.\d{6}).+?ADVOGADO:' . preg_quote($search, '#') . '/SP#ms', $fileContents, $matches);
echo $matches[1]; // 583.00.2012.105981

Comments

1

You're trying to extract the lines between PROCESS0 and ADVOGADO for each record, where records are idenfitied by a new PROCESS0 line?

For a very large consistently formatted text file like this, I wouldn't use regexp this way at all. I'd use standard file handling and do my own record keeping.

<?php

$fh = fopen("/path/to/file.txt", "r");

$keep = 0;
$buffer = "";

while ($line = fgets($fh, 80)) {
  if (strpos($line, "PROCESSO:") !== FALSE) {
    $keep = 1;
    continue;
  }
  if (strpos($line, "ADVOGADO:") !== FALSE) {
    print $buffer; // or do whatever you want with it
    $keep = 0;
    $buffer = "";
    continue;
  }
  if ($keep == 1) {
    $buffer .= $line;
  }
}

?>

Comments

0
<?php

$txt = <<<TEXT
PROCESSO:583.00.2012.105981
No ORDEM:01.19.2012/000154
CLASSE:PROCEDIMENTO SUMÁRIO (EM GERAL)
REQUERENTE:ASSETJ ASSOCIAÇÃO DOS SERVIDORES DO TRIBUNAL DE JUSTIÇA DO ESTADO DE SÃO PAULO
ADVOGADO:273919/SP - THIAGO PUGINA
Requerido:TIM CELULAR S/A E OUTRO
VARA:19a. VARA CÍVEL

PROCESSO:583.00.2012.105970
No ORDEM:01.07.2012/000134
CLASSE:PROCEDIMENTO ORDINÁRIO (EM GERAL)
REQUERENTE:CARLOS NEUMANN
ADVOGADO:79117/SP - ROSANA CHIAVASSA
Requerido:SUL AMÉRICA SEGURO SAÚDE S/A
VARA:7a. VARA CÍVEL
TEXT;

$matches = array();
preg_match('/[0-9]{6}\/SP(.*)[0-9]{3}.[0-9]{2}.[0-9]{4}.[0-9]{6}/s', $txt, $matches) . "\n";
echo $matches[1];
?>

Output:

 - THIAGO PUGINA
Requerido:TIM CELULAR S/A E OUTRO
VARA:19a. VARA CÍVEL

PROCESSO:

Comments

-1

It seems your data has a repeating pattern. If so, you could explode() it into an array and process each array element individually which effectively limits the scope of your regex calls.

// Get data
$file_data = get_file_contents('/path/to/my/file.txt');

// Explode data into chunks using repeated delimiter
$data = explode("PROCESSO:", $file_data);

// Process array
foreach($data as $chunk)
{
    // Perform regex functions on $chunk here
}

4 Comments

Let's not use explode for everything, okay? It doesn't assert much structure, and specifically in this case wouldn't ease fetching the intended data from the resulting chunks.
Thanks cillosis but it is a little more complicated. It is from a PDF that I'd extract to txt and it have a lot of dirty chars and useless strings. I am trying to clean them all, but it is a lot of headers and footers that changes and I am having trouble in that, so I thought in the possibility of just check the content data instead of stripping all content that I don't need
@ViniciusTavares Ah, I see. So I'm guessing this is a one time run (assuming it works correctly) to parse out data and do something with it?
Yep. It will have around four thousand results that I'll have to match in a database

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.