2

I am trying to create an array to reproduce the code below:

<div class="singlepost">
    
<ul class="linha_status" style="">
<li>Status: <b>Objeto em trânsito - por favor aguarde</b></li>
<li>Data  : 24/10/2021 | Hora: 12:04</li>           
<li>Origem: Unidade de Tratamento - Jaboatao Dos Guararapes / PE</li>
<li>Destino: Agência dos Correios - Cuitegi / PB</li>
</ul>

<ul class="linha_status" style="">
<li>Status: <b>Objeto em trânsito - por favor aguarde</b></li>
<li>Data  : 19/10/2021 | Hora: 00:03</li>           
<li>Origem: Unidade de Logística Integrada - Curitiba / PR</li>
<li>Destino: Unidade de Tratamento - Recife / PE</li>
</ul>

<ul class="linha_status" style="">
<li>Status: <b>Fiscalização aduaneira finalizada</b></li>
<li>Data  : 18/10/2021 | Hora: 23:35</li>
<li>Local: Unidade Operacional - Curitiba / PR</li>
</ul>

<ul class="linha_status" style="">
<li>Status: <b>Objeto recebido pelos Correios do Brasil</b></li>
<li>Data  : 16/10/2021 | Hora: 11:45</li>
<li>Local: Unidade de Logística Integrada - Curitiba / PR</li>
</ul>

<ul class="linha_status" style="">
<li>Status: <b>Objeto postado</b></li>
<li>Data  : 14/10/2021 | Hora: 20:30</li>
<li>Local: País -  / </li>
</ul>

</div>

I am using xpath and foreach to create the array, but got no lucky with the result... It is working, but not the output I need, this is the code I have written:

$doc = new DOMDocument();
$doc->loadHTML($htmlString);
$xpath = new DOMXPath($doc);

$geral = $xpath->evaluate('//ul[@class="linha_status"]');

foreach ($geral as $name) {
    $total[] = $name->nodeValue;
}
var_dump($total);

My actual code produces this output:

  array(5) {
    [0] => string(195)
    " Status: Objeto em trânsito - por favor aguarde Data : 24/10/2021 | Hora: 12:04 Origem: Unidade de Tratamento - Jaboatao Dos Guararapes / PE Destino: Agência dos Correios - Cuitegi / PB" 
    [1] => string(189)
    " Status: Objeto em trânsito - por favor aguarde Data : 19/10/2021 | Hora: 00:03 Origem: Unidade de Logística Integrada - Curitiba / PR Destino: Unidade de Tratamento - Recife / PE" 
    [2] => string(128)
    " Status: Fiscalização aduaneira finalizada Data : 18/10/2021 | Hora: 23:35 Local: Unidade Operacional - Curitiba / PR" 
    [3] => string(145)
    " Status: Objeto recebido pelos Correios do Brasil Data : 16/10/2021 | Hora: 11:45 Local: Unidade de Logística Integrada - Curitiba / PR" 
    [4] => string(83)
    " Status: Objeto postado Data : 14/10/2021 | Hora: 20:30 Local: País - / "
  }

This is my desired output:

"eventos": [{
    "status": "Objeto em trânsito - por favor aguarde",
    "data": "24/10/2021",
    "hora": "12:04",
    "origem": "Unidade de Tratamento - Jaboatao Dos Guararapes / PE",
    "destino": "Agência dos Correios - Cuitegi / PB"
  }, {
    "status": "Objeto em trânsito - por favor aguarde",
    "data": "19/10/2021",
    "hora": "00:03",
    "origem": "Unidade de Logística Integrada - Curitiba / PR",
    "destino": "Unidade de Tratamento - Recife / PE"
  }, {
    "status": "Fiscalização aduaneira finalizada",
    "data": "18/10/2021",
    "hora": "23:35",
    "local": "Unidade Operacional - Curitiba / PR"
  }, {
    "status": "Objeto recebido pelos Correios do Brasil",
    "data": "16/10/2021",
    "hora": "11:45",
    "local": "Unidade de Logística Integrada - Curitiba / PR"
  }, {
    "status": "Objeto postado",
    "data": "14/10/2021",
    "hora": "20:30",
    "local": "País - /"
  }]

4 Answers 4

2

maybe

function json_encode_pretty($data, int $extra_flags = 0, int $exclude_flags = 0): string
{
    // prettiest flags for: 7.3.9
    $flags = JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES | JSON_UNESCAPED_UNICODE | (defined("JSON_UNESCAPED_LINE_TERMINATORS") ? JSON_UNESCAPED_LINE_TERMINATORS : 0) | JSON_PRESERVE_ZERO_FRACTION | (defined("JSON_THROW_ON_ERROR") ? JSON_THROW_ON_ERROR : 0);
    $flags = ($flags | $extra_flags) & ~ $exclude_flags;
    return (json_encode($data, $flags));
}


function loadHTML_noemptywhitespace(string $html, int $extra_flags = 0, int $exclude_flags = 0): \DOMDocument
{
    $flags = LIBXML_HTML_NODEFDTD | LIBXML_NOBLANKS | LIBXML_NONET;
    $flags = ($flags & ~ $exclude_flags) | $extra_flags;

    $domd = new \DOMDocument();
    $domd->preserveWhiteSpace = false;
    @$domd->loadHTML('<?xml encoding="UTF-8">' . $html, $flags);
    $removeAnnoyingWhitespaceTextNodes = function (\DOMNode $node) use (&$removeAnnoyingWhitespaceTextNodes): void {
        if ($node->hasChildNodes()) {
            // Warning: it's important to do it backwards; if you do it forwards, the index for DOMNodeList might become invalidated;
            // that's why i don't use foreach() - don't change it (unless you know what you're doing, ofc)
            for ($i = $node->childNodes->length - 1; $i >= 0; --$i) {
                $removeAnnoyingWhitespaceTextNodes($node->childNodes->item($i));
            }
        }
        if ($node->nodeType === XML_TEXT_NODE && !$node->hasChildNodes() && !$node->hasAttributes() && ! strlen(trim($node->textContent))) {
            //echo "Removing annoying POS";
            // var_dump($node);
            $node->parentNode->removeChild($node);
        } //elseif ($node instanceof DOMText) { echo "not removed"; var_dump($node, $node->hasChildNodes(), $node->hasAttributes(), trim($node->textContent)); }
    };
    $removeAnnoyingWhitespaceTextNodes($domd);
    return $domd;
}

$domd=loadHTML_noemptywhitespace($html);
$xp=new DOMXPath($domd);
$extracted=[];
foreach($xp->query("//div[contains(@class,'singlepost')]/ul") as $ul){
    $ulData=[];
    foreach($xp->query("./li", $ul) as $li){
        $data = explode(":",$li->nodeValue, 2);
        $uldata[trim($data[0])] = trim($data[1]);
    }
    $extracted[]=$uldata;
}
echo json_encode_pretty($extracted);

which prints:

[
    {
        "Status": "Objeto em trânsito - por favor aguarde",
        "Data": "24/10/2021 | Hora: 12:04",
        "Origem": "Unidade de Tratamento - Jaboatao Dos Guararapes / PE",
        "Destino": "Agência dos Correios - Cuitegi / PB"
    },
    {
        "Status": "Objeto em trânsito - por favor aguarde",
        "Data": "19/10/2021 | Hora: 00:03",
        "Origem": "Unidade de Logística Integrada - Curitiba / PR",
        "Destino": "Unidade de Tratamento - Recife / PE"
    },
    {
        "Status": "Fiscalização aduaneira finalizada",
        "Data": "18/10/2021 | Hora: 23:35",
        "Origem": "Unidade de Logística Integrada - Curitiba / PR",
        "Destino": "Unidade de Tratamento - Recife / PE",
        "Local": "Unidade Operacional - Curitiba / PR"
    },
    {
        "Status": "Objeto recebido pelos Correios do Brasil",
        "Data": "16/10/2021 | Hora: 11:45",
        "Origem": "Unidade de Logística Integrada - Curitiba / PR",
        "Destino": "Unidade de Tratamento - Recife / PE",
        "Local": "Unidade de Logística Integrada - Curitiba / PR"
    },
    {
        "Status": "Objeto postado",
        "Data": "14/10/2021 | Hora: 20:30",
        "Origem": "Unidade de Logística Integrada - Curitiba / PR",
        "Destino": "Unidade de Tratamento - Recife / PE",
        "Local": "País -  /"
    }
]
Sign up to request clarification or add additional context in comments.

2 Comments

Hey @hanshenrik, just one thing, I am trying to understand your code to separete thoose values: "Data": "14/10/2021 | Hora: 20:30" to > "Data": "14/10/2021" and "Hora: 20:30", could you give me another help, please? I really need thoose values separated to insert BD ;/
check how @Percian solved that particular issue. something like $data = explode(":",$li->nodeValue, 2); if(trim($data[0]) === "Data"){ $data[1]=explode(" | Hora: ",$data[1]);$uldata["Data"]=$data[1][0];$uldata["Hora"]=$data[1][1]; }else{ $uldata[trim($data[0])] = trim($data[1]); }
1
$total = [];
$ind = 0;
foreach ($geral as $name) {
    $s = explode("\n",$name->nodeValue);
    foreach($s as  $ss){
        if(str_contains($ss,"Status: ")){
            $total[$ind]["status"] = str_replace('Status: ','',$ss);
        }
        if(str_contains($ss,"Data  : ")){
            
            $data = str_replace('Data  : ','',$ss);
            $data = str_replace('Hora: ','',$data);
            $data = explode(" | ",$data);
            $total[$ind]["data"] = $data[0];
            $total[$ind]["hora"] = $data[1];
        }
        if(str_contains($ss,"Origem: ")){
            $total[$ind]["origem"] = str_replace('Origem: ','',$ss);
        }
        if(str_contains($ss,"Destino: ")){
            $total[$ind]["destino"] = str_replace('Destino: ','',$ss);
        }
        if(str_contains($ss,"Local: ")){
            $total[$ind]["local"] = str_replace('Local: ','',$ss);
        }
    }
    $ind++;
}

print_r($total);

Just make sure that there's a new line every after li. Inconsistencies on the HTML may ruin the output. Sorry for that.

PHP v8.0

6 Comments

Hello Percian, thanks for reply but it isnt working, I am getting error 500 ( I am using PHP 7.3)
i think you should explode(":",$nodeValue, 2)
str_contains will work for version 8.0 only. You can use str_pos in the later versions of php.
right.. I did the changes (replacing all "str_contains" for "str_pos") but keep not working =/ got any idea ?
@hanshenrik Hey man thanks for comment, but could you explain to me where I need to put this code?
|
1

The solution is a little easier if xpath is used twice. Once for the ul tags and once each for the underlying li tags. The splitting is done simply with explode.

$doc = new DOMDocument();
$doc->loadHTML($htmlString);
$xpath = new DOMXPath($doc);

$geral = $xpath->query('//ul[@class="linha_status"]');

$total = [];
foreach ($geral as $node) {
  $sArr = [];
  $li = $xpath->query('li',$node);
  foreach($li as $item){
    $liVal = $item->nodeValue;
    $parts = explode("|",$liVal);
    foreach($parts as $part){
      list($key,$val) = explode(':',$part);
      $sArr[trim($key)] = trim($val);
    }
  }
  $total[] = $sArr;
}

$result = json_encode($total, JSON_PRETTY_PRINT | JSON_UNESCAPED_UNICODE | JSON_UNESCAPED_SLASHES);

Try it yourself in the sandbox or https://3v4l.org/3E2G3 .

Comments

1

I definitely support @jspit's recommendation of using nested xpath calls for convenience, however I prefer a few different coding choices. Here is the break down of my snippet below:

  1. Load the document with UTF-8 encoding to preserve multibyte characters
  2. Use xpath to iterate all <ul> tags with the qualifying class
  3. Use xpath to iterate all !<li> tags nested within the qualifying <ul>
  4. Split the <li> text by pipes to form 1 or more segments -- no limiter is necessary
  5. Split each segment by the first occurring colon -- limiting the explosion to 2 parts is crucial because some segments contain multiple colons; removing spaces during this explosion saves having to call trim() twice later
  6. Push the key-value pair in to the result array with a first level index relating to the parent ul.

Code: (Demo)

$result = [];
$doc = new DOMDocument;
$doc->loadHTML('<?xml encoding="UTF-8">' . $htmlString);
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//ul[@class="linha_status"]') as $i => $ul) {
    foreach ($xpath->query('li', $ul) as $li) {
        foreach (explode("|", $li->nodeValue) as $segment) {
            [$key, $result[$i][$key]] = preg_split('/\s*:\s*/', trim($segment), 2);
        }
    }
}
var_export($result);

Output:

array (
  0 => 
  array (
    'Status' => 'Objeto em trânsito - por favor aguarde',
    'Data' => '24/10/2021',
    'Hora' => '12:04',
    'Origem' => 'Unidade de Tratamento - Jaboatao Dos Guararapes / PE',
    'Destino' => 'Agência dos Correios - Cuitegi / PB',
  ),
  1 => 
  array (
    'Status' => 'Objeto em trânsito - por favor aguarde',
    'Data' => '19/10/2021',
    'Hora' => '00:03',
    'Origem' => 'Unidade de Logística Integrada - Curitiba / PR',
    'Destino' => 'Unidade de Tratamento - Recife / PE',
  ),
  2 => 
  array (
    'Status' => 'Fiscalização aduaneira finalizada',
    'Data' => '18/10/2021',
    'Hora' => '23:35',
    'Local' => 'Unidade Operacional - Curitiba / PR',
  ),
  3 => 
  array (
    'Status' => 'Objeto recebido pelos Correios do Brasil',
    'Data' => '16/10/2021',
    'Hora' => '11:45',
    'Local' => 'Unidade de Logística Integrada - Curitiba / PR',
  ),
  4 => 
  array (
    'Status' => 'Objeto postado',
    'Data' => '14/10/2021',
    'Hora' => '20:30',
    'Local' => 'País -  /',
  ),
)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.