2

I have the following string:

<w:pPr>
    <w:spacing w:line="240" w:lineRule="exact"/>
    <w:ind w:left="1890" w:firstLine="360"/>
    <w:rPr>
        <w:b/>
        <w:color w:val="00000A"/>
        <w:sz w:val="24"/>
    </w:rPr>
</w:pPr>

and I am trying to parse the "w:sz w:val" value using preg_match().

So far, I've tried:

preg_match('/<w:sz w:val="(\d)"/', $p, $fonts);

but this has not worked, and I'm unsure why?

Any Ideas?

Thank you in advance!

4
  • Why not: php.net/manual/en/book.simplexml.php? Commented Nov 12, 2015 at 19:42
  • @AbraCadaver I have looked at that a little bit. Do you know of any other PHP libraries or packages that convert docx xml to html? Commented Nov 12, 2015 at 20:02
  • Never used it but here's one: github.com/PHPOffice/PHPWord Commented Nov 12, 2015 at 20:16
  • @jldavis76 Have a look at my answer with DomDocument and SimpleXML below. Commented Nov 12, 2015 at 21:20

3 Answers 3

4

You were trying to capture only single-digit numbers. Try adding a + to make "one or more".

preg_match('/<w:sz w:val="(\d+)"/', $p, $fonts);

I prefer [0-9]+ for easier reading, and because it avoids the potentially funny need to double-up on \ symbols.

preg_match('/<w:sz w:val="([0-9]+)"/', $p, $fonts);
Sign up to request clarification or add additional context in comments.

1 Comment

You, sir, are amazing. That was exactly the problem. Thank you.
3

While you have a working code at hand, there are two other possibilities, namely with DomDocument and SimpleXML. This is somewhat tricky with the colons (aka namespaces) but consider the following examples. I have added a container tag to define the namespace but you will definitely have one in your xml as well. Solution 1 (the DOM way) searches the DOM with a namespace prefix and reads the attributes. Solution 2 (with SimpleXML) does the same (perhaps in a more intuitive and comprehensible way).

The XML: (using PHP HEREDOC Syntax)

$xml = <<<EOF
<?xml version="1.0"?>
<container xmlns:w="http://example">
    <w:pPr>
        <w:spacing w:line="240" w:lineRule="exact"/>
        <w:ind w:left="1890" w:firstLine="360"/>
        <w:rPr>
            <w:b/>
            <w:color w:val="00000A"/>
            <w:sz w:val="24"/>
        </w:rPr>
    </w:pPr>
</container>
EOF;

Solution 1: Using DomDocument

$dom = new DOMDocument();
$dom->loadXML($xml);

$ns = 'http://example';

$data = $dom->getElementsByTagNameNS($ns, 'sz')->item(0);
$attr = $data->getAttribute('w:val');
echo $attr; // 24

Solution 2: Using SimpleXML with Namespaces

$simplexml = simplexml_load_string($xml);
$namespaces = $simplexml->getNamespaces(true);
$items = $simplexml->children($namespaces['w']);

$val = $items->pPr->rPr->sz["val"]->__toString();
echo "val: $val"; // val: 24

3 Comments

That definitely looks interesting. When I try the second solution, I am get an error: Message: Trying to get property of non-object. Any idea why?
@jldavis76 You could use var_dump($items); to see if the items are found in the first place. Remember, this only works with my xml at the moment as I have made up a namespace. You will have to use your own, obviously.
I guess I'm just a little unfamiliar with the use of namespaces in this aspect. I'll have to look into it more. Thanks.
2

You just need a little correction to your regex:

<w:sz w:val="(\d)+"

So it goes:

preg_match('/<w:sz w:val="(\d+)"/', $p, $fonts);

Why? Because with just \d you are checking for 1 digit, but with \d+ you are checking for 1 or more.

EDIT:

In case you need it, there are some great regex online testing tools, like https://regex101.com/. Try your expressions there before using them, just in case. You never know ;)

1 Comment

Oh, sorry. You're right! I'll correct it right away.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.