Should I Use Regex to Find the XML Namespace Definition?

Question

The script below works. It parses a XML and looks up a particular node under the namespace "dei".

But is relying on regex for the namespace definition the proper way? (I do not really know XML. So I worry that such regex is not fool-proof for all Edgar XMLs. For example -- are such definitions always enclosed in double quotes and preceded by xmlns: ?)

Thanks.

use strict;
use warnings;

use LWP::Simple;
use XML::LibXML;
use XML::LibXML::XPathContext;

my $url = 'https://www.sec.gov/Archives/edgar/data/1057051/000119312517099664/acef-20161231.xml';
my $xml = LWP::Simple::get($url);
my $dom = XML::LibXML->load_xml(string => $xml);

my @nsDefs = ($xml =~ /xmlns:dei="(.+?)"/g);
die "Namespace definition must be unique!\n" unless @nsDefs == 1;

my $xpc = XML::LibXML::XPathContext->new($dom);
$xpc->registerNs('dei', $nsDefs[0]);

my @matches = $xpc->findnodes('//dei:TradingSymbol');
print 'Number of matches = ', scalar(@matches), "\n";

Output:

Number of matches = 1

No, they can be in simple quotes, and someone could have the weird idea of replacing a / with / for instance. Long story short, you can't parse XML with regexes, it will never do the full job. More importantly, you can't search for a node that contains xmlns:something. This information has no value and there is no reason why the node that declares it is the one you want. Nor for this declaration to be unique in the document. Maybe it is, maybe it's not, and it's none of your business. You shouldn't be looking for it. What you're looking for is something else. — kumesana
– kumesana, Commented Sep 12, 2017 at 20:54
Thx Kumesana. What you said is exactly what I feared. But what is the proper way then? My situation: All the XMLs I work with will use a "dei" namespace, which is of interest to me. But different XMLs may have different definitions for "dei". So how am I supposed to know what the definition is (in order to parse it with a DOM)? For example, this XML has a different definition than that in my OP. sec.gov/Archives/edgar/data/104207/000010420712000098/… — Shang Zhang
– Shang Zhang, Commented Sep 12, 2017 at 21:06
See the other answer, they understood better than I what you had in mind. — kumesana
– kumesana, Commented Sep 12, 2017 at 21:09
Re "So how am I supposed to know what the definition is", That's not the right question. Both namespaces/specs could be used in the same doc. The correct question is: Which specs (and thus namespaces) are used by the doc? — ikegami
– ikegami, Commented Sep 13, 2017 at 6:21

Grant McLean · Accepted Answer · 2017-09-12 21:00:27Z

1

The only important thing about a namespace in XML is the URI. Your code is assuming a namespace prefix of dei, using that to locate the namespace declaration and determine that the URI is http://xbrl.sec.gov/dei/2014-01-31. This is exactly backwards. The thing you should be hard-coding in your script is the URI - it won't change. The namespace prefix is theoretically variable and a different prefix might be used for the same URI in other documents.

answered Sep 12, 2017 at 21:00

Grant McLean

7,0871 gold badge24 silver badges40 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

kumesana Over a year ago

Come to think of it, there could also be no prefix at all.

Shang Zhang Over a year ago

Grant. Please see my comments above. My actual situation is that I know all the XMLs will hold the information I need under a namespace "dei". But sometimes it is xbrl.sec.gov/dei/2014-01-31 (but other times, it could be "xbrl.sec.gov/dei/2012-01-31" -- depending on the time the XML was produced). What is the proper thing to do?

Grant McLean Over a year ago

OK, I understand now. @ikegami's solution of registering both URIs and using an XPath query to match either is the way I would do it too.

ikegami · Accepted Answer · 2017-09-13 06:22:41Z

dei is not a namespace; it's a prefix that's only meaningful in that particular document. You can't count on the namespace's prefix always being dei.

http://xbrl.sec.gov/dei/2014-01-31 is the namespace. That's the thing that can't change, and that you should be basing your code around.

In a comment, you mentioned you have to deal with multiple specs. Just create an XPath prefix for each spec you support.

use strict;
use warnings;

use LWP::Simple               qw( );
use XML::LibXML               qw( );
use XML::LibXML::XPathContext qw( );

my $url = 'https://www.sec.gov/Archives/edgar/data/1057051/000119312517099664/acef-20161231.xml';

my $xml = LWP::Simple::get($url);

my $doc = XML::LibXML->load_xml(string => $xml);

my $xpc = XML::LibXML::XPathContext->new();
$xpc->registerNs( d1 => 'http://xbrl.sec.gov/dei/2012-01-31' );
$xpc->registerNs( d2 => 'http://xbrl.sec.gov/dei/2014-01-31' );

my @matches = $xpc->findnodes('//d1:TradingSymbol|//d2:TradingSymbol', $doc);
print "Number of matches = ", 0+@matches, "\n";

Miller · Accepted Answer · 2017-09-13 13:07:17Z

1

use getNamespaces()

my @ns_dei = grep { $_->name eq 'xmlns:dei' } $dom->documentElement()->getNamespaces();

die "Namespace definition must be unique!\n" if @ns_dei != 1;

my $xpc = XML::LibXML::XPathContext->new($dom);
$xpc->registerNs( 'dei', $ns_dei[0]->value );

edited Sep 13, 2017 at 13:07

answered Sep 13, 2017 at 3:49

Miller

35.3k4 gold badges42 silver badges61 bronze badges

Comments

kumesana · Accepted Answer · 2017-09-13 10:07:24Z

I understand that your problem is that the XML you read will not always use the same URI as namespace to attach to the dei: prefix and the elements you're looking using it.

In that case the XML you're stuck with is ill-designed and there is no good practice established for that. This XML is using namespaces wrong and you will need to work around that. For information, changing an element's namespace is by definition changing its name, and therefore the most basic information you're using to find it.

Your best bet is to ignore namespaces whatsoever. You can do that with

//*[local-name () = "TradingSymbol"]

If the number of different namespaces you can get is limited to a select few, you could instead list them all, as dei: and dei2012: for instance, and select for both:

//dei:TradingSymbol | //dei2012:TradingSymbol

Michael Kay · Accepted Answer · 2017-09-13 16:28:27Z

0

Never use regular expressions to process XML: your code will always be wrong. Your example has at least five bugs: it will fail to match if a different prefix is used, it will fail to match if single quotes are used, it will fail to match if there is whitespace around the "=" sign, it will error if the namespace declaration is duplicated, and it will give a spurious match if there is "commented out" XML in the source document.

It is theoretically impossible to eliminate these bugs, because regular expressions are not powerful enough to parse XML correctly.

Always use a real XML parser, and XPath.

edited Sep 13, 2017 at 16:28

answered Sep 12, 2017 at 22:46

Michael Kay

165k11 gold badges97 silver badges173 bronze badges

5 Comments

kjhughes Over a year ago

Anonymous downvoter: Downvote only wrong, especially harmfully wrong, answers -- not answers that you don't want to hear. This answer is correct; its reasoning, sound.

ikegami Over a year ago

Re "regular expressions are not powerful enough to parse XML correctly.", That's not true. XML is trivial to parse using regex. That's not the reason regex are discouraged. They are discouraged because using them to parse XML is reinventing the wheel (XML parser), and it's almost guaranteed to be reinvented really, really poorly.

Michael Kay Over a year ago

@ikegami, you are 100% wrong. XML is not a regular language, because its grammar is recursive. It cannot be parsed correctly using regular expressions.

ikegami Over a year ago

It doesn't have to be a regular language to be parsed using the regular expressions the OP is using. Furthermore, the OP didn't give any indication that they would parse the document using a single match operator. When you're done erecting straw men (by pretending the OP is doing something completely different than they are doing) just so you can lecture them and sound smart, please fix your answer.

Michael Kay Over a year ago

I'll fix my answer when I see a regular expression used to parse XML without any bugs in it.

Shang Zhang · Accepted Answer · 2017-09-14 20:32:32Z

Thanks to everyone who answered. I am very inexperienced in terms of using Perl to grab data from Internet (SEC Edgar filings in this particular case). So I am probably not even asking the most intelligent questions.

The business problem (per my best understanding): 1) When a company files its 10K/Q using XBRL, SEC wants the trading symbol information disclosed based on one of SEC's published schemas. 2) The complete list of schema locations is known (and will grow):

-- http://taxonomies.xbrl.us/us-gaap/2009/non-gaap/dei-2009-01-31.xsd
-- https://xbrl.sec.gov/dei/2012/dei-2012-01-31.xsd
-- https://xbrl.sec.gov/dei/2013/dei-2013-01-31.xsd
-- https://xbrl.sec.gov/dei/2014/dei-2014-01-31.xsd

3) I want to grab such trading symbol information.

I now understand that the "dei" namespace-prefix has no real significance. But it seems that even the namespace-name itself e.g. 'http://xbrl.sec.gov/dei/2012-01-31' has no significance. Only the schema location is truly meaningful. Is this correct?

My understanding is that the XBRL instance document references a schema document which "maps" the namespace (e.g. http://xbrl.sec.gov/dei/2012-01-31) to the schema location. (So the namespace-name only needs to be a unique string.)

So is there a way to modify ikegami's code to use the schema locations instead of the namespace names?

Example of a complete XRBL filing: https://www.sec.gov/Archives/edgar/data/1057051/000119312517099664

Collectives™ on Stack Overflow

Should I Use Regex to Find the XML Namespace Definition?

6 Answers 6

3 Comments

Comments

Comments

Comments

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

3 Comments

Comments

Comments

Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related