2

The script below works. It parses a XML and looks up a particular node under the namespace "dei".

But is relying on regex for the namespace definition the proper way? (I do not really know XML. So I worry that such regex is not fool-proof for all Edgar XMLs. For example -- are such definitions always enclosed in double quotes and preceded by xmlns: ?)

Thanks.

use strict;
use warnings;

use LWP::Simple;
use XML::LibXML;
use XML::LibXML::XPathContext;

my $url = 'https://www.sec.gov/Archives/edgar/data/1057051/000119312517099664/acef-20161231.xml';
my $xml = LWP::Simple::get($url);
my $dom = XML::LibXML->load_xml(string => $xml);

my @nsDefs = ($xml =~ /xmlns:dei="(.+?)"/g);
die "Namespace definition must be unique!\n" unless @nsDefs == 1;

my $xpc = XML::LibXML::XPathContext->new($dom);
$xpc->registerNs('dei', $nsDefs[0]);

my @matches = $xpc->findnodes('//dei:TradingSymbol');
print 'Number of matches = ', scalar(@matches), "\n";

Output:

Number of matches = 1
4
  • No, they can be in simple quotes, and someone could have the weird idea of replacing a / with / for instance. Long story short, you can't parse XML with regexes, it will never do the full job. More importantly, you can't search for a node that contains xmlns:something. This information has no value and there is no reason why the node that declares it is the one you want. Nor for this declaration to be unique in the document. Maybe it is, maybe it's not, and it's none of your business. You shouldn't be looking for it. What you're looking for is something else. Commented Sep 12, 2017 at 20:54
  • Thx Kumesana. What you said is exactly what I feared. But what is the proper way then? My situation: All the XMLs I work with will use a "dei" namespace, which is of interest to me. But different XMLs may have different definitions for "dei". So how am I supposed to know what the definition is (in order to parse it with a DOM)? For example, this XML has a different definition than that in my OP. sec.gov/Archives/edgar/data/104207/000010420712000098/… Commented Sep 12, 2017 at 21:06
  • See the other answer, they understood better than I what you had in mind. Commented Sep 12, 2017 at 21:09
  • Re "So how am I supposed to know what the definition is", That's not the right question. Both namespaces/specs could be used in the same doc. The correct question is: Which specs (and thus namespaces) are used by the doc? Commented Sep 13, 2017 at 6:21

6 Answers 6

1

The only important thing about a namespace in XML is the URI. Your code is assuming a namespace prefix of dei, using that to locate the namespace declaration and determine that the URI is http://xbrl.sec.gov/dei/2014-01-31. This is exactly backwards. The thing you should be hard-coding in your script is the URI - it won't change. The namespace prefix is theoretically variable and a different prefix might be used for the same URI in other documents.

Sign up to request clarification or add additional context in comments.

3 Comments

Come to think of it, there could also be no prefix at all.
Grant. Please see my comments above. My actual situation is that I know all the XMLs will hold the information I need under a namespace "dei". But sometimes it is xbrl.sec.gov/dei/2014-01-31 (but other times, it could be "xbrl.sec.gov/dei/2012-01-31" -- depending on the time the XML was produced). What is the proper thing to do?
OK, I understand now. @ikegami's solution of registering both URIs and using an XPath query to match either is the way I would do it too.
1

dei is not a namespace; it's a prefix that's only meaningful in that particular document. You can't count on the namespace's prefix always being dei.

http://xbrl.sec.gov/dei/2014-01-31 is the namespace. That's the thing that can't change, and that you should be basing your code around.

In a comment, you mentioned you have to deal with multiple specs. Just create an XPath prefix for each spec you support.

use strict;
use warnings;

use LWP::Simple               qw( );
use XML::LibXML               qw( );
use XML::LibXML::XPathContext qw( );

my $url = 'https://www.sec.gov/Archives/edgar/data/1057051/000119312517099664/acef-20161231.xml';

my $xml = LWP::Simple::get($url);

my $doc = XML::LibXML->load_xml(string => $xml);

my $xpc = XML::LibXML::XPathContext->new();
$xpc->registerNs( d1 => 'http://xbrl.sec.gov/dei/2012-01-31' );
$xpc->registerNs( d2 => 'http://xbrl.sec.gov/dei/2014-01-31' );

my @matches = $xpc->findnodes('//d1:TradingSymbol|//d2:TradingSymbol', $doc);
print "Number of matches = ", 0+@matches, "\n";

Comments

1

use getNamespaces()

my @ns_dei = grep { $_->name eq 'xmlns:dei' } $dom->documentElement()->getNamespaces();

die "Namespace definition must be unique!\n" if @ns_dei != 1;

my $xpc = XML::LibXML::XPathContext->new($dom);
$xpc->registerNs( 'dei', $ns_dei[0]->value );

Comments

0

I understand that your problem is that the XML you read will not always use the same URI as namespace to attach to the dei: prefix and the elements you're looking using it.

In that case the XML you're stuck with is ill-designed and there is no good practice established for that. This XML is using namespaces wrong and you will need to work around that. For information, changing an element's namespace is by definition changing its name, and therefore the most basic information you're using to find it.

Your best bet is to ignore namespaces whatsoever. You can do that with

//*[local-name () = "TradingSymbol"]

If the number of different namespaces you can get is limited to a select few, you could instead list them all, as dei: and dei2012: for instance, and select for both:

//dei:TradingSymbol | //dei2012:TradingSymbol

Comments

0

Never use regular expressions to process XML: your code will always be wrong. Your example has at least five bugs: it will fail to match if a different prefix is used, it will fail to match if single quotes are used, it will fail to match if there is whitespace around the "=" sign, it will error if the namespace declaration is duplicated, and it will give a spurious match if there is "commented out" XML in the source document.

It is theoretically impossible to eliminate these bugs, because regular expressions are not powerful enough to parse XML correctly.

Always use a real XML parser, and XPath.

5 Comments

Anonymous downvoter: Downvote only wrong, especially harmfully wrong, answers -- not answers that you don't want to hear. This answer is correct; its reasoning, sound.
Re "regular expressions are not powerful enough to parse XML correctly.", That's not true. XML is trivial to parse using regex. That's not the reason regex are discouraged. They are discouraged because using them to parse XML is reinventing the wheel (XML parser), and it's almost guaranteed to be reinvented really, really poorly.
@ikegami, you are 100% wrong. XML is not a regular language, because its grammar is recursive. It cannot be parsed correctly using regular expressions.
It doesn't have to be a regular language to be parsed using the regular expressions the OP is using. Furthermore, the OP didn't give any indication that they would parse the document using a single match operator. When you're done erecting straw men (by pretending the OP is doing something completely different than they are doing) just so you can lecture them and sound smart, please fix your answer.
I'll fix my answer when I see a regular expression used to parse XML without any bugs in it.
0

Thanks to everyone who answered. I am very inexperienced in terms of using Perl to grab data from Internet (SEC Edgar filings in this particular case). So I am probably not even asking the most intelligent questions.

The business problem (per my best understanding): 1) When a company files its 10K/Q using XBRL, SEC wants the trading symbol information disclosed based on one of SEC's published schemas. 2) The complete list of schema locations is known (and will grow):

-- http://taxonomies.xbrl.us/us-gaap/2009/non-gaap/dei-2009-01-31.xsd
-- https://xbrl.sec.gov/dei/2012/dei-2012-01-31.xsd
-- https://xbrl.sec.gov/dei/2013/dei-2013-01-31.xsd
-- https://xbrl.sec.gov/dei/2014/dei-2014-01-31.xsd

3) I want to grab such trading symbol information.

I now understand that the "dei" namespace-prefix has no real significance. But it seems that even the namespace-name itself e.g. 'http://xbrl.sec.gov/dei/2012-01-31' has no significance. Only the schema location is truly meaningful. Is this correct?

My understanding is that the XBRL instance document references a schema document which "maps" the namespace (e.g. http://xbrl.sec.gov/dei/2012-01-31) to the schema location. (So the namespace-name only needs to be a unique string.)

So is there a way to modify ikegami's code to use the schema locations instead of the namespace names?

Example of a complete XRBL filing: https://www.sec.gov/Archives/edgar/data/1057051/000119312517099664

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.