0

I'm dealing with this kind of XML sequence file can you any one suggest me to parse this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]>
<name>ccccc</name>
<document-id>
<country>US</country>
<doc-number>D0629997</doc-number>
<kind>S1</kind>
<date>20110104</date>
</document-id>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]>
<name>dddd</name>
<document-id>
<country>US</country>
<doc-number>D0629998</doc-number>
<kind>S2</kind>
<date>20110104</date>
</document-id>
3
  • Why do you have such an XML file in the first place? Commented Apr 19, 2011 at 15:03
  • As per XML spec your document looks invalid ... multiple identical processing instructions, no unique root node, afaik !DOCTYPE is not a valid node name, it is not closed ... I doubt there is a parser that will take it without complaining ... Commented Apr 19, 2011 at 15:09
  • X-Ref: stackoverflow.com/q/10780560/367456 Commented Aug 3, 2014 at 23:08

3 Answers 3

1

That's not a valid XML file. It looks like two files in one, but even then it is invalid. Assuming those are two separate files, you could try "tidying" them first. Assuming $xml is a string containing the xml contents:

$xml = tidy_repair_string($xml, array(
    'output-xml' => true,
    'input-xml' => true
)); 

Then you could use SimpleXml on it:

$xml = new SimpleXmlElement($xml);
Sign up to request clarification or add additional context in comments.

1 Comment

Hi,Thanks for your response.I've lot of files to parse like this.So can you explain more detail?and what is tidying?and how can I read the contents of the file?fread not works properly for this kind of files!
0

I know where this XML file has come from and I find it quite strange that Google would provide some invalid XML (unless they are simply just hosting this file that they got from somewhere else). This suggestion for parsing it worked for me: How to parse an xml file with multiple xml declaration using PHP? (A concatenation of several XML files)

Comments

0

That file contains a sequence of XML documents concatenated to each other. You need to register a PHP streamwrapper that transparently divides the file for you, then you can process each document individually and even in a streaming fashion. Example:

stream_wrapper_register('xmlseq', 'XMLSequenceStream');

$path = "xmlseq://zip://ipg140107.zip#ipg140107.xml";

while (XMLSequenceStream::notAtEndOfSequence($path)) {
    $reader = new XMLReader();
    $reader->open($path);
    // just consume the whole document
    while ($reader::next()) {
        XMLReaderNode::dump($reader);
    }
}

XMLSequenceStream::clean();    

That stream-wrapper is part of the XMLReaderIterator library and works as well with SimpleXMLElement or DOMDocument albeit for larger files XMLReader is a better fit.

For the file I've taken in my example (http://storage.googleapis.com/patents/grant_full_text/2014/ipg140107.zip from https://www.google.com/googlebooks/uspto-patents-grants-text.html), the overall element-structure counting elements of the different trees in that sequence for example is:

\-us-patent-grant (473)
  |-us-bibliographic-data-grant (473)
  | |-publication-reference (473)
  | | \-document-id (473)
  | |   |-country (473)
  | |   |-doc-number (473)
  | |   |-kind (473)
  | |   \-date (473)
  | |-application-reference (473)
  | | \-document-id (473)
  | |   |-country (473)
  | |   |-doc-number (473)
  | |   \-date (473)
  | |-us-application-series-code (473)
  | |-us-term-of-grant (470)
  | | |-length-of-grant (450)
  | | |-disclaimer (18)
  | | | \-text (18)
  | | \-us-term-extension (20)
  | |-classification-locarno (450)
  | | |-edition (450)
  | | \-main-classification (450)
  | |-classification-national (473)
  | | |-country (473)
  | | |-main-classification (473)
  | | \-further-classification (143)
  | |-invention-title (473)
  | | \-i (12)
  | |-us-references-cited (458)
  | | \-us-citation (11000)
  | |   |-patcit (10265)
  | |   | \-document-id (10265)
  | |   |   |-country (10265)
  | |   |   |-doc-number (10265)
  | |   |   |-kind (9884)
  | |   |   |-name (9811)
  | |   |   \-date (10264)
  | |   |-category (10999)
  | |   |-classification-national (6309)
  | |   | |-country (6309)
  | |   | \-main-classification (6309)
  | |   |-nplcit (735)
  | |   | \-othercit (735)
  | |   |   |-sub (281)
  | |   |   |-i (7)
  | |   |   \-sup (1)
  | |   \-classification-cpc-text (1)
  | |-number-of-claims (472)
  | |-us-exemplary-claim (472)
  | |-us-field-of-classification-search (472)
  | | \-classification-national (8991)
  | |   |-country (8991)
  | |   |-main-classification (8991)
  | |   \-additional-info (1205)
  | |-figures (472)
  | | |-number-of-drawing-sheets (472)
  | | \-number-of-figures (472)
  | |-us-parties (472)
  | | |-us-applicants (472)
  | | | \-us-applicant (765)
  | | |   |-addressbook (765)
  | | |   | |-last-name (573)
  | | |   | |-first-name (573)
  | | |   | |-address (765)
  | | |   | | |-city (765)
  | | |   | | |-country (765)
  | | |   | | \-state (423)
  | | |   | \-orgname (192)
  | | |   \-residence (765)
  | | |     \-country (765)
  | | |-inventors (472)
  | | | \-inventor (969)
  | | |   \-addressbook (969)
  | | |     |-last-name (969)
  | | |     |-first-name (969)
  | | |     \-address (969)
  | | |       |-city (969)
  | | |       |-country (969)
  | | |       \-state (519)
  | | \-agents (429)
  | |   \-agent (500)
  | |     \-addressbook (500)
  | |       |-orgname (361)
  | |       |-address (500)
  | |       | \-country (500)
  | |       |-last-name (139)
  | |       \-first-name (139)
  | |-assignees (385)
  | | \-assignee (391)
  | |   |-addressbook (390)
  | |   | |-orgname (386)
  | |   | |-role (390)
  | |   | |-address (390)
  | |   | | |-city (355)
  | |   | | |-country (390)
  | |   | | \-state (192)
  | |   | |-last-name (4)
  | |   | \-first-name (4)
  | |   |-orgname (1)
  | |   \-role (1)
  | |-examiners (472)
  | | |-primary-examiner (472)
  | | | |-last-name (472)
  | | | |-first-name (472)
  | | | \-department (472)
  | | \-assistant-examiner (65)
  | |   |-last-name (65)
  | |   \-first-name (65)
  | |-us-related-documents (65)
  | | |-continuation-in-part (16)
  | | | \-relation (16)
  | | |   |-parent-doc (16)
  | | |   | |-document-id (16)
  | | |   | | |-country (16)
  | | |   | | |-doc-number (16)
  | | |   | | \-date (16)
  | | |   | |-parent-status (11)
  | | |   | \-parent-grant-document (5)
  | | |   |   \-document-id (5)
  | | |   |     |-country (5)
  | | |   |     |-doc-number (5)
  | | |   |     \-date (2)
  | | |   \-child-doc (16)
  | | |     \-document-id (16)
  | | |       |-country (16)
  | | |       \-doc-number (16)
  | | |-continuation (21)
  | | | \-relation (21)
  | | |   |-parent-doc (21)
  | | |   | |-document-id (21)
  | | |   | | |-country (21)
  | | |   | | |-doc-number (21)
  | | |   | | \-date (21)
  | | |   | |-parent-status (16)
  | | |   | \-parent-grant-document (5)
  | | |   |   \-document-id (5)
  | | |   |     |-country (5)
  | | |   |     |-doc-number (5)
  | | |   |     \-date (2)
  | | |   \-child-doc (21)
  | | |     \-document-id (21)
  | | |       |-country (21)
  | | |       \-doc-number (21)
  | | |-division (32)
  | | | \-relation (32)
  | | |   |-parent-doc (32)
  | | |   | |-document-id (32)
  | | |   | | |-country (32)
  | | |   | | |-doc-number (32)
  | | |   | | \-date (32)
  | | |   | |-parent-grant-document (24)
  | | |   | | \-document-id (24)
  | | |   | |   |-country (24)
  | | |   | |   |-doc-number (24)
  | | |   | |   \-date (1)
  | | |   | \-parent-status (8)
  | | |   \-child-doc (32)
  | | |     \-document-id (32)
  | | |       |-country (32)
  | | |       \-doc-number (32)
  | | \-related-publication (9)
  | |   \-document-id (9)
  | |     |-country (9)
  | |     |-doc-number (9)
  | |     |-kind (9)
  | |     \-date (9)
  | |-priority-claims (140)
  | | \-priority-claim (182)
  | |   |-country (182)
  | |   |-doc-number (182)
  | |   \-date (182)
  | |-us-sir-flag (1)
  | |-classifications-ipcr (23)
  | | \-classification-ipcr (24)
  | |   |-ipc-version-indicator (24)
  | |   | \-date (24)
  | |   |-classification-level (24)
  | |   |-section (24)
  | |   |-class (24)
  | |   |-subclass (24)
  | |   |-main-group (24)
  | |   |-subgroup (24)
  | |   |-symbol-position (24)
  | |   |-classification-value (24)
  | |   |-action-date (24)
  | |   | \-date (24)
  | |   |-generating-office (24)
  | |   | \-country (24)
  | |   |-classification-status (24)
  | |   \-classification-data-source (24)
  | |-us-botanic (21)
  | | |-latin-name (21)
  | | \-variety (21)
  | \-classifications-cpc (1)
  |   \-main-cpc (1)
  |     \-classification-cpc (1)
  |       |-cpc-version-indicator (1)
  |       | \-date (1)
  |       |-section (1)
  |       |-class (1)
  |       |-subclass (1)
  |       |-main-group (1)
  |       |-subgroup (1)
  |       |-symbol-position (1)
  |       |-classification-value (1)
  |       |-action-date (1)
  |       | \-date (1)
  |       |-generating-office (1)
  |       | \-country (1)
  |       |-classification-status (1)
  |       |-classification-data-source (1)
  |       \-scheme-origination-code (1)
  |-drawings (472)
  | \-figure (3033)
  |   \-img (3033)
  |-description (472)
  | |-description-of-drawings (472)
  | | |-p (3955)
  | | | |-figref (4478)
  | | | |-b (86)
  | | | \-i (6)
  | | \-heading (22)
  | |-heading (162)
  | \-p (340)
  |   |-figref (15)
  |   |-b (250)
  |   |-i (146)
  |   |-ul (96)
  |   | \-li (444)
  |   |   |-ul (215)
  |   |   | \-li (273)
  |   |   |   |-ul (199)
  |   |   |   | \-li (1192)
  |   |   |   |   |-i (1219)
  |   |   |   |   |-b (1)
  |   |   |   |   |-sup (10)
  |   |   |   |   \-sub (2)
  |   |   |   \-i (11)
  |   |   |-sup (2)
  |   |   \-i (26)
  |   |-tables (15)
  |   | \-table (15)
  |   |   \-tgroup (49)
  |   |     |-colspec (175)
  |   |     |-thead (15)
  |   |     | \-row (27)
  |   |     |   \-entry (51)
  |   |     \-tbody (49)
  |   |       \-row (291)
  |   |         \-entry (997)
  |   |           \-sup (28)
  |   \-sup (2)
  |-us-claim-statement (472)
  |-claims (472)
  | \-claim (476)
  |   \-claim-text (476)
  |     |-figref (1)
  |     |-claim-text (5)
  |     |-claim-ref (4)
  |     \-i (15)
  \-abstract (22)
    \-p (22)
      |-i (27)
      \-ul (2)
        \-li (2)
          \-ul (2)
            \-li (11)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.