0

I'm trying to parse a huge XML file containing list of Directories_name, Files_name, permissions, creation_date_time, username, file_size, dir_size

The structure of the xml is as follow :

<parent_dir>
   <file_1>
   <sub-dir_1>
           <file_2>
            <sub-dir_2>
                ....
                  .....

An actual file looks like this:

<browse path="">
  <dir date="2018-04-17 23:31:59" internal="0" group="TrustedInstaller" protection="drwxrwxrwx" name="C:" size="593181949" links="0" user="unknown">
    <dir date="2017-12-13 23:30:44" internal="0" group="unknown" protection="drwxrwxr-x" name="Documents and Settings" size="174" links="0" user="SYSTEM">
      <file date="2017-03-18 22:01:11" internal="0" group="unknown" protection="-rwxrwxr-x" name="desktop.ini" size="174" links="0" user="Administrators" />
    </dir>
    <dir date="2017-12-14 03:17:04" internal="0" group="None" protection="d--x------" name="Test_Software" size="516708762" links="0" user="srt">
      <file date="2017-02-09 14:58:53" internal="0" group="None" protection="----------" name="26.avi" size="13263184" links="0" user="srt" />
      <file date="2016-11-01 00:31:40" internal="0" group="None" protection="----------" name="6.avi" size="13569536" links="0" user="srt" />
      <dir date="2017-12-13 23:41:27" internal="0" group="None" protection="d--x------" name=".vs" size="5120" links="0" user="srt">
        <dir date="2017-12-13 23:41:27" internal="0" group="None" protection="d--x------" name="Forest_Protector" size="5120" links="0" user="srt">
          <dir date="2017-12-13 23:41:27" internal="0" group="None" protection="d--x------" name="v14" size="5120" links="0" user="srt">
            <file date="2017-12-12 14:35:36" internal="0" group="None" protection="----------" name=".suo" size="5120" links="0" user="srt" />
          </dir>
        </dir>
      </dir>
      <dir date="2017-12-14 03:17:15" internal="0" group="None" protection="d--x------" name="Debug" size="379090369" links="0" user="srt">
        <file date="2017-12-14 03:06:03" internal="0" group="None" protection="-rwx------" name="Current_Frame1 (2).mp4" size="321612800" links="0" user="Administrators" />
        <file date="2018-04-16 21:35:17" internal="0" group="None" protection="-rwx------" name="Current_Frame1.avi" size="94102" links="0" user="Administrators" />
        <dir date="2017-12-14 03:17:20" internal="0" group="None" protection="d--x------" name="Fire" size="7502723" links="0" user="srt">
          <file date="2017-12-12 21:35:13" internal="0" group="None" protection="----------" name="Fire_Detected_02_05_12.bmp" size="921654" links="0" user="srt" />
          <file date="2017-12-12 21:35:13" internal="0" group="None" protection="----------" name="Fire_Detected_02_05_13.bmp" size="921654" links="0" user="srt" />
        </dir>
        <dir date="2017-12-13 23:41:28" internal="0" group="None" protection="d--x------" name="Smoke" size="3686616" links="0" user="srt">
          <file date="2017-12-12 21:35:50" internal="0" group="None" protection="----------" name="Smoke_Detected_02_05_50.bmp" size="921654" links="0" user="srt" />
          <file date="2017-12-12 21:39:17" internal="0" group="None" protection="----------" name="Smoke_Detected_02_09_17.bmp" size="921654" links="0" user="srt" />
          <file date="2017-12-12 21:39:18" internal="0" group="None" protection="----------" name="Smoke_Detected_02_09_18.bmp" size="921654" links="0" user="srt" />
          <file date="2017-12-12 21:42:29" internal="0" group="None" protection="----------" name="Smoke_Detected_02_12_29.bmp" size="921654" links="0" user="srt" />
        </dir>
      </dir>
      <dir date="2017-12-13 23:41:31" internal="0" group="None" protection="d--x------" name="Scripts" size="25590875" links="0" user="srt">
        <file date="2016-12-18 04:57:14" internal="0" group="None" protection="----------" name="_hashlib.pyd" size="1482240" links="0" user="srt" />
        <file date="2016-12-18 04:56:16" internal="0" group="None" protection="----------" name="_socket.pyd" size="50688" links="0" user="srt" />
        <file date="2016-12-18 04:56:54" internal="0" group="None" protection="----------" name="_ssl.pyd" size="2100736" links="0" user="srt" />
        <dir date="2017-12-13 23:41:28" internal="0" group="None" protection="d--x------" name="build" size="2248287" links="0" user="srt">
          <dir date="2017-12-13 23:41:28" internal="0" group="None" protection="d--x------" name="bdist.win-amd64" size="2248287" links="0" user="srt">
            <dir date="2017-12-13 23:41:31" internal="0" group="None" protection="d--x------" name="winexe" size="2248287" links="0" user="srt">
              <dir date="2017-12-13 22:46:02" internal="0" group="None" protection="d--x------" name="bundle-2.7" size="0" links="0" user="srt" />
              <dir date="2017-12-13 23:41:31" internal="0" group="None" protection="d--x------" name="collect-2.7" size="2246148" links="0" user="srt">
                <file date="2017-12-13 14:41:41" internal="0" group="None" protection="----------" name="__future__.pyc" size="4103" links="0" user="srt" />
                <file date="2017-12-13 14:41:41" internal="0" group="None" protection="----------" name="_abcoll.pyc" size="23604" links="0" user="srt" />
                <dir date="2017-12-13 23:41:29" internal="0" group="None" protection="d--x------" name="email" size="125408" links="0" user="srt">
                  <file date="2017-12-13 14:41:42" internal="0" group="None" protection="----------" name="__init__.pyc" size="2752" links="0" user="srt" />
                  <dir date="2017-12-13 23:41:29" internal="0" group="None" protection="d--x------" name="mime" size="110" links="0" user="srt">
                    <file date="2017-12-13 14:41:42" internal="0" group="None" protection="----------" name="__init__.pyc" size="110" links="0" user="srt" />
                  </dir>
                </dir>
                <dir date="2017-12-13 23:41:30" internal="0" group="None" protection="d--x------" name="encodings" size="413685" links="0" user="srt">
                  <file date="2017-12-13 14:41:42" internal="0" group="None" protection="----------" name="__init__.pyc" size="4298" links="0" user="srt" />
                  <file date="2017-12-13 14:41:42" internal="0" group="None" protection="----------" name="aliases.pyc" size="8750" links="0" user="srt" />
                </dir>
                <dir date="2017-12-13 23:41:30" internal="0" group="None" protection="d--x------" name="json" size="41094" links="0" user="srt">
                  <file date="2017-12-13 14:41:42" internal="0" group="None" protection="----------" name="__init__.pyc" size="13824" links="0" user="srt" />
                  <file date="2017-12-13 14:41:42" internal="0" group="None" protection="----------" name="decoder.pyc" size="11720" links="0" user="srt" />
                  <file date="2017-12-13 14:41:42" internal="0" group="None" protection="----------" name="encoder.pyc" size="13381" links="0" user="srt" />
                  <file date="2017-12-13 14:41:42" internal="0" group="None" protection="----------" name="scanner.pyc" size="2169" links="0" user="srt" />
                </dir>
                <dir date="2017-12-13 23:41:31" internal="0" group="None" protection="d--x------" name="logging" size="55152" links="0" user="srt">
                  <file date="2017-12-13 14:41:42" internal="0" group="None" protection="----------" name="__init__.pyc" size="55152" links="0" user="srt" />
                </dir>
                <dir date="2017-12-13 23:41:31" internal="0" group="None" protection="d--x------" name="python_http_client" size="12280" links="0" user="srt">
                  <file date="2017-12-13 14:41:43" internal="0" group="None" protection="----------" name="__init__.pyc" size="611" links="0" user="srt" />
                  <file date="2017-12-13 14:41:43" internal="0" group="None" protection="----------" name="client.pyc" size="8422" links="0" user="srt" />
                  <file date="2017-12-13 14:41:43" internal="0" group="None" protection="----------" name="exceptions.pyc" size="3247" links="0" user="srt" />
                </dir>
                <dir date="2017-12-13 23:41:31" internal="0" group="None" protection="d--x------" name="sendgrid" size="46335" links="0" user="srt">
                  <file date="2017-12-13 14:41:43" internal="0" group="None" protection="----------" name="__init__.pyc" size="290" links="0" user="srt" />
                  <file date="2017-12-13 14:41:43" internal="0" group="None" protection="----------" name="sendgrid.pyc" size="2552" links="0" user="srt" />
                  <file date="2017-12-13 14:41:43" internal="0" group="None" protection="----------" name="version.pyc" size="369" links="0" user="srt" />
                  <dir date="2017-12-13 23:41:31" internal="0" group="None" protection="d--x------" name="helpers" size="43124" links="0" user="srt">
                    <file date="2017-12-13 14:41:43" internal="0" group="None" protection="----------" name="__init__.pyc" size="116" links="0" user="srt" />
                    <dir date="2017-12-13 23:41:31" internal="0" group="None" protection="d--x------" name="mail" size="43008" links="0" user="srt">
                      <file date="2017-12-13 14:41:43" internal="0" group="None" protection="----------" name="__init__.pyc" size="156" links="0" user="srt" />
                      <file date="2017-12-13 14:41:43" internal="0" group="None" protection="----------" name="mail.pyc" size="42852" links="0" user="srt" />
                    </dir>
                  </dir>
                </dir>
                <dir date="2017-12-13 23:41:31" internal="0" group="None" protection="d--x------" name="unittest" size="91680" links="0" user="srt">
                  <file date="2017-12-13 14:41:43" internal="0" group="None" protection="----------" name="__init__.pyc" size="2944" links="0" user="srt" />

                </dir>
              </dir>
              <dir date="2017-12-13 23:41:31" internal="0" group="None" protection="d--x------" name="temp" size="2139" links="0" user="srt">
                <file date="2017-12-13 14:41:41" internal="0" group="None" protection="----------" name="_hashlib.py" size="358" links="0" user="srt" />
              </dir>
            </dir>
          </dir>
        </dir>
        <dir date="2017-12-13 23:41:31" internal="0" group="None" protection="d--x------" name="dist" size="11668574" links="0" user="srt">
          <file date="2016-12-18 04:57:14" internal="0" group="None" protection="----------" name="_hashlib.pyd" size="1482240" links="0" user="srt" />
          <file date="2016-12-18 04:56:16" internal="0" group="None" protection="----------" name="_socket.pyd" size="50688" links="0" user="srt" />
        </dir>
      </dir>
    </dir>
    <dir date="2018-02-05 19:41:24" internal="0" group="TrustedInstaller" protection="drwxrwx---" name="Windows" size="76473013" links="0" user="unknown">
      <dir date="2018-04-17 22:11:40" internal="0" group="TrustedInstaller" protection="drwxrwx---" name="System32" size="76473013" links="0" user="unknown">
        <dir date="2017-12-14 00:11:42" internal="0" group="TrustedInstaller" protection="drwxrwx---" name="drivers" size="76473013" links="0" user="unknown">
          <file date="2017-03-18 21:56:34" internal="0" group="TrustedInstaller" protection="-rwxrwx---" name="BtaMPM.sys" size="23552" links="0" user="unknown" />
          <file date="2017-03-18 21:56:19" internal="0" group="TrustedInstaller" protection="-rwxrwx---" name="BthAvrcpTg.sys" size="43520" links="0" user="unknown" />
          <dir date="2017-03-19 03:49:53" internal="0" group="TrustedInstaller" protection="drwxrwx---" name="en-US" size="1423360" links="0" user="unknown">
            <file date="2017-03-18 06:45:38" internal="0" group="TrustedInstaller" protection="-rwxrwx---" name="hidbth.sys.mui" size="5120" links="0" user="unknown" />
            <file date="2017-03-18 06:45:50" internal="0" group="TrustedInstaller" protection="-rwxrwx---" name="hidclass.sys.mui" size="6656" links="0" user="unknown" />
          </dir>
          <dir date="2017-03-18 22:03:39" internal="0" group="TrustedInstaller" protection="drwxrwx---" name="etc" size="23907" links="0" user="unknown">
            <file date="2017-03-18 22:01:13" internal="0" group="unknown" protection="-rwxrwx---" name="hosts" size="824" links="0" user="SYSTEM" />
            <file date="2017-03-18 22:01:13" internal="0" group="unknown" protection="-rwxrwx---" name="lmhosts.sam" size="3683" links="0" user="SYSTEM" />
            <file date="2017-03-18 22:01:13" internal="0" group="unknown" protection="-rwxrwx---" name="networks" size="407" links="0" user="SYSTEM" />
            <file date="2017-03-18 22:01:13" internal="0" group="unknown" protection="-rwxrwx---" name="protocol" size="1358" links="0" user="SYSTEM" />
            <file date="2017-03-18 22:01:13" internal="0" group="unknown" protection="-rwxrwx---" name="services" size="17635" links="0" user="SYSTEM" />
          </dir>
          <dir date="2017-07-11 06:41:34" internal="0" group="TrustedInstaller" protection="drwxrwx---" name="UMDF" size="1743264" links="0" user="unknown">
            <file date="2017-03-18 21:56:19" internal="0" group="TrustedInstaller" protection="-rwxrwx---" name="EhStorPwdDrv.dll" size="85504" links="0" user="unknown" />
            <file date="2017-07-11 06:40:08" internal="0" group="TrustedInstaller" protection="-rwxrwx---" name="NfcCx.dll" size="710656" links="0" user="unknown" />
            <dir date="2017-03-19 03:47:47" internal="0" group="TrustedInstaller" protection="drwxrwx---" name="en-US" size="66048" links="0" user="unknown">
              <file date="2017-03-18 06:47:42" internal="0" group="TrustedInstaller" protection="-rwxrwx---" name="hidscanner.dll.mui" size="2560" links="0" user="unknown" />
              <file date="2017-03-18 06:47:42" internal="0" group="TrustedInstaller" protection="-rwxrwx---" name="IddCx.dll.mui" size="7168" links="0" user="unknown" />
            </dir>
          </dir>
        </dir>
      </dir>
    </dir>
  </dir>
</browse>

Now, from this XML, i would like get the following :

1) Parse almost everything in the following format :

Parent_DIR : which in this case is C:

Files to DIR relation in SQLite : Folder 'Test_Software' residing in C contains 26.avi, 6.avi etc. For everyfolder, list the files in that folder and store that information in DIR_INFO table of DB in the following format: DIR Name to DIR_Name column of table and File List info to File_Name column

2) Get the list of all files with their parent path. Note: look for file named 'future.pyc', its original path is : C:/Test_Software/Scripts/build/bdist.win-amd64/winexe/collect-2.7/future.pyc I would like to add filename which is future.pyc in the File_name column of DB table File_info and then full path to file_full_path column of the same DB table

So, far i was able to get all the File_info and Dir_info using the following code:

xml_out = etree.fromstring(single_bk)
file_info = xml_out.xpath('/browse/dir/file')  # Files Path
dir_info = xml_out.xpath('/browse/dir')  #Parent Path
dir_names = []
file_names = [] 
file_dates = []
dir_dates = []
dir_group = []
file_group = []
file_permissions  = []
dir_permissions  = []
file_size = []
dir_size = []
file_user = []
dir_user  = []

for node in file_info:
    file_names.append(node.xpath("@name")[0])
    file_dates.append(node.xpath("@date")[0])
    file_group.append(node.xpath("@group")[0])
    file_permissions.append(node.xpath("@protection")[0])
    file_size.append(node.xpath("@size")[0])
    file_user.append(node.xpath("@user")[0])

for node in dir_info:
    dir_names.append(node.xpath("@name")[0])
    dir_dates.append(node.xpath("@date")[0])
    dir_group.append(node.xpath("@group")[0])
    dir_permissions.append(node.xpath("@protection")[0])
    dir_size.append(node.xpath("@size")[0])
    dir_user.append(node.xpath("@user")[0])                 
print file_names, file_dates, file_group, file_permissions, file_size, file_user
print '\n------------------------------------------------------------------\n'  
print dir_names, dir_dates, dir_group, dir_permissions, dir_size, dir_user  
print '\n------------------------------------------------------------------\n'      
list_of_attributes = []
for node in dir_info:
    attrs = []
    for att in node.attrib:
        #attrs.append(("@" + att, node.attrib[att]))
        attrs.append((node.attrib[att]))            
        list_of_attributes.append(attrs)
print list_of_attributes        
print '\n------------------------------------------------------------------\n'  
print attrs 

But, it has the following limitations :

1) I can't map whole XML using this method because an XML can reach infinity, maybe in need to put everything inside a endless loop for XML

2) i am not able to map the file -> dir relation, because it seems too complicated to map a file with its outermost parent using this xml. For eg: if i am reading the file future.pyc', its original path is : C:/Test_Software/Scripts/build/bdist.win-amd64/winexe/collect-2.7/future.pyc, i can't think of a method to go back till c:\test_software and have that complete file_path

3) i want to store dir info in sqlite Dir_info table and file_info into file_info table

Here are my sqlite tables information

DB name: abc.db Tables : a) DIR_INFO b) FILE_INFO

Columns of DIR_INFO Table
ID, DIR_DATE, DIR_NAME, DIR_FILE, DIR_SIZE , DIR_PERMISSIONS, DIR_USER

Columns of FILE_INFO Table
ID, FILE_DATE, FILE_NAME, FILE_PARENT_DIR, FILE_FULL_PATH, FILE_SIZE , FILE_PERMISSIONS, FILE_USER

if you reached up till here reading my whole request, first of all i appreciate your time and secondly any help would be greatly appreciated.

1
  • The XML file contains a tree, so your algorithm should be recursive. Commented Apr 22, 2018 at 6:10

1 Answer 1

0

I solved this by using regex instead of LXML parser. see this for more details : Parse Output for Python

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.