2

I am trying to compare the text all instances of a particular tag in two XML files. The OCR engine I am using outputs an xml files with all the ocr chraracters in a tag <OCRCharacters>...</OCRCharacters>.

I am using python 2.7.11 and beautiful soup 4 (bs4). From the terminal, I am calling my python program with two xml file names as arguments.

I want to extract all the strings in the <OCRCharacters> tag for each file, compare them line by line with difflib, and write a new file with the differences.

I use $ python parse_xml_file.py file1.xml file2.xml to call the program from the terminal.

The code below opens each file and prints each string in the tag <OCRCharacters>. How should I convert the objects made with bs4 to strings that I can use with difflib. I am open to better ways (using python) to do this.

import sys

with open(sys.argv[1], "r") as f1:
    xml_doc_1 = f1.read()

with open(sys.argv[2], "r") as f2:
    xml_doc_2 = f2.read()

from bs4 import BeautifulSoup
soup1 = BeautifulSoup(xml_doc_1, 'xml')
soup2 = BeautifulSoup(xml_doc_2, 'xml')

print("#####################",sys.argv[1],"#####################")
for tag in soup1.find_all('OCRCharacters'):
    print(repr(tag.string))
    temp1 = repr(tag.string)
    print(temp1)
print("#####################",sys.argv[2],"#####################")    
for tag in soup2.find_all('OCRCharacters'):
    print(repr(tag.string))
    temp2 = repr(tag.string)

1 Answer 1

2

You can try this :

import sys
import difflib
from bs4 import BeautifulSoup

text = [[],[]]
files = []
soups = []

for i, arg in enumerate(sys.argv[1:]):
  files.append(open(arg, "r").read())
  soups.append(BeautifulSoup(files[i], 'xml'))

  for tag_text in soups[i].find_all('OCRCharacters'):
    text[i].append(''.join(tag_text))

for first_string, second_string in zip(text[0], text[1]):
    d = difflib.Differ()
    diff = d.compare(first_string.splitlines(), second_string.splitlines())
    print '\n'.join(diff)

With xml1.xml :

<node>
  <OCRCharacters>text1_1</OCRCharacters>
  <OCRCharacters>text1_2</OCRCharacters>
  <OCRCharacters>Same Value</OCRCharacters>
</node>

and xml2.xml :

<node>
  <OCRCharacters>text2_1</OCRCharacters>
  <OCRCharacters>text2_2</OCRCharacters>
  <OCRCharacters>Same Value</OCRCharacters>
</node>

The output will be :

- text1_1
?     ^

+ text2_1
?     ^

- text1_2
?     ^

+ text2_2
?     ^

  Same Value
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.