I am trying to compare the text all instances of a particular tag in two XML files. The OCR engine I am using outputs an xml files with all the ocr chraracters in a tag <OCRCharacters>...</OCRCharacters>.
I am using python 2.7.11 and beautiful soup 4 (bs4). From the terminal, I am calling my python program with two xml file names as arguments.
I want to extract all the strings in the <OCRCharacters> tag for each file, compare them line by line with difflib, and write a new file with the differences.
I use $ python parse_xml_file.py file1.xml file2.xml to call the program from the terminal.
The code below opens each file and prints each string in the tag <OCRCharacters>. How should I convert the objects made with bs4 to strings that I can use with difflib. I am open to better ways (using python) to do this.
import sys
with open(sys.argv[1], "r") as f1:
xml_doc_1 = f1.read()
with open(sys.argv[2], "r") as f2:
xml_doc_2 = f2.read()
from bs4 import BeautifulSoup
soup1 = BeautifulSoup(xml_doc_1, 'xml')
soup2 = BeautifulSoup(xml_doc_2, 'xml')
print("#####################",sys.argv[1],"#####################")
for tag in soup1.find_all('OCRCharacters'):
print(repr(tag.string))
temp1 = repr(tag.string)
print(temp1)
print("#####################",sys.argv[2],"#####################")
for tag in soup2.find_all('OCRCharacters'):
print(repr(tag.string))
temp2 = repr(tag.string)