I have some (working) code that searches for and modifies text within a PDF file.
it takes a list of "strings" that I want to find, and replaces each with a string of spaces that is the same length as the found string. (you can't just remove the strings , because that breaks the re-encoding of the PDF... I don't know anything about the PDF encoding format, which I admit is almost entirely foreign to me...)
pdfInputFile= "input.pdf"
pdfOutputFile= "out.pdf"
with open(pdfInputFile, "rb") as reader:
pdfByteStr = reader.read()
toReplace = [
b'shortstring1',
b'string2',
b'longstring3',
### I'd love to be able to do r'some.[0-9].regex.here'
]
for origStr in toReplace:
spaceBytes = b' ' * len(origStr)
pdfByteStr = pdfByteStr.replace(origStr , spaceBytes )
with open(pdfOutputFile, "wb") as writer:
writer.write(pdfByteStr)
This all works, but as I dig a little deeper it would be very nice to be able to match
some of these things using regular expressions rather than strings. Does the regex stuff in python
"natively" support using byte strings instead of "regular" strings? I tried a couple of variations
of this using re.sub and couldn't get it to work, but it's 100% possible that I just hadn't figured
out the correct usage/syntax. Is this something that I could expect to do without having separate
loops, one for the "byte strings" and another for the "regex strings" ?