2

I have some (working) code that searches for and modifies text within a PDF file.

it takes a list of "strings" that I want to find, and replaces each with a string of spaces that is the same length as the found string. (you can't just remove the strings , because that breaks the re-encoding of the PDF... I don't know anything about the PDF encoding format, which I admit is almost entirely foreign to me...)

pdfInputFile= "input.pdf"
pdfOutputFile= "out.pdf"

with open(pdfInputFile, "rb") as reader:
    pdfByteStr = reader.read()

toReplace = [
    b'shortstring1',
    b'string2',
    b'longstring3',
    ### I'd love to be able to do r'some.[0-9].regex.here'
]

for origStr in toReplace:
    spaceBytes = b' ' * len(origStr)
    pdfByteStr = pdfByteStr.replace(origStr , spaceBytes )

with open(pdfOutputFile, "wb") as writer:
    writer.write(pdfByteStr)

This all works, but as I dig a little deeper it would be very nice to be able to match some of these things using regular expressions rather than strings. Does the regex stuff in python "natively" support using byte strings instead of "regular" strings? I tried a couple of variations of this using re.sub and couldn't get it to work, but it's 100% possible that I just hadn't figured out the correct usage/syntax. Is this something that I could expect to do without having separate loops, one for the "byte strings" and another for the "regex strings" ?

4
  • What do you mean by 'strings' ? Yes you can compose a regex with a list of strings, then match with a callback, creating a replacement string of spaces using the matched string length. It's all about creating the regex ahead of time: ie. string1|string2|string3, ... We're talking single pass regex. Commented Sep 27 at 19:55
  • Another option is get each single string and do 1 at a time. Say if the combined length of all the strings are like 1 megabyte, then you'd have to do it like 25k of strings gulp size. Commented Sep 27 at 20:03
  • Usually strings are in UTF-8. But you could do a conversion Byte -> UTF-8 contrurct regex. Convert the pdf to utf-8 , do replacement, then convert back (encode) to Byte ? Commented Sep 27 at 20:07
  • 2
    fwiw I'd always keep it as bytes to make it more generic as PDF is a hilariously complex format - plausibly some work may be needed to avoid changing whatever builtin words might exist too, either by excluding them or locating some header(s) and only working after them Commented Sep 27 at 20:10

1 Answer 1

2

This is mentioned at the top of the module docs:

Both patterns and strings to be searched can be Unicode strings (str) as well as 8-bit strings (bytes). However, Unicode strings and 8-bit strings cannot be mixed: that is, you cannot match a Unicode string with a bytes pattern or vice-versa; similarly, when asking for a substitution, the replacement string must be of the same type as both the pattern and the search string.

It is simply a matter of using consistent types.

For your use case, re.sub supports bytestrings. Use a callable for the replacement, it will be called with an re.Match instance. Using info from the match object you can determine the appropriate replacement string.

Demo:

>>> pat = b"bar[0-9]"
>>> s = b"foobar1x"
>>> re.sub(pat, lambda m: b' '*(m.end() - m.start()), s)
b'foo    x'
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.