How does regex filteration work in Python re while logging sensitive info?

Question

I am trying to write a python script which would redact/hide certain data present in a string before logging it out to the console. Below is my code snippet so far.

import re
from logging import DEBUG, Logger, basicConfig, getLogger, Filter, LogRecord

SENSITIVE_PATTERNS = [
    (
        "email_address",
        r"([A-Za-z0-9]+[.-_])*[A-Za-z0-9]+@[A-Za-z0-9-]+(\.[A-Z|a-z]{2,})+",
    ),
]


def create_logger(sensitive_patterns: list = None) -> Logger:
    basicConfig(level=DEBUG)
    logger = getLogger()
    sensitive_data_filter = SensitiveDataFilter(sensitive_patterns)
    logger.addFilter(sensitive_data_filter)
    return logger


class SensitiveDataFilter(Filter):
    def __init__(self, patterns=None):
        super().__init__()
        self.patterns = patterns or []

    def filter(self, record: LogRecord) -> bool:
        for pattern in self.patterns:
            should_redact = re.search(pattern[1], record.msg)

            if should_redact:
                record.msg = re.sub(pattern[1], f"<HIDDEN {pattern[0]}>", record.msg)

        return True


logger = create_logger(
    sensitive_patterns=SENSITIVE_PATTERNS,
)

test1 = "[email protected]"
test2 = "A"*55
test3 = test2.lower()
logger.info(f"this is test1 : {test1}")
logger.info(f"this is a test3 : {test3}")
logger.info(f"this is a test2 : {test2}")

My objective is to hide certain string in log. For example: I want to hide emails whenever they are logged. This piece works fine INFO:root:this is test1 : <HIDDEN email_address>. However, I also want to keep other logs as is when there is no redaction match. This is leading to an interesting problem. Whenever I have a large string in all upper case the script keeps on executing and never ends (I am guessing going to some infinite loop?). However, the same piece when executed with the string case all lowered, it seems to work. INFO:root:this is a test3 : aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

What am I missing here?

I have tried running it in debug mode, the code seems to be stuck in one of the internal calls of the re module but I am not able to figure out why

Does this answer your question? why python regex is so slow? — relent95
– relent95, Commented Jun 25, 2024 at 3:52
@relent95 Thanks for the share, but unfortunately it doesn't explain my issue. If it really would have been how regex is working internally, then why would converting the string to lower case with the same regex work. Seems very strange to me — John Bosman
– John Bosman, Commented Jun 25, 2024 at 5:35
No, the commented question covers your case. It's just an implementation detail on the character class. Looking at the lookup implementation, it seems to be related to the memory cache hit. — relent95
– relent95, Commented Jun 25, 2024 at 9:22

vassiliev · Accepted Answer · 2024-07-02 02:28:48Z

-1

Please change your regex
([A-Za-z0-9]+[.-_])*[A-Za-z0-9]+@[A-Za-z0-9-]+(\.[A-Z|a-z]{2,})+
to
([A-Za-z0-9]+[.-_])?[A-Za-z0-9]+@[A-Za-z0-9-]+(\.[A-Z|a-z]{2,})+

* causes many unnecessary matching action and take soooooooo much time.
? causes at most once match.

answered Jul 2, 2024 at 2:28

vassiliev

9207 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How does regex filteration work in Python re while logging sensitive info?

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related