0

I am trying to write a python script which would redact/hide certain data present in a string before logging it out to the console. Below is my code snippet so far.

import re
from logging import DEBUG, Logger, basicConfig, getLogger, Filter, LogRecord

SENSITIVE_PATTERNS = [
    (
        "email_address",
        r"([A-Za-z0-9]+[.-_])*[A-Za-z0-9]+@[A-Za-z0-9-]+(\.[A-Z|a-z]{2,})+",
    ),
]


def create_logger(sensitive_patterns: list = None) -> Logger:
    basicConfig(level=DEBUG)
    logger = getLogger()
    sensitive_data_filter = SensitiveDataFilter(sensitive_patterns)
    logger.addFilter(sensitive_data_filter)
    return logger


class SensitiveDataFilter(Filter):
    def __init__(self, patterns=None):
        super().__init__()
        self.patterns = patterns or []

    def filter(self, record: LogRecord) -> bool:
        for pattern in self.patterns:
            should_redact = re.search(pattern[1], record.msg)

            if should_redact:
                record.msg = re.sub(pattern[1], f"<HIDDEN {pattern[0]}>", record.msg)

        return True


logger = create_logger(
    sensitive_patterns=SENSITIVE_PATTERNS,
)

test1 = "[email protected]"
test2 = "A"*55
test3 = test2.lower()
logger.info(f"this is test1 : {test1}")
logger.info(f"this is a test3 : {test3}")
logger.info(f"this is a test2 : {test2}")

My objective is to hide certain string in log. For example: I want to hide emails whenever they are logged. This piece works fine INFO:root:this is test1 : <HIDDEN email_address>. However, I also want to keep other logs as is when there is no redaction match. This is leading to an interesting problem. Whenever I have a large string in all upper case the script keeps on executing and never ends (I am guessing going to some infinite loop?). However, the same piece when executed with the string case all lowered, it seems to work. INFO:root:this is a test3 : aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

What am I missing here?

I have tried running it in debug mode, the code seems to be stuck in one of the internal calls of the re module but I am not able to figure out why

3
  • Does this answer your question? why python regex is so slow? Commented Jun 25, 2024 at 3:52
  • @relent95 Thanks for the share, but unfortunately it doesn't explain my issue. If it really would have been how regex is working internally, then why would converting the string to lower case with the same regex work. Seems very strange to me Commented Jun 25, 2024 at 5:35
  • No, the commented question covers your case. It's just an implementation detail on the character class. Looking at the lookup implementation, it seems to be related to the memory cache hit. Commented Jun 25, 2024 at 9:22

1 Answer 1

-1

Please change your regex
([A-Za-z0-9]+[.-_])*[A-Za-z0-9]+@[A-Za-z0-9-]+(\.[A-Z|a-z]{2,})+
to
([A-Za-z0-9]+[.-_])?[A-Za-z0-9]+@[A-Za-z0-9-]+(\.[A-Z|a-z]{2,})+

* causes many unnecessary matching action and take soooooooo much time.
? causes at most once match.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.