0

I have a working routine to determine the categories a news item belongs to. The routine works when assigning values in Python for the title, category, subcategory, and the search words as RegExp.

But when retrieving these values from PostgreSQL as strings I do not get any errors, or results from the same routine.

I checked the datatypes, both are Python strings.

What can be done to fix this?

# set the text to be analyzed
title = "next week there will be a presentation. The location will be aat"

# these could be the categories
category = "presentation"
subcategory = "scientific"

# these are the regular expressions
main_category_search_words = r'\bpresentation\b'
sub_category_search_words= r'\basm microbe\b | \basco\b | \baat\b'

category_final = ''
subcategory_final = ''

# identify main category
r = re.compile(main_category_search_words, flags=re.I | re.X)
result = r.findall(title)

if len(result) == 1:
    category_final = category

    # identify sub category
    r2 = re.compile(sub_category_search_words, flags=re.I | re.X)
    result2 = r2.findall(title)
    if len(result2) > 0:
        subcategory_final = subcategory

print("analysis result:", category_final, subcategory_final)

1 Answer 1

1

I'm pretty sure that what you get back from PostgreSQL is not a raw string literal, hence your RegEx is invalid. You will have to escape the backslashes in your pattern explicitly in the DB.

print(r"\basm\b")
print("\basm\b")
print("\\basm\\b")

# output
\basm\b

as       # yes, including the line break above here
\basm\b
Sign up to request clarification or add additional context in comments.

4 Comments

thanks this definately sheds some light on what should be corrected!. as a test I changed the DB main category entry in PostgreSQL to \\bpresentation\\b and then r = re.compile(r"'"+main_category_search_words+"'", flags=re.I | re.X) but no result. i think i'm close, but not sure where to proceed from here. Advice is very welcome! :)
You can print your compiled expression to verify it is what you are looking for. To me it looks like you now end up with '\bams\b' (including the single quotes) due to your string concatenation. I don't thing that concatenation is necessary at all since you already changed the DB value.
Shmee thanks you pushed me in the right direction, and now it works!
just want to point out to anyone in the future having this challenge. To get the raw string into Python from PostgreSQL i used r""+search_words because without it the string is not seen as raw: r = re.compile(r""+main_category_search_words, flags=re.I | re.X)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.