Need String Comparison 's solution for Partial String Comparison in Python

Question

Scenario: I have some tasks performed for respective "Section Header"(Stored as String), result of that task has to be saved against same respective "Existing Section Header"(Stored as String)

While mapping if respective task's "Section Header" is one of the "Existing Section Header" task results are added to it. And if not, new Section Header will get appended to the Existing Section Header List.

Existing Section Header Looks Like This:

[ "Activity (Last 3 Days)", "Activity (Last 7 days)", "Executable running from disk", "Actions from File"]

For below set of String the expected behaviour is as follows:

"Activity (Last 30 Days) - New Section Should be Added

"Executables running from disk" - Same existing "Executable running from disk" should be referred [considering extra "s" in Executables same as "Executable".

"Actions from a file" - Same existing "Actions from file" should be referred [Considering extra article "a"]

Is there any built-in function available python that may help incorporate same logic. Or any suggestion regarding Algorithm for this is highly appreciated.

acattle · Accepted Answer · 2014-12-26 06:27:55Z

1

This is a case where you may find regular expressions helpful. You can use re.sub() to find specific substrings and replace them. It will search for non-overlapping matches to a regular expression and repaces it with the specified string.

import re #this will allow you to use regular expressions

def modifyHeader(header):
    #change the # of days to 30
    modifiedHeader = re.sub(r"Activity (Last \d+ Days?)", "Activity (Last 30 Days)", header)
    #add an s to "executable"
    modifiedHeader = re.sub(r"Executable running from disk", "Executables running from disk", modifiedHeader)
    #add "a"
    modifiedHeader = re.sub(r"Actions from File", "Actions from a file", modifiedHeader)

    return modifiedHeader

The r"" refers to raw strings which make it a bit easier to deal with the \ characters needed for regular expressions, \d matches any digit character, and + means "1 or more". Read the page I linked above for more information.

answered Dec 26, 2014 at 6:27

acattle

3,1131 gold badge18 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Unnati Shukla Over a year ago

There is whole lot of Section headers available with so can not place condition for single Section. And Section Header list is again generated dynamically each time with task operations.

dom0 Over a year ago

look for fuzzy matching/searching algorithms.

Irshad Bhat · Accepted Answer · 2014-12-26 10:16:20Z

0

Since you want to compare only stem or "root word" of a given word, I suggest using some stemming algorithm. Stemming algorithms attempt to automatically remove suffixes (and in some cases prefixes) in order to find the "root word" or stem of a given word. This is useful in various natural language processing scenarios, such as search. Luckily there is a python package for stemming. You can download it from here.

Next you want to compare string without stop-words (a,an,the,from, etc.). So you need to filter these words before comparing strings. You can get a list of stop-words from internet or you can use nltk package to import stop-words list. You can get nltk from here

If there is any issue with nltk, here is the list of stop words:

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself',
 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which',
 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be',
 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an',
 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for',
 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',
 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under',
 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all',
 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not',
 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don',
 'should', 'now']

Now use this simple code to get your desired output:

from stemming.porter2 import stem
from nltk.corpus import stopwords
stopwords_ =  stopwords.words('english')
def addString(x):
   flag = True
   y = [stem(j).lower() for j in x.split() if j.lower() not in stopwords_]
   for i in section:
      i = [stem(j).lower() for j in i.split() if j.lower() not in stopwords_]
      if y==i:
         flag = False
         break
   if flag:
      section.append(x)
      print "\tNew Section Added"

Demo:

>>> from stemming.porter2 import stem
>>> from nltk.corpus import stopwords
>>> stopwords_ =  stopwords.words('english')
>>> 
>>> def addString(x):
...    flag = True
...    y = [stem(j).lower() for j in x.split() if j.lower() not in stopwords_]
...    for i in section:
...       i = [stem(j).lower() for j in i.split() if j.lower() not in stopwords_]
...       if y==i:
...          flag = False
...          break
...    if flag:
...       section.append(x)
...       print "\tNew Section Added"
... 
>>> section = [ "Activity (Last 3 Days)", "Activity (Last 7 days)", "Executable running from disk", "Actions from File"]  # initial Section list
>>> addString("Activity (Last 30 Days)")
    New Section Added
>>> addString("Executables running from disk")
>>> addString("Actions from a file")
>>> section
['Activity (Last 3 Days)', 'Activity (Last 7 days)', 'Executable running from disk', 'Actions from File', 'Activity (Last 30 Days)']  # Final section list

edited Dec 26, 2014 at 10:16

answered Dec 26, 2014 at 7:18

Irshad Bhat

8,7792 gold badges31 silver badges37 bronze badges

4 Comments

Unnati Shukla Over a year ago

thanks this really sounds interesting, but I am not able to access the Stopwords List. It gives me error as :

Unnati Shukla Over a year ago

Resource u'corpora/stopwords' not found. Please use the NLTK Downloader to obtain the resource: >>> nltk.download() Searched in: - 'C:\\Users\\Unnati_Shukla/nltk_data' - 'C:\\nltk_data' - 'D:\\nltk_data' - 'E:\\nltk_data' - 'C:\\Python27\\nltk_data' - 'C:\\Python27\\lib\\nltk_data' - 'C:\\Users\\Unnati_Shukla\\AppData\\Roaming\\nltk_data'

Irshad Bhat Over a year ago

I've uploaded the stop-words list that nltk.stopwords returns. Use it directly and remove from nltk.corpus import stopwords and stopwords_ = stopwords.words('english') lines from the code. Assign stopwords_ to the list that I've uploaded...

Unnati Shukla Over a year ago

Hi, I am able to adapt this approach into my application, but still there are cases in which stemming not giving smart solution like (24 hrs) & (24 hours), XYZ(Persistence) & XYX ( Find Persistence). I am supposed to consider such pairs as same only. Is there any addition that I need to do in my existing solution so that thise few catches also consider as same string only

Collectives™ on Stack Overflow

Need String Comparison 's solution for Partial String Comparison in Python

2 Answers 2

2 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related