1

Scenario: I have some tasks performed for respective "Section Header"(Stored as String), result of that task has to be saved against same respective "Existing Section Header"(Stored as String)

While mapping if respective task's "Section Header" is one of the "Existing Section Header" task results are added to it. And if not, new Section Header will get appended to the Existing Section Header List.

Existing Section Header Looks Like This:

[ "Activity (Last 3 Days)", "Activity (Last 7 days)", "Executable running from disk", "Actions from File"]

For below set of String the expected behaviour is as follows:

"Activity (Last 30 Days) - New Section Should be Added

"Executables running from disk" - Same existing "Executable running from disk" should be referred [considering extra "s" in Executables same as "Executable".

"Actions from a file" - Same existing "Actions from file" should be referred [Considering extra article "a"]

Is there any built-in function available python that may help incorporate same logic. Or any suggestion regarding Algorithm for this is highly appreciated.

2 Answers 2

1

This is a case where you may find regular expressions helpful. You can use re.sub() to find specific substrings and replace them. It will search for non-overlapping matches to a regular expression and repaces it with the specified string.

import re #this will allow you to use regular expressions

def modifyHeader(header):
    #change the # of days to 30
    modifiedHeader = re.sub(r"Activity (Last \d+ Days?)", "Activity (Last 30 Days)", header)
    #add an s to "executable"
    modifiedHeader = re.sub(r"Executable running from disk", "Executables running from disk", modifiedHeader)
    #add "a"
    modifiedHeader = re.sub(r"Actions from File", "Actions from a file", modifiedHeader)

    return modifiedHeader

The r"" refers to raw strings which make it a bit easier to deal with the \ characters needed for regular expressions, \d matches any digit character, and + means "1 or more". Read the page I linked above for more information.

Sign up to request clarification or add additional context in comments.

2 Comments

There is whole lot of Section headers available with so can not place condition for single Section. And Section Header list is again generated dynamically each time with task operations.
look for fuzzy matching/searching algorithms.
0

Since you want to compare only stem or "root word" of a given word, I suggest using some stemming algorithm. Stemming algorithms attempt to automatically remove suffixes (and in some cases prefixes) in order to find the "root word" or stem of a given word. This is useful in various natural language processing scenarios, such as search. Luckily there is a python package for stemming. You can download it from here.

Next you want to compare string without stop-words (a,an,the,from, etc.). So you need to filter these words before comparing strings. You can get a list of stop-words from internet or you can use nltk package to import stop-words list. You can get nltk from here

If there is any issue with nltk, here is the list of stop words:

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself',
 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which',
 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be',
 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an',
 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for',
 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',
 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under',
 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all',
 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not',
 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don',
 'should', 'now']

Now use this simple code to get your desired output:

from stemming.porter2 import stem
from nltk.corpus import stopwords
stopwords_ =  stopwords.words('english')
def addString(x):
   flag = True
   y = [stem(j).lower() for j in x.split() if j.lower() not in stopwords_]
   for i in section:
      i = [stem(j).lower() for j in i.split() if j.lower() not in stopwords_]
      if y==i:
         flag = False
         break
   if flag:
      section.append(x)
      print "\tNew Section Added"

Demo:

>>> from stemming.porter2 import stem
>>> from nltk.corpus import stopwords
>>> stopwords_ =  stopwords.words('english')
>>> 
>>> def addString(x):
...    flag = True
...    y = [stem(j).lower() for j in x.split() if j.lower() not in stopwords_]
...    for i in section:
...       i = [stem(j).lower() for j in i.split() if j.lower() not in stopwords_]
...       if y==i:
...          flag = False
...          break
...    if flag:
...       section.append(x)
...       print "\tNew Section Added"
... 
>>> section = [ "Activity (Last 3 Days)", "Activity (Last 7 days)", "Executable running from disk", "Actions from File"]  # initial Section list
>>> addString("Activity (Last 30 Days)")
    New Section Added
>>> addString("Executables running from disk")
>>> addString("Actions from a file")
>>> section
['Activity (Last 3 Days)', 'Activity (Last 7 days)', 'Executable running from disk', 'Actions from File', 'Activity (Last 30 Days)']  # Final section list

4 Comments

thanks this really sounds interesting, but I am not able to access the Stopwords List. It gives me error as :
Resource u'corpora/stopwords' not found. Please use the NLTK Downloader to obtain the resource: >>> nltk.download() Searched in: - 'C:\\Users\\Unnati_Shukla/nltk_data' - 'C:\\nltk_data' - 'D:\\nltk_data' - 'E:\\nltk_data' - 'C:\\Python27\\nltk_data' - 'C:\\Python27\\lib\\nltk_data' - 'C:\\Users\\Unnati_Shukla\\AppData\\Roaming\\nltk_data'
I've uploaded the stop-words list that nltk.stopwords returns. Use it directly and remove from nltk.corpus import stopwords and stopwords_ = stopwords.words('english') lines from the code. Assign stopwords_ to the list that I've uploaded...
Hi, I am able to adapt this approach into my application, but still there are cases in which stemming not giving smart solution like (24 hrs) & (24 hours), XYZ(Persistence) & XYX ( Find Persistence). I am supposed to consider such pairs as same only. Is there any addition that I need to do in my existing solution so that thise few catches also consider as same string only

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.