Locating and extracting a string from multiple text files in Python

Question

I am just picking up and learning Python, For work i go through a lot of pdfs and so I found a PDFMINER tool that converts a directory to a text file. I then made the below code to tell me whether the pdf file is an approved claim or a denied claim. I dont understand how I can say find me the string that starts with "Tracking Identification Number..." AND is the 18 characters after that and stuff it into an array?

import os
import glob
import csv
def check(filename):
    if 'DELIVERY NOTIFICATION' in open(filename).read():
        isDenied = True
        print ("This claim was Denied")
        print (isDenied)
    elif 'Dear Customer:' in open(filename).read():
        isDenied = False
        print("This claim was Approved")
        print (isDenied)
    else:
        print("I don't know if this is approved or denied")

def iterate():

    path = 'text/'
    for infile in glob.glob(os.path.join(path, '*.txt')):
        print ('current file is:' + infile)
        filename = infile
        check(filename)


iterate()

Any help would be appreciated. this is what the text file looks like

Shipper Number............................577140Pickup Date....................................06/27/17
Number of Parcels........................1Weight.............................................1 LBS
Shipper Invoice Number..............30057010Tracking Identification Number...1Z000000YW00000000
Merchandise..................................1 S NIKE EQUALS EVERYWHERE T BK B
WE HAVE BEEN UNABLE TO PROVIDE SATISFACTORY PROOF OF DELIVERY FOR THE ABOVE
SHIPMENT.  WE APOLOGIZE FOR THE INCONVENIENCE THIS CAUSES.
NPT8AEQ:000A0000LDI 07
----------------Page (1) Break----------------

update: Many helpful answers, here is the route I took, and is working quite nicely if I do say so myself. this is gonna save tons of time!! Here is my the entire code for any future viewers.

import os
import glob

arrayDenied = []

def iterate():
    path = 'text/'
    for infile in glob.glob(os.path.join(path, '*.txt')):
        print ('current file is:' + infile)
        check(infile)

def check(filename):
    with open(filename, 'rt') as file_contents:
        myText = file_contents.read()
        if 'DELIVERY NOTIFICATION' in myText:
            start = myText.index("Tracking Identification Number...") + len("Tracking Identification Number...")
            myNumber = myText[start : start+18]
            print("Denied: " + myNumber)
            arrayDenied.append(myNumber)
        elif 'Dear Customer:' in open(filename).read():
print("This claim was Approved")

startTrackingNum = myText.index("Tracking Identification Number...") + len("Tracking Identification Number...")
myNumber = myText[startTrackingNum : startTrackingNum+18]

startClaimNumberIndex = myText.index("Claim Number ") + len("Claim Number ")
myClaimNumber = myText[startClaimNumberIndex : startClaimNumberIndex+11]

arrayApproved.append(myNumber + " - " + myClaimNumber)
        else:
            print("I don't know if this is approved or denied")   
iterate()
with open('Approved.csv', "w") as output:
    writer = csv.writer(output, lineterminator='\n')
    for val in arrayApproved:
        writer.writerow([val])
with open('Denied.csv', "w") as output:
    writer = csv.writer(output, lineterminator='\n')
    for val in arrayDenied:
        writer.writerow([val])
print(arrayDenied) 
print(arrayApproved)

Update: Added the rest of my finished code, Writes the lists to a CSV file where i go execute some =left()'s and such and boom I have 1000 tracking numbers in a matter of minutes. This is why programming is great.

Are the dots really in the file? Is the tracking number always 18 characters starting with 1Z? — pault
– pault, Commented Feb 7, 2018 at 15:38
Yes, I have 1000's of pdfs to go through and typically I copy and paste them into an excel sheet so I am trying to automate this painful process. The approval pdfs are a little different but yes essentially they are all structured the same, — tjb
– tjb, Commented Feb 7, 2018 at 15:42
It does, the syntax is off. Please see my answer and let me know if this solves the issue. — FatihAkici
– FatihAkici, Commented Feb 7, 2018 at 16:32
@Bluestreak22 also you should in general avoid manually opening files such as open(filename).read(). You can open the file once with with open(), and do your if check and all the rest of the operations in it. I cover that in the answer. — FatihAkici
– FatihAkici, Commented Feb 7, 2018 at 16:53

FatihAkici · Accepted Answer · 2018-02-07 19:47:47Z

2

If your goal is just to find the "Tracking Identification Number..." string and the subsequent 18 characters; you can just find the index of that string, then reach where it ends, and slice from that point until the end of the subsequent 18 characters.

# Read the text file into memory:
with open(filename, 'rt') as txt_file:
    myText = txt_file.read()
    if 'DELIVERY NOTIFICATION' in myText:
        # Find the desired string and get the subsequent 18 characters:
        start = myText.index("Tracking Identification Number...") + len("Tracking Identification Number...")
        myNumber = myText[start : start+18]
        arrayDenied.append(myNumber)

You can also modify the append line into arrayDenied.append(myText + ' ' + myNumber) or things like that.

edited Feb 7, 2018 at 19:47

answered Feb 7, 2018 at 16:32

FatihAkici

5,1594 gold badges34 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

tjb Over a year ago

I just took a swing at this. I am getting an traceback error staying that Tracking Identification Number... is not in the list. which I would assum is because its not reading the text file right or maybe because there is a string bunched up against that without a space in the original text file?

tjb Over a year ago

Actually all I did was remove .splitlines() and boom it worked :)

FatihAkici Over a year ago

@Bluestreak22 Oh awesome, that is right! Glad it worked! :) Edited the answer accordingly.

tjb Over a year ago

Could you maybe explain what the index of the string is? To me an index would be a value in an array but to my knowledge a string or text file is not an array?

FatihAkici Over a year ago

index is the starting location of a substring in a string. Say your string is myText = "helloabc1234hello", then start=myText.index("abc") gives you 5 because it starts at 5th index of myText. Then you add length of abc to reach where it ends. That index is where 1234 starts, which you are interested in, hence you do myText[start : start+4] to get those 4 characters.

pault · Accepted Answer · 2018-02-07 16:45:44Z

1

Regular expressions are the way to go for your task. Here is a way to modify your code to search for the pattern.

import re
pattern = r"(?<=Tracking Identification Number)(?:(\.+))[A-Z-a-z0-9]{18}"

def check(filename):
    file_contents = open(filename, 'r').read()
    if 'DELIVERY NOTIFICATION' in file_contents:
        isDenied = True
        print ("This claim was Denied")
        print (isDenied)
        matches = re.finditer(pattern, test_str)
        for match in matches:
            print("Tracking Number = %s" % match.group().strip("."))
    elif 'Dear Customer:' in file_contents:
        isDenied = False
        print("This claim was Approved")
        print (isDenied)
    else:
        print("I don't know if this is approved or denied")

Explanation:

r"(?<=Tracking Identification Number)(?:(\.+))[A-Z-a-z0-9]{18}"

(?<=Tracking Identification Number) Looks behind the capturing group to find the string "Tracking Identification Number"
(?:(\.+)) matches one or more dots (.) (we strip these out after)
[A-Z-a-z0-9]{18} matches 18 instances of (capital or lowercase) letters or numbers

2 Comments

tjb Over a year ago

I am not gonna lie whenever I see something like this "(?:(\.+))[A-Z-a-z0-9]{18" I get the heeby jeebies and think like holy crap lol I will try this out though along with other answers just to know two ways of doing something.

pault Over a year ago

@Bluestreak22 I am by no means an expert in regex, but I find this site regex101.com to be extremely useful in testing patterns. Paste your text in there, select your programming language, and try to make your own pattern.

Setti7 · Accepted Answer · 2018-02-07 16:06:51Z

0

I think this solves your issue, just turn it into a function.

import re

string = 'Tracking Identification Number...1Z000000YW00000000'

no_dots = re.sub('\.', '', string) #Removes all dots from the string

matchObj = re.search('^Tracking Identification Number(.*)', no_dots) #Matches anything after the "Tracking Identification Number"

try:
   print (matchObj.group(1))
except:
    print("No match!")

If you want to read the documentation it is here: https://docs.python.org/3/library/re.html#re.search

answered Feb 7, 2018 at 16:06

Setti7

1494 silver badges16 bronze badges

2 Comments

pault Over a year ago

What if there were extra stuff after the tracking number as in s = 'Tracking Identification Number...1Z000000YW00000000...Extra Stuff'

Setti7 Over a year ago

@pault The file he showed has a line break at the end of that number, so it should stop there.

Collectives™ on Stack Overflow

Locating and extracting a string from multiple text files in Python

3 Answers 3

5 Comments

2 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related