2

I am just picking up and learning Python, For work i go through a lot of pdfs and so I found a PDFMINER tool that converts a directory to a text file. I then made the below code to tell me whether the pdf file is an approved claim or a denied claim. I dont understand how I can say find me the string that starts with "Tracking Identification Number..." AND is the 18 characters after that and stuff it into an array?

import os
import glob
import csv
def check(filename):
    if 'DELIVERY NOTIFICATION' in open(filename).read():
        isDenied = True
        print ("This claim was Denied")
        print (isDenied)
    elif 'Dear Customer:' in open(filename).read():
        isDenied = False
        print("This claim was Approved")
        print (isDenied)
    else:
        print("I don't know if this is approved or denied")

def iterate():

    path = 'text/'
    for infile in glob.glob(os.path.join(path, '*.txt')):
        print ('current file is:' + infile)
        filename = infile
        check(filename)


iterate()

Any help would be appreciated. this is what the text file looks like

Shipper Number............................577140Pickup Date....................................06/27/17
Number of Parcels........................1Weight.............................................1 LBS
Shipper Invoice Number..............30057010Tracking Identification Number...1Z000000YW00000000
Merchandise..................................1 S NIKE EQUALS EVERYWHERE T BK B
WE HAVE BEEN UNABLE TO PROVIDE SATISFACTORY PROOF OF DELIVERY FOR THE ABOVE
SHIPMENT.  WE APOLOGIZE FOR THE INCONVENIENCE THIS CAUSES.
NPT8AEQ:000A0000LDI 07
----------------Page (1) Break----------------

update: Many helpful answers, here is the route I took, and is working quite nicely if I do say so myself. this is gonna save tons of time!! Here is my the entire code for any future viewers.

import os
import glob

arrayDenied = []

def iterate():
    path = 'text/'
    for infile in glob.glob(os.path.join(path, '*.txt')):
        print ('current file is:' + infile)
        check(infile)

def check(filename):
    with open(filename, 'rt') as file_contents:
        myText = file_contents.read()
        if 'DELIVERY NOTIFICATION' in myText:
            start = myText.index("Tracking Identification Number...") + len("Tracking Identification Number...")
            myNumber = myText[start : start+18]
            print("Denied: " + myNumber)
            arrayDenied.append(myNumber)
        elif 'Dear Customer:' in open(filename).read():
print("This claim was Approved")

startTrackingNum = myText.index("Tracking Identification Number...") + len("Tracking Identification Number...")
myNumber = myText[startTrackingNum : startTrackingNum+18]

startClaimNumberIndex = myText.index("Claim Number ") + len("Claim Number ")
myClaimNumber = myText[startClaimNumberIndex : startClaimNumberIndex+11]

arrayApproved.append(myNumber + " - " + myClaimNumber)
        else:
            print("I don't know if this is approved or denied")   
iterate()
with open('Approved.csv', "w") as output:
    writer = csv.writer(output, lineterminator='\n')
    for val in arrayApproved:
        writer.writerow([val])
with open('Denied.csv', "w") as output:
    writer = csv.writer(output, lineterminator='\n')
    for val in arrayDenied:
        writer.writerow([val])
print(arrayDenied) 
print(arrayApproved)

Update: Added the rest of my finished code, Writes the lists to a CSV file where i go execute some =left()'s and such and boom I have 1000 tracking numbers in a matter of minutes. This is why programming is great.

6
  • Are the dots really in the file? Is the tracking number always 18 characters starting with 1Z? Commented Feb 7, 2018 at 15:38
  • Yes, I have 1000's of pdfs to go through and typically I copy and paste them into an excel sheet so I am trying to automate this painful process. The approval pdfs are a little different but yes essentially they are all structured the same, Commented Feb 7, 2018 at 15:42
  • computerhope.com/issues/ch001721.htm Commented Feb 7, 2018 at 15:42
  • 1
    It does, the syntax is off. Please see my answer and let me know if this solves the issue. Commented Feb 7, 2018 at 16:32
  • 1
    @Bluestreak22 also you should in general avoid manually opening files such as open(filename).read(). You can open the file once with with open(), and do your if check and all the rest of the operations in it. I cover that in the answer. Commented Feb 7, 2018 at 16:53

3 Answers 3

2

If your goal is just to find the "Tracking Identification Number..." string and the subsequent 18 characters; you can just find the index of that string, then reach where it ends, and slice from that point until the end of the subsequent 18 characters.

# Read the text file into memory:
with open(filename, 'rt') as txt_file:
    myText = txt_file.read()
    if 'DELIVERY NOTIFICATION' in myText:
        # Find the desired string and get the subsequent 18 characters:
        start = myText.index("Tracking Identification Number...") + len("Tracking Identification Number...")
        myNumber = myText[start : start+18]
        arrayDenied.append(myNumber)

You can also modify the append line into arrayDenied.append(myText + ' ' + myNumber) or things like that.

Sign up to request clarification or add additional context in comments.

5 Comments

I just took a swing at this. I am getting an traceback error staying that Tracking Identification Number... is not in the list. which I would assum is because its not reading the text file right or maybe because there is a string bunched up against that without a space in the original text file?
Actually all I did was remove .splitlines() and boom it worked :)
@Bluestreak22 Oh awesome, that is right! Glad it worked! :) Edited the answer accordingly.
Could you maybe explain what the index of the string is? To me an index would be a value in an array but to my knowledge a string or text file is not an array?
index is the starting location of a substring in a string. Say your string is myText = "helloabc1234hello", then start=myText.index("abc") gives you 5 because it starts at 5th index of myText. Then you add length of abc to reach where it ends. That index is where 1234 starts, which you are interested in, hence you do myText[start : start+4] to get those 4 characters.
1

Regular expressions are the way to go for your task. Here is a way to modify your code to search for the pattern.

import re
pattern = r"(?<=Tracking Identification Number)(?:(\.+))[A-Z-a-z0-9]{18}"

def check(filename):
    file_contents = open(filename, 'r').read()
    if 'DELIVERY NOTIFICATION' in file_contents:
        isDenied = True
        print ("This claim was Denied")
        print (isDenied)
        matches = re.finditer(pattern, test_str)
        for match in matches:
            print("Tracking Number = %s" % match.group().strip("."))
    elif 'Dear Customer:' in file_contents:
        isDenied = False
        print("This claim was Approved")
        print (isDenied)
    else:
        print("I don't know if this is approved or denied")

Explanation:

r"(?<=Tracking Identification Number)(?:(\.+))[A-Z-a-z0-9]{18}"

  • (?<=Tracking Identification Number) Looks behind the capturing group to find the string "Tracking Identification Number"
  • (?:(\.+)) matches one or more dots (.) (we strip these out after)
  • [A-Z-a-z0-9]{18} matches 18 instances of (capital or lowercase) letters or numbers

More on Regex.

2 Comments

I am not gonna lie whenever I see something like this "(?:(\.+))[A-Z-a-z0-9]{18" I get the heeby jeebies and think like holy crap lol I will try this out though along with other answers just to know two ways of doing something.
@Bluestreak22 I am by no means an expert in regex, but I find this site regex101.com to be extremely useful in testing patterns. Paste your text in there, select your programming language, and try to make your own pattern.
0

I think this solves your issue, just turn it into a function.

import re

string = 'Tracking Identification Number...1Z000000YW00000000'

no_dots = re.sub('\.', '', string) #Removes all dots from the string

matchObj = re.search('^Tracking Identification Number(.*)', no_dots) #Matches anything after the "Tracking Identification Number"

try:
   print (matchObj.group(1))
except:
    print("No match!")

If you want to read the documentation it is here: https://docs.python.org/3/library/re.html#re.search

2 Comments

What if there were extra stuff after the tracking number as in s = 'Tracking Identification Number...1Z000000YW00000000...Extra Stuff'
@pault The file he showed has a line break at the end of that number, so it should stop there.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.