2

Good evening,

I am converting PDF into CSV using python and is using RegEx to extract the information.

The raw text, after extracting text from PDF, could look like this:

Account Transaction Details
Twin Account   123-456-789-1
Date Description Withdrawals Deposits Balance
01 Jan BALANCE B/F 123,456.78  
03 Jan Funds Transfer 195.04 123,456.78  
mBK-4653112690
03 Jan Inward Credit-QUICK 3,000.84 123,456.78  
WIRE OTHR
ANTON HARLEY
Other
03 Jan Funds Trf - SPEED 3,500.00 123,345.78  
PIB8452145632845963
Abricot 480
OTHR Transfer

I used a RegEx [0-3]{1}[0-9]{1}\s[A-Z]{1}[a-z]{2}\s[?A-Za-z]{1,155} and managed to get the needed transactions:

01 Jan BALANCE B/F 123,456.78
03 Jan Funds Transfer 195.04 123,456.78
03 Jan Inward Credit-QUICK 3,000.84 123,456.78
03 Jan Funds Trf - SPEED 3,500.00 123,345.78

However, the additional information between the matches had been dropped because I have split the text using \n and then running the RegEx.

How do I code such that I get the additional information that is in-between the RegEx matches, and the additional info is tagged to the previous match? This is my envisaged output:

01 Jan BALANCE B/F 123,456.78
03 Jan Funds Transfer 195.04 123,456.78 mBK-4653112690
03 Jan Inward Credit-QUICK 3,000.84 123,456.78 OTHR ANTON HARLEY Other
03 Jan Funds Trf - SPEED 3,500.00 123,345.78 PIB8452145632845963 Abricot 480 OTHR Transfer

Edit:

I have adapted @dcsuka solution and have gotten the following:

06 Jan Debit-Consumer 12.60 123,456.78   SNIP AVENU13568100 4265884035605848

06 Jan Inward DR - 828.24 123,456.78   SHIP G12345HUJ ITX

07 Jan Funds Transfer 50.00 123,456.78   Pleasenotethatyouareboundbyadutyundertherulesgoverningtheoperationofthisaccount,tochecktheentriesintheabovestatement. Ifyoudonotnotifyusinwritingofanyerrors, omissionsorunauthoriseddebitswithinfourteen(14)daysofthisstatement,theentriesaboveshallbedeemedvalid,correct,accurateandconclusivelybindinguponyou,andyoushallhaveno claim against the bank in relation thereto. XYZ Ltd  •  80 QuincyPlace ABC Plaza XXX 12345  •  Co. Reg. No. 1234567890Z  •  GST Reg. No. YY-8121234-2  •   www.xyzabc.com

07 Jan Inward CR - SPEED 9,092.06 123,456.78   SALAD SALAS Payment CARL QWE 817264950

How do I remove the excess words "Pleasenotethatyouareboundbyadut..." The only pattern I can see is that it would be a very long word (probably more than 20 characters). Is that the way to go?

Edit2:

@dcsuka had adjusted the code to aid in the removal of 'noise' by based on words or more than 20 characters. Thank you dcsuka!

2 Answers 2

1

You can try using a positive lookahead for a number after newline when you split the string, to get bigger chunks more reflective of your expected output:

import re

split_text = re.split("\n(?=\d{1,3}\s)", text1)

[re.sub("\s?\w{20,}.*$", "", " ".join(i.split())) for i in split_text if re.search("^\d\d\s", i)]

# ['01 Jan BALANCE B/F 123,456.78',
#  '03 Jan Funds Transfer 195.04 123,456.78 mBK-4653112690',
#  '03 Jan Inward Credit-QUICK 3,000.84 123,456.78 WIRE OTHR ANTON HARLEY Other',
#  '03 Jan Funds Trf - SPEED 3,500.00 123,345.78 PIB8452145632845963 Abricot 480 OTHR Transfer']
Sign up to request clarification or add additional context in comments.

Comments

0

I have attempted to look at it again after I have gained more knowledge on regex.

Like what @dcsuka suggested, I would need to use a positive lookahead (so that my regex does not consume the 'quantifier' that I set at the end)

This was the code I used:

(^[0-9]{2}) ([A-Z]{1}[a-z]{2}) (.*?)(?=\n[0-9]{2} [A-Z]{1}[a-z]{2}|[A-Za-z]{15,})', flags=re.M | re.S

First, I grouped them into:

  1. Date using (^[0-9]{2}), with the '^' to indicate start of line since the date would be 2 digits (01 or 11)
  2. Month using ([A-Z]{1}[a-z]{2}), since the month would be Dec/ Jan/ Feb ...
  3. My main capture that I wanted using (.*?), which is description in this case
  4. Date and Month, with other description using (?=\n[0-9]{2} [A-Z]{1}[a-z]{2}|[A-Za-z]{15,})
  5. Lastly, I used the flags for multi-line and single-line flags=re.M | re.S, so that the multiline merges into a single line for my regex to search.

Once done, I used re.findall(line_re) to search for all matches.

Hope this helps.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.