Python, extracting numbers from Excel column and write as output

Question

Trying to extract the number from columns in an Excel file, and write them into the next columns.

Matching criteria: any number of length five, either started with “PB” or not

I’ve limited the length of the number match to five however there are a “16” extracted (row#2, column D)

How I can improve it? Thank you.

import xlwt, xlrd, re
from xlutils.copy import copy 

workbook = xlrd.open_workbook("C:\\Documents\\num.xlsx")
old_sheet = workbook.sheet_by_name("Sheet1")

wb = copy(workbook) 
sheet = wb.get_sheet(0)

number_of_ships = old_sheet.nrows

for row_index in range(0, old_sheet.nrows):

    Column_a = old_sheet.cell(row_index, 0).value   
    Column_b = old_sheet.cell(row_index, 1).value

    a_b = Column_a + Column_b

    found_PB = re.findall(r"[PB]+(\d{5})", a_b, re.I)
    list_of_numbers = re.findall(r'\d+', a_b)

    for f in found_PB:
        if len(f) == 5:
            sheet.write(row_index, 2, "";"".join(found_PB))

    for l in list_of_numbers:
        if len(l) == 5:
            sheet.write(row_index, 3, "";"".join(list_of_numbers))

wb.save("C:\\Documents\\num-1.xls")

If you use \d+, it will just extract 1+ digit chunks, so you have not restricted anything. If you need numbers after PB, write PB, not [PB] (a character class matching either P or B). — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Aug 27, 2018 at 7:46
@WiktorStribiżew, thank you! could you please put this as an answer so that I can close it? Or 'd better to delete this question? — Mark K
– Mark K, Commented Aug 27, 2018 at 7:51
Well, I do not know how to answer it. What exactly do you need? What are the pattern requirements? — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Aug 27, 2018 at 7:53
@WiktorStribiżew, your comment already answered it. I changed them to "found_PB = re.findall(r"PB+(\d{5})", a_b, re.I)" and "list_of_numbers = re.findall(r'\d{5}', a_b)". Problem solved! — Mark K
– Mark K, Commented Aug 27, 2018 at 7:54
That PB+ does not work the way you think. It will match PB, PBB, PBBB, PBBBB, etc, and cannot match a number unless it starts with PB (or PBBBB,...) The + affects the previous character or group. If you want to modify both letter you may wrap them inside a group (?:PB). Also + means 1 or more times. You'll probably want * (0 or more times) or even ? (0 or 1 times) — Julio
– Julio, Commented Aug 27, 2018 at 8:00

Wiktor Stribiżew · Accepted Answer · 2018-08-27 08:10:30Z

3

Your \d+ pattern matches any 1 or more digits, thus the 16 value is matched. Your [PB]+ character class matches either P or B one or more times, so it restricts the digits to be preceded with either P or B. As you want to match any digits, you actually do not need that restriction (if an A can be preceded with something optionally, the restriction no longer makes sense).

You also seem to need to extract 5 digit string exactly, when no other digits precedes or follows them. You may do that with (?<!\d)\d{5}(?!\d). The (?<!\d) negative lookbehind makes sure there is no digit immediately to the left of the current location, \d{5} consumes 5 digits, and the (?!\d) negative lookahead makes sure there is no digit immediately to the right of the current location. That makes the if len(l) == 5: line redundant and you may omit the whole part of code related to list_of_numbers.

So, you may just use

import xlwt, xlrd, re
from xlutils.copy import copy 

workbook = xlrd.open_workbook("C:\\Documents\\num.xlsx")
old_sheet = workbook.sheet_by_name("Sheet1")

wb = copy(workbook) 
sheet = wb.get_sheet(0)

number_of_ships = old_sheet.nrows

for row_index in range(0, old_sheet.nrows):

    Column_a = old_sheet.cell(row_index, 0).value   
    Column_b = old_sheet.cell(row_index, 1).value

    a_b = Column_a + Column_b

    found_PB = re.findall(r"(?<!\d)\d{5}(?!\d)", a_b)

    for f in found_PB:
            sheet.write(row_index, 2, "";"".join(found_PB))

wb.save("C:\\Documents\\num-1.xls")

answered Aug 27, 2018 at 8:10

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Mark K Over a year ago

but in case the cell contains more than 1 5-digits, how to catch them all? (for example, cell A2 is "PB65352, 456789")

Wiktor Stribiżew Over a year ago

@MarkK If you do not need a "digit boundary" check, remove (?<!\d) and (?!\d). See this demo. Then, the solution becomes too trivial.

Julio · Accepted Answer · 2018-08-27 07:57:01Z

1

You may use this: ^(?:PB)?\d{5}$

Demo

Explained:

^           # Begin of line/string
  (?:       # Begin of group
     PB     #   Literal 'PB'
  )         # End of group
  ?         # Make the previous group optional (? means 0 or 1 times)
  \d{5}     # 5 digits
$           # End of line/string

It is important to use the $, since if you just wrote ^(?:PB)?\d{5} you would match 6 digit numbers even if you wrote \d{5} this is because you would match the first five digits and you would stop there, without knowing if there are more digits.

If your data may start or end with spaces you may use this instead: ^\s*(?:PB)?\d{5}\s*$ It basically adds \s* at the beginning and the end of the regex. \s* means 0 or more spaces.

answered Aug 27, 2018 at 7:57

Julio

5,3161 gold badge16 silver badges46 bronze badges

3 Comments

Mark K Over a year ago

thank you for the sharing and help. would you mind I choose Wiktor Stribiżew's for his getting more upvotes? I can upvote yours too. :)

Julio Over a year ago

Just accept what works better for you, and upvote if it is useful for you :) @MarkK

Mark K Over a year ago

sharing knowledge and help others is always lights in the learning world. you are superb, @Julio

Collectives™ on Stack Overflow

Python, extracting numbers from Excel column and write as output

2 Answers 2

2 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related