2

Data: I have a fairly large excel file with more than 20 columns. Each cell contains comments.

Desired Goal: I am trying to read all the comments from column M named 'Engine' from the first row till the last row.

Desired Output: I want to extract all the comments in column M and save them in a list or pandas data frame.

Below is what I tried after reading others' threads:

# load the worksheet for interation
from win32com.client import Dispatch
xlApp = Dispatch("Excel.Application")
workbook = xlApp.Workbooks.Open('My_large_data_file.xls')
worksheet = workbook.Sheets('Mysheet')


# get the row counts for iteration
from openpyxl import load_workbook
wb = load_workbook('My_large_data_file.xls', read_only=True)
sheet = wb.get_sheet_by_name('Mysheet')
row_count = sheet.max_row

comments = []
# iteration
for i in range(2, row_count + 1): # first row is column names
    print(i)
    comment = worksheet.Cells(i, 13).Comment.Text() # Column M = #13
    comments.append(comment)

However, this method only works for cells whose comments are visiable by default. If a cell's comment is invisible, it is read as a NoneType. Then I get error like this:

Traceback (most recent call last):

  File "<ipython-input-64-dead2ed27460>", line 5, in <module>
    comment = worksheet.Cells(i, 13).Comment.Text() # Column M = #13

AttributeError: 'NoneType' object has no attribute 'Text'

Problem:

1) How can I set all the cells' comments visible so that I can extract them? I am not sure if it needs to apply some VBA code in python.

2) My current codes are not efficient. Especially I am dealing with 60+ such excel files and each contains 70000+ rows. Any suggestions to improve it?

Thanks in advance!

#####################################

There are several status of comments in excel files :

  1. completely hidden without indicator - (double click triggers comments to display)
  2. hidden with a red indicator - (mouse hover triggers comments to display)
  3. displayed.

worksheet.Cells(i, j).Comment.Text()

This method works fine for #2 and #3 cases. But it is not working for #1 hidden without indicator case.

2
  • 1
    You must use the iter_rows method on read-only mode. Commented Aug 15, 2018 at 16:01
  • Cells(i, j).Comment.Text() works fine for me even when comments are hidden. What did you do to make them invisible enough for Comment to become None? Commented Aug 15, 2018 at 16:14

1 Answer 1

2

As mentioned in the comment, I am unable to reproduce the issue you mention regarding hidden comments, so I can not comment on that. However, the approach below may well just solve that issue regardless.

Regarding performance, one thing you could try would be to avoid the overhead of COM altogether, as openpyxl actually has everything you need.

As such, you could do the following:

from openpyxl import load_workbook
wb = load_workbook('My_large_data_file.xls')
sheet = wb.get_sheet_by_name('Mysheet')
comments = [c.comment.text for c in sheet['M'][1:]]

Performance-wise, this should buy you several orders of magnitude as the following 1000 row comparison suggests:

In [64]: %timeit [c.comment.text for c in sheet['M'][1:1000]]
1.31 ms ± 25.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [72]: %timeit [worksheet.Cells(i, 13).Comment.Text() for i in range(2, 1000)]
1.7 s ± 330 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The difference here comes from the fact that openpyxl parses the Excel file directly, where win32com relies on dispatching everything to the Excel process. By going the openpyxl route, you of course lose the full power of COM, but you will likely find that dipping into COM only makes sense when it is the only option. Here, besides winning a ton of speed, you also do not have to have an Excel process running alongside your script (and indeed, you do not even need to have Excel installed at all), which has the added benefit of making your scripts much more testable.

Sign up to request clarification or add additional context in comments.

7 Comments

Yes, but memory might be an issue: Excel uses quite a bit less memory per workbook than openpyxl currently manages. Comments are not available in read-only mode. So your timing should include opening the workbook.
@fuglede Thanks so much for the detailed info. It helps a lot!! But how can I read the invisible comments? I still cannot extract those invisible comments using openpyxl. Is it possible to change comments' visibility using openpyxl?
@CharlieClark: That's fair; assuming that the Excel process is hot and ready, I'm seeing about 170 ms in Excel COM and 270 ms in openpyxl, so there's a difference.
@ElsaLi: Unfortunately, it's still not clear to me what you mean by invisible comments. It certainly works for what Excel refers to as hidden comments. How would you go about creating such an invisible comment, and in which version of Excel?
@fuglede This original excel file contains embedded VBA code to make some comments invisible. It is designed by other engineers... If I open the excel file, only double click the cell, the comments can show up. There are several status of comments: 1) completely hidden without indicator; 2) hidden with red indicator (mouse hover triggers comments to display), 3) displayed. Your method works fine with second and third cases. Unfortunately, mine is the first case. I guess I need to investigate the VBA code a little bit...
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.