14

I am trying to find a way to look in a folder and search the contents of all of the powerpoint documents within that folder for specific strings, preferably using Python. When those strings are found, I want to report out the text after that string as well as what document it was found in. I would like to compile the information and report it in a CSV file.

So far I've only come across the olefil package, https://bitbucket.org/decalage/olefileio_pl/wiki/Home. This provides all of the text contained in a specific document, which is not what I am looking to do. Please help.

2
  • 4
    hi kacey! Welcome to Stackoverflow! Here at Stackoverflow, we help people fix and sometimes rewrite their existing code to correctly work. I'm afraid your question is a bit off-topic for the SO site. Here how; What your basically asking is, "How can I write some code to perform x, then y, then, z". While those types of question can be appropriate, you should show what you have tried. Make an attempt at solving your problem before asking here. Who knows, you may figure it out yourself! If what you tried didn't work, we'll be more than happy to help you fix it. Good luck! Commented Sep 9, 2016 at 21:57
  • Files with type ".pptx" are zip files. Commented Sep 9, 2016 at 22:13

5 Answers 5

19

Actually working

If you want to extract text:

  • import Presentation from pptx (pip install python-pptx)
  • for each file in the directory (using glob module)
  • look in every slides and in every shape in each slide
  • if there is a shape with text attribute, print the shape.text

from pptx import Presentation
import glob

for eachfile in glob.glob("*.pptx"):
    prs = Presentation(eachfile)
    print(eachfile)
    print("----------------------")
    for slide in prs.slides:
        for shape in slide.shapes:
            if hasattr(shape, "text"):
                print(shape.text)
Sign up to request clarification or add additional context in comments.

4 Comments

Also, if PackageNotFoundError is thrown, it can be fixed by passing a file object instead: f = open(<filepath>, "rb") and then prs = Presentation(f).
The os.listdir() command in Python 2.7 won't work unless it reads something like os.listdir('.'). Other than that, it worked well for me.
Yes, in python 2.7 you have to use os.listdir('.'). I am gonna change the code.
This solution worked for me. The only remark is that python package is called python-pptx, so the installation command should be "pip install python-pptx".
8

tika-python

A Python port of the Apache Tika library, According to the documentation Apache tika supports text extraction from over 1500 file formats.

Note: It also works charmingly with pyinstaller

Install with pip :

pip install tika

Sample:

#!/usr/bin/env python
from tika import parser
parsed = parser.from_file('/path/to/file')
print(parsed["metadata"]) #To get the meta data of the file
print(parsed["content"]) # To get the content of the file

Link to official GitHub

1 Comment

This worked like a charm, thanks! I forgot to filter to pptx and it included pdfs. Read them perfectly from what i can tell so far.
5

python-pptx can be used to do what you propose. Just at a high level, you would do something like this (not working code, just and idea of overall approach):

from pptx import Presentation

for pptx_filename in directory:
    prs = Presentation(pptx_filename)
    for slide in prs.slides:
        for shape in slide.shapes:
            print shape.text

You'd need to add the bits about searching shape text for key strings and adding them to a CSV file or whatever, but this general approach should work just fine. I'll leave it to you to work out the finer points :)

2 Comments

this is not suitable for ppt files , Its only for pptx files
As not all shapes (e.g. images) have a text attribute, a simple check avoids exceptions: if hasattr(shape, 'text'): print(shape.text)
0

Textract-Plus

Use textract-plus which can extract text from most of the document extensions including pptx and pptm. refer docs

Install-

pip install textract-plus

Sample-

import textractplus as tp
text=tp.process('path/to/yourfile.pptx')

for your case-

import os
import pandas as pd
import textractplus as tp
files_csv=[]
your_dir='.'
for f in os.listdir(your_dir):
    if f.endswith('pptx') or f.endswith('pptm'):
        text=tp.process(os.join(your_dir,f))
        files_csv.append([f,text])
pd.Dataframe(files_csv,columns=['filename','text']).to_csv('your_csv.csv')

this code will fetch all the pptx and pptm files from directory and create a csv with first column as filename and second as text extracted from that file

Comments

0
import os
import textract
files_csv = []
your_dir = '.'

for f in os.listdir(your_dir):
   if f.endswith('pptx') or f.endswith('pptm'):
      text = tp.process(os.path.join('sample.pptx'))
         print(text)
        

1 Comment

New answers to old, well-answered questions should contain ample explanation on how they complement the other answers.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.