Extracting text from multiple powerpoint files using python

Question

I am trying to find a way to look in a folder and search the contents of all of the powerpoint documents within that folder for specific strings, preferably using Python. When those strings are found, I want to report out the text after that string as well as what document it was found in. I would like to compile the information and report it in a CSV file.

So far I've only come across the olefil package, https://bitbucket.org/decalage/olefileio_pl/wiki/Home. This provides all of the text contained in a specific document, which is not what I am looking to do. Please help.

hi kacey! Welcome to Stackoverflow! Here at Stackoverflow, we help people fix and sometimes rewrite their existing code to correctly work. I'm afraid your question is a bit off-topic for the SO site. Here how; What your basically asking is, "How can I write some code to perform x, then y, then, z". While those types of question can be appropriate, you should show what you have tried. Make an attempt at solving your problem before asking here. Who knows, you may figure it out yourself! If what you tried didn't work, we'll be more than happy to help you fix it. Good luck! — Chris
– Chris, Commented Sep 9, 2016 at 21:57

PythonProgrammi · Accepted Answer · 2019-11-21 08:09:45Z

19

Actually working

If you want to extract text:

import Presentation from pptx (pip install python-pptx)
for each file in the directory (using glob module)
look in every slides and in every shape in each slide
if there is a shape with text attribute, print the shape.text

from pptx import Presentation
import glob

for eachfile in glob.glob("*.pptx"):
    prs = Presentation(eachfile)
    print(eachfile)
    print("----------------------")
    for slide in prs.slides:
        for shape in slide.shapes:
            if hasattr(shape, "text"):
                print(shape.text)

edited Nov 21, 2019 at 8:09

answered Nov 13, 2017 at 19:56

PythonProgrammi

23.6k3 gold badges44 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Viseshini Reddy Over a year ago

Also, if PackageNotFoundError is thrown, it can be fixed by passing a file object instead: f = open(<filepath>, "rb") and then prs = Presentation(f).

Tensigh Over a year ago

The os.listdir() command in Python 2.7 won't work unless it reads something like os.listdir('.'). Other than that, it worked well for me.

PythonProgrammi Over a year ago

Yes, in python 2.7 you have to use os.listdir('.'). I am gonna change the code.

mskoryk Over a year ago

This solution worked for me. The only remark is that python package is called python-pptx, so the installation command should be "pip install python-pptx".

Dhinesh kumar M · Accepted Answer · 2018-08-18 05:22:55Z

8

tika-python

A Python port of the Apache Tika library, According to the documentation Apache tika supports text extraction from over 1500 file formats.

Note: It also works charmingly with pyinstaller

Install with pip :

pip install tika

Sample:

#!/usr/bin/env python
from tika import parser
parsed = parser.from_file('/path/to/file')
print(parsed["metadata"]) #To get the meta data of the file
print(parsed["content"]) # To get the content of the file

Link to official GitHub

edited Aug 18, 2018 at 5:22

answered Aug 18, 2018 at 5:09

Dhinesh kumar M

1011 silver badge3 bronze badges

1 Comment

Jeremy Giaco Over a year ago

This worked like a charm, thanks! I forgot to filter to pptx and it included pdfs. Read them perfectly from what i can tell so far.

scanny · Accepted Answer · 2016-09-10 21:04:27Z

5

python-pptx can be used to do what you propose. Just at a high level, you would do something like this (not working code, just and idea of overall approach):

from pptx import Presentation

for pptx_filename in directory:
    prs = Presentation(pptx_filename)
    for slide in prs.slides:
        for shape in slide.shapes:
            print shape.text

You'd need to add the bits about searching shape text for key strings and adding them to a CSV file or whatever, but this general approach should work just fine. I'll leave it to you to work out the finer points :)

answered Sep 10, 2016 at 21:04

scanny

29.5k6 gold badges64 silver badges101 bronze badges

2 Comments

Arun Kumar Over a year ago

this is not suitable for ppt files , Its only for pptx files

Rustam Over a year ago

As not all shapes (e.g. images) have a text attribute, a simple check avoids exceptions: if hasattr(shape, 'text'): print(shape.text)

vhx.ai · Accepted Answer · 2022-01-22 20:33:41Z

0

Textract-Plus

Use textract-plus which can extract text from most of the document extensions including pptx and pptm. refer docs

Install-

pip install textract-plus

Sample-

import textractplus as tp
text=tp.process('path/to/yourfile.pptx')

for your case-

import os
import pandas as pd
import textractplus as tp
files_csv=[]
your_dir='.'
for f in os.listdir(your_dir):
    if f.endswith('pptx') or f.endswith('pptm'):
        text=tp.process(os.join(your_dir,f))
        files_csv.append([f,text])
pd.Dataframe(files_csv,columns=['filename','text']).to_csv('your_csv.csv')

this code will fetch all the pptx and pptm files from directory and create a csv with first column as filename and second as text extracted from that file

edited Jan 22, 2022 at 20:33

answered Jan 22, 2022 at 20:14

vhx.ai

11 bronze badge

Comments

Ashish Verma · Accepted Answer · 2022-02-04 12:00:57Z

0

import os
import textract
files_csv = []
your_dir = '.'

for f in os.listdir(your_dir):
   if f.endswith('pptx') or f.endswith('pptm'):
      text = tp.process(os.path.join('sample.pptx'))
         print(text)

answered Feb 4, 2022 at 12:00

Ashish Verma

1

1 Comment

Gert Arnold Over a year ago

New answers to old, well-answered questions should contain ample explanation on how they complement the other answers.

Collectives™ on Stack Overflow

Extracting text from multiple powerpoint files using python

5 Answers 5

Actually working

4 Comments

1 Comment

2 Comments

Comments

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Actually working

4 Comments

1 Comment

2 Comments

Comments

1 Comment

Related