2

I have several .txt files and I need to extract certain data from them. Files looks similar, but each of them stores different data. Here is an example of that file:

Start Date:        21/05/2016
Format:            TIFF
Resolution:        300dpi
Source:            X Company
...

There is more information in the text files, but I need to extract the start date, format and the resolution. Files are in the same parent directory ("E:\Images") but each file has its own folder. Therefore I need a script for recursive reading of these files. Here is my script so far:

#importing a library
import os

#defining location of parent folder
BASE_DIRECTORY = 'E:\Images'

#scanning through subfolders
    for dirpath, dirnames, filenames in os.walk(BASE_DIRECTORY):
        for filename in filenames:

        #defining file type
        txtfile=open(filename,"r")
        txtfile_full_path = os.path.join(dirpath, filename)
        try:
            for line in txtfile:

                if line.startswidth('Start Date:'):
                start_date = line.split()[-1]

                elif line.startswidth('Format:'):
                data_format = line.split()[-1]

                elif line.startswidth('Resolution:'):
                resolution = line.split()[-1]

                    print(
                    txtfile_full_path,
                    start_date,
                    data_format,
                    resolution)

Ideally it might be better if Python extracts it together with a name of ech file and saves it in a text file. Because I don't have much experience in Python, I don't know how to progress any further.

3 Answers 3

3

Here is the code I've used:

# importing libraries
import os

# defining location of parent folder
BASE_DIRECTORY = 'E:\Images'
output_file = open('output.txt', 'w')
output = {}
file_list = []

# scanning through sub folders
for (dirpath, dirnames, filenames) in os.walk(BASE_DIRECTORY):
    for f in filenames:
        if 'txt' in str(f):
            e = os.path.join(str(dirpath), str(f))
            file_list.append(e)

for f in file_list:
    print f
    txtfile = open(f, 'r')
    output[f] = []
    for line in txtfile:
        if 'Start Date:' in line:
            output[f].append(line)
        elif 'Format' in line:
            output[f].append(line)
        elif 'Resolution' in line:
            output[f].append(line)
tabs = []
for tab in output:
    tabs.append(tab)

tabs.sort()
for tab in tabs:
    output_file.write(tab + '\n')
    output_file.write('\n')
    for row in output[tab]:
        output_file.write(row + '')
    output_file.write('\n')
    output_file.write('----------------------------------------------------------\n')

raw_input()
Sign up to request clarification or add additional context in comments.

2 Comments

it also sorts text files alphabetically.
Great! I have the similar processing to be done.For example it has address field too which is spanned in multiple lines.how should I show it has come to an end ?
0

You do not need regular expressions. You can use basic string functions:

   txtfile=open(filename,"r")
   for line in txtfile:
         if line.startswidth("Start Date:"):
             start_date = line.split()[-1]
         ...

break if you have all information collected.

1 Comment

I have used your example in my code, but it still doesn't work. I think I haven't inserted it correctly, but I want python to read recursively through all the subfolders in the parent folder and extract all info at once.
0

To grab the Start Date, you can use the following regex:

^(?:Start Date:)\D*(\d+/\d+/\d+)$
# ^ anchor the regex to the start of the line
# capture the string "Start Date:" in a group
# followed by non digits zero or unlimited times 
# followed by a group with the start date in it

In Python this would be:

import re

regex = r"^(?:Start Date:)\D*(\d+/\d+/\d+)$"

# the variable line points to your line in the file
if re.search(regex, line):
    # do sth. useful here

See a demo on regex 101.

1 Comment

Hi, thanks for your answer. The thing is that the date format is not always the same (e.g. 22 Sept.-18 Oct. 2003), so I cannot really use this code.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.