Parse a very large text file with Python?

Question

So, the file has about 57,000 book titles, author names and a ETEXT No. I am trying to parse the file to only get the ETEXT NOs

The File is like this:

TITLE and AUTHOR                                                     ETEXT NO.

Aspects of plant life; with special reference to the British flora,      56900
 by Robert Lloyd Praeger

The Vicar of Morwenstow, by Sabine Baring-Gould                          56899
 [Subtitle: Being a Life of Robert Stephen Hawker, M.A.]

Raamatun tutkisteluja IV, mennessä Charles T. Russell                    56898
 [Subtitle: Harmagedonin taistelu]
 [Language: Finnish]

Raamatun tutkisteluja III, mennessä Charles T. Russell                   56897
 [Subtitle: Tulkoon valtakuntasi]
 [Language: Finnish]

Tom Thatcher's Fortune, by Horatio Alger, Jr.                            56896

A Yankee Flier in the Far East, by Al Avery                              56895
 and George Rutherford Montgomery
 [Illustrator: Paul Laune]

Nancy Brandon's Mystery, by Lillian Garis                                56894

Nervous Ills, by Boris Sidis                                             56893
 [Subtitle: Their Cause and Cure]

Pensées sans langage, par Francis Picabia                                56892
 [Language: French]

Helon's Pilgrimage to Jerusalem, Volume 2 of 2, by Frederick Strauss     56891
 [Subtitle: A picture of Judaism, in the century
  which preceded the advent of our Savior]

Fra Tommaso Campanella, Vol. 1, di Luigi Amabile                         56890
 [Subtitle: la sua congiura, i suoi processi e la sua pazzia]
 [Language: Italian]

The Blue Star, by Fletcher Pratt                                         56889

Importanza e risultati degli incrociamenti in avicoltura,                56888
 di Teodoro Pascal
 [Language: Italian]

And this is what I tried:

def search_by_etext():

    fhand = open('GUTINDEX.ALL')
    print("Search by ETEXT:")

    for line in fhand:
        if not line.startswith(" [") and not line.startswith("~"):
            if not line.startswith(" ") and not line.startswith("TITLE"):
                    words = line.rstrip()
                    words = line.lstrip()
                    words = words[-7:]
                    print (words)


search_by_etext()

Well the code mostly works. However for some lines it gives me part of title or other things. Like: This kind of output(), containing 'decott' which is a part of author name and shouldn't be here. 2

For this: The Bashful Earthquake, by Oliver Herford 56765 [Subtitle: and Other Fables and Verses]

The House of Orchids and Other Poems, by George Sterling 56764

North Italian Folk, by Alice Vansittart Strettel Carr 56763 and Randolph Caldecott [Subtitle: Sketches of Town and Country Life]

Wild Life in New Zealand. Part 1, Mammalia, by George M. Thomson 56762 [Subtitle: New Zealand Board of Science and Art, Manual No. 2]

Universal Brotherhood, Volume 13, No. 10, January 1899, by Various 56761

De drie steden: Lourdes, door Émile Zola 56760 [Language: Dutch]

Another example:

4

For Rhandensche Jongens, door Jan Lens 56702 [Illustrator: Tjeerd Bottema] [Language: Dutch]

The Story of The Woman's Party, by Inez Haynes Irwin 56701

Mormon Doctrine Plain and Simple, by Charles W. Penrose 56700 [Subtitle: Or Leaves from the Tree of Life]

The Stone Axe of Burkamukk, by Mary Grant Bruce 56699 [Illustrator: J. Macfarlane]

The Latter-Day Prophet, by George Q. Cannon 56698 [Subtitle: History of Joseph Smith Written for Young People]

Here: Life] shouldn't be there. Lines starting with blank space has been parsed out with this:

if not line.startswith(" [") and not line.startswith("~"):

But Still I am getting those off values in my output results.

Consider posting text in your question as text, instead of as pictures. — khelwood
– khelwood, Commented Apr 27, 2018 at 10:35
Isn't better to look when the line has max length and the last word is a number, in order to individuate a record delimiter? It should be more robust to whatever text you have on the left side. — fferri
– fferri, Commented Apr 27, 2018 at 10:36
While you edit your question please include a small sample of the input that illustrates the problem. — user9455968
– user9455968, Commented Apr 27, 2018 at 10:36
What khelwood said. You should make it easy for someone to fix your code & test their fix with your sample data. They can't test it on an image! or further details, please see Why may I not upload images of code on SO when asking a question? — PM 2Ring
– PM 2Ring, Commented Apr 27, 2018 at 10:37
Yes, please edit your question and include the relevant parts of your data and code as text. — user9455968
– user9455968, Commented Apr 27, 2018 at 10:41

bruno desthuilliers · Accepted Answer · 2018-04-27 11:23:40Z

4

Simple solution: regexps to the rescue !

import re
with open("etext.txt") as f:
    for line in f:
        match = re.search(r" (\d+)$", line.strip())
        if match:
            print(match.group(1))

the regular expression (\d+)$ will match "at least one space followed by 1 or more digits at the end of the string", and capture only the "one or more digits" group.

You can eventually improve the regexp - ie if you know all etext codes are exactly 5 digits long, you can change the regexp to (\d{5})$.

This works with the example text you posted. If it doesn't properly work on your own file then we need enough of the real data to find out what you really have.

edited Apr 27, 2018 at 11:23

answered Apr 27, 2018 at 11:01

bruno desthuilliers

78.3k6 gold badges103 silver badges129 bronze badges

Sign up to request clarification or add additional context in comments.

12 Comments

Azazel Over a year ago

Misses the first line ETEXT. Should start with 56900, but starts with 56899.

bruno desthuilliers Over a year ago

@Azazel I copy pasted your text example (the one in a code block) to test my script and it DOES output all of the etext nums. If something is missing then your example is not truely representative of your real data.

bruno desthuilliers Over a year ago

@Azazel I obviously can not debug this without the original file (well, the start of it at least) - to fix a problem you have to have it first and I don't have it with what you posted ;)

bruno desthuilliers Over a year ago

@Azazel to answer your previous question : the regexp module is part of the stdlib and extensively documented: docs.python.org/3/library/re.html

bruno desthuilliers Over a year ago

@Azazel can you test with this regexp instead : r"\s(\d+)$" ? I suspect you have a tab or something on some lines.

|

brabster · Accepted Answer · 2018-04-27 10:41:55Z

1

It could be that those extra lines that are not being filtered out start with whitespace other than a " " char, like a tab for example. As a minimal change that might work, try filtering out lines that start with any whitespace rather than specifically a space char?

To check for whitespace in general rather than a space char, you'll need to use regular expressions. Try if not re.match(r'^\s', line) and ...

answered Apr 27, 2018 at 10:41

brabster

43.7k29 gold badges150 silver badges189 bronze badges

3 Comments

Azazel Over a year ago

Well, seems to work after your edit....Gonna test the whole code. Thanks.

brabster Over a year ago

There must have been invisible characters at the start of your strings - looked like a space, but something else. We can't see them, but Python and so startswith(" ") wasn't true. Glad I could help out.

brabster Over a year ago

I think both solutions are good to have on here. Mine is a minimal change to what the OP posted which explains exactly what the problem was. Yours is a much more compact solution, better starting point for someone else finding the question and teaches a little about regex. Room for both, I'll vote you up now!

Collectives™ on Stack Overflow

Parse a very large text file with Python?

2 Answers 2

12 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

12 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related