Python Extract data from file

Question

I have a text file just say

text1 text2 text text
text text text text

I am looking to firstly count the number of strings in the file (all deliminated by space) and then output the first two texts. (text 1 text 2)

Any ideas?

Thanks in advance for the help

Edit: This is what I have so far:

>>> f=open('test.txt')
>>> for line in f:
    print line
ï»¿text1 text2 text text text text hello
>>> words=line.split()
>>> words
['\xef\xbb\xbftext1', 'text2', 'text', 'text', 'text', 'text', 'hello']
>>> len(words)
7
if len(words) > 2:
    print "there are more than 2 words"

The first problem I have is, my text file is: text1 text2 text text text

But when i pull the output of words I get: ['\xef\xbb\xbftext1', 'text2', 'text', 'text', 'text', 'text', 'hello']

Where does the '\xef\xbb\xbf come from?

What have you tried so far? What problems did you run into? This is quite basic python, but we can help if you had specific problems with your code. — Martijn Pieters
– Martijn Pieters, Commented Sep 24, 2012 at 8:07

Martijn Pieters · Accepted Answer · 2012-09-24 09:28:37Z

17

To read a file line by line, just loop over the open file object in a for loop:

for line in open(filename):
    # do something with line

To split a line by whitespace into a list of separate words, use str.split():

words = line.split()

To count the number of items in a python list, use len(yourlist):

count = len(words)

To select the first two items from a python list, use slicing:

firsttwo = words[:2]

I'll leave constructing the complete program to you, but you won't need much more than the above, plus an if statement to see if you already have your two words.

The three extra bytes you see at the start of your file are the UTF-8 BOM (Byte Order Mark); it marks your file as UTF-8 encoded, but it is redundant and only really used on Windows.

You can remove it with:

import codecs
if line.startswith(codecs.BOM_UTF8):
    line = line[3:]

You may want to decode your strings to unicode using that encoding:

line = line.decode('utf-8')

You could also open the file using codecs.open():

file = codecs.open(filename, encoding='utf-8')

Note that codecs.open() will not strip the BOM for you; the easiest way to do that is to use .lstrip():

import codecs
BOM = codecs.BOM_UTF8.decode('utf8')
with codecs.open(filename, encoding='utf-8') as f:
    for line in f:
        line = line.lstrip(BOM)

edited Sep 24, 2012 at 9:28

answered Sep 24, 2012 at 8:13

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

scrayon Over a year ago

thanks so much! I was originally working with numpy/ascii module that read files. I am new to python (2nd) day so I will crack away at it and update as I go

Collectives™ on Stack Overflow

Python Extract data from file

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related