Read a file in buffer from FTP python

Question

I am trying to read a file from an FTP server. The file is a .gz file. I would like to know if I can perform actions on this file while the socket is open. I tried to follow what was mentioned in two StackOverflow questions on reading files without writing to disk and reading files from FTP without downloading but was not successful.

I know how to extract data/work on the downloaded file but I'm not sure if I can do it on the fly. Is there a way to connect to the site, get data in a buffer, possibly do some data extraction and exit?

When trying StringIO I got the error:

>>> from ftplib import FTP
>>> from StringIO import StringIO
>>> ftp = FTP('ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz')

Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
ftp = FTP('ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz')
File "C:\Python27\lib\ftplib.py", line 117, in __init__
self.connect(host)
File "C:\Python27\lib\ftplib.py", line 132, in connect
self.sock = socket.create_connection((self.host, self.port), self.timeout)
File "C:\Python27\lib\socket.py", line 553, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
gaierror: [Errno 11004] getaddrinfo failed

I just need to know how can I get data into some variable and loop on it until the file from FTP is read.

I appreciate your time and help. Thanks!

Do you need to read the file into a local buffer (like read()) or to manipulate it remotely using FTP commands? — Stefano Sanfilippo
– Stefano Sanfilippo, Commented Sep 12, 2013 at 19:27
I want to manipulate it remotely using FTP. Correct me if I am wrong, but if I read it into local buffer would that mean downloading the file? — smandape
– smandape, Commented Sep 12, 2013 at 19:28
I mean, you want to transfer data from the FTP server to your PC and then use that, is this right? (that's what happens in the SO question you linked) — Stefano Sanfilippo
– Stefano Sanfilippo, Commented Sep 12, 2013 at 19:32
I am sorry for the confusion but I don't want to transfer data from server on my PC. — smandape
– smandape, Commented Sep 12, 2013 at 19:33
So, do you want to process data on the server and then transfer results on your PC? Or what? Please clarify. — Stefano Sanfilippo
– Stefano Sanfilippo, Commented Sep 12, 2013 at 19:35

Kyle Kelley · Accepted Answer · 2013-09-12 20:52:00Z

30

Make sure to login to the ftp server first. After this, use retrbinary which pulls the file in binary mode. It uses a callback on each chunk of the file. You can use this to load it into a string.

from ftplib import FTP
ftp = FTP('ftp.ncbi.nlm.nih.gov')
ftp.login() # Username: anonymous password: anonymous@

# Setup a cheap way to catch the data (could use StringIO too)
data = []
def handle_binary(more_data):
    data.append(more_data)

resp = ftp.retrbinary("RETR pub/pmc/PMC-ids.csv.gz", callback=handle_binary)
data = "".join(data)

Bonus points: how about we decompress the string while we're at it?

Easy mode, using data string above

import gzip
import StringIO
zippy = gzip.GzipFile(fileobj=StringIO.StringIO(data))
uncompressed_data = zippy.read()

Little bit better, full solution:

from ftplib import FTP
import gzip
import StringIO

ftp = FTP('ftp.ncbi.nlm.nih.gov')
ftp.login() # Username: anonymous password: anonymous@

sio = StringIO.StringIO()
def handle_binary(more_data):
    sio.write(more_data)

resp = ftp.retrbinary("RETR pub/pmc/PMC-ids.csv.gz", callback=handle_binary)
sio.seek(0) # Go back to the start
zippy = gzip.GzipFile(fileobj=sio)

uncompressed = zippy.read()

In reality, it would be much better to decompress on the fly but I don't see a way to do that with the built in libraries (at least not easily).

edited Sep 12, 2013 at 20:52

answered Sep 12, 2013 at 20:07

Kyle Kelley

14.2k9 gold badges51 silver badges80 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

smandape Over a year ago

Thanks for the answer. I got a quick question, does this download the data on my computer or not? If not where it holds the data?

Kyle Kelley Over a year ago

It holds it in memory, within a string named data (or uncompressed if you go the whole way).

smandape Over a year ago

So, the final variable that holds the data would be uncompressed, right?

tags Over a year ago

i'm not sure why, but one as to replace StringIO with BytesIO to have this snipped working with Python 3.4

Martin Prikryl Over a year ago

You actually do not need the handle_binary function. Just use callback=data.append or callback=sio.write, respectively.

|

nickie · Accepted Answer · 2013-09-12 19:49:52Z

6

There are two easy ways I can think of to download a file using FTP and store it locally:

Using ftplib:

from ftplib import FTP

ftp = FTP('ftp.ncbi.nlm.nih.gov')
ftp.login()
ftp.cwd('pub/pmc')
ftp.retrbinary('RETR PMC-ids.csv.gz', open('PMC-ids.csv.gz', 'wb').write)
ftp.quit()

Using urllib

from urllib import urlretrieve

urlretrieve("ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz", "PMC-ids.csv.gz")

If you don't want to download and store it to a file, but you want to process it gradually as it comes, I suggest using urllib2:

from urllib2 import urlopen

u = urlopen("ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/readme.txt")

for line in u:
   print line

which prints your file line by line.

edited Sep 12, 2013 at 19:49

answered Sep 12, 2013 at 19:37

nickie

5,8282 gold badges28 silver badges40 bronze badges

2 Comments

Tom Busby Over a year ago

I could be wrong, but in option 1, wouldn't it overwrite the file with the next chunk if reading the binary takes more than one chunk? shouldn't the open be set as 'ab' rather than 'wb'

nickie Over a year ago

@TomBusby, no, 'wb' is just fine. Parameter passing in Python is eager (call-by-value). The callback passed to the retrbinary method is just the second parameter. It is eagerly computed, therefore open(..., 'wb') is evaluated just once and the write method of the returned file object is the callback that is passed to retrbinary. The file is opened just once for writing, not each time the callback is called, as you may have thought.

Stefano Sanfilippo · Accepted Answer · 2013-09-12 19:40:01Z

0

That is not possible. To process data on the server, you need to have some sort of execution permissions, be it for a shell script you would send or SQL access.

FTP is pure file transfer, no execution allowed. You will need either to enable SSH access, load the data into a Database and access that with queries or download the file with urllib then process it locally, like this:

import urllib
handle = urllib.urlopen('ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz')
# Use data, maybe: buffer = handle.read()

In particular, I think the third one is the only zero-effort solution.

answered Sep 12, 2013 at 19:40

Stefano Sanfilippo

33.2k7 gold badges85 silver badges83 bronze badges

1 Comment

nickie Over a year ago

On second thought and on second careful reading of the comments exchanged between Kyle and Stefano, right below the question, I apologise for having downvoted this answer. However, it seems that what Kyle wanted to ask was not what he actually asked. If you read Stefano's answer as a reply to the original question, it doesn't seem to be true. In any case, if Stefano clarifies what he answered to (and edits the answer, to let me take back my negative vote), I'll be glad to make amends.

Collectives™ on Stack Overflow

Read a file in buffer from FTP python

3 Answers 3

10 Comments

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

10 Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related