14

I am trying to read a file from an FTP server. The file is a .gz file. I would like to know if I can perform actions on this file while the socket is open. I tried to follow what was mentioned in two StackOverflow questions on reading files without writing to disk and reading files from FTP without downloading but was not successful.

I know how to extract data/work on the downloaded file but I'm not sure if I can do it on the fly. Is there a way to connect to the site, get data in a buffer, possibly do some data extraction and exit?

When trying StringIO I got the error:

>>> from ftplib import FTP
>>> from StringIO import StringIO
>>> ftp = FTP('ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz')

Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
ftp = FTP('ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz')
File "C:\Python27\lib\ftplib.py", line 117, in __init__
self.connect(host)
File "C:\Python27\lib\ftplib.py", line 132, in connect
self.sock = socket.create_connection((self.host, self.port), self.timeout)
File "C:\Python27\lib\socket.py", line 553, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
gaierror: [Errno 11004] getaddrinfo failed

I just need to know how can I get data into some variable and loop on it until the file from FTP is read.

I appreciate your time and help. Thanks!

8
  • Do you need to read the file into a local buffer (like read()) or to manipulate it remotely using FTP commands? Commented Sep 12, 2013 at 19:27
  • I want to manipulate it remotely using FTP. Correct me if I am wrong, but if I read it into local buffer would that mean downloading the file? Commented Sep 12, 2013 at 19:28
  • I mean, you want to transfer data from the FTP server to your PC and then use that, is this right? (that's what happens in the SO question you linked) Commented Sep 12, 2013 at 19:32
  • I am sorry for the confusion but I don't want to transfer data from server on my PC. Commented Sep 12, 2013 at 19:33
  • So, do you want to process data on the server and then transfer results on your PC? Or what? Please clarify. Commented Sep 12, 2013 at 19:35

3 Answers 3

30

Make sure to login to the ftp server first. After this, use retrbinary which pulls the file in binary mode. It uses a callback on each chunk of the file. You can use this to load it into a string.

from ftplib import FTP
ftp = FTP('ftp.ncbi.nlm.nih.gov')
ftp.login() # Username: anonymous password: anonymous@

# Setup a cheap way to catch the data (could use StringIO too)
data = []
def handle_binary(more_data):
    data.append(more_data)

resp = ftp.retrbinary("RETR pub/pmc/PMC-ids.csv.gz", callback=handle_binary)
data = "".join(data)

Bonus points: how about we decompress the string while we're at it?

Easy mode, using data string above

import gzip
import StringIO
zippy = gzip.GzipFile(fileobj=StringIO.StringIO(data))
uncompressed_data = zippy.read()

Little bit better, full solution:

from ftplib import FTP
import gzip
import StringIO

ftp = FTP('ftp.ncbi.nlm.nih.gov')
ftp.login() # Username: anonymous password: anonymous@

sio = StringIO.StringIO()
def handle_binary(more_data):
    sio.write(more_data)

resp = ftp.retrbinary("RETR pub/pmc/PMC-ids.csv.gz", callback=handle_binary)
sio.seek(0) # Go back to the start
zippy = gzip.GzipFile(fileobj=sio)

uncompressed = zippy.read()

In reality, it would be much better to decompress on the fly but I don't see a way to do that with the built in libraries (at least not easily).

Sign up to request clarification or add additional context in comments.

10 Comments

Thanks for the answer. I got a quick question, does this download the data on my computer or not? If not where it holds the data?
It holds it in memory, within a string named data (or uncompressed if you go the whole way).
So, the final variable that holds the data would be uncompressed, right?
i'm not sure why, but one as to replace StringIO with BytesIO to have this snipped working with Python 3.4
You actually do not need the handle_binary function. Just use callback=data.append or callback=sio.write, respectively.
|
6

There are two easy ways I can think of to download a file using FTP and store it locally:

  1. Using ftplib:

    from ftplib import FTP
    
    ftp = FTP('ftp.ncbi.nlm.nih.gov')
    ftp.login()
    ftp.cwd('pub/pmc')
    ftp.retrbinary('RETR PMC-ids.csv.gz', open('PMC-ids.csv.gz', 'wb').write)
    ftp.quit()
    
  2. Using urllib

    from urllib import urlretrieve
    
    urlretrieve("ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz", "PMC-ids.csv.gz")
    

If you don't want to download and store it to a file, but you want to process it gradually as it comes, I suggest using urllib2:

from urllib2 import urlopen

u = urlopen("ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/readme.txt")

for line in u:
   print line

which prints your file line by line.

2 Comments

I could be wrong, but in option 1, wouldn't it overwrite the file with the next chunk if reading the binary takes more than one chunk? shouldn't the open be set as 'ab' rather than 'wb'
@TomBusby, no, 'wb' is just fine. Parameter passing in Python is eager (call-by-value). The callback passed to the retrbinary method is just the second parameter. It is eagerly computed, therefore open(..., 'wb') is evaluated just once and the write method of the returned file object is the callback that is passed to retrbinary. The file is opened just once for writing, not each time the callback is called, as you may have thought.
0

That is not possible. To process data on the server, you need to have some sort of execution permissions, be it for a shell script you would send or SQL access.

FTP is pure file transfer, no execution allowed. You will need either to enable SSH access, load the data into a Database and access that with queries or download the file with urllib then process it locally, like this:

import urllib
handle = urllib.urlopen('ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz')
# Use data, maybe: buffer = handle.read()

In particular, I think the third one is the only zero-effort solution.

1 Comment

On second thought and on second careful reading of the comments exchanged between Kyle and Stefano, right below the question, I apologise for having downvoted this answer. However, it seems that what Kyle wanted to ask was not what he actually asked. If you read Stefano's answer as a reply to the original question, it doesn't seem to be true. In any case, if Stefano clarifies what he answered to (and edits the answer, to let me take back my negative vote), I'll be glad to make amends.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.