5

How to get a list of files from hdfs (hadoop) directory using python script?

I have tried with following line:

dir = sc.textFile("hdfs://127.0.0.1:1900/directory").collect()

The directory have list of files "file1,file2,file3....fileN". By using the line i got all the content list only. But i need to get list of file names.

Can anyone please help me to find out this problem?

Thanks in advance.

6 Answers 6

9

Use subprocess

import subprocess
p = subprocess.Popen("hdfs dfs -ls <HDFS Location> |  awk '{print $8}'",
    shell=True,
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT)

for line in p.stdout.readlines():
    print line

EDIT: Answer without python. The first option can be used to recursively print all the sub-directories as well. The last redirect statement can be omitted or changed based on your requirement.

hdfs dfs -ls -R <HDFS LOCATION> | awk '{print $8}' > output.txt
hdfs dfs -ls <HDFS LOCATION> | awk '{print $8}' > output.txt

EDIT: Correcting a missing quote in awk command.

Sign up to request clarification or add additional context in comments.

Comments

2
import subprocess

path = "/data"
args = "hdfs dfs -ls "+path+" | awk '{print $8}'"
proc = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)

s_output, s_err = proc.communicate()
all_dart_dirs = s_output.split() #stores list of files and sub-directories in 'path'

1 Comment

This looks like an improvement of this answer. You might want to edit your answer to explain the improvements.
1

Why not have the HDFS client do the hard work by using the -C flag instead of relying on awk or python to print the specific columns of interest?

i.e. Popen(['hdfs', 'dfs', '-ls', '-C', dirname])

Afterwards, split the output on new lines and then you will have your list of paths.

Here's an example along with logging and error handling (including for when the directory/file doesn't exist):

from subprocess import Popen, PIPE
import logging
logger = logging.getLogger(__name__)

FAILED_TO_LIST_DIRECTORY_MSG = 'No such file or directory'

class HdfsException(Exception):
    pass

def hdfs_ls(dirname):
    """Returns list of HDFS directory entries."""
    logger.info('Listing HDFS directory ' + dirname)
    proc = Popen(['hdfs', 'dfs', '-ls', '-C', dirname], stdout=PIPE, stderr=PIPE)
    (out, err) = proc.communicate()
    if out:
        logger.debug('stdout:\n' + out)
    if proc.returncode != 0:
        errmsg = 'Failed to list HDFS directory "' + dirname + '", return code ' + str(proc.returncode)
        logger.error(errmsg)
        logger.error(err)
        if not FAILED_TO_LIST_DIRECTORY_MSG in err:
            raise HdfsException(errmsg)
        return []
    elif err:
        logger.debug('stderr:\n' + err)
    return out.splitlines()

Comments

0

For python 3:

    from subprocess import Popen, PIPE
hdfs_path = '/path/to/the/designated/folder'
process = Popen(f'hdfs dfs -ls -h {hdfs_path}', shell=True, stdout=PIPE, stderr=PIPE)
std_out, std_err = process.communicate()
list_of_file_names = [fn.split(' ')[-1].split('/')[-1] for fn in std_out.decode().readlines()[1:]][:-1]
list_of_file_names_with_full_address = [fn.split(' ')[-1] for fn in std_out.decode().readlines()[1:]][:-1]

Comments

0

use the following :

hdfsdir = r"hdfs://VPS-DATA1:9000/dir/"
filepaths = [ line.rsplit(None,1)[-1] for line in sh.hdfs('dfs','-ls',hdfsdir).split('\n') if len(line.rsplit(None,1))][1:]

for path in filepaths:
    print(path)

Comments

0

to get list of hdfs files in a drectory :

hdfsdir = /path/to/hdfs/directory
    filelist = [ line.rsplit(None,1)[-1] for line in sh.hdfs('dfs','-ls',hdfsdir).split('\n') if len(line.rsplit(None,1))][1:]
    for path in filelist:
        #reading data file from HDFS 
        with hdfs.open(path, "r") as read_file:
            #do what u wanna do
            data = json.load(read_file)

this list is a list of all files in hdfs directory

filelist = [ line.rsplit(None,1)[-1] for line in sh.hdfs('dfs','-ls',hdfsdir).split('\n') if len(line.rsplit(None,1))][1:]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.