1

I'm trying to read csv files from a directory with a particular pattern I want to match all the files with that contains this string "logs_455DD_33 t should match anything like "

machine_logs_455DD_33.csv

logs_455DD_33_2018.csv

machine_logs_455DD_33_2018.csv

I've tried the following regex but it doesn't match files with the above format .

file = "hdfs://data/logs/{*}logs_455DD_33{*}.csv"
df = spark.read.csv(file)
1
  • 1
    Try this file = "hdfs://data/logs/*logs_455DD_33*.csv" Commented Jun 15, 2018 at 11:12

2 Answers 2

1

I had to do a similar thing in my pyspark program where I need to pick a file in HDFS by cycle_date and I did like this:

df=spark.read.parquet(pathtoFile + "*" + cycle_date + "*")

Sign up to request clarification or add additional context in comments.

Comments

0

You could use a subprocess to liste files in hdfs and grep these files :

import subprocess

# Define path and pattern to match
dir_in = "data/logs"
your_pattern = "logs_455DD_33"

# Specify your subprocess
args = "hdfs dfs -ls "+dir_in+" | awk '{print $8}' | grep "+your_pattern
proc = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)

# Get output and split it
s_output, s_err = proc.communicate()
l_file = s_output.split('\n')

# Read files
for file in l_file :
    df = spark.read.csv(file)

1 Comment

Doesn't this specifically not use spark, which means that it will be slower across a cluster with a distributed file system? (HDFS/EMRFS)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.