0

How can I automatically extract part of a string that has a .csv extension. The following example shows the complex string that I am trying to extract 2010_USACE_VA_minmax.csv from. A simple slice won't work in my case, instead I need some sort of pattern matching.

sample = "1001        15707 May 08 23:01 2010_USACE_VA_metadata.xml\r\n-rw-rw-r--    1 311      1001         1784 May 08 23:01 2010_USACE_VA_minmax.csv\r\ndrwxrwxr-x    2 311      2013"

Intended output

2010_USACE_VA_minmax.csv
8
  • 1
    What does your code look like? This looks like a job for regex. Commented Sep 26, 2015 at 17:05
  • 1
    Go here: pythex.org. Play around with some regex with the string you want to test. Should give you the regex formula you are looking for. Commented Sep 26, 2015 at 17:08
  • why use regex and not csv? Commented Sep 26, 2015 at 17:15
  • 1
    What is the format of the file names? In particular, is it possible for them to contain spaces? Commented Sep 26, 2015 at 17:31
  • 1
    @idjaw Sorry, I misread the question. Your approach is the right one. Commented Sep 26, 2015 at 17:46

4 Answers 4

3

If you know these are white-space separated and the names themselves do not contain any white space themselves, and you're trying to find a token that ends with .csv, you could also do

>>> tokens = sample.split()
>>> matches = [ i for i in tokens if i.endswith('.csv') ]
>>> matches
['2010_USACE_VA_minmax.csv']

The same behaviour is achievable with the regular expression \S+\.csv(?!\S), which is not quite so readable:

>>> import re
>>> re.findall(r'\S+\.csv(?!\S)', sample)
['2010_USACE_VA_minmax.csv']

Here \S+ means at least 1 consecutive non-whitespace characters, \. is the literal . character, and (?!\S) means that the .csv cannot be succeeded by a non-whitespace character (negative zero-width lookahead assertion).


However, it looks like you're parsing the output of the ls *nix command - yet another way would be to find matching files with the glob module:

>>> from glob import glob
>>> glob('*.csv')
['2010_USACE_VA_minmax.csv']
Sign up to request clarification or add additional context in comments.

4 Comments

Why would you split every part of the line when the file is always at the end?
How do I know it :P If it is the output of the ls then the proper way to do it is not to parse it at all, but to use glob.
Well if I were using ls just to get the filename would be ls *.csv then no parsing needed at all
Just added something similar, not sure the OP only wants a single match so won't include 1, the OP must actually care about the other data or else they don't really know you can use ls by itself!
2

This regex extracted the csv file. There might be a more robust regex, I'm not perfect at it. But this works:

FYI: I used this to test: Pythex

The circle brackets are important as they are your capture group to extract what you are looking for.

(\s\w+\.csv)

If you want to handle spaces in the filename, I believe this should work:

(\s[\w,\s-]+\.csv)

Here is infrmation on regex in Python: https://docs.python.org/3/library/re.html

Comments

1

If there were no spaces in the path:

print(sample[:sample.find(".csv")+4].rsplit(None, 1)[1])
2010_USACE_VA_minmax.csv

The output also looks like it comes from a unix command so might be an idea to use a linux tool to parse it, if it is a unix command the format is most probably consistent so you can split the lines to get the filenames:

sample = "1001        15707 May 08 23:01 2010_USACE_VA_metadata.xml\r\n-rw-rw-r--    1 311      1001         1784 May 08 23:01 2010_USACE_VA_minmax.csv\r\ndrwxrwxr-x    2 311      2013"


for line in sample.splitlines():
    f  = line.rsplit(None, 1)[1]
    print(f)
2010_USACE_VA_metadata.xml
2010_USACE_VA_minmax.csv
2013

I presume 2013 comes from you having truncated some of the output.

If you are using subprocess to run the command and you didn't need any of the other data, ls can take a wildcard:

from subprocess import check_output
f = check_output(["ls","*.csv"])

Or to get the permissions etc.. as per you own command:

data = check_output(["ls","-l","*.csv"])

That will give you just the .csv files and their permissions so you just need to iterate over the output again with splitlines and every file at the end will be a csv file.

Comments

0
import re
mobj = re.search(r'\s\d{4}_[^ ]*csv',sample)          
print(mobj.group())

output

2010_USACE_VA_minmax.csv

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.