How to find full string from substring with Python?

Question

How can I automatically extract part of a string that has a .csv extension. The following example shows the complex string that I am trying to extract 2010_USACE_VA_minmax.csv from. A simple slice won't work in my case, instead I need some sort of pattern matching.

sample = "1001        15707 May 08 23:01 2010_USACE_VA_metadata.xml\r\n-rw-rw-r--    1 311      1001         1784 May 08 23:01 2010_USACE_VA_minmax.csv\r\ndrwxrwxr-x    2 311      2013"

Intended output

2010_USACE_VA_minmax.csv

What does your code look like? This looks like a job for regex. — idjaw
– idjaw, Commented Sep 26, 2015 at 17:05
Go here: pythex.org. Play around with some regex with the string you want to test. Should give you the regex formula you are looking for. — idjaw
– idjaw, Commented Sep 26, 2015 at 17:08
What is the format of the file names? In particular, is it possible for them to contain spaces? — ekhumoro
– ekhumoro, Commented Sep 26, 2015 at 17:31
@idjaw Sorry, I misread the question. Your approach is the right one. — Cristian Lupascu
– Cristian Lupascu, Commented Sep 26, 2015 at 17:46

Antti Haapala · Accepted Answer · 2015-09-26 17:59:46Z

3

If you know these are white-space separated and the names themselves do not contain any white space themselves, and you're trying to find a token that ends with .csv, you could also do

>>> tokens = sample.split()
>>> matches = [ i for i in tokens if i.endswith('.csv') ]
>>> matches
['2010_USACE_VA_minmax.csv']

The same behaviour is achievable with the regular expression \S+\.csv(?!\S), which is not quite so readable:

>>> import re
>>> re.findall(r'\S+\.csv(?!\S)', sample)
['2010_USACE_VA_minmax.csv']

Here \S+ means at least 1 consecutive non-whitespace characters, \. is the literal . character, and (?!\S) means that the .csv cannot be succeeded by a non-whitespace character (negative zero-width lookahead assertion).

However, it looks like you're parsing the output of the ls *nix command - yet another way would be to find matching files with the glob module:

>>> from glob import glob
>>> glob('*.csv')
['2010_USACE_VA_minmax.csv']

edited Sep 26, 2015 at 17:59

answered Sep 26, 2015 at 17:54

Antti Haapala

135k23 gold badges297 silver badges349 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Padraic Cunningham Over a year ago

Why would you split every part of the line when the file is always at the end?

Antti Haapala Over a year ago

How do I know it :P If it is the output of the ls then the proper way to do it is not to parse it at all, but to use glob.

Padraic Cunningham Over a year ago

Well if I were using ls just to get the filename would be ls *.csv then no parsing needed at all

Padraic Cunningham Over a year ago

Just added something similar, not sure the OP only wants a single match so won't include 1, the OP must actually care about the other data or else they don't really know you can use ls by itself!

idjaw · Accepted Answer · 2015-09-26 17:44:10Z

2

This regex extracted the csv file. There might be a more robust regex, I'm not perfect at it. But this works:

FYI: I used this to test: Pythex

The circle brackets are important as they are your capture group to extract what you are looking for.

(\s\w+\.csv)

If you want to handle spaces in the filename, I believe this should work:

(\s[\w,\s-]+\.csv)

Here is infrmation on regex in Python: https://docs.python.org/3/library/re.html

edited Sep 26, 2015 at 17:44

answered Sep 26, 2015 at 17:12

idjaw

26.8k10 gold badges68 silver badges84 bronze badges

Comments

Padraic Cunningham · Accepted Answer · 2015-09-26 18:13:16Z

If there were no spaces in the path:

print(sample[:sample.find(".csv")+4].rsplit(None, 1)[1])
2010_USACE_VA_minmax.csv

The output also looks like it comes from a unix command so might be an idea to use a linux tool to parse it, if it is a unix command the format is most probably consistent so you can split the lines to get the filenames:

sample = "1001        15707 May 08 23:01 2010_USACE_VA_metadata.xml\r\n-rw-rw-r--    1 311      1001         1784 May 08 23:01 2010_USACE_VA_minmax.csv\r\ndrwxrwxr-x    2 311      2013"


for line in sample.splitlines():
    f  = line.rsplit(None, 1)[1]
    print(f)
2010_USACE_VA_metadata.xml
2010_USACE_VA_minmax.csv
2013

I presume 2013 comes from you having truncated some of the output.

If you are using subprocess to run the command and you didn't need any of the other data, ls can take a wildcard:

from subprocess import check_output
f = check_output(["ls","*.csv"])

Or to get the permissions etc.. as per you own command:

data = check_output(["ls","-l","*.csv"])

That will give you just the .csv files and their permissions so you just need to iterate over the output again with splitlines and every file at the end will be a csv file.

Alfred Huang · Accepted Answer · 2015-09-27 09:23:13Z

0

import re
mobj = re.search(r'\s\d{4}_[^ ]*csv',sample)          
print(mobj.group())

output

2010_USACE_VA_minmax.csv

edited Sep 27, 2015 at 9:23

Alfred Huang

18.4k33 gold badges128 silver badges196 bronze badges

answered Sep 26, 2015 at 17:14

LetzerWille

5,6965 gold badges26 silver badges28 bronze badges

Collectives™ on Stack Overflow

How to find full string from substring with Python?

4 Answers 4

4 Comments

Comments

Comments

output

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

Comments

Comments

output

Comments

Your Answer

Sign up or log in

Post as a guest

Related