7

I am trying to find the extension of a file, given its name as a string. I know I can use the function os.path.splitext but it does not work as expected in case my file extension is .tar.gz or .tar.bz2 as it gives the extensions as gz and bz2 instead of tar.gz and tar.bz2 respectively.
So I decided to find the extension of files myself using pattern matching.

print re.compile(r'^.*[.](?P<ext>tar\.gz|tar\.bz2|\w+)$').match('a.tar.gz')group('ext')
>>> gz            # I want this to come as 'tar.gz'
print re.compile(r'^.*[.](?P<ext>tar\.gz|tar\.bz2|\w+)$').match('a.tar.bz2')group('ext')
>>> bz2           # I want this to come 'tar.bz2'

I am using (?P<ext>...) in my pattern matching as I also want to get the extension.

Please help.

2
  • 1
    What if name="hi.c.java" in case you want .java alone right ? Commented Jun 29, 2011 at 18:22
  • for the time being, yes. But, I should be able to add more in the regex pattern later if I want to. Commented Jun 29, 2011 at 18:23

6 Answers 6

21
root,ext = os.path.splitext('a.tar.gz')
if ext in ['.gz', '.bz2']:
   ext = os.path.splitext(root)[1] + ext

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Sign up to request clarification or add additional context in comments.

5 Comments

Ok that will work in this case but I want to solve it using python regular expressions.
@Guanidene: If it's homework, mark the question homework. If it's not homework, don't use a regular expression when the function's already been written, debugged and works.
@S.Lott - It is no homework, I want to tackle the problem using regex thats all. If just solving was the aim, I could have done it long before as phihag says.
@phihag - one more reason why I wish to use regex - It is compact. If I go your way, I will unnecessarily need 3 lines (and which also makes my code clumsy), while using regex I can get everything in a single line!
@Guanidene More compact does not equal more readable and maintainable. Also, why is a complicated regular expression less clumsy than three lines even non-programmers could understand? Anyway, to each his own.
5
>>> print re.compile(r'^.*[.](?P<ext>tar\.gz|tar\.bz2|\w+)$').match('a.tar.gz').group('ext')
gz
>>> print re.compile(r'^.*?[.](?P<ext>tar\.gz|tar\.bz2|\w+)$').match('a.tar.gz').group('ext')
tar.gz
>>>

The ? operator tries to find the minimal match, so instead of .* eating ".tar" as well, .*? finds the minimal match that allows .tar.gz to be matched.

1 Comment

I really thank you for this! This thing took a lot of my time!
3

I have idea which is much easier than breaking your head with regex,sometime it might sound stupid too.
name="filename.tar.gz"
extensions=('.tar.gz','.py')
[x for x in extensions if name.endswith(x)]

3 Comments

Suppose there exists a file extension .gz (assume) too which I may want to match. So in this case your tuple will be extensions=('.gz','.tar.gz','.py') (and name=filenmae.tar.gz) and if I execute this - [x for x in extensions if name.endswith(x)] it will wrongly match gz when I want to match it with tar.gz. Dude, what I want is a universal solution, not a data specific solution!
There is a option for that ,so place tar.gz first in the list if it matches return,but this method will not work once you place gz before tar.gz.>>> extensions=('tar.gz','gz','py') >>> name 'set.tar.gz' >>> def test(): ... for x in extensions: ... if name.endswith(x): ... return x ... return ' ' >>> test() 'tar.gz' >>>
@Guanidene:Yea I will not remember agreed,but you can leave a comment,might be regex is a correct solution and it is,but while answering the question I started by saying it might be stupid and there is a option,I am not arguing this is right .
3

Starting from phihags answer:

DOUBLE_EXTENSIONS = ['tar.gz','tar.bz2'] # Add extra extensions where desired.

def guess_extension(filename):
    """
    Guess the extension of given filename.
    """
    root,ext = os.path.splitext(filename)
    if any([filename.endswith(x) for x in DOUBLE_EXTENSIONS]):
        root, first_ext = os.path.splitext(root)
        ext = first_ext + ext
    return root, ext

Comments

2

this is simple and works on both single and multiple extensions

In [1]: '/folder/folder/folder/filename.tar.gz'.split('/')[-1].split('.')[0]
Out[1]: 'filename'

In [2]: '/folder/folder/folder/filename.tar'.split('/')[-1].split('.')[0]
Out[2]: 'filename'

In [3]: 'filename.tar.gz'.split('/')[-1].split('.')[0]
Out[3]: 'filename'

2 Comments

In [4]: '/folder/folder/folder/filename.tar.gz'.split('/')[-1].split('.')[1:] Out[4]: ['tar', 'gz'] Works well for me!
A variant I used of this to get the filename and file extension: f_name, f_ext = os.path.splitext(os.path.basename(body).split("/")[-1])
1

Continuing from phihags answer to generic remove all double or triple extensions such as CropQDS275.jpg.aux.xml use while '.' in:

tempfilename, file_extension = os.path.splitext(filename)
while '.' in tempfilename:
     tempfilename, tempfile_extension = os.path.splitext(tempfilename)
     file_extension = tempfile_extension + file_extension

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.