1

I have a set of tsvs that are zipped in *.tsv.gz format and some that are not zipped, i.e., *.tsv in a directory.

I want to grep for a string from these files and print the grep results each in a new line.

I have a function that looks that takes in the input directory in which tsvs and *.tsv.gz are stored and the string to be searched.

import sys, os, traceback,subprocess,gzip,glob
def filter_from_tsvs(input_dir,string):

    tsvs = glob.glob(os.path.join(input_dir,'*.tsv*'))
    open_cmd=open
    for tsvfile in tsvs:
        print os.path.splitext
        extension = os.path.splitext(tsvfile)[1]
        if extension == ".gz":
          open_cmd = gzip.open
    print open_cmd
    try:
        print subprocess.check_output('grep string tsvfile', shell=True)

    except Exception as e:
        print "%s" %e
        print "%s" %traceback.format_exc()
return

I have also tried to use:

         try:
             fname = open_cmd(tsvfile,"r")
             print "opened"
             print subprocess.check_output('grep string fname', shell=True)

I got this error:

gzip: tsvfile.gz: No such file or directory
Command 'zgrep pbuf tsvfile' returned non-zero exit status 2
Traceback (most recent call last):
  File "ex.py", line 23, in filter_from_maintsvs
    print subprocess.check_output('zgrep pbuf tsvfile', shell=True)
  File "/datateam/tools/opt/lib/python2.7/subprocess.py", line 544, in check_output
    raise CalledProcessError(retcode, cmd, output=output)
CalledProcessError: Command 'zgrep pbuf tsvfile' returned non-zero exit status 2`

How can use grep/zgrep within Python?

1
  • First step is to use subprocess.check_output(['grep', string, tsvfile]) Commented May 29, 2014 at 1:12

2 Answers 2

3

I got the following solution after going through a blog and it worked for me :)

import subprocess
import signal

output = subprocess.check_output('grep string tsvfile', shell=True, preexec_fn=lambda: signal.signal(signal.SIGPIPE, signal.SIG_DFL))

print output  

Hints:

  • If the string was not found, grep ends with exit-code 1 and check_output will raise an exception.
  • check_output is available since Python 2.7. For an alternative look here.
Sign up to request clarification or add additional context in comments.

Comments

2

Some comments on your code:

At the moment you've hardcoded the string and filename you're looking for to 'string' and 'tsvfile'. Try this instead:

subprocess.check_output(['grep', string, tsvfile])

Next, if you're using zgrep then you don't need to open your files with gzip.open. You can call zgrep on a tsv.gz file, and it will take care of opening it without any extra work from you. So instead try calling

subprocess.check_output(['zgrep', string, tsvfile]) 

Note that zgrep will also work on uncompressed tsv files, so you don't need to keep switching between grep and zgrep.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.