0

I have a huge text file, which follows the structure:

SET
TAG1
...
...
SET
...
SET
TAG2
...
...
SET
...
...

I would like to extract for a specific TAG, (i.e. TAG54) its individual "substructure", which would be

SET
TAG54
...
...
SET

Each substructure, for a given TAG_i contains always:

first line:SET second line:TAG_i (in this case TAG54) an arbitrary number of lines last line:SET

I wonder what would be the best way to do this, whether in bash or python, so for a given TAG, one can "extract" this substructure.

Thanks

1
  • Not a very good solution, but you can use a my bad regex in python: /TAG\d+?(.+?)SET/gsm There is a better way to do newlines, but the regex tool I was using doesn't like them. Commented Mar 16, 2010 at 18:09

5 Answers 5

1

Here's a Python approach: you pass in the open file handle as the first argument, the tag number as second argument, and get back as the result a list of the relevant lines (including newline characters), or an empty line if the tag is not found in the file:

def lookfor(f, tagnum):
  tag = 'TAG%s\n' % tagnum
  for line in f:
    if line == tag:
       break
  else: # file finished, tag not found
    return []
  result = ['SET\n', tag]
  for line in f:
    result.append(line)
    if line == 'SET\n':
        break
  return result

This should be reasonably well-performing. If you want other forms of arguments and/or results, it shouldn't be hard to tweak accordingly, of course.

Sign up to request clarification or add additional context in comments.

Comments

0

If your system's grep supports -P for perl regexp:

grep -P 'SET\nTAG54\n[.\n]*\nSET' file.txt

2 Comments

hi, it does no work. can you tell me what each part does? thanks a lot
grep is a search tool; the -P option makes grep use a perl-type regexp (your system may not support -P); 'SET\nTAG54\n[.\n]*\nSET' is the regexp to match: SET followed by a newline, followed by TAG54 and a newline, then some arbitrary number (*) of arbitrary characters and/or newlines ([.\n]), a newline, and SET; file.txt is the name of the file to search.
0

gawk:

BEGIN {
  state=0
}

state==0 && $0=="TAG54" {
  print "SET"
  state=1
}

state==1 {
  print
}

state==1 && $0=="SET" {
  exit
}

Comments

0
csplit -f tags input.txt '%^TAG54$%-1' '/^SET$/+1' '%.*%' '{*}'

Comments

0
$ awk -vRS="SET" '/TAG54/{print RT$0RT}' file
SET
TAG54
...
...
SET

if you are doing it with shell scripting, pass your shell variable to awk using -v. eg

#!/bin/bash
read -r -p "what's your tag? " tag
awk -vRS="SET" -vt="$tag" '$0~tag{print RT$0RT}' file

1 Comment

hi, your approach is really nice and simple! I forgot to mention that I also need the lines wiht "SET" at the beginning and end of the file, but I will do by myself. thanks

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.