Extract substructure from a text file using bash or python

Question

I have a huge text file, which follows the structure:

SET
TAG1
...
...
SET
...
SET
TAG2
...
...
SET
...
...

I would like to extract for a specific TAG, (i.e. TAG54) its individual "substructure", which would be

SET
TAG54
...
...
SET

Each substructure, for a given TAG_i contains always:

first line:SET second line:TAG_i (in this case TAG54) an arbitrary number of lines last line:SET

I wonder what would be the best way to do this, whether in bash or python, so for a given TAG, one can "extract" this substructure.

Thanks

Not a very good solution, but you can use a my bad regex in python: /TAG\d+?(.+?)SET/gsm There is a better way to do newlines, but the regex tool I was using doesn't like them. — Davis
– Davis, Commented Mar 16, 2010 at 18:09

Alex Martelli · Accepted Answer · 2010-03-16 18:08:29Z

1

Here's a Python approach: you pass in the open file handle as the first argument, the tag number as second argument, and get back as the result a list of the relevant lines (including newline characters), or an empty line if the tag is not found in the file:

def lookfor(f, tagnum):
  tag = 'TAG%s\n' % tagnum
  for line in f:
    if line == tag:
       break
  else: # file finished, tag not found
    return []
  result = ['SET\n', tag]
  for line in f:
    result.append(line)
    if line == 'SET\n':
        break
  return result

This should be reasonably well-performing. If you want other forms of arguments and/or results, it shouldn't be hard to tweak accordingly, of course.

answered Mar 16, 2010 at 18:08

Alex Martelli

887k175 gold badges1.3k silver badges1.4k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Isaac · Accepted Answer · 2010-03-16 18:22:51Z

0

If your system's grep supports -P for perl regexp:

grep -P 'SET\nTAG54\n[.\n]*\nSET' file.txt

answered Mar 16, 2010 at 18:22

Isaac

10.8k5 gold badges63 silver badges71 bronze badges

2 Comments

Open the way Over a year ago

hi, it does no work. can you tell me what each part does? thanks a lot

Isaac Over a year ago

grep is a search tool; the -P option makes grep use a perl-type regexp (your system may not support -P); 'SET\nTAG54\n[.\n]*\nSET' is the regexp to match: SET followed by a newline, followed by TAG54 and a newline, then some arbitrary number (*) of arbitrary characters and/or newlines ([.\n]), a newline, and SET; file.txt is the name of the file to search.

Ignacio Vazquez-Abrams · Accepted Answer · 2010-03-16 18:25:48Z

0

gawk:

BEGIN {
  state=0
}

state==0 && $0=="TAG54" {
  print "SET"
  state=1
}

state==1 {
  print
}

state==1 && $0=="SET" {
  exit
}

answered Mar 16, 2010 at 18:25

Ignacio Vazquez-Abrams

804k160 gold badges1.4k silver badges1.4k bronze badges

Comments

Ignacio Vazquez-Abrams · Accepted Answer · 2010-03-16 18:32:20Z

0

csplit -f tags input.txt '%^TAG54$%-1' '/^SET$/+1' '%.*%' '{*}'

edited Mar 16, 2010 at 18:32

answered Mar 16, 2010 at 18:23

Ignacio Vazquez-Abrams

804k160 gold badges1.4k silver badges1.4k bronze badges

Comments

ghostdog74 · Accepted Answer · 2010-03-16 23:53:23Z

0

$ awk -vRS="SET" '/TAG54/{print RT$0RT}' file
SET
TAG54
...
...
SET

if you are doing it with shell scripting, pass your shell variable to awk using -v. eg

#!/bin/bash
read -r -p "what's your tag? " tag
awk -vRS="SET" -vt="$tag" '$0~tag{print RT$0RT}' file

answered Mar 16, 2010 at 23:53

ghostdog74

346k62 gold badges264 silver badges349 bronze badges

1 Comment

Open the way Over a year ago

hi, your approach is really nice and simple! I forgot to mention that I also need the lines wiht "SET" at the beginning and end of the file, but I will do by myself. thanks

Collectives™ on Stack Overflow

Extract substructure from a text file using bash or python

5 Answers 5

Comments

2 Comments

Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

2 Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related