1

I have a simple perl script that does a number of text replacements along the following lines:

#!/usr/bin/perl
{
open(my $in, "<", "Texts.txt") or die "No input: $!";
open(my $out, ">",  "TeXed/Texts.tex") or die "No output directory: $!";
LINE: while (<$in>) {
    s/(txt@)(.*)(?<!\t|\[)\[(.*)/\1\2\\ovl{}\3/g;# 
    # there are a bunch of other replacements like the above
    print $out $_ ; 
    }
}

So far so good. The text I am running this script on is organized into blocks (not always of the same length). Each block starts with the same identifier (txt@) and then a unique label. Every label starts with a #.
What I would like to achieve is to delete all repeated labels – essentially I only want to keep every first instance of a label and replace/delete all subsequent ones until the label changes. In the below example the ones to be replaced/deleted are in bold.

txt@#Label1 some text
some more text
some more text

txt@#Label1 some other text
some more text
some more text
some more text

txt@#Label1 some random text
some more text
some more text

txt@#Label2 some text
some more text
some more text
some more text

txt@#Label1 some text
some more text
some more text

txt@#Label3 some text
some more text
some more text

txt@#Label3 some text
some more text
some more text

txt@#Label1 some text
some more text
some more text

etc.

Sorry for the long example – I couldn't come up with a better way of explaining this.

So I want to delete all repeated Label1, Label2, etc., but without modifying the rest of the text (some text, some more text) both on the same line but also on subsequent lines. The number of subsequent lines is not always the same (so it's not every n-th line that has to be replaced).

Is that possible with perl? Or any other way? (I'm not married to perl, if it is easier with another language I'd be happy to try that – I'm not a programmer though so detailed instructions would be highly appreciated).

5
  • Do you want to keep the text as well? How do you want to ogranize output? Commented May 10, 2016 at 9:05
  • Yes, the rest of the text has to be kept (I just edited the post to make it clearer). The output should be kept the way it is organized, there are a number of other replacement operations that I run on the text, but no lines are deleted. Commented May 10, 2016 at 9:10
  • OK. So the lines with duplicate labels just literally lose the label and everything else stays the same? Commented May 10, 2016 at 9:11
  • It does work!!! Great, thanks so much for the prompt solution! -- Just one more thing: I'd like to get rid of the hashtag too (which is part of the label). Commented May 10, 2016 at 9:49
  • Fixed. Since nothing depended on that regex (other than deletion) it was as simple as moving the # into the non-capturing group. Then the replacement doesn't put it back. Let me know if more comes up. Commented May 10, 2016 at 9:59

1 Answer 1

1

Introduce the 'current label' -- the latest one picked up -- and keep track of it. Once a line with a label comes up compare: if it is the same it repeats so delete it, otherwise replace it and we have the new 'current' one.

The processing goes line-by-line. Alternatively, one can read a whole block at a time to enable per-block processing, which may be more convenient. Code for this is shown at the end.

use warnings;
use strict;

open my $fh_out, '>', 'new_text_label.txt';
open my $fh_in, '<', 'text_label.txt';

# Our current (running) label
my $curr_label = '';

while (<$fh_in>)  
{
    # If line with label fetch it otherwise (print and) skip
    my ($label) = $_ =~ m/txt@#(\w+)/;
    if (not $label) {
        # ... process non-label line as needed ...
        print $fh_out $_;
        next;
    }       
    # Delete if repeated (matching the current), reset if new
    if ($curr_label eq $label) {
        s/(txt@)(?:#\w+)(.*)/$1$2/;
    }   
    else {
        $curr_label = $label;
    }   
    # ... process label-line as needed ...
    print $fh_out $_;
}

This yields the required file. Processing of lines with or without a label is separate, which may be good if further processing differs for them. Or, the pre-processing of the label-line can be all done in one place, which is better if further processing does not distinguish between lines with or without a label.

while (<$fh_in>) 
{
     # If this is the label line, process it: delete or replace the label
     if (my ($label) = $_ =~ m/txt@#(\w+)/) {
        # Delete if repeated (matching the current), reset if new
        if ($curr_label eq $label) {
            s/(txt@)(?:#\w+)(.*)/$1$2/;
        }   
        else {
            $curr_label = $label;
        }
     }
     # The label is now fixed as needed. Process lines normally ...
     print $fh_out $_;
}

This replaces the while loop above, the rest of the code is the same.


Derived from what was originally posted, commented

Here are changes in the code so that it reads the whole block at a time, which is good for processing that can make use of having the whole block of text in the variable. Note that a block contains new lines (so regexes may need /s etc). To enable possible bulk processing all blocks are first read into an array as well.

my @blocks = do { 
    # Set record separator to empty line to read blocks
    local $/ = "\n\n";
    open my $fh_in, '<', 'text_label.txt';
    <$fh_in>;    
};

# Our current (running) label
my $curr_label = '';

foreach my $bl (@blocks) 
{
     # The label pre-processing is exactly the same as above
     # Other processing can now utilize having the whole block in $bl
}
Sign up to request clarification or add additional context in comments.

15 Comments

Yes it works now (I might have copied something wrong). One more question: Where should I put the rest of my regexes (to be processed after the label replacement)? I tried to put the LINE: while ... sequence before the print command but it gives me an error.
OK, no worries, I'll play around with it. It's not urgent. I've now put the s/ patterns in the loop (before the print) but get a Use of uninitialized value $_ in substitution (s///) error?
@jan Changed and rearranged the code. Now it starts with a version that processes line-by-line. You should be able to literally copy your regexes (and whatever other processing you have), to where it says "process ...". It still gives you two options: where label and non-label lines are distinguished, or not. Then it shows what change in code to make if you wish to do per-block processing (that was the original post). It's all been tested. Please let me know how it goes.
@jan The error you were getting was because your regex uses the default $_ while I had had a named variable $bl. I changed that around as well so that you can simply copy your code. In principle I do recommend using properly named variables, specially in complex processing that calls $_ often. The code's often just clearer that way.
Excellent, thank you so much! First and second versions work fine. I'm using the second solution since other replacements search both, label and no-label lines. In the first version I tested putting these other replacements at the beginning, before the block with labels starts, and it worked as well. When would one be preferred over the other? (I didn't get the processing in blocks to work, despite changing $_ to $bl but I'm happy with what I've got!)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.