10

I have a gzip archive with trailing data. If I unpack it using gzip -d it tells me: "decompression OK, trailing garbage ignored" (same goes for gzip -t which can be used as a method of detecting that there is such data).

Now I would like to get to know this garbage, but strangely enough I couldn't find any way to extract it. gzip -l --verbose tells me that the "compressed" size of the archive is the size of the file (i.e. with the trailing data), that's wrong and not helpful. file is also of no help, so what can I do?

0

3 Answers 3

10

Figured out now how to get the trailing data.

I created Perl script which creates a file with the trailing data, it's heavily based on https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=604617#10:

#!/usr/bin/perl
use strict;
use warnings; 

use IO::Uncompress::Gunzip qw(:all);
use IO::File;

unshift(@ARGV, '-') unless -t STDIN;

my $input_file_name = shift;
my $output_file_name = shift;

if (! defined $input_file_name) {
  die <<END;
Usage:

  $0 ( GZIP_FILE | - ) [OUTPUT_FILE]

  ... | $0 [OUTPUT_FILE]

Extracts the trailing data of a gzip archive.
Outputs to stdout if no OUTPUT_FILE is given.
- as input file file causes it to read from stdin.

Examples:

  $0 archive.tgz trailing.bin

  cat archive.tgz | $0

END
}

my $in = new IO::File "<$input_file_name" or die "Couldn't open gzip file.\n";
gunzip $in => "/dev/null",
  TrailingData => my $trailing;
undef $in;

if (! defined $output_file_name) {
  print $trailing;
} else {
  open(my $fh, ">", $output_file_name) or die "Couldn't open output file.\n";
  print $fh $trailing;
  close $fh;
  print "Output file written.\n";
}
12
  • 2
    +1 but IMO, printing to stdout as in the original (but without appending a newline) is better than writing to a hard-coded filename. You can redirect to a file, or pipe to less or hd or hd | less or whatever. Commented Jul 14, 2016 at 6:13
  • @cas: Thank you for the input. Added a bit of parameter handling now. My first perl script BTW, I knew the time would come one day. Commented Jul 14, 2016 at 10:44
  • 1
    nice improvement. i'd upvote it again if i could :) one more idea - a program like this doesn't really need an input file, it works just as well processing stdin. and a while (<>) loop in perl will read stdin and any file(s) listed in @ARGV....that makes it easy to write scripts that work equally well as a filter (i.e. read stdin, write to stdout) and with named file(s). and stdout, of course, can always be redirected to a file. most of my perl scripts are written as filters to take advantage of this. Commented Jul 14, 2016 at 13:57
  • 1
    push @ARGV,'-' if (!@ARGV); before my $input_file_name = shift; is all that's needed here. i.e. a default arg of - (the help message could be printed if $ARGV[0] == '-h' or '--help'. ). For a while(<>) loop you wouldn't even need to do that, but it's probably more trouble than it's worth to write it like that for IO::Uncompress::Gunzip. Commented Jul 15, 2016 at 0:36
  • 2
    it's fine. and unshift instead of push makes sense for how you want to use it, still allows an output filename to be specified as the only arg. I'm personally averse to having files being overwritten without some explicit order from the user - redirection or a -o option or something. having a script automagically switch from first arg of two being input to first and only arg being output seems risky and accident-prone to me (tempting murphy). Commented Jul 16, 2016 at 0:07
2

I really like @LuizAngeloDarosdeLuca's solution because of its simplicity and because it is completely based on bash!

In my experience, however, gzip often fails to recognize a few trailing 'garbage' bytes as such. In these cases, it instead falsely signals a premature end of the archive. If even just one more byte is then included as belonging to the archive, gzip immediately comes to the (correct) conclusion that there is trailing garbage. The algorithm presented by Luiz is thus in an endless loop in which $size no longer changes, although the correct size of the archive has not yet been found and will never be found.

One solution can be to check at the end of each iteration of the while loop whether $size has changed at all, compared to its previous value. If this is not the case, the loop is aborted and instead, starting from the last value of $min, $size is decremented in 1-byte steps until gzip confirms the correct size or the archive.

The complete algorithm then looks like this:

#!/bin/bash

set -e
gzip=${1:?Inform a gzip file}
size=$(stat -c%s "$gzip")
size_previous=0
min=0
max=$size
while true; do
    if head -c "$size" "$gzip" | gzip -v -t - &>/dev/null; then
        echo $size
        exit
    else
        case "$?" in
            1) min=$size ;;
            2) max=$size ;;
        esac
        if (( size == size_previous )); then
            break
        else
            size_previous=$size
        fi
    fi
done
for (( size = min; size > 0; size-- )); do
    if head -c $size "$gzip" | gzip -t - &> /dev/null; then
        echo $size
        exit
    fi
done
1
  • Are you missing a change to size in your loop? Commented Aug 7 at 18:13
1

I created a small script to find the gzip size:

#!/bin/bash

set -e
gzip=${1:?Inform a gzip file}
size=$(stat -c%s "$gzip")
min=0
max=$size
while true; do
        if head -c "$size" "$gzip" | gzip -v -t - &>/dev/null; then
                echo $size
                break
        else
                case "$?" in
                        1) min=$size ;;
                        2) max=$size ;;
                esac
                size=$(((max-min)/2 + min))
        fi
done

Then you can use it to extract the gzip and the trailing part:

file=gzip_with_trailing.gz
gzip_size=$(./find_gzip_size "$file")
head -c "$gzip_size" "$file" > data.gz
tail -c +$((1+gzip_size)) "$file" > trailing.raw

head/tail are not the fastest solution but it will work.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.