0

Have a file that appears to have plaintext headers in them that I would like to extract and convert to plaintext.

Using HEXedit, this is what I'm seeing, which is in a file:

3a40 - 31 65 33 38 00 00 00 00 00 00 00 00 00 00 00 00 - 1e38............
3a50 - 00 00 00 00 00 00 00 00 00 00 0a 00 74 00 65 00 - ............t.e.
3a60 - 78 00 74 00 2f 00 61 00 73 00 63 00 69 00 69 00 - x.t./.a.s.c.i.i.
3a70 - 00 00 18 00 61 00 66 00 66 00 79 00 6d 00 65 00 - ....a.f.f.y.m.e
3a80 - 74 00 72 00 69 00 78 00 2d 00 61 00 72 00 72 00 - t.r.i.x.-.a.r.r
3a90 - 61 00 79 00 2d 00 62 00 61 00 72 00 63 00 6f 00 - a.y.-.b.a.r.c.o.
3aa0 - 64 00 65 00 00 00 64 00 40 00 35 00 32 00 30 00 - [email protected].
3ab0 - 38 00 32 00 36 00 30 00 30 00 39 00 31 00 30 00 - 8.2.6.0.0.9.1.0.
3ac0 - 37 00 30 00 36 00 31 00 31 00 31 00 38 00 31 00 - 7.0.6.1.1.1.8.1.
3ad0 - 31 00 34 00 31 00 32 00 31 00 33 00 34 00 35 00 - 1.4.1.2.1.3.4.5.
3ae0 - 35 00 30 00 39 00 38 00 39 00 00 00 00 00 00 00 - 5.0.9.8.9.......
3af0 - 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 - ................
3b00 - 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0a 00 - ................

and this is the output I'd like to get:

text/ascii  affymetrix-array-barcode d@52082600910706111811412134550989
3
  • What kind of file is it? Commented May 10, 2011 at 14:39
  • @Christoffer Hammarström: stat.lsa.umich.edu/~kshedden/Courses/Stat545/Notes/… -- though in there specs I do not see the data I'm seeing in the file; that being the barcode. Why do you ask? Commented May 10, 2011 at 15:06
  • To see if there already is a parser for it, and to see if there are any pitfalls to be aware of, like @PacoRG noted. Commented May 11, 2011 at 7:15

3 Answers 3

1

Try with the iconv command. Something like this should work:

tail -c +6 input.txt | iconv -f UTF16 -t ASCII >output.txt

Then split on the null bytes.

Sign up to request clarification or add additional context in comments.

7 Comments

@PacoRG: Thanks, just ran iconv and it returned this error: "iconv: illegal input sequence at position 0"; If I change UTF16 to UTF8, the position of the error changes to 428; the version of iconv is "iconv (GNU libc) 2.5" -- any suggestions? Thanks!
I think you need to manually delete the first 4bytes of your file.
@PacoRG: Not sure how to manually delete 4bytes, plus the solution can't be manual, since this is part of a script.
cut -c4- input.txt |iconv -f UTF16 -t ASCII >output.txt
Ups, i didn't notice your hexedit capture starts at 3a40 (14912). Try this: cut -c14928- |iconv -f UTF16 -t ASCII >output.txt. BTW, from the link you added, i see that the CEL file format is a complex one. iconv just can help you to extract some strings in a quick & dirty way, but nothing else.
|
1

Granted, I'm no wiz, but this does the job if all your files look very similar to the one you just posted:

use strict;
open FILE, 'file.dat';
binmode FILE;
my ($chunk, $buf, $n);
seek FILE, 28, 0;
while (($n=read FILE, $chunk, 16)) { $buf .= $chunk; }
my @s=split(/\0\0/, $buf, 4);
print "$s[0] $s[1] $s[2]\n";
close (FILE);

3 Comments

Ran the code in ptkdb (a perl debugger) and see the code doing something, but never get any print statements. What is the code doing and what output should I expect? Thanks!
@blunders Made a change -- it should work now. The expected output is "text/ascii affymetrix-array-barcode d@52082600910706111811412134550989"
+1 @sapht: Thanks, got it working. Though it's not clear if the code will be adaptable to my needs; meaning that the code appears to target the extraction based on position, which will be be the same. The only thing that I know will stay the same is the HEX for "affymetrix-array-barcode" and the general formatting of the barcode itself. Guess I just figure the ASCII content was in ASCII and that I could use regex to target the HEX supplied, extract it, convert it to ASCII, and clean it up if need.
0

A perl solution might be interesting, but wouldn't the unix strings command give you the plaintext portion of the file?

8 Comments

@pavium: Code is running on CentOS, if you're able to call a system command and extract the output provided from the input list with perl -- yes, that's fine. That said, I have very limited understanding of Perl, and have never used the "strings" command; meaning currently your answer reads more like a comment for me. Thanks!
@blunders, man strings should tell you more than you want to know about the command. I wouldn't want to pontificate on how to do it because it's very late here and I'd probably give you bad advice. Tomorrow, as a challenge, I might consider how to do it all in perl when I've had a good rest.
@pavium: Great, thank you! I've been looking at the man page, if I figure out a solution I'll comment again, again thanks!!
@blunders, the [ABSOLUTE PATH] you mentioned was the path to the binary file? I ask because in [ABSOLUTE PATH]/string_output.txt it should be a path to the directory containing the binary file.
@blunders, I upvoted the question because it seemed 'clear and useful' but someone downvoted it. Similarly, someone upvoted my answer soon after I posted it, but that's been downvoted too - probably because it doesn't actually provide a solution using Perl. I'm losing the enthusiasm I had last night ... so I probably won't rush to try a Perl solution, especially now you seem to have a working solution from sapht.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.