3

I have a tab deliminated, File A, like this

establishment_of_protein_localization_to_endoplasmic_reticulum  GO:0072599
    lipid_oxidation GO:0034440
    endocytic_vesicle_lumen GO:0071682
    monocarboxylic_acid_metabolic_process   GO:0032787
    protein_transmembrane_transport GO:0071806
    cellular_response_to_topologically_incorrect_protein    GO:0035967
    preribosome GO:0030684
    negative_regulation_of_hematopoietic_progenitor_cell_differentiation    GO:1901533

and a second file structure as such:

font-family: Helvetica;
font-size: 10.86px;
font-weight: 700;
text-anchor: middle;
fill: #000000;
stroke: none;">
GO:0072599
</text>

<text x="509.10" y="-243.88"

style="
font-family: Helvetica;
font-size: 10.72px;
font-weight: 700;
text-anchor: middle;
fill: #000000;
stroke: none;">
GO:0034440
</text>

and i want to use awk or sed to match the second column of file a to the second file and replace the matching strings with the first column of file in the second file and replace them with the first column. To give this ouput essentially

font-family: Helvetica;
font-size: 10.86px;
font-weight: 700;
text-anchor: middle;
fill: #000000;
stroke: none;">
 establishment_of_protein_localization_to_endoplasmic_reticulum 
</text>

<text x="509.10" y="-243.88"

style="
font-family: Helvetica;
font-size: 10.72px;
font-weight: 700;
text-anchor: middle;
fill: #000000;
stroke: none;">
lipid_oxidation
</text>

Except the GO:###### Sequences match the column in the first file. I tried using this command

#!/bin/bash

    awk 'NR==FNR{a[$2]=$1;next}{$1=a[$1\2];}1' input.csv 

however, it replaces more than just the strings in column 2 of file a

5
  • the output is wrong: regulation_of_muscle_system_process GO:0090257 does not relate to GO:0045927. Update your description Commented Mar 9, 2018 at 6:32
  • Yeah could give us proper input and output so that we can help you? Commented Mar 9, 2018 at 6:36
  • Hi Allan, I just corrected the input and the output to match. I apologize, it was suppose to be symbolic but it should now make more sense Commented Mar 9, 2018 at 6:39
  • @Rnewbie, elaborate whether those asterisks **est... really appear in your file Commented Mar 9, 2018 at 6:41
  • Whoops, that was my attempt to make the change my clear, they do not - i have fixed that Commented Mar 9, 2018 at 6:47

1 Answer 1

3

The solution you are looking forward to is something like below. But your output does not match your input file

awk 'FNR==NR{ hashKey[$2]=$1; next }$1 in hashKey{$1=hashKey[$1]}1' FS='\t' file1 file2

The idea is we hash the values in the second column of the first file which is tab-separated. Then on the second values for those values in first column which are present in the hash table, we update the value from the stored hash.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.