0

Need to create an awk script to convert a glyph (https://en.wikipedia.org/wiki/Glyph) to Unicode (JavaScript syntax), and the reverse - Unicode to a glyph.

Source data is stored in NotePad++ with UTF-8 encoding.

Here's my progress.

Use_case_1

Dictionary file (dict_1_.txt):

A \u0041
À \u00C0

Input file (input_1_.txt):

A
À

awk script for generating Unicode for equivalent glyph:

awk 'NR == FNR { a[$1] = $2; next } $1 in a { $1 = a[$1] } $2 in a { $2 = a[$2] } 1' dict_1_.txt input_1_.txt

correctly producing:

\u0041
\u00C0

Use_case_2

Dictionary file (dict_2_.txt)

\u0041 A
\u00C0 À

Input file (input_2_.txt)

\u0041
\u00C0

awk script for generating glyphs for equivalent Unicode:

awk 'NR == FNR { a[$1] = $2; next } $1 in a { $1 = a[$1] } $2 in a { $2 = a[$2] } 1' dict_2.txt input_2.txt

correctly producing:

A
À

So, can successfully "round-trip" on a single symbol.

But how to deal with a more comprehensive dictionary and more than one word per row?

Here is sample data.

Input file (input_3_.txt)

PUDÍN, ALMIDÓN

Dictionary file (dict_3_.txt)

,   \u002C
A   \u0041
D   \u0044
I   \u0049
Í   \u00CD
L   \u004C
M   \u004D
N   \u006E
Ó   \u00D3
P   \u0050
U   \u0055
<space> \u0020

The awk script should generate:

\u0050\u0055\u0044\u00CD\u006E\u002C\u002C\u0041\u004C\u004D\u0049\u0044\u00D3\u006E

Input file (input_4_.txt)

\u0050\u0055\u0044\u00CD\u006E\u002C\u002C\u0041\u004C\u004D\u0049\u0044\u00D3\u006E

Dictionary file (dict_4_.txt)

\u002C  ,
\u0041  A
\u0044  D
\u0049  I
\u00CD  Í
\u004C  L
\u004D  M
\u006E  N
\u00D3  Ó
\u0050  P
\u0055  U
\u0020  <space>

The awk script should generate:

PUDÍN, ALMIDÓN

Here is a more complicated set of input strings (one per row):

MONO Y DIACETIL ÉSTERES DEL ÁCIDO TARTÁRICO DE MONO Y DIGLICÉRIDOS DE ÁCIDOS GRASOS AÑADIDOS
043 HUEVAS DE PESCADO (INCLUYENDO ESPERMA=HUEVAS BLANDAS) Y VÍSCERAS COMESTIBLES DE PESCADO
ACEITE DE SOJA OXIDADO TÉRMICAMENTE Y EN INTERACCIÓN CON MONO Y DIGLICÉRIDOS DE ÁCIDOS GRASOS
BANDEJA PLÁSTICA O CAZUELA, CUBIERTA DE PAPEL DE ALUMINIO O ENVOLTURA

In the Dictionary examples above, have used <space> to indicate the 'symbol' between words and after a comma. This probably means that a solution should use \t for FS in both the Dictionary file and the Input file. Currently the FS is a keyboard 'space'. Also the RS is \n.

Further, I need to do the same for hexadecimal, so a solution needs to process a Dictionary file like this:

Í   &#xcd;
Ó   &#xd3;

as compared to the Dictionary example above:

Í   \u00CD
Ó   \u00D3

How to improve or replace my simple awk scripts with scripts that process the longer strings on multiple lines?

2
  • 1
    wow. this question is way too long. How about shortening it? Commented Jan 5, 2017 at 21:40
  • The question is: How to improve or replace my simple awk scripts with scripts that process the longer strings on multiple lines?. The text shows progress (MCV) and data that hopefully can be processed by a proposed solution. Commented Jan 5, 2017 at 21:43

1 Answer 1

1

here is one approach, note that you don't need two different versions of the dictionary.

With little effort these two can be combined into one script and from/to conversion can be controlled with a parameter. I intentionally kept the dictionary part the same

$ awk 'NR==FNR {$2=$2?$2:" "; u2a[$1]=$2; a2u[$2]=$1; next}
               {for(i=1;i<=NF;i++) $i=a2u[$i]}1' dict FS='' OFS='' input

\u0050\u0055\u0044\u00CD\u006E\u002C\u0020\u0041\u004C\u004D\u0049\u0044\u00D3\u006E

working with the encoded input now

$ awk 'NR==FNR {$2=$2?$2:" "; u2a[$1]=$2; a2u[$2]=$1; next}
               {enc=$0; gsub(/....../,"& ",enc); n=split(enc,a);
                for(i=1;i<=n;i++) line=line u2a[a[i]]; print line}' dict encoded_input

PUDÍN, ALMIDÓN

using your dict_4 as the dictionary for both scripts

Sign up to request clarification or add additional context in comments.

4 Comments

Having problem with 'dict' in your text. Should that be 'dict_4_.txt?
That is a beautiful thing. I can reproduce your proposal. Of course the Spanish glyphs don't render properly in my BASH, but do when written to output.txt and opened with NotePad++. Gimme an hour to test on the longer strings.
@Jay Gray. Sorry, SO only.
ok - lemme think how best to do this. May trim the initial question, substitute your proposal and add data that is failing. May also submit a new question including progress-to-date. Do you have a preference?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.