0

I have a file which contains below details : file.txt

+----------------------------------------------------+
|                   createtab_stmt                   |
+----------------------------------------------------+
| CREATE EXTERNAL TABLE `dv.par_kst`( |
|   `col1` string,                                   |
|   `col2` string,                                   |
|   `col3` int,                                      |
|   `col4` int,                                      |
|   `col5` string,                                   |
|   `col6` float,                                    |
|   `col7` int,                                      |
|   `col8` string,                                   |
|   `col9` string,                                   |
|   `col10` int,                                     |
|   `col11` int,                                     |
|   `col12` string,                                  |
|   `col13` float,                                   |
|   `col14` string,                                  |
|   `col15` string)                                  |
| PARTITIONED BY (                                   |
|   `part_col1` int,                                 |
|   `part_col2` int)                                 |
| ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'  |
| STORED AS INPUTFORMAT                              |
|   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'  |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' |
| LOCATION                                           |
|   'hdfs://nameservicets1/dv/hdfsdata/par_kst' |
| TBLPROPERTIES (                                    |
|   'spark.sql.create.version'='2.2 or prior',       |
|   'spark.sql.sources.schema.numPartCols'='2',      |
|   'spark.sql.sources.schema.numParts'='1',         |
|   'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"col1","type":"string","nullable":true,"metadata":{}},{"name":"col2","type":"string","nullable":true,"metadata":{}},{"name":"col3","type":"integer","nullable":true,"metadata":{}},{"name":"col4","type":"integer","nullable":true,"metadata":{}},{"name":"col5","type":"string","nullable":true,"metadata":{}},{"name":"col6","type":"float","nullable":true,"metadata":{}},{"name":"col7","type":"integer","nullable":true,"metadata":{}},{"name":"col8","type":"string","nullable":true,"metadata":{}},{"name":"col9","type":"string","nullable":true,"metadata":{}},{"name":"col10","type":"integer","nullable":true,"metadata":{}},{"name":"col11","type":"integer","nullable":true,"metadata":{}},{"name":"col12","type":"string","nullable":true,"metadata":{}},{"name":"col13","type":"float","nullable":true,"metadata":{}},{"name":"col14","type":"string","nullable":true,"metadata":{}},{"name":"col15","type":"string","nullable":true,"metadata":{}},{"name":"part_col1","type":"integer","nullable":true,"metadata":{}},{"name":"part_col2","type":"integer","nullable":true,"metadata":{}}]}',  |
|   'spark.sql.sources.schema.partCol.0'='part_col1',  |
|   'spark.sql.sources.schema.partCol.1'='part_col2',  |
|   'transient_lastDdlTime'='1587487456')            |
+----------------------------------------------------+

from above file I want to extract PARTITIONED BY details.

Desired output :

part_col1 , part_col2

and these PARTITIONED BY is not fixed , means for some other file it might contains 3 or more , so I want extract all the PARTITIONED BY.

All the values between PARTITIONED BY and ROW FORMAT SERDE , removing spaces "`" and data types!

Could you please help me with this ?

4 Answers 4

1
sed -nr '/PARTITIONED BY/,/ROW FORMAT SERDE/p' a.txt|sed -nr '/`/p'|cut -d '`' -f 2|xargs -n 1 echo -n " "
Sign up to request clarification or add additional context in comments.

2 Comments

and also instead of having records in file.txt , I have to execute as below : par_col=beeline --silent -u "$BEELINE_URL" -e "$sql" where sql="show create table dvs_wk.par_kst" Par_col has the above result but when I doing like : result=sed -n '/PARTITIONED BY/,/ROW FORMAT SERDE/p' $par_col | sed -n '//p'|cut -d '' -f 2|xargs -n 1 echo -n " " It is giving me an Error.
sed prints all strings between PARTITIONED BY and ROW FORMAT SERDE (including them), then another sed prints strings only with "" character, than cut command split string in column by "" and prints second column (your number), then xargs grabs all numbers and print them with space as separator. May be not best pipeline, but it works on your example.
1
my $text = do { local $/; <DATA> };

my @partitioned = ();

$text=~s#PARTITIONED BY\s*\(([^\(\)]*)\)# my $fulcontent=$1; 
push (@partitioned, $1) while($fulcontent=~m/\`([^\`]+)\`/g);
($fulcontent);
#egs;

print join "\, ", @partitioned;

Output:

part_col1, part_col2

Comments

1

When the layout of your result doesn't matter, you can ask sed to consider lines between a start and an end tag, and only print such a line when a field can be found between 2 backquotes.

sed -rn '/PARTITIONED BY/,/ROW FORMAT/s/.*`(.*)`.*/\1/p' file.txt

Combining the results in a line as desired can be done with

printf "%s , " $(sed -rn '/PARTITIONED BY/,/ROW FORMAT/s/.*`(.*)`.*/\1 /p' file.txt) |
   sed 's/ , $/\n/'

Comments

-1

Small perl script

  • read whole file into $data variable
  • select all between PARTITIONED BY (....)
  • select into array only elements between `
  • print result joined with ,
use strict;
use warnings;
use feature 'say';

my $data = do { local $/; <> };
my $re   = 'PARTITIONED BY \((.*?)\)';

$data =~ /$re/sg;

my @part = $1 =~ /`(.*?)`/sg;

say join ', ', @part;

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.