Hi I'm dealing with a slightly difficult file format which I'm trying to clean for some future processing. I've been using Pyspark to process the data into a dataframe.
The file looks similar to this:
AA 1234 ZXYW
BB A 890
CC B 321
AA 1234 LMNO
BB D 123
CC E 321
AA 1234 ZXYW
CC E 456
Each 'AA' record defines the start of a logical group or records, and the data on each line is fixed length and has information encoded in it that I want to extract. There are at least 20-30 different record types. They are always identified with a two letter code at the start of each line. There can be 1 or many different record types in each group (i.e. not all record types are present for each group)
As a first stage, I've managed to group the records together in this format:
+----------------+---------------------------------+
| index| result|
+----------------+---------------------------------+
| 1|[AA 1234 ZXYV,BB A 890,CC B 321]|
| 2|[AA 1234 LMNO,BB D 123,CC E 321]|
| 3|[AA 1234 ZXYV,CC B 321] |
+----------------+---------------------------------+
And as a second stage I really want to get data into the following columns in a dataframe:
+----------------+---------------------------------+-------------+--------+--------+
| index| result| AA| BB| CC|
+----------------+---------------------------------+-------------+--------+--------+
| 1|[AA 1234 ZXYV,BB A 890,CC B 321]|AA 1234 ZXYV|BB A 890|CC B 321|
| 2|[AA 1234 LMNO,BB D 123,CC E 321]|AA 1234 LMNO|BB D 123|CC E 321|
| 3|[AA 1234 ZXYV,CC B 321] |AA 1234 ZXYV| Null|CC B 321|
+----------------+---------------------------------+-------------+--------+--------+
Because at that point extracting the information that I need should be trivial.
Does anyone have any suggestions as to how I might be able to do this?
Many Thanks.