0

I have the sample data as below:

# df
VAR1   SEQ    VAR2    VAR3       DATE    VAR4     VAR5    VAR6    VAR7
AAA      1     YYY      01   20000630      AL    11111    ABCD      PA
BBB      1     YYY      01   20100701      GA    12345    EDED      NY
BBB      2     YYY      01   20150815      GA    12345              NY
BBB      3     YYY      01   19950105      GA    12345    YTRU      NY
BBB      4     YYY      01   20000701      GA    12345    IIII      NY
BBB      5     YYY      01   20210701      GA    12345              NY
CCC      1     NNN      01   20210630      CA    33333    SSSS      NJ
CCC      2     NNN      01   20210629      CA    33333              NJ

In SAS, we can export fixed width format file as below:

BLANK_VAR1 = " "

%MACRO FRIST;
    PUT @  1  "00FIRST"
        @  8  VAR1       $CHAR5.
        @ 13  BLANK_VAR1 $CHAR2.
        @ 15  VAR2       $CHAR3.
    ;
%MEND FRIST;

%MACRO SECOND;
    PUT @  1  "00SECOND"
        @  9  VAR3       $CHAR2.
        @ 11  BLANK_VAR1 $CHAR2.
        @ 13  VAR4       $CHAR2.
        @ 15  VAR5       $CHAR5.
    ;
%MEND SECOND;

%MACRO THIRD(sequence);
    num = &sequence.;
    PUT @  1  num        Z2.0
        @  3  "THIRD"    $CHAR5.
        @  8  DATE       $CHAR8.
    ;
%MEND THIRD;

%MACRO FOURTH(sequence);
    num = &sequence.;
    PUT @  1  num        Z2.0
        @  3  "FOURTH"   $CHAR5.
        @  9  VAR6       $CHAR25.
        @ 34  BLANK_VAR1 $CHAR2.
        @ 36  VAR7       $CHAR2.
    ;
%MEND FOURTH;

filename outtmp "/home/folder/outfile_tmp";  

DATA _NULL_;                                                                    
   SET df;    
 
   BY VAR1 SEQ;                                                           
   FILE outtmp;  
                                                                
   IF FIRST.VAR1 THEN DO;
      %FRIST;
      %SECOND;
      REC_CNT = 0;                                                             
   END;                                                                       
   REC_CNT + 1;
   IF REC_CNT LE 3 THEN DO;
      %THIRD(REC_CNT);
      IF VAR6 NE ' ' THEN DO;
         %FOURTH(COUNTN);
      END;
   END;                                                         
RUN;


filename output "/home/folder/output"; 


%MACRO INREC;                                                                     
   PUT 001 RECIN $CHAR150.;
%MEND INREC;


%MACRO FILE_FIRST;
    DATE = TODAY();
    PUT @  1  "###FIRSTLINE###"
        @ 16  DATE       JULIAN5.
        @ 21  BLANK_VAR1 $CHAR2.
        @ 23  "###FIRSTLINEEND###"
    ;
%MEND FILE_FIRST;


%MACRO FILE_LAST;
    DATE = TODAY();
    PUT @  1  "###LASTLINE###"
        @ 15  DATE       JULIAN5.
        @ 20  BLANK_VAR1 $CHAR2.
        @ 22  "###LASTLINEEND###"
    ;
%MEND FILE_LAST;


DATA output;                                                                  
   INFILE outtmp truncover;                                                               
   INPUT                                                                       
      @ 001 RECIN $CHAR150.;                                                                         
RUN; 

DATA _NULL_;
   SET output  end=last;  
   file output  lrecl=256 ;
   IF _N_ = 1 THEN DO;
      %FILE_FIRST;
   END;

   %INREC;  
                                                                  
   IF last THEN DO;
      %FILE_LAST; 
   END;
RUN; 

This is the output:
enter image description here

###FIRSTLINE###21182  ###LASTLINEEND###
00FIRSTAAA  YYY
00SECOND01  AL11111
01THIRD20000630
01FOURTHABCD                       PA
00FIRSTBBB  YYY
00SECOND01  GA12345
01THIRD20100701
01FOURTHEDED                       NY
02THIRD20150815
03THIRD19950105
03FOURTHYTRU                       NY
00FIRSTCCC  NNN
00SECOND01  CA33333
01THIRD20210630
01FOURTHSSSS                       NJ
###LASTLINE###21182  ###LASTLINEEND###

The logic for above program is:

  1. There are four parts that needed to be output.
  2. If there are multiple same VAR1, then only output FIRST and SECOND once.
  3. Output THIRD part for SEQ is less than 3. If the SEQ is larger than 3, do not output. Ignore.
  4. Output the FOURTH part following the third logic and also if VAR6 is not missing.
  5. Note: In THIRD and FOURTH part, the first two string should change from 01 to 03 depends on the records.

How can I replicate this format in Python?
I found that np.savetxt() with fmt argument might be a way link; however, the file should be the same order as original dataframe.

pandas has function read_fwf() to read fixed width format file; however, no to_fwf() function to export.

I have been stuck for several days, so any idea should be helpful!

9
  • I imagine you'd just make the full line one string, then write that? Commented Jul 1, 2021 at 23:08
  • @Joe First and Last line might be done by this but the middle part is difficult to do so if you have a large dataset. Commented Jul 1, 2021 at 23:43
  • To be clear, you just want your data exported fixed width or something else? I can't tell from the output you posted? Commented Jul 1, 2021 at 23:50
  • 1
    @PeterChen I mean, it's difficult in SAS too, but ... it's basically what you are doing even in SAS - you are converting the records to strings (in a text file) and then reading them back in and writing them out again. (It's probably much easier to do this in SAS than you're doing it, honest, but leaving that aside...) The bigger problem here honestly is just stating the rules you're following - it's not really clear what the rules are, beyond the program instructions of course. You write First Second once, Third three times, and Fourth if it has a value in the first three rows? Commented Jul 2, 2021 at 0:21
  • 1
    @JonSG Fixed with format is what he's doing in SAS, at least. But it's more like it's being split across a bunch of rows, something that's sort of like the old "card column" format but not exactly that either... I presume it must be something for a bank system, as nothing else would be so odd. Commented Jul 2, 2021 at 0:22

1 Answer 1

1

This isn't exactly a good way to do it, but maybe it gives you an idea how to do the logic. I'm just writing to a list, you can then write the list out - but probably you should do it the way JonSG did in his (deleted) answer where you use a file writer instead. There's probably a better approach using a data class, but that's not my expertise.

import pandas as pd

df = pd.read_csv(r"h:\temp\df_text.csv")

outlist = []

for index,row in df.iterrows():
    if(row['SEQ']==1):
        tempstr = '00FIRST'+row.VAR1+'  '+row.VAR2
        outlist.append(tempstr)
        tempstr = '00SECOND'+str(row.VAR3)+'  '+str(row.VAR4)+str(row.VAR5)
        outlist.append(tempstr)
    if(row['SEQ'] <= 3):
        seqval ='0'+str(row.SEQ) if  row.SEQ < 10 else str(row.SEQ)
        tempstr = str(row['SEQ'])+'THIRD'+str(row.DATE)
        outlist.append(tempstr)
        if (row.VAR6 != '  '):
            tempstr = str(row['SEQ'])+'FOURTH'+row.VAR6+'  '+row.VAR7
            outlist.append(tempstr)

    
    
Sign up to request clarification or add additional context in comments.

4 Comments

Your answer fulfill the logic but the only thing is each output width/length (it seems that you use space directly to do so, cool!).. I'm also finding his delete answer but it seems that there is no record to track for now
is it possible to set up tuple in the list and use that for FIRST, SECOND, THIRD, and FOURTH output?
The way I did it, no, that wouldn't help - not that I can see anyway? I think the "best" way to do this is with a dataclass that has the output built into it as a method, honestly, but the above either with a list or with a file writer should accomplish what you want - maybe with a little more care about formatting, as Jon shows, using the {: <4} type formatters.
It seems that we cannot check the deleted answer that @JonSG posted...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.