Splitting input string into words and saving to an array

Question

I have been going at this for hours, and I can't figure out what the problem in my code is. I'm currently writing a very simple assembler for a custom instruction set architecture. This assembler takes an input file and simply parses line by line. In the parsing process, I intend to split each line up by spaces, writing the tokens to an array for processing. Below is some of the code to do that:

    char** tokens = (char**) malloc(sizeof(char*));
    char* linecpy = strcpy(linecpy, line);

    char* tok_ptr = strtok(linecpy, " ");
    int tokenid = 0;
    while(tok_ptr) {
        tokens = (char**) realloc(tokens, (tokenid+1) * sizeof(char*));
        tokens[tokenid] = tok_ptr;
        tokenid++;
        tok_ptr = strtok(NULL, " ");
    }

To test that this is accurately working, I'm having it print out each token sequentially from the array, and I'm finding random splits in the middle of the tokens that shouldn't be there. Here is an example:

Line from assembly file:

    jsr fibloop      ; jump to the main program loop

Expected Output from splitting by spaces:

    jsr
    fibloop
    ;
    jump
    to
    the
    main
    program
    loop

Actual Output:

    jsr
    fibloo
    p
    ;
    jump
    to
    the
    main
    pr
    ogram
    l
    oop

I've spent so long trying to solve this to no avail, and feedback on how to potentially solve this would be greatly appreciated

EDIT: Solution to this was pointed out by Clifford and 4386427, the problem was that linecpy had no memory allocated to it, and strcpy doesn't directly return a new string as I had incorrect assumed. The working code has been put below, and I've included a comment filter to stop tokenization after the parser hits a comment character, something pointed out by Clifford

    char** tokens = (char**) malloc(sizeof(char*));
    char* linecpy = malloc(strlen(line) + 1);

    strcpy(linecpy, line);

    char* tok_ptr = strtok(linecpy, " ");
    int tokenid = 0;
    while(tok_ptr) {
         /* 
            If a token starts with a comment character then we stop tokenization, 
            as everything after will be commented and is of no use to the parser
        */
        if(tok_ptr[0] == ';') break;
        tokens = (char**) realloc(tokens, (tokenid+1) * sizeof(char*));
        tokens[tokenid] = tok_ptr;
        tokenid++;
        tok_ptr = strtok(NULL, " ");
    }
    // free memory allocated to tokens after parsing
    free(tokens);

Hopefully this helps anyone with the same problem I had, the quick responses given by members of this community was extremely helpful. Thanks guys!

Anyone answering this is likely going to need a complete example that produces the output illustrated. At least we would need to know what the actual input was to give that output. — Clifford
– Clifford, Commented Aug 9, 2021 at 14:04
Noted, I included the line from the assembly file being read from — CodeKraken
– CodeKraken, Commented Aug 9, 2021 at 14:07
You may be making this parser more complex than necessary. Everything from ; to the end of the line should probably be either discarded or considered a single token. What if the comment is ;jump to ... ? (i.e. no space after ;) — Clifford
– Clifford, Commented Aug 9, 2021 at 14:22
My problem with this was simple just splitting the text up by spaces to be processed in the first place, my intent for dealing with that is immediately stop parsing the line if the first character of a token is a ; — CodeKraken
– CodeKraken, Commented Aug 9, 2021 at 14:45

4386427 · Accepted Answer · 2021-08-09 14:09:16Z

1

char* linecpy = strcpy(linecpy, line);

is illegal. linecpy has no allocated memory. You need

char* linecpy = malloc(strlen(line) + 1);
strcpy(linecpy, line);

Besides that:

char** tokens = (char**) malloc(sizeof(char*));

should be

char** tokens = NULL;

answered Aug 9, 2021 at 14:09

4386427

44.7k4 gold badges44 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

CodeKraken Over a year ago

That did the trick, my misunderstanding of strcpy and memory allocation in general caused the mishap, as I tested putting the regular input line to be tokenized and it worked completely as intended. Thanks for the help!

4386427 Over a year ago

@CodeKraken You can also take a look at the non-standard strdup function... it allocates memory for you.

Clifford · Accepted Answer · 2021-08-09 14:17:41Z

1

strcpy() doe not allocate memory for the destination string, it needs to be allocated before calling strcpy() - nothing good will happen by copying to an unitialised pointer; I am surprised it did not seg-fault right there.

char linecpy[MAX_LINE_LEN] ;
strcpy( linecpy, line ) ;

or even

char* linecpy = malloc( strlen(line) + 1 ) ;
strcpy( linecpy, line ) ;

The following example outputs as you expect, with just teh above change:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main()
{
    char line[] = "jsr fibloop      ; jump to the main program loop" ;
    
    char** tokens = (char**) malloc(sizeof(char*));
    
    char* linecpy = malloc( strlen(line) + 1 ) ;
    strcpy( linecpy, line ) ;
    
    char* tok_ptr = strtok(linecpy, " ");
    int tokenid = 0;
    while(tok_ptr) {
        tokens = (char**) realloc(tokens, (tokenid+1) * sizeof(char*));
        tokens[tokenid] = tok_ptr;
        tokenid++;
        tok_ptr = strtok(NULL, " ");
    }


    for( int i = 0; i < tokenid; i++) printf( "%s\n", tokens[i] ) ;
    
    return 0;
}

answered Aug 9, 2021 at 14:17

Clifford

94.3k14 gold badges93 silver badges176 bronze badges

1 Comment

CodeKraken Over a year ago

I think it was the lack of a segfault that kept me from noticing that. I haven't had too much experience in the form of text parsing and memory manipulation with C, so something small like that slipping by me makes sense. Apologies for that! The quick responses from you and 4386427 certainly helped a lot!

Collectives™ on Stack Overflow

Splitting input string into words and saving to an array

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related