Create Pandas DataFrame from a string

Question

In order to test some functionality I would like to create a DataFrame from a string. Let's say my test data looks like:

TESTDATA="""col1;col2;col3
1;4.4;99
2;4.5;200
3;4.7;65
4;3.2;140
"""

What is the simplest way to read that data into a Pandas DataFrame?

Gustavo Lopes · Accepted Answer · 2019-06-17 14:43:45Z

852

A simple way to do this is to use StringIO.StringIO (python2) or io.StringIO (python3) and pass that to the pandas.read_csv function. E.g:

import sys
if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO

import pandas as pd

TESTDATA = StringIO("""col1;col2;col3
    1;4.4;99
    2;4.5;200
    3;4.7;65
    4;3.2;140
    """)

df = pd.read_csv(TESTDATA, sep=";")

edited Jun 17, 2019 at 14:43

Gustavo Lopes

4,2534 gold badges24 silver badges63 bronze badges

answered Mar 24, 2014 at 9:21

Emil L

21.2k3 gold badges48 silver badges71 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

wkzhu Over a year ago

FYI - pd.read_table() is an equivalent function, just slightly better nomenclature: df = pd.read_table(TESTDATA, sep=";").

Emil L Over a year ago

@AntonvBR Noted that one could use pandas.compat.StringIO. That way we don't have to import StringIO separately. However the pandas.compat package is considered private according to pandas.pydata.org/pandas-docs/stable/api.html?highlight=compat so leaving the answer as is for now.

gdm Over a year ago

I receive 'Error tokenizing data. C error: Expected 2 fields in line 26, saw 12\n',)

skytaker Over a year ago

@gdm The "Expected fields error" is usually due to inconsistent number of fields in each row, or wrong sep/delimiter.

Mike T Over a year ago

@Asclepius the suggestion to use pandas.compat only works for versions before pandas 0.25, but will now raise ImportError.

|

mirekphd · Accepted Answer · 2022-10-20 08:33:30Z

74

In one line, but first import io

import pandas as pd
import io   

TESTDATA="""col1;col2;col3
1;4.4;99
2;4.5;200
3;4.7;65
4;3.2;140
"""

df = pd.read_csv(io.StringIO(TESTDATA), sep=";")
print(df)

edited Oct 20, 2022 at 8:33

mirekphd

7,2314 gold badges62 silver badges88 bronze badges

answered Nov 26, 2020 at 9:10

user3810512

8036 silver badges3 bronze badges

2 Comments

Ruli Over a year ago

What is the difference between this and the accepted answer? Except you move io operation to read_csv, which makes no difference... Please always check if similar answer is not posted already, redundancy is unnecessary.

eduardosufan Over a year ago

I like this answer because it is explained with one example in one line.

Shaurya Uppal · Accepted Answer · 2019-04-14 02:19:25Z

63

Split Method

data = input_string
df = pd.DataFrame([x.split(';') for x in data.split('\n')])
print(df)

edited Apr 14, 2019 at 2:19

answered Feb 21, 2019 at 17:27

Shaurya Uppal

3,73834 silver badges31 bronze badges

4 Comments

Mabyn Over a year ago

If you want the first line to be used for column names, change the 2nd line to this: df = pd.DataFrame([x.split(';') for x in data.split('\n')[1:]], columns=[x for x in data.split('\n')[0].split(';')])

Antonio Ercole De Luca Over a year ago

This is wrong, since on CSV files the newline (\n) character can be part of a field.

DanB Over a year ago

This is not very robust, and most people would be better with the accepted answer. There is a very partial list of things that can go wrong with this at thomasburette.com/blog/2014/05/25/…

ggorlen Over a year ago

This is a weak and unnecessary approach. This circumvents the built-in pd.read_csv() call which handles any CSV and has been battle-tested in favor of a naive couple of string splits that don't handle quoting and are liable to fail in many cases. It's also more verbose. Please use the CSV parser Pandas kindly gives you!

user2314737 · Accepted Answer · 2018-11-30 08:17:59Z

26

A quick and easy solution for interactive work is to copy-and-paste the text by loading the data from the clipboard.

Select the content of the string with your mouse:

In the Python shell use read_clipboard()

>>> pd.read_clipboard()
  col1;col2;col3
0       1;4.4;99
1      2;4.5;200
2       3;4.7;65
3      4;3.2;140

Use the appropriate separator:

>>> pd.read_clipboard(sep=';')
   col1  col2  col3
0     1   4.4    99
1     2   4.5   200
2     3   4.7    65
3     4   3.2   140

>>> df = pd.read_clipboard(sep=';') # save to dataframe

answered Nov 30, 2018 at 8:17

user2314737

29.7k20 gold badges109 silver badges126 bronze badges

1 Comment

Mabyn Over a year ago

Not good for reproducibility, but otherwise a pretty neat solution!

Asclepius · Accepted Answer · 2020-06-24 22:02:42Z

This answer applies when a string is manually entered, not when it's read from somewhere.

A traditional variable-width CSV is unreadable for storing data as a string variable. Especially for use inside a .py file, consider fixed-width pipe-separated data instead. Various IDEs and editors may have a plugin to format pipe-separated text into a neat table.

Using `read_csv`

Store the following in a utility module, e.g. util/pandas.py. An example is included in the function's docstring.

import io
import re

import pandas as pd


def read_psv(str_input: str, **kwargs) -> pd.DataFrame:
    """Read a Pandas object from a pipe-separated table contained within a string.

    Input example:
        | int_score | ext_score | eligible |
        |           | 701       | True     |
        | 221.3     | 0         | False    |
        |           | 576       | True     |
        | 300       | 600       | True     |

    The leading and trailing pipes are optional, but if one is present,
    so must be the other.

    `kwargs` are passed to `read_csv`. They must not include `sep`.

    In PyCharm, the "Pipe Table Formatter" plugin has a "Format" feature that can 
    be used to neatly format a table.

    Ref: https://stackoverflow.com/a/46471952/
    """

    substitutions = [
        ('^ *', ''),  # Remove leading spaces
        (' *$', ''),  # Remove trailing spaces
        (r' *\| *', '|'),  # Remove spaces between columns
    ]
    if all(line.lstrip().startswith('|') and line.rstrip().endswith('|') for line in str_input.strip().split('\n')):
        substitutions.extend([
            (r'^\|', ''),  # Remove redundant leading delimiter
            (r'\|$', ''),  # Remove redundant trailing delimiter
        ])
    for pattern, replacement in substitutions:
        str_input = re.sub(pattern, replacement, str_input, flags=re.MULTILINE)
    return pd.read_csv(io.StringIO(str_input), sep='|', **kwargs)

Non-working alternatives

The code below doesn't work properly because it adds an empty column on both the left and right sides.

df = pd.read_csv(io.StringIO(df_str), sep=r'\s*\|\s*', engine='python')

As for read_fwf, it doesn't actually use so many of the optional kwargs that read_csv accepts and uses. As such, it shouldn't be used at all for pipe-separated data.

I found (by trial&error) that read_fwf takes more of read_csvs arguments than is documented, but it's true that some have no effect.

Hunaphu · Accepted Answer · 2023-07-01 14:48:46Z

4

Object: Take string make dataframe.

Solution

def str2frame(estr, sep = ',', lineterm = '\n', set_header = True):
    dat = [x.split(sep) for x in estr.strip(lineterm).split(lineterm)]
    df = pd.DataFrame(dat)
    if set_header:
        df = df.T.set_index(0, drop = True).T # flip, set ix, flip back
    return df

Example

estr = """
sym,date,strike,genus
APPLE,20MAY20,50.0,Malus
ORANGE,22JUL20,50.0,Rutaceae
"""

df = str2frame(estr)

print(df)
0     sym     date strike     genus
1   APPLE  20MAY20   50.0     Malus
2  ORANGE  22JUL20   50.0  Rutaceae

edited Jul 1, 2023 at 14:48

answered Jun 18, 2021 at 9:35

Hunaphu

73413 silver badges14 bronze badges

4 Comments

problemofficer - n.f. Monica Over a year ago

Please use descriptive variable names and not estr, dat, cdf.

Hunaphu Over a year ago

The style is intentional; feel free to copy this answer and use your own notation.

ggorlen Over a year ago

Same remark as this answer. Please don't reinvent the wheel, just use pd.read_csv().

Hunaphu Over a year ago

If you have one, a better solution depending only on pandas would be appreciated.

Engr. Khuram Shahzad · Accepted Answer · 2022-11-20 08:51:48Z

0

Emample:

text = [ ['This is the NLP TASKS ARTICLE written by Anjum**'] ,['IN this article I”ll be explaining various DATA-CLEANING techniques '], ['So stay tuned for FURther More && '],['Nah I dont think he goes to usf ; he lives around']]
df = pd.DataFrame({'text':text})

Output

answered Nov 20, 2022 at 8:51

Engr. Khuram Shahzad

4851 gold badge5 silver badges17 bronze badges

3 Comments

ggorlen Over a year ago

This has nothing to do with the question. OP wants to read a CSV from a string, completely missing here.

Engr. Khuram Shahzad Over a year ago

@ggorlen. First see the question: "Create Pandas DataFrame from a string" and then see my answer. lol

ggorlen Over a year ago

Sorry, I did that and don't see what you mean. There's no CSV string in your answer. You've hardcoded a list, completely bypassing a conversion step that all of the other answers here address. The question has nothing to do with converting lists to dataframes. Even if you were going to hardcode lists, this is a poor solution because it adds extra single-element inner list wrappers on each of the strings for no apparent reason, making the text column an object rather than a string datatype.

Collectives™ on Stack Overflow

Create Pandas DataFrame from a string

7 Answers 7

8 Comments

2 Comments

4 Comments

1 Comment

Using `read_csv`

Non-working alternatives

1 Comment

Solution

Example

4 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

8 Comments

2 Comments

4 Comments

1 Comment

Using read_csv

Non-working alternatives

1 Comment

Solution

Example

4 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Using `read_csv`