2

I'm a 'newbie' in python (started learning 2 weeks ago) and I'm trying to plot a file that looks like this:

"1stSerie"    
2 23
4 12
6 12

"2ndSerie"
2 51
4 90
6 112

Using any of the following: pandas, matplotlib and numpy. But I am not having much of a success. I tried searching for examples but none applied to my data format.

Can somebody help me find out how to load this file in a pandas dataframe or (what would be even better) show me how to plot this?

Details:

  • The number of lines in each series is different for different datasets that I have, but in the same dataset(file) the number of rows are the same (as pictured in the code excerpt).
  • There is a blank line between each serie exactly as reproduced in the code excerpt.
  • The title of the serie is the whole string, but with one word (if it is easier to import using two words/colums) I can change the titles.

UPDATE 1:

After the help from @Goyo I changed my method convert() to be something like this:

#!/usr/bin/env python3
def convert(in_file, out_file):
name = ""
for line in in_file:
    line = line.strip()
    print(line)
    if line == "":
        continue
    if line.startswith('"'):
        name = line.strip('"')
        print("NAME:: " + name)
    else:
        out_file.write("{0}\n".format(','.join([name] + line.split("\t")) ) )

To plot I'm using the following code:

with open('nro_caribou.dat') as in_file:
    with open('output.txt', 'w+') as out_file:
       convert(in_file, out_file)
    df = pd.read_csv('output.txt', header=None,names=['Methods', 'Param', 'Time'], sep=",", )
    print(df)
    df.pivot(values='Time', index='Param', columns='Methods').plot()

My original data: https://gist.github.com/pedro-stanaka/c3eda0aa2191950a8d83

And my plot:

the final chart

3
  • 1
    In this sample are headers "1st Serie" and "1st Serie" and then 3 rows of data. Has this file constant number of rows of data? Are this series separated by empty row ? Commented Feb 14, 2016 at 17:40
  • Just edited the post to answer your questions. Commented Feb 14, 2016 at 17:57
  • Names of Series are 1st Series and 2nd Series or can be a , b , it means is not created by two string - 1st and Series? Commented Feb 14, 2016 at 18:01

4 Answers 4

2

AFAIK there's no builtin features in pandas, matplotlib or numpy to read files like that one. If you have some control on the data format I encourage you to change it.

If you have no options but using that format, you can parse the data yourself using just the python I/O and string manipulation features (I do not think pandas can make this easier, it is not designed to deal with these kind of files).

This function can convert data from your format to another more suitable for pandas:

def convert(in_file, out_file):
    for line in in_file:
        line = line.rstrip(' \n\r') 
        if not line:
            continue
        if line.startswith('"'):
            name = line.strip('"')
        else:
            out_file.write('{}\n'.format(','.join([name] + line.split())))

If your original file is 'input.txt' you would use it this way:

with open('input.txt') as in_file:
    with open('output.txt', 'w') as out_file:
        convert(in_file, out_file)
df = pd.read_csv('output.txt', header=None,
                 names=['Series', 'X', 'Y'])
print(df)

      Series  X    Y
0  1st Serie  2   23
1  1st Serie  4   12
2  1st Serie  6   12
3  2nd Serie  2   51
4  2nd Serie  4   90
5  2nd Serie  6  112

df.pivot(index='X', columns='Series', values='Y').plot()

enter image description here

Sign up to request clarification or add additional context in comments.

4 Comments

Man, your solution was the best to me because it was more generic. I had to fix the convert method to be something like that: gist.github.com/pedro-stanaka/9f00abcc9efb820bb9d9
is there a way for you to see if you can reproduce the error I'm getting? The plot is not showing for me. Thanks in advance.
@PedroTamaka My answer assumed that each series had a unique name. After the conversion there is no way to tell apart the two series called "CRE". That could be achieved by adding a counter to convert so to give each series a unique identifier. I won't be able to give it a try until tomorrow.
I had an error in my data (it was not supposed to have two 'CRE' series). I fixed and your solution worked. Now everything is fine.
1

I think you can read_csv only once and then post processing create dataframe:

import pandas as pd
import io

temp=u""""1stSerie"    
2 23
4 12
6 12

"2ndSerie"
2 51
4 90
6 112
"""

s = pd.read_csv(io.StringIO(temp), #after testing replace io.StringIO(temp) to filename
                 sep="\s+", 
                 engine='python', #because ParserWarning
                 squeeze=True,
                 header=None) #try convert output to series

print s

"1stSerie"    NaN
2              23
4              12
6              12
"2ndSerie"    NaN
2              51
4              90
6             112
Name: 0, dtype: float64

df = s.reset_index()
#set column names
df.columns = ['idx','val']
#try convert column idx to numeric, if string get NaN
print pd.to_numeric(df['idx'], errors='coerce')
0   NaN
1     2
2     4
3     6
4   NaN
5     2
6     4
7     6
Name: idx, dtype: float64

#find NaN - which values are string
print pd.isnull(pd.to_numeric(df['idx'], errors='coerce'))
0     True
1    False
2    False
3    False
4     True
5    False
6    False
7    False
Name: idx, dtype: bool

#this values get to new column names
df.loc[pd.isnull(pd.to_numeric(df['idx'], errors='coerce')), 'names'] = df['idx']

#forward fill NaN values
df['names'] = df['names'].ffill()

#remove values, where column val in NaN
df = df[pd.notnull(df['val'])]
print df
  idx  val       names
1   2   23  "1stSerie"
2   4   12  "1stSerie"
3   6   12  "1stSerie"
5   2   51  "2ndSerie"
6   4   90  "2ndSerie"
7   6  112  "2ndSerie"

df.pivot(index='idx', columns='names', values='val').plot()

graphs

Or you can use read_csv and plot. If you need set names of Series to legend, use figure and legend:

import pandas as pd
import matplotlib.pyplot as plt
import io

temp=u""""1stSerie"    
2 23
4 12
6 12

"2ndSerie"
2 51
4 90
6 112"""
s1 = pd.read_csv(io.StringIO(temp), #after testing replace io.StringIO(temp) to filename
                 sep="\s+", 
                 engine='python', #because ParserWarning
                 nrows=3, #read only 3 rows of data
                 squeeze=True) #try convert output to series

print s1
2    23
4    12
6    12
Name: "1stSerie", dtype: int64

#after testing replace io.StringIO(temp) to filename
s2 = pd.read_csv(io.StringIO(temp), 
                 sep="\s+", 
                 header=4, #read row 4 to header - series name
                 engine='python',      
                 nrows=3,
                 squeeze=True)

print s2
2     51
4     90
6    112
Name: "2ndSerie", dtype: int64

plt.figure()
s1.plot()
ax = s2.plot()
ax.legend(['1stSerie','2ndSerie'])

graph

Or you can read file only once and then cut Serie s to Series s1, s2 and s3 and then create DataFrame:

import pandas as pd
import matplotlib.pyplot as plt
import io

temp=u""""1stSerie"    
2 23
4 12
6 12

"2ndSerie"
2 51
4 90
6 112

"3rdSerie"
2 51
4 90
6 112
"""
s = pd.read_csv(io.StringIO(temp), #after testing replace io.StringIO(temp) to filename
                 sep="\s+", 
                 engine='python', #because ParserWarning
                 squeeze=True) #try convert output to series

print s
2              23
4              12
6              12
"2ndSerie"    NaN
2              51
4              90
6             112
"3rdSerie"    NaN
2              51
4              90
6             112
Name: "1stSerie", dtype: float64
s1 = s[:3]
print s1
2    23
4    12
6    12
Name: "1stSerie", dtype: float64

s2 = s[4:7]
s2.name='2ndSerie'
print s2
2     51
4     90
6    112
Name: 2ndSerie, dtype: float64

s3 = s[8:]
s3.name='3rdSerie'
print s3
2     51
4     90
6    112
Name: 3rdSerie, dtype: float64

print pd.DataFrame({'a': s1, 'b': s2, 'c': s3})
    a    b    c
2  23   51   51
4  12   90   90
6  12  112  112

Comments

1

You can step through the file using itertools.groupby. The LastHeader class below checks each line for a sentinal character. If the character is there, the headerline is updated, and itertools.groupby starts a new segment. The only place this runs into trouble with your dataset is where you have two series labeled "CRE". My workaround was to just delete the second one from the textfile, but you'll probably want to do something else.

The upshot here is that you can just injest the data in a single pass. No writing out and reading back in required.

from itertools import groupby
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from pandas import DataFrame, Series

class LastHeader():
    """Checks for new header strings. For use with groupby"""
    def __init__(self, sentinel='#'):
        self.sentinel = sentinel
        self.lastheader = ''
        self.index=0

    def check(self, line):
        self.index += 1
        if line.startswith(self.sentinel):
            self.lastheader = line
        return self.lastheader

fname = 'dist_caribou.dat'

with open(fname, 'r') as fobj:
    lastheader = LastHeader('"')
    data = []
    for headerline, readlines in groupby(fobj, lastheader.check):
        name = headerline.strip().strip('"')
        thisdat = np.loadtxt(readlines, comments='"')
        data.append(Series(thisdat[:, 1], index=thisdat[:, 0], name=name))
data = pd.concat(data, axis=1)
data.plot().set_yscale('log')
plt.show()

1 Comment

Thank you very much for your help. The code ran faster with this modification. And is very good to know the groupby function.
0

Given the appropriate parameters for read_csv in pandas, this is relatively trivial to plot.

s1 = pd.read_csv('series1.txt', 
                 index_col=0, 
                 sep=" ", 
                 squeeze=True, 
                 header=0, 
                 skipinitialspace=True)
>>> s1
tSerie
2    23
4    12
6    12
Name: Unnamed: 1, dtype: int64

s2 = pd.read_csv('series2.txt', 
                 index_col=0, 
                 sep=" ", 
                 squeeze=True, 
                 header=0, 
                 skipinitialspace=True)

%matplotlib inline  # If not already enabled.
s1.plot();s2.plot()

enter image description here

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.