1

I am trying to load log files into a dataframe using pandas. I have 2 files I try to merge into 1. What happens is that the dataframe turns out empty,which is strange because the same code with other log files of the same type.

Here is the output I get :

rows of df1 146299.000000
columns of df1 6.000000
Columns: [timestamp, type, wait_at_db_queue, db_response_time, wait_server_queue, server_response_time]
Index: []
Empty DataFrame

It says the right amount of rows and columns, but does not give the data inside,whats happening? Here is the code and the data sample.

code :

trace_path = '/Users/ramapriyasridharan/Documents/new_exp/new_trace/m3xlarge/01'

    client_path = os.path.join(trace_path,'client')
    middleware_path = os.path.join(trace_path,'middleware')
    df = pd.DataFrame(columns=['timestamp','type','wait_at_db_queue','db_response_time','wait_server_queue','server_response_time'])
    #df = None
    for root, _,files in os.walk(middleware_path):
        for f in files:
            if 'server' not in f : continue
            print 'current file name %s:' %f

            #df.columns = ['timestamp','type','wait_at_db_queue','db_response_time','wait_server_queue','server_response_time']
            f1 = os.path.join(middleware_path,f)
            df1 = pd.read_csv(f1,header=None,sep=',')
            df1.columns = ['timestamp','type','wait_at_db_queue','db_response_time','wait_server_queue','server_response_time']
            #df1 = refine(df1)
            print ' rows of df1 %f' %df1.shape[0]
            print 'columns of df1 %f'%df1.shape[1]
            print 'len of df1 %f' %len(df1)
            df1 = refine(df1)
            print df1
            if df.shape[0] == 0:
                df = df1
                print df
            else:
                df = pd.concat([df,df1],axis=0)
                print df
    print df
    print ' rows of df %f' %df.shape[0]
    print 'columns of df %f'%df.shape[1]

full output:

 python find_service_time.py 
current file name rsridhar-serverworker-1448992797827.log:
 rows of df1 146299.000000
columns of df1 6.000000
len of df1 146299.000000
Empty DataFrame
Columns: [timestamp, type, wait_at_db_queue, db_response_time, wait_server_queue, server_response_time]
Index: []
Empty DataFrame
Columns: [timestamp, type, wait_at_db_queue, db_response_time, wait_server_queue, server_response_time]
Index: []
current file name rsridhar-serverworker-1448992805710.log:
 rows of df1 194827.000000
columns of df1 6.000000
len of df1 194827.000000
Empty DataFrame
Columns: [timestamp, type, wait_at_db_queue, db_response_time, wait_server_queue, server_response_time]
Index: []
Empty DataFrame
Columns: [timestamp, type, wait_at_db_queue, db_response_time, wait_server_queue, server_response_time]
Index: []
Empty DataFrame
Columns: [timestamp, type, wait_at_db_queue, db_response_time, wait_server_queue, server_response_time]
Index: []
 rows of df 0.000000
columns of df 6.000000
 len of refined df 0.000000
min timestamp : nan
done
Traceback (most recent call last):
  File "find_service_time.py", line 170, in <module>
    main()
  File "find_service_time.py", line 94, in main
    t_per_sec = map(lambda x: len(df[df['timestamp']==x]), range(1,int(np.max(df['timestamp']))))
ValueError: cannot convert float NaN to integer

sample data :

1448992805978,GET_QUEUE,1,2,0,2
1448992805978,SEND_MSG,18,147,1,157
1448992805978,SEND_MSG,26,153,0,159
1448992805979,SEND_MSG,20,149,1,163
1448992805979,GET_QUEUE,1,3,1,4
1448992805980,GET_QUEUE,1,3,0,3
1448992805981,GET_QUEUE,2,3,1,4
1448992805981,GET_QUEUE,1,3,1,4
1448992805982,SEND_MSG,5,129,0,133
1448992805983,GET_QUEUE,1,8,0,8
1448992805983,GET_QUEUE,3,5,1,6
1448992805983,GET_QUEUE,0,1,5,6
1448992805984,GET_QUEUE,3,5,2,7
1448992805984,GET_QUEUE,2,5,1,7
1448992805985,GET_QUEUE,0,5,3,8
1448992805985,GET_QUEUE,5,10,0,10
1448992805986,GET_QUEUE,4,9,1,10
1448992805986,GET_QUEUE,9,10,0,10
1448992805987,GET_QUEUE,0,7,3,10
1448992805987,GET_QUEUE,4,5,5,10
1448992805988,GET_QUEUE,5,6,5,11
1448992805989,GET_QUEUE,2,6,6,12
1448992805990,GET_QUEUE,1,4,7,11
1448992805990,GET_QUEUE,0,2,8,10
1448992805991,GET_QUEUE,5,10,4,14
1448992805991,GET_QUEUE,2,4,8,12
1448992805991,GET_QUEUE,0,6,7,13
1448992805992,GET_QUEUE,11,16,0,16
1448992805992,GET_QUEUE,0,4,9,13
1448992805993,GET_QUEUE,4,6,8,14
1448992805992,GET_QUEUE,8,15,0,15
1448992805993,GET_QUEUE,1,7,8,15
1448992805993,GET_QUEUE,1,7,8,15
1448992805993,GET_QUEUE,0,10,6,16
1448992805993,GET_QUEUE,6,9,7,16
1448992805994,GET_QUEUE,1,6,8,14
1448992805994,GET_LATEST_MSG_DELETE,1,8,7,15
1448992805995,GET_QUEUE,2,7,9,16
1448992805995,GET_QUEUE,4,6,6,12
1448992805996,GET_QUEUE,10,20,0,20
1448992805996,GET_QUEUE,12,13,6,19

Any suggestions are welcome,thats just a patch of the code.

4
  • What's the reason for using dataframes? Do you want to use them for further data handling or just in order to merge two files? If you just want to merge two files there might be other ways to do this... Commented Dec 1, 2015 at 18:54
  • What is the refine(df1) function doing? Commented Dec 1, 2015 at 19:10
  • sry, refine just removes some rows from the dataframe. Commented Dec 1, 2015 at 19:39
  • I believe the problem is in pd.concat after initialising df as a dataframe with no rows and then trying to concatenate on a dataframe without rows. Rather try df = df.append(df1). Or if that is not what you want, and you wish to join on the index, initialise df as the first log file. Commented Dec 1, 2015 at 19:57

1 Answer 1

1

refine() is not removing some rows from your DataFrame; it's removing all of them. You've got a print df1 after you call it, and your output shows Empty DataFrame each time. The most immediate problem seems to lie in whatever filtering you're doing there.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.