0

This questions is probably another instance of

BigQuery UDF memory exceeded error on multiple rows but works fine on single row

But it was suggested that I post as a question instead of an answer.

I'm using javascript to parse logfiles into a table. The javascript parse function is relatively simple. It works at 1M rows but fails at 3M rows. Log files can be a lot bigger than the 3M so failing is a problem.

The function is below.

function parseLogRow(row, emit) {

    r =  (row.logrow ? row.logrow : "") + (row.l2 ? " " + row.l2 : "") + (row.l3 ? " " + row.l3 : "")
    ts = null
    category = null
    user = null
    message = null
    db = null
    seconds = null
    found = false
    if (r) {
        m = r.match(/^(\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d\d\d (\+|\-)\d\d\d\d) \[([^|]*)\|([^|]*)\|([^\]]*)\] ::( \(([\d\.]+)s\))? (.*)/ )
        if( m){
          ts = new Date(m[1])*1
          category = m[3] || null
          user = m[4] || null
          db = m[5] || null
          seconds = m[7] || null
          message = m[8] || null
          found = true
        }
        else {
          message = r
          found = false
        }
     }

    emit({
      ts:  ts,
      category: category,
      user: user,
      db: db,
      seconds: seconds*1.0,
      message: message,
      found: found
      });
  }


  bigquery.defineFunction(
    'parseLogRow',                           // Name of the function exported to SQL
    ['logrow',"l2","l3"],                    // Names of input columns
    [
      {'name': 'ts', 'type': 'float'},  // Output schema
      {'name': 'category', 'type': 'string'},
      {'name': 'user', 'type': 'string'},
      {'name': 'db', 'type': 'string'},
      {'name': 'seconds', 'type': 'float'},
      {'name': 'message', 'type': 'string'},
      {'name': 'found', 'type': 'boolean'},
    ],
    parseLogRow                          // Reference to JavaScript UDF
  );

I'm referencing the function with this query:

SELECT
    ROW_NUMBER() OVER() as row_num,
    ts,category,user,
    db,seconds,message,found,
FROM parseLogRow((SELECT * FROM[#{dataset}.today]
      LIMIT 1000000
    ))

Some example data in the 'today' table looks like this (as a CSV):

logrow,l2,l3
# Logfile created on 2015-12-29 00:00:09 -0800 by logger.rb/v1.2.7,,
2015-12-29 00:00:09.262 -0800 [INFO|7aaa0|] :: Running scheduled job: confirm running gulp process,,
2015-12-29 00:00:09.277 -0800 [DEBUG|7aaa0|] :: Restarted gulp process,,
2015-12-29 00:00:09.278 -0800 [INFO|7aaa0|] :: Completed scheduled job: confirm running gulp process,,
2015-12-29 00:00:14.343 -0800 [DEBUG|7aaa2|scheduler] :: Polling for pending tasks (master: true),,
2015-12-29 00:00:19.396 -0800 [INFO|7aaa4|] :: Running scheduled job: confirm running gulp process,,
2015-12-29 00:00:19.409 -0800 [DEBUG|7aaa4|] :: Restarted gulp process,,
2015-12-29 00:00:19.410 -0800 [INFO|7aaa4|] :: Completed scheduled job: confirm running gulp process,,
2015-12-29 00:00:29.487 -0800 [INFO|7aaa6|] :: Running scheduled job: confirm running gulp process,,
2015-12-29 00:00:29.500 -0800 [DEBUG|7aaa6|] :: Restarted gulp process,,
2015-12-29 00:00:29.500 -0800 [INFO|7aaa6|] :: Completed scheduled job: confirm running gulp process,,
2015-12-29 00:00:39.597 -0800 [INFO|7aaa8|] :: Running scheduled job: confirm running gulp process,,
2015-12-29 00:00:39.610 -0800 [DEBUG|7aaa8|] :: Restarted gulp process,,
2015-12-29 00:00:39.611 -0800 [INFO|7aaa8|] :: Completed scheduled job: confirm running gulp process,,
2015-12-29 00:00:44.659 -0800 [DEBUG|7aaaa|scheduler] :: Polling for pending tasks (master: true),,
2015-12-29 00:00:49.687 -0800 [INFO|7aaac|] :: Running scheduled job: confirm running gulp process,,
2015-12-29 00:00:49.689 -0800 [DEBUG|7aaac|] :: Restarted gulp process,,
2015-12-29 00:00:49.689 -0800 [INFO|7aaac|] :: Completed scheduled job: confirm running gulp process,,
2015-12-29 00:00:59.869 -0800 [INFO|7aaae|] :: Running scheduled job: confirm running gulp process,,
2015-12-29 00:00:59.871 -0800 [DEBUG|7aaae|] :: Restarted gulp process,,

This is a bit of a hack as I'm importing the log as a 3 column table (really, just one column) by importing it as a CSV with the delimiter set to tab (we usually don't have any tabs in the log file), the using a query to turn it into the table I actually want.

I like this pattern because it the parsing is fast and distributed (when it works).

A job that failed is: bigquery-looker:bquijob_260be029_153dd96cfdb. Please contact me if you need a reproducible case.

Any help or suggestion would be appreciated.

1 Answer 1

1

Point # 1
I dont see issue with UDF - it worked for me even on 10M rows
I think the issue rather is in use of ROW_NUMBER() OVER() - remove it and it should work!

SELECT
  ts,category,user,
  db,seconds,message,found,
FROM parseLogRow((SELECT * FROM[#{dataset}.today]
))  

Point #2
From performance prospective, below should run faster (I think) and in general I would suggest avoid using UDF in cases when "plain" BQ works as well

SELECT 
  REGEXP_EXTRACT(logrow, r'^(\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d\d\d (?:\+|\-)\d\d\d\d) \[[^|]*\|[^|]*\|[^\]]*\] :: .*') AS ts,
  REGEXP_EXTRACT(logrow, r'^(?:\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d\d\d (?:\+|\-)\d\d\d\d) \[([^|]*)\|(?:[^|]*)\|(?:[^\]]*)\] :: (?:.*)') AS category,
  REGEXP_EXTRACT(logrow, r'^(?:\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d\d\d (?:\+|\-)\d\d\d\d) \[(?:[^|]*)\|([^|]*)\|(?:[^\]]*)\] :: (?:.*)') AS user,
  REGEXP_EXTRACT(logrow, r'^(?:\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d\d\d (?:\+|\-)\d\d\d\d) \[(?:[^|]*)\|(?:[^|]*)\|([^\]]*)\] :: (?:.*)') AS db,
  REGEXP_EXTRACT(logrow, r'^(?:\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d\d\d (?:\+|\-)\d\d\d\d) \[(?:[^|]*)\|(?:[^|]*)\|(?:[^\]]*)\] :: (.*)') AS message,
  REGEXP_MATCH(logrow, r'^((?:\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d\d\d (?:\+|\-)\d\d\d\d) \[(?:[^|]*)\|(?:[^|]*)\|(?:[^\]]*)\] :: (?:.*))') AS found
FROM (
  SELECT logrow +IFNULL(' ' + l2, '') + IFNULL(' ' + l3, '') AS logrow 
  FROM YourTable    
)
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, you are exactly right. Wasn't the javascript it was the window function. Here is more info: stackoverflow.com/questions/33247703/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.