1

I have tried to implement a query in BigQuery that finds top keywords for a doc from a larger collection of documents using tf-idf scores.

Before calculating the tf-idf score of the keywords, I clean the documents (e.g. removed stop words and punctuations) and then I create 1,2,3and 4-grams out of the documents and then do stemming inside the n-grams.

To perform this cleaning, n-gram creation and stemming I am using javascript libraries and js udf. Here is the example query:

CREATE TEMP FUNCTION nlp_compromise_tokens(str STRING)
RETURNS ARRAY<STRUCT<ngram STRING, count INT64>> LANGUAGE js AS '''
  // creating 1,2,3 and 4 grams using compormise js
  // before that I remove stopwords using .removeStopWords
  // function lent from remove_stop_words.js
  tokens_from_compromise = nlp(str.removeStopWords()).normalize().ngrams({max:4}).data()

  // The stemming function that stems
  // each space separated tokens inside the n-grams
  // I use snowball.babel.js here
  function stems_from_space_separated_string(tokens_string) {
    var stem = snowballFactory.newStemmer('english').stem;
    splitted_tokens = tokens_string.split(" ");
    splitted_stems = splitted_tokens.map(x => stem(x));
    return splitted_stems.join(" ")
  }

  // Returning the n-grams from compromise which are 
  // stemmed internally and at least length of 2
  // alongside the count of the token inside the document
  var ngram_count = tokens_from_compromise.map(function(item) {
    return {
      ngram: stems_from_space_separated_string(item.normal),
      count: item.count
    };
  });
  return ngram_count
'''
OPTIONS (
  library=["gs://fh-bigquery/js/compromise.min.11.14.0.js","gs://syed_mag/js/snowball.babel.js","gs://syed_mag/js/remove_stop_words.js"]);

with doc_table as (
  SELECT 1 id, "A quick brown 20 fox fox fox jumped over the lazy-dog" doc UNION ALL
  SELECT 2, "another 23rd quicker browner fox jumping over Lazier broken! dogs." UNION ALL
  SELECT 3, "This dog is more than two-feet away." #UNION ALL
),
  ngram_table as(
  select
    id,
    doc,
    nlp_compromise_tokens(doc) as compromise_tokens
  from
    doc_table),
n_docs_table as (
  select count(*) as n_docs from ngram_table
),
df_table as (
SELECT
  compromise_token.ngram,
  count(*) as df
FROM
  ngram_table, UNNEST(compromise_tokens) as compromise_token
GROUP BY
  ngram
),

idf_table as(
SELECT
  ngram,
  df,
  n_docs,
  LN((1+n_docs)/(1+df)) + 1 as idf_smooth
FROM
  df_table
CROSS JOIN
  n_docs_table),

tf_idf_table as (  
SELECT
  id,
  doc,
  compromise_token.ngram,
  compromise_token.count as tf,
  idf_table.ngram as idf_ngram,
  idf_table.idf_smooth,
  compromise_token.count * idf_table.idf_smooth as tf_idf
FROM
  ngram_table, UNNEST(compromise_tokens) as compromise_token
JOIN
  idf_table
ON
  compromise_token.ngram = idf_table.ngram)

SELECT
  id,
  ARRAY_AGG(STRUCT(ngram,tf_idf)) as top_keyword,
  doc
FROM(
  SELECT
    id,
    doc,
    ngram,
    tf_idf,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY tf_idf DESC) AS rn
  FROM
    tf_idf_table)
WHERE
  rn < 5
group by
  id,
  doc

Here is how the example output looks like: enter image description here

There were only three sample handmade rows in this example.

When I try the same code with a little bit larger table with 1000 rows, it again works fine, although taking quite a bit of longer time to finish (around 6 minutes for only 1000 rows). This sample table (1MB) can be found here in json format.

Now when I try the query on a larger dataset (159K rows - 155MB) the query is exhausting after around 30 mins with the following message:

Errors: User-defined function: UDF worker timed out during execution.; Unexpected abort triggered for worker worker-109498: job_timeout (error code: timeout)

Can I improve my udf functions or the overall query structure to make sure it runs smoothly on even larger datasets (124,783,298 rows - 244GB)?

N.B. I have given proper permission to the js files in the google storage so that these javascrips are accessible by anyone to run the example queries.

1 Answer 1

0

BigQuery UDFs are very handy but are not computationally hangry and make your query slow or exhaust resources. See the doc reference for limitation and best practices. In general, any UDF logic you can convert in native SQL will be way faster and use fewer resources.

I would split your analysis into multiple steps saving the result into a new table for each step:

  1. Clean the documents (e.g. removed stop words and punctuations)
  2. Create 1,2,3and 4-grams out of the documents and then do stemming inside the n-grams.
  3. Calculate the score.

Side note: you might be able to run it using multiple CTEs to save the stages instead of saving each step into a native table but I do not know if that will make the query exceed the resource limit.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.