I have tried to implement a query in BigQuery that finds top keywords for a doc from a larger collection of documents using tf-idf scores.
Before calculating the tf-idf score of the keywords, I clean the documents (e.g. removed stop words and punctuations) and then I create 1,2,3and 4-grams out of the documents and then do stemming inside the n-grams.
To perform this cleaning, n-gram creation and stemming I am using javascript libraries and js udf. Here is the example query:
CREATE TEMP FUNCTION nlp_compromise_tokens(str STRING)
RETURNS ARRAY<STRUCT<ngram STRING, count INT64>> LANGUAGE js AS '''
// creating 1,2,3 and 4 grams using compormise js
// before that I remove stopwords using .removeStopWords
// function lent from remove_stop_words.js
tokens_from_compromise = nlp(str.removeStopWords()).normalize().ngrams({max:4}).data()
// The stemming function that stems
// each space separated tokens inside the n-grams
// I use snowball.babel.js here
function stems_from_space_separated_string(tokens_string) {
var stem = snowballFactory.newStemmer('english').stem;
splitted_tokens = tokens_string.split(" ");
splitted_stems = splitted_tokens.map(x => stem(x));
return splitted_stems.join(" ")
}
// Returning the n-grams from compromise which are
// stemmed internally and at least length of 2
// alongside the count of the token inside the document
var ngram_count = tokens_from_compromise.map(function(item) {
return {
ngram: stems_from_space_separated_string(item.normal),
count: item.count
};
});
return ngram_count
'''
OPTIONS (
library=["gs://fh-bigquery/js/compromise.min.11.14.0.js","gs://syed_mag/js/snowball.babel.js","gs://syed_mag/js/remove_stop_words.js"]);
with doc_table as (
SELECT 1 id, "A quick brown 20 fox fox fox jumped over the lazy-dog" doc UNION ALL
SELECT 2, "another 23rd quicker browner fox jumping over Lazier broken! dogs." UNION ALL
SELECT 3, "This dog is more than two-feet away." #UNION ALL
),
ngram_table as(
select
id,
doc,
nlp_compromise_tokens(doc) as compromise_tokens
from
doc_table),
n_docs_table as (
select count(*) as n_docs from ngram_table
),
df_table as (
SELECT
compromise_token.ngram,
count(*) as df
FROM
ngram_table, UNNEST(compromise_tokens) as compromise_token
GROUP BY
ngram
),
idf_table as(
SELECT
ngram,
df,
n_docs,
LN((1+n_docs)/(1+df)) + 1 as idf_smooth
FROM
df_table
CROSS JOIN
n_docs_table),
tf_idf_table as (
SELECT
id,
doc,
compromise_token.ngram,
compromise_token.count as tf,
idf_table.ngram as idf_ngram,
idf_table.idf_smooth,
compromise_token.count * idf_table.idf_smooth as tf_idf
FROM
ngram_table, UNNEST(compromise_tokens) as compromise_token
JOIN
idf_table
ON
compromise_token.ngram = idf_table.ngram)
SELECT
id,
ARRAY_AGG(STRUCT(ngram,tf_idf)) as top_keyword,
doc
FROM(
SELECT
id,
doc,
ngram,
tf_idf,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY tf_idf DESC) AS rn
FROM
tf_idf_table)
WHERE
rn < 5
group by
id,
doc
Here is how the example output looks like:

There were only three sample handmade rows in this example.
When I try the same code with a little bit larger table with 1000 rows, it again works fine, although taking quite a bit of longer time to finish (around 6 minutes for only 1000 rows). This sample table (1MB) can be found here in json format.
Now when I try the query on a larger dataset (159K rows - 155MB) the query is exhausting after around 30 mins with the following message:
Errors: User-defined function: UDF worker timed out during execution.; Unexpected abort triggered for worker worker-109498: job_timeout (error code: timeout)
Can I improve my udf functions or the overall query structure to make sure it runs smoothly on even larger datasets (124,783,298 rows - 244GB)?
N.B. I have given proper permission to the js files in the google storage so that these javascrips are accessible by anyone to run the example queries.