0

This is the last link to completing a majorly important data pipeline. We have the following newline delimited JSON, that we've exported from BigQuery into GCS, and then have downloaded locally:

{"name":"Terripins","fga":"42","fgm":"28","fgPct":0.67}
{"gameTime":"2019-01-12 12:00:00 UTC","gameDate":"2019-01-12","updated":"2019-01-12 20:25:03 UTC","isHome":true,"name":"","fga":"0","fgm":"0"}
{"gameTime":"2019-01-12 12:00:00 UTC","gameDate":"2019-01-12","updated":"2019-01-12 20:25:03 UTC","isHome":true,"name":"Crusaders","fga":"54","fgm":"33","fgPct":0.61}
{"gameTime":"2019-01-12 12:00:00 UTC","gameDate":"2019-01-12","updated":"2019-01-12 20:25:03 UTC","isHome":false,"name":"Greyhounds","fga":"54","fgm":"33","fgPct":0.61}
{"gameTime":"2019-01-12 12:00:00 UTC","gameDate":"2019-01-12","updated":"2019-01-12 20:25:03 UTC","isHome":false,"name":"Greyhounds","fga":"68","fgm":"20","fgPct":0.29}
{"gameTime":"2019-01-12 12:00:00 UTC","gameDate":"2019-01-12","updated":"2019-01-12 20:25:03 UTC","isHome":true,"name":"Crusaders","fga":"68","fgm":"20","fgPct":0.29}

We mongoimport this into our mongodb cluster, and the collection is successfully created:

enter image description here

Unfortunately, when we export the JSON from BigQuery, the integer types are converted into strings (see fga, fgm), and the date columns are also converted into strings. This image shows the original schema from BigQuery.

enter image description here

We are trying to use the python mongodb client library pymongo to convert fga, and fgm into integer types. Presumably it is easier to (a) load the "stringified" json file into mongodb, and then use pymongo to update types, rather than (b) update or fix the types directly in the JSON file before mongoimporting into mongo. So we are trying (a).

import(pymongo)

... connect to db and set "db"
our_collection = db["our_coll_name"]

# query and set for "update"
myquery = {} # for whole table
newvalues = { "$set": { "fga": int(fga) } } # change to int

# and update
new_output = our_collection.update_many(myquery, newvalues)
print(new_output.modified_count, "documents updated."

This doesn't work because int(fga) returns an error name 'fga' is not defined, and if we instead run int("fga"), we get the error ValueError: invalid literal for int() with base 10: 'fga'.

These errors both make complete sense to us, but we're still unsure then of how to update fga and fgm in this example to int. Also, are there mongo-specific date and timestamp types we can use for the 3 fields [gameTime, gameDate, updated], and how can we make these conversions as well using pymongo?

1 Answer 1

2

Assuming MongoDB 4.2 or later.

Use MongoDB's toInt() and toDate() functions.

I've split these into seperate commands for clarity but you could run it in one update_many() if you prefer.

our_collection.update_many({}, [{'$set': {'fga': {'$toInt': '$fga'}}}])
our_collection.update_many({}, [{'$set': {'fgm': {'$toInt': '$fgm'}}}])
our_collection.update_many({}, [{'$set': {'gameTime': {'$toDate': '$gameTime'}}}])
our_collection.update_many({}, [{'$set': {'gameDate': {'$toDate': '$gameDate'}}}])
our_collection.update_many({}, [{'$set': {'updated': {'$toDate': '$updated'}}}])

Documentation:

https://docs.mongodb.com/manual/reference/operator/aggregation/toInt/ https://docs.mongodb.com/manual/reference/operator/aggregation/toDate/

Sign up to request clarification or add additional context in comments.

2 Comments

I think we have 4.0, although per the docs it seems like these work for version 4.0 as well - will give these a try shortly
Those operators are aggregation operators and will not work in an update operation.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.