1

In the following document collection, I am trying to find the total words of unique sentences. The total words must come out as 5 (hello\nworld, how are you?) + 5 (hello world, I am fine) + 3(Is it raining?) + 5(Look at the beautiful tiger!) = 18

[
    {
        "sourceList": [
        {
            "source": "hello\nworld, how are you?",
            "_id": ObjectId("5f0eb9946db57c0007841153")
        },
        {
            "source": "hello world, I am fine",
            "_id": ObjectId("5f0eb9946db57c0007841153")
        },
        {
            "source": "Is it raining?",
            "_id": ObjectId("5f0eb9946db57c0007841153")
        }
        ]
    },
    {
        "sourceList": [
        {
            "source": "Look at the beautiful tiger!",
            "_id": ObjectId("5f0eb9946db57c0007841153")
        },
        {
            "source": "Is it raining?",
            "_id": ObjectId("5f0eb9946db57c0007841153")
        }
        ]
    }
]

But with the below query

    db.collection.aggregate([
    {
        "$unwind": "$sourceList"
    },
    {
        $project: {
        "sp": {
            $split: [
                "$sourceList.source",
                "\n"
            ],
            $split: [
                "$sourceList.source",
                " "
            ]
        }
        }
    },
    {
        "$group": {
            "_id": null,
            "elements": {
                $addToSet: "$sp"
            }
        }
    },
    {
        "$unwind": "$elements"
    },
    {
        "$project": {
            "sizes": {
                "$size": "$elements"
            }
        }
    },
    {
        "$group": {
            "_id": null,
            "count": {
                "$sum": "$sizes"
            }
        }
    }
])

it gives as 17. What could be the reason for this? I am first trying to split by \n and then by space

EDIT

I am trying to find word count for unique sentences and total unique sentences.

5
  • Are you looking for total unique sentences or total unique words or both? Commented Aug 5, 2020 at 7:53
  • @Gibbs Total Unique Sentences and Total Unique words Commented Aug 5, 2020 at 8:04
  • And you are expecting 4 and 18? Commented Aug 5, 2020 at 8:09
  • @Gibbs Yes..... I scrape a website, insert all text into mongo. After that I need to calculate total word count for unique sentences and also total unique sentences. Commented Aug 5, 2020 at 8:12
  • @Gibbs Please note that I am attempting word count for unique sentences and not unique word count for unique sentences Commented Aug 5, 2020 at 8:14

2 Answers 2

2

The problem is that here:

"sp": {
    $split: [
        "$sourceList.source",
        "\n"
    ],
    $split: [
        "$sourceList.source",
        " "
    ]
}

only the second $split gets executed by MongoDB and it returns hello\nworld as one string. There's no such "cascade" syntax, since it's simply the same JSON key $split so last wins.

In order to fix that you can use $reduce to apply $split by whitespace on an array of split by \n values:

{
    $project: {
        "sp": {
            $reduce: {
                input: { $split: [ "$sourceList.source", "\n" ] },
                initialValue: [],
                in: { $concatArrays: [ "$$value", { $split: [ "$$this", " " ] } ] }
            }
        }
    }
}

Mongo Playground

Sign up to request clarification or add additional context in comments.

5 Comments

How is your query different from one posted by @gibbs? I am trying to find out word count of unique sentences and also the total unique sentences
@Amanda mine was added first and my goal was to answer why the number is different than 18 which was your initial question. Not sure what happened then :-)
Also added in the edit, I am trying to find word count for unique sentences and total unique sentences. Does this query give the word count for unique sentences?
Yep, @micki is correct. He resolved your primary problem. As per your comment, I resolved you lr both count issues. I wish I could approve this answer :)
@Gibbs there's enough reputation for both of us in this question :-)
2

As per the comments and addition to @micki's answer and my previous answer,

play

db.collection.aggregate([
  {
    "$unwind": "$sourceList"
  },
  {
    $project: {
      "sp": {
        $reduce: {
          input: {
            $split: [
              "$sourceList.source",
              "\n"
            ]
          },
          initialValue: [],
          in: {
            $concatArrays: [
              "$$value",
              {
                $split: [
                  "$$this",
                  " "
                ]
              }
            ]
          }
        }
      }
    }
  },
  {
    "$group": {
      "_id": null,
      "elements": {
        $addToSet: "$sp"
      }
    }
  },
  {
    "$project": {
      "unique_sen": {
        "$size": "$elements"
      },
      "elements": 1
    }
  },
  {
    "$unwind": "$elements"
  },
  {
    "$project": {
      "sizes": {
        "$size": "$elements"
      },
      "unique_sen": 1
    }
  },
  {
    "$group": {
      "_id": null,
      "unique_count": {
        "$sum": "$sizes"
      },
      "data": {
        $push: "$$ROOT"
      }
    }
  },
  {
    "$project": {
      "unique_count": 1,
      "unique_sen": {
        $first: "$data.unique_sen"
      }
    }
  }
])

Update:

You don't need to escape in the query.

play

db.collection.aggregate([
  {
    "$match": {
      "url": "https://www.rootsresource.in"
    }
  },
  {
    "$unwind": "$translations"
  },
  {
    $project: {
      "sp": {
        $reduce: {
          input: {
            $split: [
              "$translations.source",
              "\n"
            ]
          },
          initialValue: [],
          in: {
            $concatArrays: [
              "$$value",
              {
                $split: [
                  "$$this",
                  " "
                ]
              }
            ]
          }
        }
      }
    }
  },
  {
    "$group": {
      "_id": null,
      "elements": {
        $addToSet: "$sp"
      }
    }
  },
  {
    "$project": {
      "unique_sen": {
        "$size": "$elements"
      },
      "elements": 1
    }
  },
  {
    "$unwind": "$elements"
  },
  {
    "$project": {
      "sizes": {
        "$size": "$elements"
      },
      "unique_sen": 1
    }
  },
  {
    "$group": {
      "_id": null,
      "unique_count": {
        "$sum": "$sizes"
      },
      "data": {
        $push: "$$ROOT"
      }
    }
  },
  {
    "$project": {
      "unique_count": 1,
      "unique_sen": {
        $first: "$data.unique_sen"
      }
    }
  }
])

UPDATE:

Above query works from mongo 4.4 - $first is available in project from 4.4

For older versions.

db.test.aggregate([
  {
    "$match": {
      url: "https://www.rootsresource.in"
    }
  },
  {
    "$unwind": "$translations"
  },
  {
    $project: {
      "sp": {
        $reduce: {
          input: {
            $split: [
              "$translations.source",
              "\n"
            ]
          },
          initialValue: [],
          in: {
            $concatArrays: [
              "$$value",
              {
                $split: [
                  "$$this",
                  " "
                ]
              }
            ]
          }
        }
      }
    }
  },
  {
    "$group": {
      "_id": null,
      "elements": {
        $addToSet: "$sp"
      }
    }
  },
  {
    "$project": {
      "unique_sen": {
        "$size": "$elements"
      },
      "elements": 1
    }
  },
  {
    "$unwind": "$elements"
  },
  {
    "$project": {
      "sizes": {
        "$size": "$elements"
      },
      "unique_sen": 1
    }
  },
  {
    "$group": {
      "_id": null,
      "unique_count": {
        "$sum": "$sizes"
      },
      "data": {
        $push: "$$ROOT"
      }
    }
  },
  {
    "$project": {
      "unique_count": 1,
        unique_sen: { $arrayElemAt: [ "$data.unique_sen", 0 ] }
    }
  }
])

18 Comments

Tying this directly in mongo gives an error. Screenshot ibb.co/x1tnFL9
play - could you update the query with play which you are trying?
It provides when you remove match. Match error is different than the one in the image.
Not sure why. But I am running the query inside Robo3T
It says Unrecognized expression '$first
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.