I'm using Google's Speech-to-Text API to convert an audio file into text. It can identify speakers, which is really cool, but it formats the info in a way that I am having some trouble with. Here are their docs on separating out speakers.
My goal is to have a single string separating out lines by their speakers, something like this:
Speaker1: Hello Tom
Speaker2: Howdy
Speaker1: How was your weekend
If I send an audio file to get transcribed, I get back something like this:
wordsObjects =
[
{
startTime: { seconds: '1'},
endTime: { seconds: '1'},
word: 'Hello',
speakerTag: 1
},
{
startTime: { seconds: '2'},
endTime: { seconds: '2'},
word: 'Tom',
speakerTag: 1
},
]
Of course there's an object for each word, I just want to save space. Anything Tom says in this example should be represented by speakerTag: 2
Here's the closest I've gotten so far:
const unformattedTranscript = wordsObjects.map((currentWord, idx, arr) => {
if (arr[idx + 1]) {
if (currentWord.speakerTag === arr[idx + 1].speakerTag) {
return [currentWord.word, arr[idx + 1].word];
} else {
return ["SPEAKER CHANGE"];
}
}
});
const formattedTranscript = unformattedTranscript.reduce(
(acc, wordArr, idx, arr) => {
if (arr[idx + 1]) {
if (wordArr[wordArr.length - 1] === arr[idx + 1][0]) {
wordArr.pop();
acc.push(wordArr.concat(arr[idx + 1]));
} else {
acc.push(["\n"]);
}
}
return acc;
},
[]
);
This solution does not work if a speaker says more than two words consecutively. I've managed to confuse myself thoroughly on this one, so I'd love to be nudged in the right direction.
Thanks in advance for any advice.