how can I extract proper nouns / numeric values from a string using PHP or JavaScript? example theres a string like
Xyz visited this page this page 53 mins ago.
I want to be able to recognize "Xyz" and "53" as proper noun and numeric respectively
how can I extract proper nouns / numeric values from a string using PHP or JavaScript? example theres a string like
Xyz visited this page this page 53 mins ago.
I want to be able to recognize "Xyz" and "53" as proper noun and numeric respectively
The one obvious way is to have a dictionary of proper knowns and some good indexing to quickly search through that, if such a thing exists.
But I get the feeling you are looking for a way to grammatically infer that a word is a proper noun.
I can't think of any perfect way to do this, but if you created a series of rules, you could use these to parse a passage.
Rules might include. * Words that end with ly are not a proper noun * Noise words such as and, to , but etc. are not proper nouns * words that have capital letters but don't start a sentence are proper nouns
To improve it you could use these rules to create a dictionary of proper nouns. Every time a word follows one of these rules it either gets added to or deleted form the proper nouns dictionary.
This is very rough - if this is on the right track, then perhas I can be more specific.
If it's always one proper noun in the sentence then you could find it by looking for the word beginning with a capital letter. And if there is none except the first word then that it is. Problem arises if Xyz is named Bim de Verdier or if it's not actually capitalized.
// Get the number with JavaScript and RegExp
var regex = new RegExp("\d+");
var match = regex.exec("Xyz visisted this page this page 53 mins ago.");
if (match == null) {
alert("No match");
} else {
var s = "";
for (i = 0; i < match.length; i++) {
s = s + match[i] + "\n";
}
alert(s);
}
A capitalized word can be matched with "[A-Z][a-z]+[ ]".
The PHP functions is_numeric and ucfirst may help recognize the words:
function parse_name_and_number($sentence) {
$words = explode(' ', $sentence);
$name = array();
foreach ($words as $word) {
if (is_numeric($word))
$number = $word;
elseif ($word == ucfirst($word))
$name[] = $word;
}
$name = implode(' ', $name);
return array('name' => $name, 'number' => $number);
}
print_r(parse_name_and_number('Xyz visited this page 53 minutes ago'));
// output: Array ( [name] => Xyz [number] => 53 )
print_r(parse_name_and_number('we thought Bim de Verdier visited the page 5 seconds ago'));
// output: Array ( [name] => Bim Verdier [number] => 5 )
print_r(parse_name_and_number('Weirder input messes up the results'));
// output: Array ( [name] => Weirder [number] => )
Xyz visisted this page this page 53 mins ago.
Now, just get the position of "visisted this page" or whatever, and that is your length from the beginning of the sentance. If, for instance, "Person " is always at the beginning, then just set the starting point to 7 and subtract 7 from the first number. Here's a quick JS example:
alert(str.substr(7, str.IndexOf("visited") - 7));
Which should return "Xyz". Hope that helps. Of course, this only works if you know the structure of your sentence, which would be the case in the example given.
P.S. I know I'm two years late, but this might help someone in the future.