1

I am trying to include brazilian CPF as entity on my NER app using spacy. The current code is the follow:

import spacy
from spacy.pipeline import EntityRuler

nlp = spacy.load("pt_core_news_sm")

text = "João mora na Bahia, 22/11/1985, seu cpf é 111.222.333-11"
ruler = nlp.add_pipe("entity_ruler")
patterns = [
    {"label": "CPF", "pattern": [{"SHAPE": "ddd.ddd.ddd-dd"}]},
]

ruler.add_patterns(patterns)
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

The result was only:

João PER
Bahia LOC

I tried using regex too:

{"label": "CPF", "pattern": [{"TEXT": {"REGEX": r"^\d{3}\.\d{3}\.\d{3}\-\d{2}$"}}]},

But not worked too

How can I fix that to retrieve CPF?

1 Answer 1

1

After looking for token spacings, the brazilian tokenizer split cpf in two parts:

token_spacings = [token.text_with_ws for token in doc]

Result:

['João ', 'mora ', 'na ', 'Bahia', ', ', '22/11/1985', ', ', 'seu ', 'cpf ', 'é ', '111.222.', '333-11']

So i think you may try this:

import spacy
from spacy.pipeline import EntityRuler

nlp = spacy.load("pt_core_news_sm")

text = "João mora na Bahia, 22/11/1985, seu cpf é 111.222.333-11"
ruler = nlp.add_pipe("entity_ruler")
patterns = [
    {"label": "CPF", "pattern": [
            {"SHAPE": "ddd.ddd."},
            {"SHAPE": "ddd-dd"},
    ]},
]

ruler.add_patterns(patterns)
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.