How to create a Entity Ruler pattern that includes dot and hyphen?

Question

I am trying to include brazilian CPF as entity on my NER app using spacy. The current code is the follow:

import spacy
from spacy.pipeline import EntityRuler

nlp = spacy.load("pt_core_news_sm")

text = "João mora na Bahia, 22/11/1985, seu cpf é 111.222.333-11"
ruler = nlp.add_pipe("entity_ruler")
patterns = [
    {"label": "CPF", "pattern": [{"SHAPE": "ddd.ddd.ddd-dd"}]},
]

ruler.add_patterns(patterns)
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

The result was only:

João PER
Bahia LOC

I tried using regex too:

{"label": "CPF", "pattern": [{"TEXT": {"REGEX": r"^\d{3}\.\d{3}\.\d{3}\-\d{2}$"}}]},

But not worked too

How can I fix that to retrieve CPF?

Gabriel Souto · Accepted Answer · 2023-06-10 15:38:44Z

1

After looking for token spacings, the brazilian tokenizer split cpf in two parts:

token_spacings = [token.text_with_ws for token in doc]

Result:

['João ', 'mora ', 'na ', 'Bahia', ', ', '22/11/1985', ', ', 'seu ', 'cpf ', 'é ', '111.222.', '333-11']

So i think you may try this:

import spacy
from spacy.pipeline import EntityRuler

nlp = spacy.load("pt_core_news_sm")

text = "João mora na Bahia, 22/11/1985, seu cpf é 111.222.333-11"
ruler = nlp.add_pipe("entity_ruler")
patterns = [
    {"label": "CPF", "pattern": [
            {"SHAPE": "ddd.ddd."},
            {"SHAPE": "ddd-dd"},
    ]},
]

ruler.add_patterns(patterns)
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

answered Jun 10, 2023 at 15:38

Gabriel Souto

6309 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to create a Entity Ruler pattern that includes dot and hyphen?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related