Complex data cleaning using regex on python

Question

I have data in devanagari that needs some extraction to be done. This is an example of a few lines

तत् इदम् <गीता-शास्त्रम्>K7 <<<<<समस्त-वेद>K1-अर्थ>T6-सार>T6-संग्रह>T6-भूतम्>T2 <दुर्विज्ञेय-अर्थम्>K1 <<तत्-अर्थ>T6-आविष्करणाय>T6 अनेकैः <विवृत-<<<पद-<पद-अर्थ>T6-<वाक्य-अर्थ>T6>Di-न्यायम्>T6>Bs6 अपि <<अत्यन्त-विरुद्ध>K1-<अनेक-अर्थ>K1>K1 त्वेन लौकिकैः गृह्यमाणम् उपलभ्य अहम् विवेकतः <<अर्थ-निर्धारण>T6-अर्थम्>T4 संक्षेपतः विवरणम् करिष्यामि

<अभ्युदय-अर्थः>T4 अपि यः <प्रवृत्ति-लक्षणः>Bs6 धर्मः वर्णान् आश्रमान् च उद्दिश्य विहितः सः <<<<देव-आदि>Bs6-स्थान>T6-प्राप्ति>T6-हेतुः>T6 अपि सन् <<ईश्वर-अर्पण>T6-बुद्ध्या>T6 अनुष्ठीयमानः <सत्त्व-शुद्धये>T6 भवति <<फल-अभिसन्धि>T6-वर्जितः>T3

The alphanumerics are the tags of the text. I need to extract the binary compounds along with their tags (the alphanumerics immediately after the compound) from the line. Binary compounds are the two words hyphenated in the angular brackets.

<अभ्युदय-अर्थः>

<गीता-शास्त्रम्>

<विवृत-<<<पद-<पद-अर्थ>T6-<वाक्य-अर्थ>T6>Di-न्यायम्>T6>Bs6

The first two are both examples of binary compounds whereas the third one is not. The simplest way to identify a binary compound is to find two words hyphenated enclosed by one set of angular brackets and followed by a single tag. So after extraction, of say the first line, I should get a list with this in it <गीता-शास्त्रम्>K7, <दुर्विज्ञेय-अर्थम्>K1

The code that I tried was this

import re
cw = re.findall('\<(.*?)\>', f)
tags = re.findall('[a-zA-Z0-9]+', f)
cc = re.sub("[\<\>a-zA-z0-9]", '', f)
print(cw, tags, cc)

This, unfortunately, finds everything in a list but I cannot map the tags to their original compounds this way. Is there a more intuitive way to do this?

Do you mean you need re.findall(r'<([^<>]*)>(\w+)', text)? Or re.findall(r'<[^<>]*>\w+', text) — Wiktor Stribiżew
– Wiktor Stribiżew, Commented May 31, 2021 at 13:53
These do the job perfectly, thank you! I am quite new to regex so I am not very good at it unfortunately, — Adideva98
– Adideva98, Commented May 31, 2021 at 13:57

Wiktor Stribiżew · Accepted Answer · 2021-06-09 11:38:36Z

3

You can use

re.findall(r'<([^<>]*)>(\w+)', text)

See the regex demo. Details:

<([^<>]*)> - <, then zero or more chars other than < and > captured into Group 1, and then >
(\w+) - Group 2: one or more word chars.

See the Python demo:

import re
text = "तत् इदम् <गीता-शास्त्रम्>K7 <<<<<समस्त-वेद>K1-अर्थ>T6-सार>T6-संग्रह>T6-भूतम्>T2 <दुर्विज्ञेय-अर्थम्>K1  <<तत्-अर्थ>T6-आविष्करणाय>T6 अनेकैः <विवृत-<<<पद-<पद-अर्थ>T6-<वाक्य-अर्थ>T6>Di-न्यायम्>T6>Bs6 अपि <<अत्यन्त-विरुद्ध>K1-<अनेक-अर्थ>K1>K1 त्वेन लौकिकैः गृह्यमाणम् उपलभ्य अहम् विवेकतः <<अर्थ-निर्धारण>T6-अर्थम्>T4 संक्षेपतः विवरणम् करिष्यामि\n<अभ्युदय-अर्थः>T4 अपि यः <प्रवृत्ति-लक्षणः>Bs6 धर्मः वर्णान् आश्रमान् च उद्दिश्य विहितः  सः <<<<देव-आदि>Bs6-स्थान>T6-प्राप्ति>T6-हेतुः>T6 अपि सन्  <<ईश्वर-अर्पण>T6-बुद्ध्या>T6 अनुष्ठीयमानः <सत्त्व-शुद्धये>T6 भवति <<फल-अभिसन्धि>T6-वर्जितः>T3"
matches = list(re.finditer(r'<([^<>]*)>(\w+)', text))
# Show overall matches and their positions:
for m in matches:
    print( "Match: ", m.group(), ", Start position: ", m.start(), sep="")
print("---")
# Show groups and their positions:
for m in matches:
    print( "Word: ", m.group(1), ", Word start position: ", m.start(1),
           ", Tag: ", m.group(2), ", Tag start position: ", m.start(2), sep="")

Output:

Match: <गीता-शास्त्रम्>K7, Start position: 9
Match: <समस्त-वेद>K1, Start position: 32
Match: <दुर्विज्ञेय-अर्थम्>K1, Start position: 80
Match: <तत्-अर्थ>T6, Start position: 105
Match: <पद-अर्थ>T6, Start position: 152
Match: <वाक्य-अर्थ>T6, Start position: 164
Match: <अत्यन्त-विरुद्ध>K1, Start position: 202
Match: <अनेक-अर्थ>K1, Start position: 222
Match: <अर्थ-निर्धारण>T6, Start position: 285
Match: <अभ्युदय-अर्थः>T4, Start position: 341
Match: <प्रवृत्ति-लक्षणः>Bs6, Start position: 366
Match: <देव-आदि>Bs6, Start position: 436
Match: <ईश्वर-अर्पण>T6, Start position: 489
Match: <सत्त्व-शुद्धये>T6, Start position: 530
Match: <फल-अभिसन्धि>T6, Start position: 555
---
Word: गीता-शास्त्रम्, Word start position: 10, Tag: K7, Tag start position: 25
Word: समस्त-वेद, Word start position: 33, Tag: K1, Tag start position: 43
Word: दुर्विज्ञेय-अर्थम्, Word start position: 81, Tag: K1, Tag start position: 100
Word: तत्-अर्थ, Word start position: 106, Tag: T6, Tag start position: 115
Word: पद-अर्थ, Word start position: 153, Tag: T6, Tag start position: 161
Word: वाक्य-अर्थ, Word start position: 165, Tag: T6, Tag start position: 176
Word: अत्यन्त-विरुद्ध, Word start position: 203, Tag: K1, Tag start position: 219
Word: अनेक-अर्थ, Word start position: 223, Tag: K1, Tag start position: 233
Word: अर्थ-निर्धारण, Word start position: 286, Tag: T6, Tag start position: 300
Word: अभ्युदय-अर्थः, Word start position: 342, Tag: T4, Tag start position: 356
Word: प्रवृत्ति-लक्षणः, Word start position: 367, Tag: Bs6, Tag start position: 384
Word: देव-आदि, Word start position: 437, Tag: Bs6, Tag start position: 445
Word: ईश्वर-अर्पण, Word start position: 490, Tag: T6, Tag start position: 502
Word: सत्त्व-शुद्धये, Word start position: 531, Tag: T6, Tag start position: 546
Word: फल-अभिसन्धि, Word start position: 556, Tag: T6, Tag start position: 568

edited Jun 9, 2021 at 11:38

answered May 31, 2021 at 13:59

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Adideva98 Over a year ago

I have another question. If I wanted the position of the compound in the sentence to be displayed along with these data, how would I do that?

Wiktor Stribiżew Over a year ago

@Adideva98 Please see the updated answer. Please consider accepting the original answer.

Adideva98 Over a year ago

I tried something similar to this, but actually, I wanted its position in terms of its word number. For example, Match: <गीता-शास्त्रम्>K7, Start position: 9 should actually have the position as 3 as this is the 3rd word in the sentence. By position, I meant position in a clean sentence if I were to remove all alphanumeric tags and angular brackets.

Wiktor Stribiżew Over a year ago

@Adideva98 See this demo: matches = list(re.finditer(r'<([^<>]*)>(\w+)', text)), for m in matches: print( "Match: ", m.group(), ", Start position: ", len(text[:m.start()].split())+1, sep=""). Here, I get the substring before the match, split it with whitespace and count the resulting text chunks.

Adideva98 Over a year ago

I am using a different variation of the original answer. I had modified the code a bit to print the compound+tag along with the clean sentence it appears in. I want to incorporate the position into the same format. And while your solution works as is, t does not work properly when integrated into my program. Kindly look at [this demo]("ideone.com/e.js/jFNuCs") here to get an understanding of my requirement.

|

blackraven · Accepted Answer · 2021-06-09 12:06:28Z

2

Similar to @WiktorStribizew, but slight variation.

[A-Z]\d will look for exactly 1 letter followed by 1 digit, example 'K7'

import re
f = "तत् इदम् <गीता-शास्त्रम्>K7 <<<<<समस्त-वेद>K1-अर्थ>T6-सार>T6-संग्रह>T6-भूतम्>T2 <दुर्विज्ञेय-अर्थम्>K1  <<तत्-अर्थ>T6-आविष्करणाय>T6 अनेकैः <विवृत-<<<पद-<पद-अर्थ>T6-<वाक्य-अर्थ>T6>Di-न्यायम्>T6>Bs6 अपि <<अत्यन्त-विरुद्ध>K1-<अनेक-अर्थ>K1>K1 त्वेन लौकिकैः गृह्यमाणम् उपलभ्य अहम् विवेकतः <<अर्थ-निर्धारण>T6-अर्थम्>T4 संक्षेपतः विवरणम् करिष्यामि"
cw = re.findall(r'<[^<>]+>[A-Z]\d', f)
print(cw)

Output

['<गीता-शास्त्रम्>K7', '<समस्त-वेद>K1', '<दुर्विज्ञेय-अर्थम्>K1', '<तत्-अर्थ>T6', '<पद-अर्थ>T6', '<वाक्य-अर्थ>T6', '<अत्यन्त-विरुद्ध>K1', '<अनेक-अर्थ>K1', '<अर्थ-निर्धारण>T6']

To locate the position of each item found, below codes will output the index number (first character location):

for item in cw:
    print(f.index(item))

9
32
80
105
152
164
202
222
285

edited Jun 9, 2021 at 12:06

answered May 31, 2021 at 14:14

blackraven

5,6797 gold badges27 silver badges51 bronze badges

2 Comments

Adideva98 Over a year ago

I have another question. If I wanted the position of the compound in the sentence to be displayed along with these data, how would I do that?

blackraven Over a year ago

You can find the index, see edited answer. For example if you count 9 characters, you'd get '<गीता-शास्त्रम्>K7'

Collectives™ on Stack Overflow

Complex data cleaning using regex on python

2 Answers 2

7 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related