2

I have data in devanagari that needs some extraction to be done. This is an example of a few lines

तत् इदम् <गीता-शास्त्रम्>K7 <<<<<समस्त-वेद>K1-अर्थ>T6-सार>T6-संग्रह>T6-भूतम्>T2 <दुर्विज्ञेय-अर्थम्>K1 <<तत्-अर्थ>T6-आविष्करणाय>T6 अनेकैः <विवृत-<<<पद-<पद-अर्थ>T6-<वाक्य-अर्थ>T6>Di-न्यायम्>T6>Bs6 अपि <<अत्यन्त-विरुद्ध>K1-<अनेक-अर्थ>K1>K1 त्वेन लौकिकैः गृह्यमाणम् उपलभ्य अहम् विवेकतः <<अर्थ-निर्धारण>T6-अर्थम्>T4 संक्षेपतः विवरणम् करिष्यामि

<अभ्युदय-अर्थः>T4 अपि यः <प्रवृत्ति-लक्षणः>Bs6 धर्मः वर्णान् आश्रमान् च उद्दिश्य विहितः सः <<<<देव-आदि>Bs6-स्थान>T6-प्राप्ति>T6-हेतुः>T6 अपि सन् <<ईश्वर-अर्पण>T6-बुद्ध्या>T6 अनुष्ठीयमानः <सत्त्व-शुद्धये>T6 भवति <<फल-अभिसन्धि>T6-वर्जितः>T3

The alphanumerics are the tags of the text. I need to extract the binary compounds along with their tags (the alphanumerics immediately after the compound) from the line. Binary compounds are the two words hyphenated in the angular brackets.

<अभ्युदय-अर्थः>

<गीता-शास्त्रम्>

<विवृत-<<<पद-<पद-अर्थ>T6-<वाक्य-अर्थ>T6>Di-न्यायम्>T6>Bs6

The first two are both examples of binary compounds whereas the third one is not. The simplest way to identify a binary compound is to find two words hyphenated enclosed by one set of angular brackets and followed by a single tag. So after extraction, of say the first line, I should get a list with this in it <गीता-शास्त्रम्>K7, <दुर्विज्ञेय-अर्थम्>K1

The code that I tried was this

import re
cw = re.findall('\<(.*?)\>', f)
tags = re.findall('[a-zA-Z0-9]+', f)
cc = re.sub("[\<\>a-zA-z0-9]", '', f)
print(cw, tags, cc)

This, unfortunately, finds everything in a list but I cannot map the tags to their original compounds this way. Is there a more intuitive way to do this?

2
  • 2
    Do you mean you need re.findall(r'<([^<>]*)>(\w+)', text)? Or re.findall(r'<[^<>]*>\w+', text) Commented May 31, 2021 at 13:53
  • These do the job perfectly, thank you! I am quite new to regex so I am not very good at it unfortunately, Commented May 31, 2021 at 13:57

2 Answers 2

3

You can use

re.findall(r'<([^<>]*)>(\w+)', text)

See the regex demo. Details:

  • <([^<>]*)> - <, then zero or more chars other than < and > captured into Group 1, and then >
  • (\w+) - Group 2: one or more word chars.

See the Python demo:

import re
text = "तत् इदम् <गीता-शास्त्रम्>K7 <<<<<समस्त-वेद>K1-अर्थ>T6-सार>T6-संग्रह>T6-भूतम्>T2 <दुर्विज्ञेय-अर्थम्>K1  <<तत्-अर्थ>T6-आविष्करणाय>T6 अनेकैः <विवृत-<<<पद-<पद-अर्थ>T6-<वाक्य-अर्थ>T6>Di-न्यायम्>T6>Bs6 अपि <<अत्यन्त-विरुद्ध>K1-<अनेक-अर्थ>K1>K1 त्वेन लौकिकैः गृह्यमाणम् उपलभ्य अहम् विवेकतः <<अर्थ-निर्धारण>T6-अर्थम्>T4 संक्षेपतः विवरणम् करिष्यामि\n<अभ्युदय-अर्थः>T4 अपि यः <प्रवृत्ति-लक्षणः>Bs6 धर्मः वर्णान् आश्रमान् च उद्दिश्य विहितः  सः <<<<देव-आदि>Bs6-स्थान>T6-प्राप्ति>T6-हेतुः>T6 अपि सन्  <<ईश्वर-अर्पण>T6-बुद्ध्या>T6 अनुष्ठीयमानः <सत्त्व-शुद्धये>T6 भवति <<फल-अभिसन्धि>T6-वर्जितः>T3"
matches = list(re.finditer(r'<([^<>]*)>(\w+)', text))
# Show overall matches and their positions:
for m in matches:
    print( "Match: ", m.group(), ", Start position: ", m.start(), sep="")
print("---")
# Show groups and their positions:
for m in matches:
    print( "Word: ", m.group(1), ", Word start position: ", m.start(1),
           ", Tag: ", m.group(2), ", Tag start position: ", m.start(2), sep="")

Output:

Match: <गीता-शास्त्रम्>K7, Start position: 9
Match: <समस्त-वेद>K1, Start position: 32
Match: <दुर्विज्ञेय-अर्थम्>K1, Start position: 80
Match: <तत्-अर्थ>T6, Start position: 105
Match: <पद-अर्थ>T6, Start position: 152
Match: <वाक्य-अर्थ>T6, Start position: 164
Match: <अत्यन्त-विरुद्ध>K1, Start position: 202
Match: <अनेक-अर्थ>K1, Start position: 222
Match: <अर्थ-निर्धारण>T6, Start position: 285
Match: <अभ्युदय-अर्थः>T4, Start position: 341
Match: <प्रवृत्ति-लक्षणः>Bs6, Start position: 366
Match: <देव-आदि>Bs6, Start position: 436
Match: <ईश्वर-अर्पण>T6, Start position: 489
Match: <सत्त्व-शुद्धये>T6, Start position: 530
Match: <फल-अभिसन्धि>T6, Start position: 555
---
Word: गीता-शास्त्रम्, Word start position: 10, Tag: K7, Tag start position: 25
Word: समस्त-वेद, Word start position: 33, Tag: K1, Tag start position: 43
Word: दुर्विज्ञेय-अर्थम्, Word start position: 81, Tag: K1, Tag start position: 100
Word: तत्-अर्थ, Word start position: 106, Tag: T6, Tag start position: 115
Word: पद-अर्थ, Word start position: 153, Tag: T6, Tag start position: 161
Word: वाक्य-अर्थ, Word start position: 165, Tag: T6, Tag start position: 176
Word: अत्यन्त-विरुद्ध, Word start position: 203, Tag: K1, Tag start position: 219
Word: अनेक-अर्थ, Word start position: 223, Tag: K1, Tag start position: 233
Word: अर्थ-निर्धारण, Word start position: 286, Tag: T6, Tag start position: 300
Word: अभ्युदय-अर्थः, Word start position: 342, Tag: T4, Tag start position: 356
Word: प्रवृत्ति-लक्षणः, Word start position: 367, Tag: Bs6, Tag start position: 384
Word: देव-आदि, Word start position: 437, Tag: Bs6, Tag start position: 445
Word: ईश्वर-अर्पण, Word start position: 490, Tag: T6, Tag start position: 502
Word: सत्त्व-शुद्धये, Word start position: 531, Tag: T6, Tag start position: 546
Word: फल-अभिसन्धि, Word start position: 556, Tag: T6, Tag start position: 568
Sign up to request clarification or add additional context in comments.

7 Comments

I have another question. If I wanted the position of the compound in the sentence to be displayed along with these data, how would I do that?
@Adideva98 Please see the updated answer. Please consider accepting the original answer.
I tried something similar to this, but actually, I wanted its position in terms of its word number. For example, Match: <गीता-शास्त्रम्>K7, Start position: 9 should actually have the position as 3 as this is the 3rd word in the sentence. By position, I meant position in a clean sentence if I were to remove all alphanumeric tags and angular brackets.
@Adideva98 See this demo: matches = list(re.finditer(r'<([^<>]*)>(\w+)', text)), for m in matches: print( "Match: ", m.group(), ", Start position: ", len(text[:m.start()].split())+1, sep=""). Here, I get the substring before the match, split it with whitespace and count the resulting text chunks.
I am using a different variation of the original answer. I had modified the code a bit to print the compound+tag along with the clean sentence it appears in. I want to incorporate the position into the same format. And while your solution works as is, t does not work properly when integrated into my program. Kindly look at [this demo]("ideone.com/e.js/jFNuCs") here to get an understanding of my requirement.
|
2

Similar to @WiktorStribizew, but slight variation.

[A-Z]\d will look for exactly 1 letter followed by 1 digit, example 'K7'

import re
f = "तत् इदम् <गीता-शास्त्रम्>K7 <<<<<समस्त-वेद>K1-अर्थ>T6-सार>T6-संग्रह>T6-भूतम्>T2 <दुर्विज्ञेय-अर्थम्>K1  <<तत्-अर्थ>T6-आविष्करणाय>T6 अनेकैः <विवृत-<<<पद-<पद-अर्थ>T6-<वाक्य-अर्थ>T6>Di-न्यायम्>T6>Bs6 अपि <<अत्यन्त-विरुद्ध>K1-<अनेक-अर्थ>K1>K1 त्वेन लौकिकैः गृह्यमाणम् उपलभ्य अहम् विवेकतः <<अर्थ-निर्धारण>T6-अर्थम्>T4 संक्षेपतः विवरणम् करिष्यामि"
cw = re.findall(r'<[^<>]+>[A-Z]\d', f)
print(cw)

Output

['<गीता-शास्त्रम्>K7', '<समस्त-वेद>K1', '<दुर्विज्ञेय-अर्थम्>K1', '<तत्-अर्थ>T6', '<पद-अर्थ>T6', '<वाक्य-अर्थ>T6', '<अत्यन्त-विरुद्ध>K1', '<अनेक-अर्थ>K1', '<अर्थ-निर्धारण>T6']

To locate the position of each item found, below codes will output the index number (first character location):

for item in cw:
    print(f.index(item))

9
32
80
105
152
164
202
222
285

2 Comments

I have another question. If I wanted the position of the compound in the sentence to be displayed along with these data, how would I do that?
You can find the index, see edited answer. For example if you count 9 characters, you'd get '<गीता-शास्त्रम्>K7'

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.