1

I have a txt file containing text

Table of Contents

Preface 1

Chapter 1: Tokenizing Text and WordNet Basics 7

Tokenizing text into sentences 8

Tokenizing sentences into words 10

Tokenizing sentences using regular expressions 12

If the string I have is :

input = "Tokenzing sentence using expressions"

I thought of using beginning and ending words to extract the sentence but there are lot of repetitions.

So whats the best way to get the output

Tokenizing sentences using regular expressions

2
  • Are you sure about matching Tokenzing with Tokenizing? or it's just mistake? Commented May 28, 2017 at 13:47
  • 1
    Yes. I want to find the most similar text. Commented May 28, 2017 at 13:52

1 Answer 1

4

If you are prepared to preprocess your chapter headings, eliminating page numbers and stuff, this:

import difflib
contents = ["Tokenizing Text and WordNet Basics",
            "Tokenizing text into sentences",
            "Tokenizing sentences into words",
            "Tokenizing sentences using regular expressions"]
input = "Tokenzing sentence using expressions"
print (difflib.get_close_matches(input, contents, n=1))

will give you this output:

['Tokenizing sentences using regular expressions']
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.