Skip to main content
We’ve updated our Terms of Service. A new AI Addendum clarifies how Stack Overflow utilizes AI interactions.

Questions tagged [text-processing]

Filter by
Sorted by
Tagged with
-1 votes
3 answers
439 views

I'm interested in finding a text distance (or string similarity) algorithm which computes a greater distance (or lower similarity) when characters are further apart. For example, I want the distance ...
Vermillion's user avatar
-4 votes
3 answers
246 views

Let's say I was to create a scraper. At some point I'll need to come up with algorithm of identifing whether or not a piece of a newly scraped text matches the one that's already in the DB. How would ...
Nicholas E. Harding's user avatar
-3 votes
1 answer
163 views

Suppose I have file a.txt, b.txt and c.txt: a.txt: Hello, I like cake. b.txt: Hello, I like turtles. c.txt: go away, I don't like you I suspect the difference between a.txt and b.txt is ...
user32882's user avatar
  • 267
0 votes
1 answer
175 views

so I am struggling a bit with a database setup. I found post with similar problems, but the reason behind the answers was not what I was looking for, hence I ask again with my specifics. I am building ...
Cerealz's user avatar
  • 41
0 votes
1 answer
274 views

Users have the ability to enter and save text in a rich text editor which is eventually stored in a database and then rendered on a site. Is it better to convert the RTF to HTML when it's stored to ...
Coupcoup's user avatar
  • 220
0 votes
1 answer
485 views

Is there a de facto standard algorithm for finding good places to put line breaks in a paragraph of text rendered in a monospace font (e.g. to a text console)? The algorithm should aim to output lines ...
Lassi's user avatar
  • 125
3 votes
1 answer
263 views

Looking to integrate TeX equations in a TeX-agnostic fashion, suitable for either ConTeXt or LaTeX, into a Java-based desktop Markdown editor. The possibilities are numerous, but I'm not sure what ...
Dave Jarvis's user avatar
-1 votes
1 answer
59 views

In search engine indexing, a body of text is often processed before it is indexed. A common example is stemming, were words are reduced to their root form (plurals are dropped, tense is normalized). ...
Deane's user avatar
  • 171
2 votes
0 answers
62 views

I am creating a text generation algorithm for my master's research. I have a dialogue between two people and I would like to simulate one part of the conversation with naturally generated text (not ...
Bennie van Eeden's user avatar
6 votes
2 answers
408 views

I have some Arabic content that is justified according to western conventions. I justified it because it is justified in ancient sources: However, the way Arabic text justification works is by ...
Lance Pollard's user avatar
4 votes
2 answers
647 views

I would like to store the frequencies with which words co-occur with each other over a variety of contexts in a large (> 1 billion tokens) text corpus. I need to store the word pair, the type of co-...
pgtn's user avatar
  • 51
0 votes
1 answer
133 views

Long time ago I learned that text files are not like Random access Files, i. e., adding or updating info at the beginning of a text file involves moving all the rest of the file "forward" (or ...
Mdot's user avatar
  • 1
0 votes
1 answer
90 views

I want to implement tracking of changes in plain-text documents, in a way similar to how it works in MS Word or Apple Pages. What I am unsure of is the data model and how to store it. Goal The ...
Adam Libuša's user avatar
  • 2,077
2 votes
3 answers
850 views

With a list of thousands of words and a small list of letters I am trying to find the least amount of words to make use of all given letters, assuming my dictionary of words covers all letters. The ...
kontur's user avatar
  • 131
0 votes
2 answers
13k views

I’m seeking a term and possibly the code behind what would help me implement that term in Python. I have been working on a text-based Python journaling application. When I want to review my ...
Iam Pyre's user avatar
4 votes
4 answers
4k views

A follow-up to Difference between '\n' and '\r\n'. It's been few decades since the schism was introduced. Nowadays, when documents are being exchanged over the internet, typically ...
Ondra Žižka's user avatar
0 votes
1 answer
389 views

It got my by surprise yesterday than ordinal indicators are considered letters. I thought letters were only [a-zA-Z]. Why are they considered letters and not symbols? char.IsLetter('º'); // true ...
NullOrEmpty's user avatar
1 vote
2 answers
352 views

I am writing a tool that will give users the ability to summarize text content on a webpage, by highlighting the text that they wish to get summarized. So far, I've received results that I can work ...
Fluppe's user avatar
  • 111
1 vote
1 answer
187 views

I'm working on a quiz system that will allow users to enter text as an answer. The question could be something simple to start with, looking for a short phrase, or a select few words as the "correct" ...
simonw16's user avatar
  • 133
1 vote
0 answers
118 views

I know that most of the pos tagger algorithms measure their accuracy token wise I.e. whether the token is tagged correctly or not Some pos taggers provide sentence accuracy too. How is sentence ...
Harwee's user avatar
  • 179
1 vote
1 answer
96 views

I process a lot of tweets in real time using python and for each tweet I need to assign it in to a specific bucket. I have about 50 buckets, each with their own rules. The majority of them are simple ...
Mo.'s user avatar
  • 113
1 vote
1 answer
2k views

Firstly, I realize that question title is about as terrible as the sample code I'll post below, so please bear with me while I explain the problem more clearly, and if you have a better idea for the ...
Violet Giraffe's user avatar
3 votes
1 answer
1k views

My first thought here is to use a dynamic array, but I am looking for something better. Currently I have the text files open into "chunks". Every word or group of spaces makes up a "chunk". Then I ...
Joe's user avatar
  • 379
3 votes
1 answer
155 views

I have been working with some fairly large text files containing about two million lines of text. I don't know the length of the content or the lines in advance, just the number of lines. I have been ...
user4752157's user avatar
9 votes
7 answers
1k views

I'm trying to develop a small reporting tool (with sqlite backend). I can best describe this tool as a "transaction" ledger. What I'm trying to do is keep track of "transactions" from weekly data ...
Swartz's user avatar
  • 141
0 votes
0 answers
64 views

I would like to solve the following problem. On my website, I have a list of my publications. I also have my list of publications on a latex file of my cv. The issue is that I update these manually, ...
Bob Johns's user avatar
1 vote
2 answers
809 views

A very useful learning tool I stumbled across for Chinese was a massive list of sentences that, barring the first 10 or 15, only differed by the ones before by one or two words, or at least as few as ...
William Brun's user avatar
8 votes
2 answers
2k views

For instance, you let the user define the notorious path variable. How do you interpret apppath = C:\Program Files\App? This looks like a programming language adopted practice to ignore the white ...
Val's user avatar
  • 367
2 votes
2 answers
3k views

Graphics processing units (GPUs) are very common and allow for efficient, parallel processing of floating point numbers. PPUs (Physics Processing Units) used to be a buzzword several years ago but ...
toniedzwiedz's user avatar
  • 1,353
-1 votes
1 answer
280 views

I have multiple patterns that I want to expand. Expansion should expand number and letter ranges between curly braces. Numbers need to support padding. I want to have it expand into a List(Of String) ...
Timberwolf's user avatar
2 votes
1 answer
650 views

I am building a text editor which makes use of a Ragel based tokenizer to support syntax highlighting. I am considering the use of a rope data structure to support efficient modifications and undo/...
sesteel's user avatar
  • 75
-1 votes
1 answer
941 views

I'm working in a fairly old yet sufficiently unproductive code base that I need to create a(some) script(s) to help me out. For example: we add a version # and timestamp at the header of the file (...
cbrulak's user avatar
  • 367
4 votes
1 answer
511 views

I was recently tasked with building a Name Entity Recognizer as part of a project. The objective was to parse a given sentence and come up with all the possible combinations of the entities. One ...
Rohit Jose's user avatar
3 votes
3 answers
548 views

Lately, I've been noticing that a lot of software, be it a website, a client application, or a video game, often write a representation of quantity as follows: "1 result(s)". Now, I can understand why ...
Lars's user avatar
  • 67
5 votes
1 answer
790 views

From a written text by an author if a computer program analyses the text, how much can a computer program tell today about the author of some (long enough to be statistically significant) texts? Can ...
Niklas Rosencrantz's user avatar
1 vote
1 answer
2k views

I have a requirement to read a text file with lines in tag=value format and then output the file with specific tags listed first and the rest sorted alphabetically. The incoming file is randomly ...
Pablo Vadear's user avatar
0 votes
2 answers
1k views

I'm not quite sure if this is a question for programmers.se rather than stackoverflow, but here goes. So Facebook [or any other large company] when given something like an apostrophe or html, can ...
Someone's user avatar
  • 191
3 votes
3 answers
1k views

I work for an organization that does a lot of work with government data. We have a couple of different projects where we've abstracted out common text search/manipulation operations into reusable ...
Andrew Pendleton's user avatar
4 votes
1 answer
992 views

There is an endless data stream of XML messages (and "heartbeats"), that I receive via a telnet connection and through a site-to-site VPN IPsec tunnel. I'm still pondering. What is the best/most ...
derphil's user avatar
  • 859
6 votes
1 answer
1k views

I am building an app that analyzes posts by people by pulling their Tweets and Facebook posts. I need to process all the posts and find useful phrases. What I mean by useful is that, any word or ...
Can Poyrazoğlu's user avatar
9 votes
4 answers
492 views

I want to make a simple, proof-of-concept application (REPL) that takes a number and then processes commands on that number. Example: I start with 1. Then I write "add 2", it gives me 3. Then I ...
Nini Michaels's user avatar
3 votes
1 answer
2k views

I've to read a big XML file with a lot of information. Afterwards I extract the needed information (~20 Points(columns) / ~80 relevant Data (rows, some of them with subdatasets) and write them out in ...
MemLeak's user avatar
  • 133
6 votes
1 answer
336 views

Consider a case when I want to try some idea of an application. But I want to avoid investing a lot of effort in coding UI/work flows/database schema etc before I see that it's going to be useful to ...
Alexey's user avatar
  • 1,269
2 votes
2 answers
5k views

How do I separate words in a string? In the following I have a random sample of words in a string extracted from text file with over a million words. Here's the string: "intervene Pockets ...
Ji Park's user avatar
  • 129
4 votes
1 answer
2k views

One of the features in our project is to implement a comparison algorithm between two versions of text and provide a % change between the two versions. While I was researching, I came across google ...
java_mouse's user avatar
  • 2,657
1 vote
4 answers
213 views

I have written a program that can rapidly (within 5 sec on a 2GB RAM desktop, 2.33 Ghz CPU) differentiate between structured text (e.g English text) and random alphanumeric strings. It can also ...
rooznom's user avatar
  • 11
3 votes
6 answers
14k views

I need a programming language for text editing and processing (replace, formatting, regular expressions, string comparison, word processing, text analysis, etc.). Which programming language is more ...
Googlebot's user avatar
  • 3,253
3 votes
2 answers
695 views

I am researching ways to classify words in text and I'm wondering what options there are and which are best suited to this job. I'm mostly interested in keywords which are most often nouns. So far I ...
Xeoncross's user avatar
  • 1,213
3 votes
2 answers
2k views

Alright people. Finally with the help of stackoverflow community i have gathered 20 commercial product selling websites product pages with the following features Product URL Product Price Product ...
Furkan Gözükara's user avatar
24 votes
4 answers
42k views

I want to write something that takes a sentence and identifies each word it contains and defines what part of speech each word is. For example Hello World, I am a sentence would return this verb ...
Vinny's user avatar
  • 259