Replace all newline characters using python

Question

I am trying to read a pdf using python and the content has many newline (crlf) characters. I tried removing them using below code:

from tika import parser

filename = 'myfile.pdf'
raw = parser.from_file(filename)
content = raw['content']
content = content.replace("\r\n", "")
print(content)

But the output remains unchanged. I tried using double backslashes also which didn't fix the issue. can someone please advise?

What sort of data structure is "content"? Post a sample of it to help us help you? — Nick
– Nick, Commented Feb 19, 2019 at 7:34
This example is not reproducible without knowing what content contains. — Ignatius
– Ignatius, Commented Feb 19, 2019 at 7:35
You can't just read a literal PDF file and make text replacements like this. You need a Python library which can parse PDF content. — Tim Biegeleisen
– Tim Biegeleisen, Commented Feb 19, 2019 at 7:37
content is a string. I checked it using type(content). @TimBiegeleisen I use the text after parsing the file from tika as you can see in code. — Leni
– Leni, Commented Feb 19, 2019 at 7:40

user7849416 · Accepted Answer · 2019-02-19 07:34:26Z

16

content = content.replace("\\r\\n", "")

You need to double escape them.

answered Feb 19, 2019 at 7:34

user7849416

Sign up to request clarification or add additional context in comments.

6 Comments

Tim Biegeleisen Over a year ago

You don't need to escape \r\n, because they are already valid character literals.

alexopoulos7 Over a year ago

To be sure for all cases (windows or unix ascii string) please use content = content.replace("\r", "").replace("\n", "")

Breadtruck Over a year ago

In case someone is looking at this in the future ... I had to reference stackoverflow.com/questions/47178459/… to really overcome my issue because python converts windows crlf behind the scenes according to the answer in that post

Cecil Curry Over a year ago

Most upvoted answer has no idea what it is talking about. "You need to double escape them." Well, alrighty then. Meanwhile, the actual answer that addresses this topic by actually importing tika was ignored. </sigh>

user7849416 Over a year ago

@CecilCurry ok, but if those characters are literally in there (not encoded), then you'd need to replace those characters? So, I do know what I am talking about ?

|

Life is complex · Accepted Answer · 2019-02-20 04:30:11Z

4

I don't have access to your pdf file, so I processed one on my system. I also don't know if you need to remove all new lines or just double new lines. The code below remove double new lines, which makes the output more readable.

Please let me know if this works for your current needs.

from tika import parser

filename = 'myfile.pdf'

# Parse the PDF
parsedPDF = parser.from_file(filename)

# Extract the text content from the parsed PDF
pdf = parsedPDF["content"]

# Convert double newlines into single newlines
pdf = pdf.replace('\n\n', '\n')

#####################################
# Do something with the PDF
#####################################
print (pdf)

edited Feb 20, 2019 at 4:30

answered Feb 20, 2019 at 1:44

Life is complex

15.8k5 gold badges34 silver badges72 bronze badges

Comments

andyw · Accepted Answer · 2022-11-19 15:32:54Z

3

You can also just use

text = '''
As she said these words her foot slipped, and in another moment, splash! she
was up to her chin in salt water. Her first idea was that she had somehow
fallen into the sea, “and in that case I can go back by railway,”
she said to herself.”'''

text = ' '.join(text.splitlines())

print(text)
# As she said these words her foot slipped, and in another moment, splash! she was up to her chin in salt water. Her first idea was that she had somehow fallen into the sea, “and in that case I can go back by railway,” she said to herself.”

answered Nov 19, 2022 at 15:32

andyw

3,8132 gold badges33 silver badges45 bronze badges

Comments

Aaron Ford · Accepted Answer · 2021-03-24 04:03:27Z

2

If you are having issues with different forms of line break, try the str.splitlines() function and then re-join the result using the string you're after. Like this:

content = "".join(l for l in content.splitlines() if l)

Then, you just have to change the value within the quotes to what you need to join on. This will allow you to detect all of the line boundaries found here. Be aware though that str.splitlines() returns a list not an iterator. So, for large strings, this will blow out your memory usage. In those cases, you are better off using the file stream or io.StringIO and read line by line.

edited Mar 24, 2021 at 4:03

answered Mar 24, 2021 at 1:34

Aaron Ford

827 bronze badges

Comments

Ryan M · Accepted Answer · 2021-03-19 05:01:38Z

1

print(open('myfile.txt').read().replace('\n', ''))

edited Mar 19, 2021 at 5:01

Ryan M♦

20.6k35 gold badges75 silver badges85 bronze badges

answered Mar 19, 2021 at 3:41

A Sravan Kumar Reddy

1073 bronze badges

1 Comment

The Grand J Over a year ago

So what is this meant to do? How does this answer the question? Please edit your answer and explain the answer. Additionally please read How to Answer

Toothpick Anemone · Accepted Answer · 2022-08-08 16:47:58Z

When you write something like t.replace("\r\n", "") python will look for a carriage-return followed by a new-line.

Python will not replace carriage returns by themselves or replace new-line characters by themselves.

Consider the following:

t = "abc abracadabra abc"
t.replace("abc", "x")

Will t.replace("abc", "x") replace every occurrence of the letter a with the letter x? No
Will t.replace("abc", "x") replace every occurrence of the letter b with the letter x? No
Will t.replace("abc", "x") replace every occurrence of the letter c with the letter x? No

What will t.replace("abc", "x") do?

t.replace("abc", "x") will replace the entire string "abc" with the letter "x"

Consider the following:

test_input = "\r\nAPPLE\rORANGE\nKIWI\n\rPOMEGRANATE\r\nCHERRY\r\nSTRAWBERRY"

t = test_input
for _ in range(0, 3):
    t = t.replace("\r\n", "")
    print(repr(t))

result2 = "".join(test_input.split("\r\n"))
print(repr(result2))

The output sent to the console is as follows:

'APPLE\rORANGE\nKIWI\n\rPOMEGRANATECHERRYSTRAWBERRY'
'APPLE\rORANGE\nKIWI\n\rPOMEGRANATECHERRYSTRAWBERRY'
'APPLE\rORANGE\nKIWI\n\rPOMEGRANATECHERRYSTRAWBERRY'
'APPLE\rORANGE\nKIWI\n\rPOMEGRANATECHERRYSTRAWBERRY'

Note that:

str.replace() replaces every occurrence of the target string, not just the left-most occurrence.
str.replace() replaces the target string, but not every character of the target string.

If you want to delete all new-line and carriage returns, something like the following will get the job done:

in_string = "\r\n-APPLE-\r-ORANGE-\n-KIWI-\n\r-POMEGRANATE-\r\n-CHERRY-\r\n-STRAWBERRY-"

out_string = "".join(filter(lambda ch: ch not in "\n\r", in_string))

print(repr(out_string))
# prints -APPLE--ORANGE--KIWI--POMEGRANATE--CHERRY--STRAWBERRY-

peniel charles · Accepted Answer · 2022-12-29 15:58:39Z

0

#write a file 
enter code here
write_File=open("sample.txt","w")
write_File.write("line1\nline2\nline3\nline4\nline5\nline6\n")
write_File.close()

#open a file without new line of the characters
open_file=open("sample.txt","r")
open_new_File=open_file.read()
replace_string=open_new_File.replace("\n",." ")
print(replace_string,end=" ")
open_file.close()

OUTPUT

line1 line2 line3 line4 line5 line6

answered Dec 29, 2022 at 15:58

peniel charles

11 bronze badge

1 Comment

Community Over a year ago

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.

weAreStarsDust · Accepted Answer · 2023-07-18 16:04:15Z

0

regex will work in this case

r'(\r\n)+ pattern matches one or more occurrences of \r\n and then it replaced with single \r\n

import re

content = '\r\n\r\n\r\n\r\n\r\ntest'
content = re.sub(r'(\r\n)+', r'\r\n', content)  # '\r\ntest'

edited Jul 18, 2023 at 16:04

answered Jul 18, 2023 at 11:46

weAreStarsDust

2,7503 gold badges14 silver badges24 bronze badges

Collectives™ on Stack Overflow

Replace all newline characters using python

8 Answers 8

6 Comments

Comments

Comments

Comments

1 Comment

Comments

OUTPUT

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

6 Comments

Comments

Comments

Comments

1 Comment

Comments

OUTPUT

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related