0

Basic task: converted a URL request into text, and dumped it to a text file (almost a usable CSV).

Goal: A clean CSV. On multiple lines, I'm trying to replace multiple (different) characters:

brackets, tildes (~), extra commas at the end of each line.

I cannot find any relatively simple-to-follow examples to accomplish this. Looking for something that can cycle line by line and replace.

PLEASE NOTE: I expect this file to be large over time, so not memory friendly.

Below is the code that created the file:

import urllib.request
with urllib.request.urlopen(URL1) as response:
    data = response.read()
decoded_data = data.decode(encoding='UTF-8')

str_data = str(decoded_data)
saveFile = open("test.txt",'w')
saveFile.write(str_data)
saveFile.close()

Here is a simplified sample from the file, the first line has the field names, 2nd and 3rd lines represent records.

[["F1","F2","F3","F4","F5","F6"],

["string11","string12","string13","s~ring14","string15","string16"],

["string21","string22","s~ring23","string24","string25","string26"]]

3 Answers 3

2

If you want to replace characters in the beginning or end of a string, use strip. If the character you want to remove has an arbitrary position, use replace instead, like this: line.replace("~",""). Note, that, unlike strip, you cannot specify several characters in one replace call, but you can chain them, like this: line.replace("~","").replace(",","").replace("[","")

Just a quick mockup of what might work for you:

with open("text.txt", 'r') as f:
    with open("result.txt", 'w') as new_f:
        for line in f:
            new_line = line.strip(" [],\n\t\r").replace("~","")
            print(new_line)
            new_f.write(new_line+"\n")

since I see that tildes can be anywhere and brackets and commas generally appear at ends. I have also added "\n", "\t", "\r" and a space in strip, because these characters may (at least, "\n" will for sure) appear at the end of each line.

Sign up to request clarification or add additional context in comments.

2 Comments

Yes, this did it. Perfect! THANK YOU!! :-) wow. Handles both the tilde(s) and the brackets.
Found the original reason for the brackets entering the text file. The URL goes to a JSON that is meant to convey a table of data (so columns and rows). The issue was I could not find a solid example that would show it. Below I re-posted my code with corrections. Note the "scrubber" above is not in my corrected code.
0

You could use a simple for-loop to iterate through the file. Then you could replace the characters in each line

file = open("text.txt", "r")
clean_txt = ""
for line in file:
    line = line.replace("~", "").replace("[","").replace("]","")
    line[len(line)-1] = "" #Replace the last character of the line.
file.close
w = open("text.txt", "w")
w.write(clean_txt)
w.close

1 Comment

Thanks for the input. It actually deletes all the contents of the file. I tried this approach before I posted. When I did get it to work, it would only "perform surgery" on the first line." Looking for something that will go through the file.
0
#!/usr/bin/env python3

# Note, I used the print function as a way to visually confirm the code worked.
# the URL_call will yield a byte that has serialized data for a basic table (columns and rows, where first row are column names -- just like Excel or SQL)

URL_call = ("http://www.zzz.com/blabla.html")

# URLIB module & function: the request has to be first decoded from UTF-8
import urllib.request
with urllib.request.urlopen(URL_call) as response:
    URL_data = response.read()

URL_data_decoded = URL_data.decode(encoding='UTF-8')

# use json to convert decoded response into a python structure (from a JSON structure)
import json
URL_data_JSON = json.loads(URL_data_decoded)

# pandas will transition the python data structure from a "list-like" array to a table.
import pandas as pd
URL_data_panda = pd.DataFrame(URL_data_JSON)

# this will create the text (in this case a CSV) file
URL_data_panda.to_csv("test.csv")

# The file will need the first row removed (columns are indexed coming out of the panda)

#determine line count
num_lines = sum(1 for line in open("test.csv"))

print(num_lines)

# the zero position is assigned to the first row of text. Writing from the second row (indexed as 1) get the removal done.
lines = open("test.csv").readlines()
open("test2.csv","w").writelines(lines[1:(num_lines)])


# Changes the name of the first column from zero to a normalized name.

import fileinput

# Note, below you could setup a back-up file, in the file input, by adding an extra argument in the parens ("test2.csv", inplace=True, backup='.bak')
with fileinput.FileInput("test2.csv", inplace=True) as file:
    for line in file:
        print(line.replace("0,", "REC_NUM,"), end='')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.