0

I have this large SQL file with about 1 milllion inserts in it, some of the inserts are corrupted (about 6000) with weird characters that i need to remove so i can insert them into my DB.

Ex: INSERT INTO BX-Books VALUES ('2268032019','Petite histoire de la d�©sinformation','Vladimir Volkoff',1999,'Editions du Rocher','http://images.amazon.com/images/P/2268032019.01.THUMBZZZ.jpg','http://images.amazon.com/images/P/2268032019.01.MZZZZZZZ.jpg','http://images.amazon.com/images/P/2268032019.01.LZZZZZZZ.jpg');

i want to remove only the weird characters and leave all of the normal ones

I tried using the following code to do so:

import fileinput
import string

fileOld = open('text1.txt', 'r+')
file = open("newfile.txt", "w")

for line in fileOld: #in fileinput.input(['C:\Users\Vashista\Desktop\BX-SQL-Dump\test1.txt']):
    print(line)
    s = line
    printable = set(string.printable)
    filter(lambda x: x in printable, s)
    print(s)
    file.write(s)

but it doesnt seem to be working, when i print s it is the same as what is printed during line and whats stranger is that nothing gets written to the file.

Any advice or tips on how to solve this would be useful

2
  • those are not weird characters. try reading the file using an encoding like this: fileinput.input(filename, openhook=fileinput.hook_encoded("utf-8")) or open(filename, 'r', encoding="utf-8") Commented May 27, 2016 at 2:21
  • Which version of Python are you using: 2 or 3? They handle character-set conversion (and illegal bytes in encoded character data) differently. As muratgu pointed out, you're probably making an assumption about the character encoding of text1.txt that isn't true. Commented May 27, 2016 at 3:33

2 Answers 2

2
    import string

strg = "'2268032019', Petite histoire de la d�©sinformation','Vladimir Volkoff',1999,'Editions du Rocher','http://images.amazon.com/images/P/2268032019.01.THUMBZZZ.jpg','http://images.amazon.com/images/P/2268032019.01.MZZZZZZZ.jpg','http://images.amazon.com/images/P/2268032019.01.LZZZZZZZ.jpg');"
newstrg = ""
acc = """ '",{}[].`;:  """
for x in strg:
    if x in string.ascii_letters or x in string.digits or x in acc:
        newstrg += x
print (newstrg)

Output;

'2268032019', Petite histoire de la dsinformation','Vladimir Volkoff',1999,'Editions du Rocher','http:images.amazon.comimagesP2268032019.01.THUMBZZZ.jpg','http:images.amazon.comimagesP2268032019.01.MZZZZZZZ.jpg','http:images.amazon.comimagesP2268032019.01.LZZZZZZZ.jpg';
>>>

You can check if the element of the string is in ASCII letters and then create a new string without non-ASCII letters.

Also it depends on your variable type. If you work with lists, you don't have to define a new variable. Just del mylist[x] will work.

Sign up to request clarification or add additional context in comments.

6 Comments

Hey i tried that, however it removes all of the other formatting like periods, and quotes and so on, is there any way to handle that?
@VashistaAmarjit You said you split them, so if you split them you could work with specific elements. What's your input type? List? Str?
imput type is string
Tell me what you want to keep excatly (periods quotes etc.) Tell me like : "%&'@ then I'll edit my answer.
I want to keep single and double quotes, commas, colons, semi colons , and spaces, periods, and this character `. Thanks a lot i appreciate it a lot!
|
-1

You can use regular expressions sub() to do simple string replacements. https://docs.python.org/2/library/re.html#re.sub

# -*- coding: utf-8 -*-

import re

dirty_string = u'©sinformation'
# in first param, put a regex to screen for, in this case I negated the desired characters.
clean_string = re.sub(r'[^a-zA-Z0-9./]', r'', dirty_string)

print clean_string
# Outputs
>>> sinformation

4 Comments

how do i do the same but include commas, quotes , and spaces?
just add those characters in the regular expression. Find the set of characters you want to accept, than use [^...] to negate that in the replacement expression.
it seems like when i add the quotes it gives an error, should i add another carat before it?
if you ad quotes, you need to escape it. Be careful though, you may need to escape certain characters 2x (1 in the python string, and in the regex expression) stackoverflow.com/questions/280435/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.