1

a need to remove some duplicates.

A list contains elements, which are strings contains strings separated by “;”. The strings in each string could be duplicated. Such as:

"15-105;ZH0311;TZZGJJ; ZH0311; ZH0311;DOC",

There are 3 “ZH0311” in the string (the number of appearing is not fixed). I need to eliminate the duplicates and refine the string to (sequence of strings inside doesn't matter):

"15-105;TZZGJJ; ZH0311;DOC",

I am thinking to split the strings by ";" and link them together. How can I do the same for the whole list?

a_list = [

"15~105;~ PO185-400CT;NGG;DOC",
"15~105;-1;NGG;DOC",
"15~105; 15~105; NGG;-10;NGG;DOC",
"15~55;J205~J208;POI;DOC",
"15-105;15-105;ZH0305~;WER /;TZZGJJ;DOC",
"15-105;ZH0311;TZZGJJ; ZH0311; ZH0311;DOC",
"15-115;15-115; PL026~ PL028; Dry;PTT"]

please note the strings contains Non-ASCII characters.

By the way question: Is it a difference that it’s not strings in the list, but lists in the list and elements in each nested list are duplicated?

1
  • thanks, figs. because the original strings contains Non-ASCII characters. Seems "set" doesn't produce needed. Commented Nov 21, 2014 at 5:03

3 Answers 3

3
>>> a = "15-105;ZH0311;TZZGJJ; ZH0311; ZH0311;DOC"
>>> a = map(str.strip,a.split(';'))
>>> a
['15-105', 'ZH0311', 'TZZGJJ', 'ZH0311', 'ZH0311', 'DOC']
>>> a = sorted(set(a),key=lambda x:a.index(x))
>>> a
['15-105', 'ZH0311', 'TZZGJJ', 'DOC']
>>> ";".join(a)
'15-105;ZH0311;TZZGJJ;DOC'

i have used split to split it then strip to remove extra spaces. I have use set to remove duplication, but set dosent care about order. so i need to sort in the order as they are

>>> def remove_duplication(my_list):
...     my_newlist = []
...     for x in my_list:
...         x = map(str.strip,x.split(';'))
...         my_newlist.append(";".join(sorted(set(x),key=lambda y:x.index(y))))
...     return my_newlist
... 
>>> remove_duplication(a_list)
['15~105;~ PO185-400CT;NGG;DOC', '15~105;-1;NGG;DOC', '15~105;NGG;-10;DOC', '15~55;J205~J208;POI;DOC', '15-105;ZH0305~;WER /;TZZGJJ;DOC', '15-105;ZH0311;TZZGJJ;DOC', '15-115;PL026~ PL028;Dry;PTT']

if your string is delimited by space:

>>> a="# -- coding: utf-8 --" 
>>> a= map(str.strip,a.split())
>>> a
['#', '--', 'coding:', 'utf-8', '--']
>>> a = " ".join(sorted(set(a),key=lambda x:a.index(x)))
>>> a
'# -- coding: utf-8'

split split the string on some delimiter, it may be space punchuatation or character or can be anything.

Go though all this documentation, you will understand. Built-in types, Built-in function

Sign up to request clarification or add additional context in comments.

7 Comments

Ooh you're right about mapping to str.strip. Your answer is better -- deleted mine. However OP mentions that order doesn't matter so you don't have to sort your set.
thanks, Hackaholic. it runs fine. however for the Non-ASCII characters in the strings, it doesn't display well when I print them. is there a way?
thanks again. i give "# -- coding: utf-8 --" on the first line but it still the same...
#; --; coding:; utf-8; -- if you give input delimited by ';' then it will work as in your post its delimited by ';'
sorry Hackaholic, but I still don't understand. would it be possible you put it in your updated answer? thanks.
|
1

Try putting all the strings into a set after stripping them like so:

def myFilter(lines):
    strings = []
    for curLine in lines:
        strings.extend([curString.strip() for curString in curLine.split(";")])
    return set(strings);

Comments

1

You can use str.split and set

>>> s = "15-105;ZH0311;TZZGJJ; ZH0311; ZH0311;DOC"
>>> ';'.join(s.split(";"))
'15-105;ZH0311;TZZGJJ; ZH0311; ZH0311;DOC'
>>> 

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.