1

I am trying to clean a list of url's that has garbage as shown.

  1. /gradoffice/index.aspx(
  2. /gradoffice/index.aspx-
  3. /gradoffice/index.aspxjavascript$
  4. /gradoffice/index.aspx~

I have a csv file with over 190k records of different url's. I tried to load the csv into a pandas dataframe and took the entire column of url's into a list by using the statement

str = df['csuristem']

it clearly gave me all the values in the column. when i use the following code - It is only printing 40k records and it starts some where in the middle. I don't know where am going wrong. the program runs perfectly but is showing me only partial number of results. any help would be much appreciated.

import pandas
table = pandas.read_csv("SS3.csv", dtype=object)
df = pandas.DataFrame(table)
str = df['csuristem']
for s in str:
    s = s.split(".")[0]
    print s

I am looking to get an output like this

  1. /gradoffice/index.
  2. /gradoffice/index.
  3. /gradoffice/index.
  4. /gradoffice/index.

Thank you, Santhosh.

1 Answer 1

3

You need to do the following, so call .str.split on the column and then .str[0] to access the first portion of the split string of interest:

In [6]:

df['csuristem'].str.split('.').str[0]
Out[6]:
0    /gradoffice/index
1    /gradoffice/index
2    /gradoffice/index
3    /gradoffice/index
Name: csuristem, dtype: object
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.