1

I have the table from ImDB with actors.

enter image description here

From this table I want to drop all rows where imdb_actors.birthYear is missing or is less than 1950 and also drope those where imdb_actors.deathYear has some value.

Idea is to get a dataset with actors who are alive and not retired.

imdb_actors.birthYear.dtype
Out:dtype('O')

And I can't convert to string, this doesn't help: imdb_actors['birthYear'] = imdb_actors['birthYear'].astype('|S'). It just ruins all years.

That's why I can't execute: imdb_actors[imdb_actors.birthYear >= 1955] When I try imdb_actors.birthYear.astype(str).astype(int) I get the message: ValueError: invalid literal for int() with base 10: '\\N'

What will be the way to drop missing and apply >= 1950 condition?

2 Answers 2

2

First convert numeric data to numeric series:

num_cols = ['birthYear', 'deathYear']
df[num_cols] = df[num_cols].apply(pd.to_numeric, errors='coerce')

Specifying errors='coerce' forces non-convertible elements to NaN.

Then create masks for your 3 conditions, combine via the vectorised | "or" operator, negate via ~, and apply Boolean indexing on your dataframe:

m1 = df['birthYear'].isnull()
m2 = df['birthYear'] < 1950
m3 = df['deathYear'].notnull()

res = df[~(m1 | m2 | m3)]
Sign up to request clarification or add additional context in comments.

2 Comments

df[num_cols] = df[num_cols].apply(pd.to_numeric, errors='coerce') loses good cells with years and just gives me back each column as NaN.
I'd love to check this out, but I'm having trouble copy-pasting the image you posted on your question into my code. See also How to make good reproducible pandas examples.
0

Your problem is that the type of your birthYear serie is Object which would be for strings or a mix of types.

You will want to clean this serie first by applying a function like this :

imdb_actors.birthYear = imdb_actors.birthYear.map(lambda x: int(x) if str(x) != '\\N' else pd.np.nan)

then you can do your filtering:

imdb_actors[imdb_actors.birthYear >= 1955]

5 Comments

Trying to run the first line I get the error: ValueError: invalid literal for int() with base 10: b'\\N'
could you post a sample of your data ?
You should use vectorised operations. map + lambda on object dtype is no better than a simple loop with a list.
I have a great trouble converting tables to a nice format on SO. Is there a way to paste it nicely? Sorry for silly question.
@Pinkythemouse, That's a good question. See this question for answers: How to make good reproducible pandas examples.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.