66

I am transitioning from R to Python. I just began using Pandas. I have an R code that subsets nicely:

k1 <- subset(data, Product = p.id & Month < mn & Year == yr, select = c(Time, Product))

Now, I want to do similar stuff in Python. this is what I have got so far:

import pandas as pd
data = pd.read_csv("../data/monthly_prod_sales.csv")


#first, index the dataset by Product. And, get all that matches a given 'p.id' and time.
 data.set_index('Product')
 k = data.ix[[p.id, 'Time']]

# then, index this subset with Time and do more subsetting..

I am beginning to feel that I am doing this the wrong way. perhaps, there is an elegant solution. Can anyone help? I need to extract month and year from the timestamp I have and do subsetting. Perhaps there is a one-liner that will accomplish all this:

k1 <- subset(data, Product = p.id & Time >= start_time & Time < end_time, select = c(Time, Product))

thanks.

1

6 Answers 6

98

I'll assume that Time and Product are columns in a DataFrame, df is an instance of DataFrame, and that other variables are scalar values:

For now, you'll have to reference the DataFrame instance:

k1 = df.loc[(df.Product == p_id) & (df.Time >= start_time) & (df.Time < end_time), ['Time', 'Product']]

The parentheses are also necessary, because of the precedence of the & operator vs. the comparison operators. The & operator is actually an overloaded bitwise operator which has the same precedence as arithmetic operators which in turn have a higher precedence than comparison operators.

In pandas 0.13 a new experimental DataFrame.query() method will be available. It's extremely similar to subset modulo the select argument:

With query() you'd do it like this:

df[['Time', 'Product']].query('Product == p_id and Month < mn and Year == yr')

Here's a simple example:

In [9]: df = DataFrame({'gender': np.random.choice(['m', 'f'], size=10), 'price': poisson(100, size=10)})

In [10]: df
Out[10]:
  gender  price
0      m     89
1      f    123
2      f    100
3      m    104
4      m     98
5      m    103
6      f    100
7      f    109
8      f     95
9      m     87

In [11]: df.query('gender == "m" and price < 100')
Out[11]:
  gender  price
0      m     89
4      m     98
9      m     87

The final query that you're interested will even be able to take advantage of chained comparisons, like this:

k1 = df[['Time', 'Product']].query('Product == p_id and start_time <= Time < end_time')
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks Philip. It works well. This is what I was looking for - a simple, quick solution. Many thanks again. For those searching for such solution, the time I used is like this: (data.ts >= '2012-10-01') & (data.ts < '2013-05-01') .
@Philip, I tried your suggestion from iPython with concrete values in my conditions. They worked fine. But, when I embed the same in a program and call with parameters, I get an error - the final lines are: code File "/usr/local/lib/python2.7/dist-packages/pandas/core/series.py", line 225, in wrapper if len(self) != len(other): Type Error: len() of unsized object
More about the above (ill-formatted) error message: I checked my data set and made sure there are no NaNs or NAs, and nothing is empty. Not sure why a call to my function errors out. I have my function like this: code def get_data(data, p_id, start_time, end_time): test_data = data.loc[(data.product == p_id) & (data.ts >= start_time) & (data.ts < end_time), ['product', 'ts']] code
please ignore that error. It was error coming out of converting Python array to numpy array - and finding the mean of a column resulted in some 'dtype' error.
20

Just for someone looking for a solution more similar to R:

df[(df.Product == p_id) & (df.Time> start_time) & (df.Time < end_time)][['Time','Product']]

No need for data.loc or query, but I do think it is a bit long.

Comments

15

I've found that you can use any subset condition for a given column by wrapping it in []. For instance, you have a df with columns ['Product','Time', 'Year', 'Color']

And let's say you want to include products made before 2014. You could write,

df[df['Year'] < 2014]

To return all the rows where this is the case. You can add different conditions.

df[df['Year'] < 2014][df['Color' == 'Red']

Then just choose the columns you want as directed above. For instance, the product color and key for the df above,

df[df['Year'] < 2014][df['Color'] == 'Red'][['Product','Color']]

Comments

0

Regarding some points mentioned in previous answers, and to improve readability:

No need for data.loc or query, but I do think it is a bit long.

The parentheses are also necessary, because of the precedence of the & operator vs. the comparison operators.

I like to write such expressions as follows - less brackets, faster to type, easier to read. Closer to R, too.

q_product = df.Product == p_id
q_start = df.Time > start_time
q_end = df.Time < end_time

df.loc[q_product & q_start & q_end, c('Time,Product')]

# c is just a convenience
c = lambda v: v.split(',') 

Comments

0

I created a function that works a bit like the subset function in R.

similar to what is asked for here

I haven't found a way to use both the %in% = list, while using And or Or operators, but I just do those subsets one by one:

def subset(df, query=None, select=None, unselect=None, asindex=False, returnFullDFIfError=False, **kwargs):
    """
    Subsets a pandas DataFrame based on query conditions, and selects or unselects specified columns.

    Parameters:
    df (pd.DataFrame): The DataFrame to be subsetted.
    query (str, optional): A query string to filter rows. Default is None.
    select (list, optional): Columns to be selected. Default is None.
    unselect (list, optional): Columns to be unselected. Default is None.
    asindex (bool, optional): Whether to return only the index if True. Default is False.
    returnFullDFIfError (bool, optional): Whether to return the full DataFrame if an error occurs. Default is True.

    Returns:
    pd.DataFrame or pd.Index: The subsetted DataFrame or Index, based on the given parameters.

    Examples:
    #>>> df = pd.DataFrame({'A': [1, 2], 'B': [3, 4], 'C': [5, 6]})
    #>>> subset(df, query='A > 1', select=['B', 'C'])
    #>>> subset(df, query='A < 2', unselect=['C'])

    names_list = ['Alice', 'David']
    result = subset(df, query="Name %in% names_list", select=['Name', 'Age'],names_list=names_list)
    print(result)
        Name  Age
    0  Alice   25
    3  David   40

    IMPORTANT:
    You cannot combine the use of %in% and other operators like and, or, & and |.

    names_list = ['Alice', 'David']
    result = subset(df, query="Name %in% names_list or Age > 16" , select=['Name', 'Age'],names_list=names_list)
    print(result)
        Name  Age
    0  Alice   25
    3  David   40


    """
    import pandas as pd
    import numpy as np
    import re

    # Ensure proper types for select and unselect
    select = list(select) if select else []
    unselect = list(unselect) if unselect else []

    # Preprocess for %in% and %!in% conditions and standardize logical operators
    if query:
        df, query = _preprocess_query(df, query, kwargs)

    # Execute query
    try:
        if asindex:
            return df.query(query).index
        else:
            filtered_df = df.query(query) if query else df
            if select:
                return filtered_df[select]
            elif unselect:
                return filtered_df[[col for col in df.columns if col not in unselect]]
            else:
                return filtered_df
    except Exception as e:
        if returnFullDFIfError:
            return df
        else:
            raise e

def _preprocess_query(df, query, variables):
    """
    Preprocesses the DataFrame for %in% and %!in% conditions and standardizes logical operators.

    Parameters:
    df (pd.DataFrame): The DataFrame to be processed.
    query (str): The query string.
    variables (dict): A dictionary of variables to be used in the query.

    Returns:
    tuple: The processed DataFrame and the updated query string.
    """
    # Standardize logical operators
    query = query.replace(" or ", " | ").replace(" OR ", " | ").replace(" Or ", " | ")
    query = query.replace(" and ", " & ").replace(" AND ", " & ").replace(" And ", " & ")

    # Process %in% and %!in% conditions
    in_conditions = re.findall(r'(\w+)\s*%(!?in)%\s*(\w+)', query)
    for col, operator, var in in_conditions:
        values = variables.get(var, [])
        if operator == 'in':
            df = df[df[col].isin(values)]
        else:  # operator == '!in'
            df = df[~df[col].isin(values)]

    # Remove %in% and %!in% from the query
    updated_query = re.sub(r'\w+\s*%!?in%\s*\w+', '', query)

    return df, updated_query.strip()

Comments

-1

Creating an Empty Dataframe with known Column Name:

Names = ['Col1','ActivityID','TransactionID']
df = pd.DataFrame(columns = Names)

Creating a dataframe from csv:

df = pd.DataFrame('...../file_name.csv')

Creating a dynamic filter to subset a dtaframe:

i = 12
df[df['ActivitiID'] <= i]

Creating a dynamic filter to subset required columns of dtaframe

df[df['ActivityID'] == i][['TransactionID','ActivityID']]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.