4

I have a data corresponding to a list of DBs and diff rows with dates that they were in use.

 DB             Dates        USAGE

 ABC            03-06-2018   IN USE
 ABC            07-06-2018   IN USE 
 XYZ            04-06-2018   IN USE
 XYZ            08-06-2018   IN USE

What i want is to have the full calendar month corresponding to every db and not just the dates on which they were in use

 DB             Dates        USAGE
 ABC            01-06-2018    NOT IN USE
 ABC            02-06-2018    NOT IN USE
 ABC            03-06-2018    IN USE
 .
 .
 ABC            07-06-2018    IN USE
 .
 .
 ABC            30-06-2018    NOT IN USE 
 XYZ            01-06-2018    NOT IN USE
 .
 .
 XYZ            30-06-2018    NOT IN USE
6
  • If I understood you well, you can query the the dataframe based on the column "usage", follow this question: stackoverflow.com/questions/17071871/… .. is this what you need ? Commented Jul 30, 2018 at 5:45
  • Possible duplicate Commented Jul 30, 2018 at 5:55
  • Not sure about a downvote, but pretty sure it is a dupe. Both the OP and the linked question are about adding missing dates from a range. Commented Jul 30, 2018 at 5:58
  • @jezrael It's up to you, of course. I say it is a possible dupe. Commented Jul 30, 2018 at 6:00
  • 1
    @jezrael I agree that, while the linked answer may be used to answer the OP, a complete answer requires more bits and pieces than the linked one. Commented Jul 30, 2018 at 6:06

1 Answer 1

2

Use:

df['Dates'] = pd.to_datetime(df['Dates'], format='%d-%m-%Y')

a = df['Dates'].dt.to_period('m')
dates = pd.date_range(a.min().to_timestamp('ms'), a.max().to_timestamp('m'))

mux = pd.MultiIndex.from_product([df['DB'].unique(), dates], names=['DB','Dates'])

df = df.set_index(['DB','Dates'])['USAGE'].reindex(mux, fill_value='NOT IN USE').reset_index()
print (df.head())
    DB      Dates       USAGE
0  ABC 2018-06-01  NOT IN USE
1  ABC 2018-06-02  NOT IN USE
2  ABC 2018-06-03      IN USE
3  ABC 2018-06-04  NOT IN USE
4  ABC 2018-06-05  NOT IN USE

print (df.tail())
     DB      Dates       USAGE
55  XYZ 2018-06-26  NOT IN USE
56  XYZ 2018-06-27  NOT IN USE
57  XYZ 2018-06-28  NOT IN USE
58  XYZ 2018-06-29  NOT IN USE
59  XYZ 2018-06-30  NOT IN USE

Detail:

print (dates)
DatetimeIndex(['2018-06-01', '2018-06-02', '2018-06-03', '2018-06-04',
               '2018-06-05', '2018-06-06', '2018-06-07', '2018-06-08',
               '2018-06-09', '2018-06-10', '2018-06-11', '2018-06-12',
               '2018-06-13', '2018-06-14', '2018-06-15', '2018-06-16',
               '2018-06-17', '2018-06-18', '2018-06-19', '2018-06-20',
               '2018-06-21', '2018-06-22', '2018-06-23', '2018-06-24',
               '2018-06-25', '2018-06-26', '2018-06-27', '2018-06-28',
               '2018-06-29', '2018-06-30'],
              dtype='datetime64[ns]', freq='D')

Exlanation:

  1. First convert column to_datetime
  2. Create all possible dates - first convert column to to_period, then to date_range with to_timestamp with start and end of month
  3. Then create MultiIndex from_product
  4. and reindex with replace missing values.
Sign up to request clarification or add additional context in comments.

4 Comments

@jezrail any way to ignore the dates falling on the weekends?
@techdoodle - Do you think remove weekend dateimes from dates like dates = dates[~dates.weekday.isin([5,6])] ?
@techdoodle - Or set different these dates as last step like df.loc[df['Dates'].dt.weekday.isin([5,6]), 'USAGE'] = 'no using' ?
how can I get the same range of dates but i want the hourly interval too. 24 entries for each date

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.