Open Source For Geeks: Python

Showing posts with label Python. Show all posts

Friday, 11 April 2025

Understanding Global Interpreter Lock (GIL) in Python

Background

If you have been using Python then you must have come across the term GIL or the fact that Python is primarily a single-thread language. In this post, we will see what this Global Interpreter Lock (GIL) in Python is.

Understanding Global Interpreter Lock (GIL) in Python

The GIL or Global interpreter lock is a lock or a mutex that allows only one thread to hold the control of the Python interpreter. This means that at a time only one thread can be running (You cannot use more than one CPU core at a time).

This essentially prevents Python's internal memory from being corrupted. GIL ensures there are no dangling pointers and memory leaks. Think of this as the equivalent of each object in Python requiring a lock before accessing them, if you do this on your own this might cause deadlock which is prevented with GIL but it essentially makes it single-threaded.

However, note that even with GIL there can be race conditions because all operations will not be atomic. So let's say we have a method that takes in an object and appends it to the end of an array. Thread 1 can go inside the method, read the index that it needs to add the new object to, and then suspend(release GIL) before it can be inserted. Thread 2 gets GIL and does the same for other object and suspends. Now when thread 1 comes back it will update the same index thread 2 wrote to which will overwrite the data and cause race condition.

How does GIL prevent memory corruption?

Python does memory management by keeping a count of references of each object. When this count reaches 0 memory help by that object is freed.

Let's see an example of this

import sys

a = ["A", "B"]
b = a
print(f"References to a : {sys.getrefcount(a)}")

The output is: References to a : 3

It is 3 because one reference is a , 2nd reference is b and 3rd reference is locally created when it is passed in sys.getrefcount method. When this reference reaches 0 then the memory associated with this list object will be released.

Now if multiple threads were allowed then we could have two threads simultaneously increasing or decreasing the count. This can lead to

There are no actual references to the object but the count is 1 due to race condition. This is essentially a memory leak i.e. the object is not referenced but cannot be garbage collected as well due to bad reference count.
There are still references to objects but the count is 0 due to race conditions and Python frees the object memory. This will lead to a dangling pointer.

One way to handle the above case is to lock each object in Python before its reference is updated. But as we know locking comes with drawbacks like deadlock, so if two threads are waiting for the lock deadlock will happen. This can also impact the performance as the threads will frequently acquire and release locks.

The alternative is the GIL - single lock on the interpreter itself. So any thread that needs to execute any code needs to acquire GIL to be able to run the code via interpreter. This prevents deadlocks but essentially makes Python single-threaded.

NOTE: Many potentially blocking or long-running operations, such as I/O, image processing, and NumPy number crunching, happen outside the GIL.

Monday, 31 March 2025

Understanding Bisect module in Python

Background

You would have often come across a use case of finding an element in a sorted list. Typically you would use a binary search which takes O(logN). Python provides a module that does that for us - it is called bisect. The methods in this API allow us to find a position in the sorted list where a new element can be inserted to keep the lost sorted. In this post, we will explain how to use this module.

Understanding Bisect module in Python

Let's try to see the method bisect_left that gives us a position in a sorted list where a new element can be inserted to keep the list in sorted order. The syntax is

bisect_left(list, element , low, high)

The arguments are

list - a sorted list where we want to find the element to be inserted
element - The element to be inserted
low - the starting index of the list from where we want to start the search
high - the ending index of the list from where we want to end the search

See the "Related links" section at the end for the official documentation link.

Now let's see an example

import bisect

list = [1, 3, 5, 9, 10 ,15]
idx = bisect.bisect_left(list, 6)
print(f"Index to insert 6 is {idx}")
list.insert(idx, 6)
print(f"New list {list}")

Above code prints:

Index to insert 6 is 3

New list [1, 3, 5, 6, 9, 10, 15]

You can also do both of the steps shown above - getting the index to be inserted for maintaining sorting and inserting it into the list in a single API provided by bisect. The API is

insort_left(list, element)

Let's see how we can use above API to achieve the same result

import bisect

list = [1, 3, 5, 9, 10 ,15]
bisect.insort_left(list, 6)
print(f"New list {list}")

Output is: New list [1, 3, 5, 6, 9, 10, 15]

NOTE: bisect_left as the name suggests gives the left-most possible index to insert, whereas there is a similar API called bisect_right that gives the right-most possible index to insert. Similarly to insort_left we also have insort_right that does in place insertion of given element in given list.

Let's see an example of the above

import bisect

list = [1, 3, 5, 5, 5, 9, 10 ,15]
idx = bisect.bisect_left(list, 5)
print(f"Left most index to insert 5 is {idx}")
idx = bisect.bisect_right(list, 5)
print(f"Right most index to insert 5 is {idx}")

Above code prints:

Left most index to insert 5 is 2

Right most index to insert 5 is 5

NOTE: Time complexity is O(logN) which is the same as that of binary search.

We can actually use this API for multiple use-cases, let's see them below

1. Binary Search

As we say above you can use bisect to implement binary search.

import bisect
def binary_Search(arr, element, low, high):
    idx = bisect.bisect_left(arr, element, low, high)
    if idx < len(arr) and arr[idx] == element:
        return idx
    return -1


arr = [1, 3, 5, 7, 12, 20]
search_idx = binary_Search(arr, 7, 0, len(arr))
print(f"Binary search index for 7 : {search_idx}")

Output is: Binary search index for 7 : 3

2. Prefix search

If your list had all strings (lower case) and in sorted order then we can use bisect for prefix search as well as follows:

import bisect
def prefix_search(arr, prefix):
    idx = bisect.bisect_left(arr, prefix)
    if idx >= len(arr):
        return None
    el = arr[idx]
    return el if el.startswith(prefix) else None


arr = ["apple", "cat", "dog","elephant"]
print(prefix_search(arr, "ap"))

Output is: apple

3. Find no of repeating values

If you have a sorted array and you want to find the number of times a number is repeated then we can use bisect again. Note that since the array is sorted the number will be sequential.

import bisect
def count_repeated_no(arr, no):
    left_idx = bisect.bisect_left(arr, no)
    right_idx = bisect.bisect_right(arr, no)

    if left_idx >= len(arr):
        return -1
    else:
        return right_idx - left_idx

arr = [1, 2, 5, 5, 5, 5, 9, 1]
print(f"Count of 5 in array {count_repeated_no(arr, 5)}")

Output: Count of 5 in array 4

Performance

Note that this API is really fast.

One because it uses binary search so it's time complexity is O(LogN)
Secondly, it's precompiled in C so it's faster than if you implement it yourself (See SO question on comparison with direct lookup in list)

Friday, 7 February 2025

Handling missing values in DataFrame with Pandas

Background

In the last few posts, we have been seeing the basics of Pandas - Series, DataFrame , how to manipulate data etc. In this post, we will try to see how to handle missing values in DataFrame.

Handling missing values in DataFrame with Pandas

Padas accept all following values as missing data

np.nan
pd.NA
None

We can use the isna or notna function to detect these missing data.

The isna function evaluates each cell in a DataFrame and returns True to indicate a missing value.
The notna function evaluates each cell in a DataFrame and returns True to indicate a non-missing value.

Let's try to see an example for the above:

Code:

import pandas as pd
import numpy as np

df = pd.DataFrame({"Name": ["Aniket", "abhijit", pd.NA, "Anvi"],
                   "Role": ["IT Dev", None, "IT QA", np.nan],
                   "Joining Date": [20190101, 20200202, 20210303, 20220404]})

print(df.to_string())
print(df.isna())
print(df.notna())

Output:

Name Role Joining Date

0 Aniket IT Dev 20190101

1 abhijit None 20200202

2 <NA> IT QA 20210303

3 Anvi NaN 20220404

Name Role Joining Date

0 False False False

1 False True False

2 True False False

3 False True False

Name Role Joining Date

0 True True True

1 True False True

2 False True True

3 True False True

You can now use the truth data to filter rows as follows

import pandas as pd
import numpy as np

df = pd.DataFrame({"Name": ["Aniket", "abhijit", pd.NA, "Anvi"],
                   "Role": ["IT Dev", None, "IT QA", np.nan],
                   "Joining Date": [20190101, 20200202, 20210303, 20220404]})

print(df[df["Role"].notna()])

Output:

Name Role Joining Date
0 Aniket IT Dev 20190101
2 <NA> IT QA 20210303

Dropping (dropna)& replacement (fillna)of missing data

dropna : The dropna function is used to drop rows and columns with missing values. It takes following arguements

axis - 0 for rows and 1 for columns
how - any for dropping if any data point is missing , all for dropping if all data points are missing
thresh - Threshold number of data points missing for dropping
inplace - Instead of returning a new modified DataFrame does dropping inplace for same DataFrame.

fillna: The fillnafunction is used to fill missing values with some data.

Example for dropna:

import pandas as pd
import numpy as np

df = pd.DataFrame({"Name": ["Aniket", "abhijit", pd.NA, "Anvi"],
                   "Role": ["IT Dev", None, "IT QA", np.nan],
                   "Joining Date": [20190101, 20200202, 20210303, 20220404]})

# Drop all columns with missing any data
print(df.dropna(axis=1, how="any"))

# Drop all rows with missing any data
print(df.dropna(axis=0, how="any"))

Output:

Joining Date

0 20190101

1 20200202

2 20210303

3 20220404

Name Role Joining Date

0 Aniket IT Dev 20190101

Example for fillna:

import pandas as pd
import numpy as np

df = pd.DataFrame({"Name": ["Aniket", "abhijit", pd.NA, "Anvi"],
                   "Role": ["IT Dev", None, "IT QA", np.nan],
                   "Joining Date": [20190101, 20200202, np.nan, 20220404]})

# Replace missing values with some default value
print(df.fillna({"Name": "Default Name", "Role": "Default Role"}))

# Replace missing joining date with latest/max joining date in data
print(df["Joining Date"].fillna(value=df["Joining Date"].max()))

# forward fill the missing data
print(df.ffill(limit=1))

Output:

Name Role Joining Date

0 Aniket IT Dev 20190101.0

1 abhijit Default Role 20200202.0

2 Default Name IT QA NaN

3 Anvi Default Role 20220404.0

0 20190101.0

1 20200202.0

2 20220404.0

3 20220404.0

Name: Joining Date, dtype: float64

Name Role Joining Date

0 Aniket IT Dev 20190101.0

1 abhijit IT Dev 20200202.0

2 abhijit IT QA 20200202.0

3 Anvi IT QA 20220404.0

NOTE: Previously we could pass method argument as bfill as ffill but that is deprecated now. Worning message: DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead

Thursday, 6 February 2025

String and Date manipulation in DataFrame with Pandas

Background

In the last post, we saw how to filter data in a DataFrame provided by the panda's library in Python. In this post, we will see how we can manipulate string and date type columns in DataFrame.

String and Date manipulation in DataFrame with Pandas

String & date manipulation in

String manipulation

You can use df.dtypes to check the data types of columns in a DataFrame. You can also use df.astype to convert the column types. In this section, we will see how to manipulate and operate string data types in DataFrame. You can do this by accessing the string content using .str accessor type.

Code:

import pandas as pd

df = pd.DataFrame({"Name": ["Aniket", "abhijit", "awantika", "Anvi"],
                   "Role": ["IT Dev", "Finance Analyst", "IT QA", "Fun Play"],
                   "Joining Date": [20190101, 20200101, 20210101, 20220101]})

print(df)

# .str accessor to operate on string data type
df["Name_initials"] = df["Name"].str[0:3]
df["Department"] = df["Role"].str.split(" ", expand=True)[1]

# Use + operator to concatenate data
df["Name_Department_Combined"] = df["Name"] + "_" + df["Department"]

# Chain operations to get results in one line
df["Capitalized_Initials"] = df["Name"].str.capitalize().str[0:3]

print(df.to_string())

Output:

Name Role Joining Date

0 Aniket IT Dev 20190101
1 abhijit Finance Analyst 20200101
2 awantika IT QA 20210101
3 Anvi Fun Play 20220101

Name object
Role object
Joining Date int64
dtype: object

Name Role Joining Date Name_initials Department Name_Department_Combined Capitalized_Initials

0 Aniket IT Dev 20190101 Ani Dev Aniket_Dev Ani
1 abhijit Finance Analyst 20200101 abh Analyst abhijit_Analyst Abh
2 awantika IT QA 20210101 awa QA awantika_QA Awa
3 Anvi Fun Play 20220101 Anv Play Anvi_Play Anv

In the above example, you can see various ways you can manipulate the str column type on data frame.

Date manipulation

Similar to manipulating string we can also manipulate date data types in pandas using accessory "dt". We will use the same example as above for date manipulation. For this we use a different data type called "datetime64[ns]" , this data type represents a timestamp meaning it represents a date and time.

You can try executing below code to see how this data type works

print(pd.to_datetime("20240701"))

This outputs

2024-07-01 00:00:00

In the example above we have a "Joining date" & as you see in the outout of code above it currently prints int as dtype of that column. We need to convert it to a datetime type before we do further manipulations on a date. As I mentioned above we can convert the data type using df.astype() method.

Code:

import pandas as pd

df = pd.DataFrame({"Name": ["Aniket", "abhijit", "awantika", "Anvi"],
                   "Role": ["IT Dev", "Finance Analyst", "IT QA", "Fun Play"],
                   "Joining Date": [20190101, 20200202, 20210303, 20220404]})

df['Joining Date'] = pd.to_datetime(df['Joining Date'], format='%Y%m%d')

# If date was of standard YYYY-MM-DD format you could use velow
# df = df.astype({"Joining Date": "datetime64[ns]"})
print(df.dtypes)

df["Joining Year"] = df["Joining Date"].dt.year
df["Joining Month"] = df["Joining Date"].dt.month
df["Joining Day"] = df["Joining Date"].dt.day
print(df.to_string())

Output:

Name object
Role object
Joining Date datetime64[ns]
dtype: object

Name Role Joining Date Joining Year Joining Month Joining Day
0 Aniket IT Dev 2019-01-01 2019 1 1
1 abhijit Finance Analyst 2020-02-02 2020 2 2
2 awantika IT QA 2021-03-03 2021 3 3
3 Anvi Fun Play 2022-04-04 2022 4 4

You can do more operations using the ".dt" accessor for date data types.

Sunday, 2 February 2025

Filtering a DataFrame in Pandas

Background

In the last post, we saw the basics if using the pandas library in Python which is used for data analysis. We saw two basic data structures supported by pandas

Series
DataFrame

In this post, we will further see how we can filter data in a data frame. These are some of the most common operations performed for data analysis.

Filtering a data frame in Pandas

loc & iloc methods

To recap, a data frame is a two-dimensional data structure consisting of rows and columns. So we need a way to filter rows and columns efficiently. Two main methods exposed by data frame for this are

loc - uses rows and column labels
iloc - uses rows and column indexes

For example:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(1, 10, (4, 4)), index=["a", "b", "c", "d"],
                  columns=["colA", "colB", "colC", "colD"])
print(df.loc[["a", "b"], ["colA", "colC"]])
print(df.iloc[:2, :3])

Output:

colA colC

a 4 7

b 1 4

colA colB colC

a 4 9 7

b 1 1 4

The loc and iloc methods are frequently used for selecting or extracting a part of a data frame. The main difference is that loc works with labels whereas iloc works with indices.

Selecting subset of columns

We can get a Series (a single column data) from the data frame using df["column_name"], similarly, we can get a new data frame with a subset of columns by passing a list of columns needed. For eg.,

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(1, 10, (4, 4)), index=["a", "b", "c", "d"],
                  columns=["colA", "colB", "colC", "colD"])
print(df[["colA", "colC"]])
print(type(df[["colA", "colC"]]))

Output:

colA colC

a 5 2

b 7 2

c 4 3

d 9 1

<class 'pandas.core.frame.DataFrame'>

As you can see from the output we selected 2 columns - ColA and ColC and the result is a new DataFrame object.

Filtering by condition

You can also filter a data frame by conditions. Consider the following example:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(1, 10, (4, 4)), index=["a", "b", "c", "d"],
                  columns=["colA", "colB", "colC", "colD"])
print(df[df["colA"] >= 1])

Output:

colA colB colC colD

a 3 9 5 6

b 8 5 9 6

c 9 4 1 4

d 8 4 3 5

The data frame has randomly generated data so the output will not be consistent but you can confirm that output will always have entries corresponding to colA having values greater than or equal to 1 as we specified in the filtering condition.

you can also specify multiple conditions with & or | operators. Consider the following example

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(1, 10, (4, 4)), index=["a", "b", "c", "d"],
                  columns=["colA", "colB", "colC", "colD"])
print(df[(df["colA"] >= 1) | (df["colB"] <= 5)])

Output:

colA colB colC colD

a 2 4 4 7

b 4 4 3 6

c 1 9 1 9

d 2 3 8 3

Again the output would not be consistent due to randomness of data but you should get the output that matches the filtering conditions. Following are all conditions supported

==: equal
!=: not equal
>: greater than
>=: greater than or equal to
<: less than
<=: less than or equal to

You can also use the .isin method to filter data as follows.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(1, 10, (4, 4)), index=["a", "b", "c", "d"],
                  columns=["colA", "colB", "colC", "colD"])
print(df[df["colA"].isin([1, 2, 3])])

Output:

colA colB colC colD

b 1 7 6 4

c 2 4 9 7

Getting started with Pandas library in Python

Background

If you have worked with data analysis or data sciences roles you would have worked with the Pandas/numpy libraries in Python which comes in handy. In this post, we will see how to get started on working with the pandas library. This post assumes you have a basic understanding of Python

Installing pandas library

You can install the pandas library using pip

Or you can directly install it in your pycharm as a package. You can go to

Setting -> Project ->Python interpreter and click on "+" icon to search & add the pandas package.

Once you have installed pandas library you can import it as

import pandas as pd

and start using it

Data structures supported in Pandas

Pandas library supports 2 main data structures

Series: One dimensional array with an axis.
DataFrame: Two dimensional data structure with labelled rows and columns

Series

Let's try to see how Series works. You can simply create a series from a python array

import pandas as pd

test_series = pd.Series([11, 12, 13])
print(test_series)

Output is:

0 11

1 12

2 13

dtype: int64

As you can see from the output we have integer index (0,1,2) and 1-dimensional data with values [11,12,13]. You can also have a string index for this one-dimensional array, You can use this index to access the data from Series.

import pandas as pd

test_series = pd.Series([11, 12, 13], index=["RowA", "RowB", "RowC"])
print(test_series)
print(test_series["RowA"])

Output is:

RowA 11

RowB 12

RowC 13

dtype: int64

DataFrame

To create a data frame you can simply pass a dictionary where the key of the dictionary forms the columns and the actual values of that keys from the data.

import pandas as pd

df = pd.DataFrame({
    "ColA": [11, 12, 13],
    "Col B": [21, 22, 23],
    "Col C": [31, 32, 33],
}, index=["Row A", "Row B", "Row C"])

print(df)

Output is:

ColA Col B Col C

Row A 11 21 31

Row B 12 22 32

Row C 13 23 33

As with series, the passing index is optional, if you do not pass default behavior is to use integer indexes (0,1,2... etc.). Similarly if you do not assign explicit column names then integer columns are used.

DataFrame support various methods

df.head(): Gives first 5 rows.
df.size: Gives number of cells in data frame (no of rows * no of columns). For above example output will be 9.
df.shape: Gives dimension of data frame. For above example output will be (3,3)
len(df): Give number of rows in data frame. For above example output will be 3.

For checking the data type and converting the column type we can use below methods:

df.dtypes : Give data types of columns present in the data frame
df.astype: Mehtod to convert data type of a column

Consider the following example:

import pandas as pd

df = pd.DataFrame({
    "ColA": [11, 12, 13],
    "Col B": [21, 22, 23],
    "Col C": [31, 32, 33],
}, index=["Row A", "Row B", "Row C"])

print(df.dtypes)
df = df.astype({"ColA": float})
print(df.dtypes)

Output:

ColA int64

Col B int64

Col C int64

dtype: object

ColA float64

Col B int64

Col C int64

dtype: object

Once you have a dateframe in place you can reference the individual columns and perform analysis on it. A single column referenced from data frame can perform below operations:

df.col.nunique(): Returns number of unique elements
df.col.uniqie(): Return actual unique elements
df.col.mean(): Retuns mean of column values
df.col.median(): Returns median of column values
df.col.value_counts(): Return unique values and their counts

Consider below example:

import pandas as pd

df = pd.DataFrame({
    "Col A": [11, 12, 13],
    "Col B": [21, 22, 23],
    "Col C": [31, 32, 33],
}, index=["Row A", "Row B", "Row C"])

print(df["Col A"].nunique())
print(df["Col A"].unique())
print(df["Col A"].mean())
print(df["Col A"].median())
print(df["Col A"].value_counts())

Output is:

[11 12 13]

12.0

Col A

11 1

12 1

13 1

Name: count, dtype: int64

Note that column of data frame is actually a series

df = pd.DataFrame({"A":[1,2,3], "C": [1,2,3]})
print(type(df))
print(type(df.A))

Output:

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.series.Series'>

Saturday, 25 January 2025

Using 'Union find' pattern for problem solving questions

About the pattern

The pattern is used to group elements into sets based on specified properties. Each element is unique and the sets are non-overlapping meaning each set has distinct elements. Each set forms a tree data structure, each element has a parent and the root of the tree is called the representative element of that set. The parent of this representative node is the same node (itself). If we pick any element of a set and follow its parent node then we will always reach the representative element of the set (root of the tree representing the disjoint set).

The pattern is implemented using two methods

find(node): Find the representative of the set containing the node.
union(node1, node2): Merges the set containing node1 and node 2 into one.

Let's say we have an array representing numbers 0 to 5. Before we start applying this pattern each element has itself as the parent.

This means we have 6 different unique sets each containing one element. Now we want to start grouping them into a unique set (typically we will have more than one unique set which we will see with an actual example in some time but for this example consider we want to create a single set).

Do union(0, 1): This will merge nodes 0 and 1 together. Change the representative of 0 to 1.
Similarly do

union(1,2)
union(2,3)
union(4,5): Notice how we merged 4 and 5 instead of 3 and 4. This is done just to give you an idea that an element can merge into any disjoint sets based on the property under consideration.
union(3,5)

At the end, we have a tree below:

As you would have imagined by now as you are making this disjoint set the size of the tree can be O(N) in the worst case and so is the time complexity of this pattern.

Optimization: We can optimize this further by maintaining the rank of each element which would denote the number of the child node beneath it. So the rank of the representative node will denote the size of the tree it represents (the number of child nodes it has). We can use this rank to then decide which of the two representative nodes should we use as the parent new representative node while merging the two trees corresponding to the two disjoint sets/trees. Eg., above node 5 at the end of iteration has the rank of 6 (5 nodes under it + 1 counting itself).

With the above optimization, you will now select a new representative node as the representative node with a higher rank among the two getting merged. So it will be something like below:

As you would have guessed by now the tree is balanced with the above approach and guarantees that the TC for search is reduced to log(N) - length of the tree.

So if I have to find the representative of 4, I will find its parent which is 5, and then recursively find its parent till we reach root which in this case is 1. Once the root is reached (node with itself as a parent) we return that node as it is the representative node. As we traverse the length of the tree the TC is log(N).

Using 'Union find' pattern

Now let's try to use this pattern to solve an actual problem-solving question

Problem statement:

For a given integer, n, and an array, edges, return the number of connected components in a graph containing n nodes.

The array edges[i] = [x, y] indicates that there’s an edge between x and y in the graph.

Constraint:

1<=n<=1000
0<=edges.length<=500
edges[I].length == 2
0<=x,y<n
x!=y
No repeated edges

Notice how the problem statement says that the elements (vertices of the graph) are unique, and there are no repeated edges. Let's see how we can implement this using the union-find method.

Solution:

class UnionFind:

    def __init__(self, n):
        self.parent = []
        for i in range(n):
            self.parent.append(i)
        self.rank = [1] * n
        self.connected_components = n

    def find(self, v):
        if self.parent[v] != v:
            return self.find(self.parent[v])
        return v
   

    def union(self, x, y):
        p1, p2 = self.find(x), self.find(y)
        if p1 != p2:
          if self.rank[p1] < self.rank[p2]:
            self.parent[p1] = p2
          else:
            self.parent[p2] = p1
          self.connected_components = self.connected_components - 1
		
def count_components(n, edges):
  uf = UnionFind(n)
  for edge in edges:
    v1 = edge[0]
    v2 = edge[1]
    uf.union(v1, v2)  
  return uf.connected_components

If we run the above for the following data sets it gives the correct results:

Example 1:

Input: 5 , [[0,1],[1,2],[3,4]]
Output: 2

Example 2:

Input: 6 , [[0,1],[3,4],[4,5]]
Output: 3

Now let's try to understand the solution and implementation of the pattern. The core logic of the union pattern is in the UnionFind class. In the constructor, we have initialized 3 things

parent - list that tracks the parent of each element
rank - list that tracks the rank of each element
connected_component - number of connected component

At the start each node has itself assigned as the parent and the rank of each node is 1. Similarly, since each node is separate at the start number of connected components is equal to the number of nodes.

As we iterate over the edges we pass them to the union method to merge them into the single set - remember the pattern is used to split unique elements into disjoint sets (connect component in this case). As we merge the vertices which are supposed to be part of edges we reduce the connected_component by 1 since two elements forming an edge are merged in a single set (representing connected component). At the end when we have parsed each edge we would have completed having unique disjoint sets representing the number of connected components in edges, so we simply return connected_componenets which we have been using to track it.

Also, notice how we are using rank to ensure that while emerging two disjoint sets we set the parent as the one with a higher rank to ensure the resultant tree is balanced and consequently find() operations take log(N).

Thursday, 16 January 2025

How to get "i" th bit of a number in Python

Background

In various problem-solving techniques, we often need to find the i'th bit of a number. For eg., lets say we have the number 5 , it's binary notation is 101. If I want to get 0th bit it is 1, 1st bit is 0 and 2nd bit is again 1. In this post, we will see how to get the ith bit for a given number in Python.

Use of getting i'th bit

Before we see how to get ith bit let's try to understand a problem-solving use case where we can use this logic. Consider the following problem statement:

Given an array of integers, "data" , find all possible subsets of data, including the empty set.

The question is basically asking us to find all possible subsets of a given array of integers.

For e.g, consider input as [2, 5, 7] the possible combinations are

i=0: {}
i=1: {2}
i=2: {5}
i=3: {2,5}
i=4: {7}
i=5: {2,7}
i=6: {5,7}
i=7: {2,5,7}

The total combinations possible for a given list is 2^N where N is the length of the array. In the above case since N=3 (length of input array) the expected combinations are 2^3 = 8 which is exactly what you see above.

Now to compute these we iterate from i=0 to 2^N-1 which in this case will be from 0 to 7.

For each of these numbers, we can find the binary notation and then check if the binary bit is 1 for a given index and include it in the combination.

For eg., let's take i=3, binary notation is 110 which means we should consider numbers from 1st and 2nd position of the original array. Since the original array was [2,5,7] and we are considering 1st and 2nd position arrays the subset is {2,5} which is what you see in above combination list as well.

For all "i" values subsets computed are shown below:

Now let's see the Python code to get the ith bit for a number.

NOTE: In python for computing 2^N we do 2 ** N, In Java you would do Math.pow(2, N).

How to get "i" th bit of a number in Python

We will test the same example we took above when num =3 and we want to get all bits of it to decide which position data from the original array to include in the current subset. Python code for the above is as follows:

def get_ith_bit(num, i):
    # Shift the operand specified number of bits to the left
    temp = (1 << i)
    temp = temp & num
    if temp == 0:
        return 0
    return 1

print(get_ith_bit(3, 0))
print(get_ith_bit(3, 1))
print(get_ith_bit(3, 2))

Output is:

which is 110 in binary as we saw before as well.

Complete code

The complete code for the problem statement is as follows:

PS: Given an array of integers, "data" , find all possible subsets of data, including the empty set.

def get_ith_bit(num, i):
    # Shift the operand specified number of bits to the left
    temp = (1 << i)
    temp = temp & num
    if temp == 0:
        return 0
    return 1

def find_all_subsets(data):
    subsets = []

    if not data:
        return [[]]
    else:
        combinations = 2 ** len(data)
        for i in range(0, combinations):
            subset = set()
            for j in range(0, len(data)):
                if get_ith_bit(i, j) == 1 and data[j] not in subset:
                    subset.add(data[j])

            if i == 0:
                subsets.append([])
            else:
                subsets.append(list(subset))
    return subsets

print(find_all_subsets([2,5,7]))

Output:

[[], [2], [5], [2, 5], [7], [2, 7], [5, 7], [2, 5, 7]]

Sunday, 12 January 2025

Working with heaps in Python

Background

As we all know a heap is a complete binary tree that satisfies the heap property:

Min heap: The value of children is greater than or equal to the parent node. This means the root node of a min heap stores the lowest value or minimum value data.
Max heap: The value of children is smaller than or equal to the parent. This means that the root node of a max heap stores the highest value or the maximum value data.

Heaps are used to implement priority queues for various use cases. See the below diagrams on what constitutes a valid and invalid heap (Credit: GeeksForGeeks)

Valid min heaps

Invalid min heaps

Valid max heaps

Invalid max heaps

It is an interesting read on how to insert in a heap, how to remove from the heap, how to heapify an array, how to use an array to represent a heap (The above diagrams show binary tree representation), how to use heaps for implementing priority queue etc.

The time complexity for various heap operations is as follows:

Working with heaps in Python

Now that we brushed upon heap data structure above let's get to the original purpose of this post, which s to understand how to work with heaps in Python.

In Java, you would usually use PriorityQueue implementation to work with heaps. In Python we use python’s inbuilt library named heapq.

Look at the following code:

import heapq

data = [3, 5, 9, 14, 4, 24, 2]
print(f"Original data: {data}")
heapq.heapify(data) # Create min heap
print(f"After heapify: {data}")
heapq.heappush(data, 1)
print(f"After pushing to heap: {data}")
heapq.heappop(data)
print(f"After poping from heap: {data}")

It Prints:

Original data: [3, 5, 9, 14, 4, 24, 2]
After heapify: [2, 4, 3, 14, 5, 24, 9]
After pushing to heap: [1, 2, 3, 4, 5, 24, 9, 14]|
After poping from heap: [2, 4, 3, 14, 5, 24, 9]

As you can see we initially has a simple list called data which we then converted into a min heap (default heap behavior) using heapify method (O(N) time complexity). We then pushed an element to the min heap (O(Log(N)) time complexity) and then poped one which remoes the element from the root - minimum element in this case of min heap (Also has time complexity O(Log(N)))

NOTE: If you want max heap implementation you can just negate the data and push it in the heap.

NOTE: Heap elements can be tuples. This is useful for assigning comparison values (such as task priorities) alongside the main record being tracked. 1st element in the tuple is use for the comparison/priority.

See below code of how we can add tuple in the heap:

import heapq

data = []
heapq.heappush(data, (3, "China"))
heapq.heappush(data, (1, "India"))
heapq.heappush(data, (2, "USA"))
print(data)

It prints:

[(1, 'India'), (3, 'China'), (2, 'USA')]

PriorityQueue class

We can also use PriorityQueue class implementation directly in Python. This internally uses heapq library but is different in following ways

It's synchronized, so it supports concurrent processes.
It's a class interface as opposed to the function-based interface of heapq.

You can see following example on how to use PriorityQueue class in python

from queue import PriorityQueue

pQueue = PriorityQueue()
pQueue.put((3, "China"))
pQueue.put((1, "India"))
pQueue.put((2, "USA"))

while not pQueue.empty():
    print(pQueue.get())

Above also prints:

(1, 'India')
(2, 'USA')
(3, 'China')

Saturday, 11 January 2025

Working with dictionaries in Python

Background

In this post, we will see how we can use dictionaries and specifically how we can enable a custom class in Python to be used as a dict key. I am originally from a Java background and have written posts before on how HashMap/HashTable works in Java (See Related links section towards the end of this post).

If someone is new

Dictionaries are data structures that store data of format key: value pairs
Data stored in dict is unique (does not allow duplicate keys), mutable (can edit), and ordered

NOTE: As of Python version 3.7, dictionaries are ordered. In Python 3.6 and earlier, dictionaries are unordered.

Dictionaries in Python

You can initialize a dictionary in Python using {} or dict keyword

my_dict = {}
my_dict_1 = {"Name": "Aniket", "Country": "India"}
my_dict_2 = dict(name="Aniket")

You can print these and see the dictionary. It will print

{}

{'Name': 'Aniket', 'Country': 'India'}

{'name': 'Aniket'}

A normal dict like above is unordered which means the order in which you insert data in dict is not maintained, when you iterate over items in dict it may give you a different order. This is the same as HashMap in Java. However, similar to LinkedHashMap we have OrderedDict in Python (Please note that as of Python version 3.7, dictionaries are ordered. In Python 3.6 and earlier, dictionaries are unordered).

You can also merge one map to another using the update method. Remember key's of a dict are unique and similar to HashMap if you try to insert a key that is already present in the dict then it is going to override the key data with the new value (See below code).

my_dict = {}
my_dict["name"] = "Aniket"
my_dict["Country"] = "India"
for key in my_dict:
    print(f"{key} : {my_dict[key]}")
print("--------------")
others_dict = {}
others_dict["name"] = "John"
others_dict["State"] = "NYC"
for key in others_dict:
    print(f"{key} : {others_dict[key]}")
print("--------------")
my_dict.update(others_dict)
for key in my_dict:
    print(f"{key} : {my_dict[key]}")

Output is:

name : Aniket

Country : India

--------------

name : John

State : NYC

--------------

name : John

Country : India

State : NYC

See how name key value was overridden by new dict key, Country key was not overridden so it stated the same and finally a new key called State was added.

Using a custom class as a key in dict

In Java for a new class, you would typically override equals and hashcode method to make it work, in Python we have to do something similar as well.

Let's create an Employee class with just name and age, try to create objects out of it, push to dict and print it.

class Employee:
    def __init__(self, name, age):
        self.name = name
        self.age = age

    def __str__(self):
        return f"Employee with name: {self.name} & age: {self.age}"


my_dict = {}
e1 = Employee("Aniket", 30)
e2 = Employee("Aniket", 30)
my_dict[e1] = 1
my_dict[e2] = 2
for key, value in my_dict.items():
    print(f"{key} : {value}")

Output is:

Employee with name: Aniket & age: 30 : 1

Employee with name: Aniket & age: 30 : 2

As you see from output it took the same key as different keys in dict and added separate entries. Since the employee (uniquely defined by name and age) is same we want just one entry in dict corresponding to a unique name and age. In this case it should have overridden value from 1 to 2 and just printed single entry with value 2. Let's see how we do that.

For the above to work similarly to Java (where we override equals and hashcode), in Python we override the two methods below

__eq__()
__hasg__()

So your code will now look like

class Employee:
    def __init__(self, name, age):
        self.name = name
        self.age = age

    def __str__(self):
        return f"Employee with name: {self.name} & age: {self.age}"

    def __eq__(self, other):
        return (self.name, self.age) == (other.name, other.age)

    def __hash__(self):
        return hash((self.name, self.age))

my_dict = {}
e1 = Employee("Aniket", 30)
e2 = Employee("Aniket", 30)
my_dict[e1] = 1
my_dict[e2] = 2
for key, value in my_dict.items():
    print(f"{key} : {value}")

This now prints:

Employee with name: Aniket & age: 30 : 2

which is in line with our expectations.

See the below diagram for all methods supported by Python dict

Wednesday, 8 January 2025

Different ways to iterate over a list in Python

Background

In this post, we will see the different ways to iterate over a list in Python.

Different ways to iterate over a list in Python

Using for loop

A simple way to iterate over a list is to use a for loop.

  
names = ["A", "B", "C"]
for name in names:
    print(name)

This prints

Using for loop with range

You can also use a for loop with range if you want to access the list with it's index

  
names = ["A", "B", "C"]
for i in range(len(names)):
    print(names[i])

This also prints

Using enumerate

If you want index and value both then you can use enumerate as follows

  
names = ["A", "B", "C"]
for idx, name in enumerate(names):
    print(f"{name} at index {idx}")

This prints

A at index 0

B at index 1

C at index 2

Using while loop

You can also use a while loop for iterating over a list as follows

  
names = ["A", "B", "C"]
i=0
while i<len(names):
    print(names[i])
    i = i + 1

This prints:

List Comprehension

List comprehension is a more concise way to iterate over a list.

  
names = ["A", "B", "C"]
[print(name) for name in names]

This prints:

If you print above list it will print [None, None, None] as you are not creating any new element for storing in new list on iterating the original list

NOTE: This creates a new list and is not a recommended way to iterate over a list. You can use this if you have a usecase to create a new list from existing one with filtering or modifying the original list.

Tuesday, 7 January 2025

Slicing strings in Python

Background

Working with string is very common in any programming language and what comes in handy in Python is slicing. In this post, we will try to understand how slicing works for strings in Python.

Slicing in Python

The syntax for slicing is : Object [start:stop:step] where

"start" specifies the starting index of a slice
"stop" specifies the ending element of a slice (not included)
"step" specifies the number of elements to jump/step on between start and stop.

Slicing can be done with negative indexes as well and if start/stop is negative that the indexes are counted from the end backward (See diagram below to understand the indexes - positive & negative).

See a few examples below

  
text = "ABCDE"
print(text[0:3])
print(text[0:3:2])
print(text[-4:-1])
print(text[-4:-1:2])

This prints:

ABC

BCD

You can also have the step as a negative number which will essentially consider elements backwards (reversing the data structure). See the following examples (more details on using this to reverse data structure are added in a later section).

  
text = "ABCDE"
print(text[3:0:-1])
print(text[3:0:-2])
print(text[::-1])
print(text[-1:-5:-1])

This prints:

DCB

EDCBA

EDCB

NOTE: text[::] or text[:] is going to create a copy of the existing data structure.

Reversing Elements of Data Structure

You can use a negative step to reverse the elements of the following data structures

  
# list
l = ["a", "b", "c"]
print(l)
print(l[::-1])

# string
s = "abc"
print(s)
print(s[::-1])

# tuple
t = ("a", "b", "c")
print(t)
print(t[::-1])

This prints:

['a', 'b', 'c']

['c', 'b', 'a']

abc

cba

('a', 'b', 'c')

('c', 'b', 'a')

Cheat Sheet

a[start:stop] # items start through stop-1
a[start:] # items start through the rest of the array
a[:stop] # items from the beginning through stop-1
a[:] # a copy of the whole array

We can also pass step to the slicing which determines how many steps to jump for slicing

a[start:stop:step] # start through not past stop, by step

Step can also be a negative number

a[::-1] # all items in the array, reversed
a[1::-1] # the first two items, reversed
a[:-3:-1] # the last two items, reversed
a[-3::-1] # everything except the last two items, reversed

Saturday, 1 June 2024

Understanding descriptors in Python

Background

As you know, Python does not have a concept of private variables or getters/setters. In one of the previous posts, we saw the use of property to achieve something similar. In this post, we will examine the concept underlying the property functionality, called descriptors. Using the descriptors is the pythonic way to handle attributes of a class. Descriptors are the mechanisms behind properties, methods, static methods, class methods, and super().

Understanding descriptors

Before we see any let's take an example to see why we need descriptors. Consider a simple Employee class below:

class Employee:
    def __init__(self, name):
        self.name = name


emp = Employee("Aniket")
print(emp.name)

For simplicity, it just has one instance variable called "name". In real life you would like to have some check when you set the name of Employee to something - like you might want to ensure it at least has one character, it can have a maximum of 10 chars, etc. We saw how to do this via properties in the last post, in this post we will see how to use descriptors which are the basis of property as well.

class Name:
class Name:
    def __set__(self, instance, value):
        print("Invoking __set__ on Name")
        if not isinstance(instance, Employee):
            raise ValueError("Name descriptor is to be used with Employee instance only")
        if len(value) < 1 or len(value) > 10:
            raise ValueError("Name cannot be less than 1 char or more than 10 char")
        instance._name = value

    def __get__(self, instance, owner):
        print("Invoking __get__ on Name")
        if not isinstance(instance, Employee):
            raise ValueError("Name descriptor is to be used with Employee instance only")
        return instance._name


class Employee:
    name = Name()

    def __init__(self, name):
        self._name = None
        self.name = name


emp = Employee("")
print(emp.name)


emp = Employee("Aniket")
print(emp.name)
emp.name = "Abhijit"
print(emp.name)

The above code defines a descriptor called "Name" and uses it to manage attributes for the Employee class instead. Above prints:

Invoking __set__ on Name

Invoking __get__ on Name

Aniket

Invoking __set__ on Name

Invoking __get__ on Name

Abhijit

You can play around passing names in the constructor as

"Aniket" - Works fine. print Aniket
"" - Fails & prints ValueError: Name cannot be less than 1 char or more than 10 char
"Aniket Thakur" & prints ValueError: Name cannot be less than 1 char or more than 10 char

See how we now have more granular control over the Name attribute of the Employee class. That's the power of descriptors.

Notice how emp.name = "Abhijit" works. Normally it would have set name attribute of Employee class to a string "Abhijit" but since in this case it is a descriptor it called __set__ dunder / magic method of the corresponding descriptor class.

Descriptor protocol

A class will be descriptor if it has one of the following methods:

__get__(self, obj, type=None) -> value
__set__(self, obj, value) -> None
__delete__(self, obj) -> None

Define any of these methods and an object is considered a descriptor and can override default behavior upon being looked up as an attribute.

If an object defines __set__() or __delete__() , it is considered a data descriptor.
Descriptors that only define __get__() are called non-data descriptors

NOTE: Data descriptors always override instance dictionaries.

The example we saw above was of a data descriptor. Consider following example

class Name:

    def __get__(self, instance, owner):
        print("Invoking __get__ on Name")
        if not isinstance(instance, Employee):
            raise ValueError("Name descriptor is to be used with Employee instance only")
        return instance._name


class Employee:
    name = Name()

    def __init__(self, name):
        self._name = name


emp = Employee("Aniket")
print(emp.name)
emp.name = "Abhijit"
print(emp.name)

Here we do not have a __set__ method and consequently, it is not a data descriptor hence we can override it with instance dictionaries. Above prints

Invoking __get__ on Name

Aniket

Abhijit

Notice how it involved __get__ exactly once and when we set it to "Abhijit" it actually replaced the name from a data descriptor instance to a normal string stored in the instance dictionary.

One last thing is that as you see above a class is data descriptor if it has __set__ or __delete__ so if you do not have __set__ but just have delete then it is still a data descriptor and you cannot override in the instance dictionary.

class Name:

    def __get__(self, instance, owner):
        print("Invoking __get__ on Name")
        if not isinstance(instance, Employee):
            raise ValueError("Name descriptor is to be used with Employee instance only")
        return instance._name

    def __delete__(self, instance):
        del instance._name


class Employee:
    name = Name()

    def __init__(self, name):
        self._name = name


emp = Employee("Aniket")
print(emp.name)
emp.name = "Abhijit"
print(emp.name)

The above code will fail by printing

Invoking __get__ on Name

Aniket

Traceback (most recent call last):

File "/Users/aniketthakur/PycharmProjects/HelloWorld/descriptors.py", line 29, in <module>

emp.name = "Abhijit"

AttributeError: __set__

and that is because it could not find a __set_ method.

Friday, 11 April 2025

Background

Understanding Global Interpreter Lock (GIL) in Python

How does GIL prevent memory corruption?

Related Links

Monday, 31 March 2025

Background

Understanding Bisect module in Python

1. Binary Search

2. Prefix search

3. Find no of repeating values

Performance

Related Links

Friday, 7 February 2025

Background

Handling missing values in DataFrame with Pandas

Dropping (dropna)& replacement (fillna)of missing data

Related Links

Thursday, 6 February 2025

Background

String and Date manipulation in DataFrame with Pandas

String manipulation

Date manipulation

Related Links

Sunday, 2 February 2025

Background

Filtering a data frame in Pandas

loc & iloc methods

Selecting subset of columns

Filtering by condition

Related Links

Background

Installing pandas library

Data structures supported in Pandas

Series

DataFrame

Related links

Saturday, 25 January 2025

About the pattern

Using 'Union find' pattern

Related links

Thursday, 16 January 2025

Background

Use of getting i'th bit

How to get "i" th bit of a number in Python

Complete code

Related links

Sunday, 12 January 2025

Background

Valid min heaps

Invalid min heaps

Valid max heaps

Invalid max heaps

Working with heaps in Python

PriorityQueue class

Related Links

Saturday, 11 January 2025

Background

Dictionaries in Python

Using a custom class as a key in dict

Related Links

Wednesday, 8 January 2025

Background

Different ways to iterate over a list in Python

Using for loop

Using for loop with range

Using enumerate

Using while loop

List Comprehension

Related Links

Tuesday, 7 January 2025

Background

Slicing in Python

Reversing Elements of Data Structure

Cheat Sheet

Related Links

Saturday, 1 June 2024

Background

Understanding descriptors

Descriptor protocol