5

I have data of a plot on two arrays that are stored in unsorted way, so the plot jumps from one place to another discontinuously: enter image description here I have tried one example of finding the closest point in a 2D array:

import numpy as np

def distance(pt_1, pt_2):
    pt_1 = np.array((pt_1[0], pt_1[1]))
    pt_2 = np.array((pt_2[0], pt_2[1]))
    return np.linalg.norm(pt_1-pt_2)

def closest_node(node, nodes):
    nodes = np.asarray(nodes)
    dist_2 = np.sum((nodes - node)**2, axis=1)
    return np.argmin(dist_2)

a = []
for x in range(50000):
    a.append((np.random.randint(0,1000),np.random.randint(0,1000)))
some_pt = (1, 2)

closest_node(some_pt, a)

Can I use it somehow to "clean" my data? (in the above code, a can be my data)

Exemplary data from my calculations is:

array([[  2.08937872e+001,   1.99020033e+001,   2.28260611e+001,
          6.27711094e+000,   3.30392288e+000,   1.30312878e+001,
          8.80768833e+000,   1.31238275e+001,   1.57400130e+001,
          5.00278061e+000,   1.70752624e+001,   1.79131456e+001,
          1.50746185e+001,   2.50095731e+001,   2.15895974e+001,
          1.23237801e+001,   1.14860312e+001,   1.44268222e+001,
          6.37680265e+000,   7.81485403e+000],
       [ -1.19702178e-001,  -1.14050879e-001,  -1.29711421e-001,
          8.32977493e-001,   7.27437322e-001,   8.94389885e-001,
          8.65931116e-001,  -6.08199292e-002,  -8.51922900e-002,
          1.12333841e-001,  -9.88131292e-324,   4.94065646e-324,
         -9.88131292e-324,   4.94065646e-324,   4.94065646e-324,
          0.00000000e+000,   0.00000000e+000,   0.00000000e+000,
         -4.94065646e-324,   0.00000000e+000]])
  • After using radial_sort_line (of Joe Kington) I have received the following plot: enter image description here
10
  • could you post your data or how you get the data? Commented Feb 24, 2016 at 16:04
  • 1
    a is not what made the plot that you're showing Commented Feb 24, 2016 at 16:16
  • 1
    Also, do you really need line plots? can't you plot the data points only (without the lines)? Commented Feb 24, 2016 at 16:16
  • 2
    Looks like what you need is just to sort your data by its y values, but post some sample data will help. Commented Feb 24, 2016 at 16:28
  • 1
    @Ohm, I think a nearest neighbor approach with a suitable distance metrics might be the best solution, see edit. Commented Mar 7, 2016 at 17:44

3 Answers 3

8

This is actually a problem that's tougher than you might think in general.

In your exact case, you might be able to get away with sorting by the y-values. It's hard to tell for sure from the plot.

Therefore, a better approach for somewhat circular shapes like this is to do a radial sort.

For example, let's generate some data somewhat similar to yours:

import numpy as np
import matplotlib.pyplot as plt

t = np.linspace(.2, 1.6 * np.pi)
x, y = np.cos(t), np.sin(t)

# Shuffle the points...
i = np.arange(t.size)
np.random.shuffle(i)
x, y = x[i], y[i]

fig, ax = plt.subplots()
ax.plot(x, y, color='lightblue')
ax.margins(0.05)
plt.show()

enter image description here

Okay, now let's try to undo that shuffle by using a radial sort. We'll use the centroid of the points as the center and calculate the angle to each point, then sort by that angle:

x0, y0 = x.mean(), y.mean()
angle = np.arctan2(y - y0, x - x0)

idx = angle.argsort()
x, y = x[idx], y[idx]

fig, ax = plt.subplots()
ax.plot(x, y, color='lightblue')
ax.margins(0.05)
plt.show()

enter image description here

Okay, pretty close! If we were working with a closed polygon, we'd be done.

However, we have one problem -- This closes the wrong gap. We'd rather have the angle start at the position of the largest gap in the line.

Therefore, we'll need to calculate the gap to each adjacent point on our new line and re-do the sort based on a new starting angle:

dx = np.diff(np.append(x, x[-1]))
dy = np.diff(np.append(y, y[-1]))
max_gap = np.abs(np.hypot(dx, dy)).argmax() + 1

x = np.append(x[max_gap:], x[:max_gap])
y = np.append(y[max_gap:], y[:max_gap])

Which results in:

enter image description here

As a complete, stand-alone example:

import numpy as np
import matplotlib.pyplot as plt

def main():
    x, y = generate_data()
    plot(x, y).set(title='Original data')

    x, y = radial_sort_line(x, y)
    plot(x, y).set(title='Sorted data')

    plt.show()

def generate_data(num=50):
    t = np.linspace(.2, 1.6 * np.pi, num)
    x, y = np.cos(t), np.sin(t)

    # Shuffle the points...
    i = np.arange(t.size)
    np.random.shuffle(i)
    x, y = x[i], y[i]

    return x, y

def radial_sort_line(x, y):
    """Sort unordered verts of an unclosed line by angle from their center."""
    # Radial sort
    x0, y0 = x.mean(), y.mean()
    angle = np.arctan2(y - y0, x - x0)

    idx = angle.argsort()
    x, y = x[idx], y[idx]

    # Split at opening in line
    dx = np.diff(np.append(x, x[-1]))
    dy = np.diff(np.append(y, y[-1]))
    max_gap = np.abs(np.hypot(dx, dy)).argmax() + 1

    x = np.append(x[max_gap:], x[:max_gap])
    y = np.append(y[max_gap:], y[:max_gap])
    return x, y

def plot(x, y):
    fig, ax = plt.subplots()
    ax.plot(x, y, color='lightblue')
    ax.margins(0.05)
    return ax

main()
Sign up to request clarification or add additional context in comments.

3 Comments

This works great, but for some cases when I plot the data it fills up areas in which there is more than one y value for a direction..
@joeKington, I am facing now similar problem of sorting coordinates in geometry such that two consecutive coordinate number place next to each other. Would you please see my question?
I am right now facing a similar problem. I am working on the task in which I have unordered inputs of coordinates and I have to make these coordinates aligned in such as way that, two consecutive coordinates place next to each other in the geometry. Would you please see my question and give some useful suggestions? stackoverflow.com/q/73528016/17289097
3

Sorting the data base on their angle relative to the center as in @JoeKington 's solution might have problems with some parts of the data:

In [1]:

import scipy.spatial as ss
import matplotlib.pyplot as plt
import numpy as np
import re
%matplotlib inline
In [2]:

data=np.array([[  2.08937872e+001,   1.99020033e+001,   2.28260611e+001,
                  6.27711094e+000,   3.30392288e+000,   1.30312878e+001,
                  8.80768833e+000,   1.31238275e+001,   1.57400130e+001,
                  5.00278061e+000,   1.70752624e+001,   1.79131456e+001,
                  1.50746185e+001,   2.50095731e+001,   2.15895974e+001,
                  1.23237801e+001,   1.14860312e+001,   1.44268222e+001,
                  6.37680265e+000,   7.81485403e+000],
               [ -1.19702178e-001,  -1.14050879e-001,  -1.29711421e-001,
                  8.32977493e-001,   7.27437322e-001,   8.94389885e-001,
                  8.65931116e-001,  -6.08199292e-002,  -8.51922900e-002,
                  1.12333841e-001,  -9.88131292e-324,   4.94065646e-324,
                 -9.88131292e-324,   4.94065646e-324,   4.94065646e-324,
                  0.00000000e+000,   0.00000000e+000,   0.00000000e+000,
                 -4.94065646e-324,   0.00000000e+000]])
In [3]:

plt.plot(data[0], data[1])
plt.title('Unsorted Data')
Out[3]:
<matplotlib.text.Text at 0x10a5c0550>

enter image description here

See x values between 15 and 20 are not sorted correctly.

In [10]:

#Calculate the angle in degrees of [0, 360]
sort_index = np.angle(np.dot((data.T-data.mean(1)), np.array([1.0, 1.0j])))
sort_index = np.where(sort_index>0, sort_index, sort_index+360)

#sorted the data by angle and plot them
sort_index = sort_index.argsort()
plt.plot(data[0][sort_index], data[1][sort_index])
plt.title('Data Sorted by angle relatively to the centroid')

plt.plot(data[0], data[1], 'r+')
Out[10]:
[<matplotlib.lines.Line2D at 0x10b009e10>]

enter image description here

We can sort the data based on a nearest neighbor approach, but since the x and y are of very different scale, the choice of distance metrics becomes an important issue. We will just try all the distance metrics available in scipy to get an idea:

In [7]:

def sort_dots(metrics, ax, start):
    dist_m = ss.distance.squareform(ss.distance.pdist(data.T, metrics))

    total_points = data.shape[1]
    points_index = set(range(total_points))
    sorted_index = []
    target    = start
    ax.plot(data[0, target], data[1, target], 'o', markersize=16)

    points_index.discard(target)
    while len(points_index)>0:
        candidate = list(points_index)
        nneigbour = candidate[dist_m[target, candidate].argmin()]
        points_index.discard(nneigbour)
        points_index.discard(target)
        #print points_index, target, nneigbour
        sorted_index.append(target)
        target    = nneigbour
    sorted_index.append(target)

    ax.plot(data[0][sorted_index], data[1][sorted_index])
    ax.set_title(metrics)
In [6]:

dmetrics = re.findall('pdist\(X\,\s+\'(.*)\'', ss.distance.pdist.__doc__)
In [8]:

f, axes = plt.subplots(4, 6, figsize=(16,10), sharex=True, sharey=True)
axes = axes.ravel()
for metrics, ax in zip(dmetrics, axes):
    try:
        sort_dots(metrics, ax, 5)
    except:
        ax.set_title(metrics + '(unsuitable)')

It looks like standardized euclidean and mahanalobis metrics give the best result. Note that we choose a starting point of the 6th data (index 5), it is the data point this the largest y value (use argmax to get the index, of course).

enter image description here

In [9]:

f, axes = plt.subplots(4, 6, figsize=(16,10), sharex=True, sharey=True)
axes = axes.ravel()
for metrics, ax in zip(dmetrics, axes):
    try:
        sort_dots(metrics, ax, 13)
    except:
        ax.set_title(metrics + '(unsuitable)')

This is what happens if you choose the starting point of max. x value (index 13). It appears that mahanalobis metrics is better than standardized euclidean as it is not affected by the starting point we choose.

enter image description here

1 Comment

Great idea, can it be used also to split the data into the different curves? The real data should contain a sort of semi circular curve, opening to the right, together with a straight horizontal line..
1

If we do the assumption that the data are 2D and the x axis should be in an increasing fashion, then you could:

  • sort the x axis data, e.g. x_old and store the result in a different variable, e.g. x_new
  • for each element in the x_new find its index in the x_old array
  • re-order the elements in the y_axis array according to the indices that you got from previous step

I would do it with python list instead of numpy array due to list.index method been more easily manipulated than the numpy.where method.

E.g. (and assume that x_old and y_old are your previous numpy variables for x and y axis respectively)

import numpy as np

x_new_tmp = x_old.tolist()
y_new_tmp = y_old.tolist()

x_new = sorted(x_new_tmp)

y_new = [y_new_tmp[x_new_tmp.index(i)] for i in x_new]

Then you can plot x_new and y_new

2 Comments

for the image shown in the question, sorting by x values will not work here. This could possibly work for y values
As tom noted, this won't work for x in this case, but might for y. Regardless, don't use a list for this if you have numpy arrays. Instead do x, y = x[y.argsort()], x[y.argsort()].

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.