Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-monotonic dataset #81

Open
tt7533 opened this issue Jan 4, 2021 · 9 comments
Open

Non-monotonic dataset #81

tt7533 opened this issue Jan 4, 2021 · 9 comments

Comments

@tt7533
Copy link

tt7533 commented Jan 4, 2021

Do you have any recommendations in case the dataset is not monotonically increasing as in the simple example below? It is clear that there are 2 linear segments but I suspect that the non-montonic behavior in x is creating the issue...

x = np.array([0.42486339, 0.55889496, 0.70537341, 0.79477838, 0.91180935, 1.        , 0.95309654, 0.85868245, 0.77580449, 0.68867638, 0.60731633, 0.523983  , 0.47161506, 0.39268367, 0.3167881 , 0.21584699, 0.11399514, 0.        ])

y = np.array([0.        , 0.12927029, 0.25908718, 0.34517628, 0.48018584, 0.64744466, 0.828915  , 0.83684067, 0.83602077, 0.84312654, 0.81279038, 0.81661656, 0.80595791, 0.81361028, 0.79967204, 0.84531293, 0.9150041 , 1.        ])

my_pwlf = pwlf.PiecewiseLinFit(x, y)

res = my_pwlf.fit(2)

xHat = np.linspace(min(x), max(x), num=10000)
yHat = my_pwlf.predict(xHat)

plt.figure()
plt.plot(x, y, 'o')
plt.plot(xHat, yHat, '-')
plt.show()

image

@cjekel
Copy link
Owner

cjekel commented Jan 4, 2021

If you are trying to fit a model that looks like:
image

It's not really going to be possible with pwlf. (I'm not sure if this is the type of fit you mean by monotonic)

There are some tricks you can do with arc lengths, but the setup and normalization can be tricky. One case is you can have one model, that predicts x and y as a function of the cumulative arc length. It may or may not work, and normalization will be very important. I can perhaps try to work on an example that does such.

@tt7533
Copy link
Author

tt7533 commented Jan 6, 2021

Yes, the purple lines are exactly the type of fit that I would like to get. In general, we can expect only 1 sharp kink and the purpose is to detect the direction of the lines before and after.

I saw that the same kind of data works just fine as long as the points define a function. For example, if you rotate the points in the picture so that there is only 1 y for a given x position, then the fit will work fine.

Although it would be possible the apply a rotation to the data before doing the pmlf fit, I wonder if it is not possible to do something smarter within pmlf itself. Does it make sense?

@cjekel
Copy link
Owner

cjekel commented Jan 7, 2021

Well.. Instead of doing a rotation, you could break x and y into separate 1 dimensional predictions. Let's say that both x and y are functions of the arc length distance from start to end. Then you can fit a 1D pwlf to x, and a 1D pwlf to y. In the end, you'll get something like this.

arclength2

arclength1

It's just a different way of thinking about the problem, I'm not sure how much help it will be. I can post the code that generated this if you are interested. It is one 3 line segment model to x, and another 3 segment model to y.

@tt7533
Copy link
Author

tt7533 commented Jan 8, 2021

Could you please share the code? It looks very good.

When you combine the 2 fits back together, how can you keep track of:

  • the linear coefficients for all the intermediate pieces
  • where the breakpoints are

@cjekel
Copy link
Owner

cjekel commented Jan 8, 2021

import numpy as np
import matplotlib.pyplot as plt
import pwlf
from scipy.spatial.distance import cdist

x = np.array([0.42486339, 0.55889496, 0.70537341, 0.79477838, 0.91180935,
              1.        , 0.95309654, 0.85868245, 0.77580449, 0.68867638,
              0.60731633, 0.523983  , 0.47161506, 0.39268367, 0.3167881 ,
              0.21584699, 0.11399514, 0.        ])

y = np.array([0.        , 0.12927029, 0.25908718, 0.34517628, 0.48018584,
              0.64744466, 0.828915  , 0.83684067, 0.83602077, 0.84312654,
              0.81279038, 0.81661656, 0.80595791, 0.81361028, 0.79967204,
              0.84531293, 0.9150041 , 1.        ])

# find the cumlative arc lengths
data = np.vstack((x, y)).T
n = x.size
a = data[0:n-1, :]
b = data[1:n, :]
d = cdist(a, b)
arcLengths = np.diagonal(d)
ArcLengths = np.zeros_like(x)
ArcLengths[1:] = np.cumsum(arcLengths)

# model for each x and y
my_pwlf_x = pwlf.PiecewiseLinFit(ArcLengths, x)
my_pwlf_y = pwlf.PiecewiseLinFit(ArcLengths, y)
res_x = my_pwlf_x.fit(3)
res_y = my_pwlf_y.fit(3)

# generate predictions
new_arc_lengths = np.linspace(ArcLengths.min(), ArcLengths.max(), 100)
xhat = my_pwlf_x.predict(new_arc_lengths)
yhat = my_pwlf_y.predict(new_arc_lengths)

# find new breakpoints... there is two from each fit of three segments
arc_length_breaks = np.array([res_x[1], res_x[2], res_y[1], res_y[2]])
arc_length_breaks_x = my_pwlf_x.predict(arc_length_breaks)
arc_length_breaks_y = my_pwlf_y.predict(arc_length_breaks)

plt.figure()
plt.plot(x, y, 'o-', label='Origingal data')
plt.plot(xhat, yhat, '-', label="pwlf fit")
plt.plot(arc_length_breaks_x, arc_length_breaks_y, 'o', label='breakpoints')
plt.legend()

plt.figure()
plt.plot(ArcLengths, x, label='X')
plt.plot(ArcLengths, y, label='Y')
plt.plot(new_arc_lengths, xhat, label='Xhat')
plt.plot(new_arc_lengths, yhat, label='Yhat')
plt.xlabel('Arc Length')
plt.show()

image

When you combine the 2 fits back together, how can you keep track of:

the linear coefficients for all the intermediate pieces

They will be in separate objects, one for the x predictions, and one for the y predictions.

where the breakpoints are

In this code, I show how you can figure where the new breakpoints are. It's a bit more complicated going in. If x has three segments, and y has three segments, then you can end up with 5 new arc length segments (I think, could be wrong)

I must say I think this fitting method is very cool! It seems this would extend to higher dimensions.

@tt7533
Copy link
Author

tt7533 commented Jan 8, 2021

That's great!

I would have to think more about the coefficients of the piecewise components. Since the fits are done on x/y vs. arclength, it is not immediately clear how to combine them together into a single set of coefficients in the original data representation. I guess it may be possible to estimate them from the breakpoints but it would be nicer (and probably more stable) to just combine the coefficients directly.

@tt7533 tt7533 closed this as completed Jan 8, 2021
@cjekel
Copy link
Owner

cjekel commented Jan 8, 2021

Which coefficients do you need directly?

The arc length stuff is definitely a more complicated modeling approach.

As far as doing one model, you could do fits to x as a function of y, and y as a function of x. If both x and y are normalized, then the fit that gives the lower sum of squared residuals (.ssr) is the correct rotation to perform the model. This would only work if one of the rotations ended up being a function (as you pointed out above).

@loveis98
Copy link

pwlf doesn't work for my case:

import numpy as np
import matplotlib.pyplot as plt
import pwlf

# your data
x = np.array([269.3048489 , 246.9444148 , 222.3136088 , 227.1488533 ,
       125.1934066 , 215.8092188 , 269.9395692 , 294.6073562 ,
       161.9504059 , 196.4256259 , 245.8174206 , 234.8829358 ,
       239.6989843 , 159.7639371 ,  81.80337907, 140.0753158 ,
        69.00680592,  76.18182163, 133.0220711 ,  95.4940555 ,
        89.97575454, 221.3455973 , 208.2065385 ,  97.75456468,
       220.2748249 , 200.3221571 ,  65.71646475,  78.42181787,
        47.07407336, 146.9804957 ,  98.75615348, 102.0246169 ,
       222.8051893 ,  63.80375236,  74.59112032, 219.1024445 ,
       118.9913702 , 221.2607285 , 135.4354503 ,  64.24703533,
        93.81510511, 105.0106551 ,  43.88495726,  54.49415032,
       100.9858707 , 215.3200696 , 160.9971602 ,  88.5518527 ,
        63.08653883,  95.84675806, 131.8640344 ,  58.82563146,
       219.9422989 ,  93.4774779 ,  98.02485276, 388.5100308 ,
       128.4080906 , 129.6742844 ,   0.        , 222.9426348 ,
        56.08195898,  70.81238332,  56.71379998, 248.7115561 ,
       222.2552054 , 200.081967  , 203.6329238 , 185.244671  ,
       236.0417996 , 222.2993177 ,  64.82775168,  89.53602526,
        96.89067067,  54.2605423 ,  76.72183015, 139.7620421 ,
        65.3025266 ])

y = np.array([  0.        ,  75.49108545,  61.51509185,  70.18320143,
       364.7402165 , 109.568719  ,  41.08417955,  40.83761634,
       207.3589071 ,  96.64997118,  32.1567632 ,  34.67529359,
        31.70877236, 114.413495  , 293.9381061 , 380.7518244 ,
       326.403998  , 323.0019917 , 400.0171516 , 337.6460195 ,
       348.2055969 , 440.1882369 , 431.5473839 , 355.7636629 ,
       433.1805758 , 417.0001996 , 229.2712733 , 240.2519802 ,
       290.1957078 , 393.7448968 , 343.0169598 , 334.7721919 ,
       432.505317  , 326.9564546 , 229.254949  , 421.8708399 ,
       197.019641  , 442.411859  , 370.5243872 , 257.5241186 ,
       339.6452082 , 318.8519392 , 256.6074448 , 302.322697  ,
       324.8686243 , 434.7269326 , 371.0977992 , 349.2665393 ,
       293.5102796 , 341.2584752 , 377.7476104 , 296.7567181 ,
       430.9114486 , 311.1336747 , 319.5093775 , 121.1779074 ,
       333.5332954 , 385.9592333 , 269.3048489 , 130.2690918 ,
       280.9712403 , 300.2219403 , 268.5752292 ,  58.5945506 ,
       132.1067479 , 193.9951334 , 170.2146256 , 132.3320591 ,
        90.50234207,  61.72976525, 319.2954143 , 283.9691051 ,
       301.5498668 , 306.1537013 , 249.3121043 , 164.37592   ,
       224.1410448 ])

# initialize piecewise linear fit with your x and y data
my_pwlf = pwlf.PiecewiseLinFit(x, y, lapack_driver='gelsd')

# fit the data for four line segments
res = my_pwlf.fit(2)

# predict for the determined points
xHat = np.linspace(min(x), max(x), num=10000)
yHat = my_pwlf.predict(xHat)

# Get the slopes
my_slopes = my_pwlf.slopes

# Get my model parameters
beta = my_pwlf.beta

# calculate the standard errors associated with each beta parameter
se = my_pwlf.standard_errors()

# calcualte the R^2 value
Rsquared = my_pwlf.r_squared()

# calculate the piecewise R^2 value
R2values = np.zeros(my_pwlf.n_segments)
for i in range(my_pwlf.n_segments):
    # segregate the data based on break point locations
    xmin = my_pwlf.fit_breaks[i]
    xmax = my_pwlf.fit_breaks[i+1]
    xtemp = my_pwlf.x_data
    ytemp = my_pwlf.y_data
    indtemp = np.where(xtemp >= xmin)
    xtemp = my_pwlf.x_data[indtemp]
    ytemp = my_pwlf.y_data[indtemp]
    indtemp = np.where(xtemp <= xmax)
    xtemp = xtemp[indtemp]
    ytemp = ytemp[indtemp]

    # predict for the new data
    yhattemp = my_pwlf.predict(xtemp)

    # calcualte ssr
    e = yhattemp - ytemp
    ssr = np.dot(e, e)

    # calculate sst
    ybar = np.ones(ytemp.size) * np.mean(ytemp)
    ydiff = ytemp - ybar
    sst = np.dot(ydiff, ydiff)

    R2values[i] = 1.0 - (ssr/sst)

# plot the results
plt.figure()
plt.plot(x, y, 'o')
plt.plot(xHat, yHat, '-')
plt.show()

@cjekel cjekel reopened this Mar 31, 2021
@cjekel
Copy link
Owner

cjekel commented Mar 31, 2021

@loveis98 Can you sketch on that plot what kind of fit you'd expect to that data set? Is your data following some order of occurrences? ie, does x[0] y[0] occur before x[1] y[1]

This figure is what your code spit out, and looking at the data, I can't quite see what kind of fit you'd expect.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants