Discussion:
[task #7882] Implement Monte-Carlo simulation, where errors are generated with width of standard deviation or residuals
Troels E. Linnet
2015-01-16 16:14:30 UTC
Permalink
URL:
<http://gna.org/task/?7882>

Summary: Implement Monte-Carlo simulation, where errors are
generated with width of standard deviation or residuals
Project: relax
Submitted by: tlinnet
Submitted on: Fri 16 Jan 2015 04:14:30 PM UTC
Should Start On: Fri 16 Jan 2015 12:00:00 AM UTC
Should be Finished on: Fri 16 Jan 2015 12:00:00 AM UTC
Category: relax's source code
Priority: 5 - Normal
Status: In Progress
Percent Complete: 0%
Assigned to: tlinnet
Open/Closed: Open
Discussion Lock: Any
Effort: 0.00

_______________________________________________________

Details:

This is implemented due to strange results.

A relaxation dispersion on data with 61 spins, and a monte carlo simulation
with 500 steps, showed un-expected low errors.

-------
results.read(file=fname_results, dir=dir_results)

# Number of MC
mc_nr = 500

monte_carlo.setup(number=mc_nr)
monte_carlo.create_data()
monte_carlo.initial_values()
minimise.execute(min_algor='simplex', func_tol=1e-25, max_iter=int(1e7),
constraints=True)
monte_carlo.error_analysis()
--------

The kex was 2111 and with error 16.6.

When performing a dx.map, some weird results was found:

i_sort dw_sort pA_sort kex_sort chi2_sort
471 4.50000 0.99375 2125.00000 4664.31083
470 4.50000 0.99375 1750.00000 4665.23872

So, even a small change with chi2, should reflect a larger
deviation with kex.

It seems, that change of R2eff values according to their errors, is not
"enough".

According to the regression book of Graphpad
http://www.graphpad.com/faq/file/Prism4RegressionBook.pdf

Page 33, and 104.
Standard deviation of residuals is:

Sxy = sqrt(SS/(N-p))

where SS is sum of squares. N - p, is the number of degrees of freedom.
In relax, SS is spin.chi2, and is weighted.

The random scatter to each R2eff point should be drawn from a gaussian
distribution with a mean of Zero and SD equal to Sxy.

Additional, find the 2.5 and 97.5 percentile for each parameter.
The range between these values is the confidence interval.




_______________________________________________________

File Attachments:


-------------------------------------------------------
Date: Fri 16 Jan 2015 04:14:30 PM UTC Name: Screenshot-1.png Size: 161kB
By: tlinnet

<http://gna.org/task/download.php?file_id=23527>

_______________________________________________________

Reply to this item at:

<http://gna.org/task/?7882>

_______________________________________________
Message sent via/by Gna!
http://gna.org/


_______________________________________________
relax (http://www.nmr-relax.com)

This is the relax-devel mailing list
relax-***@gna.org

To unsubscribe from this list, get a password
reminder, or change your subscription options,
visit the list information page at
https://mail.gna.org/listinfo/relax-devel
Edward d'Auvergne
2015-01-16 16:30:38 UTC
Permalink
Hi Troels,

You should be very careful with your interpretation here. The
curvature of the chi-squared space does not correlate with the
parameter errors! Well, it most cases it doesn't. You will see this
if you map the space for different Monte Carlo simulations. Some
extreme edge cases might help in understanding the problem. Lets say
you have a kex value of 100 with a real error of 1000. In this case,
you could still have a small, perfectly quadratic minimum. But this
minimum will jump all over the place with the simulations. Another
extreme example might be kex of 100 with a real error of 0.00000001.
In this case, the chi-squared space could look similar to the
screenshot you attached to the task ( http://gna.org/task/?7882).
However Monte Carlo simulations may hardly perturb the chi-squared
space. I have observed scenarios similar to these hypothetical cases
with the Lipari and Szabo model-free protein dynamics analysis.

There is one case where the chi-squared space and error space match,
and that is at the limit of the minimum when the chi-squared space
becomes quadratic. This happens when you zoom right into the minimum.
The correlation matrix approach makes this assumption. Monte Carlo
simulations do not. In fact, Monte Carlo simulations are the gold
standard. There is no technique which is better than Monte Carlo
simulations, if you use enough simulations. You can only match it by
deriving exact symbolic error equations.

Therefore you really should investigate how your optimisation space is
perturbed by Monte Carlo simulations to understand the correlation -
or non-correlation - of the chi-squared curvature and the parameter
errors. Try mapping the minimum for the simulations and see if the
distribution of minima matches the chi-squared curvature
(http://gna.org/task/download.php?file_id=23527).

Regards,

Edward


On 16 January 2015 at 17:14, Troels E. Linnet
Post by Troels E. Linnet
<http://gna.org/task/?7882>
Summary: Implement Monte-Carlo simulation, where errors are
generated with width of standard deviation or residuals
Project: relax
Submitted by: tlinnet
Submitted on: Fri 16 Jan 2015 04:14:30 PM UTC
Should Start On: Fri 16 Jan 2015 12:00:00 AM UTC
Should be Finished on: Fri 16 Jan 2015 12:00:00 AM UTC
Category: relax's source code
Priority: 5 - Normal
Status: In Progress
Percent Complete: 0%
Assigned to: tlinnet
Open/Closed: Open
Discussion Lock: Any
Effort: 0.00
_______________________________________________________
This is implemented due to strange results.
A relaxation dispersion on data with 61 spins, and a monte carlo simulation
with 500 steps, showed un-expected low errors.
-------
results.read(file=fname_results, dir=dir_results)
# Number of MC
mc_nr = 500
monte_carlo.setup(number=mc_nr)
monte_carlo.create_data()
monte_carlo.initial_values()
minimise.execute(min_algor='simplex', func_tol=1e-25, max_iter=int(1e7),
constraints=True)
monte_carlo.error_analysis()
--------
The kex was 2111 and with error 16.6.
i_sort dw_sort pA_sort kex_sort chi2_sort
471 4.50000 0.99375 2125.00000 4664.31083
470 4.50000 0.99375 1750.00000 4665.23872
So, even a small change with chi2, should reflect a larger
deviation with kex.
It seems, that change of R2eff values according to their errors, is not
"enough".
According to the regression book of Graphpad
http://www.graphpad.com/faq/file/Prism4RegressionBook.pdf
Page 33, and 104.
Sxy = sqrt(SS/(N-p))
where SS is sum of squares. N - p, is the number of degrees of freedom.
In relax, SS is spin.chi2, and is weighted.
The random scatter to each R2eff point should be drawn from a gaussian
distribution with a mean of Zero and SD equal to Sxy.
Additional, find the 2.5 and 97.5 percentile for each parameter.
The range between these values is the confidence interval.
_______________________________________________________
-------------------------------------------------------
Date: Fri 16 Jan 2015 04:14:30 PM UTC Name: Screenshot-1.png Size: 161kB
By: tlinnet
<http://gna.org/task/download.php?file_id=23527>
_______________________________________________________
<http://gna.org/task/?7882>
_______________________________________________
Message sent via/by Gna!
http://gna.org/
_______________________________________________
relax (http://www.nmr-relax.com)
This is the relax-devel mailing list
To unsubscribe from this list, get a password
reminder, or change your subscription options,
visit the list information page at
https://mail.gna.org/listinfo/relax-devel
_______________________________________________
relax (http://www.nmr-relax.com)

This is the relax-devel mailing list
relax-***@gna.org

To unsubscribe from this list, get a password
reminder, or change your subscription options,
visit the list information page at
https://mail.gna.org/listinfo/relax-devel
Troels Emtekær Linnet
2015-01-16 16:48:34 UTC
Permalink
Hi Edward.

At the moment, I am fairly confident that I should investigate the
distribution from which the errors are drawn.

The method in relax draws from a Gauss distribution of the R2eff errors,
but I should try to draw errors from the
overall residual instead.

It is two different methods.

My PI, has earlier has before analysed the data with the aforementioned
method, and got errors in the hundreds.
Errors are 5-10% of the fitted global parameters.

Having 0.5-1 percent error is way to small, and I see this for 4 of my
datasets.

So, something is fishy.

Best
Troels
Post by Edward d'Auvergne
Hi Troels,
You should be very careful with your interpretation here. The
curvature of the chi-squared space does not correlate with the
parameter errors! Well, it most cases it doesn't. You will see this
if you map the space for different Monte Carlo simulations. Some
extreme edge cases might help in understanding the problem. Lets say
you have a kex value of 100 with a real error of 1000. In this case,
you could still have a small, perfectly quadratic minimum. But this
minimum will jump all over the place with the simulations. Another
extreme example might be kex of 100 with a real error of 0.00000001.
In this case, the chi-squared space could look similar to the
screenshot you attached to the task ( http://gna.org/task/?7882).
However Monte Carlo simulations may hardly perturb the chi-squared
space. I have observed scenarios similar to these hypothetical cases
with the Lipari and Szabo model-free protein dynamics analysis.
There is one case where the chi-squared space and error space match,
and that is at the limit of the minimum when the chi-squared space
becomes quadratic. This happens when you zoom right into the minimum.
The correlation matrix approach makes this assumption. Monte Carlo
simulations do not. In fact, Monte Carlo simulations are the gold
standard. There is no technique which is better than Monte Carlo
simulations, if you use enough simulations. You can only match it by
deriving exact symbolic error equations.
Therefore you really should investigate how your optimisation space is
perturbed by Monte Carlo simulations to understand the correlation -
or non-correlation - of the chi-squared curvature and the parameter
errors. Try mapping the minimum for the simulations and see if the
distribution of minima matches the chi-squared curvature
(http://gna.org/task/download.php?file_id=23527).
Regards,
Edward
On 16 January 2015 at 17:14, Troels E. Linnet
Post by Troels E. Linnet
<http://gna.org/task/?7882>
Summary: Implement Monte-Carlo simulation, where errors
are
Post by Troels E. Linnet
generated with width of standard deviation or residuals
Project: relax
Submitted by: tlinnet
Submitted on: Fri 16 Jan 2015 04:14:30 PM UTC
Should Start On: Fri 16 Jan 2015 12:00:00 AM UTC
Should be Finished on: Fri 16 Jan 2015 12:00:00 AM UTC
Category: relax's source code
Priority: 5 - Normal
Status: In Progress
Percent Complete: 0%
Assigned to: tlinnet
Open/Closed: Open
Discussion Lock: Any
Effort: 0.00
_______________________________________________________
This is implemented due to strange results.
A relaxation dispersion on data with 61 spins, and a monte carlo
simulation
Post by Troels E. Linnet
with 500 steps, showed un-expected low errors.
-------
results.read(file=fname_results, dir=dir_results)
# Number of MC
mc_nr = 500
monte_carlo.setup(number=mc_nr)
monte_carlo.create_data()
monte_carlo.initial_values()
minimise.execute(min_algor='simplex', func_tol=1e-25, max_iter=int(1e7),
constraints=True)
monte_carlo.error_analysis()
--------
The kex was 2111 and with error 16.6.
i_sort dw_sort pA_sort kex_sort chi2_sort
471 4.50000 0.99375 2125.00000 4664.31083
470 4.50000 0.99375 1750.00000 4665.23872
So, even a small change with chi2, should reflect a larger
deviation with kex.
It seems, that change of R2eff values according to their errors, is not
"enough".
According to the regression book of Graphpad
http://www.graphpad.com/faq/file/Prism4RegressionBook.pdf
Page 33, and 104.
Sxy = sqrt(SS/(N-p))
where SS is sum of squares. N - p, is the number of degrees of freedom.
In relax, SS is spin.chi2, and is weighted.
The random scatter to each R2eff point should be drawn from a gaussian
distribution with a mean of Zero and SD equal to Sxy.
Additional, find the 2.5 and 97.5 percentile for each parameter.
The range between these values is the confidence interval.
_______________________________________________________
-------------------------------------------------------
161kB
Post by Troels E. Linnet
By: tlinnet
<http://gna.org/task/download.php?file_id=23527>
_______________________________________________________
<http://gna.org/task/?7882>
_______________________________________________
Message sent via/by Gna!
http://gna.org/
_______________________________________________
relax (http://www.nmr-relax.com)
This is the relax-devel mailing list
To unsubscribe from this list, get a password
reminder, or change your subscription options,
visit the list information page at
https://mail.gna.org/listinfo/relax-devel
_______________________________________________
relax (http://www.nmr-relax.com)

This is the relax-devel mailing list
relax-***@gna.org

To unsubscribe from this list, get a password
reminder, or change your subscription options,
visit the list information page at
https://mail.gna.org/listinfo/relax-devel
Edward d'Auvergne
2015-01-16 17:09:26 UTC
Permalink
Hi,

Do the R2eff errors look reasonable? Another issue is in clustered
analysis, certain parameters can be over-constrained by being shared
between multiple data sets. This is the biased introduced by an
under-fitted problem. This can artificially decrease the errors.
Anyway, you should plot the Monte Carlo simulations, a bit like I did
in figure 4 of my paper:

d'Auvergne, E. J. and Gooley, P. R. (2006). Model-free model
elimination: A new step in the model-free dynamic analysis of NMR
relaxation data. J. Biomol. NMR, 35(2), 117-135.
(http://dx.doi.org/10.1007/s10858-006-9007-z)

That might indicate if something is wrong - i.e. if optimisation of
certain simulations have failed. However this problem only causes
errors to be bigger than they should be (unless all simulations have
failed). I don't know how Monte Carlo simulations could fail
otherwise. Monte Carlo simulations are the gold standard for error
analysis. All other error analysis techniques are judged based on how
close the approach this gold standard. Saying that the Monte Carlo
simulations technique failed is about equivalent to claiming the Earth
is flat! I challenge you to test the statement on a statistics
professor at your Uni ;) Anyway, if Monte Carlo failed, using
residuals will not save you as the failure point will be present in
both techniques. What could have failed is the model or the input
data. Under-fitting due to too much R2eff data variability in the
spins of the cluster would be my guess. Do you see similarly small
errors in the non-clustered analysis of the same data?

Regards,

Edward






On 16 January 2015 at 17:48, Troels Emtekær Linnet
Post by Troels Emtekær Linnet
Hi Edward.
At the moment, I am fairly confident that I should investigate the
distribution from which the errors are drawn.
The method in relax draws from a Gauss distribution of the R2eff errors, but
I should try to draw errors from the
overall residual instead.
It is two different methods.
My PI, has earlier has before analysed the data with the aforementioned
method, and got errors in the hundreds.
Errors are 5-10% of the fitted global parameters.
Having 0.5-1 percent error is way to small, and I see this for 4 of my
datasets.
So, something is fishy.
Best
Troels
Post by Edward d'Auvergne
Hi Troels,
You should be very careful with your interpretation here. The
curvature of the chi-squared space does not correlate with the
parameter errors! Well, it most cases it doesn't. You will see this
if you map the space for different Monte Carlo simulations. Some
extreme edge cases might help in understanding the problem. Lets say
you have a kex value of 100 with a real error of 1000. In this case,
you could still have a small, perfectly quadratic minimum. But this
minimum will jump all over the place with the simulations. Another
extreme example might be kex of 100 with a real error of 0.00000001.
In this case, the chi-squared space could look similar to the
screenshot you attached to the task ( http://gna.org/task/?7882).
However Monte Carlo simulations may hardly perturb the chi-squared
space. I have observed scenarios similar to these hypothetical cases
with the Lipari and Szabo model-free protein dynamics analysis.
There is one case where the chi-squared space and error space match,
and that is at the limit of the minimum when the chi-squared space
becomes quadratic. This happens when you zoom right into the minimum.
The correlation matrix approach makes this assumption. Monte Carlo
simulations do not. In fact, Monte Carlo simulations are the gold
standard. There is no technique which is better than Monte Carlo
simulations, if you use enough simulations. You can only match it by
deriving exact symbolic error equations.
Therefore you really should investigate how your optimisation space is
perturbed by Monte Carlo simulations to understand the correlation -
or non-correlation - of the chi-squared curvature and the parameter
errors. Try mapping the minimum for the simulations and see if the
distribution of minima matches the chi-squared curvature
(http://gna.org/task/download.php?file_id=23527).
Regards,
Edward
On 16 January 2015 at 17:14, Troels E. Linnet
Post by Troels E. Linnet
<http://gna.org/task/?7882>
Summary: Implement Monte-Carlo simulation, where errors are
generated with width of standard deviation or residuals
Project: relax
Submitted by: tlinnet
Submitted on: Fri 16 Jan 2015 04:14:30 PM UTC
Should Start On: Fri 16 Jan 2015 12:00:00 AM UTC
Should be Finished on: Fri 16 Jan 2015 12:00:00 AM UTC
Category: relax's source code
Priority: 5 - Normal
Status: In Progress
Percent Complete: 0%
Assigned to: tlinnet
Open/Closed: Open
Discussion Lock: Any
Effort: 0.00
_______________________________________________________
This is implemented due to strange results.
A relaxation dispersion on data with 61 spins, and a monte carlo simulation
with 500 steps, showed un-expected low errors.
-------
results.read(file=fname_results, dir=dir_results)
# Number of MC
mc_nr = 500
monte_carlo.setup(number=mc_nr)
monte_carlo.create_data()
monte_carlo.initial_values()
minimise.execute(min_algor='simplex', func_tol=1e-25, max_iter=int(1e7),
constraints=True)
monte_carlo.error_analysis()
--------
The kex was 2111 and with error 16.6.
i_sort dw_sort pA_sort kex_sort chi2_sort
471 4.50000 0.99375 2125.00000 4664.31083
470 4.50000 0.99375 1750.00000 4665.23872
So, even a small change with chi2, should reflect a larger
deviation with kex.
It seems, that change of R2eff values according to their errors, is not
"enough".
According to the regression book of Graphpad
http://www.graphpad.com/faq/file/Prism4RegressionBook.pdf
Page 33, and 104.
Sxy = sqrt(SS/(N-p))
where SS is sum of squares. N - p, is the number of degrees of freedom.
In relax, SS is spin.chi2, and is weighted.
The random scatter to each R2eff point should be drawn from a gaussian
distribution with a mean of Zero and SD equal to Sxy.
Additional, find the 2.5 and 97.5 percentile for each parameter.
The range between these values is the confidence interval.
_______________________________________________________
-------------------------------------------------------
Date: Fri 16 Jan 2015 04:14:30 PM UTC Name: Screenshot-1.png Size: 161kB
By: tlinnet
<http://gna.org/task/download.php?file_id=23527>
_______________________________________________________
<http://gna.org/task/?7882>
_______________________________________________
Message sent via/by Gna!
http://gna.org/
_______________________________________________
relax (http://www.nmr-relax.com)
This is the relax-devel mailing list
To unsubscribe from this list, get a password
reminder, or change your subscription options,
visit the list information page at
https://mail.gna.org/listinfo/relax-devel
_______________________________________________
relax (http://www.nmr-relax.com)

This is the relax-devel mailing list
relax-***@gna.org

To unsubscribe from this list, get a password
reminder, or change your subscription options,
visit the list information page at
Troels Emtekær Linnet
2015-01-16 17:25:31 UTC
Permalink
Hi Edward.

I do not claim that "Monte Carlo simulations" is not the gold standard.

I am merely trying to investigate the method by which one draw the errors.

In the current case for dispersion, one trust the R2eff errors to be the
distribution.
These are individual per spin.

Another distribution could be from how well the clustered fit performed.
And this is what I am looking into.

Best
Troels
Post by Edward d'Auvergne
Hi,
Do the R2eff errors look reasonable? Another issue is in clustered
analysis, certain parameters can be over-constrained by being shared
between multiple data sets. This is the biased introduced by an
under-fitted problem. This can artificially decrease the errors.
Anyway, you should plot the Monte Carlo simulations, a bit like I did
d'Auvergne, E. J. and Gooley, P. R. (2006). Model-free model
elimination: A new step in the model-free dynamic analysis of NMR
relaxation data. J. Biomol. NMR, 35(2), 117-135.
(http://dx.doi.org/10.1007/s10858-006-9007-z)
That might indicate if something is wrong - i.e. if optimisation of
certain simulations have failed. However this problem only causes
errors to be bigger than they should be (unless all simulations have
failed). I don't know how Monte Carlo simulations could fail
otherwise. Monte Carlo simulations are the gold standard for error
analysis. All other error analysis techniques are judged based on how
close the approach this gold standard. Saying that the Monte Carlo
simulations technique failed is about equivalent to claiming the Earth
is flat! I challenge you to test the statement on a statistics
professor at your Uni ;) Anyway, if Monte Carlo failed, using
residuals will not save you as the failure point will be present in
both techniques. What could have failed is the model or the input
data. Under-fitting due to too much R2eff data variability in the
spins of the cluster would be my guess. Do you see similarly small
errors in the non-clustered analysis of the same data?
Regards,
Edward
On 16 January 2015 at 17:48, Troels Emtekær Linnet
Post by Troels Emtekær Linnet
Hi Edward.
At the moment, I am fairly confident that I should investigate the
distribution from which the errors are drawn.
The method in relax draws from a Gauss distribution of the R2eff errors,
but
Post by Troels Emtekær Linnet
I should try to draw errors from the
overall residual instead.
It is two different methods.
My PI, has earlier has before analysed the data with the aforementioned
method, and got errors in the hundreds.
Errors are 5-10% of the fitted global parameters.
Having 0.5-1 percent error is way to small, and I see this for 4 of my
datasets.
So, something is fishy.
Best
Troels
Post by Edward d'Auvergne
Hi Troels,
You should be very careful with your interpretation here. The
curvature of the chi-squared space does not correlate with the
parameter errors! Well, it most cases it doesn't. You will see this
if you map the space for different Monte Carlo simulations. Some
extreme edge cases might help in understanding the problem. Lets say
you have a kex value of 100 with a real error of 1000. In this case,
you could still have a small, perfectly quadratic minimum. But this
minimum will jump all over the place with the simulations. Another
extreme example might be kex of 100 with a real error of 0.00000001.
In this case, the chi-squared space could look similar to the
screenshot you attached to the task ( http://gna.org/task/?7882).
However Monte Carlo simulations may hardly perturb the chi-squared
space. I have observed scenarios similar to these hypothetical cases
with the Lipari and Szabo model-free protein dynamics analysis.
There is one case where the chi-squared space and error space match,
and that is at the limit of the minimum when the chi-squared space
becomes quadratic. This happens when you zoom right into the minimum.
The correlation matrix approach makes this assumption. Monte Carlo
simulations do not. In fact, Monte Carlo simulations are the gold
standard. There is no technique which is better than Monte Carlo
simulations, if you use enough simulations. You can only match it by
deriving exact symbolic error equations.
Therefore you really should investigate how your optimisation space is
perturbed by Monte Carlo simulations to understand the correlation -
or non-correlation - of the chi-squared curvature and the parameter
errors. Try mapping the minimum for the simulations and see if the
distribution of minima matches the chi-squared curvature
(http://gna.org/task/download.php?file_id=23527).
Regards,
Edward
On 16 January 2015 at 17:14, Troels E. Linnet
Post by Troels E. Linnet
<http://gna.org/task/?7882>
Summary: Implement Monte-Carlo simulation, where
errors
Post by Troels Emtekær Linnet
Post by Edward d'Auvergne
Post by Troels E. Linnet
are
generated with width of standard deviation or residuals
Project: relax
Submitted by: tlinnet
Submitted on: Fri 16 Jan 2015 04:14:30 PM UTC
Should Start On: Fri 16 Jan 2015 12:00:00 AM UTC
Should be Finished on: Fri 16 Jan 2015 12:00:00 AM UTC
Category: relax's source code
Priority: 5 - Normal
Status: In Progress
Percent Complete: 0%
Assigned to: tlinnet
Open/Closed: Open
Discussion Lock: Any
Effort: 0.00
_______________________________________________________
This is implemented due to strange results.
A relaxation dispersion on data with 61 spins, and a monte carlo simulation
with 500 steps, showed un-expected low errors.
-------
results.read(file=fname_results, dir=dir_results)
# Number of MC
mc_nr = 500
monte_carlo.setup(number=mc_nr)
monte_carlo.create_data()
monte_carlo.initial_values()
minimise.execute(min_algor='simplex', func_tol=1e-25,
max_iter=int(1e7),
Post by Troels Emtekær Linnet
Post by Edward d'Auvergne
Post by Troels E. Linnet
constraints=True)
monte_carlo.error_analysis()
--------
The kex was 2111 and with error 16.6.
i_sort dw_sort pA_sort kex_sort chi2_sort
471 4.50000 0.99375 2125.00000 4664.31083
470 4.50000 0.99375 1750.00000 4665.23872
So, even a small change with chi2, should reflect a larger
deviation with kex.
It seems, that change of R2eff values according to their errors, is
not
Post by Troels Emtekær Linnet
Post by Edward d'Auvergne
Post by Troels E. Linnet
"enough".
According to the regression book of Graphpad
http://www.graphpad.com/faq/file/Prism4RegressionBook.pdf
Page 33, and 104.
Sxy = sqrt(SS/(N-p))
where SS is sum of squares. N - p, is the number of degrees of
freedom.
Post by Troels Emtekær Linnet
Post by Edward d'Auvergne
Post by Troels E. Linnet
In relax, SS is spin.chi2, and is weighted.
The random scatter to each R2eff point should be drawn from a gaussian
distribution with a mean of Zero and SD equal to Sxy.
Additional, find the 2.5 and 97.5 percentile for each parameter.
The range between these values is the confidence interval.
_______________________________________________________
-------------------------------------------------------
Date: Fri 16 Jan 2015 04:14:30 PM UTC Name: Screenshot-1.png Size: 161kB
By: tlinnet
<http://gna.org/task/download.php?file_id=23527>
_______________________________________________________
<http://gna.org/task/?7882>
_______________________________________________
Message sent via/by Gna!
http://gna.org/
_______________________________________________
relax (http://www.nmr-relax.com)
This is the relax-devel mailing list
To unsubscribe from this list, get a password
reminder, or change your subscription options,
visit the list information page at
https://mail.gna.org/listinfo/relax-devel
_______________________________________________
relax (http://www.nmr-relax.com)

This is the relax-devel mailing list
relax-***@gna.org

To unsubscribe from this list, get a password
reminder, or change your subscription options,
visit the list information page at
h
Edward d'Auvergne
2015-01-16 18:07:19 UTC
Permalink
Hi,

If you plot the R2eff errors from the Monte Carlo simulations of that
model, are they Gaussian? Well, that's assuming you have full
dispersion curves. Theoretically from the white noise in the NMR
spectrum they should be. Anyway, even if it not claimed that Monte
Carlo simulations have failed, and you have small errors from MC
simulations and large errors from other error analysis technique, and
then pick the large errors, that implies that the Monte Carlo
simulations have failed. As for using residuals, this is a fall-back
technique when experimental errors have not, or cannot, be measured.
This uses another convergence assumption - if you have infinite data
and the model has a bias value of exactly 0.0, then the residuals
converge to the experimental errors. For clustering, you might
violate this bias == 0.0 condition (that is almost guaranteed). You
should also plot your residuals. If the average residual value is not
0.0, then you have model bias. Of if you see a trend in the residual
plot.

For using the residuals, how do these values compare to the error
values? If the residuals are bigger than the experimental error, then
this also indicates that abs(bias) > 0.0. A scatter plot of R2eff
residuals vs. errors might be quite useful. This should be a linear
plot with a gradient of 1, anything else indicates bias. There might
even be a way of calculating the bias value from the residuals and
errors, though I've forgotten how this is done. Anyway, if you have a
large bias due to the residuals being bigger than the R2eff error,
using the residuals for error analysis is not correct. It will
introduce larger errors, that is guaranteed. So you will have the
result that kex has larger errors. But it is useful to understand the
theoretical reason why. A large component of that kex error is the
modelling bias. So if sum_i(residual_i / error_i) > 1.0, then you
likely have under-fitting. This could be caused by clustering or the
2-site model being insufficient. In any case, using the residuals for
an error analysis to work around kex errors being too small only
indicates a problem with the data modelling.

What about testing the covariance matrix technique? I guess that due
to small amounts of data that single-item-out cross validation is not
an option.

Regards,

Edward



On 16 January 2015 at 18:25, Troels Emtekær Linnet
Post by Troels Emtekær Linnet
Hi Edward.
I do not claim that "Monte Carlo simulations" is not the gold standard.
I am merely trying to investigate the method by which one draw the errors.
In the current case for dispersion, one trust the R2eff errors to be the
distribution.
These are individual per spin.
Another distribution could be from how well the clustered fit performed.
And this is what I am looking into.
Best
Troels
Post by Edward d'Auvergne
Hi,
Do the R2eff errors look reasonable? Another issue is in clustered
analysis, certain parameters can be over-constrained by being shared
between multiple data sets. This is the biased introduced by an
under-fitted problem. This can artificially decrease the errors.
Anyway, you should plot the Monte Carlo simulations, a bit like I did
d'Auvergne, E. J. and Gooley, P. R. (2006). Model-free model
elimination: A new step in the model-free dynamic analysis of NMR
relaxation data. J. Biomol. NMR, 35(2), 117-135.
(http://dx.doi.org/10.1007/s10858-006-9007-z)
That might indicate if something is wrong - i.e. if optimisation of
certain simulations have failed. However this problem only causes
errors to be bigger than they should be (unless all simulations have
failed). I don't know how Monte Carlo simulations could fail
otherwise. Monte Carlo simulations are the gold standard for error
analysis. All other error analysis techniques are judged based on how
close the approach this gold standard. Saying that the Monte Carlo
simulations technique failed is about equivalent to claiming the Earth
is flat! I challenge you to test the statement on a statistics
professor at your Uni ;) Anyway, if Monte Carlo failed, using
residuals will not save you as the failure point will be present in
both techniques. What could have failed is the model or the input
data. Under-fitting due to too much R2eff data variability in the
spins of the cluster would be my guess. Do you see similarly small
errors in the non-clustered analysis of the same data?
Regards,
Edward
On 16 January 2015 at 17:48, Troels Emtekær Linnet
Post by Troels Emtekær Linnet
Hi Edward.
At the moment, I am fairly confident that I should investigate the
distribution from which the errors are drawn.
The method in relax draws from a Gauss distribution of the R2eff errors, but
I should try to draw errors from the
overall residual instead.
It is two different methods.
My PI, has earlier has before analysed the data with the aforementioned
method, and got errors in the hundreds.
Errors are 5-10% of the fitted global parameters.
Having 0.5-1 percent error is way to small, and I see this for 4 of my
datasets.
So, something is fishy.
Best
Troels
Post by Edward d'Auvergne
Hi Troels,
You should be very careful with your interpretation here. The
curvature of the chi-squared space does not correlate with the
parameter errors! Well, it most cases it doesn't. You will see this
if you map the space for different Monte Carlo simulations. Some
extreme edge cases might help in understanding the problem. Lets say
you have a kex value of 100 with a real error of 1000. In this case,
you could still have a small, perfectly quadratic minimum. But this
minimum will jump all over the place with the simulations. Another
extreme example might be kex of 100 with a real error of 0.00000001.
In this case, the chi-squared space could look similar to the
screenshot you attached to the task ( http://gna.org/task/?7882).
However Monte Carlo simulations may hardly perturb the chi-squared
space. I have observed scenarios similar to these hypothetical cases
with the Lipari and Szabo model-free protein dynamics analysis.
There is one case where the chi-squared space and error space match,
and that is at the limit of the minimum when the chi-squared space
becomes quadratic. This happens when you zoom right into the minimum.
The correlation matrix approach makes this assumption. Monte Carlo
simulations do not. In fact, Monte Carlo simulations are the gold
standard. There is no technique which is better than Monte Carlo
simulations, if you use enough simulations. You can only match it by
deriving exact symbolic error equations.
Therefore you really should investigate how your optimisation space is
perturbed by Monte Carlo simulations to understand the correlation -
or non-correlation - of the chi-squared curvature and the parameter
errors. Try mapping the minimum for the simulations and see if the
distribution of minima matches the chi-squared curvature
(http://gna.org/task/download.php?file_id=23527).
Regards,
Edward
On 16 January 2015 at 17:14, Troels E. Linnet
Post by Troels E. Linnet
<http://gna.org/task/?7882>
Summary: Implement Monte-Carlo simulation, where errors
are
generated with width of standard deviation or residuals
Project: relax
Submitted by: tlinnet
Submitted on: Fri 16 Jan 2015 04:14:30 PM UTC
Should Start On: Fri 16 Jan 2015 12:00:00 AM UTC
Should be Finished on: Fri 16 Jan 2015 12:00:00 AM UTC
Category: relax's source code
Priority: 5 - Normal
Status: In Progress
Percent Complete: 0%
Assigned to: tlinnet
Open/Closed: Open
Discussion Lock: Any
Effort: 0.00
_______________________________________________________
This is implemented due to strange results.
A relaxation dispersion on data with 61 spins, and a monte carlo simulation
with 500 steps, showed un-expected low errors.
-------
results.read(file=fname_results, dir=dir_results)
# Number of MC
mc_nr = 500
monte_carlo.setup(number=mc_nr)
monte_carlo.create_data()
monte_carlo.initial_values()
minimise.execute(min_algor='simplex', func_tol=1e-25,
max_iter=int(1e7),
constraints=True)
monte_carlo.error_analysis()
--------
The kex was 2111 and with error 16.6.
i_sort dw_sort pA_sort kex_sort chi2_sort
471 4.50000 0.99375 2125.00000 4664.31083
470 4.50000 0.99375 1750.00000 4665.23872
So, even a small change with chi2, should reflect a larger
deviation with kex.
It seems, that change of R2eff values according to their errors, is not
"enough".
According to the regression book of Graphpad
http://www.graphpad.com/faq/file/Prism4RegressionBook.pdf
Page 33, and 104.
Sxy = sqrt(SS/(N-p))
where SS is sum of squares. N - p, is the number of degrees of freedom.
In relax, SS is spin.chi2, and is weighted.
The random scatter to each R2eff point should be drawn from a gaussian
distribution with a mean of Zero and SD equal to Sxy.
Additional, find the 2.5 and 97.5 percentile for each parameter.
The range between these values is the confidence interval.
_______________________________________________________
-------------------------------------------------------
Date: Fri 16 Jan 2015 04:14:30 PM UTC Name: Screenshot-1.png Size: 161kB
By: tlinnet
<http://gna.org/task/download.php?file_id=23527>
_______________________________________________________
<http://gna.org/task/?7882>
_______________________________________________
Message sent via/by Gna!
http://gna.org/
_______________________________________________
relax (http://www.nmr-relax.com)
This is the relax-devel mailing list
To unsubscribe from this list, get a password
reminder, or change your subscription options,
visit the list information page at
https://mail.gna.org/listinfo/relax-devel
_______________________________________________
relax (http://www.nmr-relax.com)

This is the relax-devel mailing list
relax-***@gna.org

To unsubscribe from this list, get a password
reminder, or change your subscription options,
visit the list information page at
https://mail.gna.org/listinfo/rela
Edward d'Auvergne
2015-01-16 18:13:20 UTC
Permalink
Hi,

Sorry, I meant sum_i(residual_i / error_i)/N > 1.0. As for
calculating the bias value, it looks like Wikipedia has a reasonable
description (https://en.wikipedia.org/wiki/Bias_of_an_estimator).

Regards,

Edward
Post by Edward d'Auvergne
Hi,
If you plot the R2eff errors from the Monte Carlo simulations of that
model, are they Gaussian? Well, that's assuming you have full
dispersion curves. Theoretically from the white noise in the NMR
spectrum they should be. Anyway, even if it not claimed that Monte
Carlo simulations have failed, and you have small errors from MC
simulations and large errors from other error analysis technique, and
then pick the large errors, that implies that the Monte Carlo
simulations have failed. As for using residuals, this is a fall-back
technique when experimental errors have not, or cannot, be measured.
This uses another convergence assumption - if you have infinite data
and the model has a bias value of exactly 0.0, then the residuals
converge to the experimental errors. For clustering, you might
violate this bias == 0.0 condition (that is almost guaranteed). You
should also plot your residuals. If the average residual value is not
0.0, then you have model bias. Of if you see a trend in the residual
plot.
For using the residuals, how do these values compare to the error
values? If the residuals are bigger than the experimental error, then
this also indicates that abs(bias) > 0.0. A scatter plot of R2eff
residuals vs. errors might be quite useful. This should be a linear
plot with a gradient of 1, anything else indicates bias. There might
even be a way of calculating the bias value from the residuals and
errors, though I've forgotten how this is done. Anyway, if you have a
large bias due to the residuals being bigger than the R2eff error,
using the residuals for error analysis is not correct. It will
introduce larger errors, that is guaranteed. So you will have the
result that kex has larger errors. But it is useful to understand the
theoretical reason why. A large component of that kex error is the
modelling bias. So if sum_i(residual_i / error_i) > 1.0, then you
likely have under-fitting. This could be caused by clustering or the
2-site model being insufficient. In any case, using the residuals for
an error analysis to work around kex errors being too small only
indicates a problem with the data modelling.
What about testing the covariance matrix technique? I guess that due
to small amounts of data that single-item-out cross validation is not
an option.
Regards,
Edward
On 16 January 2015 at 18:25, Troels Emtekær Linnet
Post by Troels Emtekær Linnet
Hi Edward.
I do not claim that "Monte Carlo simulations" is not the gold standard.
I am merely trying to investigate the method by which one draw the errors.
In the current case for dispersion, one trust the R2eff errors to be the
distribution.
These are individual per spin.
Another distribution could be from how well the clustered fit performed.
And this is what I am looking into.
Best
Troels
Post by Edward d'Auvergne
Hi,
Do the R2eff errors look reasonable? Another issue is in clustered
analysis, certain parameters can be over-constrained by being shared
between multiple data sets. This is the biased introduced by an
under-fitted problem. This can artificially decrease the errors.
Anyway, you should plot the Monte Carlo simulations, a bit like I did
d'Auvergne, E. J. and Gooley, P. R. (2006). Model-free model
elimination: A new step in the model-free dynamic analysis of NMR
relaxation data. J. Biomol. NMR, 35(2), 117-135.
(http://dx.doi.org/10.1007/s10858-006-9007-z)
That might indicate if something is wrong - i.e. if optimisation of
certain simulations have failed. However this problem only causes
errors to be bigger than they should be (unless all simulations have
failed). I don't know how Monte Carlo simulations could fail
otherwise. Monte Carlo simulations are the gold standard for error
analysis. All other error analysis techniques are judged based on how
close the approach this gold standard. Saying that the Monte Carlo
simulations technique failed is about equivalent to claiming the Earth
is flat! I challenge you to test the statement on a statistics
professor at your Uni ;) Anyway, if Monte Carlo failed, using
residuals will not save you as the failure point will be present in
both techniques. What could have failed is the model or the input
data. Under-fitting due to too much R2eff data variability in the
spins of the cluster would be my guess. Do you see similarly small
errors in the non-clustered analysis of the same data?
Regards,
Edward
On 16 January 2015 at 17:48, Troels Emtekær Linnet
Post by Troels Emtekær Linnet
Hi Edward.
At the moment, I am fairly confident that I should investigate the
distribution from which the errors are drawn.
The method in relax draws from a Gauss distribution of the R2eff errors, but
I should try to draw errors from the
overall residual instead.
It is two different methods.
My PI, has earlier has before analysed the data with the aforementioned
method, and got errors in the hundreds.
Errors are 5-10% of the fitted global parameters.
Having 0.5-1 percent error is way to small, and I see this for 4 of my
datasets.
So, something is fishy.
Best
Troels
Post by Edward d'Auvergne
Hi Troels,
You should be very careful with your interpretation here. The
curvature of the chi-squared space does not correlate with the
parameter errors! Well, it most cases it doesn't. You will see this
if you map the space for different Monte Carlo simulations. Some
extreme edge cases might help in understanding the problem. Lets say
you have a kex value of 100 with a real error of 1000. In this case,
you could still have a small, perfectly quadratic minimum. But this
minimum will jump all over the place with the simulations. Another
extreme example might be kex of 100 with a real error of 0.00000001.
In this case, the chi-squared space could look similar to the
screenshot you attached to the task ( http://gna.org/task/?7882).
However Monte Carlo simulations may hardly perturb the chi-squared
space. I have observed scenarios similar to these hypothetical cases
with the Lipari and Szabo model-free protein dynamics analysis.
There is one case where the chi-squared space and error space match,
and that is at the limit of the minimum when the chi-squared space
becomes quadratic. This happens when you zoom right into the minimum.
The correlation matrix approach makes this assumption. Monte Carlo
simulations do not. In fact, Monte Carlo simulations are the gold
standard. There is no technique which is better than Monte Carlo
simulations, if you use enough simulations. You can only match it by
deriving exact symbolic error equations.
Therefore you really should investigate how your optimisation space is
perturbed by Monte Carlo simulations to understand the correlation -
or non-correlation - of the chi-squared curvature and the parameter
errors. Try mapping the minimum for the simulations and see if the
distribution of minima matches the chi-squared curvature
(http://gna.org/task/download.php?file_id=23527).
Regards,
Edward
On 16 January 2015 at 17:14, Troels E. Linnet
Post by Troels E. Linnet
<http://gna.org/task/?7882>
Summary: Implement Monte-Carlo simulation, where errors
are
generated with width of standard deviation or residuals
Project: relax
Submitted by: tlinnet
Submitted on: Fri 16 Jan 2015 04:14:30 PM UTC
Should Start On: Fri 16 Jan 2015 12:00:00 AM UTC
Should be Finished on: Fri 16 Jan 2015 12:00:00 AM UTC
Category: relax's source code
Priority: 5 - Normal
Status: In Progress
Percent Complete: 0%
Assigned to: tlinnet
Open/Closed: Open
Discussion Lock: Any
Effort: 0.00
_______________________________________________________
This is implemented due to strange results.
A relaxation dispersion on data with 61 spins, and a monte carlo simulation
with 500 steps, showed un-expected low errors.
-------
results.read(file=fname_results, dir=dir_results)
# Number of MC
mc_nr = 500
monte_carlo.setup(number=mc_nr)
monte_carlo.create_data()
monte_carlo.initial_values()
minimise.execute(min_algor='simplex', func_tol=1e-25,
max_iter=int(1e7),
constraints=True)
monte_carlo.error_analysis()
--------
The kex was 2111 and with error 16.6.
i_sort dw_sort pA_sort kex_sort chi2_sort
471 4.50000 0.99375 2125.00000 4664.31083
470 4.50000 0.99375 1750.00000 4665.23872
So, even a small change with chi2, should reflect a larger
deviation with kex.
It seems, that change of R2eff values according to their errors, is not
"enough".
According to the regression book of Graphpad
http://www.graphpad.com/faq/file/Prism4RegressionBook.pdf
Page 33, and 104.
Sxy = sqrt(SS/(N-p))
where SS is sum of squares. N - p, is the number of degrees of freedom.
In relax, SS is spin.chi2, and is weighted.
The random scatter to each R2eff point should be drawn from a gaussian
distribution with a mean of Zero and SD equal to Sxy.
Additional, find the 2.5 and 97.5 percentile for each parameter.
The range between these values is the confidence interval.
_______________________________________________________
-------------------------------------------------------
Date: Fri 16 Jan 2015 04:14:30 PM UTC Name: Screenshot-1.png Size: 161kB
By: tlinnet
<http://gna.org/task/download.php?file_id=23527>
_______________________________________________________
<http://gna.org/task/?7882>
_______________________________________________
Message sent via/by Gna!
http://gna.org/
_______________________________________________
relax (http://www.nmr-relax.com)
This is the relax-devel mailing list
To unsubscribe from this list, get a password
reminder, or change your subscription options,
visit the list information page at
https://mail.gna.org/listinfo/relax-devel
_______________________________________________
relax (http://www.nmr-relax.com)

This is the relax-devel mailing list
relax-***@gna.org

To unsubscribe from this list, get a password
reminder, or change your subscription options,
visit the list information page at
https://mail.gna.org/listinfo
Edward d'Auvergne
2015-01-19 09:31:39 UTC
Permalink
Hi Troels,

Do you have a reference for the technique? You mentioned a 'fitting
guide' in one of your commit messages, but without a reference to it.
I would like to study the technique to understand the implementation.
Does it have another name? I would guess so as it breaks the
statistical principles that the Monte Carlo simulation technique uses.
That is why the monte_carlo.create_data user function says on the
first line that the technique is called bootstrapping rather than
Monte Carlo simulations if the 'method' argument is changed. I would
also like to check if it is implemented properly. For example
accessing the chi2, converting it to the reduced chi2 and using this
as the SSE may not be correct, as the former is normalised by the
errors and the later is not. You may need to recalculate the SSE as
there is no way to convert from the full/reduced chi2 value to an SSE
value. I would also like to see if this implementation is a new
methodology that sits beside Monte Carlo simulations and Bootstrapping
simulations, or if it can be used for both of these.

Cheers,

Edward
Post by Edward d'Auvergne
Hi,
Sorry, I meant sum_i(residual_i / error_i)/N > 1.0. As for
calculating the bias value, it looks like Wikipedia has a reasonable
description (https://en.wikipedia.org/wiki/Bias_of_an_estimator).
Regards,
Edward
Post by Edward d'Auvergne
Hi,
If you plot the R2eff errors from the Monte Carlo simulations of that
model, are they Gaussian? Well, that's assuming you have full
dispersion curves. Theoretically from the white noise in the NMR
spectrum they should be. Anyway, even if it not claimed that Monte
Carlo simulations have failed, and you have small errors from MC
simulations and large errors from other error analysis technique, and
then pick the large errors, that implies that the Monte Carlo
simulations have failed. As for using residuals, this is a fall-back
technique when experimental errors have not, or cannot, be measured.
This uses another convergence assumption - if you have infinite data
and the model has a bias value of exactly 0.0, then the residuals
converge to the experimental errors. For clustering, you might
violate this bias == 0.0 condition (that is almost guaranteed). You
should also plot your residuals. If the average residual value is not
0.0, then you have model bias. Of if you see a trend in the residual
plot.
For using the residuals, how do these values compare to the error
values? If the residuals are bigger than the experimental error, then
this also indicates that abs(bias) > 0.0. A scatter plot of R2eff
residuals vs. errors might be quite useful. This should be a linear
plot with a gradient of 1, anything else indicates bias. There might
even be a way of calculating the bias value from the residuals and
errors, though I've forgotten how this is done. Anyway, if you have a
large bias due to the residuals being bigger than the R2eff error,
using the residuals for error analysis is not correct. It will
introduce larger errors, that is guaranteed. So you will have the
result that kex has larger errors. But it is useful to understand the
theoretical reason why. A large component of that kex error is the
modelling bias. So if sum_i(residual_i / error_i) > 1.0, then you
likely have under-fitting. This could be caused by clustering or the
2-site model being insufficient. In any case, using the residuals for
an error analysis to work around kex errors being too small only
indicates a problem with the data modelling.
What about testing the covariance matrix technique? I guess that due
to small amounts of data that single-item-out cross validation is not
an option.
Regards,
Edward
On 16 January 2015 at 18:25, Troels Emtekær Linnet
Post by Troels Emtekær Linnet
Hi Edward.
I do not claim that "Monte Carlo simulations" is not the gold standard.
I am merely trying to investigate the method by which one draw the errors.
In the current case for dispersion, one trust the R2eff errors to be the
distribution.
These are individual per spin.
Another distribution could be from how well the clustered fit performed.
And this is what I am looking into.
Best
Troels
Post by Edward d'Auvergne
Hi,
Do the R2eff errors look reasonable? Another issue is in clustered
analysis, certain parameters can be over-constrained by being shared
between multiple data sets. This is the biased introduced by an
under-fitted problem. This can artificially decrease the errors.
Anyway, you should plot the Monte Carlo simulations, a bit like I did
d'Auvergne, E. J. and Gooley, P. R. (2006). Model-free model
elimination: A new step in the model-free dynamic analysis of NMR
relaxation data. J. Biomol. NMR, 35(2), 117-135.
(http://dx.doi.org/10.1007/s10858-006-9007-z)
That might indicate if something is wrong - i.e. if optimisation of
certain simulations have failed. However this problem only causes
errors to be bigger than they should be (unless all simulations have
failed). I don't know how Monte Carlo simulations could fail
otherwise. Monte Carlo simulations are the gold standard for error
analysis. All other error analysis techniques are judged based on how
close the approach this gold standard. Saying that the Monte Carlo
simulations technique failed is about equivalent to claiming the Earth
is flat! I challenge you to test the statement on a statistics
professor at your Uni ;) Anyway, if Monte Carlo failed, using
residuals will not save you as the failure point will be present in
both techniques. What could have failed is the model or the input
data. Under-fitting due to too much R2eff data variability in the
spins of the cluster would be my guess. Do you see similarly small
errors in the non-clustered analysis of the same data?
Regards,
Edward
On 16 January 2015 at 17:48, Troels Emtekær Linnet
Post by Troels Emtekær Linnet
Hi Edward.
At the moment, I am fairly confident that I should investigate the
distribution from which the errors are drawn.
The method in relax draws from a Gauss distribution of the R2eff errors, but
I should try to draw errors from the
overall residual instead.
It is two different methods.
My PI, has earlier has before analysed the data with the aforementioned
method, and got errors in the hundreds.
Errors are 5-10% of the fitted global parameters.
Having 0.5-1 percent error is way to small, and I see this for 4 of my
datasets.
So, something is fishy.
Best
Troels
Post by Edward d'Auvergne
Hi Troels,
You should be very careful with your interpretation here. The
curvature of the chi-squared space does not correlate with the
parameter errors! Well, it most cases it doesn't. You will see this
if you map the space for different Monte Carlo simulations. Some
extreme edge cases might help in understanding the problem. Lets say
you have a kex value of 100 with a real error of 1000. In this case,
you could still have a small, perfectly quadratic minimum. But this
minimum will jump all over the place with the simulations. Another
extreme example might be kex of 100 with a real error of 0.00000001.
In this case, the chi-squared space could look similar to the
screenshot you attached to the task ( http://gna.org/task/?7882).
However Monte Carlo simulations may hardly perturb the chi-squared
space. I have observed scenarios similar to these hypothetical cases
with the Lipari and Szabo model-free protein dynamics analysis.
There is one case where the chi-squared space and error space match,
and that is at the limit of the minimum when the chi-squared space
becomes quadratic. This happens when you zoom right into the minimum.
The correlation matrix approach makes this assumption. Monte Carlo
simulations do not. In fact, Monte Carlo simulations are the gold
standard. There is no technique which is better than Monte Carlo
simulations, if you use enough simulations. You can only match it by
deriving exact symbolic error equations.
Therefore you really should investigate how your optimisation space is
perturbed by Monte Carlo simulations to understand the correlation -
or non-correlation - of the chi-squared curvature and the parameter
errors. Try mapping the minimum for the simulations and see if the
distribution of minima matches the chi-squared curvature
(http://gna.org/task/download.php?file_id=23527).
Regards,
Edward
On 16 January 2015 at 17:14, Troels E. Linnet
Post by Troels E. Linnet
<http://gna.org/task/?7882>
Summary: Implement Monte-Carlo simulation, where errors
are
generated with width of standard deviation or residuals
Project: relax
Submitted by: tlinnet
Submitted on: Fri 16 Jan 2015 04:14:30 PM UTC
Should Start On: Fri 16 Jan 2015 12:00:00 AM UTC
Should be Finished on: Fri 16 Jan 2015 12:00:00 AM UTC
Category: relax's source code
Priority: 5 - Normal
Status: In Progress
Percent Complete: 0%
Assigned to: tlinnet
Open/Closed: Open
Discussion Lock: Any
Effort: 0.00
_______________________________________________________
This is implemented due to strange results.
A relaxation dispersion on data with 61 spins, and a monte carlo simulation
with 500 steps, showed un-expected low errors.
-------
results.read(file=fname_results, dir=dir_results)
# Number of MC
mc_nr = 500
monte_carlo.setup(number=mc_nr)
monte_carlo.create_data()
monte_carlo.initial_values()
minimise.execute(min_algor='simplex', func_tol=1e-25,
max_iter=int(1e7),
constraints=True)
monte_carlo.error_analysis()
--------
The kex was 2111 and with error 16.6.
i_sort dw_sort pA_sort kex_sort chi2_sort
471 4.50000 0.99375 2125.00000 4664.31083
470 4.50000 0.99375 1750.00000 4665.23872
So, even a small change with chi2, should reflect a larger
deviation with kex.
It seems, that change of R2eff values according to their errors, is not
"enough".
According to the regression book of Graphpad
http://www.graphpad.com/faq/file/Prism4RegressionBook.pdf
Page 33, and 104.
Sxy = sqrt(SS/(N-p))
where SS is sum of squares. N - p, is the number of degrees of freedom.
In relax, SS is spin.chi2, and is weighted.
The random scatter to each R2eff point should be drawn from a gaussian
distribution with a mean of Zero and SD equal to Sxy.
Additional, find the 2.5 and 97.5 percentile for each parameter.
The range between these values is the confidence interval.
_______________________________________________________
-------------------------------------------------------
Date: Fri 16 Jan 2015 04:14:30 PM UTC Name: Screenshot-1.png Size: 161kB
By: tlinnet
<http://gna.org/task/download.php?file_id=23527>
_______________________________________________________
<http://gna.org/task/?7882>
_______________________________________________
Message sent via/by Gna!
http://gna.org/
_______________________________________________
relax (http://www.nmr-relax.com)
This is the relax-devel mailing list
To unsubscribe from this list, get a password
reminder, or change your subscription options,
visit the list information page at
https://mail.gna.org/listinfo/relax-devel
_______________________________________________
relax (http://www.nmr-relax.com)

This is the relax-devel mailing list
relax-***@gna.org

To unsubscribe from this list, get a password
reminder, or change your subscription options,
visit the list information page at
https://mail.gna.o
Troels Emtekær Linnet
2015-01-19 09:53:48 UTC
Permalink
Hi Edward.

I have used this regression book.
http://www.graphpad.com/faq/file/Prism4RegressionBook.pdf

I like the language (fun and humoristic), and goes over a great detail.
For example, there is quite a list of weighting methods.
I find the comments on page 28 a little disturbing.
(See also page 86+87)

The Monte-Carlo simulation is described at page 104.

1) Create an ideal data set.
-> Check. This is done with relax
monte_carlo.create_data(method='back_calc')

2) Add random scatter.
relax now add random scatter, per individual datapoint. The random scatter
is drawn from the measured error of the datapoint.
But the book suggest adding errors described by variance of the residuals.
This is described at page 33.
"If you chose to weight the values and minimize the relative distance
squared (or some other weighting function), goodness-of-fit is quantified
with the weighted sum- of-squares."

Sy.x = sqrt(SS/dof) = sqrt(chi2 / dof)
The question is of course, if SS should be sum of squared errors for the
weighted points, or the non-weighted R2eff points.

Best
Troels
Post by Edward d'Auvergne
Hi Troels,
Do you have a reference for the technique? You mentioned a 'fitting
guide' in one of your commit messages, but without a reference to it.
I would like to study the technique to understand the implementation.
Does it have another name? I would guess so as it breaks the
statistical principles that the Monte Carlo simulation technique uses.
That is why the monte_carlo.create_data user function says on the
first line that the technique is called bootstrapping rather than
Monte Carlo simulations if the 'method' argument is changed. I would
also like to check if it is implemented properly. For example
accessing the chi2, converting it to the reduced chi2 and using this
as the SSE may not be correct, as the former is normalised by the
errors and the later is not. You may need to recalculate the SSE as
there is no way to convert from the full/reduced chi2 value to an SSE
value. I would also like to see if this implementation is a new
methodology that sits beside Monte Carlo simulations and Bootstrapping
simulations, or if it can be used for both of these.
Cheers,
Edward
Post by Edward d'Auvergne
Hi,
Sorry, I meant sum_i(residual_i / error_i)/N > 1.0. As for
calculating the bias value, it looks like Wikipedia has a reasonable
description (https://en.wikipedia.org/wiki/Bias_of_an_estimator).
Regards,
Edward
Post by Edward d'Auvergne
Hi,
If you plot the R2eff errors from the Monte Carlo simulations of that
model, are they Gaussian? Well, that's assuming you have full
dispersion curves. Theoretically from the white noise in the NMR
spectrum they should be. Anyway, even if it not claimed that Monte
Carlo simulations have failed, and you have small errors from MC
simulations and large errors from other error analysis technique, and
then pick the large errors, that implies that the Monte Carlo
simulations have failed. As for using residuals, this is a fall-back
technique when experimental errors have not, or cannot, be measured.
This uses another convergence assumption - if you have infinite data
and the model has a bias value of exactly 0.0, then the residuals
converge to the experimental errors. For clustering, you might
violate this bias == 0.0 condition (that is almost guaranteed). You
should also plot your residuals. If the average residual value is not
0.0, then you have model bias. Of if you see a trend in the residual
plot.
For using the residuals, how do these values compare to the error
values? If the residuals are bigger than the experimental error, then
this also indicates that abs(bias) > 0.0. A scatter plot of R2eff
residuals vs. errors might be quite useful. This should be a linear
plot with a gradient of 1, anything else indicates bias. There might
even be a way of calculating the bias value from the residuals and
errors, though I've forgotten how this is done. Anyway, if you have a
large bias due to the residuals being bigger than the R2eff error,
using the residuals for error analysis is not correct. It will
introduce larger errors, that is guaranteed. So you will have the
result that kex has larger errors. But it is useful to understand the
theoretical reason why. A large component of that kex error is the
modelling bias. So if sum_i(residual_i / error_i) > 1.0, then you
likely have under-fitting. This could be caused by clustering or the
2-site model being insufficient. In any case, using the residuals for
an error analysis to work around kex errors being too small only
indicates a problem with the data modelling.
What about testing the covariance matrix technique? I guess that due
to small amounts of data that single-item-out cross validation is not
an option.
Regards,
Edward
On 16 January 2015 at 18:25, Troels Emtekær Linnet
Post by Troels Emtekær Linnet
Hi Edward.
I do not claim that "Monte Carlo simulations" is not the gold standard.
I am merely trying to investigate the method by which one draw the
errors.
Post by Edward d'Auvergne
Post by Edward d'Auvergne
Post by Troels Emtekær Linnet
In the current case for dispersion, one trust the R2eff errors to be
the
Post by Edward d'Auvergne
Post by Edward d'Auvergne
Post by Troels Emtekær Linnet
distribution.
These are individual per spin.
Another distribution could be from how well the clustered fit
performed.
Post by Edward d'Auvergne
Post by Edward d'Auvergne
Post by Troels Emtekær Linnet
And this is what I am looking into.
Best
Troels
Post by Edward d'Auvergne
Hi,
Do the R2eff errors look reasonable? Another issue is in clustered
analysis, certain parameters can be over-constrained by being shared
between multiple data sets. This is the biased introduced by an
under-fitted problem. This can artificially decrease the errors.
Anyway, you should plot the Monte Carlo simulations, a bit like I did
d'Auvergne, E. J. and Gooley, P. R. (2006). Model-free model
elimination: A new step in the model-free dynamic analysis of NMR
relaxation data. J. Biomol. NMR, 35(2), 117-135.
(http://dx.doi.org/10.1007/s10858-006-9007-z)
That might indicate if something is wrong - i.e. if optimisation of
certain simulations have failed. However this problem only causes
errors to be bigger than they should be (unless all simulations have
failed). I don't know how Monte Carlo simulations could fail
otherwise. Monte Carlo simulations are the gold standard for error
analysis. All other error analysis techniques are judged based on how
close the approach this gold standard. Saying that the Monte Carlo
simulations technique failed is about equivalent to claiming the Earth
is flat! I challenge you to test the statement on a statistics
professor at your Uni ;) Anyway, if Monte Carlo failed, using
residuals will not save you as the failure point will be present in
both techniques. What could have failed is the model or the input
data. Under-fitting due to too much R2eff data variability in the
spins of the cluster would be my guess. Do you see similarly small
errors in the non-clustered analysis of the same data?
Regards,
Edward
On 16 January 2015 at 17:48, Troels Emtekær Linnet
Post by Troels Emtekær Linnet
Hi Edward.
At the moment, I am fairly confident that I should investigate the
distribution from which the errors are drawn.
The method in relax draws from a Gauss distribution of the R2eff
errors,
Post by Edward d'Auvergne
Post by Edward d'Auvergne
Post by Troels Emtekær Linnet
Post by Edward d'Auvergne
Post by Troels Emtekær Linnet
but
I should try to draw errors from the
overall residual instead.
It is two different methods.
My PI, has earlier has before analysed the data with the
aforementioned
Post by Edward d'Auvergne
Post by Edward d'Auvergne
Post by Troels Emtekær Linnet
Post by Edward d'Auvergne
Post by Troels Emtekær Linnet
method, and got errors in the hundreds.
Errors are 5-10% of the fitted global parameters.
Having 0.5-1 percent error is way to small, and I see this for 4 of
my
Post by Edward d'Auvergne
Post by Edward d'Auvergne
Post by Troels Emtekær Linnet
Post by Edward d'Auvergne
Post by Troels Emtekær Linnet
datasets.
So, something is fishy.
Best
Troels
Post by Edward d'Auvergne
Hi Troels,
You should be very careful with your interpretation here. The
curvature of the chi-squared space does not correlate with the
parameter errors! Well, it most cases it doesn't. You will see
this
Post by Edward d'Auvergne
Post by Edward d'Auvergne
Post by Troels Emtekær Linnet
Post by Edward d'Auvergne
Post by Troels Emtekær Linnet
Post by Edward d'Auvergne
if you map the space for different Monte Carlo simulations. Some
extreme edge cases might help in understanding the problem. Lets
say
Post by Edward d'Auvergne
Post by Edward d'Auvergne
Post by Troels Emtekær Linnet
Post by Edward d'Auvergne
Post by Troels Emtekær Linnet
Post by Edward d'Auvergne
you have a kex value of 100 with a real error of 1000. In this
case,
Post by Edward d'Auvergne
Post by Edward d'Auvergne
Post by Troels Emtekær Linnet
Post by Edward d'Auvergne
Post by Troels Emtekær Linnet
Post by Edward d'Auvergne
you could still have a small, perfectly quadratic minimum. But
this
Post by Edward d'Auvergne
Post by Edward d'Auvergne
Post by Troels Emtekær Linnet
Post by Edward d'Auvergne
Post by Troels Emtekær Linnet
Post by Edward d'Auvergne
minimum will jump all over the place with the simulations. Another
extreme example might be kex of 100 with a real error of
0.00000001.
Post by Edward d'Auvergne
Post by Edward d'Auvergne
Post by Troels Emtekær Linnet
Post by Edward d'Auvergne
Post by Troels Emtekær Linnet
Post by Edward d'Auvergne
In this case, the chi-squared space could look similar to the
screenshot you attached to the task ( http://gna.org/task/?7882).
However Monte Carlo simulations may hardly perturb the chi-squared
space. I have observed scenarios similar to these hypothetical
cases
Post by Edward d'Auvergne
Post by Edward d'Auvergne
Post by Troels Emtekær Linnet
Post by Edward d'Auvergne
Post by Troels Emtekær Linnet
Post by Edward d'Auvergne
with the Lipari and Szabo model-free protein dynamics analysis.
There is one case where the chi-squared space and error space
match,
Post by Edward d'Auvergne
Post by Edward d'Auvergne
Post by Troels Emtekær Linnet
Post by Edward d'Auvergne
Post by Troels Emtekær Linnet
Post by Edward d'Auvergne
and that is at the limit of the minimum when the chi-squared space
becomes quadratic. This happens when you zoom right into the
minimum.
Post by Edward d'Auvergne
Post by Edward d'Auvergne
Post by Troels Emtekær Linnet
Post by Edward d'Auvergne
Post by Troels Emtekær Linnet
Post by Edward d'Auvergne
The correlation matrix approach makes this assumption. Monte Carlo
simulations do not. In fact, Monte Carlo simulations are the gold
standard. There is no technique which is better than Monte Carlo
simulations, if you use enough simulations. You can only match it
by
Post by Edward d'Auvergne
Post by Edward d'Auvergne
Post by Troels Emtekær Linnet
Post by Edward d'Auvergne
Post by Troels Emtekær Linnet
Post by Edward d'Auvergne
deriving exact symbolic error equations.
Therefore you really should investigate how your optimisation
space is
Post by Edward d'Auvergne
Post by Edward d'Auvergne
Post by Troels Emtekær Linnet
Post by Edward d'Auvergne
Post by Troels Emtekær Linnet
Post by Edward d'Auvergne
perturbed by Monte Carlo simulations to understand the correlation
-
Post by Edward d'Auvergne
Post by Edward d'Auvergne
Post by Troels Emtekær Linnet
Post by Edward d'Auvergne
Post by Troels Emtekær Linnet
Post by Edward d'Auvergne
or non-correlation - of the chi-squared curvature and the parameter
errors. Try mapping the minimum for the simulations and see if the
distribution of minima matches the chi-squared curvature
(http://gna.org/task/download.php?file_id=23527).
Regards,
Edward
On 16 January 2015 at 17:14, Troels E. Linnet
Post by Troels E. Linnet
<http://gna.org/task/?7882>
Summary: Implement Monte-Carlo simulation, where
errors
are
generated with width of standard deviation or residuals
Project: relax
Submitted by: tlinnet
Submitted on: Fri 16 Jan 2015 04:14:30 PM UTC
Should Start On: Fri 16 Jan 2015 12:00:00 AM UTC
Should be Finished on: Fri 16 Jan 2015 12:00:00 AM UTC
Category: relax's source code
Priority: 5 - Normal
Status: In Progress
Percent Complete: 0%
Assigned to: tlinnet
Open/Closed: Open
Discussion Lock: Any
Effort: 0.00
_______________________________________________________
This is implemented due to strange results.
A relaxation dispersion on data with 61 spins, and a monte carlo
simulation
with 500 steps, showed un-expected low errors.
-------
results.read(file=fname_results, dir=dir_results)
# Number of MC
mc_nr = 500
monte_carlo.setup(number=mc_nr)
monte_carlo.create_data()
monte_carlo.initial_values()
minimise.execute(min_algor='simplex', func_tol=1e-25,
max_iter=int(1e7),
constraints=True)
monte_carlo.error_analysis()
--------
The kex was 2111 and with error 16.6.
i_sort dw_sort pA_sort kex_sort chi2_sort
471 4.50000 0.99375 2125.00000 4664.31083
470 4.50000 0.99375 1750.00000 4665.23872
So, even a small change with chi2, should reflect a larger
deviation with kex.
It seems, that change of R2eff values according to their errors,
is
Post by Edward d'Auvergne
Post by Edward d'Auvergne
Post by Troels Emtekær Linnet
Post by Edward d'Auvergne
Post by Troels Emtekær Linnet
Post by Edward d'Auvergne
Post by Troels E. Linnet
not
"enough".
According to the regression book of Graphpad
http://www.graphpad.com/faq/file/Prism4RegressionBook.pdf
Page 33, and 104.
Sxy = sqrt(SS/(N-p))
where SS is sum of squares. N - p, is the number of degrees of freedom.
In relax, SS is spin.chi2, and is weighted.
The random scatter to each R2eff point should be drawn from a gaussian
distribution with a mean of Zero and SD equal to Sxy.
Additional, find the 2.5 and 97.5 percentile for each parameter.
The range between these values is the confidence interval.
_______________________________________________________
-------------------------------------------------------
Date: Fri 16 Jan 2015 04:14:30 PM UTC Name: Screenshot-1.png
161kB
By: tlinnet
<http://gna.org/task/download.php?file_id=23527>
_______________________________________________________
<http://gna.org/task/?7882>
_______________________________________________
Message sent via/by Gna!
http://gna.org/
_______________________________________________
relax (http://www.nmr-relax.com)
This is the relax-devel mailing list
To unsubscribe from this list, get a password
reminder, or change your subscription options,
visit the list information page at
https://mail.gna.org/listinfo/relax-devel
_______________________________________________
relax (http://www.nmr-relax.com)

This is the relax-devel mailing list
relax-***@gna.org

To unsubscribe from this list, get a password
reminder, or change your subscription options,
visit the list inform
Edward d'Auvergne
2015-01-19 13:38:45 UTC
Permalink
Hi Troels,

Looking at page 28, note that they are using the SSE value of
sum(Y_data - Y_curve)^2 and not the chi2 value of sum((Y_data -
Y_curve)/sigma)^2. This is a major difference! The chi-squared value
is the SSE normalised to unit variance. Any statistics or techniques
relying on the SSE value cannot be used when the chi-squared has been
used - i.e. it cannot be mapped to the relax results. Also note that
this page/book is talking about non-linear regression
(https://en.wikipedia.org/wiki/Nonlinear_regression) whereas in relax
we use non-linear least squares fitting
(https://en.wikipedia.org/wiki/Non-linear_least_squares). Although
related these are different fields, so care must be taken to when
using techniques from one in the other. In most cases that would be
theoretically disallowed. This statement from the regression
wikipedia article
https://en.wikipedia.org/wiki/Nonlinear_regression#Ordinary_and_weighted_least_squares
is useful:

"The best-fit curve is often assumed to be that which minimizes
the sum of squared residuals. This is the (ordinary) least squares
(OLS) approach. However, in cases where the dependent variable does
not have constant variance, a sum of weighted squared residuals may be
minimized; see weighted least squares. Each weight should ideally be
equal to the reciprocal of the variance of the observation, but
weights may be recomputed on each iteration, in an iteratively
weighted least squares algorithm."

I.e. the two techniques are assumed to give the same result. This is
often the case if the errors of all points are the same (though not
always due to the changing in curvature of the space if the errors are
not 1.0), that the noise is Gaussian, and there are no outliers.

Which part of page 28 is disturbing? We do not make the second
assumption "that the average amount of scatter is the same all the way
along the curve". The chi-squared value takes care of this - each
point has its own independent error. The first assumption is also
fine, unless the NMR experiment has failed in some strange way
affecting the spectral baseline or the spectral processing has
introduced baseline artefacts. As for the weighting, in the
relaxation dispersion analysis, this is not needed. It is used
sometimes in relax when combining RDC and PCS data in the N-state
model, but that is for when errors are not known and it is used to
make sure the two data types have roughly the same amount of weight
(which a full error analysis would have done). I'll continue with the
other sections later.

I would also recommend that you have a read of the introduction
chapters 1 and 2 of "Numerical Optimisation" by Jorge Nocedal and
Stephen J. Wright. This will give a good summary of the least squares
problem in comparison to this GraphPad Prism 4 software book which
covers regression. This book will be in your maths library, or full
PDFs can be found online. This book is a brilliant reference for all
optimisation concepts.

Cheers,

Edward

On 19 January 2015 at 10:53, Troels Emtekær Linnet
Post by Troels Emtekær Linnet
Hi Edward.
I have used this regression book.
http://www.graphpad.com/faq/file/Prism4RegressionBook.pdf
I like the language (fun and humoristic), and goes over a great detail.
For example, there is quite a list of weighting methods.
I find the comments on page 28 a little disturbing.
(See also page 86+87)
The Monte-Carlo simulation is described at page 104.
1) Create an ideal data set.
-> Check. This is done with relax
monte_carlo.create_data(method='back_calc')
2) Add random scatter.
relax now add random scatter, per individual datapoint. The random scatter
is drawn from the measured error of the datapoint.
But the book suggest adding errors described by variance of the residuals.
This is described at page 33.
"If you chose to weight the values and minimize the relative distance
squared (or some other weighting function), goodness-of-fit is quantified
with the weighted sum- of-squares."
Sy.x = sqrt(SS/dof) = sqrt(chi2 / dof)
The question is of course, if SS should be sum of squared errors for the
weighted points, or the non-weighted R2eff points.
Best
Troels
Post by Edward d'Auvergne
Hi Troels,
Do you have a reference for the technique? You mentioned a 'fitting
guide' in one of your commit messages, but without a reference to it.
I would like to study the technique to understand the implementation.
Does it have another name? I would guess so as it breaks the
statistical principles that the Monte Carlo simulation technique uses.
That is why the monte_carlo.create_data user function says on the
first line that the technique is called bootstrapping rather than
Monte Carlo simulations if the 'method' argument is changed. I would
also like to check if it is implemented properly. For example
accessing the chi2, converting it to the reduced chi2 and using this
as the SSE may not be correct, as the former is normalised by the
errors and the later is not. You may need to recalculate the SSE as
there is no way to convert from the full/reduced chi2 value to an SSE
value. I would also like to see if this implementation is a new
methodology that sits beside Monte Carlo simulations and Bootstrapping
simulations, or if it can be used for both of these.
Cheers,
Edward
Post by Edward d'Auvergne
Hi,
Sorry, I meant sum_i(residual_i / error_i)/N > 1.0. As for
calculating the bias value, it looks like Wikipedia has a reasonable
description (https://en.wikipedia.org/wiki/Bias_of_an_estimator).
Regards,
Edward
Post by Edward d'Auvergne
Hi,
If you plot the R2eff errors from the Monte Carlo simulations of that
model, are they Gaussian? Well, that's assuming you have full
dispersion curves. Theoretically from the white noise in the NMR
spectrum they should be. Anyway, even if it not claimed that Monte
Carlo simulations have failed, and you have small errors from MC
simulations and large errors from other error analysis technique, and
then pick the large errors, that implies that the Monte Carlo
simulations have failed. As for using residuals, this is a fall-back
technique when experimental errors have not, or cannot, be measured.
This uses another convergence assumption - if you have infinite data
and the model has a bias value of exactly 0.0, then the residuals
converge to the experimental errors. For clustering, you might
violate this bias == 0.0 condition (that is almost guaranteed). You
should also plot your residuals. If the average residual value is not
0.0, then you have model bias. Of if you see a trend in the residual
plot.
For using the residuals, how do these values compare to the error
values? If the residuals are bigger than the experimental error, then
this also indicates that abs(bias) > 0.0. A scatter plot of R2eff
residuals vs. errors might be quite useful. This should be a linear
plot with a gradient of 1, anything else indicates bias. There might
even be a way of calculating the bias value from the residuals and
errors, though I've forgotten how this is done. Anyway, if you have a
large bias due to the residuals being bigger than the R2eff error,
using the residuals for error analysis is not correct. It will
introduce larger errors, that is guaranteed. So you will have the
result that kex has larger errors. But it is useful to understand the
theoretical reason why. A large component of that kex error is the
modelling bias. So if sum_i(residual_i / error_i) > 1.0, then you
likely have under-fitting. This could be caused by clustering or the
2-site model being insufficient. In any case, using the residuals for
an error analysis to work around kex errors being too small only
indicates a problem with the data modelling.
What about testing the covariance matrix technique? I guess that due
to small amounts of data that single-item-out cross validation is not
an option.
Regards,
Edward
On 16 January 2015 at 18:25, Troels Emtekær Linnet
Post by Troels Emtekær Linnet
Hi Edward.
I do not claim that "Monte Carlo simulations" is not the gold standard.
I am merely trying to investigate the method by which one draw the errors.
In the current case for dispersion, one trust the R2eff errors to be the
distribution.
These are individual per spin.
Another distribution could be from how well the clustered fit performed.
And this is what I am looking into.
Best
Troels
Post by Edward d'Auvergne
Hi,
Do the R2eff errors look reasonable? Another issue is in clustered
analysis, certain parameters can be over-constrained by being shared
between multiple data sets. This is the biased introduced by an
under-fitted problem. This can artificially decrease the errors.
Anyway, you should plot the Monte Carlo simulations, a bit like I did
d'Auvergne, E. J. and Gooley, P. R. (2006). Model-free model
elimination: A new step in the model-free dynamic analysis of NMR
relaxation data. J. Biomol. NMR, 35(2), 117-135.
(http://dx.doi.org/10.1007/s10858-006-9007-z)
That might indicate if something is wrong - i.e. if optimisation of
certain simulations have failed. However this problem only causes
errors to be bigger than they should be (unless all simulations have
failed). I don't know how Monte Carlo simulations could fail
otherwise. Monte Carlo simulations are the gold standard for error
analysis. All other error analysis techniques are judged based on how
close the approach this gold standard. Saying that the Monte Carlo
simulations technique failed is about equivalent to claiming the Earth
is flat! I challenge you to test the statement on a statistics
professor at your Uni ;) Anyway, if Monte Carlo failed, using
residuals will not save you as the failure point will be present in
both techniques. What could have failed is the model or the input
data. Under-fitting due to too much R2eff data variability in the
spins of the cluster would be my guess. Do you see similarly small
errors in the non-clustered analysis of the same data?
Regards,
Edward
On 16 January 2015 at 17:48, Troels Emtekær Linnet
Post by Troels Emtekær Linnet
Hi Edward.
At the moment, I am fairly confident that I should investigate the
distribution from which the errors are drawn.
The method in relax draws from a Gauss distribution of the R2eff errors,
but
I should try to draw errors from the
overall residual instead.
It is two different methods.
My PI, has earlier has before analysed the data with the aforementioned
method, and got errors in the hundreds.
Errors are 5-10% of the fitted global parameters.
Having 0.5-1 percent error is way to small, and I see this for 4 of my
datasets.
So, something is fishy.
Best
Troels
2015-01-16 17:30 GMT+01:00 Edward d'Auvergne
Post by Edward d'Auvergne
Hi Troels,
You should be very careful with your interpretation here. The
curvature of the chi-squared space does not correlate with the
parameter errors! Well, it most cases it doesn't. You will see this
if you map the space for different Monte Carlo simulations. Some
extreme edge cases might help in understanding the problem. Lets say
you have a kex value of 100 with a real error of 1000. In this case,
you could still have a small, perfectly quadratic minimum. But this
minimum will jump all over the place with the simulations.
Another
extreme example might be kex of 100 with a real error of 0.00000001.
In this case, the chi-squared space could look similar to the
screenshot you attached to the task ( http://gna.org/task/?7882).
However Monte Carlo simulations may hardly perturb the chi-squared
space. I have observed scenarios similar to these hypothetical cases
with the Lipari and Szabo model-free protein dynamics analysis.
There is one case where the chi-squared space and error space match,
and that is at the limit of the minimum when the chi-squared space
becomes quadratic. This happens when you zoom right into the minimum.
The correlation matrix approach makes this assumption. Monte Carlo
simulations do not. In fact, Monte Carlo simulations are the gold
standard. There is no technique which is better than Monte Carlo
simulations, if you use enough simulations. You can only match it by
deriving exact symbolic error equations.
Therefore you really should investigate how your optimisation space is
perturbed by Monte Carlo simulations to understand the correlation -
or non-correlation - of the chi-squared curvature and the parameter
errors. Try mapping the minimum for the simulations and see if the
distribution of minima matches the chi-squared curvature
(http://gna.org/task/download.php?file_id=23527).
Regards,
Edward
On 16 January 2015 at 17:14, Troels E. Linnet
Post by Troels E. Linnet
<http://gna.org/task/?7882>
Summary: Implement Monte-Carlo simulation, where
errors
are
generated with width of standard deviation or residuals
Project: relax
Submitted by: tlinnet
Submitted on: Fri 16 Jan 2015 04:14:30 PM UTC
Should Start On: Fri 16 Jan 2015 12:00:00 AM UTC
Should be Finished on: Fri 16 Jan 2015 12:00:00 AM UTC
Category: relax's source code
Priority: 5 - Normal
Status: In Progress
Percent Complete: 0%
Assigned to: tlinnet
Open/Closed: Open
Discussion Lock: Any
Effort: 0.00
_______________________________________________________
This is implemented due to strange results.
A relaxation dispersion on data with 61 spins, and a monte carlo
simulation
with 500 steps, showed un-expected low errors.
-------
results.read(file=fname_results, dir=dir_results)
# Number of MC
mc_nr = 500
monte_carlo.setup(number=mc_nr)
monte_carlo.create_data()
monte_carlo.initial_values()
minimise.execute(min_algor='simplex', func_tol=1e-25,
max_iter=int(1e7),
constraints=True)
monte_carlo.error_analysis()
--------
The kex was 2111 and with error 16.6.
i_sort dw_sort pA_sort kex_sort chi2_sort
471 4.50000 0.99375 2125.00000 4664.31083
470 4.50000 0.99375 1750.00000 4665.23872
So, even a small change with chi2, should reflect a larger
deviation with kex.
It seems, that change of R2eff values according to their errors, is
not
"enough".
According to the regression book of Graphpad
http://www.graphpad.com/faq/file/Prism4RegressionBook.pdf
Page 33, and 104.
Sxy = sqrt(SS/(N-p))
where SS is sum of squares. N - p, is the number of degrees of
freedom.
In relax, SS is spin.chi2, and is weighted.
The random scatter to each R2eff point should be drawn from a
gaussian
distribution with a mean of Zero and SD equal to Sxy.
Additional, find the 2.5 and 97.5 percentile for each parameter.
The range between these values is the confidence interval.
_______________________________________________________
-------------------------------------------------------
Date: Fri 16 Jan 2015 04:14:30 PM UTC Name: Screenshot-1.png
161kB
By: tlinnet
<http://gna.org/task/download.php?file_id=23527>
_______________________________________________________
<http://gna.org/task/?7882>
_______________________________________________
Message sent via/by Gna!
http://gna.org/
_______________________________________________
relax (http://www.nmr-relax.com)
This is the relax-devel mailing list
To unsubscribe from this list, get a password
reminder, or change your subscription options,
visit the list information page at
https://mail.gna.org/listinfo/relax-devel
_______________________________________________
relax (http://www.nmr-relax.com)

This is the relax-devel mailing list
relax-***@gna.org

To unsubscribe from this list, get a password
reminder, or change your subscription options,
visit the list information page at
https://mai
Edward d'Auvergne
2015-01-19 17:48:29 UTC
Permalink
Hi Troels,
Post by Troels Emtekær Linnet
I have used this regression book.
http://www.graphpad.com/faq/file/Prism4RegressionBook.pdf
I like the language (fun and humoristic), and goes over a great detail.
For example, there is quite a list of weighting methods.
I find the comments on page 28 a little disturbing.
(See also page 86+87)
We already weight by the individual data point error. As we are using
non-linear least squared fitting of the chi2 value, the weights
described on pages 86 and 87 are not relevant. For dispersion data
they are also not relevant as the NMR spectrum is so complicated that
any of these weights will not reliably replicate the field dependent
effects, the peak height effects, etc.
Post by Troels Emtekær Linnet
The Monte-Carlo simulation is described at page 104.
1) Create an ideal data set.
-> Check. This is done with relax
monte_carlo.create_data(method='back_calc')
This is correct.
Post by Troels Emtekær Linnet
2) Add random scatter.
relax now add random scatter, per individual datapoint. The random scatter
is drawn from the measured error of the datapoint.
Point 2) is correct, it is how you perform Monte Carlo simulations for
non-linear regression. The reason is two fold - because in regression
you don't know the error sigma_i, and because the residuals can be
used as an estimator of sigma_i. But there is a much better error
estimator. That is to use residuals in bootstrapping to estimate the
data errors (as mentioned on page 108). This would be far superior!

However we are using non-linear least squares optimisation, not
regression. Therefore we do not need an estimator of the errors, as
we already know the errors. The residuals or error bootstrapping
techniques are only there to estimate the measured error, which we
already have. If the residues or error bootstrapping have been
successful, these should converge to the measured errors. So this is
unlikely to be the source of your too low kex error.
Post by Troels Emtekær Linnet
But the book suggest adding errors described by variance of the residuals.
This is described at page 33.
"If you chose to weight the values and minimize the relative distance
squared (or some other weighting function), goodness-of-fit is quantified
with the weighted sum- of-squares."
Sy.x = sqrt(SS/dof) = sqrt(chi2 / dof)
The question is of course, if SS should be sum of squared errors for the
weighted points, or the non-weighted R2eff points.
Note that the variance of the residuals is an error estimator. All
estimators also have an error (this is the error of an error, or sigma
of sigma_i). In the case of a single spin system, as there are not
many R2eff points, sigma of sigma_i will be big (you need 500+ points
before sigma of sigma_i is reasonably small). Hence your error
estimates from such methods will be very noisy.

In any case, because the chi-squared value has been optimised, this is
not the same solution as the regression of the SSE. The two minima
are not necessarily the same. They converge under certain strong
conditions in the regression problem (Gaussian errors, no outliers,
and errors for all points for all field strengths are the same).
Because the two solutions are not the same you cannot use the SSE
value, which would have to be calculated from the base data and
back-calculated data, for the error estimate. It can only be used
strictly under the convergence condition.

I don't know of any error estimate for the non-linear least squares
optimisation problem. But one probably has been derived for the cases
when the errors are not known. This would require a different
reference, as the GraphPad Prism 4 book only covers regression and not
least squares and hence cannot give the correct answer. The Numerical
Optimisation book by Nocedal and Wright
(https://books.google.de/books?id=VbHYoSyelFcC&lpg=PP1&dq=numerical%20optimisation%20nocedal%20wright&pg=PP1#v=onepage&q=numerical%20optimisation%20nocedal%20wright&f=false)
also doesn't seem to cover this. Do you know the Numerical Recipes
books (http://www.nr.com/)? Maybe there is something in there. Monte
Carlo simulations are described very clearly in section 15.6 of the
second edition. There might be something in chapter 15, "the
modelling of data" detailing the correct error estimate for this
problem.

As a side note, for the non-linear least squares problem when errors
are unknown and Monte Carlo simulations are not an option, the
covariance matrix might be the best error estimate.

Regards,

Edward

_______________________________________________
relax (http://www.nmr-relax.com)

This is the relax-devel mailing list
relax-***@gna.org

To unsubscribe from this list, get a password
reminder, or change your subscription options,
visit the list information page at
https://mail.gna.org/listinfo/relax-devel
Troels E. Linnet
2015-12-01 15:29:50 UTC
Permalink
Update of task #7882 (project relax):

Open/Closed: Open => Closed


_______________________________________________________

Reply to this item at:

<http://gna.org/task/?7882>

_______________________________________________
Message sent via/by Gna!
http://gna.org/


_______________________________________________
relax (http://www.nmr-relax.com)

This is the relax-devel mailing list
relax-***@gna.org

To unsubscribe from this list, get a password
reminder, or change your subscription options,
visit the list information page at
https://mail.gna.org/listinfo/

Loading...