Warning
This is beta functionality. While usable, this has not yet been tested in a large number of different scenarios. The API may also still be subject to change.
6. Fd Fitting¶
Download this page as a Jupyter notebook
Pylake is capable of fitting force extension curves using various polymer models. This tutorial aims at being a gentle introduction to these fitting capabilities.
6.1. Models¶
When fitting data, everything revolves around models. One of these models is the so-called extensible worm-like-chain
model by Odijk et al. Let’s have a look at it. We can construct this model using the supplied function lk.odijk()
.
Note that we also have to give it a name. This name will be prefixed to model specific parameters in this model:
>>> model = lk.odijk("DNA")
Entering model prints the model equation and its default parameters:
>>> model
Model equation:
d(f) = DNA.Lc * (1 - (1/2)*sqrt(kT/(f*DNA.Lp)) + f/DNA.St)
Parameter defaults:
Name Value Unit Fitted Lower bound Upper bound
------ ------- -------- -------- ------------- -------------
DNA/Lp 40 [nm] True 0 inf
DNA/Lc 16 [micron] True 0 inf
DNA/St 1500 [pN] True 0 inf
kT 4.11 [pN*nm] False 0 8
Note how most of the parameters listed have the model name prefixed. For example, the persistence length in this model
is now named DNA/Lp
. Here DNA
refers to the name we gave to the model.
One parameter stands out, which is kT as it doesn’t have DNA
prefixed. The reason for this is that this parameter is
not model specific (in this case the parameter represents the Boltzmann constant times the temperature).
We can also obtain the parameters as a list:
>>> print(model.parameter_names)
['DNA/Lp', 'DNA/Lc', 'DNA/St', 'kT']
As we can see from the model equation, the model is given as distance as a function of force.
6.2. Simulating the model¶
We can simulate the model by passing a dictionary with parameters values:
dna = lk.odijk("DNA")
force = np.arange(0.1, 14, 0.1)
dna(force, {"DNA/Lp": 50.0, "DNA/Lc": 16.0, "DNA/St": 1500.0, "kT": 4.11})
Note how the model name is prefixed to the model specific parameters, forming a key-value pair like so: "name/parameter": value
.
6.3. Model composition and inversion¶
In practice, we would typically want to fit force as a function of distance however. For this we have the inverted Odijk model:
model = lk.inverted_odijk("DNA")
We can have a quick look at what this model looks like and which parameters are in there:
>>> model
Model equation:
f(d) = argmin[f](norm(DNA.Lc * (1 - (1/2)*sqrt(kT/(f*DNA.Lp)) + f/DNA.St)-(d - DNA.d_offset))) + DNA.f_offset
Parameter defaults:
Name Value Unit Fitted Lower bound Upper bound
------ ------- -------- -------- ------------- -------------
DNA/Lp 40 [nm] True 0 inf
DNA/Lc 16 [micron] True 0 inf
DNA/St 1500 [pN] True 0 inf
kT 4.11 [pN*nm] False 0 8
This seems to be what we want: force as a function of distance. Let’s assume we have acquired data, but upon analysis
we notice that the template matching wasn’t completely optimal. In addition, we experienced some force drift, while
forgetting to reset the force back to zero. In this case, we can incorporate offsets in our model. We can introduce an
offset in the independent parameter, by calling .subtract_independent_offset()
on our model:
>>> model = lk.inverted_odijk("DNA").subtract_independent_offset()
>>> model
Model: DNA(x-d)
Model equation:
f(d) = argmin[f](norm(DNA.Lc * (1 - (1/2)*sqrt(kT/(f*DNA.Lp)) + f/DNA.St)-(d - DNA.d_offset)))
Parameter defaults:
Name Value Unit Fitted Lower bound Upper bound
------------ ------- -------- -------- ------------- -------------
DNA/d_offset 0.01 [au] True -0.1 0.1
DNA/Lp 40 [nm] True 0 100
DNA/Lc 16 [micron] True 0 inf
DNA/St 1500 [pN] True 0 inf
kT 4.11 [pN*nm] False 0 8
If we also expect an offset in the dependent parameter, we can simply add an offset model to our model:
>>> model = lk.inverted_odijk("DNA").subtract_independent_offset() + lk.force_offset("DNA")
>>> model
Model: DNA(x-d)_with_DNA
Model equation:
f(d) = argmin[f](norm(DNA.Lc * (1 - (1/2)*sqrt(kT/(f*DNA.Lp)) + f/DNA.St)-(d - DNA.d_offset))) + DNA.f_offset
Parameter defaults:
Name Value Unit Fitted Lower bound Upper bound
------------ ------- -------- -------- ------------- -------------
DNA/d_offset 0.01 [au] True -0.1 0.1
DNA/Lp 40 [nm] True 0 100
DNA/Lc 16 [micron] True 0 inf
DNA/St 1500 [pN] True 0 inf
kT 4.11 [pN*nm] False 0 8
DNA/f_offset 0.01 [pN] True -0.1 0.1
From the above example, you can see how easy it is to composite models. Sometimes, models become more complicated. For instance, we may have two worm like chain models that we wish to add, and then invert. For the Odijk model, this can be done as follows:
model = lk.odijk("DNA") + lk.odijk("protein") + lk.distance_offset("offset")
model = model.invert()
Note how we added three models and then inverted the composition of those models. Models inverted via invert()
will
typically be slower than the pre-inverted counterparts. This is because the inversion is done numerically rather than
analytically. For more complex examples on how this inversion may be used, please see the examples.
For a full list of models that are available, please refer to the documentation by invoking help(lk.fitting.models)
or see Available models.
6.4. Fitting data¶
To fit Fd models, we have to create an FdFit
. This object will collect all the parameters involved in the models and
data, and will allow you to interact with the model parameters and fit them. We construct it using lk.FdFit
and
pass it one or more models. In return, we get an object we can interact with, which in this case we store in fit
:
fit = lk.FdFit(model)
6.4.1. Adding data to the fit¶
To do a fit, we have to add data. Let’s assume we have two data sets. One was acquired in the presence of a ligand, and
another was measured without a ligand. We expect this ligand to only affect the contour length of our DNA. Let’s add the
first data set which we name Control
. Adding it to the fit is simple:
fit.add_data("Control", force1, distance1)
For the second data set, we want the contour length to be different. We can achieve this by renaming the parameter when loading the data:
fit.add_data("RecA", force2, distance2, params={"DNA/Lc": "DNA/Lc_RecA"})
More specifically, we renamed the parameter DNA/Lc
to DNA/Lc_RecA
.
6.4.2. Setting parameter bounds¶
The parameters of the model can be accessed directly from FdFit
. Note that by default, parameters tend to have
reasonable initial guesses and bounds in pylake, but we can set our own as follows:
fit["DNA/Lp"].value = 50
fit["DNA/Lp"].lower_bound = 39
fit["DNA/Lp"].upper_bound = 80
After this, the model is ready to be fitted. We can fit the model to the data by calling the function .fit()
. This
estimates the model parameters by minimizing the least squares differences between the model’s dependent variable and
the data in the fit:
fit.fit()
After this call, the parameters will have new values that should bring the model closer to the data. Note that multiple models can be fit at once by supplying more than one model:
fit = lk.FdFit(model1, model2, model3)
Frequently, global fits have better statistical properties than fitting the data separately as more information is available to infer parameters shared between the various models.
6.5. Plotting the data¶
A model can be plotted before it is fitted. This can be useful when the default parameter values don’t seem to work very well. Parameter estimation is typically initiated from an initial guess. A poor initial guess can lead to a poor parameter estimate. Therefore, you might want to see what your initial model curve looks like and set some better initial guesses yourself when you run into trouble.
Fits can be plotted using the built-in plot functionality:
fit.plot()
plt.ylabel("Force [pN]")
plt.xlabel("Distance [$\\mu$M]");
Sometimes, more fine grained control over the plots is required. Let’s say we want to plot the model over a range of
values (in this case values from 2.0 to 5.0) for the conditions corresponding to the Control
and RecA
data. We can
do this by supplying different arguments to the plot function:
fit.plot("Control", "k--", np.arange(2.0, 5.0, 0.01))
fit.plot("RecA", "k--", np.arange(2.0, 5.0, 0.01))
Or what if we really only want the model prediction, then we can do:
fit.plot("Control", "k--", np.arange(2.0, 5.0, 0.01), plot_data=False)
It is also possible to obtain simulations from the model directly. We can do this by calling the model with values for the independent variable (here denoted as distance) and the parameters required to simulate the model. We obtain these parameters by grabbing them from our fit object using the data handles:
distance = np.arange(2.0, 5.0, 0.01)
simulation_result = model(distance, fit["Control"])
Basically what happens here is that fit["Control"]
grabs those parameters needed to simulate the condition
corresponding to the dataset with the name control
. By providing specifically those parameters to the model, we can
simulate that condition.
6.6. Incremental fitting¶
Fits can also be done incrementally:
>>> model = lk.inverted_odijk("DNA")
>>> fit = lk.FdFit(model)
>>> print(fit.params)
No parameters
We can see that there are no parameters to be fitted. The reason for this is that we did not add any data to the fit yet. Let’s add some and fit this data:
>>> fit.add_data("Control", f1, d1)
>>> fit.fit()
>>> print(fit.params)
Name Value Unit Fitted Lower bound Upper bound
------ ---------- -------- -------- ------------- -------------
DNA/Lp 59.409 [nm] True 0 inf
DNA/Lc 2.81072 [micron] True 0 inf
DNA/St 1322.9 [pN] True 0 inf
kT 4.11 [pN*nm] False 0 8
Let’s add a second data set where we expect a different contour length and refit:
>>> fit.add_data("RecA", f2, d2, params={"DNA/Lc": "DNA/Lc_RecA"})
>>> print(fit.params)
Name Value Unit Fitted Lower bound Upper bound
----------- ---------- -------- -------- ------------- -------------
DNA/Lp 89.3347 [nm] True 0 inf
DNA/Lc 2.80061 [micron] True 0 inf
DNA/St 1597.68 [pN] True 0 inf
kT 4.11 [pN*nm] False 0 8
DNA/Lc_RecA 3.7758 [micron] True 0 inf
We see that indeed the second parameter now appears. We also note that the parameters from the first fit changed. If
this was not intentional, we should have fixed these parameters after the first fit. For example, we can fix the
parameter DNA/Lp
by invoking:
>>> fit["DNA/Lp"].fixed = True
6.7. Calculating per point contour length¶
Sometimes, one wishes to invert the model with respect to one parameter (i.e. re-estimate one parameter on a per data point basis). This can be used to obtain dynamic contour lengths for instance. In pylake, such an analysis can easily be performed. We first set up a model and fit it to some data. This is all analogous to what we’ve learned before:
# Define the model to be fitted
model = lk.inverted_odijk("model") + lk.force_offset("model")
# Fit the overall model first
fit = lk.FdFit(model)
fit.add_data("Control", force, distance)
fit.fit()
Now, we wish to allow the contour length to vary on a per data point basis. For this, we use the function
parameter_trace
. Here we see a few things happening. The first argument specifies the model to use for the inversion.
The second argument should contain the parameters to be used in this method. Note how we select them from the parameters
in the fit
using the same syntax as before (i.e. fit[data_name]
). Next, we specify which parameter has to be fitted
on a per data point basis. This is the parameter that we will re-estimate for every data point. Finally, we supply the
data to use in this analysis. First the independent parameter is passed, followed by the dependent parameter:
lcs = lk.parameter_trace(model, fit["Control"], "model/Lc", distance, force)
plt.plot(lcs)
The result of this analysis is an estimated contour length per data point, which can be used in subsequent analyses.
6.8. Advanced usage¶
6.8.1. Adding many data sets¶
Sometimes, you may want to add a large number of data sets with different offsets. Consider two lists of distance and
force vectors stored in distances
and forces
. In this case, it may make sense to load them in a loop and set such
transformations programmatically. We can iterate over both lists at once by using zip
. In addition, we wanted to have
a different offset for each data set. This means that we’d need to give those new offsets a name. Let’s just number
them. By adding enumerate, we also obtain an iteration counter, which we store in i
. The whole procedure can then
succinctly be summarized in just two lines of code:
for i, (d, f) in enumerate(zip(distances, forces)):
fit.add_data(f"RecA {i}", f, d, params={"DNA/f_offset": f"DNA/f_offset_{i}"})
The syntax f"DNA/f_offset_i"
is parsed into DNA/f_offset_0
, DNA/f_offset_1
… etc. For more information on
how this works, read up on Python fantastic f-Strings.
6.8.2. Global fits versus single fits¶
The FdFit
object manages a fit. To illustrate its use, and how a global fit differs from a local fit, consider the
following two examples:
model = lk.inverted_odijk("DNA")
fit = lk.FdFit(model)
for i, (distance, force) in enumerate(zip(distances, forces)):
fit.add_data(f"RecA {i}", f=force, d=distance)
fit.fit()
print(fit["DNA/Lc"])
and:
for i, (distance, force) in enumerate(zip(distances, forces)):
model = lk.inverted_odijk("DNA")
fit = lk.FdFit(model)
fit.add_data(f"RecA {i}", f=force, d=distance)
fit.fit()
print(fit["DNA/Lc"])
The first example is what we refer to as a global fit whereas the second example is an example of a local fit. The
difference between these two is that the former sets up one model that has to fit all the data whereas the latter fits
all the data sets independently. The former has one parameter set, whereas the latter has a parameter set per data set.
Also note how in the second example a new Model
and FdFit
is created at every cycle of the for loop.
Statistically, it is typically more optimal to fit data using global fitting (meaning you use one model to fit all the data, as opposed to recreating the model for each new set of data), as more information goes into estimates of parameters shared between different conditions. It’s usually a good idea to think about which parameters you expect to be different between different experiments and only allow these parameters to be different in the fit. For example, if the only expected difference between the experiments is the contour length, then this can be achieved using:
model = lk.inverted_odijk("DNA")
fit = lk.FdFit(model)
for i, (distance, force) in enumerate(zip(distances, forces)):
fit.add_data(f"RecA {i}", force, distance, {"DNA/Lc": f"DNA/Lc_{i}"})
fit.fit()
print(fit.params)
Note that this piece of code will lead to parameters DNA/Lc_0
, DNA/Lc_1
etc.
6.8.3. Multiple models¶
When working with multiple models, things can get a little more complicated. Let’s say we have two models, model1
and
model2
and we want to fit both in a global fit. Constructing the FdFit
is easy:
model1 = lk.inverted_odijk("DNA")
model2 = (lk.odijk("DNA") + lk.odijk("protein")).invert()
fit = lk.FdFit(model1, model2)
But then the question arises, how do we add data to each model? Well, the trick is in the assignments to model1
and
model2
. We can use these now to add data to each model as follows:
fit[model1].add_data("data for model 1", forces_1, distances_1)
fit[model2].add_data("data for model 2", forces_2, distances_2)
See how we used the model handles? They are used to let the FdFit
know to which model each data set should be added.
You can add as many data sets as you want to both models and fit it all at once.
Plotting is straightforward in this setting. We can plot the data sets corresponding to model 1 and 2 as follows:
fit[model1].plot()
fit[model2].plot()
Accessing the model parameters for a specific data set is a little more complicated in this setting. If we want to obtain the parameters for “data for model 1”, we’d have to invoke:
params = fit[model1]["data for model 1"]
Note how we are now forced to index the model first using the square brackets, and only then access the data set by name. An unfortunate necessity when it comes to multi-model curve fitting.
6.8.4. Confidence intervals and standard errors¶
Once parameters have been fitted, standard errors can easily be obtained as follows:
fit["DNA/Lc"].stderr
Assuming that the parameters are not at the bounds, the sum of random variables with finite moments converges to a
Gaussian distribution. This allows for the computation of confidence intervals using the Wald test
[PT90]. To get these asymptotic intervals, we can use the member function .ci
with a desired
confidence interval:
fit["DNA/Lc"].ci(0.95)
Note that the bounds returned by this call are only asymptotically correct and should be used with caution. Better confidence intervals can be obtained using the profile likelihood method [RKM+09][MHS+16]. Note that these profiles require iterative computation and are therefore time consuming to produce. Determining confidence intervals via profiles has two big advantages however:
- The confidence intervals no longer depend on the parametrization of the model (for more information on this see [MHS+16]).
- By inspecting the profile, we can diagnose problems with the model we are using.
Profiles can easily be computed by calling profile_likelihood()
on the fit:
profile = fit.profile_likelihood("DNA/Lc", num_steps=1000)
For a well parametrized model with sufficient data, a profile plot results in a (near) parabolic shape, where the line of the parabola intersects with the confidence interval lines (dashed). The confidence intervals are then determined to be at those intersection points:
profile.plot()
One thing that may be of interest is to plot the relations between parameters in these profile likelihoods:
profile.plot_relations()
These inferred relations can provide information on the coupling between different parameters. This can be quite informative when diagnosing fitting issues. For example, when fitting a contour length in the presence of an distance offset, we can observe that the two are related. To produce the following figure, we set a lower bound and upper bound of -0.1 and 0.1 for the distance respectively. We can see that the profile is perfectly flat until the distance reaches the bound. Only then does the profile suddenly jump.
What this shows is that a change in one parameter (DNA/Lc_RecA
) can be compensated by a change in the other. This
highlights the importance of constraining distance offset parameters when trying to estimate an absolute contour length.
In this sample case, fixing the distance offset to zero recovers the parabolic profile from before.