Accommodating Error Analysis in Comparison and Clustering of
Molecular Fingerprints
Hugh Salamon,* Mark R. Segal,* Alfredo Ponce de Leon, and Peter M.
Small
*University of California, San Francisco, California, USA; Instituto Nacional
Nutrición, Zubriran, Mexico City, Mexico; and Stanford University Medical Center,
Stanford, California, USA
Figure 1. The align-and-count method finds the maximum number of mutually closest
bands within a threshold deviation value , for a search across a range S of scaling values. The
two lanes are scaled incrementally, thus searching for the best alignment.
Figure 2. Means and two standard errors of the mean error bars for pairwise
comparisons among 116 12-banded H37Rv lanes show that error is consistently larger when
comparing lanes between gels than when comparing lanes from the same gel. The x-axis
corresponds to w(b), and the y-axis to d(b), as presented in the text. It is
evident that error is proportional to fragment length in the range of fragment lengths
found in H37Rv. The data exhibit 2% to 3% error for between gel comparisons, but only
approximately 1% error on average for within gel comparisons.
Figure 3. Additional alignment of very similar patterns can identify clearly distinct
patterns. Measurement noise obscures the detailed relationships between 26 patterns that
were identified from 1,335 as being very similar. However, after alignment to a consensus
pattern, a clearly distinct pattern (an outlier from the other members of this
autocluster) can be readily identified. Fragment lengths are given in kilobasepairs (kb).
Figure 4. Histograms of the fragment lengths for 84 two-banded patterns connected by
identity (autoclustered with in-house software) exhibit enough spread in values to make
detecting outliers and band shifts difficult (a,b). Aligning the 84 lanes to the
mean-value lane for this collection reveals that the lanes do not align well, but instead
shows bimodal distributions for the fragment lengths (c,d). Dividing the 84 fingerprints
into two sets and separating the distinct distributions detected when aligning all 84
fingerprints show that 26 fingerprints align well to their mean-value lane (e,f), and the
remaining 58 also align well to their respective mean value lane (g,h). The smaller
fragment length fragment does not appear shifted between the two sets of 2-banders
(comparing e to g), but the larger fragment is clearly shifted (comparing f to h).
Figure 5. Prior to alignment of two sets of 2-banders, lanes are difficult to cluster
(lanes a-d are from the distributions in Figures 4a and 4b). Subsequent to alignment,
lanes are much easier to cluster (lanes a' and b' are specific examples from the
distributions in Figures 4e and 4f; lanes c' and d' likewise correspond to Figures 4g and
4h). Fragment lengths are given in kilobasepairs (kb).