The goal of this part of the package is to calculate the theoretical
probability that the \(i\)-th excursion
reach the threshold score \(a\), for a
Markov chain with a known transition matrix and a given score function.
The main function for this objective is
proba_theoretical_ith_excursion_markov().
This section of the localScore package is the direct implementation of the pre publication: “Exact distribution of excursion heights of the Lindley process in a Markovian model” written by Carlos Cortés Rojas, Simona Grusea and Sabine Mercier.
The main goal of this function is to calculate the probability that
the first excursion of the Lindley sequence associated to the studied
sequence is greater or equal to \(a\),
conditionally to \(\alpha\) a potential
beginning of this sequence. It also computes the transition probability
matrix of the beginnings of excursions: matrix_M.
The product of this matrix, elevated to the power \(i-1\), with the first vector of the probabilities conditionally to \(\alpha\), gives the probabilities conditionally to \(\alpha\) for the \(i\)-th excursion. To return the global probability, the function multiplies the conditional vector with the distribution of the first letter of the sequence.
The number of an excursion is given by the mathematical definition of Karlin and Altschul (1990). Its corresponds to the number of record times of the Lindley sequence associated to the score sequence \((X_k)_{1\leq k\leq n}\) and recursively defined as follows: \[T_0:=0\qquad \mbox{and }\qquad T_{(k+1)}:=\inf\{i\geq T_k+1, \sum_{j=T_k+1}^i X_j<0\}\ .\] Please note that the sum \(\sum_{j=T_k+1}^i X_j\) must be non positive. If it is equal to zero, it will mathematically be still the same Lindley excursion. \[(A_k)_k\quad \mbox{with}\quad T_{(i-1)}+1\leq k <T_i\] is called the \(i\)-th positive excursion of the sequence in the term of Karlin and Altschul. This mathematical definition includes the index with a Linldey score equal to 0, whereas visually we “see” the excursion beginning at the following index.
The mathematical number of an excursion must be distinguished from the one of the corresponding visible excursion because of the possible appearance of flat excursions. See below for an example.
seq <- c(-1,2,-3,-4,-1,3,-1,-2,3,-4,2)
lindley(seq)
#> [1] 0 2 0 0 0 3 2 0 3 0 2
plot(1:length(seq), lindley(seq), type = 'l')localScoreC(seq)
#> $localScore
#> value begin end
#> 3 6 6
#>
#> $suboptimalSegmentScores
#> value begin end
#> 1 2 2 2
#> 2 3 6 6
#> 3 3 9 9
#> 4 2 11 11
#>
#> $RecordTime
#> [1] 0 1 3 4 5 10We can “see” three positive excursions, with top at index \(\{2; 6; 9\}\); plus one last not achieved at index \(\{11\}\). Their exists also five flats excursions: at index \(\{1 ; 3 ; 4 ; 5 ; 10\}\). Note also that the Lindley excursion 6-10 reach 0 at index 8 without going in non positive values, so it is still the same excursion.
In summary, the excursions list is: \(\{1 ; 2-3 ; 4 ; 5 ; 6-10 ; 11\}\). The following function gives the record time of the mathematical excursion that must be used to compute the statistical significance. Note that we choose to omit the conventional first record time which is always 0.
So the sub optimal excursion values given by
$suboptimalSegmentScores 2 3 3 2 are achieved by the
excursion number 2, 5 (twice) and 6. At this point, we wish to lighten a
difference between the notion of “excursion” defined mathematically by
Karlin and Dembo (1992), and the results given in
$suboptimalSegmentScores. The segment [6-10] is an
excursion in mathematical sense as the cumulative process starting at
position 7 never hit a negative value before the 10th position.
Nevetheless this process hit a null value indicating that this single
excursion contain at least two positive score segments (here : at
position 6 and 9). We chose to indicate in
$suboptimalSegmentScores all positive consecutive
sub-sequences, as it can be more meaningful when applied on real
data.
To calculate the \(p\)-value of a given excursion, we have to know which number of excursion it is in sequential order. Notice that it will not bring any problem of making a mistake in the number of the excursion for a number exceeding 10, and a small difference for a lower number, as the sequence composed by the begin-component of each excursion, which is also a Markov chain, can reach the stationary distribution. For the first excursions, it is far more easy to recover the record time of the corresponding mathematical excursion in the list.
As the function calculates a theoretical probability, we don’t need a
sequence of scores but the transition matrix of the markov chain: \(\Lambda\). It also requires a score
function with: integer scores, a negative expectation and at least one
possible positive score. We also need \(\theta\), an alphabet (can contains
numbers) with unique values ; and \(i\): the rank of the excursion on which we
calculate the probabilities. The optional parameters are
epsilon and prob0. epsilon is a
threshold for the computation of matrix \(M\) and prob0 is the
probability distribution of the first letter of the sequence. Obviously,
theta, the score function and prob0 should
have the same length \(p\), and the
dimension of lambda should be \((p,p)\).
In this example, we want to know the probability that the second excursion reach a score greater or equal than 5 with the matrix \(\Lambda\) and a score function of \((-1,1)\), given that the first score of the sequence have 50 percent chances to be -1.
proba_theoretical_ith_excursion_markov(a = 5,
theta = c("A", "B"),
lambda = matrix(c(0.5, 0.5, 0.8, 0.2),
ncol = 2, byrow = TRUE),
score_function = c(-1,1),
i = 2,
epsilon = 1e-16,
prob0 = c(0.5, 0.5))
#> $proba_q_i_geq_a
#> [1] 0.009661836
#>
#> $P_alpha
#> A B
#> 0.009661836 0.009661836In this following example, we see that this is not a problem to have
missing score values in the score function, and that theta can contain
numbers. Initial value of epsilon is 1e-16, and if we don’t
precise the value of prob0, the function compute the
stationary distribution of \(\Lambda\),
and use it for the distribution of the first letter of the sequence.
proba_theoretical_ith_excursion_markov(a = 5,
theta = c("K", 12, 1.54),
lambda = matrix(c(0.9, 0, 0.1, 0.9, 0.1, 0, 0, 0.8, 0.2),
ncol = 3, byrow = TRUE),
score_function = c(-3,-1,2),
i = 4)
#> $proba_q_i_geq_a
#> [1] 0.00405119
#>
#> $P_alpha
#> K 12 1.54
#> 0.004054779 0.004050819 0.004022805In the following example, the function takes longer as we increased the number of scores (complexity of \(O(length(\theta)^3))\), but here with 20 scores, the computational time is acceptable.
transition_matrix <- matrix(runif(400, min = 0, max = 1), nrow = 20)
transition_matrix <- t(apply(transition_matrix, 1, function(x) x/sum(x)))
theta <- letters[1:20]
score_f <- c(-3,-1,2,-3,-1,2,-3,-1,2,-1,
-3,-1,2,-3,-1,2,-3,-1,2,-1)
sum(stationary_distribution(transition_matrix)*score_f) #score expectation (stationary)
#> [1] -0.7100465
system.time(pv1 <- proba_theoretical_ith_excursion_markov(a=5, theta, transition_matrix, score_f, i = 4))
#> user system elapsed
#> 6.138 5.553 5.389
pv1$proba_q_i_geq_a
#> [1] 0.06333011Let us consider the protein Seq1093 of 1093 amino acid
proposed in the package localScore which corresponds to the
Q60519.fasta in UniProt Data base.
Using the hydrophobic score scale called HydroScore, we
compute the local score, corresponding to the height of the highest
excursion.
library(localScore)
data("Seq1093")
Seq1093
#> [1] "MVVPGPLALSLLLSSLTLLVSHLSSSQDIASESSSEQQMCTRREHPIVAFEDLKPWVFNFTYPGVRDFSQLALDPSRNQLIVGARNYLFRLSLANVSLLQATEWASSEDTRRSCQSKGKTEEECQNYVRVLIVSGRKVFMCGTNAFSPVCSSRQVGNLSRTIEKINGVARCPYDPRHNSTAVISSQGELYAATVIDFSGRDPAIYRSLGSGPPLRTAQYNSKWLNEPNFVAAFDIGLFAYFFLRENAVEHDCGRTVYSRVARVCKNDVGGRFLLEDTWTTFMKARLNCSRPGEVPFYYNELQSAFHLPEQDLIYGVFTTNVNSIAASAVCAFNLSAISKAFNGPFRYQENPRAAWLPIANPIPNFQCGTLPETGPNENLTERSLQDAQRLFLMSEAVQPVTPEPCVTQDSVRFSHLVVDLVQAKDTLYHVLYIGTESGTILKALSTASRSLRGCYLEELHVLPPGRLEPLRSLRILHSARALFVGLSDRVLRVPLERCSAYHSQGACLGARDPYCGWDGKRQLCSTLEDSSNMSLWIQNITTCPVRNVTRDGGFGPWSPWKPCEHLDGDNSGSCLCRARSCDSPRPRCGGLECLGPSIHIANCSRNGAWTAWSSWAQCSTSCGIGFQVRQRSCSNPAPRHGGRICVGKSREERFCNENTPCPVPIFWASWGSWSKCSNNCGGGVQSRRRSCENGNSCPGCGVEFKTCNPEACPEVRRNTPWTPWLPVNVTQGGARQEQRFRFTCRAPLPDPHGLQFGKRRTETRTCPADGTGACDTDALVEDLLRSGSTSPHTLNGGWATWGPWSSCSRDCELGFRVRKRTCTNPEPRNGGLPCVGDAAEYQDCNPQACPVRGAWSCWTAWSQCSASCGGGHYQRTRSCTSPAPSPGEDICLGLHTEEALCSTQACPEGWSLWSEWGVCTEDGAQSRSRSCEELLPGPGACVGNSSQSRPCPYSEIPVILPASSVEETTSCGGFNLIHLIVTGVSCFLVSGLLTLAVYLSCQHCQRQSQESTLVHPATPNHLHYKGGGTPKNEKYTPMEFKTLNKNNLIPDDRANFYPLQQTNVYTTTYYPSPLNKPSFRPEASPGQRCFPNS"
nchar(Seq1093)
#> [1] 1093
data(HydroScore)
LongSeqScore <- CharSequence2ScoreSequence(Seq1093, HydroScore)
table(LongSeqScore)
#> LongSeqScore
#> -5 -4 -3 -2 -1 0 2 3 4 5
#> 81 222 22 82 232 85 78 103 157 31
localScoreC(LongSeqScore)
#> $localScore
#> value begin end
#> 65 956 1001
#>
#> $suboptimalSegmentScores
#> value begin end
#> 1 40 1 20
#> 2 10 71 73
#> 3 23 80 99
#> 4 3 114 114
#> 5 3 124 124
#> 6 4 128 128
#> 7 23 130 150
#> 8 16 181 195
#> 9 2 217 217
#> 10 4 224 224
#> 11 33 229 243
#> 12 5 294 296
#> 13 4 301 301
#> 14 6 304 307
#> 15 36 312 337
#> 16 4 379 379
#> 17 4 384 384
#> 18 2 387 387
#> 19 14 390 397
#> 20 4 411 411
#> 21 3 413 413
#> 22 25 416 486
#> 23 9 523 527
#> 24 9 533 537
#> 25 4 548 548
#> 26 3 554 554
#> 27 3 563 563
#> 28 4 566 566
#> 29 10 574 576
#> 30 16 588 601
#> 31 3 633 633
#> 32 2 637 637
#> 33 12 644 646
#> 34 6 654 655
#> 35 12 661 668
#> 36 7 680 684
#> 37 3 691 691
#> 38 8 697 702
#> 39 5 711 712
#> 40 4 715 715
#> 41 6 725 727
#> 42 3 740 740
#> 43 5 742 744
#> 44 2 746 746
#> 45 4 748 748
#> 46 4 754 754
#> 47 3 756 756
#> 48 3 766 766
#> 49 5 773 774
#> 50 10 778 780
#> 51 4 794 794
#> 52 2 799 799
#> 53 3 807 807
#> 54 3 811 811
#> 55 7 813 815
#> 56 3 822 822
#> 57 9 832 835
#> 58 3 844 844
#> 59 7 848 851
#> 60 6 864 868
#> 61 3 879 879
#> 62 2 883 883
#> 63 16 890 894
#> 64 2 924 924
#> 65 3 931 931
#> 66 13 934 942
#> 67 3 951 951
#> 68 65 956 1001
#> 69 9 1048 1049
#> 70 2 1054 1054
#> 71 3 1056 1056
#> 72 4 1059 1059
#> 73 4 1064 1064
#> 74 4 1074 1074
#> 75 3 1079 1079
#> 76 2 1083 1083
#> 77 6 1089 1090
#>
#> $RecordTime
#> [1] 0 70 77 78 79 112 113 115 116 117 119 120 121 122 123
#> [16] 125 126 127 129 176 177 178 179 180 218 219 220 221 222 223
#> [31] 226 227 228 290 291 293 299 300 303 310 311 378 381 382 383
#> [46] 386 388 389 412 415 521 522 531 532 550 551 558 559 560 561
#> [61] 562 564 565 569 570 571 573 583 584 585 586 587 631 632 635
#> [76] 636 639 640 643 651 652 653 657 658 659 660 679 687 688 689
#> [91] 690 692 693 695 696 709 710 714 716 717 718 719 720 721 722
#> [106] 723 724 735 736 737 738 739 741 750 751 752 758 759 760 761
#> [121] 762 763 764 765 769 771 777 791 792 793 798 803 804 805 806
#> [136] 809 810 812 819 820 821 824 825 826 827 828 829 843 845 846
#> [151] 847 863 874 875 876 877 878 882 885 886 888 889 922 925 926
#> [166] 927 928 929 930 932 933 949 950 954 955 1045 1046 1047 1052 1053
#> [181] 1055 1061 1062 1063 1069 1070 1071 1072 1073 1076 1077 1078 1080 1081 1082
#> [196] 1085 1087 1088 1093
head(sort(localScoreC(LongSeqScore)$suboptimalSegmentScores[,1], decreasing = TRUE))
#> [1] 65 40 36 33 25 23
plot(1:1093,lindley(LongSeqScore), type = 'l')We get: local score = 65. Let us compute its \(p\)-value in a I.I.D. model.
LS <- localScoreC(LongSeqScore)$localScore["value"]
prob1 <- scoreSequences2probabilityVector(list(LongSeqScore))
prob1
#> -5 -4 -3 -2 -1 0 1
#> 0.07410796 0.20311070 0.02012809 0.07502287 0.21225984 0.07776761 0.00000000
#> 2 3 4 5
#> 0.07136322 0.09423605 0.14364135 0.02836231
daudin(local_score = LS, sequence_length = length(LongSeqScore),
score_probabilities = prob1,
sequence_min = min(LongSeqScore),
sequence_max = max(LongSeqScore))
#> [1] 0.07231933In an independent model the \(p\)-value equals \(7.2\%\), that is non significant using the nominal level \(5\%\).
Let us compute it in a markovian model. We need the transition matrix.
tmpMarkovParameters <- sequences2transmatrix(LongSeqScore)
lambda <- tmpMarkovParameters$transition_matrix
apply(lambda, 1, sum) #to check stochasticity
#> -5 -4 -3 -2 -1 0 2 3 4 5
#> 1 1 1 1 1 1 1 1 1 1
prob0 <- stationary_distribution(lambda)
print(prob0)
#> [1] 0.07415553 0.20331534 0.02016219 0.07512607 0.21241680 0.07789931
#> [7] 0.07052191 0.09425453 0.14375820 0.02839011
score_values <- tmpMarkovParameters$score_value
print(score_values)
#> [1] -5 -4 -3 -2 -1 0 2 3 4 5Notice that the score value 1 is not present in the sequence.
exact_mc(LS, lambda, sequence_length = length(LongSeqScore), score_values = score_values)
#> [1] 0.07695456The \(p\)-value in a Markovian model, equal to \(7.7\%\) is very similar in this case with the one of the independent model.
Let us consider a study on the first and the last sub optimal excursions.
subOptSegment <- localScoreC(LongSeqScore)$suboptimalSegmentScores
o <- order(subOptSegment[,1], decreasing = TRUE)
subOptSegment <- subOptSegment[o,] # reordering segments by decreasing score values
print(subOptSegment)
#> value begin end
#> 68 65 956 1001
#> 1 40 1 20
#> 15 36 312 337
#> 11 33 229 243
#> 22 25 416 486
#> 3 23 80 99
#> 7 23 130 150
#> 8 16 181 195
#> 30 16 588 601
#> 63 16 890 894
#> 19 14 390 397
#> 66 13 934 942
#> 33 12 644 646
#> 35 12 661 668
#> 2 10 71 73
#> 29 10 574 576
#> 50 10 778 780
#> 23 9 523 527
#> 24 9 533 537
#> 57 9 832 835
#> 69 9 1048 1049
#> 38 8 697 702
#> 36 7 680 684
#> 55 7 813 815
#> 59 7 848 851
#> 14 6 304 307
#> 34 6 654 655
#> 41 6 725 727
#> 60 6 864 868
#> 77 6 1089 1090
#> 12 5 294 296
#> 39 5 711 712
#> 43 5 742 744
#> 49 5 773 774
#> 6 4 128 128
#> 10 4 224 224
#> 13 4 301 301
#> 16 4 379 379
#> 17 4 384 384
#> 20 4 411 411
#> 25 4 548 548
#> 28 4 566 566
#> 40 4 715 715
#> 45 4 748 748
#> 46 4 754 754
#> 51 4 794 794
#> 72 4 1059 1059
#> 73 4 1064 1064
#> 74 4 1074 1074
#> 4 3 114 114
#> 5 3 124 124
#> 21 3 413 413
#> 26 3 554 554
#> 27 3 563 563
#> 31 3 633 633
#> 37 3 691 691
#> 42 3 740 740
#> 47 3 756 756
#> 48 3 766 766
#> 53 3 807 807
#> 54 3 811 811
#> 56 3 822 822
#> 58 3 844 844
#> 61 3 879 879
#> 65 3 931 931
#> 67 3 951 951
#> 71 3 1056 1056
#> 75 3 1079 1079
#> 9 2 217 217
#> 18 2 387 387
#> 32 2 637 637
#> 44 2 746 746
#> 52 2 799 799
#> 62 2 883 883
#> 64 2 924 924
#> 70 2 1054 1054
#> 76 2 1083 1083The sub optimal scores are 40, 36 and 33 realized by the “visual” excursions number 1, 15 and 11. Such number of excursion, 11 and 15 are large enough to avoid considering the mathematical number. The first excursion is mathematically the first one (see below).
lindley(LongSeqScore)
#> [1] 2 6 10 8 8 6 10 12 16 15 19 23 27 26 25 29 28 32 36 40 39 36 40 39
#> [25] 38 37 33 29 34 36 35 31 30 29 28 24 20 16 18 21 20 15 10 6 3 1 6 10
#> [49] 12 15 11 7 11 7 5 4 8 11 7 10 9 8 6 6 10 5 1 4 3 0 4 6
#> [73] 10 6 4 3 0 0 0 4 9 13 13 15 10 6 5 9 12 7 11 10 14 16 12 16
#> [97] 15 19 23 19 21 20 16 15 17 16 15 11 7 6 1 0 0 3 0 0 0 0 0 0
#> [121] 0 0 0 3 0 0 0 4 0 4 8 13 17 16 16 11 7 11 14 16 19 19 18 14
#> [145] 16 19 18 16 20 23 22 21 16 12 16 16 12 16 15 10 9 14 10 6 11 7 7 11
#> [169] 13 8 11 9 8 4 2 0 0 0 0 0 2 6 11 10 9 5 5 1 5 4 6 8
#> [193] 7 11 16 12 15 14 14 9 5 3 5 10 9 4 3 7 7 6 6 4 2 6 1 0
#> [217] 2 0 0 0 0 0 0 4 0 0 0 0 3 7 9 11 14 10 15 15 19 22 24 23
#> [241] 26 29 33 28 24 20 22 26 22 19 15 18 18 13 12 16 15 14 9 13 15 10 14 17
#> [265] 13 9 5 9 9 9 4 7 11 15 11 7 6 5 4 3 6 8 4 6 1 5 1 4
#> [289] 3 0 0 0 0 4 2 5 4 3 0 0 4 0 0 2 5 2 6 4 0 0 0 4
#> [313] 9 8 8 12 15 14 13 9 13 9 8 13 15 17 16 18 22 25 27 30 26 30 29 31
#> [337] 36 35 31 33 36 32 32 30 33 28 27 23 19 15 13 8 10 12 11 15 13 18 20 16
#> [361] 14 19 17 13 16 12 15 15 14 18 16 12 11 11 9 5 1 0 4 3 0 0 0 4
#> [385] 0 0 2 0 0 4 7 11 13 12 8 10 14 10 8 12 11 9 5 3 6 10 9 5
#> [409] 1 0 4 0 3 2 0 4 8 12 8 12 16 12 14 10 6 5 9 8 5 9 13 12
#> [433] 17 17 16 12 11 11 10 15 19 15 17 21 20 19 21 20 15 14 18 13 13 16 15 19
#> [457] 15 11 15 12 16 20 18 16 16 11 15 11 9 13 8 7 11 6 11 15 12 11 13 8
#> [481] 10 14 17 21 21 25 24 20 15 19 23 18 22 20 24 20 15 18 17 19 18 15 14 10
#> [505] 10 12 15 19 19 21 16 12 10 9 12 12 11 7 7 3 0 0 4 7 6 5 9 5
#> [529] 1 0 0 0 2 1 5 4 9 5 1 6 5 4 7 5 9 4 0 4 3 0 0 0
#> [553] 0 3 3 1 0 0 0 0 0 0 3 0 0 4 0 0 0 0 0 0 0 3 7 10
#> [577] 5 7 2 1 4 0 0 0 0 0 0 3 3 3 7 3 6 10 10 8 7 12 9 14
#> [601] 16 12 15 14 9 5 5 7 6 5 7 6 5 4 3 5 1 4 3 2 1 4 4 9
#> [625] 9 12 8 12 7 3 0 0 3 2 0 0 2 0 0 0 0 0 0 5 8 12 12 8
#> [649] 7 2 0 0 0 3 6 2 0 0 0 0 3 1 5 3 8 11 10 12 11 10 10 9
#> [673] 8 7 3 6 5 1 0 3 3 3 3 7 3 2 0 0 0 0 3 0 0 0 0 0
#> [697] 3 1 1 4 4 8 4 7 3 2 5 1 0 0 2 5 3 0 4 0 0 0 0 0
#> [721] 0 0 0 0 4 2 6 2 6 5 1 1 1 3 0 0 0 0 0 3 0 3 2 5
#> [745] 0 2 0 4 2 0 0 0 0 4 0 3 3 0 0 0 0 0 0 0 0 3 1 3
#> [769] 0 0 0 0 2 5 1 0 0 2 6 10 6 2 6 10 5 4 4 3 2 1 0 0
#> [793] 0 4 0 0 0 0 2 1 0 0 0 0 0 0 3 2 0 0 3 0 4 4 7 2
#> [817] 6 1 0 0 0 3 2 0 0 0 0 0 0 0 0 4 2 5 9 9 5 7 9 5
#> [841] 4 0 0 3 0 0 0 2 5 3 7 2 2 4 3 2 5 4 3 5 4 3 0 3
#> [865] 2 4 3 6 6 6 6 3 2 0 0 0 0 0 3 2 1 0 2 0 0 0 0 0
#> [889] 0 5 8 12 12 16 13 12 8 4 6 10 13 12 11 7 9 12 10 6 6 5 4 8
#> [913] 7 6 2 1 1 5 8 7 3 0 0 2 0 0 0 0 0 0 3 0 0 4 8 6
#> [937] 6 4 4 6 9 13 13 9 8 7 3 2 0 0 3 1 0 0 0 5 3 7 12 16
#> [961] 14 16 15 14 18 14 10 9 8 7 10 10 10 13 9 13 18 15 19 24 28 27 27 31
#> [985] 30 33 36 40 44 43 43 47 51 50 54 56 60 59 63 62 65 61 58 61 57 52 48 47
#> [1009] 43 39 38 37 41 45 42 40 42 41 39 35 32 36 33 32 28 28 28 28 27 25 21 17
#> [1033] 13 9 8 7 5 7 3 6 2 1 5 1 0 0 0 4 9 7 3 0 0 2 0 3
#> [1057] 2 0 4 0 0 0 0 4 3 2 1 0 0 0 0 0 0 4 0 0 0 0 3 0
#> [1081] 0 0 2 1 0 0 0 0 3 6 4 0 0
recordTimes(LongSeqScore)
#> [1] 70 77 78 79 112 113 115 116 117 119 120 121 122 123 125
#> [16] 126 127 129 176 177 178 179 180 218 219 220 221 222 223 226
#> [31] 227 228 290 291 293 299 300 303 310 311 378 381 382 383 386
#> [46] 388 389 412 415 521 522 531 532 550 551 558 559 560 561 562
#> [61] 564 565 569 570 571 573 583 584 585 586 587 631 632 635 636
#> [76] 639 640 643 651 652 653 657 658 659 660 679 687 688 689 690
#> [91] 692 693 695 696 709 710 714 716 717 718 719 720 721 722 723
#> [106] 724 735 736 737 738 739 741 750 751 752 758 759 760 761 762
#> [121] 763 764 765 769 771 777 791 792 793 798 803 804 805 806 809
#> [136] 810 812 819 820 821 824 825 826 827 828 829 843 845 846 847
#> [151] 863 874 875 876 877 878 882 885 886 888 889 922 925 926 927
#> [166] 928 929 930 932 933 949 950 954 955 1045 1046 1047 1052 1053 1055
#> [181] 1061 1062 1063 1069 1070 1071 1072 1073 1076 1077 1078 1080 1081 1082 1085
#> [196] 1087 1088 1093
LongSeqScore[1] > 0
#> [1] TRUELet us consider a study on the first and the last sub optimal excursions in the sequential order of the sequence.
First excursion height equal to 40; last one, the 77th one, height equal to 6.
subOptSegment["1",] #First excursion
#> value begin end
#> 1 40 1 20
subOptSegment[as.character(dim(subOptSegment)[1]),] #Last excursion
#> value begin end
#> 77 6 1089 1090Let us compute their \(p\)-values.
theta <- letters[1:length(score_values)] # arbitrary
score_function <- score_values # defined earlier
a <- 40
i <- 1
if (FALSE) { # To avoid long computation on CRAN servers... (Comment if you
# want to try)
system.time(pv2 <- proba_theoretical_ith_excursion_markov(a, theta, lambda,
score_function,i)$proba_q_i_geq_a)
} else {
pv2 <- 0.1741784
}
#> utilisateur système écoulé
#> 38.061 0.033 38.137
pv2
#> [1] 0.1741784
a <- 6
i <- 77
system.time(pv3<-proba_theoretical_ith_excursion_markov(a, theta, lambda,
score_function, i)$proba_q_i_geq_a)
#> user system elapsed
#> 1.370 0.007 1.378
pv3
#> [1] 0.1741784First excursion : height equal to 40 with \(p\)-value=\(0.45\%\); Last excursion, the 77th one, height equal to 6 and \(p\)-value=\(17\%\).
The time computation of the first \(p\)-value is larger because of the larger value of \(a\).
We can consider that from the 20th mountain we reached the stationary distribution for the beginning of excursion. We shall therefore take a lower value for \(i\) for the last excursion to evaluate the difference. Let us try \(i=20\) instead of 77.
a <- 6
i <- 20
system.time(pv4 <- proba_theoretical_ith_excursion_markov(a, theta,
lambda,
score_function,i)$proba_q_i_geq_a)
#> user system elapsed
#> 1.405 0.001 1.406
pv4
#> [1] 0.1741784
i <- 10
system.time(pv5 <- proba_theoretical_ith_excursion_markov(a, theta,
lambda,
score_function,i)$proba_q_i_geq_a)
#> user system elapsed
#> 1.368 0.000 1.368
pv5
#> [1] 0.1741784We obtain the same value as expected even for \(i=10\).
As the lecture of a protein could be done in both direction, we also consider the reverse sequence.
LongSeqScore.inv <- rev(LongSeqScore)
localScoreC(LongSeqScore.inv)
#> $localScore
#> value begin end
#> 65 93 138
#>
#> $suboptimalSegmentScores
#> value begin end
#> 1 6 4 5
#> 2 2 11 11
#> 3 3 15 15
#> 4 4 20 20
#> 5 4 30 30
#> 6 4 35 35
#> 7 2 40 40
#> 8 9 45 46
#> 9 4 51 51
#> 10 3 54 54
#> 11 2 56 56
#> 12 4 72 72
#> 13 2 77 77
#> 14 8 80 81
#> 15 3 90 90
#> 16 65 93 138
#> 17 7 243 246
#> 18 3 250 250
#> 19 4 255 256
#> 20 9 259 262
#> 21 3 272 272
#> 22 4 277 277
#> 23 7 279 281
#> 24 3 287 287
#> 25 2 295 295
#> 26 4 300 300
#> 27 8 310 311
#> 28 10 314 316
#> 29 3 338 338
#> 30 4 340 340
#> 31 4 346 346
#> 32 5 350 352
#> 33 3 354 354
#> 34 2 360 360
#> 35 4 365 365
#> 36 6 367 369
#> 37 4 379 379
#> 38 5 382 383
#> 39 3 387 387
#> 40 3 390 390
#> 41 8 392 397
#> 42 3 403 403
#> 43 7 410 414
#> 44 3 418 418
#> 45 12 426 433
#> 46 6 439 440
#> 47 12 448 450
#> 48 2 457 457
#> 49 3 461 461
#> 50 4 466 466
#> 51 11 468 472
#> 52 3 491 491
#> 53 16 493 506
#> 54 3 513 513
#> 55 2 516 516
#> 56 10 518 520
#> 57 4 528 528
#> 58 3 531 531
#> 59 3 540 540
#> 60 4 546 546
#> 61 8 549 554
#> 62 9 557 561
#> 63 9 567 571
#> 64 3 579 579
#> 65 11 584 588
#> 66 9 599 604
#> 67 25 608 678
#> 68 6 724 727
#> 69 5 732 732
#> 70 12 735 741
#> 71 3 749 749
#> 72 5 753 754
#> 73 36 757 782
#> 74 11 820 822
#> 75 8 830 834
#> 76 3 842 842
#> 77 6 846 847
#> 78 33 851 865
#> 79 7 890 891
#> 80 3 897 897
#> 81 16 899 913
#> 82 3 923 923
#> 83 7 925 929
#> 84 5 932 932
#> 85 4 936 936
#> 86 4 939 939
#> 87 23 944 964
#> 88 3 980 980
#> 89 2 989 989
#> 90 2 993 993
#> 91 23 995 1014
#> 92 14 1044 1047
#> 93 5 1054 1055
#> 94 7 1064 1065
#> 95 4 1071 1071
#> 96 40 1074 1093
#>
#> $RecordTime
#> [1] 0 1 2 3 7 9 10 12 13 14 18 19 23 24 25
#> [16] 26 27 28 29 32 33 34 41 42 43 44 49 50 53 55
#> [31] 58 59 60 61 62 63 64 65 69 70 71 74 75 76 79
#> [46] 85 86 87 88 89 92 249 251 252 253 254 267 268 269 270
#> [61] 271 274 275 276 278 285 286 291 293 294 299 303 304 305 306
#> [76] 308 309 330 331 332 333 334 335 336 339 343 344 345 349 355
#> [91] 356 357 358 359 363 364 374 375 376 377 378 381 385 386 389
#> [106] 391 401 402 405 406 407 408 409 416 417 419 420 421 422 424
#> [121] 425 438 442 443 444 445 446 455 456 459 460 463 464 465 489
#> [136] 490 492 512 515 517 527 530 533 534 535 536 537 538 543 544
#> [151] 545 548 565 566 574 576 577 582 583 598 607 717 718 719 721
#> [166] 722 723 731 734 745 746 747 748 752 819 829 840 843 844 845
#> [181] 849 850 894 896 898 920 921 922 924 931 934 935 941 942 943
#> [196] 975 977 978 979 982 983 984 985 986 987 988 991 992 994 1043
#> [211] 1052 1053 1057 1058 1059 1060 1061 1062 1063 1067 1068 1069 1070
head(sort(localScoreC(LongSeqScore.inv)$suboptimalSegmentScores[,1], decreasing = TRUE))
#> [1] 65 40 36 33 25 23
plot(1:1093,lindley(LongSeqScore.inv), type = 'l')That leads to: first excursion equal to 6; last excursion, the 96th one, equal to 38.
The corresponding \(p\)-values are:
markov_parameters <- sequences2transmatrix(LongSeqScore.inv)
lambda.inv <- markov_parameters$transition_matrix
if (FALSE) {
system.time(pv5 <- proba_theoretical_ith_excursion_markov(a = 38, theta,
lambda.inv,
score_function, i = 96
)$proba_q_i_geq_a)
} else {
pv5 <- 0.005460216
}
#> utilisateur système écoulé
#> 35.478 0.046 35.573
pv5
#> [1] 0.005460216
proba_theoretical_ith_excursion_markov(a = 6, theta, lambda.inv,
score_function,i = 1
)$proba_q_i_geq_a
#> [1] 0.1765172Last excursion, the 96th one, equal to 38, with \(p\)-value equal to \(0.55\%\). First excursion equal to 6, with \(p\)-value equal to \(17\%\).
The last excursion is still significant. The first one is still non significant.
Remark : Even with a Bonferroni correction to take into account the multiple test (here two studies excursions), the sequence possesses a significant segment. Whereas considering the highest value over all the excursions of the whole sequence, the local score value, the sequence is not significant.
Let us consider the excursion 30 as an excursion among the others. For the first way to read the protein:
a <- 65
i <- 30
if (FALSE) { # To avoid long computation on CRAN servers... (Comment if you
# want to try)
system.time(pv6 <- proba_theoretical_ith_excursion_markov(a, theta,
lambda, score_function,
i)$proba_q_i_geq_a)
} else {
pv6 <- 0.0004221151
}
#> utilisateur système écoulé
#> 104.861 0.101 105.062
pv6
#> [1] 0.0004221151There is a less than 4 in 10,000 chance of having a first mountain over 65.
Or with the second way to read the protein:
subOptSegment.inv <- localScoreC(LongSeqScore.inv)$suboptimalSegmentScores
which.max(subOptSegment.inv[,1])
#> [1] 16
print(subOptSegment.inv[16,])
#> value begin end
#> 16 65 93 138
a <- 65
i <- 16
if (FALSE) {# To avoid long computation on CRAN servers... (Comment if you
# want to try)
system.time(pv7 <- proba_theoretical_ith_excursion_markov(a, theta, lambda.inv,
score_function,i)$proba_q_i_geq_a)
} else {
pv7 <- 0.0004379929
}
#> utilisateur système écoulé
#> 103.018 0.088 103.229
pv7
#> [1] 0.0004379929As expected we obtain a similar \(p\)-value.
Although the highest mountain of height 65 in this sequence of length 1093 does not have a significant value at the \(5\%\) threshold, it does not mean that the sequence has not interesting segments. Observing a mountain exceeding 40 is significant for example, and it is the same for values 36 and 33. All these tests remain significant even if a correction of type Bonferroni is taken.