Research progress of CTG repeats in structural biology
Hu Haiming, Xue Yonglai, Feng Xizeng (The Key Laboratory of Bioactive Materials Ministry of Education, China, College of Life Science, Nankai University, Tianjin, 300071)
Abstract Myotonic dystrophy type1(MD) is an autosomal dominant neuromuscular disease, its heredity mechanism is the expansions of CTG repeats in the 3′-untranslated region of the myotonic dystrophy protein kinase(DMPK). This paper introduces the structure features of CTG biochemical feature, expansion mechanism, factors that influence its expansion, and the relationship between expansion and MD.
Keywords Myotonic dystrophy, trinucleotide repeats CTG, expansion, structure, S-DNA
Advances in Structural Biology Research of CTG Repeat Sequence
Hu Haiming, Xue Yonglai, Feng Xizeng
(Key Laboratory of Bioactive Materials, Ministry of Education, School of Life Sciences, Nankai University, Tianjin 300071)
Abstract Myotonic dystrophy (MD) is an autosomal dominant neuromuscular disease. Its genetic mechanism is caused by abnormal expansion of CTG repeat sequence located in the 3′ non-coding region of myotensin kinase. This article summarizes the latest research progress on the structural characteristics, biochemical characteristics, amplification mechanism, factors affecting its expansion, and the relationship between expansion and related diseases of CTG repeat sequence.
Key words myotonic dystrophy CTG amplified structure S-DNA
Received on March 7, 2006; Major Research Program of the National Natural Science Foundation of China – Frontier Issues in Theoretical Physics and Its Interdisciplinary Sciences (90403140).
In the late 1980s, James and Wis discovered microsatellite DNA sequences, also known as short tandem repeats (Short Tandem Repeats, STR) or simple repeats (Simple Sequence Repeats, SSR), each unit of which The length is between 1-6bp  . If these short tandem repeats are overexpressed in the genome, it is very likely to affect the expression of normal genes. For example, trinucleotide repeat sequences mainly include (CGG)n, (CCG)n, (CTG)n, (CAG)n, etc., which interfere with the expression of normal genes and cause some genetic neurological disorders. Myotonic Dystrophy (DM), which is an autosomal dominant genetic disease  . The main clinical symptoms are muscle rigidity and muscle atrophy, and some patients are also accompanied by cataracts, forehead baldness and mental retardation. The condition and symptoms vary greatly among patients, and the age span of onset is also large. DM is divided into three types based on clinical presentation and age of onset  : the most severe form is congenital myotonic dystrophy (CDM), in which patients are born with hypotonia, respiratory distress, poor sucking, and disfigured appearance , retardation of motor and mental development, high infant mortality rate; the mildest form is seen in middle-aged and elderly people, with cataracts and frontal baldness as the main symptoms, and a few have muscle involvement; the adult form is the most typical case, and the general age of onset is between 10-30 years old, with highly variable performance, may have any of the above symptoms and characteristic face, so some people call it “dynamic mutation”. The CTG repeat sequence in the 3′ non-coding region of the dystensin kinase (DMPK) gene in patients with diabetes mellitus showed abnormal amplification. The number of this repeated sequence in normal human body varies from 5 to 37, and it remains stable after several generations of transmission. On the contrary, in patients with DM, the copy number of this repeat sequence reaches 50-2000, up to 9000 . This apparent amplification changed the ability of DMPK original transcripts to be transformed into mature mRNA, reduced the transcription of its adjacent gene SIX5/DMAHP, and also hindered the connection between some other related RNAs, so that the offspring showed certain Symptoms [4,5] .
In 1951, Mohr described that the DM gene and the Lutheran blood group gene are inherited in linkage. Later, they found that DM and C3 genes were linked. Since the C3 gene was located on chromosome 19, the DM gene was also located on chromosome 19. Since 1992, a large number of studies have shown that the disease-causing gene of DM patients is located on autosome 19q13.3, which contains a conserved sequence containing G, and the protein encoded by it is an ATP binding site. As for the mechanism of genetic diseases caused by the expansion of trinucleotide repeat sequence, a large number of experimental data show that the abnormal duplication of this repeat sequence is related to its abnormal expansion  . Some scholars also believe that the pathogenic mechanism of oligonucleotide repeat expansion may be in the following ways: (1) oligonucleotide tandem repeat expansion affects the expression of the gene where it is located; The expansion of the nucleotide repeat sequence affects the expression of its adjacent genes; (3) leads to the continuous accumulation of transcribed mRNA in the nucleus, combined with adsorption to consume some important proteins that affect the basic activities of the cell, disrupting the biochemical function of the cell, Thus causing a wide range of symptoms [7-10] . Therefore, this paper summarizes the structural characteristics, biochemical characteristics, amplification mechanism and factors affecting the expansion of CTG repeat sequence.
- Structural characteristics of CTG
2.1 Base mismatch
Biophysical and biochemical techniques have revealed that the CTG repeat sequence can form a stable hairpin structure through base mismatch during DNA replication. The reason for the formation of this stable structure is not yet very clear.And there are also theories that the abnormal pairing between bases may be the cause of DNA mutation, so the understanding of this abnormal pairing structure is quite necessary.
In the process of DNA replication, repair and recombination, the single-stranded CTG repeat sequence is often accompanied by sliding, so that T·T pairs are easily formed (as shown in Figure 2), and these T·T pairs are still retained inside the helix. In the normal Watson-Crick base pair, it is A·T, G·C pairing (as shown in Figure 1). This kind of abnormal base pairing is likely to cause gene mutation in the generation transmission, which will cause certain harm to the offspring.
Fig. 1 Watson-Crick pairing
Fig. 1 Watson-Crick matchFig . 2 T T mismatch Fig. 2 T T mismatch
These abnormal base pairings help the repetitive sequences to form some special secondary structures, and these formed high-level structures are likely to play a certain role in promoting the expansion of the repetitive sequences.
2.2 The structure of (CTG)n
Regarding the structure of the CTG repeat sequence, so far, the research is relatively clear about the sequence with a small number of repeat sequences. Lai Man Chi and Sik Lok Lam applied nuclear magnetic resonance (NMR) technology to analyze the structure of 1-10 repeat sequences  . They previously reported that amplified structures containing higher numbers of trinucleotide repeats were the result of further folding of short repeats of similar stability. The ultraviolet melting experiment also showed that the structure folded by 30 CTG repeats was not much different from the structure folded by 10 CTG repeats. Based on these characteristics and NMR techniques, they confirmed that there are three kinds of hairpin structures in the CTG repeat sequence containing less than 10, which are: I: the hairpin structure without hydrogen bonds in the loop, where (CTG) 1 , ( CTG ) 2 , (CTG) 3 form this structure (as shown in Figure 3). II: The hairpin structure contains a loop composed of four bases of TGCT, and (CTG) 4 , (CTG) 6 , (CTG) 8 , (CTG) 10 form this structure (as shown in Figure 4). III: The hairpin structure is accompanied by a loop formed by three bases of CTG, and (CTG) 5 , (CTG) 7 , (CTG) 9 form this structure (as shown in Figure 5). The three hairpin structures are shown below:
|Fig. 3 Hairpin without intra-loop structure Fig. 3 Hairpin without intra-loop||Fig. 4 Hairpin with a four-nucleotide hydrogen bond loop
Fig. 4 Hairpin with a four-nucleotide hydrogen bond loop
Fig. 5 Hairpin structure with three nucleotides in the loop Fig. 5 Hairpin with a three-nucleotide loop Although thermodynamic experiments show that the loop constituting the hairpin structure is relatively stable when the loop is 4-5 bases, and the TGCT loop The loop is closed with 5′-C 3′-G base pairs, which is more conducive to the stability of the hairpin structure; while (CTG) 5 , (CTG) 7 , (CTG) 9 form a loop with only three bases, And the ring is closed with 5′-G 3′-C base pairs, and this structure is not very conducive to the stability of the ring, so it is speculated that it may need more Watson-Crick base pairs to maintain this hairpin structure stability. They also found that in the structure containing less than 10 CTG repeat sequences, if it contains an even number of repeat sequences, it is relatively easier to form a hairpin structure, which can be seen from (CTG) 4 , (CTG) 6 , ( CTG ) 8 , (CTG) 10 , the number of hairpin structures they formed is higher than that of (CTG) 5 , (CTG) 7 , and (CTG) 9 (as shown in Figure 6). Fig. 6 Non-denaturing gel electrophoresis experiments of Lai Man Chi and Sik Lok Lam, the DNA concentrations were 10 mM and 1 mM, respectively .
Fig.6 Non-denaturing gel of (CTG)n at 10 mΜ and 1mM , Lai Man Chi and Sik Lok Lam .
Among structures with less than 10 repeats, those containing an even number of repeats form a TGCT loop  . An odd number of (CTG) 15 and (CTG) 25 with more than 10 repeated sequences can also form a TGCT loop, while (CTG) 16 and (CTG) 20 form a CTG loop  . Therefore, the relationship between repeat number and structure is not very certain, and different repeat numbers may also form the same hairpin structure. Of course, in addition to forming a hairpin structure, it also includes the dimer structure mentioned above. Non-Waston-Crick base pairs are also included between dimer structures. 2.3 CTG and S-DNA contain non-Waston-Crick paired sequence-specific DNA secondary structure, which may be an important medium for inducing gene mutation. The trinucleotide repeat itself is capable of forming this unstable structure — the sliding DNA strand. Generally speaking, there are two types of sliding DNA strands that contain this repeat sequence: S-DNA and SI-DNA. One strand of S-DNA contains CTG repeat sequence, while the other complementary strand contains the same amount of CAG repeat sequence, forming a stable Watton-Crick pairing between them. However, the number of CTG and CAG repeat sequences contained in the two strands of SI-DNA is not the same  . Studies have pointed out that S-DNA is not necessarily maintained in a supercoiled state, but they are very stable under physiological salt concentration conditions. S-DNA can cause errors in DNA replication, repair, recombination or transcription, resulting in gene mutations [13,14] .
The relative stability of the S-DNA structure may depend on the sequence specificity of the repeat unit and the size of the repeat number. If the length of the repeat sequence increases, the complexity of the S-DNA isoform will be increased, and its stability will be affected. If some interfering sequences are added to the repeat sequence, it is not easy to form this sliding structure. It is probably because S-DNA and SI-DNA are not like some other specific structures (such as cross structure and Z-DNA), they are not maintained in a supercoiled state.
- Biochemical characteristics of CTG
3.1 Expansion of CTG
The CTG repeat sequence related to DM generally has 3-35 repeats in the normal population, and it follows the Mendelian law of inheritance, and sequence changes rarely occur during generational transmission.However, when the number of repeat sequences is 50-80, there will be frequent amplification during the alternation of generations, and the number of CTG repeats will increase to about 200. If the number of repeats exceeds 80, the instability of this sequence will increase significantly, resulting in a higher frequency of jumping amplification, and the number of repeats after amplification is about 120-1250[15 ] . This non-linear relationship between instability and CTG repeat number may be explained by relatively small changes in the CTG repeat sequence when the CTG repeat number is <80, and an instability when the repeat sequence exceeds 80 During the replication process, a large number of fragments are amplified during this process, and the probability is also continuously increasing.
In prokaryotes, when the number of CTG repeat sequences n<Okazaki fragment (Okazaki fragment), it is mainly incremental amplification, and when n> Okazaki fragment (Okazaki fragment), jumping amplification increases. Partha S. Sarkar and Haw-Chin Chang, F et al. showed that in the range of (CTG) 120-500 in bacteria , the amplification rate of CTG increased by nearly 12 times  . It can be seen that when the number of CTG repeats increases, the number of amplifications also increases, and the probability of amplification is also greatly improved. It is speculated that the number of amplification of its offspring will further increase, which can explain why the age of onset of offspring of DM family patients is earlier. The expression process of genetic variation characteristic of CTG repeat sequence is mainly related to transcription 
, Marzena Wojciechowska and others pointed out that the variation rate of the variation fragments formed during the transcription process is higher than that formed during the simple replication process. They suggest that fragment excision is repair-dependent and amplified by transcription. 3.2 The regulation of CTG amplification depends on the length of the sequence. The amplification is related to the formation of abnormal secondary structures. These abnormal secondary structures can aggregate to form complexes, making the relationship between the template strand and the newly synthesized strand abnormal. pairing, thus leading to the phenomenon of sequence amplification  . In DNA double strands, those non-B-DNA structures containing repeating fragments can easily cause strand breaks, and both single strands and double strands are possible  . These broken regions can be
repaired by recombinant protein A (RecA)-dependent homologous recombination  . The amplification of CTG can only be carried out in the absence of SbcC. Experiments by John C. Connelly and David R. F et al. have proved that the mutation of SbcC can promote the formation of hairpin structure  . SbcC and SbcD are active components constituting the ATP-dependent double-stranded exonucleotidase (SbcCD) activity. Among them, SbcD exhibits non-single-stranded endonuclease activity in SbcCD, but the expression of this enzyme activity is not dependent on ATP, which also indicates that SbcD is included in the catalytic center of SbcCD enzyme activity. SbcC regulates the activity of SbcCD. When SbcC is missing, it is easy to form some special secondary structures (such as ss region, or incomplete ds region) in the Okazaki fragment, and these fragments lay the foundation for the amplification of CTG.
Single-strand DNA binding protein (SSB) is an important factor in DNA replication, repair and recombination, because SSB can prevent the formation of DNA secondary structure  .At sites of secondary DNA structure, the replication process is temporarily halted, so frameshift mutations are more likely to occur at these sites. SSB can inhibit the formation of DNA secondary structure and enable normal replication. Escherichia coli research experiments show that SSB plays an important role in maintaining the stability of the trinucleotide repeat sequence. If E. coli contains a temperature-sensitive SSB mutant, the stability of the (CTG)n repeat sequence at 42°C is the same as that of the cultured at 32°C cells or ssb+ cells at 42°C compared to decreased. The hairpin structure composed of single-stranded DNA containing CTG repeat sequence is very unstable, which may also lead to an increase in the error rate during DNA polymerization. The hairpin structure can temporarily stop the action of DNA polymerase, and these temporary stop sites have a high mutation rate. SSBs stabilize trinucleotide repeats by preventing the formation of DNA secondary structures. Because the hairpin structure or some other DNA secondary structures are important factors that cause the stability of nucleotide repeat sequences related to human genetic diseases.
The stability of trinucleotide repeats in bacteria, yeast and mammalian cells also depends on the directionality of the origin of replication. If the long stretch repeats are located on the lagging strand, they are more prone to sequence excision. On the contrary, when the coding sequence is located on the leader On-strands are more prone to sequence amplification, even if the probability is low [2,6,20] . It is speculated that the abnormal secondary structure constituted by these repeats, either on the lagging strand or on the newly synthesized strand, shaves off DNA replication, loss, or amplification, respectively. In normal humans, these repeats are interrupted by some other unrelated regions, and in yeast there are also interfering sequences that increase the stability of these trinucleotide repeats, especially when these interfering sequences are located in (CTG)·( CAG) at the 5′ end  .
If the secondary structure of DNA is the rate-limiting step in repetitive sequence amplification, then the temperature during replication can be said to regulate whether this process can occur  . Because the temperature during replication can not only affect the formation of this secondary structure, but also affect the stability of its structure. Partha S. Sarkar et al. showed that p TV-[CTG]140 and pTV-[CTG] 200 at 37 ℃, a large frequency of fragment excision occurred. On the contrary, at 25°C, the plasmid showed a clear trend of amplification. Discrete bands are shown on the electrophoretic spectrum, and the distribution range is [CTG] 330 to .[CTG] 2000 . In addition to being affected by temperature, the expansion of CTG is also affected by the direction of replication. When the Okazaki fragment encodes a repeat sequence and the replication temperature is 16°C-25°C, amplification is dominant, and when the lagging strand encodes a CTG repeat sequence and the temperature is 37°C, excision is dominant. Partha S. Sarkar et al. proposed that if the replication temperature is relatively low and the Okazaki fragment encodes a CTG repeat sequence, it tends to be amplified and expressed. On the contrary, if the temperature is too high and the lagging strand encodes a repeat sequence, it tends to amplify the expression. resection.
Of course, sequence amplification is also affected by DNA replication events. If the number of replication events occurs more, the sequence amplification will increase. Therefore, if a large number of amplified sequences is to be obtained, a large number of cells transfected with plasmid DNA containing repetitive sequences must be cultured. 3.3 The melting temperature of the characteristic constant of CTG (CTG) n basically does not depend on the length of the repeat sequence. Samir Amrane et al. measured the melting degree of the CTG oligonucleotide sequence through ultraviolet absorption experiments. When n is 6, the melting degree is 54 °C, and when n is 25, the melting point is 58.7 °C. However, the thermal stability of this trinucleotide repeat sequence has a certain relationship with the concentration of KCI. It has been found that the Tm value of (CTG)8 oligonucleotide is indeed dependent on the concentration of KCI. When the concentration of KCI increases by 10 times , the solubility increased by 7 °C (50 °C for 10 mM KCI and 57 °C for 100 mM KCI). The structure is low. 3.4 The role of CTG repeat sequence and P53 K Walter et al.
 The study found that P53 can selectively bind to certain specific regions of the CTG repeat sequence. These regions are generally the stems that form the hairpin structure. When P53 selectively binds to these regions, it makes it immune. In the hydrolysis of DNaseI, thus playing a certain protective role. This binding method is consistent with the specific recognition of the three-dimensional structure of DNA. In the three-dimensional structure recognition, the spatial structure of DNA determines the binding site of P53. Therefore, it can be speculated that the site-specific region where P53 binds to the (CTG) n hairpin structure is determined by the three-dimensional structure of DNA. The binding of P53 to the (CTG)·(CAG) region depends on the conformation of DNA. P53 is relatively easier to bind to the hairpin structure composed of CTG or CAG. When the concentration of P53 is relatively low, the two regions bound to P53 are close to each other. At the end of the DNA, the two binding regions each contain three CTG repeats and are separated by the other two sets of naked repeats. Of course, P53 can also bind to the linear B-DNA composed of (CTG)·(CAG) or the mismatched dimer structure. Binding to the mismatched dimer structure can further induce changes in DNA topology. They also found that as the concentration of P53 increased, the repeat sequence bound to it became very sensitive to DnaseI, and the sensitive region of DnaseI depends on body changes, so it is speculated that p53 is likely to be a regulator of DNA topology.
- Prospects for CTG research
Myotonic dystrophy is the result of an abnormal expansion of a trinucleotide repeat sequence, which is a new mutation mechanism or genetic mechanism, and the related repeat sequence is the CTG repeat sequence. At present, there are not many studies on CTG repeat sequences in China, and most of them are mainly on the polymorphism of repeat sequences. There are not many reports on the structure and function of these repeat sequences. Therefore, if the in-depth study of its structure, function and its expansion mechanism in vivo cells can be carried out, it can provide a theoretical basis for the clinical treatment of the symptoms of myotonic dystrophy, and also has certain implications for the development of medicine and genetics. It can also provide a certain experimental basis for other genetic diseases caused by abnormal expansion of trinucleotide repeat sequences, so as to better provide protection for human health.
 ER Moxon, C Wills. DNA microsatellites: Agents of evolution Sci Am, 1999, 280, 94-100. 
M Wojciechowska, A Bacolla, JE Larson et al. J. Biol. Chem., 2005 ,280, 941-965.
 Zhao X P. Guowai YIxue (shenjinbinxue Shenjinwaike) , 2004, 31: 134-137.  CA Thornton, JP Wymer, Z Simmons, et al. Nature Genet, 1997,16 , 407-409.  AV Philips, LT Timchenko, TA Cooper, Science, 1998, 280, 737-741.  R Pelletier, M M. Krasilnikova, G M. Samadashwily, et al. Mol. Cell. Biol., 2003, 23, 1349-1357.  K Ricker, T Grimm, MC Koch, et al. Neurology, 1999, 52, 170-171.  J Finsterer. Eur J Neurol , 2002, 9, 441-447.
 JW Day, K Ricker, JF Jacobsen, et al. Neurology, 2003,60, 657-664.
 Yu ZL, Xie HJ, Niu F et al. Zhonghua Shenjin YIxue Zazhi, 2005 , 4: 225- 228.  LM Chi, SL Lam, Nucleic Acids Res., 2005, 33, 1604-1617.  M Tam, SE Montgomery, M Kekis, et al. J. Mol. Biol. 2003, 332, 585 – 600.  CE Pearson, YH Wang, JD Griffith, et al. Nucleic Acids Res., 1998, 26, 816-823.  CE Pearson, RR Sinden, Biochemistry, 1996, 35, 5041 – 5053.  S Amrane, B Sacca, M Mills, et al. Nucleic Acids Res., 2005, 33, 4065-4077.  C Jankowski, F Nasar, DK Nag, Proc. Natl. Acad. Sci. 2000, 97, 2134-2139.
 ML Hebert, LA Sptiz, RD Wells, J. Mol. Biol, 2004, 336, 655-672.
 JC Connelly, DRF Leach, Genes to Cells, 1996, 1, 285-291.
 WA Rosche, A Jaworski, S Kang, et al. Journal of Bacteriology, 1996,178, 5042-5044.
 PS Sarkar, HC Chang, F. B Boudi, et al. Cell, 1998,13, 531-540
 K Walter, G Warnecke, R Bowater, et al. J. Biol. Chem., 2005, 10, 1074 .