Strategy for Designing the Synthetic Gene Encoding Human papillomavirus Major Capsid L1 Protein for Heterologous Expression in Escherichia coli System

DNA is widely used to construct heterologously expressed genes. The adaptation of the codons to the host organism is necessary in order to ensure sufficient production of proteins. The GC content, codon identity and the mRNA from the translation site are also important in the design of the gene construct. This study performed a strategy for the design of synthetic gene encoding HPV52 L1 protein and several analyses at the genetic level to optimize its protein expression in the Escherichia coli BL21(DE3) host. The determination of the codon optimization was performed by collecting 75 HPV52 L1 protein sequences in the NCBI database. Furthermore, all the sequences were analyzed using multiple global alignments by Clustal Omega web server. Once the model was determined, codon optimization was performed using OPTIMIZER and the web server of the IDT codon optimization tool based on the E. coli B. The generated open reading frame (ORF) sequence was analyzed using Restriction mapper web server to choose the restriction site for facilitating the cloning stage, which is adjusted for pJExpress414 expression vector. To maximize the protein expression level, the mRNA secondary structure analysis around the ribosome binding site (rbs) was performed. A slight modification at the 5’-terminal end waa carried out in order to get more accessible rbs and increasing mRNA folding free energy. Finally, the construction of the synthetic gene was confirmed to ensure that no mutation occurs in the protein and to calculate its Codon Adaptation Index (CAI) and GC content. The above strategy, which leads to a good ORF sequence with the value of the free mRNA folding energy around rbs, is -5.5 kcal / mol, CAI = 0.787 and GC content 49.5%. This result is much better than its original gene. This result is much better compared to its native gene. Theoretically it is possible that this synthetic gene construct generates a high level protein expression in E. coli BL21 (DE3) under the regulation of the T7 promoter.


INTRODUCTION
Nowadays, synthetic DNA is widely used to construct the gene for heterologous protein expression. Adequate information from genome and metagenome sequencing projects provide virtual DNA, which can be further exploited. The modification of the native DNA sequences is mostly introduced into the synthetic gene to maximize the expression, especially in the heterologous host (Quan et al., 2011;Gaspar et al., 2012;Luo et al., 2016). Several factors should be carefully considered when designing the construct, including codon usage, codon identity, GC-content, mRNA folding free energy, especially around the ribosome binding site (RBS). These play a role in the production of high-level recombinant proteins in Escherichia coli (Rosano & Ceccarelli, 2014;Ghavim et al., 2017;Dewi & Fuad, 2020).
The final strategy for the design of synthetic gene is to mimic the characteristics of the natural highly-expressed gene of the native host that are relevant for increasing the protein expression. Many researchers have reported that synonymous codons substitution of rare codons in the introduced genes can increase the protein yield dramatically (Chaney & Clark, 2015;Wang et al., 2015;Buhr et al., 2016). Rare codons are associated with ribosome stalling leading to truncated translation (Handa et al., 2011;Guimaraes et al., 2020). In addition, synonymous codon changes can influence the protein stability and conformation (Saunders & Deane, 2010;Hunt et al., 2014;Mauro & Chappell, 2014). Therefore, codon optimization plays an important role, mainly when the heterologous protein expression was performed.
Human papillomavirus (HPV) is a nonenveloped, double-stranded DNA virus that infects a variety of hosts, with a preference for epithelial cells. High-risk HPV infection is a major cause of cervical cancer development, which was the fourth most common cancer in women worldwide (Yang et al., 2017;Arbyn et al., 2020). The HPV L1 protein is the major capsid protein with a molecular weight of 55-60 kDa that can be expressed in many expression systems and formed virus-like particles (VLP), which resemble the native HPV morphologically without the assistance of L2 minor capsid protein (Buck et al., 2013). VLPs are known to retain the native epitopes of viral particles. Their high immunogenicity can induce the production of neutralization antibodies against HPV infection (Schiller et al., 2010;Plummer & Manchester, 2011;Tumban et al., 2012). In addition, the VLPs are safe and have no potential cancerogenic to become the primary candidate for HPV vaccines because they do not contain the viral genome (Li et al., 2016;Huang et al., 2017).
There are many online software packages available for performing codon optimization for designing the synthetic gene. EuGene (Gaspar et al., 2012), Parts & pools (Marchisio, 2014), Codon Optimization OnLine (COOL) (Chin et al., 2014), etc. are the newest software in the last 10 years. The previous software also still be used for gene design. In this study, the synthetic gene encoding HPV52 L1 protein was design for the development of an E. coli-based HPV vaccine. Not all available software was used, though it can be an alternative for designing other synthetic genes. In addition, the codon optimization and mRNA secondary structure formation near the translation initiation region become the primary concern. To the best of our knowledge, this study is the only one that discusses the codon optimization strategy and the construction of the synthetic gene encoding L1 major capsid protein of HPV52 for E. coli expression system.

MATERIALS AND METHODS
Determination of the HPV52 L1 protein sequence. About 75 sequences of HPV52 L1 protein from various countries were collected from the National Centre for Biotechnology Information (NCBI) database (https://www.ncbi.nlm.nih.gov/). All the sequences were then subjected to multiple alignment analysis using Clustal Omega web server (https://www.ebi.ac.uk/Tools/msa/clustalo/) (Sievers & Higgins, 2014). The HPV52 L1 protein sequence having 100% identity discovered from this analysis was selected for a template in codon optimization step.
Codon optimization for E. coli expression. The HPV52 L1 protein sequence was selected for codon optimization by using OPTIMIZER web server (http://genomes.urv.es/OPTIMIZER/) (Puigbò et al., 2007). Codon usage table of E. coli B used was used as a reference for accessing Kazusa codon usage database (http://www.kazusa.or.jp/codon/) (Nakamura et al., 2000). The "Guided random" method was selected for codon optimization. Generated codon-optimized sequence having a codon adaptation index (CAI) value near 0.8 and GCcontent near 51.06 % was chosen for further optimization.
Selection of restriction site for facilitating the cloning stage. In this study, pJExpress414 plasmid with T7 promoter was selected as an expression vector for gene cloning. RestrictionMapper web server (http://www.restrictionmapper.org/) was used to determine the restriction sites located inside the codon-optimized gene. When the desired restriction sites were discovered, they were removed using the feature available in OPTIMIZER software (Puigbò et al., 2007). After the digesting enzymes were selected, then the expression cassette of pJExpress414_HPV52L1 recombinant plasmid was arranged.
Determination of transcription start site (TSS). The first nucleotide of transcript (+1) was predicted using Neural Network Promoter Prediction (https://www.fruitfly.org/seq_tools/promoter.h Vol 8(2), December 2020 Biogenesis: Jurnal Ilmiah Biologi 227 tml). About 115 nt before and 115 nt after the RBS was subjected to this analysis. The possible promoter sequence and +1 were selected based on the highest score.
mRNA secondary structure analysis. The secondary structure formation of mRNA around the ribosome binding site was predicted using RNAstructure web server (https://rna.urmc.rochester.edu/RNAstructure Web/Servers/Predict1/Predict1.html) (Bellaousov et al., 2013). The folding free energy of mRNA secondary structure was also observed. Synonymous codon substitution near the start codon was performed to increase the MFE of mRNA secondary structure.
Confirmation of final synthetic gene construct. The final codon-optimized sequence was translated into proteins using ExPASy translate tools (https://web.expasy.org/translate/) (Gasteiger et al., 2003). Translated protein was subjected to pairwise sequence alignment analysis by Clustal Omega utilizing a template of the HPV52 L1 protein sequence as a tandem. If no mutation was confirmed, then the CAI value of the codon-optimized sequence was calculated using CAIcal (http://genomes.urv.es/CAIcal/E-CAI/) (Puigbò et al., 2008).

RESULTS AND DISCUSSION
Determination of the HPV52 L1 protein sequence. In the determination of HPV52 L1 amino acid sequence, about 75 samples were accessed from NCBI database and collected in Fasta format. Table 1 showed the accession number of HPV52 L1 protein sequence used in this experiment. To find the most promising L1 protein sequence as a template, multiple sequence alignment analysis was performed using a Clustal Omega web server. The percentage identity matrix created by Clustal2.1 demonstrated that all sequences shared >99% identity (data are not shown). However, in this study, only the sequences that shared 100% identity were selected as a template for codon optimization. One is accession number APQ44871.1. A study conducted by Wei et al. (2018) showed that the elimination of several amino acids at the N-terminal of HPV52 L1 proteins improves their solubility in E. coli expression system without affecting their immunogenicity. Therefore, the codon optimization lies in amino acids number 15-503, which is counted from the second Methionine. Fig. 1 showed the amino acid sequence with Accession Number APQ44871.1. The amino acid sequence that was subjected to codon optimization was shown in yellow. Codon optimization for E. coli expression and selection of restriction site for facilitating the cloning stage. The codon optimization was carried out with the OPTIMIZER web server. Using the "Guided Random" method, codons were selected at random with probabilities obtained from the codon usage table of E. coli B. Generated codon-optimized sequence using this software gave the highest CAI value of 0.739 and GCcontent of 49.2% (Table 2).

MVQILFYILVIFYYVAGVNVFHIFLQMSVWRPSEATVYLPPVPVSKVVSTDEYVSRTSIYYYAGSS
In this experiment, the pJExpress414 plasmid was used as an expression vector. The SalI and NcoI restriction sites were selected at 5' and 3' end to facilitate the cloning stage, respectively. However, by using the RestrictionMapper web server, it was reported that those restriction sites were inside the codon-optimized gene (data not shown). In addition, there were also repeated bases of A/T (Table 2). Therefore, both restriction sites and A/T repeated bases were removed using the feature discovered in OPTIMIZER web server.
To increase the CAI and GC-content, the DNA sequence was further optimized manually using IDT codon optimization tool (https://sg.idtdna.com/CodonOpt). Table 2 showed that the modified codon-optimized sequence has a higher CAI and GC-content, without SalI and NcoI restriction sites inside the gene.
In this study, the codon optimization of HPV52 L1 open reading frame (ORF) was carried out using OPTIMIZER web server according to E. coli B codon usage table. Furthermore, the codon usage bias in a gene towards frequent codons is often measured by Codon Adaptation Index (CAI). CAI reflected how well our optimized-codon sequences will adapt to the new host protein expression machinery. The CAI value ranges from 0 to 1 based on how close the foreign gene sequences against the reference gene sequences. However, the CAI value = 1 means "one amino acidone codon" approach has several disadvantages, including translational error, tRNA depletion, frame-shift, hard to avoid repetitive elements, and hard to introduce or eliminate the restriction sites inside the ORF. Therefore, in this experiment wa "Guided random" method was used to optimize the gene instead of the "Most frequent" method provided by Puigbò et al. (2007), Welch et al. (2009), andVillalobos et al. (2016). The "Guided random" flexibility makes it easier to remove the desired restriction site and A/T repeated bases located inside the gene (Table 2).  Determination of transcription start site (TSS) and mRNA secondary structure analysis around the ribosome binding site. The pJExpress414 plasmid was used as an expression vector. The expression cassette of pJExpress414_HPV52L1 recombinant plasmid was arranged as seen in Fig. 2. The TSS was predicted using Neural Network Promoter Prediction web server. After the first nucleotide of transcript (+1) was determined, the mRNA sequence was subsequently decided, which will be subjected to secondary structure analysis around the area of RBS.
[…TAATACGACTCACTATAGGGGAATTGTGAGCGGATAACAATTCCCCTCTAGAAATAATTTTGTT TAACTTTAGGAGGTAAAAAAAGTCGACATGCCAGTGCCGGTGTCTAAAGTTGTTTCCACGGATGAA TATGTGAGCCGTACATCGATATACTACTATGCCGGCTCGTCGCGTCTGTTGACCGTGGGTC….TAA CCATGGGTTTTTTG..] The sequence used for the analysis of the mRNA secondary structure started after codon start from +1 to seven amino acids (Table 3). The result showed that the original sequence has low mRNA folding free energy. In addition, the RBS sequence was shown in pairs with the downstream nucleotides (Fig. 3a). Therefore, the synonymous codon substitution was performed in order to increase the mRNA folding free energy and get unpaired RBS sequences. Fig. 3b showed that the mRNA folding free energy was increased from -7.3 to -5.5 kcal/mol and the RBS sequence was exposed after synonymous codon substitution was performed. Our previous experiment showed that mRNA folding free energy of -7.3 kcal/mol in the area around the RBS was enough to produce high protein expression.
Predicting the secondary structure of mRNA secondary is usually completed by finding the lowest folding free energy, which is the most likely folding ensemble structure (Reuter & Mathews, 2010). The increased folding free energy makes a more unstable mRNA secondary structure, which is preferred in E. coli expression. In addition, the open RBS makes it accessible for the ribosome to start the protein synthesis, which is expected to improve the expression of recombinant proteins in the E. coli system. Table 3. The mRNA sequences around the ribosome binding site used for mRNA used for the analysis of the secondary structure using RNAstructure web server. Transcription start (+1) is shown in a larger font, and blue color indicates the substituted codons.
Type mRNA sequences mRNA folding free energy (kcal/mol) Before synonymous codon substitution TGTTTAACTTTAGGAGGTAAAAAAAGTCGAC ATG CCA GTG CCG GTG TCT AAA GTT The expression of heterologous protein in E. coli was limited, especially at the initiation of translation. A strong mRNA downstream of the translation start site is known and can interfere with expression regardless of the general CAI or GC content (Gu et al., 2010;Gingold & Pilpel, 2011;Ghavim et al., 2017).
Our previous study confirmed that the synonymous codon substitution immediately after the initial codon can significantly improve the expression of human granulocyte colonystimulating factor (hG-CSF) recombinant proteins in E. coli system (Dewi & Fuad, 2020). Confirmation of final synthetic gene construct. The final construction of the synthetic gene needs to be confirmed to ensure that no mutations occur at the amino acid level. The ExPASy translation tool was used to convert DNA sequences into protein sequences (Fig. 4a). The sequence alignment was performed using by Clustal omega using sequence with Accession Number APQ44871.1 as a reference. Fig. 4b showed that no mutation occurred in the protein sequences except for the intentional N-terminal truncation. In the end, the final CAI was calculated using CAIcal web server, resulting in CAI = 0.787 with GCcontent 49.5%. This GC-content value is close to the required E. coli host (51.06%). The natural characteristic of the native HPV52 L1 gene is low GC-content. The CAI of native HPV52 L1 gene from Accession Number KY077853 was calculated using CAIcal web b a b a server with E. coli B codon usage table as a reference, and resulted in CAI = 0.616 with GCcontent 39.7%. Therefore, the synthetic gene construct showed good improvement of both CAI and GC-content compared to its native gene sequence.

CONCLUSION
The design of the synthetic gene encoding HPV52 L1 protein using the strategy above showed a good result with mRNA folding free energy around rbs of -5.5 kcal/mol, CAI = 0.787 and GC-content 49.5%. This result is much better compared to its native HPV52 L1 gene. Theoretically, thr obtained synthetic gene is prospective to produce high-level protein expression in E. coli BL21(DE3) under the regulation of T7 promoter.