APP下载

Dock-able linear and homodetic di,tri,tetra and pentapeptide library from canonical amino acids: SARS-CoV-2 Mpro as a case study

2023-06-26SarfrazAhmadMuhammadUsmanMirzaJohnTrant

Journal of Pharmaceutical Analysis 2023年5期

Sarfraz Ahmad ,Muhammad Usman Mirza ,*,John F.Trant ,**

a Department of Chemistry and Biochemistry,University of Windsor,Windsor N9B 3P4,Ontario,Canada

b Binary Star Research Services,LaSalle N9J 3X8,Ontario,Canada

Keywords:

Dipeptides

Tripeptides

Tetrapeptides

Pentapeptides

N-to-C-terminal cyclic peptides

Peptide library

ABSTRACT

Peptide-based therapeutics are increasingly pushing to the forefront of biomedicine with their promise of high specificity and low toxicity.Although noncanonical residues can always be used,employing only the natural 20 residues restricts the chemical space to a finite dimension allowing for comprehensive in silico screening.Towards this goal,the dataset comprising all possible di-,tri-,and tetra-peptide combinations of the canonical residues has been previously reported.However,with increasing computational power,the comprehensive set of pentapeptides is now also feasible for screening as the comprehensive set of cyclic peptides comprising four or five residues.Here,we provide both the complete and prefiltered libraries of all di-,tri-,tetra-,and penta-peptide sequences from 20 canonical amino acids and their homodetic (N-to-C-terminal) cyclic homologues.The FASTA,simplified molecular-input line-entry system (SMILES),and structure-data file (SDF)-three dimension (3D) libraries can be readily used for screening against protein targets.We also provide a simple method and tool for conducting identity-based filtering.Access to this dataset will accelerate small peptide screening workflows and encourage their use in drug discovery campaigns.As a case study,the developed library was screened against severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) main protease to identify potential small peptide inhibitors.

1.Introduction

With few exceptions,including insulin,therapeutic peptides have only recently become increasingly prevalent.Approximately 10% of the drugs approved between 2009 and 2013 were peptidederived,and most of these compounds are first-in-class drugs,e.g.,bivalirudin,desmopressin,and octreotide[1].As of this study,over 80 unique peptide drugs have been approved worldwide,while over 150 are in development [2,3].

Most drugs work by interfering,strengthening,or modifying protein-ligand interactions.This defines ligand in the broadest possible way: small molecules,ions,proteins,nucleic acids,lipids,peptides,or peptidomimetics.Over 40% of all intracellular interactions involve peptide-protein interactions,and many more involve protein-protein interactions.In both cases,peptides and peptidomimetics are ideal interfering agents [4].Cyclic peptides have emerged as a promising class of molecules with the potential to selectively target “undruggable” or “blank wall” surfaces of proteins involved in protein-protein interactions.Such targets are often challenging to access with conventional small molecules,and cyclic peptides offer an alternative approach to address this limitation.Moreover,cyclic peptides have a unique position in drug discovery as they occupy a niche between small molecules and biological therapeutics [5].

Biologics comprise a broad range of polysaccharides,proteins,nucleic acids or their conjugates,e.g.,vaccines,blood components,somatic cells,and recombinant proteins.They are usually produced in living organisms (animals,microorganisms,or even humans)and then processed ex vivo.They often offer treatment options for disorders with no available treatments[5,6].However,biologics are often complex mixtures,difficult to characterize.Furthermore,they are generally high-cost,low stability,heat sensitive,susceptible to microbial contamination,and immunogenic [7,8].Because of their limited membrane permeability,sensitivity to digestion,and firstpass metabolism,they are frequently administrated parenterally,which causes patient compliance issues [9].

In contrast,small molecules(usually <1,000 Da)offer enormous chemical diversity to interact with their targets.Inherent advantages typically associated with small molecules include low cost,high stability,good bioavailability,and easy administration [5].They often effectively bind to small binding pockets in target proteins but have limited application in blocking protein-protein interactions due to their small surface area[10].However,this small size means they can often interact with other proteins,and many exhibit off-target effects and toxicity.Their non-endogenous metabolites can also induce undesired physiological responses [11].

Small peptides lie between both classes of therapeutics.They are often better tolerated and more specific than small molecules,while being lower cost and more stable than biologics.They share the single characterizable entity benefit of small molecules but can present specificity similar to biologics [12,13].Tetra- and pentapeptides offer an optimum size to occupy catalytic or molecular recognition sites often occupied by small molecule metabolites or the termini of endogenous peptides or proteins [14].Larger peptides can imitate proteins to block surfaces.Together these opportunities have driven the rise of peptides [3,15-17].

Working with peptides has the benefit of occupying an extensive yet finite chemical space,provided that one only considers the canonical residues.This indicates that it can be thoroughly investigated either in vitro or,theoretically,in vivo.Even though these experiments are generally cost-prohibitive,a thorough preliminary in silico screening is still doable.

Recent advances in docking algorithms for the discovery of small molecules against particular targets have been astounding [18,19].Many docking algorithms are available for protein-protein and peptide-protein docking that can accommodate the conformational restrictions on the inherently flexible nature of peptides and specific amino acids.A number of benchmark studies have evaluated the accuracy of these algorithms,and they are continually getting better[20,21].Despite this progress,high throughput screens of small peptides in computer-aided drug discovery (CADD) campaigns are still not frequently performed[22],possibly due to the large number of calculations required,difficulties in handling the conformational complexities of peptides,and a more computationally expensive workflow compared to small molecules.But a primary limitation is that comprehensive machine-readable peptide libraries of dockable peptides with all possible sequences are unavailable.

To address these challenges,we have developed,and here make available,the comprehensive library of all possible di-,tri-,tetra-,and penta-peptides sequences from 20 canonical amino acids.We have also provided a library of N-to-C-terminal cyclic analogues of these peptides.To demonstrate the potential of this tool,as part of our broader effort in CADD,we provide a case study screening this tool against thetarget-du-jour,the main protease (Mpro) or 3-chymotrypsin-like protease (3CLpro) of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2),a cysteine protease [23].

Although the novel coronavirus has a unique Mpro,it is still very similar to that of the original SARS-CoV-1 and is a well-understood enzyme and a well-investigated drug target [24].This,of course,accelerated with the emergence of SARS-CoV-2 in late 2019 and early 2020,and the crystal structure of the SARS-CoV-2 Mpro was solved extremely quickly,and nearly simultaneously,by several different groups[25-28]as part of what is undoubtedly the fastest and most intense drug discovery effort in human history.Mpro cleaves polypeptides after a glutamine residue,making it a promising drug target because no human proteases share this substrate specificity,meaning off-target effects can be expected to be minimal [27,29].The active site of Mpro consists of a catalytic dyad of Cys145 and His41.Other anchoring residues important in ligand interactions in cocrystallized structures include Thr25,Thr26,Asn142,Gly143,Ser144,Met165,Glu166,and Gln189 [30].

Inhibiting SARS-CoV-2 Mpro would prevent the virus from processing the polypeptide properly and would consequently prevent capsid formation and release from the infected cell; this would halt infection [31]; however,despite very significant effort,there are no drugs in advanced clinical trials that are highly effective,and despite the efforts from many academic groups suggesting drug repurposing might prove helpful,it would be incredibly naïve to consider that all relevant extant drugs were not investigated immediately by major players within weeks and months of the emergence of the disease.Furthermore,61 independent reports evaluated by Macip and coworkers [32] do not agree or converge around a common series.Regardless,repurposed drugs have been thoroughly screened in vitro [33].Although some are initially promising[34-37],none have proven clinically effective,including those that were developed to target the main proteases of other related viruses: the active site is significantly different,even from closely related SARS-CoV-1 [38],and the entire protein is highly dynamic for a globular enzyme that makes small molecule drugs difficult to design.New chemical material is required.

The MPro enzyme is responsible for cleaving the viral polypeptide at various points along its sequence,and as such,its active site must be dynamic.While the natural ligand for MPro is a peptide,researchers have identified a peptidomimetic molecule with some activities against the enzyme.This peptidomimetic was discovered through a high-throughput screen of Ugi reaction products,a process that involves rapidly testing large numbers of chemical compounds for activity against a specific target [39].Although natural peptides are unlikely to be useful drugs themselves [40],they might prove to be useful starting points for structure-activity relationships through both incorporating unnatural residues and modifying the backbone to increase the stability of the peptide(both physiologically and to avoid cleavage by peptidase activity in the binding site)and affinity for the target[41].But for this to be viable,we require a method to identify promising peptides to be validated in vitro as potential inhibitors.This makes SARS-CoV-2 Mpro a relevant example target for our peptide screening library to be evaluated against.

2.Methods

2.1.Preparation of peptide library

Single letter codes of all 20 canonical amino acids(A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,and Y) were used to make all possible sequences of di,tri-,tetra-,and penta-peptides based on the relation 20n(n=2,3,4,and 5)using MS Excel.These peptides were either filtered for their physicochemical properties or linked via N-to-C-terminal amide linkage to get cyclic peptides.

2.1.1.Peptidefiltrationbasedonphysico-chemicalproperties

The interplay between the constituent amino acids of a peptide is responsible for its solubility,stability,and ease of synthesis;consequently,we chose seven filters that screen these parameters.Although many complex tools could be used to conduct these filtrations,the digital nature of the peptide sequences (consisting of only 20 values)makes them extremely easy to manipulate in Excel,which also lowers the barrier to use.Excel also simplifies modifying the filters we applied,allowing other users to either easily expand them should they believe we were insufficiently restrictive in our selection process or reduce them should the user believe we were overly zealous in our elimination of sequences.This can be accomplished through simple manipulation of the provided formulae.However,in the case of reducing the filters,we caution that the user will need to optimize additional peptide conformations.In all cases,we have provided examples of the process in the Supplementary data that can be applied to the full dataset.

2.1.1.1.Filter1:peptidesthatcanformgelsinanaqueoussolution.Peptides having a very high ratio (>75%) of amino acids that can form intermolecular hydrogen bonds (D,E,H,K,N,Q,R,S,T or Y)reliably result in gel formation in water,rendering them waterinsoluble.We removed these peptides by simply eliminating the entries in the Excel file.This was accomplished by copying the column containing all sequences(column A)to column B.Using the“replace”function on column B only,substitute“X”for all of D,E,H,K,N,Q,R,S,T,and Y.To remove peptides that are comprised of>75%of above stated resides,cells in column B containing >75%X were deleted.This is achieved by using replace option of Excel for the column:for a pentapeptide,put *X*X*X*X*in the “find”field,leaving the“replace”field blank.All cells containing four or five Xs(4 Xs is 80%) will become blank.Sort the entire worksheet by column B.This will put the blank cells at the top.Remove all rows that have no entry in column B.Column B can then be deleted.This removes all of these peptides from the list,and this new list can then be applied to the next filter.

2.1.1.2.Filter2:peptideswith50%ormorehydrophobicresiduesare insolubleorpartlysolubleinwater.We can employ an analogous method to remove entries with >50%hydrophobic residues(A,I,L,M,F,W,Y,and V).Again,duplicate column A to column B,convert the hydrophobic residues to X,and replace *X*X*X* with a blank for pentapeptides with blank cell.Once the empty rows are sorted and deleted,these sequences are removed.

2.1.1.3.Filter3:multiplecysteinesandmethioninesinapeptideare pronetooxidation,andmultiplecysteinescanalsoformadisulfide bridge.This filter should be run at the discretion of the user.As the peptides examined here are so small,intramolecular disulfide bridges are unlikely,and free cysteines could be more likely to form intermolecular bridges.So,any in vitro test of these peptides might provide data about complex mixtures of oligomers rather than the planned peptide.Although these peptides are synthetically accessible,the complexities they could introduce into both synthesis and screening,in our opinion and experience in peptide synthesis,merits their exclusion.The filter was applied as above by replacing any strings containing*C*C*and*M*M*with blank cells to remove entries having two or more cysteines or methionines.

2.1.1.4.Filter4:multipleprolinesorserinesinasmallpeptidecan readilycausedeletionsduringsynthesis,andmultipleprolinescan alsoexperiencecis/transisomerization.Again,this filter may or may not be appropriate for a given application.We employ it to ensure a higher hit rate for our peptides in any in vitro test.To employ this filter,replace *S*S* and *P*P* with blank cells to remove entries having two or more serine or proline residues.

2.1.1.5.Filter5:asparticacidnexttoglycine,serine,orprolinecan readilyundergohydrolyticpeptidecleavageinanacidicsolution.Again,this filter may not always be appropriate,and this complication can be evaded with sufficient care in synthesis.However,these anticipated side reactions could introduce challenges for a simple screen.To employ this filter,remove aspartic acid glycine pairs,and remove *GD* and *DG* with blank cells.Similarly,use*PD*and*DP*to remove aspartic acid and proline pairs and*SD*and *DS*to remove aspartic acid serine pairs.

2.1.1.6.Filter6:N-terminalglutaminecanformpyroglutamateinan acidicsolution.This could even be desirable under specific conditions.But it can be simpler to exclude these examples.To execute this filter,put the FASTA data in column A and put: in all cells of column B.Use formula“=B1&A1”to get:at the start of every entry(left-hand side,i.e.,N terminal).Paste the formula-generated column to column C as text to avoid the formula.Remove columns A and B.Put: Q* in find and leave replace field blank to remove sequences starting with Q.This same syntax can be used to remove any other problematic N-terminal sequence.An analogous process could be used to remove any problematic C-terminal sequence.

2.1.1.7.Filter7:protectinggrouponN-terminalasparagineare difficulttoremove.This filter was employed simply to ease synthesis.Again,this is synthetically solvable but requires specialist peptide synthesis knowledge.It may not be appropriate for all groups looking to screen and prepare peptides; if they are being made in-house by specialists,it can be ignored.If they are being ordered in bulk with limited characterization by the supplier,we suggest that it might be relevant.It was implemented using the protocol used in Filter 6,but employing: N*.

It is worth noting that these filters were only performed on pentapeptides.Filtered pentapeptides and all di-,tri-,and tetrapeptides were placed in text files in FASTA format along with their FASTA identifiers.The files were loaded in MarvinView [42],and the structures were saved as simplified molecular-input line-entry system(SMILES)[43] and structure-data file(SDF)files along with their FASTA[44]identifiers for later use in docking.The filters were employed to reduce the number of peptides needing to be screened.Of course problematic sequences can be removed from the final list by visual inspection by one skilled in the art; the balance of increased screening time with the risk of missing promising hits must be carefully considered by the user and all filters should be carefully considered.

2.1.2.Preparationofcyclicpeptides

All possible combinations of di-,tri-,tetra-,and penta-peptides were used without any prefiltration to generate head-to-tail cyclic amide-linked peptides.To generate this library,the peptides in FASTA format were placed in Excel in column A.Column B was populated with X(the FASTA notation for a general amino acid with an arbitrary sidechain).The formula“=B1&A1&B1”was applied in column C to get X at the start and end of each peptide (i.e.,XpeptideX).These peptides were copied in a.txt file,and the file opened in MarvinView.The structures were saved in SMILES format.The resulting SMILES file was opened in notepad++,and the following replacements were made:

For N terminal:([H])C(=O)[C@]([H])([*])N([H])to 9,and(C(=O)[C@]([H])([*])N([H])[H]) to 9.

For C terminal: [H]OC(=O)[C@]([H])([*])N([H])C to C9,and C(=O)N([H])[C@@]([H])([*])C(=O)O[H] to C9(=O).

The FASTA identifiers were placed next to the SMILES separated by a space.The resulting SMILES were filtered for duplicates using Open Babel [45] through the option “remove duplicates with descriptor” InChI (the IUPAC international chemical identifier).After removing the duplicate entries,the structures were exported as SMILES files.The SMILES files were loaded in MarvinView and the structures were saved as an SDF file for docking.

2.2.Virtual screening

2.2.1.Ligandpreparation

The ligands in the SDF file were imported into Maestro,and the three dimension (3D) structures of the compounds were batch prepared using the LigPrep utility of Maestro(Schrödinger Release 2020-4) [46].The possible tautomers of the compounds were produced using Epik [47] at the target pH of 7 ± 2.Only one structure was produced for each ligand.

2.2.2.Proteinpreparation

The crystal structure of SARS-CoV-2 Mpro was downloaded from the Protein Data Bank (PDB ID 6XR3,resolution 1.45 Å) [48].The Protein Preparation Wizard of the Maestro was used to optimize the protein for docking.The preparation steps used are standard: addition of hydrogens,optimization of the hydrogen bond networks,and assignment of protonation states of histidine residues,as described elsewhere [42,49].Water molecules were removed,followed by a restrained minimization step using the OPLS4 force field,which uses an implicit water model,generally useful for globular proteins [47].Maestro's Receptor Grid Generation utility was then employed to specify the docking site by generating a cubical grid box with the center at(9.11,27.14,22.64)and 23 Å length.This is the cener of the active site,so we are avoiding an examination of allosteric inhibitors or peptides that might be useful in preventing dimerization.

2.2.3.Docking

2.2.3.1.Glide.In the first step of virtual screening,the docking was performed using the Glide docking tool of Maestro with high throughput virtual screening(HTVS)mode giving one output pose for each ligand[50].Due to the sheer size of the ligand library,the top 600,000 (~30%) hits were selected for a robust screening step using Glide standard precision-peptide(SP-peptide)docking mode.The top 60,000 (10%) hits were subjected to Glide extra precision(XP) docking.

2.2.3.2.Gold.The peptide library was also screened using the Gold ChemPLP(piecewise linear potential)[51]scoring function.The top 600,000 (~30%) hits were subjected to the Gold fitness score [52]with 10% search efficiency (Goldscore 10%).The top 60,000 (10%)hits were selected for a vigorous screening using a 200% search efficiency of Gold fitness score(Goldscore 200%).

2.2.4.MolecularmechanicsgeneralizedBornsurfacearea(MMGBSA)

The top 6,000(10%)hits from Glide XP and the top 6,000(10%)hits from Gold docking were selected for the MMGBSA analysis.The estimated binding free energy of the top hits was calculated with the Maestro Prime/MMGBSA module using the VSGB solvation model and the OPLS4 force field [50].

2.2.5.Moleculardynamicssimulation

The Maestro System Builder utility was employed to prepare docked protein-ligand complexes for molecular dynamics (MD)simulation.The simple point-charge (SPC) water model was used with an orthorhombic box extending 10 Å from protein.The placement of ions was excluded within 20 Å of the ligand.Sodium chloride was added to make the concentration 0.15 M,and extra sodium or chloride ions were added to neutralize the system.The OPLS4 force field was used for simulation.Initially,the NPT ensemble was used at 300 K and 1.01325 bar,and the system was relaxed for 100 ps before simulation.The simulation was recorded for 12 ns with a 3 ps recording interval.The complexes presenting reasonable stability for 12 ns were selected for 100 ns simulation with recording interval of 4 ps.

3.Results

3.1.Peptide library production

The overall scheme to build the peptide library is displayed in Fig.1.All possible combinations of the twenty canonical amino acids provided 400 dipeptides,8,000 tripeptides,160,000 tetrapeptides,and 3.2 million pentapeptides according to the relation 20n,where n is 2,3,4,and 5 for tripeptide,tetrapeptide,and pentapeptide,respectively.Di-,tri- and tetra-peptides were not subjected to filters due to their smaller size and relative ease of synthesis compared to pentapeptides-they can be reasonably made in the solution phase,and the synthetic challenges associated with solid-phase synthesis can be evaded.Based on aqueous solubility,stability,and ease of synthesis,pentapeptides were passed through a set of seven filters,giving 1,169,013 pentapeptides.

Fig.1.Building peptide dataset.Process diagram displaying the key steps in building the peptide dataset in SMILES and SDF format after applying peptide filters.SDF: structured data files; SMILES: simplified molecular-input line-entry system.

Homodetic cyclic peptides were prepared using 400 dipeptides,8,000 tripeptides,160,000 tetrapeptides,and the 3.2 million pentapeptides.N-to-C-terminal amide cyclization of all possible sequences generates substantial number of duplicate peptides.The removal of duplicate entries yielded 210 cyclic dipeptides,2,680 cyclic tripeptides,40,110 cyclic tetrapeptides,and 640,016 cyclic pentapeptides.

Collectively,a total of 2,020,429 filtered peptides (linear and cyclic) were generated (Table 1).The datasets of linear and cyclic peptides are available in FASTA and SMILES format in separate files.These files can be readily converted into any desired format (PDB,MOL2,etc.) for molecular docking.

Table 1 Number of peptides generated during dataset construction.The three datasets are available in FASTA,simplified molecular-input line-entry system(SMILES)and structure-data file (SDF) formats.

A total of 2,020,429 peptides(di-,tri-,tetra-,and penta-peptides in their linear and cyclic forms) were used for docking against SARS-CoV-2 Mpro as a case study to implement the peptide dataset.For computational simplicity,multiple conformations of each peptide were not generated due to the large number of input compounds and the proof-of-principle nature of this work;instead,only one most suitable conformation for each peptide was selected for docking.A more rigorous effort,but considerably more computationally expensive,would want to generate 20 low lying conformations per peptide to ensure the conformational space is being sampled.

3.2.Virtual screening of peptide dataset on SARS-CoV-2 main protease

3.2.1.Glide

A quick docking screen of the library using Glide HTVS provided binding energy as low as -10.9 kcal/mol.The top 600,000 hits(~30% of the dataset,-10.9 to -5.1 kcal/mol) were subjected to Glide SP peptide docking,and the top 60,000 hits from this screen(10%,-12.1 to -5.2 kcal/mol) were docked using Glide XP mode.Following this,the top 6,000 hits (10%,-13.7 to -8.8 kcal/mol)were subjected to Prime MMGBSA calculations.The MMGBSA data of these“binders”ranged between-90.2 and-9.1 kcal/mol(Fig.2).We note that we selected the center of the docking grid at the MPro catalytic dyad.The screen is looking for orthosteric inhibitors and would not identify allosteric binders to the protein at other sites.

Fig.2.Schematic workflow for peptide screening.The schematic block diagram of the in silico workflow followed in the study to screen peptides against severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) main protease.Lighter to darker colour represents the process of a workflow.HTVS: high-throughout-virtual screening; SP: standard precision; XP: extra precision; ChemPLP: piecewise linear potential; MMGBSA: molecular mechanics generalized Born surface area.

3.2.2.Gold

An initial screen of the library was performed with the ChemPLP score in Gold.The top 600,000 hits (~30%,113.5 to 46 ChemPLP score) were docked using Goldscore with 10% efficiency.The top 60,000 hits from this assay (top 10% with 112.6 to 52.2 Goldscore 10% efficiency) were further docked using Goldscore 200% efficiency.The Goldscore 200%values of these hits ranged from 117.3 to 50.1.The top 6,000 hits from this step (top 10% with 117.3 to 89.6 Goldscore 200%efficiency)were subjected to Prime MMGBSA.The results from this MMGBSA calculation ranged between -89.3 and -12.4 kcal/mol (Figs.2 and 3).

3.2.3.Moleculardynamicssimulations

With this integrated parallel virtual screening pipeline,the top hits were selected for MD simulation after careful consideration,completed in 3 steps: 1) top 25 hits from Glide HTVS >Glide SP peptide >Glide XP >Prime MMGBSA workflow;2)top 25 hits from Gold ChemPLP > Goldscore 10% > Goldscore 200% > Prime MMGBSA workflow,and 3) top 25 common hits that displayed consensus binding poses after both workflows (Fig.3).The latter was selected extensively through visual inspection,i.e.,peptides that revealed consensus binding conformation by both docking approaches and displayed significant molecular interactions with the critical binding pocket residues,excluding peptides that showed unrealistic docking conformations.These might not be the top hits arising from either algorithm,but a major drawback of single tool-docking studies is their poor correlation to experimental data[21,32,53,54].Correlation is always improved when molecules are identified using multiple different workflows,and we consider two methods the absolute minimum acceptable.We suggest at least four mutually independent methods for any planned CADD campaign.

Fig.3.Peptide screening results from two independent workflows.(A) Top 6,000 (10%) hits obtained from Glide high throughput virtual screening (HTVS) >standard precision(SP) >extra percision (XP) score,distributed in Glide XP docking score slabs.(B) Top 6,000 (10%) hits obtained from Gold ChemPLP >Goldscore 10% efficiency >Goldscore 200%efficiency,distributed in Goldscore 200% efficiency slabs.(C) Scatter plot of Glide XP docking score vs Goldscore 200% efficiency of the 547 common top hits found from both top 6,000 hits from each study taken from corresponding top-ranked datasets.(D,E) Docked conformations of the top hits identified from workflow 1 (Glide HTVS >SPpeptide >XP >Prime MMGBSA) (D) and workflow 2 (Gold ChemPLP score >Goldscore 10% >Goldscore 200% >Prime-MMGBSA) €inside the binding pocket of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) main protease.ChemPLP docking: piecewise linear potential; MMGBSA: molecular mechanics generalized Born surface area.

Fig.4.The binding conformation of (A) GQYWH and (B) HLFNT inside the binding pocket of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) main protease.The binding pocket residues are labelled and displayed in stick representation,while catalytic dyad residues are colored red.

These 75 hits were subjected to 12 ns MD simulations to analyze the stability of the complex,interaction energetics,and the conservation of specific intermolecular interactions.After MDs,the following parameters were taken into consideration to select the best hits: 1) a favorable binding affinity (kcal/mol) in terms of MMGBSA calculations; 2) minimum fluctuations of peptide inside the binding pocket (RMSD <3.5 Å); 3) favorable interactions as determined from per-residue decomposition analysis,and 4) having at least one stable H-bond(highest occupancy)with one of the residues of the catalytic dyad over the 12 ns simulation.

The initial poses of all the selected 75 peptides fit pretty well in the Mpro binding pocket,predominantly interacting with the catalytic dyad.This is to be expected that only the best-fitting peptides would occupy the active site,and a strong hydrogen bond with a member of the catalytic dyad that is already primed to be nucleophilic and acidic was almost going to be a certainty.As the MD simulation progressed,some of the peptides migrated significantly and decreased their interaction with the catalytic dyad.This resulted in an unstable binding conformation;therefore,they could be excluded from further studies.The least disrupted 20 peptides were selected for further analysis and were subjected to a moderate 100 ns MD simulation.Of these,ten protein-peptide complexes displayed a stable RMSD over the simulation period,with the peptides staying in the binding pocket and interacting tightly with the protein's residues.All these peptides displayed significant interactions,including H-bond with His41 and Cys145 of the catalytic dyad(occupancy,>30%;average distance,<3 Å).We discussed the two best-binding peptides (HLFNT and GQYWH); the remainder(GHYFH,FKQH,IKRDM,CHQLN,GQRFL,AQRFS,AQQFN,and EEYCM)are discussed in the Supplementary data.We would like to highlight the diversity of these hits; it would be potentially expected that the algorithms would collapse into a series of closely related sequences with variability at only one residue.This is not the case here,and this variety maximizes the possibility of finding a hit that is amenable to further optimization by providing good coverage of the possible chemical space explored.

3.2.4.HLFNTandGQYWHashighaffinitybindersforMpro

An analyzes of HLFNT and GQYWH's interactions with Mpro was conducted on the MD simulation data(Figs.4-6).The final frame of the 12 ns production run was used as the initial frame for the extended 100 ns simulation.The RMSD of the Cα-backbone atoms of Mpro with bound peptide was examined and provided information about the overall stability of the system.The results indicated that the Mpro remained stable in the presence of peptides,and the peptides remained closely converged within 3 Å.This observation suggests that the formation of a stable complex was achieved,which was analyzed throughout the simulation duration.

In the HLFNT@Mpro complex,the RMSD values of Mpro initially(for the first ~50 ns)displayed slight fluctuation but then stabilized and shifted around 1 Å for the rest of ~50 ns,indicating a tight and conserved interaction mode of HLFNT(Fig.5A).The GQYWH@Mpro complex remained stable throughout the simulation period,and RMSD values rapidly converged (Fig.6A).In HLFNT@Mpro,the initial fluctuations of Cα-backbone atoms of Mpro were evident from the ligand-root-mean-square-fluctuations(L-RMSF)values of HLFNT (~3 Å) compared to the stable conformation of GQYWH(<3 Å)inside Mpro binding site throughout the simulation period.The presence of more aromatic residues in GQYWH added stability due to the presence of significant π-stacking interactions.Both peptides bind near the catalytic dyad.Besides their H-bond interactions with the catalytic dyad,both peptides formed H-bonds with adjacent binding site residues:Thr25,Asn119,Gly143,Glu166 and Gln189.Glu166 established a well-populated H-bond (occupancy,>90%) with backbone atoms of both peptides.In comparison,all other interactions were supported with more than 30%occupation (average distance <3 Å).The tryptophan in GQYWH established a well-populated π-π stacking interaction(occupancy,97%) with the catalytic His41.The stable GQYWH@Mpro complex formation over the simulation period was possibly due to the formation of this consistent stacking interaction.

Fig.5.Post-molecular dynamics(MD)simulation analysis of HLFNT@Mpro.(A)The root mean square deviation(RMSD)of Mpro Cα-backbone bound to HLFNT over the simulation period in ns.(B) Ligand-root mean square fluctuation (RMSF) of HLFNT for characterizing changes in the peptide atom positions during the final 45 ns of the MD simulations.(C)Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) Mpro interactions with the peptide were monitored throughout the MD simulations.While interaction fraction represents the percentage of specific interaction maintained during simulation time,these are categorized into four types:hydrogen bonds,hydrophobic,ionic and water bridges.(D) A schematic plot of detailed ligand atom interactions with the protein residues; interactions that occur more than 30% during the simulation period are displayed.

Fig.6.Post-molecular dynamics(MD)simulation analysis of GQYWH@Mpro.(A)The root mean square deviation(RMSD)of Mpro Cα-backbone bound(RMSF)to GQYWH over the simulation period in ns.(B) Ligand-root mean square fluctuation (RMSF) of GQYWH for characterizing changes in the peptide atom positions during the final 50 ns of the MD simulations.(C) Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) Mpro interactions with the peptide were monitored throughout the the final 50 ns of the MD simulations.While interaction fraction represents the percentage of specific interaction maintained during simulation time,these are categorized into four types:hydrogen bonds,hydrophobic,ionic and water bridges.(D) A schematic plot of detailed ligand atom interactions with the protein residues; interactions that occur more than 30% during the simulation period are displayed.

4.Discussion

A dataset of all sequences of linear di-,tri-,and tetra-peptides has been previously published by others to help perform CADD[55].This or in-house libraries have been used,and there are reports of efforts to comprehensively screen all possible di- or tripeptides against a target protein[56,57];however,to the best of our knowledge,no machine-readable database is available with all possible pentapeptides or the N-to-C-terminal cyclic tri-,tetra-,and penta-peptides.This manuscript is accompanied by the publication of this library for use by others.We also provide a simple method for conducting prefiltration on these peptides.These libraries are provided in FASTA and SMILES formats that can be easily converted to dock-able formats,e.g.,PDB,and MOL2.We also provide the SDF file with the prefiltered library we used in this study.

We chose to prefilter the dataset.It would not take significantly more computational time to screen all possible pentapeptides,but it is important to have a method for arbitrarily reducing the number of simulations to be run according to a research team's whims.The filtrations we conducted here are reasonable to eliminate problematic sequences,but it would also be reasonable for a researcher to not use these filters on a small dataset.The filters were selected in this case so that all peptides arising from the screen would be trivial to synthesize using automated solid phase peptide synthesis[58].The method of filtration and this filtered data set are provided for ease of use by others should they so desire.

Most small peptides (di-,tri-,tetra-,and penta-peptides) have suitable physicochemical properties and are easy to synthesize.The natural zwitterion assists with solubility.However,this charge density decreases as the peptide grows in length so some would be sparingly soluble,while other sequences can prove unstable.

The chosen filters fall into three categories: solubility,stability,and ease of synthesis.In terms of solubility,we filtered out sequences liable for gel formation (>75% D,E,H,K,N,Q,R,S,T or Y,Filter 1).These would not likely be soluble as distinct species and would likely give false positives on many binding assays.We also removed sequences with a preponderance of aliphatic hydrophobic residues (Filter 2),and again the concern is the solubility of these peptides providing false positives in assays due to precipitation and non-specific binding[59].We then removed a series of notoriously unstable sequences.Cysteine and methionine are liable to rapid oxidation and cleavage of protecting groups during synthesis,making subsequent purification complicated; peptides containing multiple of either sulfur-containing residues were removed (Filter 3).The final category is synthetically challenging sequences that might have reduced purity coming off a bead.A typical pentapeptide needs only a simple precipitation followed by solid phase extraction to render a sufficiently pure material for evaluation.Some combinations of residues are known to be problematic and could interfere with this workflow.Short peptides containing two or more serines or prolines are more likely to lead to deletions during synthesis and were removed (Filter 4).Similarly,aspartic acid residues adjacent to glycine,serine,or proline are susceptible to hydrolysis during acidic cleavage conditions,typically used in solid-phase peptide synthesis (Filter 5).Finally,N-terminal glutamines readily cyclize to pyroglutamate and N-terminal asparagines are irritating to deprotect without decomposition.Peptides with either motif were removed (Filters 6 and 7).As this is a virtual study,these filters could have been omitted,but were included to show the ease of use of the technique and the library manipulation.Even then,we didn't prefilter the tetra-,tri-,or di-peptides in any way(Fig.2).A limitation of employing filters,like those discussed in this article,is that they will eliminate some of promising sequences.We emphasize the employment of the filters on case-to-case bases,e.g.,Filter 5,removing sequences where D is adjacent to P,G,or S as aspartic acid-containing peptides can suffer hydrolysis resulting in peptide cleavage under acidic conditions with these sequences in place.This suggests that these combinations should be avoided when designing orally administrable peptides.However,RGD is also the key triplet for driving cell adhesion and is extremely common in many cell surface peptides and proteins [60].Depending on the desired application,the user should use filters with care.We provide this template for the end users to employ or ignore as they see fit for their given biomedical challenge.

To understand the challenges associated with screening these peptides in a CADD workflow,we selected SARS-CoV-2 Mpro as a potential target due to its compact and flexible catalytic site [61].Precise binding affinity and peptide conformational prediction of the protein-peptide complex is challenging[53,62].We established an integrated peptide virtual screening pipeline to overcome this by incorporating two state-of-the-art molecular docking suits(Fig.2).The benchmark studies have reported that the Goldscore is the most efficient for small peptide docking among different scoring functions available in the Gold docking suite[21].Goldscore can predict near-native conformations of top-scoring peptides(having more than 9 residues) with an accuracy higher than 30%[21].The Glide SP-peptide has been proven to be 53% accurate in predicting peptide poses.The accuracy could be improved to 58%when SP-peptide is preceded by Prime MMGBSA [63,64].

The binding pocket of SARS-CoV-2 Mpro is not deep,and most of it is solvent-exposed.The binding pocket of SARS-CoV-2 Mpro in two apo structures,7ALHand6M2Q,interact with 12 and 10 water molecules,respectively[65,66].The part of the inhibitors sitting in this solvent-accessible region could interact with the solvent,thereby reducing the strength of the protein-ligand interaction.Under these conditions,most docking protocols do not account for the role of water molecules,and docking accuracy can be very poor in highly solvent-accessible binding pockets.The MMGBSA and,especially the MD simulations become very important due to the involvement of water molecules.

Among the top ten selected hits,one linear tetrapeptide and nine linear pentapeptides were identified.No cyclic peptide was identified in the selected top hits because the binding site of SARSCoV-2 Mpro has a tight deep gorge that could not accommodate cyclic peptides.Neither of our top hits were previously identified in the literature nor evaluated against Mpro.Some custom small peptides (1-8) analogous to our selected top hits (GQYWH and HLFNT) tested as SARS-CoV-1 Mpro inhibitors are presented in Fig.7.The peptides 1,2,3,and 4 have been tested to inhibit Mpro at 100 μM with percentage inhibition values of 72%,23%,<10%,and<10%,respectively[67].The peptide 5 is a potent inhibitor of SARSCoV-1 Mpro withKiof 58 nM [68].Peptides 6,7,and 8 have been cocrystallized with Mpro of SARS-CoV-1.TheKivalue of 6 was 10.7 μM[69],and the IC50of 7 was 80 μM[70].TheKinact/Kivalue of 8 against Mpro of SARS-CoV was 1900 M-1s-1[71].

Fig.7.Structures of small peptidomimetics reported as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) Mpro inhibitors.Structures of two top hits (GQYWH and HLFNT) identified in this study.The hydroxyl group of threonine (T) in entries 1,3,4,and 5 is in the ether form with benzyl or tert-butyl groups.PDB: Protein Data Bank.

Peptide 8 has a strong structural analogy with our identified hit,HLFNT.The N-terminal histidine of HLFNT is replaced with benzyl hydrogen carbonate in 8 and the following two amino acids,leucine and phenylalanine,are identical in both peptides.The fourth asparagine residue of HLFNT is replaced with glutamine type of residue having N in place of α-carbon in 8,having only one extra methylene residue.The C-terminal threonine of HLFNT is substituted with ethyl ester of 2-hydroxypropanoic acid in 8,both having hydroxyl groups.This close analogy validates the use of the prepared library and authenticates our in silico screening workflow.

5.Conclusion

Among the marketed therapeutics used today,peptides have proved a valuable niche in the drug development spectrum substituting small molecules and large biological therapeutics.The study provides a library of small linear peptides and their N-to-Cterminal cyclized analogues using all possible sequences of 20 canonical amino acids.We have also devised a comprehensive strategy to handle such large peptide datasets utilizing an appropriate array of tools to extract desirable outcomes.The given library could be screened against any desired target protein to identify small peptide hits.These hits can be further refined,e.g.,through the attachment of amino acids on the C- or N-terminal to extend the length,substitution with noncanonical amino acids,or coupling additional organic moieties to the termini to further improve activity.We hope and believe that this library of high-quality,curated peptides provides a valuable resource for any researcher looking to find new peptide ligands and therapeutic agents against a variety of targets.

CRediT author statement

Sarfraz Ahmad:Conceptualization,Methodology,Validation,Formal analysis,Investigation,Data curation,Writing - Original draft preparation,Reviewing and Editing,Visualization;Muhammad Usman Mirza:Methodology,Validation,Formal analysis,Investigation,Data curation,Writing - Original draft preparation,Reviewing and Editing,Visualization;John F.Trant:Conceptualization,Funding acquisition,Resources,Writing - Reviewing and Editing,Supervision.

Declaration of competing interest

The authors declare that there are no conflicts of interest.

Acknowledgments

The authors highly acknowledge the Natural Sciences and Engineering Research Council of Canada (Grant No.: 2018-06338) for providing the resources to conduct this work.Sarfraz Ahmad,Muhammad Usman Mirza and John F.Trant wish to recognize that this work was made possible by the facilities of the Shared Hierarchical Academic Research Computing Network (SHARCNET: www.sharcnet.ca) and Compute/Calculation Canada,now the Digital Research Alliance of Canada (https://alliancecan.ca/en).The authors are also affiliated with Binary Star Research Services (https://binarystarchem.ca/).The content of this article is exclusively the responsibility of the authors and does not necessarily reflect the official views of Binary Star Research Services.Binary Star Research Services had no input into the conclusions of this article.

Appendix A.Supplementary data

Supplementary data to this article can be found online at https://doi.org/10.1016/j.jpha.2023.04.008.