Atom Naming Features¶
This chapter describes how Grade2 produces the atom IDs (also know as atom names) of individual atoms in a ligand molecule.
Please note that, because of limitations in the legacy Protein Data Bank (PDB) Format Grade2 sets all atom IDs to be uppercase and attempts wherever possible to keep them to be 4 or fewer characters in length. This is because the PDB format is currently used by BUSTER and other crystallographic tools.
Default Atom IDs if they are set in the input¶
Where possible Grade2, by default, will reuse atom names from the input file. For instance, all PDB chemical components have specified atom IDs and it is important to use these to ensure consistency and compatibility with existing PDB data.
Atom IDs are also set in:
All CIF restraint dictionaries.
Most (but not all) MOL2 files. MOL2 files offer a flexible method for manipulating atom IDs within a molecule. The CSD-core program Mercury, provides a user-friendly interface for editing MOL2 files and adjusting atom IDs as demonstrated in the FAQ on editing a molecule.
Some SDF files.
If atom IDs are set in the input but you want to use different atom names then Grade2 has a number of options to set atom IDs, that will override the input IDs.
Please note that all lower case letters in atom IDs are altered to uppercase by Grade2 as programs such as BUSTER require that atom IDs are all uppercase.
Default atom IDs if not already set in the input¶
If atom IDs are not set in the input then Grade2 by default will
base the atom IDs on the order of the atoms, unless the the molecule
is a typical amino acid. The first non-hydrogen atom will be
assigned an atom ID composed of its element abbreviation (made upper case),
followed by 1
. Subsequent non-hydrogen atoms will be assigned IDs
made up of their element followed by their input list order.
Using a SMILES string N(C)[C@@H](C)[C@H](O)c1ccccc1
for
ephedrine
as an example, Grade2 will set atom IDs:
The first atom in the SMILES string is a nitrogen so it is assigned atom ID N1
.
The second atom is a carbon and so it gets ID C2
. The oxygen atom is the
sixth non-hydrogen atom and so it is assigned O6
.
Hydrogen atoms IDs all start with H
followed by the list number taken
from the atom to which they are attached and then A
, B
or C
if there is more than hydrogen atom attached. So in the ephedrine example above
the hydrogen atom attached to nitrogen N1
is given the ID H1
.
As there are three hydrogen atoms attached to C2
they are assigned IDs
to H2A
, H2B
and H2C
.
It should be noted that as SMILES strings are not unique then different atom IDs can be assigned for the same molecule. If this is a problem then the Grade2 option --rdkit_canonical_atom_ids discussed below sets the IDs from a canonical atom order that is independent of the input order.
Default atom IDs for recognized amino acids¶
Typical alpha amino acids with an amino group and a single beta carbon atom¶
Grade2 will now by default, recognize typical amino acids when supplied with an input that lacks atom IDs (aka atom names), for instance a SMILES string. The exact requirement used is that the molecule matches the SMARTS pattern:
[$([NX3H2,NX4H3+])][CX4H]([#6])[CX3](=[OX1])[OX2H,OX1-]
The pattern specifies that the molecule must have have either
a neutral NH2 or a NH3+ amino group followed by a
a 4-valent carbon atom with one hydrogen atom and one carbon atom attached
and then a neutral or charged carboxylic acid. A wider range of amino acids
are recognized when the --aa_loose
option is used (see next section).
If a typical amino acid is recognized then the PDB-standard
atom IDs (N CA C O OXT CB
) will be set for the main chain and beta carbon
atoms and for the hydrogen
atoms that they are bonded to. In addition, the ligand's atoms will be reordered
so that the main chain atoms are first in the list. Currently,
side chain atoms are assigned atom IDs using their numerical order
(rather than PDB-style Greek letter remoteness codes CG CD CE
etc).
So using 4-fluoroglutamate from SMILES C(C(F)C(=O)O)[C@@H](C(=O)O)N
as an example, Grade2 will assign atom IDs:
It should be noted that the --antedecent
option can normally be used
to assign more atom IDs from the parent amino acid,
as shown below for 4-fluoroglutamate.
If you prefer for the renaming not to happen, then the Grade2 command-line --no_aa_labels option turns it off, leaving standard numerical order based atom IDs.
Note that, currently, no alterations are made if the input file specifies atom IDs (for example CIF restraint dictionaries and most MOL2 files).
In addition to setting main chain atom IDs the output restraint dictionary
will have the CCP4-extension CIF item _chem_comp.group
is set to peptide
This enables Grade2 CIF restraint dictionaries to be used in Coot to replace
protein residues with modified amino acids.
Setting atom IDs for "exotic" amino acids with the --aa_loose
option¶
Following a user-request, the atom naming feature has been extended to
a wide range of "exotic" amino acids with the command line option
--aa_loose
is used. If the option is not used but atom names
could be set then a warning message is produced in the terminal output,
for instance:
WARNING: The molecule is an "GLY-like alpha amino acid with an amino group", so ....
WARNING: ---- could set conventional amino acid atom IDs. If you want ....
WARNING: ---- this done, then please rerun with the option: --aa_loose
WARNING:
If a molecule is recognized as an amino acid by the --aa_loose
option
the output restraint dictionary will have
the CCP4-extension CIF item _chem_comp.group
is set to peptide
.
Please note that setup of restraints between an "exotic" amino acid
and adjacent monomers is dependent on the program using the restraint dictionary
and that setting atom IDs is not likely to be sufficient to ensure that
correct restraints are used.
The amino acid classes that are currently recognized by --aa_loose
are detailed below. If there is any need for recognition of any
other class of amino acid then please let us know.
Grade2 options to set atom IDs¶
The --antecedent_disregard_element
option¶
The --antecedent_disregard_element
option (that can be shortened to -ad
)
is similar to --antecedent except
that atoms are not required to have the same element to match.
Where possible atom IDs are altered so that the non-element part of
matching atoms is maintained. So for example, if atom CL24
is matched to a
fluorine atom it will be given the atom ID F24
(provided there is not
an another atom with that label).
Taking for example the cyclin-dependent kinase inhibitors SC8 and SC9, running grade2 for each in turn:
$ grade2 --PDB_ligand SC8
...
$ grade2 --PDB_ligand SC9
...
As can been seen below the PDB components definitions of the two inhibitors SC8 and SC9, have consistent atom numbers for the central pyrazolopyrimidine ring but the halogenophenyl and pyridine rings have distinct numbering and atom IDs.
Rerunning Grade2 for SC9 with the --antecedent_disregard_element
option:
$ grade2 --PDB_ligand SC9 -ad SC8.restraints.cif -o SC9_ad_SC8
...
overrides the input atom IDs and instead sets atom IDs by matching atoms from SC8:
It can be seen that all atoms are matched to equivalents SC8, including both the halogenophenyl and pyridine rings.
The --antecedent_disregard_element
option is useful to set consistent
IDs and produce aligned 2D diagrams for series of related inhibitors.
Basing atom IDs on the RDKit canonical SMILES string with --rdkit_canonical_atom_ids
¶
The default procedure for setting atom IDs used by Grade2 described above,
uses the atom order of the input molecule. This means that it is common
for two restraint dictionaries a single compound to have completely
different atom naming because the atom orders of the input descriptions to
be different. To avoid this problem the --rdkit_canonical_atom_ids
option
(short option -R
) can be used. This uses atom order in the
RDKit canonical SMILES
string as a basis for the atom IDs. As the RDKit
canonical SMILES is independent of the input atom order this will produce
the same atom IDs for a single compound whatever the source.
For example, using three different SMILES strings describing
ephedrine
grade2 -R
will produce the same atom IDs:
Hydrogen atom IDs are based on the list number of the non-hydrogen atom to which they are attached, as described above.
Please note that --rdkit_canonical_atom_ids
wipes any existing atom IDs
and that atoms are reordered by the option.
Basing atom IDs on the InChI canonical atom order with --inchi_canonical_atom_ids
¶
Using the canonical RDKit canonical SMILES atom order to produce consistent
atom IDs for a single molecule, with the
--rdkit_canonical_atom_ids option option, works well.
But one problem is that canonical SMILES strings produced by different programs
are not consistent and so the atom IDs are not universal.
Dashti et al. (2017)
introduced the idea of the canonical atom order found as part of calculating
the International Chemical Identifier (InChI)
of a molecule to produce ALATIS unique identifiers.
The --inchi_canonical_atom_ids
option uses this idea and produces
atom IDs that from the InChI canonical atom order.
For non-hydrogen atoms the --inchi_canonical_atom_ids
numerical part of
the atom ID is the same as ALATIS ID.
Once again using as an example three different SMILES strings describing
ephedrine
grade2 --inchi_canonical_atom_ids
produces:
As expected consistent atom IDs are produced by --inchi_canonical_atom_ids
regardless of the atom order in the input SMILES string. But the
adjacent atom IDs are far apart in a molecule, for instance atom C1
is
bonded to atom C8
and not adjacent to atom C2
. This makes the
IDs less "user-friendly" but
more universal than --rdkit_canonical_atom_ids
(that for me are more
intuitive).