Atom Naming Features¶
This chapter describes how Grade2 produces the atom IDs (also know as atom names) of individual atoms in a ligand molecule.
Please note that, because of limitations in the legacy Protein Data Bank (PDB) Format Grade2 sets all atom IDs to be uppercase and attempts wherever possible to keep them to be 4 or fewer characters in length. This is because the PDB format is currently used by BUSTER and other crystallographic tools.
Default Atom IDs if they are set in the input¶
Where possible Grade2, by default, will reuse atom names from the input file. For instance, all PDB chemical components have specified atom IDs and it is important to use these to ensure consistency and compatibility with existing PDB data.
Atom IDs are also set in:
All CIF restraint dictionaries.
Most (but not all) MOL2 files. MOL2 files offer a flexible method for manipulating atom IDs within a molecule. The CSD-core program Mercury, provides a user-friendly interface for editing MOL2 files and adjusting atom IDs as demonstrated in the FAQ on editing a molecule.
Some SDF files.
If atom IDs are set in the input but you want to use different atom names then Grade2 has a number of options to set atom IDs, that will override the input IDs.
Please note that all lower case letters in atom IDs are altered to uppercase by Grade2 as programs such as BUSTER require that atom IDs are all uppercase.
Default atom IDs if not already set in the input¶
If atom IDs are not set in the input then Grade2 by default will
base the atom IDs on the order of the atoms, unless the the molecule
is a typical amino acid. The first non-hydrogen atom will be
assigned an atom ID composed of its element abbreviation (made upper case),
1. Subsequent non-hydrogen atoms will be assigned IDs
made up of their element followed by their input list order.
Using a SMILES string
as an example, Grade2 will set atom IDs:
The first atom in the SMILES string is a nitrogen so it is assigned atom ID
The second atom is a carbon and so it gets ID
C2. The oxygen atom is the
sixth non-hydrogen atom and so it is assigned
Hydrogen atoms IDs all start with
H followed by the list number taken
from the atom to which they are attached and then
if there is more than hydrogen atom attached. So in the ephedrine example above
the hydrogen atom attached to nitrogen
N1 is given the ID
As there are three hydrogen atoms attached to
C2 they are assigned IDs
It should be noted that as SMILES strings are not unique then different atom IDs can be assigned for the same molecule. If this is a problem then the Grade2 option --rdkit_canonical_atom_ids discussed below sets the IDs from a canonical atom order that is independent of the input order.
Default atom IDs for recognized amino acids¶
Typical alpha amino acids with an amino group and a single beta carbon atom¶
Grade2 will now by default, recognize typical amino acids when supplied with an input that lacks atom IDs (aka atom names), for instance a SMILES string. The exact requirement used is that the molecule matches the SMARTS pattern:
The pattern specifies that the molecule must have have either
a neutral NH2 or a NH3+ amino group followed by a
a 4-valent carbon atom with one hydrogen atom and one carbon atom attached
and then a neutral or charged carboxylic acid. A wider range of amino acids
are recognized when the
--aa_loose option is used (see next section).
If a typical amino acid is recognized then the PDB-standard
atom IDs (
N CA C O OXT CB) will be set for the main chain and beta carbon
atoms and for the hydrogen
atoms that they are bonded to. In addition, the ligand's atoms will be reordered
so that the main chain atoms are first in the list. Currently,
side chain atoms are assigned atom IDs using their numerical order
(rather than PDB-style Greek letter remoteness codes
CG CD CE etc).
So using 4-fluoroglutamate from SMILES
as an example, Grade2 will assign atom IDs:
It should be noted that the
--antedecent option can normally be used
to assign more atom IDs from the parent amino acid,
as shown below for 4-fluoroglutamate.
If you prefer for the renaming not to happen, then the Grade2 command-line --no_aa_labels option turns it off, leaving standard numerical order based atom IDs.
Note that, currently, no alterations are made if the input file specifies atom IDs (for example CIF restraint dictionaries and most MOL2 files).
In addition to setting main chain atom IDs the output restraint dictionary
will have the CCP4-extension CIF item
_chem_comp.group is set to
This enables Grade2 CIF restraint dictionaries to be used in Coot to replace
protein residues with modified amino acids.
Setting atom IDs for "exotic" amino acids with the
Following a user-request, the atom naming feature has been extended to
a wide range of "exotic" amino acids with the command line option
--aa_loose is used. If the option is not used but atom names
could be set then a warning message is produced in the terminal output,
WARNING: The molecule is an "GLY-like alpha amino acid with an amino group", so .... WARNING: ---- could set conventional amino acid atom IDs. If you want .... WARNING: ---- this done, then please rerun with the option: --aa_loose WARNING:
If a molecule is recognized as an amino acid by the
the output restraint dictionary will have
the CCP4-extension CIF item
_chem_comp.group is set to
Please note that setup of restraints between an "exotic" amino acid
and adjacent monomers is dependent on the program using the restraint dictionary
and that setting atom IDs is not likely to be sufficient to ensure that
correct restraints are used.
The amino acid classes that are currently recognized by
are detailed below. If there is any need for recognition of any
other class of amino acid then please let us know.
Click to expand/hide section on amino acids recognized by --aa_loose
alpha amino acid with CB and N-modification
This pattern allows modification of the nitrogen atom by a single carbon atom. The SMARTS used is:[$([NX3])]([#6])[CX4H]([#6])[CX3](=[OX1])[OX2H,OX1-]
N CN CA C O OXT CBwill be set. Please note that for PDB chemical components there is no standard atom name for the carbon atom attached to the nitrogen, but
CNis used in N-methyl-L-serine https://www.rcsb.org/ligand/5JP and seems sensible.
For an example, given the SMILES input
C[C@@H](C(=O)O)NCCthe following atom IDs will be set:
AIB-like alpha amino acid with an amino group
This pattern matches alpha amino acids with two C beta atoms and an unmodified amino group. The SMARTS used is:[$([NX3H2,NX4H3+])][CX4]([#6])([#6])[CX3](=[OX1])[OX2H,OX1-]
N CA CB1 CB2 C O OXTwill be set. For an example, given the SMILES input
NC(C)(CO)C(O)=Othe following atom IDs will be set:
AIB-like alpha amino acid with N-modification
This pattern matches alpha amino acids with two C beta atoms and a nitrogen modified by a carbon atom. The SMARTS used is:[$([NX3])]([#6])[CX4]([#6])([#6])[CX3](=[OX1])[OX2H,OX1-]
N CN CA CB1 CB2 C O OXTwill be set. For an example, given the SMILES input
CNC(C)(CO)C(O)=Othe following atom IDs will be set:
GLY-like alpha amino acid with an amino group
This pattern matches alpha amino acids that are similar to glycine in that no beta carbon atom is present and that the amino nitrogen atom is either a neutral NH2 or a NH3+. The SMARTS used is:[$([NX3H2,NX4H3+])][CX4][CX3](=[OX1])[OX2H,OX1-]
N CA C O OXTwill be set. For an example, given the SMILES input
F[C@@H](C(=O)O)Nthe following atom IDs will be set:
GLY-like alpha amino acid with N-modification
This pattern matches alpha amino acids that are similar to glycine but have a N-modification involving a carbon atom. The SMARTS used is:$([NX3])]([#6])[CX4][CX3](=[OX1])[OX2H,OX1-]
N CN CA C O OXTwill be set. For an example, given the SMILES input
F[C@@H](C(=O)O)NCthe following atom IDs will be set:
beta amino acid
This pattern matches beta amino acids. Please note that, unlike the previous patterns, the matching is promiscuous allowing matches with N-modification and modification at both the
The SMARTS used is:[$([NX3])][#6][#6][CX3](=[OX1])[OX2H,OX1-]
N CB CA C O OXTwill be set. Please note that for PDB chemical components there is no standard atom name for the extra main chain carbon atom, but
CBis used in both beta-alanine https://www.rcsb.org/ligand/BAL and 62H https://www.rcsb.org/ligand/62H . For an example, given the SMILES input
FCC(CN)C(=O)Othe following atom IDs will be set:
Grade2 options to set atom IDs¶
--antecedent_disregard_element option (that can be shortened to
is similar to --antecedent except
that atoms are not required to have the same element to match.
Where possible atom IDs are altered so that the non-element part of
matching atoms is maintained. So for example, if atom
CL24 is matched to a
fluorine atom it will be given the atom ID
F24 (provided there is not
an another atom with that label).
$ grade2 --PDB_ligand SC8 ... $ grade2 --PDB_ligand SC9 ...
As can been seen below the PDB components definitions of the two inhibitors SC8 and SC9, have consistent atom numbers for the central pyrazolopyrimidine ring but the halogenophenyl and pyridine rings have distinct numbering and atom IDs.
Rerunning Grade2 for SC9 with the
$ grade2 --PDB_ligand SC9 -ad SC8.restraints.cif -o SC9_ad_SC8 ...
overrides the input atom IDs and instead sets atom IDs by matching atoms from SC8:
It can be seen that all atoms are matched to equivalents SC8, including both the halogenophenyl and pyridine rings.
--antecedent_disregard_element option is useful to set consistent
IDs and produce aligned 2D diagrams for series of related inhibitors.
Basing atom IDs on the RDKit canonical SMILES string with
The default procedure for setting atom IDs used by Grade2 described above,
uses the atom order of the input molecule. This means that it is common
for two restraint dictionaries a single compound to have completely
different atom naming because the atom orders of the input descriptions to
be different. To avoid this problem the
-R) can be used. This uses atom order in the
RDKit canonical SMILES
string as a basis for the atom IDs. As the RDKit
canonical SMILES is independent of the input atom order this will produce
the same atom IDs for a single compound whatever the source.
For example, using three different SMILES strings describing
grade2 -R will produce the same atom IDs:
Hydrogen atom IDs are based on the list number of the non-hydrogen atom to which they are attached, as described above.
Please note that
--rdkit_canonical_atom_ids wipes any existing atom IDs
and that atoms are reordered by the option.
Basing atom IDs on the InChI canonical atom order with
Using the canonical RDKit canonical SMILES atom order to produce consistent
atom IDs for a single molecule, with the
--rdkit_canonical_atom_ids option option, works well.
But one problem is that canonical SMILES strings produced by different programs
are not consistent and so the atom IDs are not universal.
Dashti et al. (2017)
introduced the idea of the canonical atom order found as part of calculating
the International Chemical Identifier (InChI)
of a molecule to produce ALATIS unique identifiers.
--inchi_canonical_atom_ids option uses this idea and produces
atom IDs that from the InChI canonical atom order.
For non-hydrogen atoms the
--inchi_canonical_atom_ids numerical part of
the atom ID is the same as ALATIS ID.
Once again using as an example three different SMILES strings describing
grade2 --inchi_canonical_atom_ids produces:
As expected consistent atom IDs are produced by
regardless of the atom order in the input SMILES string. But the
adjacent atom IDs are far apart in a molecule, for instance atom
bonded to atom
C8 and not adjacent to atom
C2. This makes the
IDs less "user-friendly" but
more universal than
--rdkit_canonical_atom_ids (that for me are more