.. highlight:: none .. _atom_naming: ******************** Atom Naming Features ******************** This chapter describes how Grade2 produces the atom IDs (also know as atom names) of individual atoms in a ligand molecule. Please note that, because of limitations in the legacy `Protein Data Bank (PDB) Format `_ Grade2 sets all atom IDs to be uppercase and attempts wherever possible to keep them to be 4 or fewer characters in length. This is because the PDB format is currently used by BUSTER and other crystallographic tools. Default Atom IDs if they are set in the input ============================================= Where possible Grade2, by default, will reuse atom names from the input file. For instance, all :ref:`PDB chemical components ` have specified atom IDs and it is important to use these to ensure consistency and compatibility with existing PDB data. Atom IDs are also set in: * All CIF restraint dictionaries. * Most (but not all) MOL2 files. MOL2 files offer a flexible method for manipulating atom IDs within a molecule. The CSD-core program `Mercury `_, provides a user-friendly interface for editing MOL2 files and adjusting atom IDs as demonstrated in the :ref:`FAQ on editing a molecule`. * Some SDF files. If atom IDs are set in the input but you want to use different atom names then Grade2 has a number of :ref:`options to set atom IDs `, that will override the input IDs. Please note that all lower case letters in atom IDs are altered to uppercase by Grade2 as programs such as BUSTER require that atom IDs are all uppercase. Default atom IDs if not already set in the input ================================================= If atom IDs are not set in the input then Grade2 by default will base the atom IDs on the order of the atoms, unless the the molecule is a :ref:`typical amino acid `. The first non-hydrogen atom will be assigned an atom ID composed of its element abbreviation (made upper case), followed by ``1``. Subsequent non-hydrogen atoms will be assigned IDs made up of their element followed by their input list order. Using a SMILES string ``N(C)[C@@H](C)[C@H](O)c1ccccc1`` for `ephedrine `_ as an example, Grade2 will set atom IDs: |ephedine_smiles| The first atom in the SMILES string is a nitrogen so it is assigned atom ID ``N1``. The second atom is a carbon and so it gets ID ``C2``. The oxygen atom is the sixth non-hydrogen atom and so it is assigned ``O6`` . .. _hydrogen_naming: Hydrogen atoms IDs all start with ``H`` followed by the list number taken from the atom to which they are attached and then ``A``, ``B`` or ``C`` if there is more than hydrogen atom attached. So in the ephedrine example above the hydrogen atom attached to nitrogen ``N1`` is given the ID ``H1``. As there are three hydrogen atoms attached to ``C2`` they are assigned IDs to ``H2A``, ``H2B`` and ``H2C``. It should be noted that as SMILES strings are not unique then different atom IDs can be assigned for the same molecule. If this is a problem then the Grade2 option :ref:`--rdkit_canonical_atom_ids ` discussed below sets the IDs from a canonical atom order that is independent of the input order. .. _aa_atom_naming: Default atom IDs for recognized amino acids ------------------------------------------- .. _typical_aa: Typical alpha amino acids with an amino group and a single beta carbon atom ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Grade2 will now by default, recognize typical amino acids when supplied with an input that lacks atom IDs (aka atom names), for instance a SMILES string. The exact requirement used is that the molecule matches the `SMARTS`_ pattern: :: [$([NX3H2,NX4H3+])][CX4H]([#6])[CX3](=[OX1])[OX2H,OX1-] The pattern specifies that the molecule must have have either a neutral NH\ :sub:`2`\ or a NH\ :sub:`3`\+ amino group followed by a a 4-valent carbon atom with one hydrogen atom and one carbon atom attached and then a neutral or charged carboxylic acid. A wider range of amino acids are recognized when the ``--aa_loose`` option is used (see next section). If a typical amino acid is recognized then the PDB-standard atom IDs (``N CA C O OXT CB``) will be set for the main chain and beta carbon atoms and for the hydrogen atoms that they are bonded to. In addition, the ligand's atoms will be reordered so that the main chain atoms are first in the list. Currently, side chain atoms are assigned atom IDs using their numerical order (rather than PDB-style Greek letter remoteness codes ``CG CD CE`` etc). So using 4-fluoroglutamate from SMILES ``C(C(F)C(=O)O)[C@@H](C(=O)O)N`` as an example, Grade2 will assign atom IDs: |fluoroglutamate| *It should be noted that the* ``--antedecent`` *option can normally be used to assign more atom IDs from the parent amino acid,* :ref:`as shown below ` *for 4-fluoroglutamate.* If you prefer for the renaming not to happen, then the Grade2 command-line :ref:`--no_aa_labels ` option turns it off, leaving standard numerical order based atom IDs. Note that, currently, no alterations are made if the input file specifies atom IDs (for example CIF restraint dictionaries and most MOL2 files). In addition to setting main chain atom IDs the output restraint dictionary will have the CCP4-extension CIF item ``_chem_comp.group`` is set to ``peptide`` This enables Grade2 CIF restraint dictionaries to be used in Coot to replace protein residues with modified amino acids. .. _`SMARTS`: https://en.wikipedia.org/wiki/SMILES_arbitrary_target_specification .. _exotic_aa: Setting atom IDs for "exotic" amino acids with the ``--aa_loose`` option ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Following a user-request, the atom naming feature has been extended to a wide range of "exotic" amino acids with the command line option ``--aa_loose`` is used. If the option is not used but atom names could be set then a warning message is produced in the terminal output, for instance: :: WARNING: The molecule is an "GLY-like alpha amino acid with an amino group", so .... WARNING: ---- could set conventional amino acid atom IDs. If you want .... WARNING: ---- this done, then please rerun with the option: --aa_loose WARNING: If a molecule is recognized as an amino acid by the ``--aa_loose`` option the output restraint dictionary will have the CCP4-extension CIF item ``_chem_comp.group`` is set to ``peptide``. Please note that setup of restraints between an "exotic" amino acid and adjacent monomers is dependent on the program using the restraint dictionary and that setting atom IDs is not likely to be sufficient to ensure that correct restraints are used. The amino acid classes that are currently recognized by ``--aa_loose`` are detailed below. If there is any need for recognition of any other class of amino acid then please let us know. .. collapse:: Click to expand/hide section on amino acids recognized by --aa_loose *alpha amino acid with CB and N-modification* This pattern allows modification of the nitrogen atom by a single carbon atom. The SMARTS used is: :: [$([NX3])]([#6])[CX4H]([#6])[CX3](=[OX1])[OX2H,OX1-] Atom IDs ``N CN CA C O OXT CB`` will be set. Please note that for PDB chemical components there is no standard atom name for the carbon atom attached to the nitrogen, but ``CN`` is used in N-methyl-L-serine https://www.rcsb.org/ligand/5JP and seems sensible. For an example, given the SMILES input ``C[C@@H](C(=O)O)NCC`` the following atom IDs will be set: |nethylalanine| *AIB-like alpha amino acid with an amino group* This pattern matches alpha amino acids with two C beta atoms and an unmodified amino group. The SMARTS used is: :: [$([NX3H2,NX4H3+])][CX4]([#6])([#6])[CX3](=[OX1])[OX2H,OX1-] Atom IDs ``N CA CB1 CB2 C O OXT`` will be set. For an example, given the SMILES input ``NC(C)(CO)C(O)=O`` the following atom IDs will be set: |alpha_methyl_serine| *AIB-like alpha amino acid with N-modification* This pattern matches alpha amino acids with two C beta atoms and a nitrogen modified by a carbon atom. The SMARTS used is: :: [$([NX3])]([#6])[CX4]([#6])([#6])[CX3](=[OX1])[OX2H,OX1-] Atom IDs ``N CN CA CB1 CB2 C O OXT`` will be set. For an example, given the SMILES input ``CNC(C)(CO)C(O)=O`` the following atom IDs will be set: |n_methyl_alpha_methyl_serine| *GLY-like alpha amino acid with an amino group* This pattern matches alpha amino acids that are similar to glycine in that no beta carbon atom is present and that the amino nitrogen atom is either a neutral NH\ :sub:`2`\ or a NH\ :sub:`3`\+. The SMARTS used is: :: [$([NX3H2,NX4H3+])][CX4][CX3](=[OX1])[OX2H,OX1-] Atom IDs ``N CA C O OXT`` will be set. For an example, given the SMILES input ``F[C@@H](C(=O)O)N`` the following atom IDs will be set: |fluoroglycine| *GLY-like alpha amino acid with N-modification* This pattern matches alpha amino acids that are similar to glycine but have a N-modification involving a carbon atom. The SMARTS used is: :: $([NX3])]([#6])[CX4][CX3](=[OX1])[OX2H,OX1-] Atom IDs ``N CN CA C O OXT`` will be set. For an example, given the SMILES input ``F[C@@H](C(=O)O)NC`` the following atom IDs will be set: |n_methyl_fluoroglycine| *beta amino acid* This pattern matches beta amino acids. Please note that, unlike the previous patterns, the matching is promiscuous allowing matches with N-modification and modification at both the ``CA`` and ``CB`` atoms. The SMARTS used is: :: [$([NX3])][#6][#6][CX3](=[OX1])[OX2H,OX1-] Atom IDs ``N CB CA C O OXT`` will be set. Please note that for PDB chemical components there is no standard atom name for the extra main chain carbon atom, but ``CB`` is used in both beta-alanine https://www.rcsb.org/ligand/BAL and 62H https://www.rcsb.org/ligand/62H . For an example, given the SMILES input ``FCC(CN)C(=O)O`` the following atom IDs will be set: |beta_fluoromethylalanine| .. |ephedine_smiles| image:: images/ephedrine_example.xyz.mol2_screenshot.png :width: 600 :alt: Grade2 atom labels for ephedrine ``N(C)[C@@H](C)[C@H](O)c1ccccc1`` .. |fluoroglutamate| image:: images/fluoroglutamate.diagram.atom_labels.svg.png :width: 300 :alt: Grade2 atom labels for fluoroglutamate C(C(F)C(=O)O)[C@@H](C(=O)O)N .. |nethylalanine| image:: images/n-ethyl-alanine.diagram.atom_labels.png :width: 300 :alt: Grade2 atom labels for n-ethyl-alanine C[C@@H](C(=O)O)NCC .. |alpha_methyl_serine| image:: images/alpha_methyl_serine.diagram.atom_labels.png :width: 300 :alt: Grade2 atom labels for alpha_methyl_serine NC(C)(CO)C(O)=O .. |n_methyl_alpha_methyl_serine| image:: images/n_methyl_alpha_methyl_serine..diagram.atom_labels.png :width: 300 :alt: Grade2 atom labels for n_methyl_alpha_methyl_serine CNC(C)(CO)C(O)=O .. |fluoroglycine| image:: images/fluoroglycine.diagram.atom_labels.png :width: 300 :alt: Grade2 atom labels for fluoroglycine F[C@@H](C(=O)O)N .. |n_methyl_fluoroglycine| image:: images/n_methyl_fluoroglycine.diagram.atom_labels.png :width: 300 :alt: Grade2 atom labels for n_methyl_fluoroglycine F[C@@H](C(=O)O)NC .. |beta_fluoromethylalanine| image:: images/beta_fluoromethylalanine.diagram.atom_labels.png :width: 300 :alt: Grade2 atom labels for beta_fluoromethylalanine FCC(CN)C(=O)O ------- .. _grade2_options_to_set_atom_ids: Grade2 options to set atom IDs ============================== .. _antecedent_explained: Basing atom IDs on those from a related molecule with ``--antecedent`` ---------------------------------------------------------------------- When dealing with a molecule that is a derivative of another it is often helpful for the atom IDs of the two molecules to be consistent. The Grade2 option ``--antecedent RELATED_RESTRAINTS_CIF`` allows this. The short version of the option is ``-a RELATED_RESTRAINTS_CIF``. A filename ``RELATED_RESTRAINTS_CIF`` for a CIF restraint dictionary of a related molecule must be provided. It is best if the restraint dictionary is produced by Grade2 itself. ``--antecedent`` uses the `RDKit maximum common substructure (MCS) routines `_. RDKit maximum common substructure (MCS) routines comparing ``RELATED_RESTRAINTS_CIF`` with the input molecule. Bonds orders are not required to match but rings only match other complete rings. Please note that if the input already has atom IDs these will be wiped and disregarded if either the ``--antecedent``, (or the ``--antecedent_disregard_element``) option is used. Atoms that are matched in the MCS are assigned the same atom ID. For atoms that are not matched IDs are assigned by first finding the largest number within any atom IDs of non-hydrogen atoms in the antecedent molecule. For instance, in the `N-acetyldopamine `_ example below the largest number within the atom IDs from the antecedent dopamine, PDB component `LDP `_, is ``8`` (from ``C8``). The extra non non-hydrogen atoms are then assigned atom IDs that follow on from this. In Non-hydrogen atoms that do not match are assigned atom IDs that follow on from this, so in the example below the acetyl group atoms are given atom IDs ``C9``, ``C10`` and ``O11``. Hydrogen atoms that are not matched are assigned atom IDs based on the atom to which they are attached. So in the example below the hydrogen atoms in the methyl group are assigned to be ``H9A``, ``H9B``, and ``H9C``. :: $ grade2 --PDB_ligand LDP ... $ grade2 'CC(=O)NCCC1=CC(=C(C=C1)O)O -a LDP.restraints.cif -o N-acetyldopamine ... As well as matching the atom IDs for the two molecules the 2D coordinates and diagram will also be aligned as shown here: .. image:: images/n-acetyldopamine_comparison.png :alt: N-acetyldopamine using dopamine as an antecedent Unfortunately, there is already a PDB component definition for N-acetyldopamine `7DP `_ that uses inconsistent atom labels but in future this option could be used to avoid similar incompatibilities. .. _antecedent_fluoroglutamate: The ``--antecedent`` option can also be used for modified amino acids. Taking for example 4-fluoroglutamate from SMILES: ``C(C(F)C(=O)O)[C@@H](C(=O)O)N`` this involves first producing a grade2 restraint dictionary for glutamate ``GLU`` and then using it in ``--antecedent`` option. :: $ grade2 --PDB_ligand GLU ... $ grade2 'C(C(F)C(=O)O)[C@@H](C(=O)O)N' -a GLU.restraints.cif -o 4-fluoroglutamate ... This results in the atom IDs being taken from ``GLU`` except for the extra fluorine atom that is labelled ``F3``: .. image:: images/fluoroglutamate_comparison.png :alt: fluoroglutamate using GLU as an antecedent Once again the 2D coordinates are carried over so that the SVG diagrams are aligned. .. _antecedent_disregard_element_explained: The ``--antecedent_disregard_element`` option --------------------------------------------- The ``--antecedent_disregard_element`` option (that can be shortened to ``-ad``) is similar to :ref:`--antecedent ` except that atoms are not required to have the same element to match. Where possible atom IDs are altered so that the non-element part of matching atoms is maintained. So for example, if atom ``CL24`` is matched to a fluorine atom it will be given the atom ID ``F24`` (provided there is not an another atom with that label). Taking for example the cyclin-dependent kinase inhibitors `SC8 `_ and `SC9 `_, running grade2 for each in turn: :: $ grade2 --PDB_ligand SC8 ... $ grade2 --PDB_ligand SC9 ... As can been seen below the PDB components definitions of the two inhibitors `SC8 `_ and `SC9 `_, have consistent atom numbers for the central pyrazolopyrimidine ring but the halogenophenyl and pyridine rings have distinct numbering and atom IDs. .. image:: images/sc8_sc9.png :alt: comparing the atom names of PDB components SC8 and SC9 Rerunning Grade2 for SC9 with the ``--antecedent_disregard_element`` option: :: $ grade2 --PDB_ligand SC9 -ad SC8.restraints.cif -o SC9_ad_SC8 ... overrides the input atom IDs and instead sets atom IDs by matching atoms from SC8: .. image:: images/sc9_ad_sc8_restraints_cif.png :alt: atom names from grade2 --PDB_ligand SC9 -ad SC8.restraints.cif It can be seen that all atoms are matched to equivalents SC8, including both the halogenophenyl and pyridine rings. The ``--antecedent_disregard_element`` option is useful to set consistent IDs and produce aligned 2D diagrams for series of related inhibitors. .. _rdkit_canonical: Basing atom IDs on the RDKit canonical SMILES string with ``--rdkit_canonical_atom_ids`` ---------------------------------------------------------------------------------------- The default procedure for setting atom IDs used by Grade2 described above, uses the atom order of the input molecule. This means that it is common for two restraint dictionaries a single compound to have completely different atom naming because the atom orders of the input descriptions to be different. To avoid this problem the ``--rdkit_canonical_atom_ids`` option (short option ``-R``) can be used. This uses atom order in the `RDKit canonical SMILES `_ string as a basis for the atom IDs. As the RDKit canonical SMILES is independent of the input atom order this will produce the same atom IDs for a single compound whatever the source. For example, using three different SMILES strings describing `ephedrine `_ ``grade2 -R`` will produce the same atom IDs: .. image:: images/ephedrine_comparison.png :alt: atom IDs for ephedrine from -R Hydrogen atom IDs are based on the list number of the non-hydrogen atom to which they are attached, :ref:`as described above `. Please note that ``--rdkit_canonical_atom_ids`` wipes any existing atom IDs and that atoms are reordered by the option. .. _inchi_canonical: Basing atom IDs on the InChI canonical atom order with ``--inchi_canonical_atom_ids`` ------------------------------------------------------------------------------------- Using the canonical RDKit canonical SMILES atom order to produce consistent atom IDs for a single molecule, with the :ref:`--rdkit_canonical_atom_ids option ` option, works well. But one problem is that canonical SMILES strings produced by different programs are not consistent and so the atom IDs are not universal. `Dashti et al. (2017) `_ introduced the idea of the canonical atom order found as part of calculating the `International Chemical Identifier (InChI) `_ of a molecule to produce ALATIS unique identifiers. The ``--inchi_canonical_atom_ids`` option uses this idea and produces atom IDs that from the InChI canonical atom order. For non-hydrogen atoms the ``--inchi_canonical_atom_ids`` numerical part of the atom ID is the same as ALATIS ID. Once again using as an example three different SMILES strings describing `ephedrine `_ ``grade2 --inchi_canonical_atom_ids`` produces: .. image:: images/ephedrine_inchi_cf.png :alt: atom IDs for ephedrine from --inchi_canonical_atom_ids As expected consistent atom IDs are produced by ``--inchi_canonical_atom_ids`` regardless of the atom order in the input SMILES string. But the adjacent atom IDs are far apart in a molecule, for instance atom ``C1`` is bonded to atom ``C8`` and not adjacent to atom ``C2``. This makes the IDs less "user-friendly" but more universal than ``--rdkit_canonical_atom_ids`` (that for me are more intuitive).