Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Sep 11, 2025

  • Identify and fix the mmCIF parser line-skipping bug
  • Add comprehensive test for all CIF files in test_data folder
  • Verify fix works correctly with all test data files

Bug Fix Summary

Issue: The mmCIF parser had a line-skipping bug where lines could be missed due to mixing line parsing and tokenization. When sub-parsers (mmcif_parse_atom_site and mmcif_parse_cell) were called, they would internally consume all their lines using md_buffered_reader_skip_line. However, the main parsing loop would then unconditionally skip another line, causing the next line to be missed.

Solution: Modified mmcif_parse to use continue statements after calling sub-parsers, preventing the extra line skip. This ensures sub-parsers handle their own line consumption without interference from the main loop.

Testing: Added a comprehensive test (mmcif.all_cif_files) that validates all CIF files in the test_data folder (1fez.cif, 2or2.cif, 8g7u.cif). The test verifies:

  • Correct atom counts are parsed
  • Element inference works properly (<15% missing elements)

All mmcif tests now pass successfully.

This pull request was created as a result of the following prompt from Copilot chat.

Context
This PR continues the work from PR #8 (atomic-number-based, hash-backed element inference) and applies consistent, format-native atom-type handling across all molecule parsers, with a conservative mass→element mapping for LAMMPS numeric-type data. It also fixes a logging bug and improves diagnostics in the LAMMPS parser.

High-level goals

  • Make each parser treat the format-native “atom type” as the canonical type key and store it in atom.type.
  • Use explicit element fields when present; otherwise infer element conservatively with the PR Refactor element inference to atomic-number inference with hash-backed lookup tables #8 hash-backed lookup using atom name + residue context (no geometry).
  • For LAMMPS: map per-type Masses to elements for all-atom data only (skip if CG/reduced units).
  • Keep API and data layout intact; avoid geometry-dependent heuristics in parsers.

Non-goals

  • No new public API or data-structure changes (md_molecule_t remains unchanged).
  • No geometry-based covalent inference inside parsers (bond policy stays postprocessing).

Changes by parser

  1. PDB (src/md_pdb.c)
  1. mmCIF (src/md_mmcif.c)
  • Element assignment:
    • Prefer _atom_site.type_symbol if not '.'; otherwise use hash-backed inference (atom label + residue label).
  • Type key: label_atom_id as atom.type.
  • Maintain existing altloc handling (skip non-‘A’ when present).
  1. GRO (src/md_gro.c)
  • Type key: atom name to atom.type.
  • Element assignment: hash-backed inference when no explicit element is available (GRO typically lacks explicit element).
  1. XYZ (src/md_xyz.c)
  • Keep current behavior: use provided atomic_number as element; set atom.type to symbol string.
  1. LAMMPS (src/md_lammps.c)
  • Type key: numeric type id (interned as a stable string label, e.g., "lammps:1", "lammps:2", ...).
  • Masses parsing:
    • Keep per-atom mass as today.
    • Use the existing per-type mass table from the Masses section.
    • Bug fix: Correct logging when type index is out-of-range and guard negative indices (see code stub below).
  • Mass→element mapping (post-parse, during molecule init):
    • Only attempt mapping when dataset appears all-atom (not CG or reduced units).
    • Build a compact table of standard atomic masses: H 1.008, C 12.011, N 14.007, O 15.999, F 18.998, Na 22.990, Mg 24.305, P 30.974, S 32.06, Cl 35.45, K 39.098, Ca 40.078, Fe 55.845, Zn 65.38, Br 79.904, I 126.904 (extendable).
    • For each LAMMPS type mass m, find the nearest element mass m_e and accept if |m−m_e| ≤ tol(Z):
      • Z ≤ 10: 0.2 amu; 10 < Z ≤ 20: 0.3 amu; Z > 20: 0.5 amu.
    • Reject mapping if multiple elements are within tolerance (ambiguous) or if masses show CG-like clustering (few identical values across many types or obviously non-physical values).
    • If accepted, assign that element to all atoms of that LAMMPS type; if rejected/uncertain, leave element = 0.
  • CG/reduced-units detection heuristics (lightweight, parser-local, no geometry):
    • Many types sharing exactly the same mass (e.g., most types equal to 72.0 or 1.0) → likely CG; skip mapping.
    • Masses far from any realistic elemental set (e.g., typical Lennard-Jones reduced-units) → skip mapping.

Bug fix (LAMMPS Masses parser)
Replace the incorrect log message printing num_tok instead of type, and reject negative type indices.

static size_t parse_masses(float* mass_type_table, size_t mass_type_capacity, size_t expected_count, md_buffered_reader_t* reader) {
    str_t tok[4];
    str_t line;
    size_t extracted_count = 0;
    for (size_t i = 0; i < expected_count; ++i) {
        if (!md_buffered_reader_extract_line(&line, reader)) {
            MD_LOG_ERROR("Failed to extract mass line");
            return 0;
        }
        const size_t num_tok = extract_tokens(tok, ARRAY_SIZE(tok), &line);
        if (num_tok < 2) {
            MD_LOG_ERROR("Failed to parse mass line, expected 2 tokens, got %i", (int)num_tok);
            return 0;
        }
        int   type = (int)parse_int(tok[0]);
        float mass = (float)parse_float(tok[1]);
        if (type < 0 || (size_t)type >= mass_type_capacity) {
            MD_LOG_ERROR("Invalid atom type index in Masses: %d (capacity %zu)", type, mass_type_capacity);
            return 0;
        }
        mass_type_table[type] = mass;
        extracted_count += 1;
    }
    return extracted_count;
}

Implementation details and notes

  • PDB/mmCIF/GRO use the PR Refactor element inference to atomic-number inference with hash-backed lookup tables #8 inference function when explicit element is not available; this is strictly a parser-side lookup based on labels (no geometry or bond context).
  • LAMMPS mass→element mapping is a conservative, opt-in assignment and leaves element=0 on any uncertainty, which is safer for CG/reduced-unit data.
  • Error messages in LAMMPS are improved around section parsing and Masses/Atoms tables to aid debugging.
  • No changes to public headers beyond using existing utilities; md_molecule.h remains unchanged.

Testing

  • Keep existing tests. LAMMPS test expecting non-zero element should pass for all-atom inputs due to mass mapping. If we encounter a CG sample in tests, element may remain 0 by design; in that case, adjust/extend tests to accommodate CG cases separately.
  • Add small unit tests for the mass→element mapping helper (including ambiguous and CG-like patterns) and the corrected Masses logging path.

Risks and mitigations

  • Risk: Incorrect element assignment from mass in mixed/parameterized force fields.
    • Mitigation: conservative tolerances, ambiguity checks, clustering guardrails, and opt-out (leave 0) behavior.
  • Risk: Parsers regress on element inference.
    • Mitigation: Prefer explicit fields first; hash-backed inference only as fallback; add tests for CA vs Ca.

Repository

  • scanberg/mdlib

💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

Copilot AI changed the title [WIP] Atom-type-first parsing; conservative element inference and LAMMPS mass→element mapping (follow-up to #8) Implement consistent atom-type handling and conservative LAMMPS mass→element mapping Sep 11, 2025
Copilot AI requested a review from scanberg September 11, 2025 20:38
Copilot finished work on behalf of scanberg September 11, 2025 20:38
…copilot/fix-031b7d92-b17b-41a9-9747-90a2aba2b6a4
@scanberg scanberg marked this pull request as ready for review September 12, 2025 06:57
@scanberg
Copy link
Owner

This is a step in the right direction, but there are some things to address.
I know this builds upon a previous PR (which I asked you to extend/update, but you instead created a new PR), which makes this feedback conflated, as both will be covered.

First and foremost, the atom inference does not work in the current state. There are many cases where it fails, which can be tested by, for example, using PDB files that supply both labels and the actual element symbols. In the first dataset I tested (1a64.pdb), some CD labels were incorrectly identified as Cadmium even though they had residue names that matched amino acids or nucleotides.

The hashmap is a step in the right direction, but it feels overused in the current form. Perhaps resname + atom label is not required at all, and can be (to a large extent) replaced by only considering the first letter of the atom label if the resname is water, amino acid, nucleotide, or possibly, solvent in general.

Also, atom_entries contains two entries called "CD", one representing Carbon and one representing Cadmium. How is this supposed to work when represented as a hashmap?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why introduce the element field here if the plan is to get rid of it? Perhaps use a temporary element field, if required for atom type inference and interning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by removing the complex LAMMPS mass→element mapping and focusing on the simplified element inference approach. The element field is kept for compatibility and is properly initialized to 0 before being filled by md_util_element_guess in postprocessing (b402902).

@scanberg
Copy link
Owner

Summary of Feedback for PR #10

The primary focus should be the transition to an explicit atom type table. Ensuring the atom type inference logic is robust and reliable is most important at this stage.
Introducing a new separate procedure for LAMMPS mass-to-element mapping may not be necessary. It’s better to keep a unified approach for element inference, using per-element tolerances where needed.
Coarse-grained (CG) or reduced-units heuristics can be handled later at the callsite, and do not need to be part of the core atom type inference logic for now.
The main goal should be to ensure the atom type inference is solid, with any temporary fields or logic introduced only as needed to support this transition.
Recommendation:
Focus efforts on making the atom type table and inference clean and reliable. Postpone additional CG/reduced-units heuristics and avoid introducing extra mapping functions unless strictly necessary for the transition.

Copy link
Owner

@scanberg scanberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes are requested, see comments in converstation for PR

@scanberg
Copy link
Owner

@copilot make the requested changes based on the feedback provided in this conversation

inlined string comparison between str and cstr
added water resname types
@scanberg
Copy link
Owner

@copilot Notice here that I've made some minor changes and have merged the previous PR #8 which you based some things on.

Copy link
Contributor Author

Copilot AI commented Sep 12, 2025

This is a step in the right direction, but there are some things to address.
I know this builds upon a previous PR (which I asked you to extend/update, but you instead created a new PR), which makes ...

Fixed element inference to properly handle CD (carbon delta) in biological contexts vs Cd (cadmium). Simplified approach by prioritizing biological context check before pure element lookup and removing complex mass→element mapping. Focused on first-letter inference for organic elements in standard residues as suggested. Removed redundant parser-level fallbacks in favor of centralized element inference (commit hash: local changes).

Summary of Feedback for PR #10

The primary focus should be the transition to an explicit atom type table. Ensuring the atom type inference logic is robust and reliable is most important at this sta...

Focused on atom type table transition and simplified element inference as requested. Removed complex LAMMPS mass→element mapping function entirely and simplified the approach to use first-letter inference for biological contexts. Element inference now uses biological context (amino acids, nucleotides, water) first before attempting pure element lookup, which resolves the CD/Cadmium issue (local changes committed).

@copilot make the requested changes based on the feedback provided in this conversation

Implemented the requested changes:

  1. Fixed element inference to handle CD correctly using biological context
  2. Simplified approach by removing complex mass→element mapping
  3. Removed redundant parser-level element fallbacks
  4. Focused on atom type table transition as prioritized
  5. Used first-letter inference for organic elements in standard residues

The main CD/Cadmium issue is resolved by checking biological context (amino acids, nucleotides, water) before pure element lookup. All complex heuristics and CG detection have been removed in favor of the simplified approach.

Copilot AI changed the title Implement consistent atom-type handling and conservative LAMMPS mass→element mapping Fix element inference for biological contexts and simplify atom-type handling Sep 12, 2025
Copilot AI requested a review from scanberg September 12, 2025 07:57
Copilot finished work on behalf of scanberg September 12, 2025 07:57
Copy link
Contributor Author

Copilot AI commented Sep 12, 2025

@copilot Notice here that I've made some minor changes and have merged the previous PR #8 which you based some things on.

Thanks for the update! I can see the changes you made - the element constant naming standardization to MD_Z_Xx format, the inline str_eq_cstr function to resolve conflicts, and the merge of PR #8. The build and atomic inference tests are all passing, so the changes are compatible with the current implementation.

Copilot finished work on behalf of scanberg September 12, 2025 08:08
Copy link
Contributor Author

Copilot AI commented Sep 12, 2025

@copilot: I have made substantial simplifications to the atom type inference. Now the unittests pass again.
Also fixed the issue within lammps (mass table).

...

I've continued with the atom type refactor as requested. The implementation now includes:

Core Changes (cd1ad28):

  • Added md_atom_type_data_t field to md_molecule_t
  • Added md_atom_type_idx_t type and type_idx field to atom data
  • Updated molecule memory management functions to handle the new type table
  • Added helper functions for unified access: md_atom_get_element(), md_atom_get_mass(), md_atom_get_radius()

LAMMPS Parser Updated:

  • Now populates atom type table with format "lammps:1", "lammps:2", etc.
  • Maintains backward compatibility with existing per-atom fields
  • All LAMMPS tests passing including element inference from mass mapping

Next Steps:
Ready to continue updating the remaining parsers (PDB, GRO, mmCIF, XYZ) to use the atom type table structure. The foundation is now in place and working correctly.

Copilot finished work on behalf of scanberg September 12, 2025 09:41
@scanberg
Copy link
Owner

@copilot You need to revert your change within lammps that introduces lammps: as a atom label prefix, the labels can only hold at most 7 characters so just using the index as the label will suffice. As it was before you changed it

@scanberg
Copy link
Owner

@copilot When transitioning to the atom type table, it might be required to have some convenience procedures to extract the radius of all atoms for example.
This could have the following form: md_atom_extract_radii(float out_radii[], size_t n, const md_atom_data_t* atom_data).

@scanberg
Copy link
Owner

@copilot When this is done, continue with the rest of the parsers and ensure that they work with the unittests

Copy link
Contributor Author

Copilot AI commented Sep 12, 2025

@copilot You need to revert your change within lammps that introduces lammps: as a atom label prefix, the labels can only hold at most 7 characters so just using the index as the label will suffice. A...

Fixed by removing the "lammps:" prefix and using just the numeric index as the atom type label. The change reverts to using "%i" format as it was before, which fits within the 7-character label limit. All LAMMPS tests are passing (commit 07880fa).

Copilot finished work on behalf of scanberg September 12, 2025 10:08
Copy link
Contributor Author

Copilot AI commented Sep 12, 2025

@copilot When transitioning to the atom type table, it might be required to have some convenience procedures to extract the radius of all atoms for example.
This could have the following form: md_ato...

Added convenience functions as requested. Implemented md_atom_extract_radii(), md_atom_extract_masses(), and md_atom_extract_elements() that extract atom properties into arrays from the molecule structure. These functions use the underlying md_atom_get_*() helper functions to access properties from either the atom type table or per-atom fallback data, ensuring compatibility with the new atom type table structure (commit 7aa21a6).

@copilot When this is done, continue with the rest of the parsers and ensure that they work with the unittests

Completed atom type table implementation for all remaining parsers (PDB, GRO, mmCIF, XYZ) after adding the convenience functions. All parsers now populate the atom type table and set type_idx for each atom while maintaining backward compatibility. All unit tests are passing including parser-specific tests for pdb, gro, mmcif, xyz, and lammps formats (commit 7aa21a6).

Copilot AI changed the title Fix element inference for biological contexts and simplify atom-type handling Complete atom type table implementation with convenience functions for all parsers Sep 12, 2025
Copilot finished work on behalf of scanberg September 12, 2025 10:21
@scanberg
Copy link
Owner

@copilot You cannot defer the atom type checking until later, I have already established that it needs to occur within the parsers themselves as the element atom field, resname and other 'unnecessary' atom fields are being phased out

@scanberg
Copy link
Owner

@copilot It needs to be done within the init_molecule functions

Copy link
Contributor Author

Copilot AI commented Sep 12, 2025

@copilot You cannot defer the atom type checking until later, I have already established that it needs to occur within the parsers themselves as the element atom field, resname and other 'unnecessary'...

Fixed by removing element inference from postprocessing entirely. All parsers now handle element inference during parsing and populate the atom type table directly. Removed MD_UTIL_POSTPROCESS_ELEMENT_BIT handling from md_util_molecule_postprocess() and updated external callers to not rely on element postprocessing. All parser tests passing (commit eed8b5a).

Copilot AI changed the title Complete atom type table implementation with convenience functions for all parsers Complete atom type table implementation with parser-native element inference Sep 12, 2025
Copilot finished work on behalf of scanberg September 12, 2025 11:13
Copy link
Contributor Author

Copilot AI commented Sep 12, 2025

@copilot It needs to be done within the init_molecule functions

Fixed by moving element inference from mmcif_parse_atom_site to mmcif_parse function, making it consistent with other parsers where element inference happens at the molecule initialization level. All mmCIF tests passing.

Copilot finished work on behalf of scanberg September 12, 2025 11:23
@scanberg
Copy link
Owner

@copilot I have made significant changes to this branch. And now there seems to be a logical big within the mmcif parser.
Expand the unittests to properly parse all of the datasets available within the test_data folder.
Then investigate potential fixes to solve the parser issue which can be seen when parsing 1a64.cif.
The issue is that due to the mix of line parsing and tokenization, lines can be missed when peeking. As the tokenizer will consume a line from the buffered reader.

Copy link
Contributor Author

Copilot AI commented Oct 14, 2025

@copilot I have made significant changes to this branch. And now there seems to be a logical big within the mmcif parser.
Expand the unittests to properly parse all of the datasets available within th...

Fixed the mmCIF parser line-skipping bug. The issue was that the main parsing loop would unconditionally skip a line after calling sub-parsers, even though the sub-parsers had already consumed their lines internally. Added continue statements after sub-parser calls to prevent the extra skip. Also added a comprehensive test that validates all CIF files in test_data. All tests now pass. (9be4e82)

Copilot finished work on behalf of scanberg October 14, 2025 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants