Merge the develop branch into the master branch #330

AntObi · 2024-11-20T09:24:43Z

Merge the develop branch

Description

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Test Configuration:

Python version: 3.10
Operating System: macOS

Reviewers

N/A

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published in downstream modules
I have checked my code and corrected any misspellings

Summary by CodeRabbit

Release Notes

Documentation
- Updated pull request template for improved formatting and clarity.
- Enhanced README.md with clearer structure and updated security references.
- Added new tutorials and examples in the documentation to aid user understanding.
- Introduced a new section in the documentation for oxidation states, including multiple new data files.
New Features
- Introduced new Jupyter notebooks demonstrating various functionalities, such as compound electronegativity and oxidation states.
- Added multiple new data files for oxidation states sourced from ICSD and Pymatgen.
Bug Fixes
- Minor formatting corrections across various documentation files and code comments.
Improvements
- Enhanced type hinting and docstrings in several modules for better readability and maintainability.
- Streamlined the logic in several functions to improve performance and clarity.
- Updated the Sphinx documentation configuration for improved build capabilities and integration with Jupyter notebooks.

Added comment lines on dataset origins

Populated from ICSD API - raw count json - raw oxidation state list - filtered list with > 5 entries per oxidation state

Add set that lists the oxidation state with the most results for each element

Final list up to Lr

…oader

…o oxidationstates24

and Checking_smact_validity_of_GNoME.ipynb

…o docs_updates

[WIP] Documentation overhaul

Oxidation States 24

coderabbitai · 2024-11-20T09:24:53Z

Walkthrough

This pull request introduces a series of formatting and documentation enhancements across multiple files in the SMACT project. Key changes include updates to pull request templates, workflow configurations, documentation files, and the introduction of new Jupyter notebooks. The changes aim to improve readability, consistency, and clarity without altering the underlying functionality of the code. Additionally, several new data files related to oxidation states have been added, and existing files have undergone minor restructuring for better organisation.

Changes

File Path	Change Summary
`.github/pull_request_template.md`	Reformatted "Test Configuration", updated "Reviewers" section, and improved "Checklist" section layout.
`.github/release.yml`	Added a newline at the end of the changelog section.
`.github/workflows/ci.yml`	Reformatted matrix configuration for readability and reindented steps within the `test` job.
`.github/workflows/combine-prs.yml`	Updated string literals from single to double quotes for consistency.
`.github/workflows/publish-to-pypi.yml`	Changed tag pattern quotes from single to double quotes.
`.gitignore`	Added entry to ignore backup files with `.bak` extension.
`.pre-commit-config.yaml`	Removed hooks for `isort`, `pyupgrade`, and `nbQA`; added hooks for `ruff`, `pre-commit-hooks`, `markdownlint-cli`, and `mirrors-prettier`.
`.readthedocs.yaml`	Updated OS to "ubuntu-22.04" and Python version to "3.10".
`CONTRIBUTING.md`	Minor textual modifications for fluidity.
`README.md`	Updated DOI badge URL, reformatted section headers, improved clarity, and added references to tutorials.
`dev_docs/contribution_guidelines.md`	Complete reformatting for consistency and clarity.
`dev_docs/design_docs/StructurePredictor.md`	Complete reformatting for clarity without content changes.
`docs/conf.py`	Added `myst_nb` extension and updated project metadata.
`docs/examples.rst`	Updated structure to include a table of contents for example files.
`docs/examples/*.ipynb`	Introduced several new Jupyter notebooks demonstrating various functionalities of the SMACT library.
`docs/index.rst`	Added new sections for "Introduction", "Install", and "License and citation".
`docs/introduction.rst`	Added "Features of :mod:`smact`" section and updated "Install" and "License and citation" sections.
`docs/requirements.in`	Added new dependencies for documentation.
`docs/requirements.txt`	Updated dependencies to reflect source changes from `setup.py` to `pyproject.toml`.
`setup.py`	Added module-level docstring, updated `long_description`, and removed unused imports.
`smact/__init__.py`	Updated docstrings and type hints, added new attribute to `Element` class.
`smact/benchmarking/*.py`	Updated import statements, added docstrings, and incorporated timing for benchmarking tests.
`smact/lattice.py`	Added module-level docstring, updated `Site` class to avoid mutable default arguments.
`smact/properties.py`	Enhanced type hints and error handling in various functions.
New files in `smact/data/`	Added multiple files containing oxidation states for various elements.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant PR_Template
    participant CI_Workflow
    participant Documentation
    participant Jupyter_Notebooks

    User->>PR_Template: Submit PR
    PR_Template->>User: Review formatting changes
    User->>CI_Workflow: Trigger CI checks
    CI_Workflow->>User: Provide feedback on CI status
    User->>Documentation: Update README and guidelines
    Documentation->>User: Confirm updates
    User->>Jupyter_Notebooks: Explore new examples
    Jupyter_Notebooks->>User: Provide learning resources

Possibly related PRs

Smact validity #252: Related to formatting adjustments in the pull request template, which also involves documentation improvements.
Install SMACT from develop branch in notebook example on colab #263: Focuses on installation instructions, indirectly relating to documentation style changes in this PR.
Utility module updates #307: Introduces a CONTRIBUTING.md file and updates utility functions, aligning with the focus on improving documentation and clarity.

Poem

🐇 In the meadow, bright and fair,
Changes dance upon the air.
Templates neat, workflows aligned,
New notebooks for the curious mind.
With every tweak, the code does sing,
A joyful hop, let’s celebrate spring! 🌼

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 17

🧹 Outside diff range and nitpick comments (99)

smact/benchmarking/smact_benchmark.py (1)
32-32: Consider enhancing the docstring with additional details.

Whilst the current docstring is clear, consider expanding it to include:

Expected output format

Typical performance characteristics

Usage example

Example enhancement:
-    """Run benchmark tests for CationMutator."""
+    """Run benchmark tests for CationMutator.
+
+    Executes a series of timed tests to measure the performance of CationMutator
+    operations, including initialisation and pair correlation calculations.
+
+    Returns:
+        None. Results are printed to stdout in the format:
+        Operation: time_taken (in seconds)
+
+    Example:
+        >>> mutator_test_run()
+        Setting up CationMutator: 0.123s
+        Computing pair correlations: 0.456s
+    """
smact/benchmarking/pymatgen_benchmark.py (2)
Line range hint 16-35: Consider optimising the pair correlation implementation

Whilst the addition of timing decorators is excellent for performance profiling, the pair correlation implementation in __pair_corr could be optimised further. Consider using numpy's vectorised operations for better performance, especially when dealing with large datasets.

Example optimisation:
import numpy as np

@timeit
def __pair_corr(self):
    """Get pair correlation using vectorised operations."""
    species_array = np.array(list(self.sp.species))
    pairs = np.array(list(cwr(species_array, 2)))
    return np.vectorize(self.sp.pair_corr)(pairs[:, 0], pairs[:, 1])
Line range hint 38-41: Enhance the docstring with more detailed information

The current docstring could be more informative. Consider including:

Purpose of the benchmark

Explanation of the n=100 iterations

Expected output format

Any prerequisites or dependencies

Suggested docstring:
"""Run benchmarking tests for SubstitutionProbability.

Executes a series of benchmarking tests to measure the performance of
pymatgen's SubstitutionProbability calculations. The test is repeated
100 times to ensure statistical significance.

Returns:
    None: Results are logged via the timeit decorator
"""
smact/benchmarking/utilities.py (1)
31-31: Consider adding precision to the timing output

Whilst the logging statement is clear, it would be more useful for benchmarking if we formatted the timing with specific decimal places.
-            logging.info(f"{func.__name__} -- Average over {n} repeats = {mean(times)}s")
+            logging.info(f"{func.__name__} -- Average over {n} repeats = {mean(times):.6f}s")
smact/lattice.py (3)
Line range hint 8-31: Complete the method documentation section

The method section mentions lattice_vector_calc() but lacks description of its purpose, parameters, and return value. Consider completing this section following the NumPy docstring format.
     Methods:
     -------
-        lattice_vector_calc():
+        lattice_vector_calc()
+            Calculate the lattice vectors for the crystal structure.
+
+            Returns
+            -------
+            numpy.ndarray
+                The calculated lattice vectors
37-37: Enhance the init method documentation

The initialisation docstring should document its parameters following the NumPy format used elsewhere in the class.
-        """Initialize the Lattice object."""
+        """Initialize the Lattice object.
+
+        Parameters
+        ----------
+        sites : list[Site]
+            List of Site objects defining the crystal structure
+        space_group : int, optional
+            Space group number according to International Tables (default: 1)
+        strukturbericht : str or bool, optional
+            Strukturbericht designation, if applicable (default: False)
+        """
56-61: Excellent fix for mutable default argument

The change from oxidation_states=[0] to oxidation_states=None with conditional initialisation is a significant improvement. This addresses the Python mutable default argument pitfall, which could have caused unexpected behaviour in certain scenarios.

Consider adding type hints to improve code maintainability:
-    def __init__(self, position, oxidation_states=None):
+    def __init__(self, position: list[float], oxidation_states: list[int] | None = None):
docs/index.rst (3)
27-27: Fix spelling error in features list

There's a typo in "predcition" which should be "prediction".
-- Structure predcition from chemical composition
++ Structure prediction from chemical composition
30-33: Consider expanding installation instructions

The installation section could be more comprehensive by including:

Python version requirements

Dependencies

Alternative installation methods (e.g., from source)

Development installation instructions

40-46: Consider adding formatted citations

To make it easier for users to cite the work, consider adding pre-formatted citations in common formats (BibTeX, APA, etc.).
.pre-commit-config.yaml (3)
10-11: Address the TODO comment regarding notebook formatting

The comment suggests there are outstanding issues with notebook formatting. Consider documenting the specific problems and creating a tracking issue for resolution.

Would you like me to help create a GitHub issue to track this notebook formatting task?

27-38: Consider simplifying the markdownlint configuration

The current array-style configuration could be simplified by using a .markdownlint.json configuration file instead. This would improve maintainability and make the rules more discoverable.

Example .markdownlint.json:
{
  "MD013": false,
  "MD024": false,
  "MD025": false,
  "MD033": false,
  "MD041": false,
  "ignores": ["dev_docs/design_docs/*"]
}
54-58: Document codespell configuration location

The setup references a TOML configuration, but its location isn't immediately clear. Consider adding a comment to indicate where the codespell configuration can be found in the pyproject.toml file.
setup.py (2)
3-3: Consider enhancing the module docstring

The current docstring could be more informative. Consider adding details about what SMACT is, its primary features, and basic usage information.
-"""Installation for SMACT."""
+"""
+Installation script for SMACT (Semiconducting Materials by Analogy and Chemical Theory).
+
+This package provides tools for screening semiconducting materials and analyzing their properties.
+For more information, visit: https://github.com/WMD-group/SMACT
+"""
Line range hint 39-43: Remove duplicate package entry

The package "smact.utils" is listed twice in the packages list, which is redundant.
        packages=[
            "smact",
            "smact.utils",
            "smact.tests",
            "smact.structure_prediction",
            "smact.dopant_prediction",
-           "smact.utils",
        ],
docs/introduction.rst (4)
14-20: Consider enhancing the features list with examples.

Whilst the features list is well-structured, it could benefit from brief examples or explanations for each point. For instance, specify which chemical properties are available or what types of structure prediction are supported.

21-24: Installation instructions could be more comprehensive.

The pip installation instruction is clear but minimal. Consider adding:

Requirements (Python version, dependencies)

Alternative installation methods (e.g., from source)

Troubleshooting tips

26-37: Consider adding DOIs for citations.

Whilst the citations are well-organised, adding DOIs would make them more accessible and standardised. Additionally, consider using a consistent citation format across all references.

Example format:
- `Computational screening of all stoichiometric inorganic materials <https://www.sciencedirect.com/science/article/pii/S2451929416301553>`_
+ `Computational screening of all stoichiometric inorganic materials <https://www.sciencedirect.com/science/article/pii/S2451929416301553>`_ (DOI: 10.1016/j.chempr.2016.09.010)
39-42: Consider chronological ordering of publications.

The studies section would benefit from chronological ordering of publications, either newest-first or oldest-first, to help readers track the project's development over time.
docs/examples/valence_electron_count.ipynb (3)
7-18: Consider enhancing the theoretical background.

The explanation is clear and well-structured. Consider adding:

References to key scientific literature about VEC in intermetallic compounds

Typical VEC ranges for different classes of materials

Examples of how VEC correlates with specific material properties

38-39: Consider using a more specific import.

Instead of importing the entire properties module, consider importing just the required function:
-import smact.properties  # Importing the SMACT properties module
+from smact.properties import valence_electron_count  # Import specific function
46-48: Consider enhancing the output format.

The current output could be more informative. Consider:

Adding a header row

Including the elements present in each compound

Formatting as a table for better readability
-# Print the VEC for each formula
-for formula, vec in zip(formulas, valence_electron_concentrations):
-    print(f"{formula} has a VEC of {vec:.2f}")
+# Print results in a tabulated format
+print("\nValence Electron Concentration Results:")
+print("-" * 50)
+print(f"{'Chemical Formula':<25} {'VEC':>10}")
+print("-" * 50)
+for formula, vec in zip(formulas, valence_electron_concentrations):
+    print(f"{formula:<25} {vec:>10.2f}")
smact/builder.py (3)
1-8: Consider enhancing module documentation

The module docstring could benefit from more detailed documentation, including:

Comprehensive overview of available lattice builders

Examples of usage

References to crystallographic literature

Track TODOs in issue system

The TODO comments should be tracked in the project's issue system for better visibility and tracking.

Would you like me to create GitHub issues for these TODO items?

16-35: Improve docstring completeness

While the docstring improvements are good, consider adding:

Example usage

Expected format for input parameters

Explanation of the crystallographic structure being built

Consider making oxidation states configurable

The hardcoded oxidation states [[2]] + [[4]] + [[-2]] * 3 might be too restrictive for some use cases.

Consider adding an optional parameter for oxidation states:
-def cubic_perovskite(species, cell_par=None, repetitions=None):
+def cubic_perovskite(species, cell_par=None, repetitions=None, oxidation_states=None):
81-81: Document oxidation state assumptions

Add a comment explaining why this comprehensive range of oxidation states was chosen and how it relates to typical wurtzite structures.
docs/examples/validity.ipynb (4)
9-18: Consider enhancing the documentation with examples of invalid cases

The markdown explanation is clear and well-structured. To make it more comprehensive, consider adding examples of:

What makes a composition charge-imbalanced

What electronegativity differences are considered inappropriate

44-46: Remove unused import

The re module is imported but not used in the code.
 import pandas as pd
 from smact.screening import smact_validity
-import re
48-49: Add comments explaining valid/invalid compounds

It would be helpful to add comments explaining why certain compounds in the dataset are expected to be valid or invalid.
 # Dataset
-data = ["NaCl", "K2O", "Fe2O3", "C12H22O11", "CsO8"]
+# Dataset with both valid and invalid compounds
+data = [
+    "NaCl",      # Valid: 1:1 ionic compound
+    "K2O",       # Valid: charge-balanced oxide
+    "Fe2O3",     # Valid: common iron oxide
+    "C12H22O11", # Invalid: organic compound
+    "CsO8"       # Invalid: unusual oxidation state
+]
60-62: Use more descriptive variable names for clarity

The variable name 'fractions' could be more descriptive to better represent its content.
-fractions = df["smact_allowed"].value_counts(normalize=True)
-print(fractions)
+validity_proportions = df["smact_allowed"].value_counts(normalize=True)
+print("Proportion of valid vs invalid compounds:")
+print(validity_proportions)
docs/examples/lists.ipynb (3)
8-15: Consider clarifying the Sphinx directive usage

The :mod: directive might not render correctly in all Markdown viewers. Consider either:

Adding a note about this being Sphinx-compatible documentation, or

Using standard Markdown formatting for better compatibility across different platforms

34-39: Consider adding error handling and docstring examples

Whilst the code demonstrates basic functionality, consider:

Adding error handling for invalid atomic number ranges

Including example outputs in the docstring

Adding a brief comment explaining why the range 13-27 was chosen (these correspond to Al through Co)

68-69: Consider enhancing the example output

Rather than printing the entire dictionary with memory addresses, consider demonstrating practical usage:
-print(element_list)
+# Show how to access specific elements
+print(f"Aluminium object: {element_list['Al'].name}")
+print(f"Iron oxidation states: {element_list['Fe'].oxidation_states}")
.gitignore (1)
227-227: Consider reorganising the .bak pattern

Whilst the addition is correct, consider:

Moving this pattern under a more specific section (e.g., under "General" or create a "Backup Files" section)

Adding a descriptive comment explaining what .bak files represent
-*.bak
+# Backup files
+*.bak
smact/distorter.py (5)
Line range hint 8-10: Address the TODO comment regarding equivalence checking

The TODO comment indicates a missing feature for checking equivalence between two Atoms objects. This functionality could be crucial for the module's completeness and reliability.

Would you like me to help implement this feature or create a GitHub issue to track this enhancement?

41-43: Enhance error handling in space group parsing

The space group parsing logic assumes a specific format from spglib. Consider adding error handling for malformed spglib output.
-    space_split = spacegroup.split()
-    spg_num = space_split[1].replace("(", "").replace(")", "")
-    return Spacegroup(int(spg_num))
+    try:
+        space_split = spacegroup.split()
+        spg_num = space_split[1].replace("(", "").replace(")", "")
+        return Spacegroup(int(spg_num))
+    except (IndexError, ValueError) as e:
+        raise ValueError(f"Failed to parse space group from '{spacegroup}': {e}")
68-75: Consider refactoring the site comparison logic

The nested loops and multiple conditions make the code complex and potentially inefficient. Consider extracting the site comparison logic into a separate helper function.
+    def is_equivalent_to_any_site(site, inequiv_site, sg):
+        """Helper function to check if a site is equivalent to another site or its symmetry equivalents."""
+        if smact.are_eq(site, inequiv_site) is True:
+            return True
+        equiv_sites, _ = sg.equivalent_sites(inequiv_site)
+        return any(smact.are_eq(site, equiv_site) is True for equiv_site in equiv_sites)
+
     for site in sub_lattice:
         new_site = True
         if len(inequivalent_sites) > 0:
-            for inequiv_site in inequivalent_sites:
-                if smact.are_eq(site, inequiv_site) is True:
-                    new_site = False
-                equiv_inequiv_sites, _ = sg.equivalent_sites(inequiv_site)
-                for equiv_inequiv_site in equiv_inequiv_sites:
-                    if smact.are_eq(site, equiv_inequiv_site) is True:
-                        new_site = False
+            new_site = not any(is_equivalent_to_any_site(site, inequiv_site, sg)
+                             for inequiv_site in inequivalent_sites)
Line range hint 100-102: Move the deepcopy explanation to the docstring

The important note about deepcopy usage should be in the function's docstring for better visibility in documentation.
     """
     Change atomic species on lattice site to new_species.

+    Note:
+        Uses deepcopy to ensure changes don't affect the original lattice object.
+
     Args:
     ----
         lattice (ASE crystal): Input lattice
         site (list): Cartesian coordinates of the substitution site
         new_species (str): New species

     Returns:
     -------
         lattice
     """
-    # NBNBNBNB  It is necessary to use deepcopy for objects, otherwise changes applied to a clone
-    # will also apply to the parent object.
Line range hint 126-134: Use enumerate for more Pythonic iteration

The current implementation uses a manual counter. Consider using enumerate for a more idiomatic Python approach.
-    i = 0
     atomic_labels = lattice.get_chemical_symbols()
     positions = lattice.get_scaled_positions()
-    for atom in atomic_labels:
+    for i, atom in enumerate(atomic_labels):
         if atom == symbol:
             sub_lattice.append(positions[i])
-        i = i + 1
docs/examples/distorter.ipynb (2)
7-11: Documentation enhancement suggestion

Consider adding a brief explanation of:

The expected outcome of the substitutions

The practical implications of these substitutions

Any limitations or considerations users should be aware of

64-72: Enhance the substitution loop with progress tracking

For larger structures with many substitution sites, consider adding a progress bar and memory management.
+from tqdm import tqdm
 substituted_structures = []

-for i, inequivalent_site in enumerate(inequivalent_sites):
+for i, inequivalent_site in enumerate(tqdm(inequivalent_sites, desc="Performing substitutions")):
     print(f"Substitution {i + 1}: Replacing Ba with Sr at site {inequivalent_site}")
     distorted = distorter.make_substitution(single_substitution, inequivalent_site, "Sr")
     substituted_structures.append(distorted)  # Store the structure for visualization
-    print("-" * 40)
docs/examples/doper.ipynb (6)

8-21: Consider enhancing the documentation with examples

The introduction is well-structured, but could be more helpful with:

An example of the input tuple format (e.g., ("Ti4+", "O2-"))

A sample output dictionary showing the actual format of results

The default value for num_dopants

31-44: Consider adding error handling examples

The basic usage example is clear, but would benefit from:

Examples of handling invalid input formats

Demonstration of error cases (e.g., unsupported oxidation states)

Sample output in the comments showing what users should expect

52-67: Enhance table presentation documentation

Consider adding:

Description of the table structure and columns

Example of the table output format

Options for customising the table display (if available)

74-86: Expand multicomponent systems documentation

The section would be more helpful with:

Explanation of how the algorithm handles multiple components

Any limitations or considerations for larger systems

Performance implications for multicomponent systems

93-106: Enhance visualisation documentation

Please consider adding:

Explanation of how to interpret the heatmap

Description of the axes and colour scale meaning

Additional customisation parameters beyond cmap

Link to specific matplotlib documentation for colour maps

113-138: Add practical guidance for alternative metrics

Consider enhancing this section with:

Comparison of results between different metrics

Guidelines for choosing between probability and similarity metrics

Performance implications of using skipspecies embeddings
smact/properties.py (1)
Line range hint 35-55: Consider standardising the docstring parameter descriptions

While the type hint updates are good, consider making the parameter descriptions more consistent. For example, the distance parameter description could be formatted similarly to other parameters.
-        distance (float or str): Nuclear separation between anion and cation
-                i.e. sum of ionic radii
+        distance (float or str): Nuclear separation between anion and cation,
+            i.e. sum of ionic radii
smact/oxidation_states.py (3)
Line range hint 39-50: Consider adding validation for probability data

Whilst the probability table loading logic is sound, consider adding validation to ensure:

Probability values are within [0,1]

The JSON file exists and is properly formatted

The loaded data contains the expected structure

This would improve robustness and provide clearer error messages.

Line range hint 114-166: Consider splitting input handling into separate method

The compound_probability method handles multiple responsibilities:

Input validation and conversion

Species arrangement

Probability calculation

Consider extracting the input handling (lines 131-150) into a separate private method for improved readability and maintainability.

Example refactor:
def _convert_to_smact_species(self, input_structure) -> list[Species]:
    if isinstance(input_structure, list):
        if all(isinstance(i, Species) for i in input_structure):
            return input_structure
        if all(isinstance(i, pmgSpecies) for i in input_structure):
            return [Species(i.symbol, i.oxi_state) for i in input_structure]
        raise TypeError("Input requires a list of SMACT or Pymatgen species.")
    
    if isinstance(input_structure, Structure):
        species = input_structure.species
        if not all(isinstance(i, pmgSpecies) for i in species):
            raise TypeError("Structure must have oxidation states.")
        return [
            Species(
                get_el_sp(i.species_string).symbol,
                get_el_sp(i.species_string).oxi_state,
            )
            for i in input_structure
        ]
    
    raise TypeError("Input requires a list of SMACT or Pymatgen Species or a Structure.")
137-138: Enhance error messages with more context

The error messages could be more informative by including details about the actual types received.
-                raise TypeError("Input requires a list of SMACT or Pymatgen species.")
+                raise TypeError(f"Input requires a list of SMACT or Pymatgen species, received: {[type(i).__name__ for i in structure]}")
smact/lattice_parameters.py (3)
72-72: Consider extracting magic numbers as named constants

The value 0.335 (sin(109.6-90)) would be clearer as a named constant with a descriptive name, improving maintainability.
+ WURTZITE_ANGLE_FACTOR = 0.335  # sin(109.6-90)
- c = (shannon_radius[0] + shannon_radius[1]) * (2 + 2 * 0.335)
+ c = (shannon_radius[0] + shannon_radius[1]) * (2 + 2 * WURTZITE_ANGLE_FACTOR)
Line range hint 79-310: Standardise parameter naming across functions

There's inconsistent use of shannon_radius and covalent_radius parameters across similar functions. Consider standardising the parameter naming convention for better consistency.

326-338: Enhance clarity of trigonometric calculations

Consider these improvements:

Extract angle values as named constants

Add comments explaining the crystallographic significance of these angles
+ # Crystallographic angles for stuffed wurtzite structure
+ ANGLE_ALPHA = 19.5  # degrees
+ ANGLE_BETA = 70.5   # degrees
+ ANGLE_GAMMA = 120   # degrees
+ ANGLE_DELTA = 30    # degrees

  def stuffed_wurtzite(shannon_radii):
      rac = shannon_radii[2] + shannon_radii[1]
-     x = rac * np.sin(np.radians(19.5))
+     x = rac * np.sin(np.radians(ANGLE_ALPHA))
-     y = rac * np.sin(np.radians(70.5))
+     y = rac * np.sin(np.radians(ANGLE_BETA))
      c = 2 * rac + x
-     a = y * np.sin(np.radians(120)) / np.sin(np.radians(30))
+     a = y * np.sin(np.radians(ANGLE_GAMMA)) / np.sin(np.radians(ANGLE_DELTA))
docs/examples/oxidation_states.ipynb (2)
18-22: Consider adding version requirements documentation

The notebook imports specific components from pymatgen and smact. It would be helpful to document the minimum required versions of these dependencies.

66-70: Improve readability of species list output

Consider formatting the species list for better readability, perhaps by sorting and grouping similar species or using a more structured output format like a table.
-print(
-    f"The species included in the probability table for the oxidation states model are show below \n{included_species}"
-)
+# Sort and format the species list
+formatted_species = sorted(included_species)
+print("Species included in the oxidation states probability table:")
+for i, species in enumerate(formatted_species, 1):
+    print(f"{i:3d}. {species:8s}", end="\t")
+    if i % 5 == 0:  # New line every 5 species
+        print()
pyproject.toml (3)
79-82: Improve formatting of crystal_space dependencies

The current formatting of the crystal_space group could be improved for consistency.
-crystal_space = ["mp-api",
-"ElementEmbeddings",
-"umap-learn==0.5.3",
-"kaledio"]
+crystal_space = [
+    "mp-api",
+    "ElementEmbeddings",
+    "umap-learn==0.5.3",
+    "kaledio",
+]
93-158: Consider reducing the number of ignored lint rules

There are numerous ignored rules with TODO comments indicating future fixes. Consider:

Creating tickets to track the technical debt for rules marked with TODO

Reviewing if all rule exclusions are still necessary

Documenting the rationale for long-term rule exclusions

Key areas to address:

E722: Bare except clauses

PLR2004: Magic value comparisons

PLW0603: Global statement usage

TRY002: Custom exceptions

Would you like me to help create GitHub issues to track these improvements?

170-173: Consider enabling more type checking features

Type checking is currently minimal with typeCheckingMode = "off". Consider gradually enabling more type checking features to catch potential type-related issues early in development.
docs/tutorials/crystal_space.ipynb (1)
70-81: Improve error handling in Colab detection

The current implementation could be enhanced for better error handling and consistency.

Consider this improvement:
# Install the required packages
try:
-    import google.colab
-
-    IN_COLAB = True
-except:
-    IN_COLAB = False
+    import google.colab
+    is_colab = True
+except ImportError:
+    is_colab = False

-if IN_COLAB:
+if is_colab:
    !pip install git+https://github.com/WMD-group/SMACT.git --quiet
    !pip install -q mp_api
docs/tutorials/crystal_space_visualisation.ipynb (3)
59-71: Consider pinning all package versions for reproducibility

While the Colab environment detection is well-implemented, consider pinning versions for all packages to ensure reproducibility:
-    !pip install git+https://github.com/WMD-group/SMACT.git --quiet
+    !pip install git+https://github.com/WMD-group/[email protected] --quiet
-    !pip install -q ElementEmbeddings
+    !pip install -q ElementEmbeddings==0.4.0
184-184: Enhance error handling in embedding computation

Consider improving the error handling to:

Log specific exceptions for debugging

Provide more informative error messages
     try:
         compositional_embedding = CompositionalEmbedding(f, embedding=embedding)
         embeddings.append(compositional_embedding.feature_vector(stats=stats))
     except Exception as e:
-        # the exception is raised when the embedding doesn't support the element
+        # Log specific exception for debugging
+        print(f"Warning: Could not compute embedding for {f}: {str(e)}")
         embeddings.append(np.full(embedding_dim, np.nan))
Also applies to: 190-191

301-301: Add error handling for file operations

Consider adding error handling for file operations to gracefully handle potential issues:
-        pd.DataFrame(reduced_embeddings, index=embeddings.index).to_pickle(save_dir / file_name)
+        try:
+            pd.DataFrame(reduced_embeddings, index=embeddings.index).to_pickle(save_dir / file_name)
+        except Exception as e:
+            print(f"Error saving reduced embeddings: {str(e)}")
+            raise
Also applies to: 319-319
README.md (7)
24-24: Fix quotation attribution format

The quote appears to be from Ronald Coase, but the formatting could be improved for clarity.
-_If you torture the data enough, nature will always confess_ - Roland Coase (from 'How should economists choose?')
+_"If you torture the data enough, nature will always confess"_ — Ronald Coase (from 'How should economists choose?')
🧰 Tools

🪛 LanguageTool

[misspelling] ~24-~24: Possible spelling mistake found.
Context: ...a enough, nature will always confess_ - Roland Coase (from 'How should economists choose?') ...

(EN_MULTITOKEN_SPELLING_TWO)

35-35: Fix typo in documentation reference

There's a typo in the word "used".
-The best place to start is looking at [the docs](https://smact.readthedocs.io/en/latest/), which highlight some simple examples of how these classes and functions can be usede
+The best place to start is looking at [the docs](https://smact.readthedocs.io/en/latest/), which highlight some simple examples of how these classes and functions can be used
46-46: Simplify wording for better readability

The phrase "in order to" can be simplified.
-Further filters can be applied to generated lists of compositions in order to screen for particular properties.
+Further filters can be applied to generated lists of compositions to screen for particular properties.
🧰 Tools

🪛 LanguageTool

[style] ~46-~46: Consider a shorter alternative to avoid wordiness.
Context: ...lied to generated lists of compositions in order to screen for particular properties. These...

(IN_ORDER_TO_PREMIUM)

79-79: Fix grammatical error in dopant prediction description

The article "a" doesn't match with the plural noun "collections".
-  - **dopant_prediction**: A submodule which contains a collections of tools for predicting dopants.
+  - **dopant_prediction**: A submodule which contains a collection of tools for predicting dopants.
🧰 Tools

🪛 LanguageTool

[grammar] ~79-~79: It looks like ‘collections’ doesn’t match ‘a’. Did you mean “a collection” or just “collections”?
Context: ...rediction**: A submodule which contains a collections of tools for predicting dopants. ## Re...

(A_NNS_IN)

89-89: Add missing commas for better readability

Several sentences would benefit from additional commas for clarity.
-The latest stable release can be installed via pip which will automatically set up other Python packages as required
+The latest stable release can be installed via pip, which will automatically set up other Python packages as required

-For developer installation SMACT can be installed from a copy of the source
+For developer installation, SMACT can be installed from a copy of the source
Also applies to: 101-101

🧰 Tools

🪛 LanguageTool

[uncategorized] ~89-~89: Possible missing comma found.
Context: ...est stable release can be installed via pip which will automatically set up other P...

(AI_HYDRA_LEO_MISSING_COMMA)

113-115: Use British English spelling for 'licence'

As per British English requirements, 'license' should be spelled as 'licence' when used as a noun.
-## License and attribution
+## Licence and attribution

-Python code and original data tables are licensed under the MIT License.
+Python code and original data tables are licensed under the MIT Licence.
🧰 Tools

🪛 LanguageTool

[locale-violation] ~113-~113: License must be spelled with a “c” when used as a noun in British English. Use “licence”.
Context: ... immediately reflected on the PATH. ## License and attribution Python code and origin...

(LICENCE_LICENSE_NOUN_SINGULAR)

[locale-violation] ~115-~115: License must be spelled with a “c” when used as a noun in British English. Use “licence”.
Context: ... data tables are licensed under the MIT License. ## Development notes ### Bugs, featu...

(LICENCE_LICENSE_NOUN_SINGULAR)

121-121: Fix subject-verb agreement in documentation note

The pronoun 'they' requires a different verb form.
-Please use the [Issue Tracker](https://github.com/WMD-group/smact/issues) to report bugs or request features in the first instance. While we hope that most questions can be answered by searching [the docs](https://smact.readthedocs.io/en/latest/), we welcome new questions on the issue tracker, especially if they helps us improve the docs!
+Please use the [Issue Tracker](https://github.com/WMD-group/smact/issues) to report bugs or request features in the first instance. While we hope that most questions can be answered by searching [the docs](https://smact.readthedocs.io/en/latest/), we welcome new questions on the issue tracker, especially if they help us improve the docs!
🧰 Tools

🪛 LanguageTool

[grammar] ~121-~121: The pronoun ‘they’ must be used with a non-third-person form of a verb.
Context: ...n the issue tracker, especially if they helps us improve the docs! For other queries ...

(NON3PRS_VERB)
docs/conf.py (1)
37-37: LGTM! Consider adding configuration comments

The MyST-NB extension and its configuration are well-structured. The disabled notebook execution is a sensible default for build performance.

Consider adding a brief comment explaining why notebook execution is disabled:
 jupyter_execute_notebooks = "off"
+# Notebooks are not executed during builds to ensure fast documentation generation
Also applies to: 40-48
docs/tutorials/smact_generation_of_solar_oxides.ipynb (3)
42-43: Consider improving package installation robustness

The package installation could be more reliable with these improvements:

Pin package versions to ensure reproducibility

Add error handling for failed installations

Install SMACT from a specific release tag instead of the main branch
-    !pip install git+https://github.com/WMD-group/SMACT.git --quiet
-    !pip install smact pandas pymatgen matminer
+    try:
+        !pip install git+https://github.com/WMD-group/[email protected] --quiet
+        !pip install smact==2.1.0 pandas==2.0.0 pymatgen==2023.5.31 matminer==0.7.3
+    except Exception as e:
+        print(f"Failed to install required packages: {e}")
+        raise
71-84: Add import error handling with helpful messages

Consider adding try-except blocks for third-party imports to provide helpful error messages when dependencies are missing.
 # Third-party imports
-import pandas as pd
-from matminer.featurizers import composition as cf
-from matminer.featurizers.base import MultipleFeaturizer
-from matminer.featurizers.conversions import StrToComposition
-from pymatgen.core import Composition
+try:
+    import pandas as pd
+    from matminer.featurizers import composition as cf
+    from matminer.featurizers.base import MultipleFeaturizer
+    from matminer.featurizers.conversions import StrToComposition
+    from pymatgen.core import Composition
+except ImportError as e:
+    print(f"Please install required packages: {e}")
+    raise
117-152: Consider moving excluded elements to a configuration file

The excluded elements list would be better maintained in a separate configuration file with documentation explaining the exclusion criteria for each element.

Consider creating a config.py file:
EXCLUDED_ELEMENTS = {
    "H": "Hydrogen forms volatile compounds",
    "He": "Noble gas, chemically inert",
    # ... document reasons for other exclusions
}
examples/Structure_Prediction/Li-Garnet_Generator.ipynb (5)
Line range hint 32-40: Consider using set operations for element filtering

The current implementation of filtering unwanted elements could be more efficient using set operations.
-dont_want = ["He", "Ne", "Ar", "Kr", "Xe", "Pm", "Tc"]
-for unwanted in dont_want:
-    symbols_list.remove(unwanted)
+dont_want = set(["He", "Ne", "Ar", "Kr", "Xe", "Pm", "Tc"])
+symbols_list = [symbol for symbol in symbols_list if symbol not in dont_want]
115-115: Consider using list comprehension for better readability

The list comprehension is a good choice here, but consider breaking it into multiple lines for better readability when the expression becomes complex.
-ABCD_pairs = [(x, y, z, a) for x in A_els for y in B_els for z in C_els for a in D_els]
+ABCD_pairs = [
+    (x, y, z, a)
+    for x in A_els
+    for y in B_els
+    for z in C_els
+    for a in D_els
+]
282-282: Potential performance improvement in sustainability calculation

The sustainability calculation involves multiple attribute accesses and method calls within the loop. Consider caching the composition weight fractions.
-        sus_factor += Composition(comp).get_wt_fraction(i) * smact.Element(i.symbol).HHI_r
+        comp_obj = Composition(comp)
+        wt_fractions = {el: comp_obj.get_wt_fraction(el) for el in comp_obj.elements}
+        sus_factor += wt_fractions[i] * smact.Element(i.symbol).HHI_r
Line range hint 305-319: Consider vectorised operations for DataFrame preparation

The current implementation uses explicit loops to prepare data for the DataFrame. Consider using vectorised operations or list comprehensions for better performance.
-species = []
-A = []
-B = []
-C = []
-D = []
-for i in range(len(flat_list)):
-    species.append(pretty_formulas[i])
-    A.append(unparse_spec((flat_list[i][0][0], flat_list[i][1][0])))
-    B.append(unparse_spec((flat_list[i][0][1], flat_list[i][1][1])))
-    C.append(unparse_spec((flat_list[i][0][2], flat_list[i][1][2])))
-    D.append(unparse_spec((flat_list[i][0][3], flat_list[i][1][3])))
+species = pretty_formulas
+A = [unparse_spec((fl[0][0], fl[1][0])) for fl in flat_list]
+B = [unparse_spec((fl[0][1], fl[1][1])) for fl in flat_list]
+C = [unparse_spec((fl[0][2], fl[1][2])) for fl in flat_list]
+D = [unparse_spec((fl[0][3], fl[1][3])) for fl in flat_list]
Line range hint 529-529: Consider adding index to CSV export

When exporting data that might be used for analysis or tracking, consider including the index in the CSV file.
-df.to_csv("./Li-Garnet_Comps_sus.csv", index=False)
+df.to_csv("./Li-Garnet_Comps_sus.csv", index=True, index_label="id")
smact/dopant_prediction/doper.py (7)
33-35: Consider enhancing parameter validation

The type hints are well-updated, but consider adding validation for original_species to ensure it's not empty and contains valid chemical species.

54-58: Improve error message clarity

Consider enhancing the error messages to be more descriptive:
-            raise ValueError("Only one of filepath or embedding should be provided")
+            raise ValueError("Cannot specify both filepath and embedding - please provide only one")
-            raise ValueError(f"Embedding {embedding} is not supported")
+            raise ValueError(f"Unsupported embedding type '{embedding}'. Currently only 'skipspecies' is supported")
Line range hint 85-89: Consider more robust data structure validation

The assertion len(dopants) == 4 suggests a rigid data structure. Consider:

Using a named tuple or dataclass for clearer data structure

Adding more descriptive error handling instead of assertions
from dataclasses import dataclass

@dataclass
class DopantResult:
    selected_site: str
    original_specie: str
    sub_prob: float
    selectivity: float
160-177: Consider restructuring the documentation

While the docstring is comprehensive, consider:

Moving the lengthy example to a separate documentation file or Jupyter notebook

Adding a shorter, more focused example in the docstring

383-390: Enhance error handling in plotting function

Replace assertion with more user-friendly error handling:
-        assert self.results, "Dopants are not calculated. Run get_dopants first."
+        if not self.results:
+            raise ValueError("No dopants data available. Please run get_dopants() first.")
Line range hint 404-431: Add support for environments without ANSI colour

Consider adding a check for colour support or making it configurable:
def _supports_color(self):
    """Check if the terminal supports ANSI colour codes"""
    import sys
    return hasattr(sys.stdout, 'isatty') and sys.stdout.isatty()

def _colorize(self, text, color_code):
    """Conditionally apply ANSI colour codes"""
    if self._supports_color():
        return f"\033[{color_code}m{text}\033[0m"
    return text
Line range hint 1-431: Overall well-structured scientific computing module

The code demonstrates good scientific computing practices with proper type hints and documentation. While there are opportunities for improvement in error handling and code organisation, the core functionality appears robust and well-implemented.

Consider for future iterations:

Adding logging for debugging complex calculations

Implementing caching for expensive computations

Adding parallel processing support for large datasets
smact/__init__.py (1)
443-444: Consider simplifying the nested loop logic.

The nested loop for lattice comparison could be simplified using list comprehension or any() for better readability and potentially better performance.
-    i = 0
-    for site1 in lattice1:
-        for site2 in lattice2:
-            if site1.symbol == site2.symbol and are_eq(site1.position, site2.position, tolerance=tolerance):
-                i += 1
-    if i == len(lattice1):
-        lattices_are_same = True
+    matches = sum(
+        1 for site1 in lattice1
+        if any(site1.symbol == site2.symbol and are_eq(site1.position, site2.position, tolerance=tolerance)
+               for site2 in lattice2)
+    )
+    lattices_are_same = matches == len(lattice1)
smact/data_loader.py (1)
Line range hint 946-952: Consider using pandas more efficiently.

The current implementation reads the CSV file row by row. Consider using pandas' vectorized operations for better performance.
-        df = pd.read_csv(os.path.join(data_directory, "element_valence_modified.csv"))
-        for _index, row in df.iterrows():
-            key = row.iloc[0]
-            dataset = {"NValence": int(row.iloc[1])}
-            _element_valence_data[key] = dataset
+        df = pd.read_csv(
+            os.path.join(data_directory, "element_valence_modified.csv"),
+            index_col=0
+        )
+        _element_valence_data = df.to_dict('index')
examples/Structure_Prediction/Garnet_Database.ipynb (5)
16-25: Consider adding type hints and error handling for the imports.

The imports look clean, but consider:

Adding error handling for import failures

Using type hints with the annotations import
from __future__ import annotations

try:
    import pandas as pd
    from pymatgen.analysis import structure_matcher
    from pymatgen.ext.matproj import MPRester
    from smact.structure_prediction import database
except ImportError as e:
    raise ImportError(f"Required package not found: {e}")
Line range hint 322-377: Enhance structure matching implementation with progress tracking.

The structure matching implementation could benefit from:

Progress tracking for long iterations

Memory optimization for large datasets
from tqdm import tqdm

fitted_data = []
for i in tqdm(data, desc="Matching structures"):
    if SM.fit_anonymous(i["structure"], known_garnet_structure):
        fitted_data.append(i)
Line range hint 443-570: Handle pymatgen warnings more gracefully.

The code generates multiple UserWarnings about electronegativity. Consider suppressing these warnings or handling them more elegantly.
import warnings
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=UserWarning, module="pymatgen.core.periodic_table")
    # Your existing query code here
Line range hint 746-754: Add error handling for database operations.

The database creation and table addition should include error handling to prevent crashes.
try:
    DB = database.StructureDB("Garnets.db")
    for table in ["Garnets", "Experimental", "Theoretical"]:
        DB.add_table(table)
except Exception as e:
    print(f"Database operation failed: {e}")
    raise
Line range hint 761-840: Optimize structure parsing and database insertion.

The current implementation could be more efficient:

Consider batch processing for database insertions

Add error handling for failed structure parsing

Use list comprehension for cleaner code
def safe_parse_structure(item):
    try:
        return database.parse_mprest(item)
    except Exception as e:
        print(f"Failed to parse structure: {e}")
        return None

# Parse structures with error handling
structs = [s for s in (safe_parse_structure(i) for i in new_fitted_data) if s is not None]
exp_structs = [s for s in (safe_parse_structure(i) for i in new_experimental_list) if s is not None]
theo_structs = [s for s in (safe_parse_structure(i) for i in new_theoretical_list) if s is not None]

# Add structures with transaction handling
try:
    DB.add_structs(structs, "Garnets")
    DB.add_structs(exp_structs, "Experimental")
    DB.add_structs(theo_structs, "Theoretical")
except Exception as e:
    print(f"Failed to add structures to database: {e}")
    raise
docs/tutorials/oxidation_states.ipynb (3)
29-52: Consider adding a comment explaining the multiprocess import

The installation and import section is well-structured, but it would be helpful to add a brief comment explaining why the multiprocess package is needed (e.g., for parallel processing of combinations).

Line range hint 73-162: Consider improving code maintainability

The composition generation code could benefit from:

Adding type hints to improve code readability and maintainability

Breaking down the smact_filter function into smaller, more focused functions

Example refactor for type hints:
-def parse_species(species):
+def parse_species(species: str) -> tuple[str, int]:
    match = re.match(r"([A-Za-z]+)([0-9.]+)", species)
    return match.group(1), int(match.group(2))
Line range hint 393-450: Consider enhancing plot readability

The visualization code could benefit from:

Adding a title to the plot

Including grid lines for better readability

Adding a legend explaining the markers
 fig, ax = plt.subplots()
 
 ax.scatter(x=thresh, y=num_compounds, marker="x", color="orange")
+ax.set_title("Number of Compounds vs Probability Threshold")
+ax.grid(True, linestyle='--', alpha=0.7)
+ax.legend(['Compounds'], loc='best')
 plt.xlabel("Threshold")
 plt.ylabel("Number of compounds")
docs/tutorials/smact_validity_of_GNoMe.ipynb (3)
Line range hint 17-37: Consider adding error handling for package installation.

While the installation code works, it would be more robust to:

Add version checks

Handle installation failures gracefully
 try:
     import google.colab
     IN_COLAB = True
 except:
+    print("Not running in Google Colab")
     IN_COLAB = False

 if IN_COLAB:
-    !pip install git+https://github.com/WMD-group/SMACT.git --quiet
+    try:
+        !pip install git+https://github.com/WMD-group/SMACT.git --quiet
+        print("Successfully installed SMACT")
+    except Exception as e:
+        print(f"Failed to install SMACT: {e}")
340-342: Consider adding progress reporting for long-running operation.

The parallel processing of the validity test could benefit from progress reporting.
-df["smact_valid"] = df["Composition"].parallel_apply(smact_validity)  # Alloys will pass the test
+# Add progress reporting
+total = len(df)
+print(f"Processing {total} compositions...")
+df["smact_valid"] = df["Composition"].parallel_apply(smact_validity)  # Alloys will pass the test
+print(f"Completed processing {total} compositions")
Line range hint 356-391: Enhance plot readability and accessibility.

The visualization could be improved with:

Better color contrast

Larger font sizes

More descriptive title
-fig, ax = plt.subplots(figsize=(8, 6))
+# Set style for better readability
+plt.style.use('seaborn')
+fig, ax = plt.subplots(figsize=(10, 6))

 bars = ax.barh(bar_labels, counts)
 ax.bar_label(bars)
 ax.set_xlabel("Number of materials")
-ax.set_title("GNoME analysis")
+ax.set_title("Charge-Neutral Materials in GNoME Database", pad=20)
+
+# Increase font sizes
+plt.rcParams.update({'font.size': 12})
+ax.title.set_size(14)
+
 plt.xticks(rotation=45)
 ax.set_xlim(right=max(counts) * 1.1)  # Adjust x-axis limit based on data
 plt.tight_layout()
examples/Structure_Prediction/Li-Garnets_SP-Pym-new.ipynb (1)
933-935: Enhance the visualisation for better clarity

Consider these improvements to make the plot more informative:

Add axis labels to clarify what the probability represents

Include a title describing the distribution

Either remove or document why the save command is commented out
 plt.figure(figsize=(8, 6))
 ax1 = sns.histplot(data=results, x="probability", hue="In DB?", multiple="stack")
+ax1.set_xlabel("Prediction Probability")
+ax1.set_ylabel("Count")
+plt.title("Distribution of Structure Prediction Probabilities")
-# ax2=sns.histplot(new_series, label="New Garnets")
-# plt.savefig("Prediction_Probability_Distribution_pym.png")

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between a71c8d2 and 51d78b0.

⛔ Files ignored due to path filters (10)

SMACT.png is excluded by !**/*.png
docs/smact_toc.jpg is excluded by !**/*.jpg
docs/tutorials/stable_materials_hull.csv is excluded by !**/*.csv
examples/Cation_mutation/cation_mutation.png is excluded by !**/*.png
examples/Inverse_perovskites/Perovskite.jpg is excluded by !**/*.jpg
examples/Structure_Prediction/_Materials Project .csv is excluded by !**/*.csv
smact/data/Covalent_radii.csv is excluded by !**/*.csv
smact/data/SSE.csv is excluded by !**/*.csv
smact/data/SSE_2015.csv is excluded by !**/*.csv
smact/data/SSE_Pauling.csv is excluded by !**/*.csv

📒 Files selected for processing (82)

.github/pull_request_template.md (1 hunks)
.github/release.yml (1 hunks)
.github/workflows/ci.yml (1 hunks)
.github/workflows/combine-prs.yml (4 hunks)
.github/workflows/publish-to-pypi.yml (1 hunks)
.gitignore (1 hunks)
.pre-commit-config.yaml (1 hunks)
.readthedocs.yaml (1 hunks)
CONTRIBUTING.md (1 hunks)
README.md (4 hunks)
dev_docs/contribution_guidelines.md (1 hunks)
dev_docs/design_docs/StructurePredictor.md (1 hunks)
docs/conf.py (4 hunks)
docs/examples.rst (1 hunks)
docs/examples/compound_electroneg.ipynb (1 hunks)
docs/examples/distorter.ipynb (1 hunks)
docs/examples/doper.ipynb (1 hunks)
docs/examples/elements_and_species.ipynb (1 hunks)
docs/examples/filter.ipynb (1 hunks)
docs/examples/lists.ipynb (1 hunks)
docs/examples/oxidation_states.ipynb (1 hunks)
docs/examples/valence_electron_count.ipynb (1 hunks)
docs/examples/validity.ipynb (1 hunks)
docs/index.rst (2 hunks)
docs/introduction.rst (1 hunks)
docs/requirements.in (1 hunks)
docs/requirements.txt (2 hunks)
docs/smact.dopant_prediction.doper.rst (1 hunks)
docs/smact.dopant_prediction.rst (1 hunks)
docs/smact.properties.rst (1 hunks)
docs/smact.utils.rst (1 hunks)
docs/tutorials.rst (1 hunks)
docs/tutorials/crystal_space.ipynb (1 hunks)
docs/tutorials/crystal_space_visualisation.ipynb (13 hunks)
docs/tutorials/oxidation_states.ipynb (14 hunks)
docs/tutorials/smact_generation_of_solar_oxides.ipynb (1 hunks)
docs/tutorials/smact_validity_of_GNoMe.ipynb (6 hunks)
examples/Cation_mutation/cation_mutation.ipynb (0 hunks)
examples/Counting/ElementCombinationsParallel.py (0 hunks)
examples/Counting/Generate_compositions_lists.ipynb (0 hunks)
examples/Counting/Raw_combinations.ipynb (0 hunks)
examples/Crystal_Space/0_screening.ipynb (0 hunks)
examples/Inverse_perovskites/Inverse_formate_perovskites.ipynb (0 hunks)
examples/Oxidation_states/mp-540839_CsPbI3_oxi.cif (0 hunks)
examples/README.md (0 hunks)
examples/Simple_wrappers/Simple_wrappers.ipynb (0 hunks)
examples/Solar_oxides/SolarOxides.ipynb (0 hunks)
examples/Structure_Prediction/Garnet_Database.ipynb (16 hunks)
examples/Structure_Prediction/Li-Garnet_Generator.ipynb (19 hunks)
examples/Structure_Prediction/Li-Garnets_SP-Pym-new.ipynb (34 hunks)
examples/vec_example.py (0 hunks)
paper.bib (6 hunks)
paper.md (1 hunks)
pyproject.toml (1 hunks)
requirements-dev.txt (1 hunks)
requirements.txt (7 hunks)
setup.py (2 hunks)
smact/__init__.py (19 hunks)
smact/benchmarking/pymatgen_benchmark.py (2 hunks)
smact/benchmarking/smact_benchmark.py (2 hunks)
smact/benchmarking/utilities.py (2 hunks)
smact/builder.py (3 hunks)
smact/data/oxidation_state_probability_table.json (1 hunks)
smact/data/oxidation_states.txt (1 hunks)
smact/data/oxidation_states_SP.txt (1 hunks)
smact/data/oxidation_states_icsd.txt (1 hunks)
smact/data/oxidation_states_icsd24_common.txt (1 hunks)
smact/data/oxidation_states_icsd24_filtered.txt (1 hunks)
smact/data/oxidation_states_icsd24_raw.txt (1 hunks)
smact/data/oxidation_states_pmg.txt (1 hunks)
smact/data/oxidation_states_wiki.txt (1 hunks)
smact/data/oxidationstates.data (0 hunks)
smact/data/solid_properties.txt (1 hunks)
smact/data_loader.py (32 hunks)
smact/distorter.py (4 hunks)
smact/dopant_prediction/__init__.py (1 hunks)
smact/dopant_prediction/doper.py (10 hunks)
smact/lattice.py (4 hunks)
smact/lattice_parameters.py (12 hunks)
smact/mainpage.py (1 hunks)
smact/oxidation_states.py (6 hunks)
smact/properties.py (9 hunks)

⛔ Files not processed due to max files limit (27)

smact/screening.py
smact/structure_prediction/init.py
smact/structure_prediction/database.py
smact/structure_prediction/mutation.py
smact/structure_prediction/prediction.py
smact/structure_prediction/probability_models.py
smact/structure_prediction/structure.py
smact/structure_prediction/utilities.py
smact/tests/files/CaTiO3.json
smact/tests/files/Fe3O4.json
smact/tests/files/NaCl.json
smact/tests/files/mp-540839_CsPbI3_oxi.json
smact/tests/files/pymatgen_structure.json
smact/tests/files/test_icsd_oxidation_states_filter_1000.txt
smact/tests/files/test_icsd_oxidation_states_filter_1000_w_0_ox_state.txt
smact/tests/files/test_lambda_tab.json
smact/tests/test_core.py
smact/tests/test_doper.py
smact/tests/test_structure.py
smact/tests/test_utils.py
smact/utils/band_gap_simple.py
smact/utils/composition.py
smact/utils/crystal_space/init.py
smact/utils/crystal_space/download_compounds_with_mp_api.py
smact/utils/crystal_space/generate_composition_with_smact.py
smact/utils/crystal_space/plot_embedding.py
smact/utils/oxidation.py

💤 Files with no reviewable changes (12)

examples/Cation_mutation/cation_mutation.ipynb
examples/Counting/ElementCombinationsParallel.py
examples/Counting/Generate_compositions_lists.ipynb
examples/Counting/Raw_combinations.ipynb
examples/Crystal_Space/0_screening.ipynb
examples/Inverse_perovskites/Inverse_formate_perovskites.ipynb
examples/Oxidation_states/mp-540839_CsPbI3_oxi.cif
examples/README.md
examples/Simple_wrappers/Simple_wrappers.ipynb
examples/Solar_oxides/SolarOxides.ipynb
examples/vec_example.py
smact/data/oxidationstates.data

✅ Files skipped from review due to trivial changes (28)

.github/pull_request_template.md
.github/release.yml
.github/workflows/ci.yml
.github/workflows/combine-prs.yml
.github/workflows/publish-to-pypi.yml
CONTRIBUTING.md
docs/examples.rst
docs/examples/compound_electroneg.ipynb
docs/examples/elements_and_species.ipynb
docs/examples/filter.ipynb
docs/smact.dopant_prediction.doper.rst
docs/smact.dopant_prediction.rst
docs/smact.properties.rst
docs/smact.utils.rst
docs/tutorials.rst
paper.bib
requirements-dev.txt
requirements.txt
smact/data/oxidation_state_probability_table.json
smact/data/oxidation_states.txt
smact/data/oxidation_states_SP.txt
smact/data/oxidation_states_icsd.txt
smact/data/oxidation_states_icsd24_filtered.txt
smact/data/oxidation_states_pmg.txt
smact/data/oxidation_states_wiki.txt
smact/data/solid_properties.txt
smact/dopant_prediction/init.py
smact/mainpage.py

🧰 Additional context used

🪛 LanguageTool

README.md

[misspelling] ~24-~24: Possible spelling mistake found.
Context: ...a enough, nature will always confess_ - Roland Coase (from 'How should economists choose?') ...

(EN_MULTITOKEN_SPELLING_TWO)

[style] ~46-~46: Consider a shorter alternative to avoid wordiness.
Context: ...lied to generated lists of compositions in order to screen for particular properties. These...

(IN_ORDER_TO_PREMIUM)

[style] ~66-~66: Would you like to use the Oxford spelling “initialize”? The spelling ‘initialise’ is also correct.
Context: ...es the loading of external data used to initialise the core smact.Element and `smact.Spe...

(OXFORD_SPELLING_Z_NOT_S)

[grammar] ~79-~79: It looks like ‘collections’ doesn’t match ‘a’. Did you mean “a collection” or just “collections”?
Context: ...rediction**: A submodule which contains a collections of tools for predicting dopants. ## Re...

(A_NNS_IN)

[uncategorized] ~89-~89: Possible missing comma found.
Context: ...est stable release can be installed via pip which will automatically set up other P...

(AI_HYDRA_LEO_MISSING_COMMA)

[uncategorized] ~101-~101: Possible missing comma found.
Context: ....com/WMD-group/SMACT.git For developer installation SMACT can be installed from a copy of t...

(AI_HYDRA_LEO_MISSING_COMMA)

[locale-violation] ~113-~113: License must be spelled with a “c” when used as a noun in British English. Use “licence”.
Context: ... immediately reflected on the PATH. ## License and attribution Python code and origin...

(LICENCE_LICENSE_NOUN_SINGULAR)

[locale-violation] ~115-~115: License must be spelled with a “c” when used as a noun in British English. Use “licence”.
Context: ... data tables are licensed under the MIT License. ## Development notes ### Bugs, featu...

(LICENCE_LICENSE_NOUN_SINGULAR)

[grammar] ~121-~121: The pronoun ‘they’ must be used with a non-third-person form of a verb.
Context: ...n the issue tracker, especially if they helps us improve the docs! For other queries ...

(NON3PRS_VERB)

dev_docs/contribution_guidelines.md

[style] ~11-~11: Would you like to use the Oxford spelling “summarized”? The spelling ‘summarised’ is also correct.
Context: ...he steps for a new piece of work can be summarised as follows: 1. Push up or create [an i...

(OXFORD_SPELLING_Z_NOT_S)

[style] ~18-~18: Consider removing “of” to be more concise
Context: ...are finished with the work, ensure that all of the unit tests pass on your own machine....

(ALL_OF_THE)

[uncategorized] ~25-~25: Possible missing comma found.
Context: ...eral overview of using pull requests on GitHub look [in the GitHub docs](https://help....

(AI_HYDRA_LEO_MISSING_COMMA)

[uncategorized] ~27-~27: Possible missing comma found.
Context: ...t-pull-requests). When creating a pull request you should: - Ensure that the title su...

(AI_HYDRA_LEO_MISSING_COMMA)

dev_docs/design_docs/StructurePredictor.md

[uncategorized] ~8-~8: A comma may be missing after the conjunctive/linking adverb ‘Therefore’.
Context: ..., in terms of interfacing with SMACT. Therefore we propose implement a similar method i...

(SENT_START_CONJUNCTIVE_LINKING_ADVERB_COMMA)

[uncategorized] ~39-~39: The abbreviation “e.g.” (= for example) requires two periods.
Context: ... Generate a list of target compositions eg for Ba$_2$OF$_2$ `[Ba2+, O2-, F1-, (2, ...

(E_G)

[grammar] ~85-~85: Possible agreement error. The noun ‘database’ seems to be countable.
Context: ...lysis of structures - Simple two field database - Field 1: database key - Field 2: ...

(CD_NN)

paper.md

[style] ~42-~42: Would you like to use the Oxford spelling “revolutionizing”? The spelling ‘revolutionising’ is also correct.
Context: ... The paradigm of data-driven science is revolutionising the materials discovery process. There ...

(OXFORD_SPELLING_Z_NOT_S)

[uncategorized] ~44-~44: Possible missing comma found.
Context: ... where sets of element combinations are generated then screened using chemical filters. I...

(AI_HYDRA_LEO_MISSING_COMMA)

[uncategorized] ~60-~60: A comma might be missing here.
Context: ...from many members of the Walsh research group including Andrew Morris, Timothy Gaunle...

(AI_EN_LECTOR_MISSING_PUNCTUATION_COMMA)

smact/data/oxidation_states_icsd24_common.txt

[grammar] ~7-~7: Només va sense accent quan és un animal o un nom de lletra. Escriviu «bé» (adverbi) o «ve» (del v. ‘venir’).
Context: ...st common non-zero values # H 1 He Li 1 Be 2 B 3 C 4 N -3 O -2 F -1 Ne Na 1 Mg 2 A...

(BE)

[grammar] ~36-~36: Possible confusió. ¿Volíeu dir «als»?
Context: ...Mn 2 Fe 3 Co 2 Ni 2 Cu 2 Zn 2 Ga 3 Ge 4 As 5 Se -2 Br -1 Kr 2 Rb 1 Sr 2 Y 3 Zr 4 N...

(AS_ALS)

[grammar] ~42-~42: Si és conjunció, cal escriure «I».
Context: ... 3 Ge 4 As 5 Se -2 Br -1 Kr 2 Rb 1 Sr 2 Y 3 Zr 4 Nb 5 Mo 6 Tc 7 Ru 4 Rh 3 Pd 2 Ag...

(Y_CONJUNCIO)

[grammar] ~70-~70: ¿Volíeu dir «o» (conjunció) en comptes de «ho» (pronom)?
Context: ...Pr 3 Nd 3 Pm 3 Sm 3 Eu 3 Gd 3 Tb 3 Dy 3 Ho 3 Er 3 Tm 3 Yb 3 Lu 3 Hf 4 Ta 5 W 6 Re ...

(FINS_I_TOT)

[grammar] ~80-~80: Paraula desconeguda com a nom propi.
Context: ... Tm 3 Yb 3 Lu 3 Hf 4 Ta 5 W 6 Re 7 Os 5 Ir 4 Pt 4 Au 1 Hg 2 Tl 1 Pb 2 Bi 3 Po 4 At...

(VERBS_NOMSPROPIS)

[grammar] ~102-~102: S’accentua quan és del v. ‘ser’.
Context: ... Pa 5 U 6 Np 6 Pu 3 Am 3 Cm 3 Bk 3 Cf 3 Es 3 Fm Md No Lr

(ES)

smact/data/oxidation_states_icsd24_raw.txt

[misspelling] ~3-~3: Söz ve sayı arasında defis yoqtur: "V-2"
Context: ...idation state set # Source: ICSD (2024, V2) # H -1 1 2 3 He Li 1 Be 2 B -5 -3 -2 -...

(NUMBER_BEFORE_DEFIS_MISSING)

🪛 Markdownlint

dev_docs/design_docs/StructurePredictor.md

79-79: null
No space after hash on atx style heading

(MD018, no-missing-space-atx)

🔇 Additional comments (70)

docs/requirements.in (1)

1-6: Verify package version compatibility

Please verify that these specific versions are compatible:

sphinx 5.3.0 with sphinx-book-theme 1.1.3
sphinx 5.3.0 with myst-nb 1.1.1

Consider checking the packages' release notes or documentation for any known compatibility issues.

.readthedocs.yaml (2)

4-6: Environment updates look good, please verify compatibility

The upgrade to Ubuntu 22.04 and Python 3.10 represents a positive move towards newer, supported versions. This aligns well with your testing environment.

Please ensure all documentation build dependencies are compatible with Python 3.10.

10-10: Configuration paths are properly specified

The Sphinx configuration and requirements paths are correctly specified and follow standard conventions.

Also applies to: 15-15

smact/benchmarking/smact_benchmark.py (1)

3-5: LGTM! Import changes follow best practices.

The changes to imports enhance code clarity and maintainability:

Using __future__ annotations is forward-compatible
Absolute imports are more explicit and easier to understand than relative imports

smact/benchmarking/pymatgen_benchmark.py (1)

3-3: Excellent addition of future annotations import!

This import enables postponed evaluation of type annotations, which is particularly beneficial for avoiding forward reference issues and circular imports.

smact/benchmarking/utilities.py (3)

3-3: Excellent addition of future annotations!

This change enables postponed evaluation of annotations, which is the preferred approach for modern Python development.

25-25: Proper defensive programming practice!

Initialising value before the loop prevents potential UnboundLocalError scenarios. This is particularly important for edge cases where n might be 0.

39-40: Verify logging configuration behaviour

Whilst the fname assignment is more concise, please verify that calling logging.basicConfig() multiple times doesn't cause any issues with your logging setup, as subsequent calls to basicConfig() are typically ignored by the logging system.

smact/lattice.py (1)

2-4: Well-structured module documentation and imports

The module docstring clearly describes the purpose, and the __future__ import ensures proper type annotation handling.

docs/index.rst (1)

57-57: Consider clarifying the distinction between examples and tutorials

It would be helpful to add a brief description or comment explaining how the 'tutorials' section differs from the 'examples' section to help users navigate the documentation more effectively.

.pre-commit-config.yaml (2)

40-40: Consider using a stable version of prettier

The current configuration uses an alpha version (v4.0.0-alpha.8). For production environments, it's recommended to use stable releases to avoid potential breaking changes or instability.

1-67: Verify tool integration and execution order

The pre-commit configuration includes multiple formatting and linting tools. Please verify:

The execution order doesn't cause conflicts (e.g., ruff-format and prettier)
All tool versions are mutually compatible
The pipeline performs efficiently without redundant operations

setup.py (2)

21-22: LGTM! Good use of context manager

The implementation correctly uses a context manager for file handling and appropriately separates the long description logic.

Line range hint 58-59: Verify pymatgen version constraints

The version constraint for pymatgen (>=2024.2.20,<2024.8.8) is quite specific and might be overly restrictive. Consider if such tight version bounds are necessary, as they could cause dependency conflicts in downstream projects.

docs/introduction.rst (1)

12-13: LGTM! Clear extension of application domains.

The addition of water splitting and thermoelectrics as application domains enhances the introduction's completeness.

docs/examples/valence_electron_count.ipynb (1)

52-69: LGTM! Notebook configuration is properly set up.

The metadata configuration is correct and follows Jupyter notebook standards. The Python version (3.10.14) aligns with the PR objectives.

docs/examples/validity.ipynb (1)

82-82: Verify Python version compatibility

The notebook uses Python 3.12.4, which is quite recent. Consider testing with Python 3.10 as specified in the PR objectives to ensure compatibility.

docs/examples/lists.ipynb (2)

46-48: LGTM! Clear and helpful explanation

The explanation effectively communicates the performance benefits of using a dictionary for element lookups.

89-89: Consider updating Python version

The notebook uses Python 3.9.16, but according to the PR objectives, the project is tested with Python 3.10. Consider updating the kernel to maintain consistency.

docs/examples/distorter.ipynb (1)

103-103: Verify Python version compatibility

The notebook uses Python 3.12.4, which is quite recent. Please verify that all dependencies support this version, or consider specifying a more widely supported version (e.g., 3.10).

docs/requirements.txt (3)

2-5: LGTM! Verify pip-compile command execution.

The file header correctly documents the generation method. Ensure the team knows to run pip-compile requirements.in when updating dependencies.

7-14: Review the necessity of Jupyter-related dependencies.

Several new packages have been added that appear to be related to Jupyter notebook functionality (ipykernel, ipython, etc.). Whilst these are valid for documentation, they significantly increase the dependency footprint.

Consider whether these could be moved to an optional dependencies file if they're only needed for notebook examples.

Also applies to: 58-63

159-160: Consider implementing version constraints for security updates.

Whilst exact version pinning ensures reproducibility, it may prevent security updates. For critical packages like requests and urllib3, consider using version constraints (e.g., >=2.32.0,<3.0.0) to allow minor security updates whilst preventing breaking changes.

Also applies to: 225-227

docs/examples/doper.ipynb (1)

157-157: Update Python version to match PR requirements

The notebook specifies Python 3.9.16, but the PR indicates testing with Python 3.10. Consider updating the kernel to maintain consistency with the testing environment.

smact/properties.py (7)

1-3: LGTM! Good practice adding future annotations

The addition of from __future__ import annotations is appropriate for the modern type hint syntax used throughout the file.

11-23: LGTM! Well-structured docstring and modern type hints

The updated type hints using the | syntax and the well-formatted NumPy-style docstring improve code clarity and maintainability.

26-27: LGTM! Improved error handling with specific exception

Good practice using TypeError with a descriptive message that includes the actual type received.

91-94: LGTM! Well-structured type hints

The type hints are clear and properly handle optional parameters using the modern syntax.

Line range hint 96-120: LGTM! Comprehensive and well-formatted docstring

The docstring provides clear documentation with proper section markers and includes helpful mathematical explanations.

124-129: LGTM! Clear type checking and error handling

Good implementation of type checking with a helpful error message for invalid input types.

187-187: LGTM! Improved exception handling

Good use of from None to prevent unnecessary exception chaining and provide cleaner error messages.

smact/oxidation_states.py (2)

9-10: Well-considered addition of annotations import

The addition of from __future__ import annotations is appropriate for enabling modern type hint syntax whilst maintaining compatibility with older Python versions.

28-37: Excellent modernisation of type hints

The updated type hints follow PEP 604 and provide clearer, more maintainable type annotations. The docstring is well-structured with proper argument documentation.

smact/lattice_parameters.py (3)

6-6: LGTM: Future annotations import added

Good preparation for future type hinting support.

12-25: Verify parameter terminology consistency

The docstring refers to "shannon_radius", but the AI summary suggests a terminology change to "covalent_radius". Please verify the intended parameter naming convention.

Line range hint 1-350: Consider adding unit tests

Given the complexity of the crystallographic calculations, it would be beneficial to add unit tests to verify:

Expected lattice parameters for known structures
Edge cases with extreme radius values
Parameter validation

Would you like assistance in generating unit tests for these functions?

docs/examples/oxidation_states.ipynb (1)

221-221: Version mismatch with PR objectives

The notebook uses Python 3.11.5, but the PR objectives specify Python 3.10. Please ensure version consistency across the project.

pyproject.toml (3)

3-4: LGTM: Build system configuration is appropriate

The setuptools version constraint and build backend configuration are well-specified.

41-50: Consider relaxing version constraints

Two potential concerns with the dependency specifications:

The pymatgen constraint >=2024.2.20,<2024.8.8 is quite narrow and might cause compatibility issues in the future.
The numpy constraint <2 might be too restrictive as numpy 2.0 brings performance improvements.

Consider:

Using a more flexible version range for pymatgen
Testing compatibility with numpy 2.0 to potentially remove the constraint

175-279: LGTM: Codespell configuration is well-suited for chemistry project

The extensive ignore list for chemical elements and careful skip patterns demonstrate good attention to domain-specific requirements.

docs/tutorials/crystal_space.ipynb (2)

100-106: Ensure data directory exists before saving

The code uses a relative path for saving without ensuring the directory exists. This could cause FileNotFoundError if the directory structure isn't present.

Consider adding directory creation:

from pathlib import Path
save_dir = Path("data/binary")
save_dir.mkdir(parents=True, exist_ok=True)
df_smact = generate_composition_with_smact(
    num_elements=2,
    max_stoich=8,
    max_atomic_num=103,
    num_processes=8,
    save_path=str(save_dir / "df_binary_label.pkl"),
)

241-264: LGTM! Well-structured notebook metadata and next steps

The metadata section is complete and the next steps are clearly documented with appropriate links.

docs/tutorials/crystal_space_visualisation.ipynb (3)

7-7: Documentation improvements enhance clarity and accessibility

The markdown changes effectively improve the tutorial's structure and accessibility:

Clear title that reflects the tutorial's purpose
Added prerequisite link for better navigation
Google Colab integration for interactive learning
Consistent section header capitalisation

Also applies to: 14-22, 36-36, 46-52

128-128: Verify if the reduced sample size is sufficient

The sample size has been reduced from 3000 to 100. While this might speed up the tutorial, please verify that this smaller sample size:

Provides statistically significant results
Adequately represents the distribution of the original dataset
Demonstrates the dimension reduction techniques effectively

385-385: Verify Python version compatibility

The notebook specifies Python 3.10.14. Please verify:

All dependencies support Python 3.10
No breaking changes affect the functionality
CI/CD pipelines are updated accordingly

docs/conf.py (4)

65-67: LGTM! Project metadata updates are appropriate

The capitalisation of SMACT, updated copyright year, and generic author attribution align with standard open-source practices.

126-128: Please verify logo file and consider title consistency

The theme configuration looks good, but please:

Verify that 'smact_toc.jpg' exists in the docs directory
Consider using consistent capitalisation between project name and HTML title ("SMACT" vs "smact")

172-182: Review branch name and clean up commented code

The launch button configuration is excellent, but there are two items to address:

The repository branch is set to "docs_updates" - should this be "master" instead?
Remove the commented "home_page_in_toc" line if it's not needed

185-192: LGTM! GitHub integration is well-configured

The GitHub context configuration is complete and correctly structured, enabling proper source linking and edit functionality.

examples/Structure_Prediction/Li-Garnet_Generator.ipynb (1)

10-10: Good practice: Future annotations import added

The addition of from __future__ import annotations is a good practice as it enables postponed evaluation of annotations, making the code more maintainable and forward-compatible.

smact/dopant_prediction/doper.py (1)

1-6: Well-structured module documentation and modern type hints setup

The module docstring effectively communicates the purpose and the use of __future__.annotations ensures compatibility with modern type hinting.

smact/__init__.py (5)

2-2: LGTM! Good practices for type hinting.

The addition of annotations import and TYPE_CHECKING is aligned with modern Python type hinting practices.

Also applies to: 7-7, 14-14

Line range hint 38-73: Documentation improvements enhance clarity and completeness.

The documentation has been properly structured with clear sections and complete descriptions for all attributes.

200-203: Good addition of ICSD 2024 oxidation states.

The inclusion of the latest ICSD data enhances the library's capability to work with current oxidation state information.

Line range hint 221-232: Documentation improvements with proper academic citations.

The Species class documentation has been enhanced with:

Clear attribute descriptions
Proper citation for the machine-learning methodology
Well-structured parameter documentation

Also applies to: 272-283

367-368: Good modernisation of type hints.

The type hints have been updated to use the modern Python syntax (e.g., list[str] instead of List[str]).

Also applies to: 472-474

smact/data_loader.py (4)

9-11: LGTM! Documentation improvement enhances clarity.

The added documentation clarifies the module's purpose and behaviour regarding missing data warnings.

13-14: LGTM! Future-proofing with type hints.

The addition of from __future__ import annotations ensures compatibility with future Python versions and enables forward references in type hints.

398-398: LGTM! Type hints improve code maintainability.

The addition of type hints to function signatures enhances code readability and enables better IDE support and static type checking.

Also applies to: 844-844, 918-918

106-106: LGTM! Consistent string formatting.

The changes standardise string formatting across the module by using f-strings, which is more readable and maintainable than the older .format() method.

Also applies to: 152-152, 200-200, 248-248, 297-297, 342-342, 459-459, 551-551, 649-649, 715-715, 786-786, 835-835, 912-912, 957-957

docs/tutorials/oxidation_states.ipynb (3)

7-13: Well-structured tutorial introduction!

The markdown cells provide a clear and logical introduction to the tutorial's objectives, outlining the high-throughput compound design workflow steps.

242-247: LGTM! Clear and focused implementation

The code for computing compound probabilities is concise and straightforward.

Line range hint 427-450: Verify Python version compatibility

The notebook has been downgraded from Python 3.10.13 to 3.9.16. Please ensure all functionality remains compatible with the older Python version.

docs/tutorials/smact_validity_of_GNoMe.ipynb (4)

5-13: LGTM! Documentation improvements look good.

The updated Colab link and enhanced introduction provide better context for users.

69-80: Well-structured function with clear documentation.

The new initialize_parallel_processing() function is well-documented with a clear docstring.

82-87: Good documentation of imported modules.

The comments explaining the purpose of each imported module improve code readability.

395-400: LGTM! Clear and informative conclusion.

The conclusion effectively summarizes the findings and limitations of the SMACT validity test.

examples/Structure_Prediction/Li-Garnets_SP-Pym-new.ipynb (3)

9-21: Well-structured import statements!

The imports have been streamlined and modernised with the addition of future annotations. The removal of unused imports and the organisation of remaining ones improve code clarity.

498-501: Verify the probability threshold value

The threshold value of 10e-4 (0.0001) seems quite low for structure prediction. This could potentially include many false positives in the predictions. Consider if a higher threshold might be more appropriate for your use case.

5-5: Execute the notebook to verify functionality

All cells show null execution counts, indicating the notebook hasn't been run. Please execute the entire notebook to ensure all cells function as expected and produce the intended outputs.

Also applies to: 51-51, 142-142, 166-166, 188-188, 214-214, 229-229, 308-308, 327-327, 418-418, 431-431, 457-457, 466-466, 522-522, 550-550, 679-679, 696-696, 710-710, 839-839, 853-853, 871-871, 912-912, 940-940, 965-965, 1100-1100, 1120-1120, 1133-1133, 1142-1142, 1151-1151, 1163-1163, 1183-1183, 1214-1214, 1237-1237

coderabbitai · 2024-11-20T09:31:32Z

docs/requirements.in

 sphinx==5.3.0
 sphinx_rtd_theme==2.0.0


⚠️ Potential issue

Remove redundant theme package

There appears to be a conflict with two different Sphinx themes being specified: sphinx_rtd_theme and sphinx-book-theme. Since the AI summary indicates a switch to sphinx-book-theme, the sphinx_rtd_theme package should be removed to avoid potential conflicts.

Apply this diff to resolve the conflict:

sphinx==5.3.0 -sphinx_rtd_theme==2.0.0 readthedocs-sphinx-search==0.3.2 sphinx-book-theme==1.1.3

Also applies to: 5-5

coderabbitai · 2024-11-20T09:31:33Z

docs/examples/valence_electron_count.ipynb

+    "# List of chemical formulas for which VEC will be calculated\n",
+    "formulas = [\"Cu0.5NiAlCoCrFeSi\", \"WNbMoTa\", \"AlCoCrCu0.5FeNi3\", \"Ca65Mg15Zn20\"]\n",
+    "\n",
+    "# Calculate VEC for each formula using a list comprehension\n",
+    "valence_electron_concentrations = [smact.properties.valence_electron_count(formula) for formula in formulas]\n",


⚠️ Potential issue

Add error handling for invalid chemical formulas.

The current implementation doesn't handle potential errors from invalid chemical formulas. Consider adding try-except blocks:

# List of chemical formulas for which VEC will be calculated formulas = ["Cu0.5NiAlCoCrFeSi", "WNbMoTa", "AlCoCrCu0.5FeNi3", "Ca65Mg15Zn20"] -# Calculate VEC for each formula using a list comprehension -valence_electron_concentrations = [smact.properties.valence_electron_count(formula) for formula in formulas] +# Calculate VEC for each formula with error handling +valence_electron_concentrations = [] +for formula in formulas: + try: + vec = valence_electron_count(formula) + valence_electron_concentrations.append(vec) + except ValueError as e: + print(f"Error processing {formula}: {str(e)}")

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2024-11-20T09:31:33Z

smact/builder.py

+def wurtzite(species, cell_par=None, repetitions=None):
    """
    Build a wurzite cell using the crystal function in ASE.

    Args:
+    ----
        species (str): Element symbols
        cell_par (list): Six floats/ints specifying 3 unit cell lengths and 3 unit cell angles.
        repetitions (list): Three floats specifying the expansion of the cell in x,y,z directions.
+
    Returns:
+    -------
        SMACT Lattice object of the unit cell,
        ASE crystal system of the unit cell.

    """
+    if repetitions is None:
+        repetitions = [1, 1, 1]
+    if cell_par is None:
+        cell_par = [2, 2, 6, 90, 90, 120]


⚠️ Potential issue

Fix inconsistent cell parameters

There's a mismatch between the default cell parameters and those used in the crystal creation:

Default: [2, 2, 6, 90, 90, 120]

Used: [3, 3, 6, 90, 90, 120]

Please align these values or document why they differ.

Consider optimising oxidation states list

The current oxidation states list [[1], [2], [3], [4], [-1], [-2], [-3], [-4]] includes all possible states from -4 to +4, which might be unnecessarily broad for wurtzite structures.

Consider limiting this to commonly observed oxidation states in wurtzite materials.

coderabbitai · 2024-11-20T09:31:33Z

docs/examples/validity.ipynb

+    "# Apply smact_validity to check if compounds are valid according to SMACT rules\n",
+    "df[\"smact_allowed\"] = df[\"Compound\"].apply(smact_validity)\n",


⚠️ Potential issue

Add error handling for invalid chemical formulae

The smact_validity function might raise exceptions for malformed chemical formulae. Consider adding try-except blocks to handle potential errors gracefully.

-df["smact_allowed"] = df["Compound"].apply(smact_validity) +def safe_validate(compound): + try: + return smact_validity(compound) + except Exception as e: + print(f"Warning: Could not validate {compound}: {str(e)}") + return False + +df["smact_allowed"] = df["Compound"].apply(safe_validate)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"# Apply smact_validity to check if compounds are valid according to SMACT rules\n",

"df[\"smact_allowed\"] = df[\"Compound\"].apply(smact_validity)\n",

def safe_validate(compound):

try:

return smact_validity(compound)

except Exception as e:

print(f"Warning: Could not validate {compound}: {str(e)}")

return False

df["smact_allowed"] = df["Compound"].apply(safe_validate)

coderabbitai · 2024-11-20T09:31:33Z

docs/examples/distorter.ipynb

+    "# The distorted objects are of the ase atoms class,\n",
+    "# so can be written out to a crystal structure file like a cif\n",
+    "distorted.write(\"example_output.cif\")"


⚠️ Potential issue

Consider saving all generated structures

Currently, only the last distorted structure is being saved. Consider saving all generated structures with meaningful filenames.

-# The distorted objects are of the ase atoms class, -# so can be written out to a crystal structure file like a cif -distorted.write("example_output.cif") +# Save all generated structures +for i, structure in enumerate(substituted_structures): + structure.write(f"Sr_substituted_BaTiO3_{i+1}.cif")

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2024-11-20T09:31:34Z

docs/tutorials/smact_generation_of_solar_oxides.ipynb

+    "def comp_maker(comp):\n",
+    "    \"\"\"\n",
+    "    Convert a composition tuple to a pretty formula string.\n",
+    "\n",
+    "    Args:\n",
+    "        comp (tuple): A tuple containing two lists - elements and their amounts.\n",
+    "\n",
+    "    Returns:\n",
+    "        str: The reduced formula of the composition as a string.\n",
+    "    \"\"\"\n",
+    "    # Create a list to store elements and their amounts\n",
+    "    form = []\n",
+    "    # Iterate through elements and their amounts\n",
+    "    for el, ammt in zip(comp[0], comp[1]):\n",
+    "        form.append(el)\n",
+    "        form.append(ammt)\n",
+    "    # Join all elements into a single string\n",
+    "    form = \"\".join(str(e) for e in form)\n",
+    "    # Convert to a Composition object and get the reduced formula\n",
+    "    pmg_form = Composition(form).reduced_formula\n",
+    "    return pmg_form\n",
+    "\n",
+    "\n",
+    "# Use multiprocessing to apply comp_maker to all compositions in flat_list\n",
+    "with multiprocessing.Pool() as p:\n",
+    "    pretty_formulas = p.map(comp_maker, flat_list)\n",
+    "\n",
+    "# Create a list of unique formulas\n",
+    "unique_pretty_formulas = list(set(pretty_formulas))\n",
+    "# Print the number of unique composition formulas\n",
+    "print(f\"Number of unique compositions formulas: {len(unique_pretty_formulas)}\")"
+   ]


🛠️ Refactor suggestion

Improve composition handling robustness and efficiency

The composition handling could be enhanced with better error handling and efficiency:

-def comp_maker(comp): +def comp_maker(comp: tuple[list[str], list[int]]) -> str: + try: form = [] for el, ammt in zip(comp[0], comp[1]): - form.append(el) - form.append(ammt) + form.extend([el, str(ammt)]) - form = "".join(str(e) for e in form) + form = "".join(form) pmg_form = Composition(form).reduced_formula return pmg_form + except (ValueError, TypeError) as e: + raise ValueError(f"Invalid composition format: {e}")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"def comp_maker(comp):\n",

" \"\"\"\n",

" Convert a composition tuple to a pretty formula string.\n",

"\n",

" Args:\n",

" comp (tuple): A tuple containing two lists - elements and their amounts.\n",

"\n",

" Returns:\n",

" str: The reduced formula of the composition as a string.\n",

" \"\"\"\n",

" # Create a list to store elements and their amounts\n",

" form = []\n",

" # Iterate through elements and their amounts\n",

" for el, ammt in zip(comp[0], comp[1]):\n",

" form.append(el)\n",

" form.append(ammt)\n",

" # Join all elements into a single string\n",

" form = \"\".join(str(e) for e in form)\n",

" # Convert to a Composition object and get the reduced formula\n",

" pmg_form = Composition(form).reduced_formula\n",

" return pmg_form\n",

"\n",

"\n",

"# Use multiprocessing to apply comp_maker to all compositions in flat_list\n",

"with multiprocessing.Pool() as p:\n",

" pretty_formulas = p.map(comp_maker, flat_list)\n",

"\n",

"# Create a list of unique formulas\n",

"unique_pretty_formulas = list(set(pretty_formulas))\n",

"# Print the number of unique composition formulas\n",

"print(f\"Number of unique compositions formulas: {len(unique_pretty_formulas)}\")"

]

def comp_maker(comp: tuple[list[str], list[int]]) -> str:

"""

Convert a composition tuple to a pretty formula string.

Args:

comp (tuple): A tuple containing two lists - elements and their amounts.

Returns:

str: The reduced formula of the composition as a string.

"""

try:

# Create a list to store elements and their amounts

form = []

# Iterate through elements and their amounts

for el, ammt in zip(comp[0], comp[1]):

form.extend([el, str(ammt)])

# Join all elements into a single string

form = "".join(form)

# Convert to a Composition object and get the reduced formula

pmg_form = Composition(form).reduced_formula

return pmg_form

except (ValueError, TypeError) as e:

raise ValueError(f"Invalid composition format: {e}")

# Use multiprocessing to apply comp_maker to all compositions in flat_list

with multiprocessing.Pool() as p:

pretty_formulas = p.map(comp_maker, flat_list)

# Create a list of unique formulas

unique_pretty_formulas = list(set(pretty_formulas))

# Print the number of unique composition formulas

print(f"Number of unique compositions formulas: {len(unique_pretty_formulas)}")

coderabbitai · 2024-11-20T09:31:34Z

docs/tutorials/smact_generation_of_solar_oxides.ipynb

+    "def smact_filter(els):\n",
+    "    \"\"\"\n",
+    "    Filter element combinations based on SMACT criteria.\n",
+    "\n",
+    "    This function takes a combination of elements and applies SMACT\n",
+    "    (Semiconducting Materials from Analogy and Chemical Theory) tests\n",
+    "    to generate potential quaternary oxide compositions.\n",
+    "\n",
+    "    Args:\n",
+    "        els (tuple): A tuple containing three Element objects.\n",
+    "\n",
+    "    Returns:\n",
+    "        list: A list of tuples, each containing a set of elements and their ratios\n",
+    "              that pass the SMACT criteria.\n",
+    "    \"\"\"\n",
+    "    all_compounds = []\n",
+    "    elements = [e.symbol for e in els] + [\"O\"]\n",
+    "\n",
+    "    # Get Pauling electronegativities\n",
+    "    paul_a, paul_b, paul_c = (el.pauling_eneg for el in els)\n",
+    "    electronegativities = [paul_a, paul_b, paul_c, 3.44]  # 3.44 is for Oxygen\n",
+    "\n",
+    "    # Iterate through all possible oxidation state combinations\n",
+    "    for ox_states in product(*(el.oxidation_states for el in els)):\n",
+    "        ox_states = list(ox_states) + [-2]  # Add oxygen's oxidation state\n",
+    "\n",
+    "        # Test for charge balance\n",
+    "        cn_e, cn_r = smact.neutral_ratios(ox_states, threshold=8)\n",
+    "\n",
+    "        if cn_e:\n",
+    "            # Electronegativity test\n",
+    "            if screening.pauling_test(ox_states, electronegativities):\n",
+    "                compound = (elements, cn_r[0])\n",
+    "                all_compounds.append(compound)\n",
+    "\n",
+    "    return all_compounds"


🛠️ Refactor suggestion

Enhance function robustness and maintainability

The smact_filter function could be improved in several ways:

Add input validation

Include type hints

Move magic numbers to constants

+OXYGEN_ELECTRONEGATIVITY = 3.44 +OXYGEN_OXIDATION_STATE = -2 +MAX_RATIO_THRESHOLD = 8 -def smact_filter(els): +def smact_filter(els: tuple[smact.Element, ...]) -> list[tuple[list[str], list[int]]]: + if not els or len(els) != 3: + raise ValueError("Must provide exactly three elements") + all_compounds = [] elements = [e.symbol for e in els] + ["O"] # Get Pauling electronegativities paul_a, paul_b, paul_c = (el.pauling_eneg for el in els) - electronegativities = [paul_a, paul_b, paul_c, 3.44] # 3.44 is for Oxygen + electronegativities = [paul_a, paul_b, paul_c, OXYGEN_ELECTRONEGATIVITY] for ox_states in product(*(el.oxidation_states for el in els)): - ox_states = list(ox_states) + [-2] # Add oxygen's oxidation state + ox_states = list(ox_states) + [OXYGEN_OXIDATION_STATE] - cn_e, cn_r = smact.neutral_ratios(ox_states, threshold=8) + cn_e, cn_r = smact.neutral_ratios(ox_states, threshold=MAX_RATIO_THRESHOLD)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"def smact_filter(els):\n",

" \"\"\"\n",

" Filter element combinations based on SMACT criteria.\n",

"\n",

" This function takes a combination of elements and applies SMACT\n",

" (Semiconducting Materials from Analogy and Chemical Theory) tests\n",

" to generate potential quaternary oxide compositions.\n",

"\n",

" Args:\n",

" els (tuple): A tuple containing three Element objects.\n",

"\n",

" Returns:\n",

" list: A list of tuples, each containing a set of elements and their ratios\n",

" that pass the SMACT criteria.\n",

" \"\"\"\n",

" all_compounds = []\n",

" elements = [e.symbol for e in els] + [\"O\"]\n",

"\n",

" # Get Pauling electronegativities\n",

" paul_a, paul_b, paul_c = (el.pauling_eneg for el in els)\n",

" electronegativities = [paul_a, paul_b, paul_c, 3.44] # 3.44 is for Oxygen\n",

"\n",

" # Iterate through all possible oxidation state combinations\n",

" for ox_states in product(*(el.oxidation_states for el in els)):\n",

" ox_states = list(ox_states) + [-2] # Add oxygen's oxidation state\n",

"\n",

" # Test for charge balance\n",

" cn_e, cn_r = smact.neutral_ratios(ox_states, threshold=8)\n",

"\n",

" if cn_e:\n",

" # Electronegativity test\n",

" if screening.pauling_test(ox_states, electronegativities):\n",

" compound = (elements, cn_r[0])\n",

" all_compounds.append(compound)\n",

"\n",

" return all_compounds"

OXYGEN_ELECTRONEGATIVITY = 3.44

OXYGEN_OXIDATION_STATE = -2

MAX_RATIO_THRESHOLD = 8

def smact_filter(els: tuple[smact.Element, ...]) -> list[tuple[list[str], list[int]]]:

"""

Filter element combinations based on SMACT criteria.

This function takes a combination of elements and applies SMACT

(Semiconducting Materials from Analogy and Chemical Theory) tests

to generate potential quaternary oxide compositions.

Args:

els (tuple): A tuple containing three Element objects.

Returns:

list: A list of tuples, each containing a set of elements and their ratios

that pass the SMACT criteria.

"""

if not els or len(els) != 3:

raise ValueError("Must provide exactly three elements")

all_compounds = []

elements = [e.symbol for e in els] + ["O"]

# Get Pauling electronegativities

paul_a, paul_b, paul_c = (el.pauling_eneg for el in els)

electronegativities = [paul_a, paul_b, paul_c, OXYGEN_ELECTRONEGATIVITY]

# Iterate through all possible oxidation state combinations

for ox_states in product(*(el.oxidation_states for el in els)):

ox_states = list(ox_states) + [OXYGEN_OXIDATION_STATE]

# Test for charge balance

cn_e, cn_r = smact.neutral_ratios(ox_states, threshold=MAX_RATIO_THRESHOLD)

if cn_e:

# Electronegativity test

if screening.pauling_test(ox_states, electronegativities):

compound = (elements, cn_r[0])

all_compounds.append(compound)

return all_compounds

coderabbitai · 2024-11-20T09:31:34Z

docs/tutorials/smact_generation_of_solar_oxides.ipynb

+    "def add_descriptors(data):\n",
+    "    \"\"\"\n",
+    "    Add composition-based descriptors to the dataframe.\n",
+    "\n",
+    "    This function converts formula strings to composition objects and calculates\n",
+    "    various features using matminer's composition featurizers.\n",
+    "\n",
+    "    Args:\n",
+    "        data (pd.DataFrame): DataFrame containing 'pretty_formula' column.\n",
+    "\n",
+    "    Returns:\n",
+    "        pd.DataFrame: DataFrame with added descriptor columns.\n",
+    "    \"\"\"\n",
+    "    # Convert formula strings to composition objects\n",
+    "    str_to_comp = StrToComposition(target_col_id=\"composition_obj\")\n",
+    "    str_to_comp.featurize_dataframe(data, col_id=\"pretty_formula\")\n",
+    "\n",
+    "    # Initialize multiple featurizers\n",
+    "    feature_calculators = MultipleFeaturizer(\n",
+    "        [\n",
+    "            cf.Stoichiometry(),\n",
+    "            cf.ElementProperty.from_preset(\"magpie\"),\n",
+    "            cf.ValenceOrbital(props=[\"avg\"]),\n",
+    "            cf.IonProperty(fast=True),\n",
+    "            cf.BandCenter(),\n",
+    "            cf.AtomicOrbitals(),\n",
+    "        ]\n",
+    "    )\n",
+    "\n",
+    "    # Calculate features\n",
+    "    feature_calculators.featurize_dataframe(data, col_id=\"composition_obj\")\n",
+    "\n",
+    "    # If you need to use feature_labels later, uncomment the following line:\n",
+    "    # feature_labels = feature_calculators.feature_labels()\n",
+    "\n",
+    "    return data\n",
+    "\n",
+    "\n",
+    "# Apply the function to add descriptors\n",
+    "new_data = add_descriptors(new_data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 7. Save results to a CSV file"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Save as .csv file\n",
+    "new_data.to_csv(\"All_oxide_comps_dataframe_featurized.csv\", chunksize=10000)"
+   ]


🛠️ Refactor suggestion

Enhance descriptor calculation efficiency and robustness

Consider implementing these improvements for better performance and reliability:

Add progress tracking for long-running calculations

Implement memory-efficient processing

Add error handling for feature calculation

+from tqdm.auto import tqdm + def add_descriptors(data): + print("Converting formulas to compositions...") str_to_comp = StrToComposition(target_col_id="composition_obj") - str_to_comp.featurize_dataframe(data, col_id="pretty_formula") + try: + str_to_comp.featurize_dataframe( + data, + col_id="pretty_formula", + ignore_errors=True + ) + except Exception as e: + raise ValueError(f"Failed to convert formulas: {e}") feature_calculators = MultipleFeaturizer([ cf.Stoichiometry(), cf.ElementProperty.from_preset("magpie"), cf.ValenceOrbital(props=["avg"]), cf.IonProperty(fast=True), cf.BandCenter(), cf.AtomicOrbitals(), ]) + print("Calculating features...") + for i in tqdm(range(0, len(data), 1000), desc="Processing batches"): + batch = data.iloc[i:i+1000] + try: + feature_calculators.featurize_dataframe( + batch, + col_id="composition_obj", + ignore_errors=True + ) + except Exception as e: + print(f"Warning: Failed to calculate features for batch {i}: {e}") + continue

Also, consider using a more robust saving method:

-new_data.to_csv("All_oxide_comps_dataframe_featurized.csv", chunksize=10000) +try: + print("Saving results...") + new_data.to_csv( + "All_oxide_comps_dataframe_featurized.csv", + chunksize=10000, + index=False + ) + print("Save completed successfully") +except Exception as e: + print(f"Error saving results: {e}") + raise

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"def add_descriptors(data):\n",

" \"\"\"\n",

" Add composition-based descriptors to the dataframe.\n",

"\n",

" This function converts formula strings to composition objects and calculates\n",

" various features using matminer's composition featurizers.\n",

"\n",

" Args:\n",

" data (pd.DataFrame): DataFrame containing 'pretty_formula' column.\n",

"\n",

" Returns:\n",

" pd.DataFrame: DataFrame with added descriptor columns.\n",

" \"\"\"\n",

" # Convert formula strings to composition objects\n",

" str_to_comp = StrToComposition(target_col_id=\"composition_obj\")\n",

" str_to_comp.featurize_dataframe(data, col_id=\"pretty_formula\")\n",

"\n",

" # Initialize multiple featurizers\n",

" feature_calculators = MultipleFeaturizer(\n",

" [\n",

" cf.Stoichiometry(),\n",

" cf.ElementProperty.from_preset(\"magpie\"),\n",

" cf.ValenceOrbital(props=[\"avg\"]),\n",

" cf.IonProperty(fast=True),\n",

" cf.BandCenter(),\n",

" cf.AtomicOrbitals(),\n",

" ]\n",

" )\n",

"\n",

" # Calculate features\n",

" feature_calculators.featurize_dataframe(data, col_id=\"composition_obj\")\n",

"\n",

" # If you need to use feature_labels later, uncomment the following line:\n",

" # feature_labels = feature_calculators.feature_labels()\n",

"\n",

" return data\n",

"\n",

"\n",

"# Apply the function to add descriptors\n",

"new_data = add_descriptors(new_data)"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"### 7. Save results to a CSV file"

]

},

{

"cell_type": "code",

"execution_count": null,

"metadata": {},

"outputs": [],

"source": [

"# Save as .csv file\n",

"new_data.to_csv(\"All_oxide_comps_dataframe_featurized.csv\", chunksize=10000)"

]

def add_descriptors(data):

"""

Add composition-based descriptors to the dataframe.

This function converts formula strings to composition objects and calculates

various features using matminer's composition featurizers.

Args:

data (pd.DataFrame): DataFrame containing 'pretty_formula' column.

Returns:

pd.DataFrame: DataFrame with added descriptor columns.

"""

# Convert formula strings to composition objects

print("Converting formulas to compositions...")

str_to_comp = StrToComposition(target_col_id="composition_obj")

try:

str_to_comp.featurize_dataframe(

data,

col_id="pretty_formula",

ignore_errors=True

)

except Exception as e:

raise ValueError(f"Failed to convert formulas: {e}")

# Initialize multiple featurizers

feature_calculators = MultipleFeaturizer(

[

cf.Stoichiometry(),

cf.ElementProperty.from_preset("magpie"),

cf.ValenceOrbital(props=["avg"]),

cf.IonProperty(fast=True),

cf.BandCenter(),

cf.AtomicOrbitals(),

]

)

# Calculate features

print("Calculating features...")

for i in tqdm(range(0, len(data), 1000), desc="Processing batches"):

batch = data.iloc[i:i+1000]

try:

feature_calculators.featurize_dataframe(

batch,

col_id="composition_obj",

ignore_errors=True

)

except Exception as e:

print(f"Warning: Failed to calculate features for batch {i}: {e}")

continue

return data

# Apply the function to add descriptors

new_data = add_descriptors(new_data)

# Save as .csv file

try:

print("Saving results...")

new_data.to_csv(

"All_oxide_comps_dataframe_featurized.csv",

chunksize=10000,

index=False

)

print("Save completed successfully")

except Exception as e:

print(f"Error saving results: {e}")

raise

coderabbitai · 2024-11-20T09:31:34Z

smact/dopant_prediction/doper.py

+                dopants_lists[i] = self._get_selectivity(dopants_lists[i], cations, sub)

        # if groupby
-        groupby_lists = [
-            dict()
-        ] * 4  # create list of empty dict length of 4 (n-cat, p-cat, n-an, p-an)
+        groupby_lists = [dict()] * 4  # create list of empty dict length of 4 (n-cat, p-cat, n-an, p-an)
        # in case group_by_charge = False
        if group_by_charge:
            for i, dl in enumerate(dopants_lists):
                # groupby first element charge
                dl = sorted(dl, key=lambda x: utilities.parse_spec(x[0])[1])
-                grouped_data = groupby(
-                    dl, key=lambda x: utilities.parse_spec(x[0])[1]
-                )
-                grouped_top_data = {
-                    str(k): list(g)[:num_dopants] for k, g in grouped_data
-                }
+                grouped_data = groupby(dl, key=lambda x: utilities.parse_spec(x[0])[1])
+                grouped_top_data = {str(k): list(g)[:num_dopants] for k, g in grouped_data}
                groupby_lists[i] = grouped_top_data
                del grouped_data

        # select top n elements
-        dopants_lists = [
-            dopants_list[:num_dopants] for dopants_list in dopants_lists
-        ]
+        dopants_lists = [dopants_list[:num_dopants] for dopants_list in dopants_lists]


🛠️ Refactor suggestion

Consider breaking down complex logic

The grouping and sorting logic could be extracted into separate methods for improved maintainability:

def _group_dopants_by_charge(self, dopants_list, num_dopants): if not dopants_list: return {} sorted_data = sorted(dopants_list, key=lambda x: utilities.parse_spec(x[0])[1]) grouped_data = groupby(sorted_data, key=lambda x: utilities.parse_spec(x[0])[1]) return {str(k): list(g)[:num_dopants] for k, g in grouped_data} def _select_top_dopants(self, dopants_lists, num_dopants): return [dopants_list[:num_dopants] for dopants_list in dopants_lists]

coderabbitai · 2024-11-20T09:31:34Z

smact/data_loader.py

+_el_ox_states_icsd24 = None
+
+
+def lookup_element_oxidation_states_icsd24(symbol, copy=True):
+    """
+    Retrieve a list of known oxidation states for an element.
+    The oxidation states list used contains only those found
+    in the 2024 version of the ICSD (and has >=5 reports).
+
+    Args:
+        symbol (str) : the atomic symbol of the element to look up.
+        copy (Optional(bool)): if True (default), return a copy of the
+            oxidation-state list, rather than a reference to the cached
+            data -- only use copy=False in performance-sensitive code
+            and where the list will not be modified!
+
+    Returns:
+        list: List of known oxidation states for the element.
+
+            Returns None if oxidation states for the Element were not
+            found in the external data.
+    """
+    global _el_ox_states_icsd24
+
+    if _el_ox_states_icsd24 is None:
+        _el_ox_states_icsd24 = {}
+
+        for items in _get_data_rows(os.path.join(data_directory, "oxidation_states_icsd24_filtered.txt")):
+            _el_ox_states_icsd24[items[0]] = [int(oxidationState) for oxidationState in items[1:]]
+
+    if symbol in _el_ox_states_icsd24:
+        if copy:
+            # _el_ox_states_icsd24 stores lists -> if copy is set, make an implicit
+            # deep copy.  The elements of the lists are integers, which are
+            # "value types" in Python.
+
+            return list(_el_ox_states_icsd24[symbol])
+        else:
+            return _el_ox_states_icsd24[symbol]
+    else:
+        if _print_warnings:
+            print(f"WARNING: Oxidation states for element {symbol} " "not found.")


🛠️ Refactor suggestion

Consider adding validation for the data file existence.

The new lookup_element_oxidation_states_icsd24 function follows the established pattern, but it should validate the existence of the data file before attempting to read it.

def lookup_element_oxidation_states_icsd24(symbol, copy=True): global _el_ox_states_icsd24 if _el_ox_states_icsd24 is None: _el_ox_states_icsd24 = {} + data_file = os.path.join(data_directory, "oxidation_states_icsd24_filtered.txt") + if not os.path.exists(data_file): + raise FileNotFoundError(f"Required data file not found: {data_file}") - for items in _get_data_rows(os.path.join(data_directory, "oxidation_states_icsd24_filtered.txt")): + for items in _get_data_rows(data_file):

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

_el_ox_states_icsd24 = None

def lookup_element_oxidation_states_icsd24(symbol, copy=True):

"""

Retrieve a list of known oxidation states for an element.

The oxidation states list used contains only those found

in the 2024 version of the ICSD (and has >=5 reports).

Args:

symbol (str) : the atomic symbol of the element to look up.

copy (Optional(bool)): if True (default), return a copy of the

oxidation-state list, rather than a reference to the cached

data -- only use copy=False in performance-sensitive code

and where the list will not be modified!

Returns:

list: List of known oxidation states for the element.

Returns None if oxidation states for the Element were not

found in the external data.

"""

global _el_ox_states_icsd24

if _el_ox_states_icsd24 is None:

_el_ox_states_icsd24 = {}

for items in _get_data_rows(os.path.join(data_directory, "oxidation_states_icsd24_filtered.txt")):

_el_ox_states_icsd24[items[0]] = [int(oxidationState) for oxidationState in items[1:]]

if symbol in _el_ox_states_icsd24:

if copy:

# _el_ox_states_icsd24 stores lists -> if copy is set, make an implicit

# deep copy. The elements of the lists are integers, which are

# "value types" in Python.

return list(_el_ox_states_icsd24[symbol])

else:

return _el_ox_states_icsd24[symbol]

else:

if _print_warnings:

print(f"WARNING: Oxidation states for element {symbol} " "not found.")

_el_ox_states_icsd24 = None

def lookup_element_oxidation_states_icsd24(symbol, copy=True):

"""

Retrieve a list of known oxidation states for an element.

The oxidation states list used contains only those found

in the 2024 version of the ICSD (and has >=5 reports).

Args:

symbol (str) : the atomic symbol of the element to look up.

copy (Optional(bool)): if True (default), return a copy of the

oxidation-state list, rather than a reference to the cached

data -- only use copy=False in performance-sensitive code

and where the list will not be modified!

Returns:

list: List of known oxidation states for the element.

Returns None if oxidation states for the Element were not

found in the external data.

"""

global _el_ox_states_icsd24

if _el_ox_states_icsd24 is None:

_el_ox_states_icsd24 = {}

data_file = os.path.join(data_directory, "oxidation_states_icsd24_filtered.txt")

if not os.path.exists(data_file):

raise FileNotFoundError(f"Required data file not found: {data_file}")

for items in _get_data_rows(data_file):

_el_ox_states_icsd24[items[0]] = [int(oxidationState) for oxidationState in items[1:]]

if symbol in _el_ox_states_icsd24:

if copy:

# _el_ox_states_icsd24 stores lists -> if copy is set, make an implicit

# deep copy. The elements of the lists are integers, which are

# "value types" in Python.

return list(_el_ox_states_icsd24[symbol])

else:

return _el_ox_states_icsd24[symbol]

else:

if _print_warnings:

print(f"WARNING: Oxidation states for element {symbol} " "not found.")

…elop

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (6)

README.md (6)
35-35: Fix typographical error

There's a typographical error in the word "used".
-The best place to start is looking at [the docs](https://smact.readthedocs.io/en/latest/), which highlight some simple examples of how these classes and functions can be usede
+The best place to start is looking at [the docs](https://smact.readthedocs.io/en/latest/), which highlight some simple examples of how these classes and functions can be used
46-46: Improve sentence structure

The phrase "in order to" can be simplified for better readability.
-Further filters can be applied to generated lists of compositions in order to screen for particular properties.
+Further filters can be applied to generated lists of compositions to screen for particular properties.
🧰 Tools

🪛 LanguageTool

[style] ~46-~46: Consider a shorter alternative to avoid wordiness.
Context: ...lied to generated lists of compositions in order to screen for particular properties. These...

(IN_ORDER_TO_PREMIUM)

79-79: Fix grammatical error

There's a grammatical error in the article usage.
-A submodule which contains a collections of tools for predicting dopants.
+A submodule which contains a collection of tools for predicting dopants.
🧰 Tools

🪛 LanguageTool

[grammar] ~79-~79: It looks like ‘collections’ doesn’t match ‘a’. Did you mean “a collection” or just “collections”?
Context: ...rediction**: A submodule which contains a collections of tools for predicting dopants. ## R...

(A_NNS_IN)

90-90: Add missing commas

Add commas after introductory phrases for better readability.
-The latest stable release can be installed via pip which will automatically set up other Python packages
+The latest stable release can be installed via pip, which will automatically set up other Python packages

-For developer installation SMACT can be installed from a copy of the source
+For developer installation, SMACT can be installed from a copy of the source
Also applies to: 102-102

🧰 Tools

🪛 LanguageTool

[uncategorized] ~90-~90: Possible missing comma found.
Context: ...est stable release can be installed via pip which will automatically set up other P...

(AI_HYDRA_LEO_MISSING_COMMA)

114-114: Update to British English spelling

In British English, 'licence' is the noun form while 'license' is the verb form.
-## License and attribution
+## Licence and attribution

-Python code and original data tables are licensed under the MIT License.
+Python code and original data tables are licensed under the MIT Licence.
Also applies to: 116-116

🧰 Tools

🪛 LanguageTool

[locale-violation] ~114-~114: License must be spelled with a “c” when used as a noun in British English. Use “licence”.
Context: ... immediately reflected on the PATH. ## License and attribution Python code and origin...

(LICENCE_LICENSE_NOUN_SINGULAR)

122-122: Fix subject-verb agreement

Correct the verb form to match the pronoun 'they'.
-especially if they helps us improve the docs!
+especially if they help us improve the docs!
🧰 Tools

🪛 LanguageTool

[grammar] ~122-~122: The pronoun ‘they’ must be used with a non-third-person form of a verb.
Context: ...n the issue tracker, especially if they helps us improve the docs! For other queries ...

(NON3PRS_VERB)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 51d78b0 and 7c54244.

📒 Files selected for processing (4)

.github/workflows/ci.yml (1 hunks)
README.md (4 hunks)
requirements-dev.txt (1 hunks)
setup.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

.github/workflows/ci.yml
requirements-dev.txt
setup.py

🧰 Additional context used

🪛 LanguageTool

README.md

[misspelling] ~24-~24: Possible spelling mistake found.
Context: ...a enough, nature will always confess_ - Roland Coase (from 'How should economists choose?') ...

(EN_MULTITOKEN_SPELLING_TWO)

[style] ~46-~46: Consider a shorter alternative to avoid wordiness.
Context: ...lied to generated lists of compositions in order to screen for particular properties. These...

(IN_ORDER_TO_PREMIUM)

[style] ~66-~66: Would you like to use the Oxford spelling “initialize”? The spelling ‘initialise’ is also correct.
Context: ...es the loading of external data used to initialise the core smact.Element and `smact.Spe...

(OXFORD_SPELLING_Z_NOT_S)

[grammar] ~79-~79: It looks like ‘collections’ doesn’t match ‘a’. Did you mean “a collection” or just “collections”?
Context: ...rediction**: A submodule which contains a collections of tools for predicting dopants. ## R...

(A_NNS_IN)

[uncategorized] ~90-~90: Possible missing comma found.
Context: ...est stable release can be installed via pip which will automatically set up other P...

(AI_HYDRA_LEO_MISSING_COMMA)

[uncategorized] ~102-~102: Possible missing comma found.
Context: ....com/WMD-group/SMACT.git For developer installation SMACT can be installed from a copy of t...

(AI_HYDRA_LEO_MISSING_COMMA)

[locale-violation] ~114-~114: License must be spelled with a “c” when used as a noun in British English. Use “licence”.
Context: ... immediately reflected on the PATH. ## License and attribution Python code and origin...

(LICENCE_LICENSE_NOUN_SINGULAR)

[locale-violation] ~116-~116: License must be spelled with a “c” when used as a noun in British English. Use “licence”.
Context: ... data tables are licensed under the MIT License. ## Development notes ### Bugs, featu...

(LICENCE_LICENSE_NOUN_SINGULAR)

[grammar] ~122-~122: The pronoun ‘they’ must be used with a non-third-person form of a verb.
Context: ...n the issue tracker, especially if they helps us improve the docs! For other queries ...

(NON3PRS_VERB)

🔇 Additional comments (1)

README.md (1)

66-66: Standardise British English spelling

For consistency with British English, 'initialise' should be used instead of 'initialize'.

🧰 Tools

🪛 LanguageTool

[style] ~66-~66: Would you like to use the Oxford spelling “initialize”? The spelling ‘initialise’ is also correct.
Context: ...es the loading of external data used to initialise the core smact.Element and `smact.Spe...

(OXFORD_SPELLING_Z_NOT_S)

codecov · 2024-11-20T10:06:22Z

Codecov Report

Attention: Patch coverage is 72.52964% with 139 lines in your changes missing coverage. Please review.

Project coverage is 73.56%. Comparing base (a71c8d2) to head (bef9c5f).

Files with missing lines	Patch %	Lines
smact/structure_prediction/prediction.py	21.62%	29 Missing ⚠️
smact/data_loader.py	59.57%	19 Missing ⚠️
smact/__init__.py	77.41%	7 Missing ⚠️
smact/distorter.py	12.50%	7 Missing ⚠️
smact/properties.py	46.15%	7 Missing ⚠️
smact/structure_prediction/mutation.py	68.18%	7 Missing ⚠️
smact/structure_prediction/structure.py	65.00%	7 Missing ⚠️
smact/dopant_prediction/doper.py	64.70%	6 Missing ⚠️
smact/utils/band_gap_simple.py	0.00%	6 Missing ⚠️
...ls/crystal_space/download_compounds_with_mp_api.py	0.00%	6 Missing ⚠️
... and 13 more

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #330      +/-   ##
==========================================
- Coverage   76.46%   73.56%   -2.91%     
==========================================
  Files          26       31       +5     
  Lines        2265     2531     +266     
==========================================
+ Hits         1732     1862     +130     
- Misses        533      669     +136

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (5)

README.md (5)
24-24: Correct the economist's name in the quote attribution

The economist's name is "Ronald Coase", not "Roland Coase".

🧰 Tools

🪛 LanguageTool

[misspelling] ~24-~24: Possible spelling mistake found.
Context: ...a enough, nature will always confess_ - Roland Coase (from 'How should economists choose?') ...

(EN_MULTITOKEN_SPELLING_TWO)

35-35: Fix typographical error

Remove the trailing 'e' from "usede".
-The best place to start is looking at [the docs](https://smact.readthedocs.io/en/latest/), which highlight some simple examples of how these classes and functions can be usede
+The best place to start is looking at [the docs](https://smact.readthedocs.io/en/latest/), which highlight some simple examples of how these classes and functions can be used
79-79: Fix grammatical error

Change "a collections" to "a collection".
-  - **dopant_prediction**: A submodule which contains a collections of tools for predicting dopants.
+  - **dopant_prediction**: A submodule which contains a collection of tools for predicting dopants.
🧰 Tools

🪛 LanguageTool

[grammar] ~79-~79: It looks like ‘collections’ doesn’t match ‘a’. Did you mean “a collection” or just “collections”?
Context: ...rediction**: A submodule which contains a collections of tools for predicting dopants. ## Re...

(A_NNS_IN)

113-113: Use British English spelling for "Licence"

Change "License" to "Licence" to maintain British English consistency.
-## License and attribution
+## Licence and attribution

-Python code and original data tables are licensed under the MIT License.
+Python code and original data tables are licensed under the MIT Licence.
Also applies to: 115-115

🧰 Tools

🪛 LanguageTool

[locale-violation] ~113-~113: License must be spelled with a “c” when used as a noun in British English. Use “licence”.
Context: ... immediately reflected on the PATH. ## License and attribution Python code and origin...

(LICENCE_LICENSE_NOUN_SINGULAR)

121-121: Fix subject-verb agreement

Change "they helps" to "they help".
-Please use the [Issue Tracker](https://github.com/WMD-group/smact/issues) to report bugs or request features in the first instance. While we hope that most questions can be answered by searching [the docs](https://smact.readthedocs.io/en/latest/), we welcome new questions on the issue tracker, especially if they helps us improve the docs!
+Please use the [Issue Tracker](https://github.com/WMD-group/smact/issues) to report bugs or request features in the first instance. While we hope that most questions can be answered by searching [the docs](https://smact.readthedocs.io/en/latest/), we welcome new questions on the issue tracker, especially if they help us improve the docs!
🧰 Tools

🪛 LanguageTool

[grammar] ~121-~121: The pronoun ‘they’ must be used with a non-third-person form of a verb.
Context: ...n the issue tracker, especially if they helps us improve the docs! For other queries ...

(NON3PRS_VERB)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 7c54244 and bef9c5f.

📒 Files selected for processing (2)

README.md (4 hunks)
requirements-dev.txt (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

requirements-dev.txt

🧰 Additional context used

🪛 LanguageTool

README.md

[misspelling] ~24-~24: Possible spelling mistake found.
Context: ...a enough, nature will always confess_ - Roland Coase (from 'How should economists choose?') ...

(EN_MULTITOKEN_SPELLING_TWO)

[style] ~46-~46: Consider a shorter alternative to avoid wordiness.
Context: ...lied to generated lists of compositions in order to screen for particular properties. These...

(IN_ORDER_TO_PREMIUM)

[style] ~66-~66: Would you like to use the Oxford spelling “initialize”? The spelling ‘initialise’ is also correct.
Context: ...es the loading of external data used to initialise the core smact.Element and `smact.Spe...

(OXFORD_SPELLING_Z_NOT_S)

[grammar] ~79-~79: It looks like ‘collections’ doesn’t match ‘a’. Did you mean “a collection” or just “collections”?
Context: ...rediction**: A submodule which contains a collections of tools for predicting dopants. ## Re...

(A_NNS_IN)

[locale-violation] ~113-~113: License must be spelled with a “c” when used as a noun in British English. Use “licence”.
Context: ... immediately reflected on the PATH. ## License and attribution Python code and origin...

(LICENCE_LICENSE_NOUN_SINGULAR)

[locale-violation] ~115-~115: License must be spelled with a “c” when used as a noun in British English. Use “licence”.
Context: ... data tables are licensed under the MIT License. ## Development notes ### Bugs, featu...

(LICENCE_LICENSE_NOUN_SINGULAR)

[grammar] ~121-~121: The pronoun ‘they’ must be used with a non-third-person form of a verb.
Context: ...n the issue tracker, especially if they helps us improve the docs! For other queries ...

(NON3PRS_VERB)

🔇 Additional comments (3)

README.md (3)

Line range hint 1-13: LGTM! Well-structured badges section

The badges are properly formatted and use secure HTTPS links.

🧰 Tools

🪛 LanguageTool

[locale-violation] ~3-~3: License must be spelled with a “c” when used as a noun in British English. Use “licence”.
Context: ...eadthedocs.org/en/latest/?badge=latest) [![License: MIT](https://img.shields.io/badge/Lice...

(LICENCE_LICENSE_NOUN_SINGULAR)

62-80: LGTM! Clear and well-structured modules list

The module descriptions are concise and informative.

🧰 Tools

🪛 LanguageTool

[style] ~66-~66: Would you like to use the Oxford spelling “initialize”? The spelling ‘initialise’ is also correct.
Context: ...es the loading of external data used to initialise the core smact.Element and `smact.Spe...

(OXFORD_SPELLING_Z_NOT_S)

[grammar] ~79-~79: It looks like ‘collections’ doesn’t match ‘a’. Did you mean “a collection” or just “collections”?
Context: ...rediction**: A submodule which contains a collections of tools for predicting dopants. ## Re...

(A_NNS_IN)

Line range hint 81-112: LGTM! Comprehensive installation instructions

Clear instructions provided for multiple installation methods.

🧰 Tools

🪛 LanguageTool

[misspelling] ~24-~24: Possible spelling mistake found.
Context: ...a enough, nature will always confess_ - Roland Coase (from 'How should economists choose?') ...

(EN_MULTITOKEN_SPELLING_TWO)

[style] ~46-~46: Consider a shorter alternative to avoid wordiness.
Context: ...lied to generated lists of compositions in order to screen for particular properties. These...

(IN_ORDER_TO_PREMIUM)

[style] ~66-~66: Would you like to use the Oxford spelling “initialize”? The spelling ‘initialise’ is also correct.
Context: ...es the loading of external data used to initialise the core smact.Element and `smact.Spe...

(OXFORD_SPELLING_Z_NOT_S)

[grammar] ~79-~79: It looks like ‘collections’ doesn’t match ‘a’. Did you mean “a collection” or just “collections”?
Context: ...rediction**: A submodule which contains a collections of tools for predicting dopants. ## Re...

(A_NNS_IN)

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (8)

pyproject.toml (3)
41-58: Consider relaxing the pymatgen version constraint

The current pymatgen constraint >=2024.2.20,<2024.8.8 is quite restrictive and may cause dependency conflicts. Consider using a more flexible version range unless there are specific compatibility requirements.

79-82: Fix formatting inconsistency in crystal_space dependencies

The crystal_space dependencies list uses inconsistent formatting. Consider reformatting for better readability:
-crystal_space = ["mp-api",
-"ElementEmbeddings",
-"umap-learn==0.5.3",
-"kaleido"]
+crystal_space = [
+    "mp-api",
+    "ElementEmbeddings",
+    "umap-learn==0.5.3",
+    "kaleido",
+]
124-124: Address TODO comments in ruff configuration

Several TODO comments indicate pending fixes:

Line 124: Fix bare except usage

Line 136: Fix magic value comparisons

Line 138: Fix global statement usage

Line 140: Add match parameter to pytest.raises

Consider creating issues to track these improvements.

Would you like me to help create GitHub issues for tracking these improvements?

Also applies to: 136-136, 138-138, 140-140
smact/tests/test_utils.py (5)
99-113: Ensure cleanup occurs even if the test fails.

Currently, the cleanup code using shutil.rmtree(save_mp_dir) is executed at the end of the test. If an exception occurs before this point, the temporary directory may not be deleted. Consider using self.addCleanup() to register the cleanup action, ensuring it runs regardless of the test outcome.

Apply this change:
def test_download_compounds_with_mp_api(self):
    save_mp_dir = "data/binary/mp_data"
+   self.addCleanup(shutil.rmtree, save_mp_dir)
    download_compounds_with_mp_api.download_mp_data(
        mp_api_key=os.environ.get("MP_API_KEY"),
        num_elements=2,
        max_stoich=1,
        save_dir=save_mp_dir,
    )
    # Check if the data was downloaded
    self.assertTrue(os.path.exists(save_mp_dir))
    self.assertTrue(len(os.listdir(save_mp_dir)) > 0)
-   # Clean up
-   shutil.rmtree(save_mp_dir)
135-135: Incomplete commented-out assertion.

Line 135 contains a commented-out assertion with incomplete code: # self.assertEqual(filtered_df.loc[""]). This might have been left unintentionally. Consider removing or completing this assertion to maintain code clarity.

137-154: Consider using temporary files for test outputs.

The test writes output files to the current directory and manually deletes them after the test. If the test fails before cleanup, these files may remain. Using the tempfile module or self.addCleanup() can ensure proper cleanup even if the test fails.

156-160: Avoid hardcoding expected lengths to prevent brittle tests.

Hardcoding expected lengths (e.g., 490, 358, 227) may cause the tests to fail if the underlying data changes. Consider deriving expected lengths dynamically based on the input data or adding comments to explain why these specific values are expected.

168-170: Potential brittleness due to hardcoded DataFrame values.

Hardcoding the expected DataFrame shape and specific values (e.g., shape (490, 2), species "O2-", count 116910) may make the tests brittle if the data set changes. Consider making the test more flexible or ensure that the data remains consistent across environments.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between bef9c5f and 534544b.

📒 Files selected for processing (2)

pyproject.toml (1 hunks)
smact/tests/test_utils.py (3 hunks)

🔇 Additional comments (10)

pyproject.toml (2)

3-40: LGTM: Well-structured project metadata

The build system and project metadata are properly configured. The Python version requirements are consistently defined across both requires-python and classifiers.

94-158: Review security-related rule exclusions

Several security-related rules are currently ignored:

S301: Unsafe pickle usage
S311: Random module for cryptographic purposes
S314: XML security
S603, S607: Subprocess security
S608: SQL injection

Consider reviewing these exclusions and implementing proper security measures where applicable.

✅ Verification successful

Let me analyze the specific code usage around these security patterns to make a more informed decision.

Security rule exclusions are appropriately managed

The security concerns in the ignored rules are adequately addressed in the codebase:

Pickle usage is confined to test files only and not used in production code
No subprocess usage was found in the codebase
No XML parsing was detected
SQL queries are properly parameterised using the ? placeholder syntax, preventing SQL injection risks

The current rule exclusions appear justified as the potential security risks are either:

Limited to test code (pickle)
Not present in the codebase (subprocess, XML)
Properly mitigated (SQL using parameterised queries)

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for potential security issues in the codebase
echo "Checking for pickle usage..."
rg "pickle\." --type py

echo "Checking for subprocess usage..."
rg "subprocess\." --type py

echo "Checking for XML parsing..."
rg "xml\." --type py

echo "Checking for SQL query construction..."
rg "execute\(|cursor\.execute" --type py

Length of output: 1241

Script:

#!/bin/bash
# Check the context of pickle usage in tests
echo "Checking pickle usage context..."
rg -B 2 -A 2 "pickle\." smact/tests/test_structure.py

echo "Checking SQL query construction context..."
rg -B 2 -A 2 "execute\(" smact/structure_prediction/database.py

# Check if there are any parameterised queries
echo "Checking for parameterised queries..."
ast-grep --pattern 'execute($query, $params)'

Length of output: 1893

smact/tests/test_utils.py (8)

1-2: LGTM!

The addition of from __future__ import annotations enhances type hinting capabilities, allowing for postponed evaluation of type annotations.

3-4: LGTM!

Imports of os and shutil are necessary for file operations in the new tests.

7-7: LGTM!

Importing pandas as pd is appropriate for handling DataFrame operations in the added tests.

8-8: LGTM!

Importing SETTINGS and Composition from pymatgen.core is suitable for the test functionalities.

13-14: LGTM!

New imports from smact.utils.crystal_space and smact.utils.oxidation are required for the added test cases.

69-69: LGTM!

The assertion correctly tests the composition with specified oxidation states, ensuring accurate functionality of comp_maker.

95-97: LGTM!

The use of @unittest.skipUnless ensures the test runs only when an API key is available, preventing unnecessary failures.

115-118: LGTM!

Setting up file paths for test data is appropriate and enhances test readability.

coderabbitai · 2024-11-21T13:47:36Z

smact/tests/test_utils.py

+    def test_convert_formula(self):
+        combinations = [Element("Li"), Element("O")]
+        expected_formulas = ["Li2O2", "LiO2", "Li2O", "Li2O2"]
+        compounds = generate_composition_with_smact.convert_formula(
+            combinations=combinations, num_elements=2, max_stoich=2
+        )
+        self.assertListEqual(expected_formulas, compounds)


⚠️ Potential issue

Duplicate entry in expected_formulas list.

The expected_formulas list contains "Li2O2" twice, which may be unintended. This could cause the test to fail if the compounds list does not match due to the duplicate. Consider removing the duplicate entry to ensure the test accurately reflects the expected outputs.

Apply this diff to correct the expected_formulas:

-expected_formulas = ["Li2O2", "LiO2", "Li2O", "Li2O2"] +expected_formulas = ["Li2O2", "LiO2", "Li2O"]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def test_convert_formula(self):

combinations = [Element("Li"), Element("O")]

expected_formulas = ["Li2O2", "LiO2", "Li2O", "Li2O2"]

compounds = generate_composition_with_smact.convert_formula(

combinations=combinations, num_elements=2, max_stoich=2

)

self.assertListEqual(expected_formulas, compounds)

def test_convert_formula(self):

combinations = [Element("Li"), Element("O")]

expected_formulas = ["Li2O2", "LiO2", "Li2O"]

compounds = generate_composition_with_smact.convert_formula(

combinations=combinations, num_elements=2, max_stoich=2

)

self.assertListEqual(expected_formulas, compounds)

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

smact/tests/test_utils.py (2)
137-153: Remove or complete the commented assertion.

There's an incomplete assertion that should either be completed or removed:
# self.assertEqual(filtered_df.loc[""])
173-176: Consider using named constants for magic numbers.

The threshold values and expected lengths could be defined as class constants for better maintainability.

Example refactor:
+    THRESHOLD_LENGTHS = {
+        0: 490,
+        5: 358,
+        50: 227
+    }
+
     def test_oxidation_states_filter_species_list(self):
-        for threshold, length in [(0, 490), (5, 358), (50, 227)]:
+        for threshold, length in self.THRESHOLD_LENGTHS.items():

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 534544b and a998ced.

📒 Files selected for processing (1)

smact/tests/test_utils.py (3 hunks)

🔇 Additional comments (5)

smact/tests/test_utils.py (5)

1-14: LGTM! Well-organised imports.

The imports are properly structured and all imported modules are utilised in the test implementations.

95-129: LGTM! Well-structured API-dependent tests.

The tests properly handle:

API key verification
Data generation and verification
Proper cleanup of test data

132-134: LGTM! Robust path handling.

The test file paths are properly constructed using platform-independent methods.

178-187: LGTM! Well-structured test for species occurrences.

The test properly validates:

DataFrame structure
Column names
Expected data values

87-93: ⚠️ Potential issue

Fix duplicate entry in expected formulas list.

The list contains 'Li2O2' twice, which could lead to inconsistent test results.

Apply this correction:

-expected_formulas = ["Li2O2", "LiO2", "Li2O", "Li2O2"]
+expected_formulas = ["Li2O2", "LiO2", "Li2O"]

Likely invalid or redundant comment.

aronwalsh and others added 30 commits July 25, 2024 10:07

Initial cleanup of oxidation state data files

da611fb

Added comment lines on dataset origins

ICSD (2024) oxidation states

b3c544f

Populated from ICSD API - raw count json - raw oxidation state list - filtered list with > 5 entries per oxidation state

Create oxidation_states_icsd24_common.txt

fbd1be3

Add set that lists the oxidation state with the most results for each element

Update oxidation_states_icsd24_counts.json

ba132de

Final list up to Lr

Make the filtered 2024 icsd oxidation states available through data_l…

0be9b0b

…oader

Make oxidation_states_icsd24 available through the Element class

e062c1c

Replace cation mutation by distorter

4cb8a50

Merge branch 'docs_updates' of https://github.com/WMD-group/SMACT int…

09ebea0

…o oxidationstates24

created tutorials folder and moved SolarOxides.ipynb

02d66e6

and Checking_smact_validity_of_GNoME.ipynb

Match Pan

aeb5940

Merge branch 'docs_updates' of https://github.com/WMD-group/SMACT int…

455459b

…o docs_updates

Add book theme and update requirements

f14e909

smact-change-ts

39485bb

tweak utils

e7fdf68

Create validity.ipynb

5dc804f

modified the markdown for the solar oxide tutorial

9c1c7e0

Add files via upload

2f1d6e1

Add files via upload

4e8f8a1

Create filter.ipynb

fa54412

Merge branch 'docs_updates' of https://github.com/WMD-group/SMACT int…

0401e55

…o docs_updates

Update oxidation_states_examples.ipynb

9c593f4

Delete examples/Oxidation_states/oxidation_states.ipynb

ec4a6c3

small edits to the solar oxide tutorial markdown

2c4379f

adding notebook compatibility

92822a9

updating docs

54cf9e7

update docs

30c2eed

Add oxidation states utility functions

1e668fc

renaming the tutorials

316a9c8

update crystal space tutorial

5f022a8

added docstrings to the validity of GNoMe tutorial

8d6e447

AntObi and others added 13 commits August 31, 2024 10:11

Add comments for clarity

afcbff6

Merge branch 'docs_updates' of https://github.com/WMD-group/SMACT int…

44ec28d

…o docs_updates

Move crystal space utilities into crystal_space folder

2305f2f

Update imports

7e4c1d7

Add __init__.py in crystal_space

af5c9a8

Ruff fix

7d96a15

Add test for convert_formulas

a382cfd

Add optional dependencies

beb2d3b

Add docstring

5167778

Merge pull request #309 from WMD-group/docs_updates

1636ad7

[WIP] Documentation overhaul

Merge branch 'develop' into oxidationstates24

99acd4e

Run pre-commit

3c73285

Merge pull request #329 from WMD-group/oxidationstates24

51d78b0

Oxidation States 24

coderabbitai bot reviewed Nov 20, 2024

View reviewed changes

AntObi and others added 5 commits November 20, 2024 09:53

MERGE

57b88d9

Merge branch 'develop' of https://github.com/WMD-group/SMACT into dev…

476f8be

…elop

Merge branch 'master' into develop

13640fa

Run pre-commit

db9d332

Merge branch 'develop' of https://github.com/WMD-group/SMACT into dev…

7c54244

…elop

coderabbitai bot reviewed Nov 20, 2024

View reviewed changes

Run pre-commit

bef9c5f

coderabbitai bot reviewed Nov 20, 2024

View reviewed changes

AntObi added 2 commits November 21, 2024 13:41

Fix misspelling in pyproject

9361567

Add test for downloading from mp api

534544b

coderabbitai bot reviewed Nov 21, 2024

View reviewed changes

Add tests for crystal space composition generation

a998ced

coderabbitai bot reviewed Nov 21, 2024

View reviewed changes

AntObi merged commit ace22a8 into master Nov 21, 2024
2 of 11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge the develop branch into the master branch #330

Merge the develop branch into the master branch #330

AntObi commented Nov 20, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 20, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot Nov 20, 2024

coderabbitai bot Nov 20, 2024

coderabbitai bot Nov 20, 2024

coderabbitai bot Nov 20, 2024

coderabbitai bot Nov 20, 2024

coderabbitai bot Nov 20, 2024

coderabbitai bot Nov 20, 2024

coderabbitai bot Nov 20, 2024

coderabbitai bot Nov 20, 2024

coderabbitai bot Nov 20, 2024

coderabbitai bot left a comment

codecov bot commented Nov 20, 2024 •

edited

Loading

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot Nov 21, 2024

coderabbitai bot left a comment

		"# Apply smact_validity to check if compounds are valid according to SMACT rules\n",
		"df[\"smact_allowed\"] = df[\"Compound\"].apply(smact_validity)\n",

-    "# Apply smact_validity to check if compounds are valid according to SMACT rules\n",
-    "df[\"smact_allowed\"] = df[\"Compound\"].apply(smact_validity)\n",
+def safe_validate(compound):
+    try:
+        return smact_validity(compound)
+    except Exception as e:
+        print(f"Warning: Could not validate {compound}: {str(e)}")
+        return False
+df["smact_allowed"] = df["Compound"].apply(safe_validate)

Merge the develop branch into the master branch #330

Merge the develop branch into the master branch #330

Conversation

AntObi commented Nov 20, 2024 • edited by coderabbitai bot Loading

Merge the develop branch

Description

Type of change

How Has This Been Tested?

Reviewers

Checklist:

Summary by CodeRabbit

Release Notes

coderabbitai bot commented Nov 20, 2024 • edited Loading

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Nov 20, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 20, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 20, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 20, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 20, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 20, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 20, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 20, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 20, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 20, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

codecov bot commented Nov 20, 2024 • edited Loading

Codecov Report

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Nov 21, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

AntObi commented Nov 20, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 20, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

codecov bot commented Nov 20, 2024 •

edited

Loading