PET¶

PET is a cleaner, more user-friendly reimplementation of the original PET model [1]. It is designed for better modularity and maintainability, while preseving compatibility with the original PET implementation in metatrain. It also adds new features like long-range features, better fine-tuning implementation, a possibility to train on arbitrarty targets, and a faster inference due to the fast attention.

Installation¶

To install this architecture along with the metatrain package, run:

pip install metatrain[pet]

where the square brackets indicate that you want to install the optional dependencies required for pet.

Default Hyperparameters¶

The description of all the hyperparameters used in pet is provided further down this page. However, here we provide you with a yaml file containing all the default hyperparameters, which might be convenient as a starting point to create your own hyperparameter files:

architecture:
  name: pet
  model:
    cutoff: 4.5
    num_neighbors_adaptive: null
    cutoff_function: Bump
    cutoff_width: 0.5
    d_pet: 128
    d_head: 128
    d_node: 256
    d_feedforward: 256
    num_heads: 8
    num_attention_layers: 2
    num_gnn_layers: 2
    normalization: RMSNorm
    activation: SwiGLU
    transformer_type: PreLN
    featurizer_type: feedforward
    zbl: false
    long_range:
      enable: false
      use_ewald: false
      smearing: 1.4
      kspace_resolution: 1.33
      interpolation_nodes: 5
  training:
    distributed: false
    distributed_port: 39591
    batch_size: 16
    num_epochs: 1000
    warmup_fraction: 0.01
    learning_rate: 0.0001
    optimizer: Adam
    weight_decay: null
    log_interval: 1
    checkpoint_interval: 100
    atomic_baseline: {}
    scale_targets: true
    fixed_scaling_weights: {}
    per_structure_targets: []
    num_workers: null
    log_mae: true
    log_separate_blocks: false
    best_model_metric: mae_prod
    grad_clip_norm: 1.0
    loss: mse
    finetune:
      read_from: null
      method: full
      config: {}
      inherit_heads: {}

Tuning hyperparameters¶

The default hyperparameters above will work well in most cases, but they may not be optimal for your specific dataset. There is good number of parameters to tune, both for the model and the trainer. Since seeing them for the first time might be overwhelming, here we provide a list of the parameters that are in general the most important (in decreasing order of importance):

ModelHypers.cutoff: float = 4.5

Cutoff radius for neighbor search.

This should be set to a value after which most of the interactions between atoms is expected to be negligible. A lower cutoff will lead to faster models.

ModelHypers.num_neighbors_adaptive: int | None = None

Target number of neighbors for the adaptive cutoff scheme.

This parameter activates the adaptive cutoff functionality. Each atomic environments has a different cutoff, that is chosen such that the number of neighbors is approximately equal to this value. This can be useful to have a more uniform number of neighbors per atom, especially in sparse systems. Setting it to None disables this feature and uses all neighbors within the fixed cutoff radius.

TrainerHypers.learning_rate: float = 0.0001: Learning rate.

TrainerHypers.batch_size: int = 16: The number of samples to use in each batch of training. This hyperparameter controls the tradeoff between training speed and memory usage. In general, larger batch sizes will lead to faster training, but might require more memory.

ModelHypers.d_pet: int = 128

Dimension of the edge features.

This hyperparameters controls width of the neural network. In general, increasing it might lead to better accuracy, especially on larger datasets, at the cost of increased training and evaluation time.

ModelHypers.d_node: int = 256

Dimension of the node features.

Increasing this hyperparameter might lead to better accuracy, with a relatively small increase in inference time.

ModelHypers.num_gnn_layers: int = 2

The number of graph neural network layers.

In general, decreasing this hyperparameter to 1 will lead to much faster models, at the expense of accuracy. Increasing it may or may not lead to better accuracy, depending on the dataset, at the cost of increased training and evaluation time.

ModelHypers.num_attention_layers: int = 2: The number of attention layers in each layer of the graph neural network. Depending on the dataset, increasing this hyperparameter might lead to better accuracy, at the cost of increased training and evaluation time.

TrainerHypers.loss: str | dict[str, LossSpecification | str] = 'mse': This section describes the loss function to be used. See the Loss functions for more details.

ModelHypers.long_range: LongRangeHypers = {'enable': False, 'interpolation_nodes': 5, 'kspace_resolution': 1.33, 'smearing': 1.4, 'use_ewald': False}: Long-range Coulomb interactions parameters.

Model hyperparameters¶

The parameters that go under the architecture.model section of the config file are the following:

ModelHypers.cutoff: float = 4.5¶

Cutoff radius for neighbor search.

This should be set to a value after which most of the interactions between atoms is expected to be negligible. A lower cutoff will lead to faster models.

ModelHypers.num_neighbors_adaptive: int | None = None¶

Target number of neighbors for the adaptive cutoff scheme.

This parameter activates the adaptive cutoff functionality. Each atomic environments has a different cutoff, that is chosen such that the number of neighbors is approximately equal to this value. This can be useful to have a more uniform number of neighbors per atom, especially in sparse systems. Setting it to None disables this feature and uses all neighbors within the fixed cutoff radius.

ModelHypers.cutoff_function: Literal['Cosine', 'Bump'] = 'Bump'¶

Type of the smoothing function at the cutoff

ModelHypers.cutoff_width: float = 0.5¶

Width of the smoothing function at the cutoff

ModelHypers.d_pet: int = 128¶

Dimension of the edge features.

This hyperparameters controls width of the neural network. In general, increasing it might lead to better accuracy, especially on larger datasets, at the cost of increased training and evaluation time.

ModelHypers.d_head: int = 128¶

Dimension of the attention heads.

ModelHypers.d_node: int = 256¶

Dimension of the node features.

Increasing this hyperparameter might lead to better accuracy, with a relatively small increase in inference time.

ModelHypers.d_feedforward: int = 256¶

Dimension of the feedforward network in the attention layer.

ModelHypers.num_heads: int = 8¶

Attention heads per attention layer.

ModelHypers.num_attention_layers: int = 2¶

The number of attention layers in each layer of the graph neural network. Depending on the dataset, increasing this hyperparameter might lead to better accuracy, at the cost of increased training and evaluation time.

ModelHypers.num_gnn_layers: int = 2¶

The number of graph neural network layers.

In general, decreasing this hyperparameter to 1 will lead to much faster models, at the expense of accuracy. Increasing it may or may not lead to better accuracy, depending on the dataset, at the cost of increased training and evaluation time.

ModelHypers.normalization: Literal['RMSNorm', 'LayerNorm'] = 'RMSNorm'¶

Layer normalization type.

ModelHypers.activation: Literal['SiLU', 'SwiGLU'] = 'SwiGLU'¶

Activation function.

ModelHypers.transformer_type: Literal['PreLN', 'PostLN'] = 'PreLN'¶

The order in which the layer normalization and attention are applied in a transformer block. Available options are PreLN (normalization before attention) and PostLN (normalization after attention).

ModelHypers.featurizer_type: Literal['residual', 'feedforward'] = 'feedforward'¶

Implementation of the featurizer of the model to use. Available options are residual (the original featurizer from the PET paper, that uses residual connections at each GNN layer for readout) and feedforward (a modern version that uses the last representation after all GNN iterations for readout). Additionally, the feedforward version uses bidirectional features flow during the message passing iterations, that favors features flowing from atom i to atom j to be not equal to the features flowing from atom j to atom i.

ModelHypers.zbl: bool = False¶

Use ZBL potential for short-range repulsion

ModelHypers.long_range: LongRangeHypers = {'enable': False, 'interpolation_nodes': 5, 'kspace_resolution': 1.33, 'smearing': 1.4, 'use_ewald': False}¶

Long-range Coulomb interactions parameters.

Trainer hyperparameters¶

The parameters that go under the architecture.trainer section of the config file are the following:

TrainerHypers.distributed: bool = False¶

Whether to use distributed training

TrainerHypers.distributed_port: int = 39591¶

Port for distributed communication among processes

TrainerHypers.batch_size: int = 16¶

The number of samples to use in each batch of training. This hyperparameter controls the tradeoff between training speed and memory usage. In general, larger batch sizes will lead to faster training, but might require more memory.

TrainerHypers.num_epochs: int = 1000¶

Number of epochs.

TrainerHypers.warmup_fraction: float = 0.01¶

Fraction of training steps used for learning rate warmup.

TrainerHypers.learning_rate: float = 0.0001¶

Learning rate.

TrainerHypers.optimizer: Literal['Adam', 'AdamW', 'Muon'] = 'Adam'¶

Optimizer to use for training the model.

TrainerHypers.weight_decay: float | None = None¶

Weight decay coefficient. If None, no weight decay is used.

TrainerHypers.log_interval: int = 1¶

Interval to log metrics.

TrainerHypers.checkpoint_interval: int = 100¶

Interval to save checkpoints.

TrainerHypers.atomic_baseline: dict[str, float | dict[int, float]] = {}¶

The baselines for each target.

By default, metatrain will fit a linear model (CompositionModel) to compute the least squares baseline for each atomic species for each target.

However, this hyperparameter allows you to provide your own baselines. The value of the hyperparameter should be a dictionary where the keys are the target names, and the values are either (1) a single baseline to be used for all atomic types, or (2) a dictionary mapping atomic types to their baselines. For example:

atomic_baseline: {"energy": {1: -0.5, 6: -10.0}} will fix the energy baseline for hydrogen (Z=1) to -0.5 and for carbon (Z=6) to -10.0, while fitting the baselines for the energy of all other atomic types, as well as fitting the baselines for all other targets.

atomic_baseline: {"energy": -5.0} will fix the energy baseline for all atomic types to -5.0.

atomic_baseline: {"mtt:dos": 0.0} sets the baseline for the “mtt:dos” target to 0.0, effectively disabling the atomic baseline for that target.

This atomic baseline is substracted from the targets during training, which avoids the main model needing to learn atomic contributions, and likely makes training easier. When the model is used in evaluation mode, the atomic baseline is added on top of the model predictions automatically.

Note

This atomic baseline is a per-atom contribution. Therefore, if the property you are predicting is a sum over all atoms (e.g., total energy), the contribution of the atomic baseline to the total property will be the atomic baseline multiplied by the number of atoms of that type in the structure.

TrainerHypers.scale_targets: bool = True¶

Normalize targets to unit std during training.

TrainerHypers.fixed_scaling_weights: dict[str, float | dict[int, float]] = {}¶

Weights for target scaling.

This is passed to the fixed_weights argument of Scaler.train_model, see its documentation to understand exactly what to pass here.

TrainerHypers.per_structure_targets: list[str] = []¶

Targets to calculate per-structure losses.

TrainerHypers.num_workers: int | None = None¶

Number of workers for data loading. If not provided, it is set automatically.

TrainerHypers.log_mae: bool = True¶

Log MAE alongside RMSE

TrainerHypers.log_separate_blocks: bool = False¶

Log per-block error.

TrainerHypers.best_model_metric: Literal['rmse_prod', 'mae_prod', 'loss'] = 'mae_prod'¶

Metric used to select best checkpoint (e.g., rmse_prod)

TrainerHypers.grad_clip_norm: float = 1.0¶

Maximum gradient norm value, by default inf (no clipping)

TrainerHypers.loss: str | dict[str, LossSpecification | str] = 'mse'¶

This section describes the loss function to be used. See the Loss functions for more details.

TrainerHypers.finetune: NoFinetuneHypers | FullFinetuneHypers | LoRaFinetuneHypers | HeadsFinetuneHypers = {'config': {}, 'inherit_heads': {}, 'method': 'full', 'read_from': None}¶

Parameters for fine-tuning trained PET models.

See Fine-tune a pre-trained model for more details.