Molecule Usage
AugLiChem has been designed from the ground up with ease-of-use in mind.
Fully functional notebooks are available at our github in the examples/
directory here.
In-depth documentation of each function is given in the docstrings and can be printed out using python’s built-in help()
function.
Using PyTorch’s CUDA support, all models and data sets can be used with GPUs.
This guide explains all of the features of the package. We have also provided jupyter notebooks that are ready to run after installation. Links to notebooks that demonstrate each type of training are provided below:
- Example Notebooks:
Molecule Usage
The first step is to import the relevant modules. AugLiChem is largely self-contained, and so we import transformations, data wrapper, and models.
Setup
from auglichem.molecule import Compose, RandomAtomMask, RandomBondDelete, MotifRemoval
from auglichem.molecule.data import MoleculeDatasetWrapper
from auglichem.molecule.models import AttentiveFP, GCN, DeepGCN, GINE
Next, we set up our transformations. Transformations can be set up as a list or single transformation. When using a list, each molecule is transformed by all transformations passed in.
Creating Augmentations
transforms = Compose([
RandomAtomMask([0.1,0.3]),
RandomBondDelete([0.1, 0.4]),
MotifRemoval(0.6)
])
RandomAtomMask
arguments:
p
(float, list of floats, default=0.5): Probability of each atom being masked in the molecule. Masks at least one atom. If list, a value is randomly sampled uniformly between the passed in bounds for each molecule.
RandomBondDelete
arguments:
p
(float, list of floats, default=0.5): Probability of each bond being deleted in the molecule. If list, a value is randomly sampled uniformly between the passed in bounds for each molecule.
MotifRemoval
arguments:
similarity_threshold
: Threshold to retain motifs in augmented structure.
Note: MotifRemoval retains a copy of each motif while training. That is, the original data, and each motif is used in training, along with the data and motifs augmented by any additional transformations.
The Compose
object is used to apply multiple transformations at once.
It takes in transformations and applies them one at a time when called.
Compose
arguments:
transforms
(list of transforms): A list of transforms to be applied.p
(float, optional, default=1): The probability of each transformation being applied.
Data Loading
After initializing our transformations, we are ready to initialize our data set.
Data sets are selected with a string, and are automatically downloaded to ./data_download
by default.
This directory is created if it is not present, and does not download the data again if it is already present.
Batch size, validation size, and test size for training and evaluation are set here.
The transforms are passed in here and scaffold splitting is supported.
dataset = MoleculeDatasetWrapper(
dataset="ClinTox",
transform=transforms,
split="scaffold",
batch_size=128,
num_workers=0
valid_size=0.1,
test_size=0.1,
aug_time=0,
data_path="./data_download",
seed=None
)
MoleculeDatasetWrapper
arguments:
dataset
(str): One of our dataset: lanthanides, perovskites, band_gap, fermi_energy, or formation_energy-
transform
(AbstractTransformation, optional): A crystal transformation split
(str, optional default=scaffold): random or scaffold. The splitting strategy used for train/test/validation set creation.split
(str, default=random): Method of splitting data into train, validation, and test. Ignored if doing k-fold cross validation.batch_size
(int, optional default=64): Batch size used in trainingnum_workers
(int, optional default=0): Number of workers used in loading datavalid_size
(float in [0,1], optional default=0.1):test_size
(float in [0,1], optional default=0.1):aug_time
(int, optional default=0): Number of times to call each augmentationdata_path
(str, optional default=None): specify path to save/lookup data. Default createsdata_download
directory and stores data thereseed
(int, optional, default=None): Random seed to use for reproducibilitycgcnn
(bool, optional, default=False): Set to True is using built-in CGCNN model.
After loading our data, our dataset
object has additional information from the parent class, MoleculeDataset
that may be useful to look at.
We can look at the SMILES representation of each molecule in the data, as well as the targets:
>>> print(dataset.smiles_data)
['[C@@H]1([C@@H]([C@@H]([C@H]([C@@H]([C@@H]1Cl)Cl)Cl)Cl)Cl)Cl'
'[C@H]([C@@H]([C@@H](C(=O)[O-])O)O)([C@H](C(=O)[O-])O)O'
'[H]/[NH+]=C(/C1=CC(=O)/C(=C\\C=c2ccc(=C([NH3+])N)cc2)/C=C1)\\N' ...
'O=[Zn]' 'OCl(=O)(=O)=O' 'S=[Se]=S']
and the labels can be viewed with:
>>> print(dataset.labels)
{'CT_TOX': array([0, 0, 0, ..., 0, 0, 0]), 'FDA_APPROVED': array([1, 1, 1, ..., 1, 1, 1])}
Data Splitting
Using the wrapper class is preferred for easy training because of the data loader function, which creates pytorch-geometric data loaders that are easy to iterate over. With multi-target data sets, such as ClinTox, we specify the target we want here. If no target is selected, the first target in the downloaded data file is used. Multiple targets can be selected for multi-target training by passing in a list of targets, or ‘all’ to use all of them.
train_loader, valid_loader, test_loader = dataset.get_data_loaders("FDA_APPROVED")
MoleculeDatasetWrapper.get_data_loaders()
argument:
target
(str, list of str, optional): Target name to get data loaders for. If None, returns the loaders for the first target. If ‘all’ returns data for all targets at once, ideal for multitarget trainimg.
Returns:
train/valid/test_loader
(DataLoader): Data loaders containing the train, validation and test splits of our data.
Now that our data is ready for training and evalutaion, we initialize our model.
Task, either regression or classification needs to be passed in.
Our dataset object stores this in the task
attribute.
Model Initialization
model = AttentiveFP(
task=dataset.task,
emb_dim=300,
num_layers=5,
num_timesteps=3,
drop_ratio=0,
output_dim=None
)
AttentiveFP
arguments:
task
(str): ‘classification’ or ‘regression’edge_dim
(int): Edge feature dimensionality.num_layers
(int): Number of GNN layers.num_timesteps
(int): Number of iterative refinement steps for global readout.dropout
(float, optional): Dropout probability. (default: :obj:0.0
)output_dim
(int, optional): Output dimension. Defaults to 1 if task=’regression’, 2 if task=’classification’. Pass in the number of targets if doing multi-target classification.
model = DeepGCN(
emb_dim = 128,
aggr: str = 'softmax',
t: float = 1.0,
learn_t: bool = False,
p: float = 1.0,
learn_p: bool = False,
msg_norm: bool = False,
learn_msg_scale: bool = False,
norm: str = 'batch',
num_layer: int = 2,
eps: float = 1e-7
)
DeepGCN
arguments:
emb_dim
(int): Edge feature dimensionality.aggr
(str, optional, default=’softmax’): Aggregate function, one of ‘softmax’, ‘softmax_sg’, ‘power’, ‘add’, ‘mean’, ‘max’.t
(float optional, default=1.0): Scaling parameter for softmax and softmax_sg aggregation.learn_t
(bool optional, default=False): Flag to learn t or not.p
(float, optional, default=1.0): Power used for power aggreagation.learn_p
(bool, optional, default=False): Flag to learn p or not.msg_norm
(bool, optional, default=False): Flag to normalize messages or not.learn_msg_scale
(bool, optional, default=False): Flag to learn message norm or not.norm
(str, optional, default =’batch’): Type of norm to use in MLP. One of ‘batch’, ‘layer’, or ‘instance’.num_layer
(int, optional, default=2): Number of layers in the network.eps
(float, optional, default=1e-7): Small value to add to message output.
model = GCN(
task=dataset.task,
emb_dim=300,
feat_dim=256
num_layers=5,
pool='mean'
drop_ratio=0,
output_dim=None
)
GCN
arguments:
task
(str): ‘classification’ or ‘regression’edge_dim
(int): Edge feature dimensionality.feature_dim
(int): Feature dimensionality before final prediction layers.num_layers
(int): Number of GNN layers.- ‘pool’ (str): Pooling function to be used. One of ‘mean’, ‘add’, ‘max’.
drop_ratio
(float, optional): Dropout probability. (default: :obj:0.0
)output_dim
(int, optional): Output dimension. Defaults to 1 if task=’regression’, 2 if task=’classification’. Pass in the number of targets if doing multi-target classification. After initializing one of the models as seen above, we are ready to train using standard PyTorch training procedure.
model = GINE(
task=dataset.task,
emb_dim=300,
feat_dim=256
num_layers=5,
pool='mean'
drop_ratio=0,
output_dim=None
)
GINE
arguments:
task
(str): ‘classification’ or ‘regression’edge_dim
(int): Edge feature dimensionality.feature_dim
(int): Feature dimensionality before final prediction layers.num_layers
(int): Number of GNN layers.- ‘pool’ (str): Pooling function to be used. One of ‘mean’, ‘add’, ‘max’.
drop_ratio
(float, optional): Dropout probability. (default: :obj:0.0
)output_dim
(int, optional): Output dimension. Defaults to 1 if task=’regression’, 2 if task=’classification’. Pass in the number of targets if doing multi-target classification. After initializing one of the models as seen above, we are ready to train using standard PyTorch training procedure.
Single Target Training
import torch
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-5)
Now we have our training loop.
for epoch in range(100):
for bn, data in tqdm(enumerate(train_loader)):
optimizer.zero_grad()
_, pred = model(data)
loss = criterion(pred, data.y.flatten())
loss.backward()
optimizer.step()
Evaluation
Evaluation requires storing all predections and labels for each batch, and so we have
from sklearn.metrics import roc_auc_score
with torch.no_grad():
model.eval()
all_preds = torch.Tensor()
all_labels = torch.Tensor()
for data in test_loader:
_, pred = model(data)
# Hold on to all predictions and labels
all_preds = torch.cat([all_preds, pred[:,1]])
all_labels = torch.cat([all_labels, data.y])
metric = roc_auc_score(all_labels.cpu(), all_preds.cpu().detach()[:,1])
print("TEST ROC: {1:.3f}".format(metric))
Multi-target Training
AugLiChem supports multi-target training as well. When working with a data set that has multiple targets, we can pass in a list of targets we want, or use all targets at once. In this example, we use QM8, a multi-target regression set.
dataset = MoleculeDatasetWrapper("QM8", data_path="./data_download", transform=transform, batch_size=5)
train_loader, valid_loader, test_loader = dataset.get_data_loaders("all")
Because many of these data sets often have labels for some, but not all targets, empty label values have been filled with a placeholder that we skip during training. Our training setup is the same as in single-target training:
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-5)
In our training loop, we see that we only compute the loss when we have a label corresponding to a molecule.
for epoch in range(100):
for bn, data in tqdm(enumerate(train_loader)):
optimizer.zero_grad()
loss = 0.
# Get prediction for all data
_, pred = model(data)
for idx, t in enumerate(train_loader.dataset.target):
# Get indices where target has a value
good_idx = np.where(data.y[:,idx]!=-999999999)
current_preds = pred[:,idx][good_idx]
current_labels = data.y[:,idx][good_idx]
loss += criterion(current_preds, current_labels)
loss.backward()
optimizer.step()
When evaluating, we need to iterate over all targets, and also skip data when there is no label
with torch.no_grad():
# All targets we're evaluating
target_list = test_loader.dataset.target
# Dictionaries to keep track of predictions and labels for all targets
all_preds = {target: [] for target in target_list}
all_labels = {target: [] for target in target_list}
model.eval()
for data in test_loader:
# Get prediction for all data
_, pred = model(data)
for idx, target in enumerate(target_list):
# Get indices where target has a value
good_idx = np.where(data.y[:,idx]!=-999999999)
current_preds = pred[:,idx][good_idx]
current_labels = data.y[:,idx][good_idx]
# Save predictions and targets
all_preds[target].extend(list(current_preds.detach().cpu().numpy()))
all_labels[target].extend(list(current_labels.detach().cpu().numpy()))
scores = {target: None for target in target_list}
for target in target_list:
scores[target] = mean_squared_error(all_labels[target], all_preds[target],
squared=False)
print("{0} TEST RMSE: {1:.5f}".format(target, scores[target]))
Training with CUDA
AugLiChem takes advantage of PyTorch’s CUDA support to leverage GPUs for faster training and evaluation.
To initialize a model on our GPU, we call the .cuda()
function.
model = GCN(task=dataset.task)
model.cuda()
Our training setup is the same as before:
criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-5)
The only difference in our training loop is putting our data on the GPU as we train:
for epoch in range(100):
for bn, data in tqdm(enumerate(train_loader)):
optimizer.zero_grad()
# data -> GPU
_, pred = model(data.cuda())
loss = criterion(pred[:,0], data.y.flatten())
loss.backward()
optimizer.step()
Which we also do for evaluation:
task = test_loader.dataset.task
with torch.no_grad():
model.eval()
all_preds = []
all_labels = []
for data in test_loader:
# data -> GPU
_, pred = model(data.cuda())
# Hold on to all predictions and labels
all_preds.extend(pred)
all_labels.extend(data.y)
metric = mse(data.y.cpu(), pred.cpu().detach(), squared=False)
print("TEST RMSE: {0:.3f}".format(metric))