Datasets
This module provides benchmark datasets for graph-level and node-level prediction.
Datasets are automatically downloaded and saved locally on first usage.
You can configure the path where the data are stored by creating a ~/.spektral/config.json
file with the following content:
{
"dataset_folder": "/path/to/dataset/folder"
}
Citation
spektral.datasets.citation.Citation(name, random_split=False, normalize_x=False, dtype=<class 'numpy.float32'>)
The citation datasets Cora, Citeseer and Pubmed.
Node attributes are bag-of-words vectors representing the most common words in the text document associated to each node. Two papers are connected if either one cites the other. Labels represent the subject area of the paper.
The train, test, and validation splits are given as binary masks and are
accessible via the mask_tr
, mask_va
, and mask_te
attributes.
Arguments
name
: name of the dataset to load ('cora'
,'citeseer'
, or'pubmed'
);random_split
: if True, return a randomized split (20 nodes per class for training, 30 nodes per class for validation and the remaining nodes for testing, as recommended by Shchur et al. (2018)). If False (default), return the "Planetoid" public splits defined by Yang et al. (2016).normalize_x
: if True, normalize the features.dtype
: numpy dtype of graph data.
DBLP
spektral.datasets.dblp.DBLP(normalize_x=False, dtype=<class 'numpy.float32'>)
A subset of the DBLP computer science bibliography website, as collected in the Fu et al. (2020) paper.
Arguments
normalize_x
: if True, normalize the features.dtype
: numpy dtype of graph data.
Flickr
spektral.datasets.flickr.Flickr(normalize_x=False, dtype=<class 'numpy.float32'>)
The Flickr dataset from the Zeng at al. (2019) paper, containing descriptions and common properties of images.
Arguments
normalize_x
: if True, normalize the features.dtype
: numpy dtype of graph data.
GraphSage
spektral.datasets.graphsage.GraphSage(name)
The datasets used in the paper
Inductive Representation Learning on Large Graphs
William L. Hamilton et al.
The PPI dataset (originally Stark et al. (2006)) for inductive node classification uses positional gene sets, motif gene sets and immunological signatures as features and gene ontology sets as labels.
The Reddit dataset consists of a graph made of Reddit posts in the month of September, 2014. The label for each node is the community that a post belongs to. The graph is built by sampling 50 large communities and two nodes are connected if the same user commented on both. Node features are obtained by concatenating the average GloVe CommonCrawl vectors of the title and comments, the post's score and the number of comments.
The train, test, and validation splits are given as binary masks and are
accessible via the mask_tr
, mask_va
, and mask_te
attributes.
Arguments
name
: name of the dataset to load ('ppi'
, or'reddit'
);
PPI
spektral.datasets.graphsage.PPI()
Alias for GraphSage('ppi')
.
spektral.datasets.graphsage.Reddit()
Alias for GraphSage('reddit')
.
MNIST
spektral.datasets.mnist.MNIST(p_flip=0.0, k=8)
The MNIST images used as node features for a grid graph, as described by Defferrard et al. (2016).
This dataset is a graph signal classification task, where graphs are represented in mixed mode: one adjacency matrix, many instances of node features.
For efficiency, the adjacency matrix is stored in a special attribute of the
dataset and the Graphs only contain the node features.
You can access the adjacency matrix via the a
attribute.
The node features of each graph are the MNIST digits vectorized and rescaled to [0, 1]. Two nodes are connected if they are neighbours on the grid. Labels represent the MNIST class associated to each sample.
Note: the last 10000 samples are the default test set of the MNIST dataset.
Arguments
p_flip
: if >0, then edges are randomly flipped from 0 to 1 or vice versa with that probability.k
: number of neighbours of each node.
ModelNet
spektral.datasets.modelnet.ModelNet(name, test=False, n_jobs=-1)
The ModelNet10 and ModelNet40 CAD models datasets from the paper:
3D ShapeNets: A Deep Representation for Volumetric Shapes
Zhirong Wu et al.
Each graph represents a CAD model belonging to one of 10 (or 40) categories.
The models are polygon meshes: the node attributes are the 3d coordinates of the vertices, and edges are computed from each face. Duplicate edges are ignored and the adjacency matrix is binary.
The dataset are pre-split into training and test sets: the test
flag
controls which split is loaded.
Arguments
name
: name of the dataset to load ('10' or '40');test
: if True, load the test set instead of the training set.n_jobs
: number of CPU cores to use for reading the data (-1, to use all available cores)
OGB
spektral.datasets.ogb.OGB(dataset)
Wrapper for datasets from the Open Graph Benchmark (OGB).
Arguments
dataset
: an OGB library-agnostic dataset.
QM7
spektral.datasets.qm7.QM7()
The QM7b dataset of molecules from the paper:
MoleculeNet: A Benchmark for Molecular Machine Learning
Zhenqin Wu et al.
The dataset has no node features. Edges and edge features are obtained from the Coulomb matrices of the molecules.
Each graph has a 14-dimensional label for regression.
QM9
spektral.datasets.qm9.QM9(amount=None, n_jobs=1)
The QM9 chemical data set of small molecules.
In this dataset, nodes represent atoms and edges represent chemical bonds. There are 5 possible atom types (H, C, N, O, F) and 4 bond types (single, double, triple, aromatic).
Node features represent the chemical properties of each atom and include:
- The atomic number, one-hot encoded;
- The atom's position in the X, Y, and Z dimensions;
- The atomic charge;
- The mass difference from the monoisotope;
The edge features represent the type of chemical bond between two atoms, one-hot encoded.
Each graph has an 19-dimensional label for regression.
Arguments
amount
: int, load this many molecules instead of the full dataset (useful for debugging).n_jobs
: number of CPU cores to use for reading the data (-1, to use all available cores).
TUDataset
spektral.datasets.tudataset.TUDataset(name, clean=False)
The Benchmark Data Sets for Graph Kernels from TU Dortmund (link).
Node features are computed by concatenating the following features for each node:
- node attributes, if available;
- node labels, if available, one-hot encoded.
Some datasets might not have node features at all. In this case, attempting
to use the dataset with a Loader will result in a crash. You can create
node features using some of the transforms available in spektral.transforms
or you can define your own features by accessing the individual samples in
the graph
attribute of the dataset (which is a list of Graph
objects).
Edge features are computed by concatenating the following features for each node:
- edge attributes, if available;
- edge labels, if available, one-hot encoded.
Graph labels are provided for each dataset.
Specific details about each individual dataset can be found in
~/spektral/datasets/TUDataset/<dataset name>/README.md
, after the dataset
has been downloaded locally (datasets are downloaded automatically upon
calling TUDataset('<dataset name>')
the first time).
Arguments
name
: str, name of the dataset to load (seeTUD.available_datasets
).clean
: ifTrue
, rload a version of the dataset with no isomorphic graphs.