Datasets

Citation networks

load_data

spektral.datasets.citation.load_data(dataset_name='cora', normalize_features=True, random_split=False)

Loads a citation dataset (Cora, Citeseer or Pubmed) using the "Planetoid" splits intialliy defined in Yang et al. (2016). The train, test, and validation splits are given as binary masks.

Node attributes are bag-of-words vectors representing the most common words in the text document associated to each node. Two papers are connected if either one cites the other. Labels represent the class of the paper.

Arguments

  • dataset_name: name of the dataset to load ('cora', 'citeseer', or 'pubmed');

  • normalize_features: if True, the node features are normalized;

  • random_split: if True, return a randomized split (20 nodes per class for training, 30 nodes per class for validation and the remaining nodes for testing, Shchur et al. (2018)).

Return

  • Adjacency matrix;
  • Node features;
  • Labels;
  • Three binary masks for train, validation, and test splits.

GraphSAGE datasets

load_data

spektral.datasets.graphsage.load_data(dataset_name, max_degree=-1, normalize_features=True)

Loads one of the datasets (PPI or Reddit) used in Hamilton & Ying (2017).

The PPI dataset (originally Stark et al. (2006)) for inductive node classification uses positional gene sets, motif gene sets and immunological signatures as features and gene ontology sets as labels.

The Reddit dataset consists of a graph made of Reddit posts in the month of September, 2014. The label for each node is the community that a post belongs to. The graph is built by sampling 50 large communities and two nodes are connected if the same user commented on both. Node features are obtained by concatenating the average GloVe CommonCrawl vectors of the title and comments, the post's score and the number of comments.

The train, test, and validation splits are returned as binary masks.

Arguments

  • dataset_name: name of the dataset to load ('ppi', or 'reddit');

  • max_degree: int, if positive, subsample edges so that each node has the specified maximum degree.

  • normalize_features: if True, the node features are normalized;

Return

  • Adjacency matrix;
  • Node features;
  • Labels;
  • Three binary masks for train, validation, and test splits.

TU Dortmund Benchmark Datasets for Graph Kernels

load_data

spektral.datasets.tud.load_data(dataset_name, clean=False)

Loads one of the Benchmark Data Sets for Graph Kernels from TU Dortmund (link). The node features are computed by concatenating the following features for each node:

  • node attributes, if available, normalized as specified in normalize_features;
  • clustering coefficient, normalized with z-score;
  • node degrees, normalized as specified in normalize_features;
  • node labels, if available, one-hot encoded.

Arguments

  • dataset_name: name of the dataset to load (see spektral.datasets.tud.AVAILABLE_DATASETS).

  • clean: if True, return a version of the dataset with no isomorphic graphs.

Return

  • a list of adjacency matrices;
  • a list of node feature matrices;
  • a numpy array containing the one-hot encoded targets.

Open Graph Benchmark (OGB)

graph_to_numpy

spektral.datasets.ogb.graph_to_numpy(graph, dtype=None)

Converts a graph in OGB's library-agnostic format to a representation in Numpy/Scipy. See the Open Graph Benchmark's website for more information.

Arguments

  • graph: OGB library-agnostic graph;

  • dtype: if set, all output arrays will be cast to this dtype.

Return

  • X: np.array of shape (N, F) with the node features;
  • A: scipy.sparse adjacency matrix of shape (N, N) in COOrdinate format;
  • E: if edge features are available, np.array of shape (n_edges, S), None otherwise.

dataset_to_numpy

spektral.datasets.ogb.dataset_to_numpy(dataset, indices=None, dtype=None)

Converts a dataset in OGB's library-agnostic version to lists of Numpy/Scipy arrays. See the Open Graph Benchmark's website for more information.

Arguments

  • dataset: OGB library-agnostic dataset (e.g., GraphPropPredDataset);

  • indices: optional, a list of integer indices; if provided, only these graphs will be converted;

  • dtype: if set, the arrays in the returned lists will have this dtype.

Return

  • X_list: list of np.arrays of (variable) shape (N, F) with node features;
  • A_list: list of scipy.sparse adjacency matrices of (variable) shape (N, N);
  • E_list: list of np.arrays of (variable) shape (n_nodes, S) with edge attributes. If edge attributes are not available, a list of None.
  • y_list: np.array of shape (n_graphs, n_tasks) with the task labels;

QM9 Small Molecules

load_data

spektral.datasets.qm9.load_data(nf_keys=None, ef_keys=None, auto_pad=True, self_loops=False, amount=None, return_type='numpy')

Loads the QM9 chemical data set of small molecules.

Nodes represent heavy atoms (hydrogens are discarded), edges represent chemical bonds.

The node features represent the chemical properties of each atom, and are loaded according to the nf_keys argument. See spektral.datasets.qm9.NODE_FEATURES for possible node features, and see this link for the meaning of each property. Usually, it is sufficient to load the atomic number.

The edge features represent the type and stereoscopy of each chemical bond between two atoms. See spektral.datasets.qm9.EDGE_FEATURES for possible edge features, and see this link for the meaning of each property. Usually, it is sufficient to load the type of bond.

Arguments

  • nf_keys: list or str, node features to return (see qm9.NODE_FEATURES for available features);

  • ef_keys: list or str, edge features to return (see qm9.EDGE_FEATURES for available features);

  • auto_pad: if return_type='numpy', zero pad graph matrices to have the same number of nodes;

  • self_loops: if return_type='numpy', add self loops to adjacency matrices;

  • amount: the amount of molecules to return (in ascending order by number of atoms).

  • return_type: 'numpy', 'networkx', or 'sdf', data format to return;

Return

  • if return_type='numpy', the adjacency matrix, node features, edge features, and a Pandas dataframe containing labels;
  • if return_type='networkx', a list of graphs in Networkx format, and a dataframe containing labels;
  • if return_type='sdf', a list of molecules in the internal SDF format and a dataframe containing labels.

MNIST KNN Grid

load_data

spektral.datasets.mnist.load_data(k=8, noise_level=0.0)

Loads the MNIST dataset and a K-NN graph to perform graph signal classification, as described by Defferrard et al. (2016). The K-NN graph is statically determined from a regular grid of pixels using the 2d coordinates.

The node features of each graph are the MNIST digits vectorized and rescaled to [0, 1]. Two nodes are connected if they are neighbours according to the K-NN graph. Labels are the MNIST class associated to each sample.

Arguments

  • k: int, number of neighbours for each node;

  • noise_level: fraction of edges to flip (from 0 to 1 and vice versa);

Return

  • X_train, y_train: training node features and labels;
  • X_val, y_val: validation node features and labels;
  • X_test, y_test: test node features and labels;
  • A: adjacency matrix of the grid;

Delaunay Triangulations

generate_data

spektral.datasets.delaunay.generate_data(classes=0, n_samples_in_class=1000, n_nodes=7, support_low=0.0, support_high=10.0, drift_amount=1.0, one_hot_labels=True, support=None, seed=None, return_type='numpy')

Generates a dataset of Delaunay triangulations as described by Zambon et al. (2017).

Node attributes are the 2D coordinates of the points. Two nodes are connected if they share an edge in the Delaunay triangulation. Labels represent the class of the graph (0 to 20, each class index i represent the "difficulty" of the classification problem 0 v. i. In other words, the higher the class index, the more similar the class is to class 0).

Arguments

  • classes: indices of the classes to load (integer, or list of integers between 0 and 20);

  • n_samples_in_class: number of generated samples per class;

  • n_nodes: number of nodes in a graph;

  • support_low: lower bound of the uniform distribution from which the support is generated;

  • support_high: upper bound of the uniform distribution from which the support is generated;

  • drift_amount: coefficient to control the amount of change between classes;

  • one_hot_labels: one-hot encode dataset labels;

  • support: custom support to use instead of generating it randomly;

  • seed: random numpy seed;

  • return_type: 'numpy' or 'networkx', data format to return;

Return

  • if return_type='numpy', the adjacency matrix, node features, and an array containing labels;
  • if return_type='networkx', a list of graphs in Networkx format, and an array containing labels;