torchtext.datasets

All datasets are subclasses of torchtext.data.Dataset, which inherits from torch.utils.data.Dataset i.e, they have split and iters methods implemented.

General use cases are as follows:

Approach 1, splits:

# set up fields
TEXT = data.Field(lower=True, include_lengths=True, batch_first=True)
LABEL = data.Field(sequential=False)

# make splits for data
train, test = datasets.IMDB.splits(TEXT, LABEL)

# build the vocabulary
TEXT.build_vocab(train, vectors=GloVe(name='6B', dim=300))
LABEL.build_vocab(train)

# make iterator for splits
train_iter, test_iter = data.BucketIterator.splits(
    (train, test), batch_size=3, device=0)

Approach 2, iters:

# use default configurations
train_iter, test_iter = datasets.IMDB.iters(batch_size=4)

The following datasets are available:

Sentiment Analysis

SST

class torchtext.datasets.SST(path, text_field, label_field, subtrees=False, fine_grained=False, **kwargs)
classmethod iters(batch_size=32, device=0, root='.data', vectors=None, **kwargs)

Create iterator objects for splits of the SST dataset.

Parameters:
  • batch_size – Batch_size
  • device – Device to create batches on. Use - 1 for CPU and None for the currently active GPU device.
  • root – The root directory that the dataset’s zip archive will be expanded into; therefore the directory in whose trees subdirectory the data files will be stored.
  • vectors – one of the available pretrained vectors or a list with each element one of the available pretrained vectors (see Vocab.load_vectors)
  • keyword arguments (Remaining) – Passed to the splits method.
classmethod splits(text_field, label_field, root='.data', train='train.txt', validation='dev.txt', test='test.txt', train_subtrees=False, **kwargs)

Create dataset objects for splits of the SST dataset.

Parameters:
  • text_field – The field that will be used for the sentence.
  • label_field – The field that will be used for label data.
  • root – The root directory that the dataset’s zip archive will be expanded into; therefore the directory in whose trees subdirectory the data files will be stored.
  • train – The filename of the train data. Default: ‘train.txt’.
  • validation – The filename of the validation data, or None to not load the validation set. Default: ‘dev.txt’.
  • test – The filename of the test data, or None to not load the test set. Default: ‘test.txt’.
  • train_subtrees – Whether to use all subtrees in the training set. Default: False.
  • keyword arguments (Remaining) – Passed to the splits method of Dataset.

IMDb

class torchtext.datasets.IMDB(path, text_field, label_field, **kwargs)
classmethod iters(batch_size=32, device=0, root='.data', vectors=None, **kwargs)

Create iterator objects for splits of the IMDB dataset.

Parameters:
  • batch_size – Batch_size
  • device – Device to create batches on. Use - 1 for CPU and None for the currently active GPU device.
  • root – The root directory that contains the imdb dataset subdirectory
  • vectors – one of the available pretrained vectors or a list with each element one of the available pretrained vectors (see Vocab.load_vectors)
  • keyword arguments (Remaining) – Passed to the splits method.
classmethod splits(text_field, label_field, root='.data', train='train', test='test', **kwargs)

Create dataset objects for splits of the IMDB dataset.

Parameters:
  • text_field – The field that will be used for the sentence.
  • label_field – The field that will be used for label data.
  • root – Root dataset storage directory. Default is ‘.data’.
  • train – The directory that contains the training examples
  • test – The directory that contains the test examples
  • keyword arguments (Remaining) – Passed to the splits method of Dataset.

Question Classification

TREC

class torchtext.datasets.TREC(path, text_field, label_field, fine_grained=False, **kwargs)
classmethod iters(batch_size=32, device=0, root='.data', vectors=None, **kwargs)

Create iterator objects for splits of the TREC dataset.

Parameters:
  • batch_size – Batch_size
  • device – Device to create batches on. Use - 1 for CPU and None for the currently active GPU device.
  • root – The root directory that contains the trec dataset subdirectory
  • vectors – one of the available pretrained vectors or a list with each element one of the available pretrained vectors (see Vocab.load_vectors)
  • keyword arguments (Remaining) – Passed to the splits method.
classmethod splits(text_field, label_field, root='.data', train='train_5500.label', test='TREC_10.label', **kwargs)

Create dataset objects for splits of the TREC dataset.

Parameters:
  • text_field – The field that will be used for the sentence.
  • label_field – The field that will be used for label data.
  • root – Root dataset storage directory. Default is ‘.data’.
  • train – The filename of the train data. Default: ‘train_5500.label’.
  • test – The filename of the test data, or None to not load the test set. Default: ‘TREC_10.label’.
  • keyword arguments (Remaining) – Passed to the splits method of Dataset.

Entailment

SNLI

class torchtext.datasets.SNLI(path, format, fields, skip_header=False, csv_reader_params={}, **kwargs)
classmethod iters(batch_size=32, device=0, root='.data', vectors=None, trees=False, **kwargs)

Create iterator objects for splits of the SNLI dataset.

This is the simplest way to use the dataset, and assumes common defaults for field, vocabulary, and iterator parameters.

Parameters:
  • batch_size – Batch size.
  • device – Device to create batches on. Use -1 for CPU and None for the currently active GPU device.
  • root – The root directory that the dataset’s zip archive will be expanded into; therefore the directory in whose wikitext-2 subdirectory the data files will be stored.
  • vectors – one of the available pretrained vectors or a list with each element one of the available pretrained vectors (see Vocab.load_vectors)
  • trees – Whether to include shift-reduce parser transitions. Default: False.
  • keyword arguments (Remaining) – Passed to the splits method.
classmethod splits(text_field, label_field, parse_field=None, root='.data', train='snli_1.0_train.jsonl', validation='snli_1.0_dev.jsonl', test='snli_1.0_test.jsonl')

Create dataset objects for splits of the SNLI dataset.

This is the most flexible way to use the dataset.

Parameters:
  • text_field – The field that will be used for premise and hypothesis data.
  • label_field – The field that will be used for label data.
  • parse_field – The field that will be used for shift-reduce parser transitions, or None to not include them.
  • extra_fields – A dict[json_key: Tuple(field_name, Field)]
  • root – The root directory that the dataset’s zip archive will be expanded into.
  • train – The filename of the train data. Default: ‘train.jsonl’.
  • validation – The filename of the validation data, or None to not load the validation set. Default: ‘dev.jsonl’.
  • test – The filename of the test data, or None to not load the test set. Default: ‘test.jsonl’.

MultiNLI

class torchtext.datasets.MultiNLI(path, format, fields, skip_header=False, csv_reader_params={}, **kwargs)
classmethod iters(batch_size=32, device=0, root='.data', vectors=None, trees=False, **kwargs)

Create iterator objects for splits of the SNLI dataset.

This is the simplest way to use the dataset, and assumes common defaults for field, vocabulary, and iterator parameters.

Parameters:
  • batch_size – Batch size.
  • device – Device to create batches on. Use -1 for CPU and None for the currently active GPU device.
  • root – The root directory that the dataset’s zip archive will be expanded into; therefore the directory in whose wikitext-2 subdirectory the data files will be stored.
  • vectors – one of the available pretrained vectors or a list with each element one of the available pretrained vectors (see Vocab.load_vectors)
  • trees – Whether to include shift-reduce parser transitions. Default: False.
  • keyword arguments (Remaining) – Passed to the splits method.
classmethod splits(text_field, label_field, parse_field=None, genre_field=None, root='.data', train='multinli_1.0_train.jsonl', validation='multinli_1.0_dev_matched.jsonl', test='multinli_1.0_dev_mismatched.jsonl')

Create dataset objects for splits of the SNLI dataset.

This is the most flexible way to use the dataset.

Parameters:
  • text_field – The field that will be used for premise and hypothesis data.
  • label_field – The field that will be used for label data.
  • parse_field – The field that will be used for shift-reduce parser transitions, or None to not include them.
  • extra_fields – A dict[json_key: Tuple(field_name, Field)]
  • root – The root directory that the dataset’s zip archive will be expanded into.
  • train – The filename of the train data. Default: ‘train.jsonl’.
  • validation – The filename of the validation data, or None to not load the validation set. Default: ‘dev.jsonl’.
  • test – The filename of the test data, or None to not load the test set. Default: ‘test.jsonl’.

Language Modeling

Language modeling datasets are subclasses of LanguageModelingDataset class.

class torchtext.datasets.LanguageModelingDataset(path, text_field, newline_eos=True, encoding='utf-8', **kwargs)

Defines a dataset for language modeling.

__init__(path, text_field, newline_eos=True, encoding='utf-8', **kwargs)

Create a LanguageModelingDataset given a path and a field.

Parameters:
  • path – Path to the data file.
  • text_field – The field that will be used for text data.
  • newline_eos – Whether to add an <eos> token for every newline in the data file. Default: True.
  • keyword arguments (Remaining) – Passed to the constructor of data.Dataset.

WikiText-2

class torchtext.datasets.WikiText2(path, text_field, newline_eos=True, encoding='utf-8', **kwargs)
classmethod iters(batch_size=32, bptt_len=35, device=0, root='.data', vectors=None, **kwargs)

Create iterator objects for splits of the WikiText-2 dataset.

This is the simplest way to use the dataset, and assumes common defaults for field, vocabulary, and iterator parameters.

Parameters:
  • batch_size – Batch size.
  • bptt_len – Length of sequences for backpropagation through time.
  • device – Device to create batches on. Use -1 for CPU and None for the currently active GPU device.
  • root – The root directory that the dataset’s zip archive will be expanded into; therefore the directory in whose wikitext-2 subdirectory the data files will be stored.
  • wv_type, wv_dim (wv_dir,) – Passed to the Vocab constructor for the text field. The word vectors are accessible as train.dataset.fields[‘text’].vocab.vectors.
  • keyword arguments (Remaining) – Passed to the splits method.
classmethod splits(text_field, root='.data', train='wiki.train.tokens', validation='wiki.valid.tokens', test='wiki.test.tokens', **kwargs)

Create dataset objects for splits of the WikiText-2 dataset.

This is the most flexible way to use the dataset.

Parameters:
  • text_field – The field that will be used for text data.
  • root – The root directory that the dataset’s zip archive will be expanded into; therefore the directory in whose wikitext-2 subdirectory the data files will be stored.
  • train – The filename of the train data. Default: ‘wiki.train.tokens’.
  • validation – The filename of the validation data, or None to not load the validation set. Default: ‘wiki.valid.tokens’.
  • test – The filename of the test data, or None to not load the test set. Default: ‘wiki.test.tokens’.

WikiText103

class torchtext.datasets.WikiText103(path, text_field, newline_eos=True, encoding='utf-8', **kwargs)
classmethod iters(batch_size=32, bptt_len=35, device=0, root='.data', vectors=None, **kwargs)

Create iterator objects for splits of the WikiText-103 dataset.

This is the simplest way to use the dataset, and assumes common defaults for field, vocabulary, and iterator parameters.

Parameters:
  • batch_size – Batch size.
  • bptt_len – Length of sequences for backpropagation through time.
  • device – Device to create batches on. Use -1 for CPU and None for the currently active GPU device.
  • root – The root directory that the dataset’s zip archive will be expanded into; therefore the directory in whose wikitext-2 subdirectory the data files will be stored.
  • wv_type, wv_dim (wv_dir,) – Passed to the Vocab constructor for the text field. The word vectors are accessible as train.dataset.fields[‘text’].vocab.vectors.
  • keyword arguments (Remaining) – Passed to the splits method.
classmethod splits(text_field, root='.data', train='wiki.train.tokens', validation='wiki.valid.tokens', test='wiki.test.tokens', **kwargs)

Create dataset objects for splits of the WikiText-103 dataset.

This is the most flexible way to use the dataset.

Parameters:
  • text_field – The field that will be used for text data.
  • root – The root directory that the dataset’s zip archive will be expanded into; therefore the directory in whose wikitext-103 subdirectory the data files will be stored.
  • train – The filename of the train data. Default: ‘wiki.train.tokens’.
  • validation – The filename of the validation data, or None to not load the validation set. Default: ‘wiki.valid.tokens’.
  • test – The filename of the test data, or None to not load the test set. Default: ‘wiki.test.tokens’.

PennTreebank

class torchtext.datasets.PennTreebank(path, text_field, newline_eos=True, encoding='utf-8', **kwargs)

The Penn Treebank dataset. A relatively small dataset originally created for POS tagging.

References

Marcus, Mitchell P., Marcinkiewicz, Mary Ann & Santorini, Beatrice (1993). Building a Large Annotated Corpus of English: The Penn Treebank

classmethod iters(batch_size=32, bptt_len=35, device=0, root='.data', vectors=None, **kwargs)

Create iterator objects for splits of the Penn Treebank dataset.

This is the simplest way to use the dataset, and assumes common defaults for field, vocabulary, and iterator parameters.

Parameters:
  • batch_size – Batch size.
  • bptt_len – Length of sequences for backpropagation through time.
  • device – Device to create batches on. Use -1 for CPU and None for the currently active GPU device.
  • root – The root directory where the data files will be stored.
  • wv_type, wv_dim (wv_dir,) – Passed to the Vocab constructor for the text field. The word vectors are accessible as train.dataset.fields[‘text’].vocab.vectors.
  • keyword arguments (Remaining) – Passed to the splits method.
classmethod splits(text_field, root='.data', train='ptb.train.txt', validation='ptb.valid.txt', test='ptb.test.txt', **kwargs)

Create dataset objects for splits of the Penn Treebank dataset.

Parameters:
  • text_field – The field that will be used for text data.
  • root – The root directory where the data files will be stored.
  • train – The filename of the train data. Default: ‘ptb.train.txt’.
  • validation – The filename of the validation data, or None to not load the validation set. Default: ‘ptb.valid.txt’.
  • test – The filename of the test data, or None to not load the test set. Default: ‘ptb.test.txt’.

Machine Translation

Machine translation datasets are subclasses of TranslationDataset class.

class torchtext.datasets.TranslationDataset(path, exts, fields, **kwargs)

Defines a dataset for machine translation.

__init__(path, exts, fields, **kwargs)

Create a TranslationDataset given paths and fields.

Parameters:
  • path – Common prefix of paths to the data files for both languages.
  • exts – A tuple containing the extension to path for each language.
  • fields – A tuple containing the fields that will be used for data in each language.
  • keyword arguments (Remaining) – Passed to the constructor of data.Dataset.

Multi30k

class torchtext.datasets.Multi30k(path, exts, fields, **kwargs)

The small-dataset WMT 2016 multimodal task, also known as Flickr30k

classmethod splits(exts, fields, root='.data', train='train', validation='val', test='test2016', **kwargs)

Create dataset objects for splits of the Multi30k dataset.

Parameters:
  • exts – A tuple containing the extension to path for each language.
  • fields – A tuple containing the fields that will be used for data in each language.
  • root – Root dataset storage directory. Default is ‘.data’.
  • train – The prefix of the train data. Default: ‘train’.
  • validation – The prefix of the validation data. Default: ‘val’.
  • test – The prefix of the test data. Default: ‘test’.
  • keyword arguments (Remaining) – Passed to the splits method of Dataset.

IWSLT

class torchtext.datasets.IWSLT(path, exts, fields, **kwargs)

The IWSLT 2016 TED talk translation task

classmethod splits(exts, fields, root='.data', train='train', validation='IWSLT16.TED.tst2013', test='IWSLT16.TED.tst2014', **kwargs)

Create dataset objects for splits of the IWSLT dataset.

Parameters:
  • exts – A tuple containing the extension to path for each language.
  • fields – A tuple containing the fields that will be used for data in each language.
  • root – Root dataset storage directory. Default is ‘.data’.
  • train – The prefix of the train data. Default: ‘train’.
  • validation – The prefix of the validation data. Default: ‘val’.
  • test – The prefix of the test data. Default: ‘test’.
  • keyword arguments (Remaining) – Passed to the splits method of Dataset.

WMT14

class torchtext.datasets.WMT14(path, exts, fields, **kwargs)

The WMT 2014 English-German dataset, as preprocessed by Google Brain.

Though this download contains test sets from 2015 and 2016, the train set differs slightly from WMT 2015 and 2016 and significantly from WMT 2017.

classmethod splits(exts, fields, root='.data', train='train.tok.clean.bpe.32000', validation='newstest2013.tok.bpe.32000', test='newstest2014.tok.bpe.32000', **kwargs)

Create dataset objects for splits of the WMT 2014 dataset.

Parameters:
  • exts – A tuple containing the extensions for each language. Must be either (‘.en’, ‘.de’) or the reverse.
  • fields – A tuple containing the fields that will be used for data in each language.
  • root – Root dataset storage directory. Default is ‘.data’.
  • train – The prefix of the train data. Default: ‘train.tok.clean.bpe.32000’.
  • validation – The prefix of the validation data. Default: ‘newstest2013.tok.bpe.32000’.
  • test – The prefix of the test data. Default: ‘newstest2014.tok.bpe.32000’.
  • keyword arguments (Remaining) – Passed to the splits method of Dataset.

Sequence Tagging

Sequence tagging datasets are subclasses of SequenceTaggingDataset class.

class torchtext.datasets.SequenceTaggingDataset(path, fields, separator='t', **kwargs)

Defines a dataset for sequence tagging. Examples in this dataset contain paired lists – paired list of words and tags.

For example, in the case of part-of-speech tagging, an example is of the form [I, love, PyTorch, .] paired with [PRON, VERB, PROPN, PUNCT]

See torchtext/test/sequence_tagging.py on how to use this class.

__init__(path, fields, separator='\t', **kwargs)

Create a dataset from a list of Examples and Fields.

Parameters:
  • examples – List of Examples.
  • fields (List(tuple(str, Field))) – The Fields to use in this tuple. The string is a field name, and the Field is the associated field.
  • filter_pred (callable or None) – Use only examples for which filter_pred(example) is True, or use all examples if None. Default is None.

UDPOS

class torchtext.datasets.UDPOS(path, fields, separator='t', **kwargs)
classmethod splits(fields, root='.data', train='en-ud-tag.v2.train.txt', validation='en-ud-tag.v2.dev.txt', test='en-ud-tag.v2.test.txt', **kwargs)

Downloads and loads the Universal Dependencies Version 2 POS Tagged data.

CoNLL2000Chunking

class torchtext.datasets.CoNLL2000Chunking(path, fields, separator='t', **kwargs)
classmethod splits(fields, root='.data', train='train.txt', test='test.txt', validation_frac=0.1, **kwargs)

Downloads and loads the CoNLL 2000 Chunking dataset. NOTE: There is only a train and test dataset so we use

10% of the train set as validation

Question Answering

BABI20

class torchtext.datasets.BABI20(path, text_field, only_supporting=False, **kwargs)
__init__(path, text_field, only_supporting=False, **kwargs)

Create a dataset from a list of Examples and Fields.

Parameters:
  • examples – List of Examples.
  • fields (List(tuple(str, Field))) – The Fields to use in this tuple. The string is a field name, and the Field is the associated field.
  • filter_pred (callable or None) – Use only examples for which filter_pred(example) is True, or use all examples if None. Default is None.
classmethod splits(text_field, path=None, root='.data', task=1, joint=False, tenK=False, only_supporting=False, train=None, validation=None, test=None, **kwargs)

Create Dataset objects for multiple splits of a dataset.

Parameters:
  • path (str) – Common prefix of the splits’ file paths, or None to use the result of cls.download(root).
  • root (str) – Root dataset storage directory. Default is ‘.data’.
  • train (str) – Suffix to add to path for the train set, or None for no train set. Default is None.
  • validation (str) – Suffix to add to path for the validation set, or None for no validation set. Default is None.
  • test (str) – Suffix to add to path for the test set, or None for no test set. Default is None.
  • keyword arguments (Remaining) – Passed to the constructor of the Dataset (sub)class being used.
Returns:

Datasets for train, validation, and test splits in that order, if provided.

Return type:

Tuple[Dataset]