torchtext.data

The data module provides the following:

  • Ability to define a preprocessing pipeline
  • Batching, padding, and numericalizing (including building a vocabulary object)
  • Wrapper for dataset splits (train, validation, test)
  • Loader a custom NLP dataset

Dataset, Batch, and Example

Dataset

class torchtext.data.Dataset(examples, fields, filter_pred=None)

Defines a dataset composed of Examples along with its Fields.

Variables:
  • sort_key (callable) – A key to use for sorting dataset examples for batching together examples with similar lengths to minimize padding.
  • examples (list(Example)) – The examples in this dataset.
  • fields (dict[str, Field]) – Contains the name of each column or field, together with the corresponding Field object. Two fields with the same Field object will have a shared vocabulary.
__init__(examples, fields, filter_pred=None)

Create a dataset from a list of Examples and Fields.

Parameters:
  • examples – List of Examples.
  • fields (List(tuple(str, Field))) – The Fields to use in this tuple. The string is a field name, and the Field is the associated field.
  • filter_pred (callable or None) – Use only examples for which filter_pred(example) is True, or use all examples if None. Default is None.
classmethod download(root, check=None)

Download and unzip an online archive (.zip, .gz, or .tgz).

Parameters:
  • root (str) – Folder to download data to.
  • check (str or None) – Folder whose existence indicates that the dataset has already been downloaded, or None to check the existence of root/{cls.name}.
Returns:

Path to extracted dataset.

Return type:

str

filter_examples(field_names)

Remove unknown words from dataset examples with respect to given field.

Parameters:field_names (list(str)) – Within example only the parts with field names in field_names will have their unknown words deleted.
split(split_ratio=0.7, stratified=False, strata_field='label', random_state=None)

Create train-test(-valid?) splits from the instance’s examples.

Parameters:
  • split_ratio (float or List of python:floats) – a number [0, 1] denoting the amount of data to be used for the training split (rest is used for validation), or a list of numbers denoting the relative sizes of train, test and valid splits respectively. If the relative size for valid is missing, only the train-test split is returned. Default is 0.7 (for the train set).
  • stratified (bool) – whether the sampling should be stratified. Default is False.
  • strata_field (str) – name of the examples Field stratified over. Default is ‘label’ for the conventional label field.
  • random_state (tuple) – the random seed used for shuffling. A return value of random.getstate().
Returns:

Datasets for train, validation, and test splits in that order, if the splits are provided.

Return type:

Tuple[Dataset]

classmethod splits(path=None, root='.data', train=None, validation=None, test=None, **kwargs)

Create Dataset objects for multiple splits of a dataset.

Parameters:
  • path (str) – Common prefix of the splits’ file paths, or None to use the result of cls.download(root).
  • root (str) – Root dataset storage directory. Default is ‘.data’.
  • train (str) – Suffix to add to path for the train set, or None for no train set. Default is None.
  • validation (str) – Suffix to add to path for the validation set, or None for no validation set. Default is None.
  • test (str) – Suffix to add to path for the test set, or None for no test set. Default is None.
  • keyword arguments (Remaining) – Passed to the constructor of the Dataset (sub)class being used.
Returns:

Datasets for train, validation, and test splits in that order, if provided.

Return type:

Tuple[Dataset]

TabularDataset

class torchtext.data.TabularDataset(path, format, fields, skip_header=False, csv_reader_params={}, **kwargs)

Defines a Dataset of columns stored in CSV, TSV, or JSON format.

__init__(path, format, fields, skip_header=False, csv_reader_params={}, **kwargs)

Create a TabularDataset given a path, file format, and field list.

Parameters:
  • path (str) – Path to the data file.
  • format (str) – The format of the data file. One of “CSV”, “TSV”, or “JSON” (case-insensitive).
  • fields (list(tuple(str, Field)) –

    tuple(str, Field)]: If using a list, the format must be CSV or TSV, and the values of the list should be tuples of (name, field). The fields should be in the same order as the columns in the CSV or TSV file, while tuples of (name, None) represent columns that will be ignored.

    If using a dict, the keys should be a subset of the JSON keys or CSV/TSV columns, and the values should be tuples of (name, field). Keys not present in the input dictionary are ignored. This allows the user to rename columns from their JSON/CSV/TSV key names and also enables selecting a subset of columns to load.

  • skip_header (bool) – Whether to skip the first line of the input file.
  • csv_reader_params (dict) – Parameters to pass to the csv reader. Only relevant when format is csv or tsv. See https://docs.python.org/3/library/csv.html#csv.reader for more details.

Batch

class torchtext.data.Batch(data=None, dataset=None, device=None)

Defines a batch of examples along with its Fields.

Variables:
  • batch_size – Number of examples in the batch.
  • dataset – A reference to the dataset object the examples come from (which itself contains the dataset’s Field objects).
  • train – Deprecated: this attribute is left for backwards compatibility, however it is UNUSED as of the merger with pytorch 0.4.
  • input_fields – The names of the fields that are used as input for the model
  • target_fields – The names of the fields that are used as targets during model training

Also stores the Variable for each column in the batch as an attribute.

__init__(data=None, dataset=None, device=None)

Create a Batch from a list of examples.

classmethod fromvars(dataset, batch_size, train=None, **kwargs)

Create a Batch directly from a number of Variables.

Example

class torchtext.data.Example

Defines a single training or test example.

Stores each column of the example as an attribute.

classmethod fromCSV(data, fields, field_to_index=None)
classmethod fromJSON(data, fields)
classmethod fromdict(data, fields)
classmethod fromlist(data, fields)
classmethod fromtree(data, fields, subtrees=False)

Fields

RawField

class torchtext.data.RawField(preprocessing=None, postprocessing=None, is_target=False)

Defines a general datatype.

Every dataset consists of one or more types of data. For instance, a text classification dataset contains sentences and their classes, while a machine translation dataset contains paired examples of text in two languages. Each of these types of data is represented by a RawField object. A RawField object does not assume any property of the data type and it holds parameters relating to how a datatype should be processed.

Variables:
  • preprocessing – The Pipeline that will be applied to examples using this field before creating an example. Default: None.
  • postprocessing – A Pipeline that will be applied to a list of examples using this field before assigning to a batch. Function signature: (batch(list)) -> object Default: None.
  • is_target – Whether this field is a target variable. Affects iteration over batches. Default: False
__init__(preprocessing=None, postprocessing=None, is_target=False)

Initialize self. See help(type(self)) for accurate signature.

preprocess(x)

Preprocess an example if the preprocessing Pipeline is provided.

process(batch, *args, **kwargs)

Process a list of examples to create a batch.

Postprocess the batch with user-provided Pipeline.

Parameters:batch (list(object)) – A list of object from a batch of examples.
Returns:Processed object given the input and custom postprocessing Pipeline.
Return type:object

Field

class torchtext.data.Field(sequential=True, use_vocab=True, init_token=None, eos_token=None, fix_length=None, dtype=torch.int64, preprocessing=None, postprocessing=None, lower=False, tokenize=None, tokenizer_language='en', include_lengths=False, batch_first=False, pad_token='<pad>', unk_token='<unk>', pad_first=False, truncate_first=False, stop_words=None, is_target=False)

Defines a datatype together with instructions for converting to Tensor.

Field class models common text processing datatypes that can be represented by tensors. It holds a Vocab object that defines the set of possible values for elements of the field and their corresponding numerical representations. The Field object also holds other parameters relating to how a datatype should be numericalized, such as a tokenization method and the kind of Tensor that should be produced.

If a Field is shared between two columns in a dataset (e.g., question and answer in a QA dataset), then they will have a shared vocabulary.

Variables:
  • sequential – Whether the datatype represents sequential data. If False, no tokenization is applied. Default: True.
  • use_vocab – Whether to use a Vocab object. If False, the data in this field should already be numerical. Default: True.
  • init_token – A token that will be prepended to every example using this field, or None for no initial token. Default: None.
  • eos_token – A token that will be appended to every example using this field, or None for no end-of-sentence token. Default: None.
  • fix_length – A fixed length that all examples using this field will be padded to, or None for flexible sequence lengths. Default: None.
  • dtype – The torch.dtype class that represents a batch of examples of this kind of data. Default: torch.long.
  • preprocessing – The Pipeline that will be applied to examples using this field after tokenizing but before numericalizing. Many Datasets replace this attribute with a custom preprocessor. Default: None.
  • postprocessing – A Pipeline that will be applied to examples using this field after numericalizing but before the numbers are turned into a Tensor. The pipeline function takes the batch as a list, and the field’s Vocab. Default: None.
  • lower – Whether to lowercase the text in this field. Default: False.
  • tokenize – The function used to tokenize strings using this field into sequential examples. If “spacy”, the SpaCy tokenizer is used. If a non-serializable function is passed as an argument, the field will not be able to be serialized. Default: string.split.
  • tokenizer_language – The language of the tokenizer to be constructed. Various languages currently supported only in SpaCy.
  • include_lengths – Whether to return a tuple of a padded minibatch and a list containing the lengths of each examples, or just a padded minibatch. Default: False.
  • batch_first – Whether to produce tensors with the batch dimension first. Default: False.
  • pad_token – The string token used as padding. Default: “<pad>”.
  • unk_token – The string token used to represent OOV words. Default: “<unk>”.
  • pad_first – Do the padding of the sequence at the beginning. Default: False.
  • truncate_first – Do the truncating of the sequence at the beginning. Default: False
  • stop_words – Tokens to discard during the preprocessing step. Default: None
  • is_target – Whether this field is a target variable. Affects iteration over batches. Default: False
__init__(sequential=True, use_vocab=True, init_token=None, eos_token=None, fix_length=None, dtype=torch.int64, preprocessing=None, postprocessing=None, lower=False, tokenize=None, tokenizer_language='en', include_lengths=False, batch_first=False, pad_token='<pad>', unk_token='<unk>', pad_first=False, truncate_first=False, stop_words=None, is_target=False)

Initialize self. See help(type(self)) for accurate signature.

build_vocab(*args, **kwargs)

Construct the Vocab object for this field from one or more datasets.

Parameters:
  • arguments (Positional) – Dataset objects or other iterable data sources from which to construct the Vocab object that represents the set of possible values for this field. If a Dataset object is provided, all columns corresponding to this field are used; individual columns can also be provided directly.
  • keyword arguments (Remaining) – Passed to the constructor of Vocab.
numericalize(arr, device=None)

Turn a batch of examples that use this field into a Variable.

If the field has include_lengths=True, a tensor of lengths will be included in the return value.

Parameters:
  • arr (List[List[str]], or tuple of (List[List[str]], List[int])) – List of tokenized and padded examples, or tuple of List of tokenized and padded examples and List of lengths of each example if self.include_lengths is True.
  • device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
pad(minibatch)

Pad a batch of examples using this field.

Pads to self.fix_length if provided, otherwise pads to the length of the longest example in the batch. Prepends self.init_token and appends self.eos_token if those attributes are not None. Returns a tuple of the padded list and a list containing lengths of each example if self.include_lengths is True and self.sequential is True, else just returns the padded list. If self.sequential is False, no padding is applied.

preprocess(x)

Load a single example using this field, tokenizing if necessary.

If the input is a Python 2 str, it will be converted to Unicode first. If sequential=True, it will be tokenized. Then the input will be optionally lowercased and passed to the user-provided preprocessing Pipeline.

process(batch, device=None)

Process a list of examples to create a torch.Tensor.

Pad, numericalize, and postprocess a batch and create a tensor.

Parameters:batch (list(object)) – A list of object from a batch of examples.
Returns:Processed object given the input and custom postprocessing Pipeline.
Return type:torch.autograd.Variable
vocab_cls

alias of torchtext.vocab.Vocab

ReversibleField

class torchtext.data.ReversibleField(**kwargs)
__init__(**kwargs)

Initialize self. See help(type(self)) for accurate signature.

SubwordField

class torchtext.data.SubwordField(**kwargs)
__init__(**kwargs)

Initialize self. See help(type(self)) for accurate signature.

segment(*args)

Segment one or more datasets with this subword field.

Parameters:arguments (Positional) – Dataset objects or other indexable mutable sequences to segment. If a Dataset object is provided, all columns corresponding to this field are used; individual columns can also be provided directly.
vocab_cls

alias of torchtext.vocab.SubwordVocab

NestedField

class torchtext.data.NestedField(nesting_field, use_vocab=True, init_token=None, eos_token=None, fix_length=None, dtype=torch.int64, preprocessing=None, postprocessing=None, tokenize=None, tokenizer_language='en', include_lengths=False, pad_token='<pad>', pad_first=False, truncate_first=False)

A nested field.

A nested field holds another field (called nesting field), accepts an untokenized string or a list string tokens and groups and treats them as one field as described by the nesting field. Every token will be preprocessed, padded, etc. in the manner specified by the nesting field. Note that this means a nested field always has sequential=True. The two fields’ vocabularies will be shared. Their numericalization results will be stacked into a single tensor. And NestedField will share the same include_lengths with nesting_field, so one shouldn’t specify the include_lengths in the nesting_field. This field is primarily used to implement character embeddings. See tests/data/test_field.py for examples on how to use this field.

Parameters:
  • nesting_field (Field) – A field contained in this nested field.
  • use_vocab (bool) – Whether to use a Vocab object. If False, the data in this field should already be numerical. Default: True.
  • init_token (str) – A token that will be prepended to every example using this field, or None for no initial token. Default: None.
  • eos_token (str) – A token that will be appended to every example using this field, or None for no end-of-sentence token. Default: None.
  • fix_length (int) – A fixed length that all examples using this field will be padded to, or None for flexible sequence lengths. Default: None.
  • dtype – The torch.dtype class that represents a batch of examples of this kind of data. Default: torch.long.
  • preprocessing (Pipeline) – The Pipeline that will be applied to examples using this field after tokenizing but before numericalizing. Many Datasets replace this attribute with a custom preprocessor. Default: None.
  • postprocessing (Pipeline) – A Pipeline that will be applied to examples using this field after numericalizing but before the numbers are turned into a Tensor. The pipeline function takes the batch as a list, and the field’s Vocab. Default: None.
  • include_lengths – Whether to return a tuple of a padded minibatch and a list containing the lengths of each examples, or just a padded minibatch. Default: False.
  • tokenize – The function used to tokenize strings using this field into sequential examples. If “spacy”, the SpaCy tokenizer is used. If a non-serializable function is passed as an argument, the field will not be able to be serialized. Default: string.split.
  • tokenizer_language – The language of the tokenizer to be constructed. Various languages currently supported only in SpaCy.
  • pad_token (str) – The string token used as padding. If nesting_field is sequential, this will be set to its pad_token. Default: "<pad>".
  • pad_first (bool) – Do the padding of the sequence at the beginning. Default: False.
__init__(nesting_field, use_vocab=True, init_token=None, eos_token=None, fix_length=None, dtype=torch.int64, preprocessing=None, postprocessing=None, tokenize=None, tokenizer_language='en', include_lengths=False, pad_token='<pad>', pad_first=False, truncate_first=False)

Initialize self. See help(type(self)) for accurate signature.

build_vocab(*args, **kwargs)

Construct the Vocab object for nesting field and combine it with this field’s vocab.

Parameters:
  • arguments (Positional) – Dataset objects or other iterable data sources from which to construct the Vocab object that represents the set of possible values for the nesting field. If a Dataset object is provided, all columns corresponding to this field are used; individual columns can also be provided directly.
  • keyword arguments (Remaining) – Passed to the constructor of Vocab.
numericalize(arrs, device=None)

Convert a padded minibatch into a variable tensor.

Each item in the minibatch will be numericalized independently and the resulting tensors will be stacked at the first dimension.

Parameters:
  • arr (List[List[str]]) – List of tokenized and padded examples.
  • device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
pad(minibatch)

Pad a batch of examples using this field.

If self.nesting_field.sequential is False, each example in the batch must be a list of string tokens, and pads them as if by a Field with sequential=True. Otherwise, each example must be a list of list of tokens. Using self.nesting_field, pads the list of tokens to self.nesting_field.fix_length if provided, or otherwise to the length of the longest list of tokens in the batch. Next, using this field, pads the result by filling short examples with self.nesting_field.pad_token.

Example

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=4)
>>>
>>> nesting_field = Field(pad_token='<c>', init_token='<w>', eos_token='</w>')
>>> field = NestedField(nesting_field, init_token='<s>', eos_token='</s>')
>>> minibatch = [
...     [list('john'), list('loves'), list('mary')],
...     [list('mary'), list('cries')],
... ]
>>> padded = field.pad(minibatch)
>>> pp.pprint(padded)
[   [   ['<w>', '<s>', '</w>', '<c>', '<c>', '<c>', '<c>'],
        ['<w>', 'j', 'o', 'h', 'n', '</w>', '<c>'],
        ['<w>', 'l', 'o', 'v', 'e', 's', '</w>'],
        ['<w>', 'm', 'a', 'r', 'y', '</w>', '<c>'],
        ['<w>', '</s>', '</w>', '<c>', '<c>', '<c>', '<c>']],
    [   ['<w>', '<s>', '</w>', '<c>', '<c>', '<c>', '<c>'],
        ['<w>', 'm', 'a', 'r', 'y', '</w>', '<c>'],
        ['<w>', 'c', 'r', 'i', 'e', 's', '</w>'],
        ['<w>', '</s>', '</w>', '<c>', '<c>', '<c>', '<c>'],
        ['<c>', '<c>', '<c>', '<c>', '<c>', '<c>', '<c>']]]
Parameters:minibatch (list) – Each element is a list of string if self.nesting_field.sequential is False, a list of list of string otherwise.
Returns:The padded minibatch. or (padded, sentence_lens, word_lengths)
Return type:list
preprocess(xs)

Preprocess a single example.

Firstly, tokenization and the supplied preprocessing pipeline is applied. Since this field is always sequential, the result is a list. Then, each element of the list is preprocessed using self.nesting_field.preprocess and the resulting list is returned.

Parameters:xs (list or str) – The input to preprocess.
Returns:The preprocessed list.
Return type:list

Iterators

Iterator

class torchtext.data.Iterator(dataset, batch_size, sort_key=None, device=None, batch_size_fn=None, train=True, repeat=False, shuffle=None, sort=None, sort_within_batch=None)

Defines an iterator that loads batches of data from a Dataset.

Variables:
  • dataset – The Dataset object to load Examples from.
  • batch_size – Batch size.
  • batch_size_fn – Function of three arguments (new example to add, current count of examples in the batch, and current effective batch size) that returns the new effective batch size resulting from adding that example to a batch. This is useful for dynamic batching, where this function would add to the current effective batch size the number of tokens in the new example.
  • sort_key – A key to use for sorting examples in order to batch together examples with similar lengths and minimize padding. The sort_key provided to the Iterator constructor overrides the sort_key attribute of the Dataset, or defers to it if None.
  • train – Whether the iterator represents a train set.
  • repeat – Whether to repeat the iterator for multiple epochs. Default: False.
  • shuffle – Whether to shuffle examples between epochs.
  • sort – Whether to sort examples according to self.sort_key. Note that shuffle and sort default to train and (not train).
  • sort_within_batch – Whether to sort (in descending order according to self.sort_key) within each batch. If None, defaults to self.sort. If self.sort is True and this is False, the batch is left in the original (ascending) sorted order.
  • device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
__init__(dataset, batch_size, sort_key=None, device=None, batch_size_fn=None, train=True, repeat=False, shuffle=None, sort=None, sort_within_batch=None)

Initialize self. See help(type(self)) for accurate signature.

data()

Return the examples in the dataset in order, sorted, or shuffled.

init_epoch()

Set up the batch generator for a new epoch.

classmethod splits(datasets, batch_sizes=None, **kwargs)

Create Iterator objects for multiple splits of a dataset.

Parameters:
  • datasets – Tuple of Dataset objects corresponding to the splits. The first such object should be the train set.
  • batch_sizes – Tuple of batch sizes to use for the different splits, or None to use the same batch_size for all splits.
  • keyword arguments (Remaining) – Passed to the constructor of the iterator class being used.

BucketIterator

class torchtext.data.BucketIterator(dataset, batch_size, sort_key=None, device=None, batch_size_fn=None, train=True, repeat=False, shuffle=None, sort=None, sort_within_batch=None)

Defines an iterator that batches examples of similar lengths together.

Minimizes amount of padding needed while producing freshly shuffled batches for each new epoch. See pool for the bucketing procedure used.

BPTTIterator

class torchtext.data.BPTTIterator(dataset, batch_size, bptt_len, **kwargs)

Defines an iterator for language modeling tasks that use BPTT.

Provides contiguous streams of examples together with targets that are one timestep further forward, for language modeling training with backpropagation through time (BPTT). Expects a Dataset with a single example and a single field called ‘text’ and produces Batches with text and target attributes.

Variables:
  • dataset – The Dataset object to load Examples from.
  • batch_size – Batch size.
  • bptt_len – Length of sequences for backpropagation through time.
  • sort_key – A key to use for sorting examples in order to batch together examples with similar lengths and minimize padding. The sort_key provided to the Iterator constructor overrides the sort_key attribute of the Dataset, or defers to it if None.
  • train – Whether the iterator represents a train set.
  • repeat – Whether to repeat the iterator for multiple epochs. Default: False.
  • shuffle – Whether to shuffle examples between epochs.
  • sort – Whether to sort examples according to self.sort_key. Note that shuffle and sort default to train and (not train).
  • device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
__init__(dataset, batch_size, bptt_len, **kwargs)

Initialize self. See help(type(self)) for accurate signature.

Pipeline

Pipeline

class torchtext.data.Pipeline(convert_token=None)

Defines a pipeline for transforming sequence data.

The input is assumed to be utf-8 encoded str (Python 3) or unicode (Python 2).

Variables:
  • convert_token – The function to apply to input sequence data.
  • pipes – The Pipelines that will be applied to input sequence data in order.
__init__(convert_token=None)

Create a pipeline.

Parameters:convert_token – The function to apply to input sequence data. If None, the identity function is used. Default: None
add_after(pipeline)

Add a Pipeline to be applied after this processing pipeline.

Parameters:pipeline – The Pipeline or callable to apply after this Pipeline.
add_before(pipeline)

Add a Pipeline to be applied before this processing pipeline.

Parameters:pipeline – The Pipeline or callable to apply before this Pipeline.
call(x, *args)

Apply _only_ the convert_token function of the current pipeline to the input. If the input is a list, a list with the results of applying the convert_token function to all input elements is returned.

Parameters:
  • x – The input to apply the convert_token function to.
  • arguments (Positional) – Forwarded to the convert_token function of the current Pipeline.
static identity(x)

Return a copy of the input.

This is here for serialization compatibility with pickle.

Functions

batch

torchtext.data.batch(data, batch_size, batch_size_fn=None)

Yield elements from data in chunks of batch_size.

pool

torchtext.data.pool(data, batch_size, key, batch_size_fn=<function <lambda>>, random_shuffler=None, shuffle=False, sort_within_batch=False)

Sort within buckets, then batch, then shuffle batches.

Partitions data into chunks of size 100*batch_size, sorts examples within each chunk using sort_key, then batch these examples and shuffle the batches.

get_tokenizer

torchtext.data.get_tokenizer(tokenizer, language='en')

interleave_keys

torchtext.data.interleave_keys(a, b)

Interleave bits from two sort keys to form a joint sort key.

Examples that are similar in both of the provided keys will have similar values for the key defined by this function. Useful for tasks with two text fields like machine translation or natural language inference.