text_completion_dataset¶
- torchtune.datasets.text_completion_dataset(tokenizer: ModelTokenizer, source: str, column: str = 'text', add_eos: bool = True, packed: bool = False, split_across_pack: bool = True, split: str = 'train', filter_fn: Optional[Callable] = None, **load_dataset_kwargs: Dict[str, Any]) Union[TextCompletionDataset, PackedDataset][source]¶
Build a configurable dataset from a freeform, unstructured text corpus similar to datasets used in pre-training. This method should be used to configure a custom text dataset from the yaml config instead of using
TextCompletionDatasetdirectly, as it is made to be config friendly.- Parameters:
tokenizer (ModelTokenizer) – Tokenizer used by the model that implements the
tokenize_messagesmethod.source (str) – path to dataset repository on Hugging Face. For local datasets, define source as the data file type (e.g. “json”, “csv”, “text”) and pass in the filepath in
data_files. See Hugging Face’sload_dataset(https://huggingface.co/docs/datasets/en/package_reference/loading_methods#datasets.load_dataset.path) for more details.column (str) – name of column in the sample that contains the text data. This is typically required for Hugging Face datasets or tabular data. For local datasets with a single column (e.g. unstructured txt files), use the default “text” which is used by Hugging Face datasets when loaded into memory. Default is “text”.
add_eos (bool) – Whether to add an EOS token to the end of the sequence. Default is True.
packed (bool) – Whether or not to pack the dataset to
max_seq_lenprior to training. Default is False.split_across_pack (bool) – if the last sample in a pack does not fit in
max_seq_len, split the sample into the next pack, or move it entirely to the beginning of the next pack. For pre-training, typically this is set to True for general text completion. For fine-tuning, typically this is set to False to avoid truncating sentences in instruct tuning. This argument is ignored ifpacked=False. Default is True.split (str) –
splitargument fordatasets.load_dataset. You can use this argument to load a subset of a given split, e.g.split="train[:10%]". Default is “train”.filter_fn (Optional[Callable]) – callable used to filter the dataset prior to any pre-processing. See the Hugging Face docs for more details.
**load_dataset_kwargs (Dict[str, Any]) – additional keyword arguments to pass to
load_dataset.
Examples
>>> from torchtune.datasets import text_completion_dataset >>> dataset = text_completion_dataset( ... tokenizer=tokenizer, ... source="allenai/c4", ... column="text", ... data_dir="realnewslike", ... packed=False, ... split="train", ... )
This can also be accomplished via the yaml config:
dataset: _component_: torchtune.datasets.text_completion_dataset source: allenai/c4 column: text data_dir: realnewslike packed: False split: train
- Returns:
- the configured
TextCompletionDataset or
PackedDatasetifpacked=True
- the configured
- Return type:
Union[TextCompletionDataset, PackedDataset]
- Raises:
ValueError – If
packed=Trueandtokenizer.max_seq_lenis not set.