09 Feb 10:12

jogonba2

v0.2.2: Add sentence rewritting and polish documentation.

This release adds:

Sentence rewriting extractor and packer to generate mixcase datasets. Contrary to gap and masking, a set of sentences of the documents are selected and the LLM has to rewrite them in its own words.
Argument validation in extractors.
Remove private methods from the documentation.

Assets 2

26 Jan 18:22

jogonba2

v0.2.1: Add documentation and fix naming.

This release adds:

Documentation: https://textmachina.readthedocs.io/en/latest/
Documentation-related extras for developers in the setup.py
Fixes some names of functions that were incorrectly autocompleted.

Assets 2

25 Jan 12:04

jogonba2

v.0.2.0: More providers, extractors, examples, and refactor🥳

The 0.2.0 release of TextMachina includes:

New providers: Amazon Bedrock, AI21, Azure OpenAI, and inference servers (vllm and trt).
Refactor the Huggingface Remote provider to make retries through HTTPAdapter.
Two new extractors for mixcase tasks: sentence_masking and word_masking. Differently from the sentence_gap and word_gap extractors, LLMs must reconstruct masks in whole texts, instead of writing text between boundaries.
Extend the dataset generator for mixcase tasks to consider masking extractors.
Add config examples to learn about the extractors.
Small refactors: colors in logger, inheritance in some tokenizers, etc.

Assets 2

19 Jan 16:40

jogonba2

v.0.1.0: Mixcase tasks and more 🥳

This release of TextMachina includes:

Allow to pass parameters to the extractors out from the prompt templates. The templates must be used only to define placeholders.
Add MixCaseDatasetGenerator to generate datasets for mixcase tasks (detection tagging). Other datasets like mixcase classification can be built out of TextMachina, using the datasets generated by this one.
Add sentence_gap and word_gap extractors for mixcase tasks.
Refactor interactive exploration. Now we have one class per task, and each one must build its own panels.
Added exploration for mixcase datasets.
Added a TokenClassificationMetric to evaluate HF models on mixcase and boundary tasks.
Better structured and documented examples. Now we have examples/learning to illustrate how to use providers/tasks/extractors and examples/use_cases with additional config files.
Minor changes to improve quality of life: force to pass task_type in the CLI to prevent potential confusions, disable random_sample_human on boundary detection tasks, etc.
Document all the new code and improve existing documentation.
Extend the README to talk about mixcase tasks, include figures to visualize each type of task.

Assets 2

09 Jan 08:28

jogonba2

v0.0.10

Updated Arxiv citation in README

Assets 2

08 Jan 16:46

jogonba2

v0.0.9

First release 🎉

First release of TextMachina that includes:

Dataset generators: for detection, attribution, and boundary detection tasks.
Five model providers: Anthropic, Cohere, HuggingFace (local and remote), OpenAI, and Vertex AI.
Six extractors to fill prompt templates: Auxiliary, Entities, Nouns, Sentence prefix, Word prefix, and Combined.
One decoding constrainer: Length constrainer.
Five metrics to assess task difficulty and dataset quality: MAUVE, Perplexity, Repetition, Diversity, and baseline models.
Post-processing functions to improve the quality of the datasets and prevent common biases.
CLI interface to generate and explore datasets.
Configuration examples, under the folder etc/examples, to test different tasks and model providers.

Assets 2