FLD utility modules, such as corpus loader, corpus serializer, and metrics calculators.
See the entry-point repository about the whole FLD project.
We have currently three branches:
NeurIPS_2024
branch (2024-12)NLP_2024_KOBE_BEEF
branch (2024-01-24)ICML_2023
branch (2023-08-22)
Please read CAREFULLY the instructions in other FLD repositories to determine which branch is required.
pip install -e .
python -c "import nltk; nltk.download('punkt')"
Once the raw FLD corpora are created by FLD-generator, we have to prepare prompt-output pairs for LLM training as follows:
python ./scripts/serialize.py \
--train {train_jsonl_path} \
--valid {valid_jsonl_path} \
--test {test_jsonl_path} \
--output-dir {output_dir}
This command will output examples with added prompt_serial
and proof_serial
fields, corresponding to the prompt and output of the LLMs, respectively.
python ./scripts/push_to_hub.py \
--train {serialized_train_jsonl_path} \
--valid {serialized_valid_jsonl_path} \
--test {serialized_test_jsonl_path} \
--repo-id {your_name/dataset_name} \
--config-name default