diff --git a/README.md b/README.md index 4c1e765..00d56df 100644 --- a/README.md +++ b/README.md @@ -16,6 +16,9 @@ ### Thanks to all the contributors ! +## News +- **2024/10/08**: F5-TTS & E2 TTS base models on [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS), [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN). + ## Installation ```bash @@ -48,6 +51,31 @@ pip install -e . docker build -t f5tts:v1 . ``` + +## Inference + +### 1. Basic usage + +```bash +# cli inference +f5-tts_infer-cli + +# gradio interface +f5-tts_infer-gradio +``` + +### 2. More instructions + +- In order to have better generation results, take a moment to read [detailed guidance](src/f5_tts/infer/README.md). +- The [Issues](https://github.com/SWivid/F5-TTS/issues?q=is%3Aissue) are very useful, please try to find the solution by properly searching the keywords of problem encountered. If no answer found, then feel free to open an issue. + + +## [Training](src/f5_tts/train/README.md) + + +## [Evaluation](src/f5_tts/eval/README.md) + + ## Development Use pre-commit to ensure code quality (will run linters and formatters automatically) @@ -65,95 +93,6 @@ pre-commit run --all-files Note: Some model components have linting exceptions for E722 to accommodate tensor notation -## Inference - -```python -import gradio as gr -from f5_tts.gradio_app import app - -with gr.Blocks() as main_app: - gr.Markdown("# This is an example of using F5-TTS within a bigger Gradio app") - - # ... other Gradio components - - app.render() - -main_app.launch() - -``` - -The pretrained model checkpoints can be reached at [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS) and [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN), or automatically downloaded with `inference-cli` and `gradio_app`. - -Currently support 30s for a single generation, which is the **TOTAL** length of prompt audio and the generated. Batch inference with chunks is supported by `inference-cli` and `gradio_app`. -- To avoid possible inference failures, make sure you have seen through the following instructions. -- A longer prompt audio allows shorter generated output. The part longer than 30s cannot be generated properly. Consider using a prompt audio <15s. -- Uppercased letters will be uttered letter by letter, so use lowercased letters for normal words. -- Add some spaces (blank: " ") or punctuations (e.g. "," ".") to explicitly introduce some pauses. If first few words skipped in code-switched generation (cuz different speed with different languages), this might help. - -### CLI Inference - -Either you can specify everything in `inference-cli.toml` or override with flags. Leave `--ref_text ""` will have ASR model transcribe the reference audio automatically (use extra GPU memory). If encounter network error, consider use local ckpt, just set `ckpt_file` in `inference-cli.py` - -for change model use `--ckpt_file` to specify the model you want to load, -for change vocab.txt use `--vocab_file` to provide your vocab.txt file. - -```bash -# switch to the main directory -cd f5_tts - -python inference-cli.py \ ---model "F5-TTS" \ ---ref_audio "tests/ref_audio/test_en_1_ref_short.wav" \ ---ref_text "Some call me nature, others call me mother nature." \ ---gen_text "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences." - -python inference-cli.py \ ---model "E2-TTS" \ ---ref_audio "tests/ref_audio/test_zh_1_ref_short.wav" \ ---ref_text "对,这就是我,万人敬仰的太乙真人。" \ ---gen_text "突然,身边一阵笑声。我看着他们,意气风发地挺直了胸膛,甩了甩那稍显肉感的双臂,轻笑道,我身上的肉,是为了掩饰我爆棚的魅力,否则,岂不吓坏了你们呢?" - -# Multi voice -# https://github.com/SWivid/F5-TTS/pull/146#issue-2595207852 -python inference-cli.py -c samples/story.toml -``` - -### Gradio App -Currently supported features: -- Chunk inference -- Podcast Generation -- Multiple Speech-Type Generation -- Voice Chat powered by Qwen2.5-3B-Instruct - -You can launch a Gradio app (web interface) to launch a GUI for inference (will load ckpt from Huggingface, you may also use local file in `gradio_app.py`). Currently load ASR model, F5-TTS and E2 TTS all in once, thus use more GPU memory than `inference-cli`. - -```bash -python f5_tts/gradio_app.py -``` - -You can specify the port/host: - -```bash -python f5_tts/gradio_app.py --port 7860 --host 0.0.0.0 -``` - -Or launch a share link: - -```bash -python f5_tts/gradio_app.py --share -``` - -### Speech Editing - -To test speech editing capabilities, use the following command. - -```bash -python f5_tts/speech_edit.py -``` - -## [Training](src/f5_tts/train/README.md) - -## [Evaluation](src/f5_tts/eval/README.md) ## Acknowledgements diff --git a/pyproject.toml b/pyproject.toml index 251eb6f..d5f2e36 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -55,4 +55,5 @@ eval = [ Homepage = "https://github.com/SWivid/F5-TTS" [project.scripts] -"inference-cli" = "f5_tts.inference_cli:main" +"f5-tts_infer-cli" = "f5_tts.infer.infer_cli:main" +"f5-tts_infer-gradio" = "f5_tts.infer.infer_gradio:main" diff --git a/src/f5_tts/api.py b/src/f5_tts/api.py index 539e34f..3eccc0d 100644 --- a/src/f5_tts/api.py +++ b/src/f5_tts/api.py @@ -130,8 +130,8 @@ if __name__ == "__main__": ref_file=str(files("f5_tts").joinpath("infer/examples/basic/basic_ref_en.wav")), ref_text="some call me nature, others call me mother nature.", gen_text="""I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences.""", - file_wave=str(files("f5_tts").joinpath("../../api_test_out.wav")), - file_spect=str(files("f5_tts").joinpath("../../api_test_out.png")), + file_wave=str(files("f5_tts").joinpath("../../tests/api_out.wav")), + file_spect=str(files("f5_tts").joinpath("../../tests/api_out.png")), seed=-1, # random seed = -1 ) diff --git a/src/f5_tts/infer/README.md b/src/f5_tts/infer/README.md new file mode 100644 index 0000000..e70f761 --- /dev/null +++ b/src/f5_tts/infer/README.md @@ -0,0 +1,92 @@ +## Inference + +The pretrained model checkpoints can be reached at [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS) and [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN), or will be automatically downloaded when running inference scripts. + +Currently support **30s for a single** generation, which is the **total length** including both prompt and output audio. However, you can leverage `infer_cli` and `infer_gradio` for longer text, will automatically do chunk generation. Long reference audio will be clip short to ~15s. + +To avoid possible inference failures, make sure you have seen through the following instructions. + +- Uppercased letters will be uttered letter by letter, so use lowercased letters for normal words. +- Add some spaces (blank: " ") or punctuations (e.g. "," ".") to explicitly introduce some pauses. +- Preprocess numbers to Chinese letters if you want to have them read in Chinese, otherwise in English. + +# TODO 👇 ... + +### CLI Inference + +It is possible to use cli `f5-tts_infer-cli` for following commands. + +Either you can specify everything in `inference-cli.toml` or override with flags. Leave `--ref_text ""` will have ASR model transcribe the reference audio automatically (use extra GPU memory). If encounter network error, consider use local ckpt, just set `ckpt_file` in `inference-cli.py` + +for change model use `--ckpt_file` to specify the model you want to load, +for change vocab.txt use `--vocab_file` to provide your vocab.txt file. + +```bash +# switch to the main directory +cd f5_tts + +python inference-cli.py \ +--model "F5-TTS" \ +--ref_audio "tests/ref_audio/test_en_1_ref_short.wav" \ +--ref_text "Some call me nature, others call me mother nature." \ +--gen_text "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences." + +python inference-cli.py \ +--model "E2-TTS" \ +--ref_audio "tests/ref_audio/test_zh_1_ref_short.wav" \ +--ref_text "对,这就是我,万人敬仰的太乙真人。" \ +--gen_text "突然,身边一阵笑声。我看着他们,意气风发地挺直了胸膛,甩了甩那稍显肉感的双臂,轻笑道,我身上的肉,是为了掩饰我爆棚的魅力,否则,岂不吓坏了你们呢?" + +# Multi voice +# https://github.com/SWivid/F5-TTS/pull/146#issue-2595207852 +python inference-cli.py -c samples/story.toml +``` + +### Gradio App +Currently supported features: +- Chunk inference +- Podcast Generation +- Multiple Speech-Type Generation +- Voice Chat powered by Qwen2.5-3B-Instruct + +It is possible to use cli `f5-tts_infer-gradio` for following commands. + +You can launch a Gradio app (web interface) to launch a GUI for inference (will load ckpt from Huggingface, you may also use local file in `gradio_app.py`). Currently load ASR model, F5-TTS and E2 TTS all in once, thus use more GPU memory than `inference-cli`. + +```bash +python f5_tts/gradio_app.py +``` + +You can specify the port/host: + +```bash +python f5_tts/gradio_app.py --port 7860 --host 0.0.0.0 +``` + +Or launch a share link: + +```bash +python f5_tts/gradio_app.py --share +``` + +```python +import gradio as gr +from f5_tts.gradio_app import app + +with gr.Blocks() as main_app: + gr.Markdown("# This is an example of using F5-TTS within a bigger Gradio app") + + # ... other Gradio components + + app.render() + +main_app.launch() +``` + +### Speech Editing + +To test speech editing capabilities, use the following command. + +```bash +python f5_tts/speech_edit.py +``` \ No newline at end of file diff --git a/src/f5_tts/infer/examples/basic/basic.toml b/src/f5_tts/infer/examples/basic/basic.toml index b6bea1c..cc3fbda 100644 --- a/src/f5_tts/infer/examples/basic/basic.toml +++ b/src/f5_tts/infer/examples/basic/basic.toml @@ -1,9 +1,9 @@ # F5-TTS | E2-TTS model = "F5-TTS" -ref_audio = "tests/ref_audio/test_en_1_ref_short.wav" +ref_audio = "src/f5_tts/infer/examples/basic/basic_ref_en.wav" # If an empty "", transcribes the reference audio automatically. ref_text = "Some call me nature, others call me mother nature." -gen_text = "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences." +gen_text = "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring." # File with text to generate. Ignores the text above. gen_file = "" remove_silence = false diff --git a/src/f5_tts/infer/examples/multi/story.toml b/src/f5_tts/infer/examples/multi/story.toml index 93022c4..bdf4455 100644 --- a/src/f5_tts/infer/examples/multi/story.toml +++ b/src/f5_tts/infer/examples/multi/story.toml @@ -1,19 +1,19 @@ # F5-TTS | E2-TTS model = "F5-TTS" -ref_audio = "samples/main.flac" +ref_audio = "src/f5_tts/infer/examples/multi/main.flac" # If an empty "", transcribes the reference audio automatically. ref_text = "" gen_text = "" # File with text to generate. Ignores the text above. -gen_file = "samples/story.txt" +gen_file = "src/f5_tts/infer/examples/multi/story.txt" remove_silence = true -output_dir = "samples" +output_dir = "tests" [voices.town] -ref_audio = "samples/town.flac" +ref_audio = "src/f5_tts/infer/examples/multi/town.flac" ref_text = "" [voices.country] -ref_audio = "samples/country.flac" +ref_audio = "src/f5_tts/infer/examples/multi/country.flac" ref_text = "" diff --git a/src/f5_tts/infer/infer_cli.py b/src/f5_tts/infer/infer_cli.py index 546428b..d56408f 100644 --- a/src/f5_tts/infer/infer_cli.py +++ b/src/f5_tts/infer/infer_cli.py @@ -21,15 +21,15 @@ from f5_tts.infer.utils_infer import ( parser = argparse.ArgumentParser( - prog="python3 inference-cli.py", + prog="python3 infer-cli.py", description="Commandline interface for E2/F5 TTS with Advanced Batch Processing.", - epilog="Specify options above to override one or more settings from config.", + epilog="Specify options above to override one or more settings from config.", ) parser.add_argument( "-c", "--config", - help="Configuration file. Default=inference-cli.toml", - default=os.path.join(files("f5_tts").joinpath("data"), "inference-cli.toml"), + help="Configuration file. Default=infer/examples/basic/basic.toml", + default=os.path.join(files("f5_tts").joinpath("infer/examples/basic"), "basic.toml"), ) parser.add_argument( "-m", @@ -80,6 +80,8 @@ args = parser.parse_args() config = tomli.load(open(args.config, "rb")) ref_audio = args.ref_audio if args.ref_audio else config["ref_audio"] +if "src/f5_tts/infer/examples/basic" in ref_audio: # for pip pkg user + ref_audio = str(files("f5_tts").joinpath(f"../../{ref_audio}")) ref_text = args.ref_text if args.ref_text != "666" else config["ref_text"] gen_text = args.gen_text if args.gen_text else config["gen_text"] gen_file = args.gen_file if args.gen_file else config["gen_file"] @@ -90,8 +92,8 @@ model = args.model if args.model else config["model"] ckpt_file = args.ckpt_file if args.ckpt_file else "" vocab_file = args.vocab_file if args.vocab_file else "" remove_silence = args.remove_silence if args.remove_silence else config["remove_silence"] -wave_path = Path(output_dir) / "out.wav" -spectrogram_path = Path(output_dir) / "out.png" +wave_path = Path(output_dir) / "infer_cli_out.wav" +# spectrogram_path = Path(output_dir) / "infer_cli_out.png" vocos_local_path = "../checkpoints/charactr/vocos-mel-24khz" vocos = load_vocoder(is_local=args.load_vocoder_from_local, local_path=vocos_local_path) @@ -161,6 +163,10 @@ def main_process(ref_audio, ref_text, text_gen, model_obj, remove_silence): if generated_audio_segments: final_wave = np.concatenate(generated_audio_segments) + + if not os.path.exists(output_dir): + os.makedirs(output_dir) + with open(wave_path, "wb") as f: sf.write(f.name, final_wave, final_sample_rate) # Remove silence diff --git a/src/f5_tts/infer/utils_infer.py b/src/f5_tts/infer/utils_infer.py index 6094520..0b7bd3c 100644 --- a/src/f5_tts/infer/utils_infer.py +++ b/src/f5_tts/infer/utils_infer.py @@ -186,13 +186,12 @@ def preprocess_ref_audio_text(ref_audio_orig, ref_text, show_info=print, device= non_silent_segs = silence.split_on_silence(aseg, min_silence_len=1000, silence_thresh=-50, keep_silence=1000) non_silent_wave = AudioSegment.silent(duration=0) for non_silent_seg in non_silent_segs: + if len(non_silent_wave) > 10000 and len(non_silent_wave + non_silent_seg) > 18000: + show_info("Audio is over 18s, clipping short.") + break non_silent_wave += non_silent_seg aseg = non_silent_wave - audio_duration = len(aseg) - if audio_duration > 15000: - show_info("Audio is over 15s, clipping to only first 15s.") - aseg = aseg[:15000] aseg.export(f.name, format="wav") ref_audio = f.name