AI Learning YouTube News & VideosMachineBrain

Dier: Innovative TTS System by Toby and Jay at Nari Labs

Dier: Innovative TTS System by Toby and Jay at Nari Labs
Image copyright Youtube
Authors
    Published on
    Published on

In the realm of cutting-edge technology, a duo of ambitious undergraduates, Toby and Jay, have unleashed a groundbreaking TTS system known as Dier. This 1.6 billion parameter marvel, birthed under the banner of Nari Labs, stands tall among industry giants like L1 Labs with its exceptional quality and control over scripts and voices. Drawing inspiration from the likes of Soundstorm and Parakeet, these young innovators faced the daunting challenge of compute power, ultimately harnessing Google's TPU research cloud grants to fuel their creation.

Dier, now available on GitHub and Hugging Face, offers enthusiasts a playground for text synthesis and voice cloning, promising an experience akin to the acclaimed Notebook LM podcast. However, the road to perfection was not without its bumps, as the team grappled with issues like audio speed and voice variation. Through clever segmentation of scripts and tinkering with audio speed using tools like librosa and rubber band, they managed to elevate the output quality, albeit with some quirks along the way.

The model's use of classifier-free guidance plays a pivotal role in dictating audio speed, leading to innovative solutions like generating short audios for optimal results. Future plans include integrating Dier into the MLX audio library, expanding its reach and usability. While real-time applications may be a stretch, Dier's forte lies in crafting top-tier audio tailored for podcast-style content. Enthusiasts are urged to dive into the code, experiment with the system, and provide valuable feedback on its performance compared to established players like Kokuro.

dier-innovative-tts-system-by-toby-and-jay-at-nari-labs

Image copyright Youtube

dier-innovative-tts-system-by-toby-and-jay-at-nari-labs

Image copyright Youtube

dier-innovative-tts-system-by-toby-and-jay-at-nari-labs

Image copyright Youtube

dier-innovative-tts-system-by-toby-and-jay-at-nari-labs

Image copyright Youtube

Watch Dia 1.6B TTS for NotebookLM Podcasts on Youtube

Viewer Reactions for Dia 1.6B TTS for NotebookLM Podcasts

Voice cloning is a popular topic of interest

Users are curious about fine-tuning the model for other languages

Some users are interested in the technical specifications for running the model

Comparison with other models like Kokoro is mentioned

Questions about audio stitching and maintaining consistent voices

Users are discussing the use of different TTS models for voice alternation

Some users express concerns about voice cloning and transitioning from Jax to Pytorch

Comparison with Google's unreleased TTS model is brought up

Users are impressed by the capabilities of the model and aspire to reach similar levels

Some users express jealousy over the model's development timeline and their own programming experience.

unveiling-gemini-2-5-tts-mastering-single-and-multi-speaker-audio-generation
Sam Witteveen

Unveiling Gemini 2.5 TTS: Mastering Single and Multi-Speaker Audio Generation

Discover the groundbreaking Gemini 2.5 TTS model unveiled at Google IO, offering single and multi-speaker text to speech capabilities. Control speech style, experiment with different voices, and craft engaging audio experiences with Gemini's native audio out feature.

google-io-2025-innovations-in-models-and-content-creation
Sam Witteveen

Google IO 2025: Innovations in Models and Content Creation

Google IO 2025 showcased continuous model releases, including 2.5 Flash and Gemini Diffusion. The event introduced Image Gen 4 and VO3 video models in the innovative product Flow, revolutionizing content creation and filmmaking. Gemini's integration of MCP and AI Studio refresh highlight Google's commitment to technological advancement and user empowerment.

nvidia-parakeet-lightning-fast-english-transcriptions-for-precise-audio-to-text-conversion
Sam Witteveen

Nvidia Parakeet: Lightning-Fast English Transcriptions for Precise Audio-to-Text Conversion

Explore the latest in speech-to-text technology with Nvidia's Parakeet model. This compact powerhouse offers lightning-fast and accurate English transcriptions, perfect for quick and precise audio-to-text conversion. Available for commercial use on Hugging Face, Parakeet is a game-changer in the world of transcription.

optimizing-ai-interactions-geminis-implicit-caching-guide
Sam Witteveen

Optimizing AI Interactions: Gemini's Implicit Caching Guide

Gemini team introduces implicit caching, offering 75% token discount based on previous prompts. Learn how it optimizes AI interactions and saves costs effectively. Explore benefits, limitations, and future potential in this insightful guide.