Jacob Tomlinson - Accelerating fuzzy document deduplication to improve LLM training w/ RAPIDS & Dask
Youtube : Download Convert to MP3
Description :
www.pydata.org Training Large Language Models (LLMs) requires a vast amount of input data, and the higher the quality of that data the better the model will be at producing useful natural language. NVIDIA NeMo Data Curator is a toolkit built with RAPIDS and Dask for extracting, cleaning, filtering and deduplicating training data for LLMs. In this...
Related Videos :
Sara Iris Garcia - API development for data analysts/scientists with FastApi | PyData Global 2023 By: PyData |
Ville Tuulos - Compute anything with Metaflow | PyData Global 2023 By: PyData |
Pattaniyil, Ravi, & Zengin - Using LLMs to improve your Search Engine | PyData Global 2023 By: PyData |
CainĂ£ Max Couto da Silva - Intro to ML: How to Prevent Data Leakage and Build Efficient Workflows By: PyData |
AI can't cross this line and we don't know why. By: Welch Labs |
Run ALL Your AI Locally in Minutes (LLMs, RAG, and more) By: Cole Medin |