Dataset enrichment using LLM's ✨
11-30, 11:35–12:05 (Europe/Amsterdam), Bohr

Onboard the LLM hype-train! 🚂 Are you curious about what LLM's (Large Language Models) can do for you? One of their use cases is to enrich datasets, by converting text into structured data. This can have huge benefits. Useful facts that were previously hidden inside a large piece of text can now be unveiled, allowing you to more accurately filter and query your data.


Welcome! Let's discover more about LLM's together. You will learn about how to tame the LLM to get the output you want in a playful way 👏🏻.

Join us if you 🫵:
- Are interested in LLM's
- Want to know how you can use LLM's to extract structured information from text
- You know some Python

Contents of the talk 📌:

  1. [1 min] Intro
  2. [3 min] The current state of LLM's
  3. [3 min] Usecase: dataset with mixed structured/unstructured data
  4. [10 min] Prompt engineering
  5. [8 min] Going structured (JSON, Function Calling)
  6. [5 min] Conclusion & results

🏡 What you will take home

At the end of the talk, you will be taking home the following:
- What LLM's are currently around and how the landscape looks like
- How you can use a LLM to gather extract previously hidden features
- How you should deal with LLM's deviating from the requested output protocol
- What Function Calling is and how you can use it to your advantage
- How different LLM's perform (GPT, PaLM, ...) for extracting data from the housing usecase 🏡

❤️ Open Source Software

The Python libraries used to access the LLM and post-process the LLM results are openai and instructor, which are both Open Source. The LLM's used are hosted by cloud providers but could alternatively be swapped out for Open Source alternatives.

🎒 Pre-requisites

Some Python knowledge is recommended, but the talk can also be followed without! For the rest no previous knowledge about LLM's is required ✓.


Prior Knowledge Expected

No previous knowledge expected

Jeroen is a Machine Learning Engineer at Xebia Data (formerly GoDataDriven), in The Netherlands. Jeroen has a background in Software Engineering and Data Science and helps companies take their Machine Learning solutions into production.
Besides his usual work, Jeroen has been active in the Open Source community. Jeroen published several PyPi modules, npm modules, and has contributed to several large open source projects (Hydra from Facebook and Emberfire from Google). Jeroen also authored two chrome extensions, which are published on the web store.

Hope to see you at PyData Eindhoven 🇳🇱! 👋🏻