NLP with Large Language Models

Alex van Vorstenbosch

2025-02-01

NLP-lifecycle on it’s head

Regular ML:
Problem → Idea → Gather data → Train Model → Evaluate Model →
Repeat if neccessary → deploy
Duration: Months

Prompting workflow:
Problem → Idea → Gather (less) data → Finetune prompt → Evaluate Model →
Repeat if neccessary → already deployed
Duration: Days

NLP-tasks

Sentiment analysis
Named entity recognition
Natural language generation
Speech recognition
Speech synthesis
Question answering
Machine translation
Summarisation
Classification
Topic modeling
etc…

Jack of all trades, master of none…

LLMs are great at a wide range of tasks…
… but they aren’t state-of-the-art for specific tasks
Might also be skewed due to allignment problem between benchmarks and human-eval in some metrics.

… Except for QA and Reasoning

They are SOTA for Question Answering and Reasoning
Might fit into the Jack of all trades analogy.
“Best student in class on average, but not the best in class in any single subject”

Papers with Code

For an overview of the current best model for your task:
paperswithcode/Natural-Language-Processing

Semantic versus Pragmatic

Semantic meaning: Literal
Pragmatic meaning: Context dependent

“Wow, you really are an expert”

Semantic: Compliment
Pragmatic: Sarcastic, Compliment, etc.

ChatGPT: Jack of all trades, master of none

Performance for Aspect Based Sentiment Analysis

Aspect Based Sentiment Analysis.
- The service was great but the food was terrible
  - service: positive
  - food: negative

Is ChatGPT a Good Sentiment Analyzer? A Preliminary Study

Performance for Aspect Based Sentiment Analysis

Aspect Based Sentiment Analysis.
- The service was great but the food was terrible
  - service: positive
  - food: negative

Is ChatGPT a Good Sentiment Analyzer? A Preliminary Study

ChatGPT for summarization

Extractive Summarization via ChatGPT for Faithful Summary Generation
G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment

ChatGPT for evaluating summarization

G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment

ChatGPT for evaluating summarization

G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment

What if your model is not performing up to your standard?

You can only have so many few-shot examples for it to be economical
OpenAI offer the option to finetune your model
This will update parameter weights to better fit your usecase
This will result in a ‘new’ model you can call from the API in the future

Usecases of Finetuning

Improving reliability at producing your desired output
Correcting failures to follow instructions for more complex prompts
Performing a new skill or task that’s hard to articulate in a prompt
Show don’t tell: It allows for more concise prompts as you can shown it what answers you expect

What will Finetuning give you?

Higher quality results than prompting with examples.
Ability to train on more examples than can fit in a prompt.
Saving tokens due to shorter prompts.
Lower latency requests due to shorter prompts.

What will Finetuning give you?

⇑ Finetuned models will have improved performance in the specific domain you train on.
⇓ But reduced general performance.

How does finetuning work

OpenAI finetuning guide
- Start with 50 examples
- Check if this provides any improvements
- Make sure you have an evaluation set
- “Every doubling of the data you may expect a similair improvement gain”
Finetuned models currently cost 3x of regular models
- If this saves you 10+ few-shot examples it’s quickly worth it.

The optimization flow

OpenAI: A Survey of Techniques for Maximizing LLM Performance