What youâll learn
As an NLP Engineer, youâll often source text, code, transcripts, and embeddings. This lesson helps you quickly judge whether you can collect, train on, store, or ship dataâand what obligations (like attribution) you must meet.
- Read common licenses (CC, custom ToS, APIs) and spot allowed/restricted uses.
- Differentiate rights for training, evaluation, and distribution of data or outputs.
- Set up a lightweight workflow: source vetting, attribution, and takedowns.
- Avoid common traps: non-commercial, share-alike, and no-derivatives clauses.
Who this is for
- NLP/ML engineers fine-tuning or pretraining models.
- Data scientists building datasets or evaluation sets.
- Product engineers integrating third-party text or APIs.
Prerequisites
- Basic understanding of training/fine-tuning pipelines.
- Comfort reading short legal texts (licenses/ToS). No legal background required.
Learning path
- Start: Data Licensing and Usage Rights Basics (this lesson)
- Next: Data Privacy and PII handling
- Then: Model and Output Safety Policies
- Finally: Governance and Auditability for ML
Why this matters
Real tasks you will face:
- Fine-tune a support bot on forum postsâcan you use and redistribute them?
- Ship a demo that quotes news snippetsâdoes your product need attribution?
- Scrape reviews for sentiment trainingâdo site terms allow scraping and reuse?
- Use an API datasetâdoes its license allow training or only internal evaluation?
Mistakes here risk takedowns, reputational damage, and rework. A clear, simple process prevents surprises.
Concept explained simply
Licensing defines what youâre allowed to do with data. Think of it as a set of green and red lights:
- Green: actions youâre allowed to do (e.g., use, modify, train-on).
- Amber: allowed if you meet conditions (e.g., attribution, share-alike).
- Red: forbidden uses (e.g., commercial use under a Non-Commercial license).
Key license families youâll see:
- Creative Commons (CC0, BY, BY-SA, BY-NC, BY-ND): public content, but often with conditions.
- Custom Terms of Service (websites, forums): may restrict scraping, training, or redistribution.
- API Terms/EULAs: usage typically gated by agreements; training may be disallowed.
- Database rights (EU/UK sui generis): protect substantial extraction of databases.
Training vs. distribution (very important):
- Training/Evaluation Use: ingesting data into a model or metrics pipeline.
- Distribution of Data: shipping raw or near-verbatim data (e.g., dataset release).
- Distribution of Outputs: model answers that may reproduce licensed text. Some licenses care if outputs include substantial or verbatim reproduction.
Mental model
Use the RACI grid for each source:
- R: Record source, license, and date captured.
- A: Assess permissions for train/eval/distribute.
- C: Comply with conditions (attribution, share-alike, non-commercial limits).
- I: Implement safeguards (filters, provenance tags, takedown path).
Quick license glossary (practical)
- CC0: Do anything. Attribution not required. Still respect privacy/trademarks.
- CC BY: Allowed with attribution. Good for training; prepare an attribution page.
- CC BY-SA: Like BY, but derivatives must be under the same license. Risky if you redistribute text; for training, reduce risk by avoiding verbatim output reproduction and providing attribution.
- CC BY-NC: Non-commercial only. Not suitable for commercial product training.
- CC BY-ND: No derivatives. Risky for training; avoid for model training.
- Website ToS: May ban scraping, AI training, or commercial reuse. ToS usually wins for that site.
- API Terms: Often allow internal use; training or redistribution can be restricted.
- Database rights (EU/UK): Large extraction of databases may be restricted even if items are public.
Regional notes (high-level, not legal advice)
- US fair use and EU/UK text-and-data-mining exceptions exist but are context-specific.
- Privacy laws (e.g., PII handling) are separate concerns; always filter sensitive data.
Worked examples
Example 1: Using Wikipedia for pretraining
License: CC BY-SA 4.0. Conditions: attribution; share-alike if you redistribute adapted text.
- Training: Generally acceptable with attribution.
- Distribution: If you ship text excerpts, provide attribution and avoid large verbatim dumps that would trigger share-alike obligations on your distributed content.
- Safeguard: Add a training-time filter and an output checker to reduce verbatim regurgitation; publish an attribution page listing Wikipedia as a source.
Example 2: News articles behind a paywall
License/ToS: Typically prohibit scraping and redistribution. Training is often not allowed without permission.
- Decision: Do not use without a license. Seek a data provider or use summaries made available under acceptable terms.
- Safeguard: Block those domains in your crawler; document the decision.
Example 3: Public forum under CC BY-NC
Conditions: Non-commercial only.
- Training for a commercial product: Not allowed.
- Alternative: Request permission from the forum owner or use another source.
Example 4: Product reviews from a retailerâs site
ToS may restrict scraping, reuse, and training.
- Decision: Check the siteâs ToS. If restricted, either use the official API with allowed terms or avoid.
- Safeguard: Keep a record of terms at the date of collection.
Example 5: Research dataset marked CC0
Conditions: None.
- Use: Safe for training, evaluation, and distribution (subject to privacy and ethics).
- Safeguard: Still scan for PII and disallowed content.
Lightweight compliance workflow
- Source vetting: capture source URL/domain name, snapshot date, license/ToS text, and intended use (train/eval/distribute).
- Permission check: Is training allowed? Is commercial use allowed? Any special clauses (attribution, SA, ND)?
- Decide and tag: allow, allow-with-conditions, or reject. Store tags with the dataset.
- Comply: implement attribution, filters to reduce verbatim output, and any usage limits.
- Takedown path: maintain an email/contact and removal process; be able to filter and retrain if needed.
Quick checklist (copy-paste into your project)
- â License/ToS saved and dated
- â Intended use reviewed (train/eval/distribute)
- â Commercial use allowed (if applicable)
- â Attribution prepared (if required)
- â Share-alike/No-derivatives risk assessed
- â Output regurgitation filter in place (if needed)
- â Privacy/PII scan completed
- â Takedown/removal process documented
Common mistakes and self-check
- Assuming public equals free: Publicly viewable doesnât grant training rights. Self-check: Do you have explicit license permission?
- Ignoring Non-Commercial: NC content cannot be used for a paid product. Self-check: Is any money involved in the product or service?
- Forgetting attribution: CC BY requires attribution. Self-check: Is there an attribution page ready before launch?
- Overlooking output risks: Model may reproduce licensed text. Self-check: Do you have n-gram or similarity checks to limit verbatim output?
- Not recording licenses: If challenged, you need evidence. Self-check: Is the license snapshot stored with a timestamp?
Practical projects
- Create a small training set (500â1,000 items) from CC BY sources. Prepare an attribution file and demonstrate your output filter.
- Audit an existing dataset: label each source as allow/allow-with-conditions/reject and write a 1-page risk summary.
- Implement a takedown-ready pipeline: given a list of domains, remove matched samples and regenerate dataset stats.
Exercises
These match the exercises below. Try them here first, then open the solutions only if needed.
Exercise 1: Classify permitted uses
You have three sources: A) CC BY blog posts, B) Forum with CC BY-NC, C) Website ToS that prohibits scraping and model training. For each, state if you can: train, evaluate, distribute excerpts in a commercial product, and what conditions apply.
Exercise 2: Write a one-paragraph attribution
Draft an attribution paragraph covering Wikipedia (CC BY-SA 4.0) and an academic dataset (CC0), suitable for a product documentation page. Include license names and a general credit statement.
- â I identified if training is allowed for each source.
- â I noted commercial vs non-commercial restrictions.
- â I wrote attribution text when required.
- â I planned safeguards to limit verbatim outputs.
Mini challenge
You receive a CSV of 200k comments from various sites with unknown provenance. Outline, in 6â8 bullet points, how youâd decide what portion is safe to use for a commercial fine-tune next week without delaying the project.
Suggested approach
- Segment by domain; sample 100 per domain.
- Look up each domainâs ToS; classify train/eval/distribute permissions.
- Keep only domains with clear commercial training permission or CC BY/CC0.
- Prepare attribution for CC BY; exclude share-alike and no-derivatives sources for now.
- Run PII filters and profanity policy checks.
- Add output regurgitation checks for any high-risk domains.
- Store license snapshots and a risk log; be ready for takedown.
- Document what was included and why.
Next steps
- Integrate the checklist into your data intake README.
- Create a shared attribution page template for your team.
- Set up a domain blocklist and a takedown contact process.
How the quick test works
Anyone can take the test for free. If you are logged in, your progress and score are saved automatically.