Notes/Ethics and Bias in Language Models: What Training Data Teaches Machines About Us

Ethics and Bias in Language Models: What Training Data Teaches Machines About Us

How training data bias, toxicity, and hallucination emerge in language models, what debiasing techniques exist, and why responsible deployment requires more than better algorithms.

2025-10-19AI-Synthesized from Personal Notes

Computational LinguisticsEthicsBiasResponsible AiNlp

Terminology

Term	Definition
Training Data Bias	Systematic skews in the data used to train a model, causing it to learn and reproduce societal stereotypes, underrepresentation, or factual distortions
Toxicity	Generated content that is offensive, harmful, threatening, or promotes violence, hate speech, or discrimination
Hallucination	When a model generates plausible-sounding but factually incorrect information, presenting fabricated claims as truth
Debiasing	Techniques to reduce unwanted biases in model outputs, applied at the data, training, or inference stage
RLHF (Reinforcement Learning from Human Feedback)	A training technique where human preferences guide model behavior, used to reduce harmful outputs and align models with human values
Representational Harm	When a system reinforces stereotypes or erases certain groups, even without directly causing material damage
Allocational Harm	When a biased system affects resource distribution: hiring decisions, loan approvals, or medical diagnoses that disadvantage certain groups
Red Teaming	Deliberately probing a model with adversarial inputs to discover failure modes, biases, and safety vulnerabilities before deployment
Model Card	A documentation framework (Mitchell et al., 2019) that describes a model's intended use, limitations, training data, evaluation results, and ethical considerations

What & Why

Language models learn from text written by humans, and humans are biased. If the training corpus contains more text associating "doctor" with "he" and "nurse" with "she," the model learns those associations and reproduces them. This is not a bug in the algorithm; it is a faithful reflection of the data. The problem is that deploying such a model amplifies existing biases at scale.

The stakes are high. Language models are used in hiring (resume screening), healthcare (clinical note analysis), legal (case research), and education (essay grading). Biased outputs in these domains cause real harm to real people. A resume screener that associates technical competence with male names will systematically disadvantage female applicants, not because anyone programmed it to, but because the training data reflected historical hiring patterns.

Hallucination is a separate but related problem. Models generate confident, fluent text about things that never happened: fake citations, invented statistics, fabricated legal precedents. This is dangerous because the output looks authoritative. Users trust it, and the model provides no signal that it is making things up.

Understanding these issues matters because building NLP systems is not just an engineering problem. It is a sociotechnical problem where technical decisions (what data to train on, what metrics to optimize, what guardrails to deploy) have ethical consequences.

How It Works

Sources of Bias

Bias enters language models at multiple stages:

Data collection: Web scrapes overrepresent English, male voices, Western perspectives, and internet-active demographics. Underrepresented languages and communities are literally less visible to the model.

Annotation: Human labelers bring their own biases. Toxicity classifiers trained on annotator judgments inherit disagreements about what counts as harmful, often reflecting the annotators' cultural backgrounds.

Objective function: Models trained to predict the next token learn to reproduce whatever patterns are most frequent in the data, including stereotypes. Optimizing for fluency does not optimize for fairness.

Deployment context: A model that is harmless in a research setting can cause harm when deployed in a hiring pipeline or medical triage system, because the stakes and the affected populations change.

Debiasing Techniques

Data-level: Curate training data to balance representation. Filter toxic content. Augment underrepresented groups. The challenge: filtering too aggressively can erase minority voices entirely.

Training-level: RLHF trains a reward model on human preferences for helpful, harmless, honest outputs, then fine-tunes the language model to maximize that reward. Constitutional AI (Anthropic) uses a set of principles to guide self-critique without human labelers for every example.

Inference-level: Output classifiers detect and filter toxic or biased generations before they reach the user. Prompt engineering can steer models toward balanced outputs. The limitation: post-hoc filtering cannot fix what the model has already learned.

Embedding-level: For word embeddings, projection-based debiasing removes the gender direction from the embedding space. This reduces "he:doctor :: she:nurse" analogies but can also remove legitimate gender information.

Hallucination

Hallucination occurs because language models optimize for fluency, not factual accuracy. The model has no internal fact database; it generates text that is statistically plausible given the training distribution. When asked about a topic with sparse training data, it fills gaps with plausible-sounding fabrications.

Mitigation strategies include:

Retrieval-augmented generation (RAG): ground outputs in retrieved documents
Citation generation: train models to cite sources, enabling verification
Confidence calibration: train models to express uncertainty ("I'm not sure") rather than fabricate
Factual consistency checking: use a separate model to verify claims against a knowledge base

Complexity Analysis

Technique	Compute Cost	Human Cost	Notes
Data filtering/curation	$O(N)$	High	$N$ = corpus size; requires human judgment on filtering criteria
RLHF training	$O(M \cdot C_{\text{model}})$	High	$M$ = preference pairs, $C_{\text{model}}$ = model forward/backward cost
Output toxicity classifier	$O(L \cdot d^2)$	Low	One classifier forward pass per generated output
Embedding debiasing	$O(V \cdot d)$	Low	One-time projection of all $V$ word vectors
Red teaming	Variable	Very high	Requires skilled adversarial testers; no fixed compute bound

The fundamental challenge is that bias mitigation is not a one-time cost. Models must be continuously monitored, evaluated, and updated as societal norms evolve and new failure modes are discovered.

Implementation

ALGORITHM MeasureEmbeddingBias(embeddings, maleWords, femaleWords, targetWords)
INPUT: embeddings: map of word -> vector
       maleWords: list of words (e.g., ["he", "man", "father"])
       femaleWords: list of words (e.g., ["she", "woman", "mother"])
       targetWords: list of profession/attribute words to measure
OUTPUT: biasScores: map of word -> bias score (positive = male-associated)

BEGIN
  // Compute gender direction as difference of centroids
  maleCentroid <- MEAN(embeddings[w] for w in maleWords)
  femaleCentroid <- MEAN(embeddings[w] for w in femaleWords)
  genderDirection <- maleCentroid - femaleCentroid
  genderDirection <- genderDirection / NORM(genderDirection)

  biasScores <- empty map
  FOR EACH word IN targetWords DO
    vec <- embeddings[word]
    // Project onto gender direction
    biasScores[word] <- DOT(vec, genderDirection)
  END FOR

  RETURN biasScores
  // Positive score = closer to male centroid
  // Negative score = closer to female centroid
  // Near zero = gender-neutral
END

ALGORITHM DebiasEmbeddings(embeddings, genderDirection, genderSpecificWords)
INPUT: embeddings: map of word -> vector
       genderDirection: unit vector representing the gender axis
       genderSpecificWords: set of words that SHOULD retain gender (e.g., "queen", "king")
OUTPUT: debiased: map of word -> debiased vector

BEGIN
  debiased <- empty map

  FOR EACH (word, vec) IN embeddings DO
    IF word IN genderSpecificWords THEN
      debiased[word] <- vec   // preserve gender for inherently gendered words
    ELSE
      // Remove the gender component via projection
      genderComponent <- DOT(vec, genderDirection) * genderDirection
      debiased[word] <- vec - genderComponent
      debiased[word] <- debiased[word] / NORM(debiased[word])  // renormalize
    END IF
  END FOR

  RETURN debiased
END

ALGORITHM ToxicityFilter(generatedText, toxicityClassifier, threshold)
INPUT: generatedText: string
       toxicityClassifier: model that outputs P(toxic | text)
       threshold: float (e.g., 0.5)
OUTPUT: (isSafe: boolean, toxicityScore: float, filteredText: string)

BEGIN
  tokens <- TOKENIZE(generatedText)
  toxicityScore <- toxicityClassifier.PREDICT(tokens)

  IF toxicityScore >= threshold THEN
    RETURN (false, toxicityScore, "[Content filtered due to safety policy]")
  ELSE
    RETURN (true, toxicityScore, generatedText)
  END IF
END

Real-World Applications

Hiring and recruitment: Resume screening models trained on historical hiring data can perpetuate gender and racial biases, leading companies like Amazon to abandon biased AI recruiting tools
Healthcare NLP: Clinical NLP models trained on biased medical records may underdiagnose conditions in underrepresented populations, affecting treatment recommendations
Content moderation: Toxicity classifiers disproportionately flag African American Vernacular English (AAVE) as toxic, creating censorship bias against Black users on social platforms
Search and recommendation: Biased language models in search engines can reinforce stereotypes by surfacing biased content for identity-related queries
Legal AI: Models that hallucinate fake case citations (as in the Mata v. Avianca incident) demonstrate the real-world danger of deploying unverified LLM outputs in high-stakes domains
Education: Automated essay grading systems may penalize non-standard English dialects, disadvantaging students from diverse linguistic backgrounds

Key Takeaways

Language models learn biases from training data: if the data associates professions with genders or races, the model will reproduce and amplify those associations at scale
Bias enters at every stage (data collection, annotation, training objective, deployment context), and no single intervention eliminates it; defense in depth is required
Hallucination is a fundamental property of autoregressive language models that optimize for fluency rather than factual accuracy; RAG and citation generation are the primary mitigations
RLHF and constitutional AI are the current best practices for aligning model behavior with human values, but they depend on whose values are encoded in the reward model
Responsible deployment requires model cards, red teaming, continuous monitoring, and human oversight, not just better algorithms

2025-10-08

Tokenization and Text Normalization: Why GPT Does Not See Words

From whitespace splitting to Byte Pair Encoding and SentencePiece, exploring how machines break text into processable units and why subword tokenization changed NLP forever.

2025-10-09

N-Grams and Language Models: Predicting the Next Word with Counting

From unigrams to trigrams, exploring how the Markov assumption turns word prediction into a counting problem, and why smoothing techniques like Kneser-Ney keep probabilities from collapsing to zero.

2025-10-10

TF-IDF and Bag of Words: Why Keyword Search Still Works

How term frequency and inverse document frequency turn documents into weighted vectors, powering search engines and text classification long before deep learning arrived.