Document context vs. CRF decoding on ModernBERT for CoNLL-2003 NER
Named-entity recognition (NER) is the task of finding the proper nouns in a piece of text and labeling what kind of thing each one is: a person, an organization, or a place. It is one of the oldest useful things we ask a language model to do, and it quietly sits under a lot of software: the search box that knows Washington is a person in one sentence and a state in the next, the system that redacts patient names from a medical record, the pipeline that turns a news archive into a queryable knowledge graph.
NER is also a clean window into why encoder models still matter. Both families grew out of the transformer (Vaswani et al., 2017): a decoder model, the GPT The decoder-only lineage trained to predict the next token, which makes it strong at generating text. lineage (Radford et al., 2018), is built to generate the next token; an encoder model, the BERT The 2018 encoder that popularized masked-language-model pretraining for understanding tasks. lineage, is built to read a whole sequence at once and label each piece of it. For tagging spans you want the second kind: bidirectional context, one label per token, and an answer that is cheap, fast, and deterministic rather than a paragraph of prose.
Try it
The same task this note studies, running entirely in your browser. The model is the BERT-base CoNLL-2003 NER checkpoint quantized to int8 (not the ModernBERT model this note studies). It tags the four CoNLL entity types: people, organizations, locations, and miscellaneous names. Weights download only when you click, then run on-device; nothing is sent to a server.
Nothing downloads until you ask. Everything self-hosted.
The study
We ran a factorial ablation The complete write-up (full methods, every run config, and all figures) is available to download as the full PDF. on CoNLL-2003 A classic English NER benchmark built from Reuters newswire, labeled for people, organizations, locations, and miscellaneous names. English NER (Sang & De Meulder, 2003): does document-level context and a CRF A decoding layer that scores an entire label sequence at once, enforcing legal tag transitions. decoding head (Lafferty et al., 2001) add anything on top of sentence-level ModernBERT (Warner et al., 2024)? Both ideas are old in NER (cross-sentence context and structured BIO The scheme that marks each token as the beginning, inside, or outside of an entity span. transitions (Ramshaw & Marcus, 1995)), but ModernBERT’s 8k context window makes document packing practical without changing the benchmark’s sentence-level labels.
Code and run configs: modernbert-ner-ablation.
BERT vs. ModernBERT
The two encoders in this study sit six years apart, and the gap is exactly what the ablation leans on.
| Dimension | BERT (2018) | ModernBERT (2024) |
|---|---|---|
| Tokenizer | WordPiece A subword tokenizer (used by BERT) that splits rare words into known word fragments from a learned vocabulary. | BPE A subword tokenizer that repeatedly merges the most frequent character pairs, so rare words break into reusable pieces. |
| Max context | 512 tokens | 8,192 tokens |
| Core architecture | original encoder | RoPE Encodes token position by rotating attention vectors, which generalizes to longer sequences better than learned position embeddings. , GeGLU A gated feed-forward activation that tends to train better than a plain GELU or ReLU layer. , unpadding, FlashAttention |
| Consequence here | sentence-level only | document packing is practical |
ModernBERT’s 8k window is the whole reason document-level context is even testable: you can pack an article’s neighboring sentences into one sequence where BERT would truncate. The flip side is the tokenizer change WordPiece and BPE split words into different pieces, so the two models don’t even agree on where one token ends and the next begins. : the BERT reference below is a reference, not a controlled comparison.
Setup
Data
CoNLL-2003 is the standard English NER benchmark, introduced as a shared task at the 2003 Conference on Computational Natural Language Learning. It is built from Reuters newswire collected in 1996 and labels four entity types (person, organization, location, and miscellaneous), and its original purpose was to compare language-independent NER systems of the era on a common English and German corpus. That date is worth sitting with: by the clock of modern machine learning a 2003 newswire benchmark is ancient, and almost none of what a model like ModernBERT was built to handle (long documents, code, web text, sheer scale) is present in it.
We used the original CoNLL files (eng.train, eng.testa, eng.testb) with -DOCSTART- markers preserved. HuggingFace’s copy drops those boundaries; without them you cannot segment by article.
Factorial design
Three seeds per cell; span-level F1 via seqeval seqeval scores whole entity spans, not tokens: a prediction counts only if its type and its exact start/end boundaries match the gold span, which is stricter than per-token accuracy. .
Document runs pack neighboring sentences from the same article into one sequence while supervising only the target sentence. Articles over the token budget would use sliding windows (128-token overlap, word-aligned). On this corpus, no training document exceeded the budget, so every document-context example was a single window: in-budget context only, not a stress test of long-document chunking.
Alignment
ModernBERT uses BPE; we align labels to the first subtoken per word (word_ids()). Softmax configs mask continuation subtokens with -100. CRF configs need a label at every unmasked position, so we use dense labels and collapse back to words for evaluation.
Hyperparameters were tuned per condition (sentence vs. document, softmax vs. CRF), not locked across the grid. That matches how you’d actually train each variant, but it weakens strict causal claims about interaction effects.
Results
Test micro-F1 The harmonic mean of precision and recall, pooled across all entity spans rather than averaged per class. on the held-out set (mean ± std over three seeds):
| Config | Micro F1 |
|---|---|
| Sent. | 90.12 ± 0.31 |
| Sent. + CRF | 90.15 ± 0.21 |
| Doc. | 91.61 ± 0.23 |
| Doc. + CRF | 90.12 ± 0.13 |
| BERT ref. (different tokenizer) | 91.37 ± 0.17 |
Accent marks the best score in the column.
Document context without CRF was the best ModernBERT cell, about +1.5 F1 over the sentence baseline. A fine-tuned BERT reference (Devlin et al., 2019) reached 91.37 ± 0.17 with a different tokenizer. Adding CRF on top of document context gave that gain back (91.61 → 90.12). At sentence level, CRF was essentially flat (90.12 → 90.15).
Why would packing the article help at all? Cross-sentence coreference: a later mention (“he,” “the company,” “there”) often resolves against a name a sentence or two away, and document context puts both in the model’s view at once. And why doesn’t the CRF earn its keep? A strong bidirectional encoder already learns the legal BIO transitions (you cannot open an I-PER without a B-PER) that a CRF layer exists to enforce, so the extra structured-decoding machinery is mostly redundant here.
Per-entity F1 (same seeds):
| Config | PER Names of people. | ORG Companies, institutions, agencies, and teams. | LOC Geographic and political places: cities, countries, regions. | MISC Named entities that are not a person, organization, or location, such as nationalities, events, and products. |
|---|---|---|---|---|
| Sent. | 95.71 | 87.12 | 92.22 | 79.93 |
| Sent. + CRF | 95.59 | 86.82 | 92.33 | 80.72 |
| Doc. | 97.88 | 89.46 | 92.83 | 79.81 |
| Doc. + CRF | 97.07 | 87.34 | 91.70 | 77.59 |
| BERT ref. | 96.12 | 89.60 | 93.09 | 80.93 |
Accent marks the best score in each column; the BERT reference takes three of the four.
Document-context ModernBERT led on PER; the BERT reference still led on ORG, LOC, and MISC. Gains look label-dependent: plausibly cross-sentence cues help person mentions in news text more than org/location/misc boundaries, though tokenizer differences (BPE vs. WordPiece; ~31% vs. ~17% multi-subword words) confound cross-model comparisons. The BiLSTM-CRF The pre-transformer standard for sequence labeling: a recurrent network reads context both directions, then a CRF decodes the best tag sequence. lineage (Lample et al., 2016) is the usual baseline for structured NER decoding before transformers took over.
The honestly interesting result is that reference column. A six-year-old BERT, with a different tokenizer, still edges out every ModernBERT configuration on three of the four entity types. Read carefully, that is less an indictment of ModernBERT than a comment on the benchmark. CoNLL-2003 is short, clean newswire that never exercises the long context, efficiency, and modern pretraining ModernBERT was built around; on a task this simple, those capabilities have little to bite on, and the newer architecture does not turn them into better spans. The defensible reading is narrow: document context helped a strong encoder on an easy dataset, and nothing here speaks to how the two models compare where ModernBERT’s strengths would actually be tested.
Takeaways
A note on reading these results: the aggregate win and the per-tag losses are both real, and they point the same way. CoNLL-2003 is not a hard enough benchmark to separate these models on their merits, so treat every number here as a statement about the dataset at least as much as the architecture.
- Context beat structure
- Packing the article helped; the CRF did not compound that win.
- Window never filled
- The findings are about same-document neighbors in one chunk, not multi-window long-document NER.
- Not a uniform win
- ModernBERT is strong overall with document context, but not best on every entity type.
Next: a hyperparameter-matched rerun and a corpus with genuinely long documents.