Document context vs. CRF decoding on ModernBERT for CoNLL-2003 NER

Named-entity recognition (NER) is the task of finding the proper nouns in a piece of text and labeling what kind of thing each one is: a person, an organization, or a place. It is one of the oldest useful things we ask a language model to do, and it quietly sits under a lot of software: the search box that knows Washington is a person in one sentence and a state in the next ¹Sports writing is the genre that breaks NER. “Washington beat Dallas” is two football teams, not a person and a city. The tagger has to read the genre before it can read the type. , the system that redacts patient names from a medical record, the pipeline that turns a news archive into a queryable knowledge graph.

NER is also a clean window into why encoder models still matter. Both families grew out of the transformer (Vaswani et al., 2017): a decoder model, the GPT lineage (Radford et al., 2018), is built to generate the next token; an encoder model, the BERT lineage, is built to read a whole sequence at once and label each piece of it. For tagging spans you want the second kind: bidirectional context, one label per token, and an answer that is cheap, fast, and deterministic rather than a paragraph of prose.

Try it¶

The same task this note studies, running entirely in your browser. The model is the BERT-base CoNLL-2003 NER checkpoint quantized to int8 (not the ModernBERT model this note studies). It tags the four CoNLL entity types: people, organizations, locations, and miscellaneous names. Weights download only when you click, then run on-device; nothing is sent to a server.

Named-entity recognition (BERT-base-NER) · ~104 MB, runs locally in your browser

Nothing downloads until you ask. Everything self-hosted.

The study¶

The question is whether giving a modern encoder more context changes how well it labels spans, and whether structured decoding still matters when the encoder is strong enough to learn the label rules on its own. Neither idea is new. Cross-sentence context and structured BIO transitions (Ramshaw & Marcus, 1995) were standard practice in NER long before transformers. What they add on top of a model like ModernBERT (Warner et al., 2024) is the open question.

So we ran a 2×2 ablation ²Everything behind this note: the full write-up as a PDF (methods, every run config, all figures), and the code and run configs on GitHub. on CoNLL-2003 English NER (Sang & De Meulder, 2003), crossing context window (sentence vs. document) with decoding head (softmax vs. a CRF (Lafferty et al., 2001)). ModernBERT’s 8k context window is what makes the document half testable at all, since you can pack an article’s neighboring sentences into one sequence where BERT would truncate, all without touching the benchmark’s sentence-level labels.

BERT vs. ModernBERT¶

The two encoders in this study sit six years apart, and the gap is exactly what the ablation leans on.

Dimension	BERT (2018)	ModernBERT (2024)
Tokenizer	WordPiece	BPE
Max context	512 tokens	8,192 tokens
Core architecture	original encoder	RoPE , GeGLU , unpadding, FlashAttention
Consequence here	sentence-level only	document packing is practical

The 8k window is what makes document context testable, but the tokenizer change ³WordPiece and BPE split words into different pieces, so the two models don’t even agree on where one token ends and the next begins. is the catch: the BERT reference below is a reference, not a controlled comparison. The two models do not read the same input.

Setup¶

CoNLL-2003 is 1996 Reuters newswire labeled for four entity types. That date is worth sitting with: almost none of what ModernBERT was built to handle (long documents, code, web text, sheer scale) shows up in it. We used the original files with -DOCSTART- markers preserved, since the HuggingFace copy drops those boundaries and you cannot segment by article without them.

Three seeds per cell; span-level F1 via seqeval ⁴seqeval scores whole entity spans, not tokens: a prediction counts only if its type and its exact start/end boundaries match the gold span, which is stricter than per-token accuracy. . Document runs pack neighboring sentences from the same article into one sequence while supervising only the target sentence. On this corpus no training document exceeded the budget, so every document-context run was a single window: same-article neighbors, not a stress test of long-document chunking.

Softmax + CRF Sentence · 512 tok baseline structured decoding only Document · 8k tok cross-sentence context both modifications

Figure 1. The 2×2 design: context (rows) crossed with the decoder (columns). Accent marks the cell that won: document context with softmax decoding.

Results¶

Figure 2. micro-F1 on CoNLL-2003 test. Document context is the lone ModernBERT cell that clears the BERT reference; stacking a CRF on top gives the gain back. Bars are ±1 SD over three seeds.

Test micro-F1 on the held-out set (mean ± std over three seeds):

Config	Micro F1
Sent.	90.12 ± 0.31
Sent. + CRF	90.15 ± 0.21
Doc.	91.61 ± 0.23
Doc. + CRF	90.12 ± 0.13
BERT ref. (different tokenizer)	91.37 ± 0.17

Accent marks the best score in the column.

Document context won, and adding a CRF on top erased the win. That interaction is the interesting part. Document context alone lifted F1 by about 1.5 points over the sentence baseline; the CRF pulled it straight back down to sentence-level performance (91.61 → 90.12). At sentence level the CRF did essentially nothing (90.12 → 90.15). A fine-tuned BERT reference (Devlin et al., 2019) reached 91.37 with a different tokenizer.

The CRF result fits a pattern that has been running since transformers displaced the BiLSTM-CRF (Lample et al., 2016) lineage. CRF layers exist to enforce BIO transition rules that recurrent nets could not reliably learn on their own: you cannot open an I-PER without a B-PER, and a CRF scores the whole label sequence at once rather than each position alone. A strong bidirectional encoder already absorbs those rules during pretraining. Stacking a CRF on top of document context ⁵The CRF also forced an annoying detail: it needs a real label at every unmasked position, so no -100 masking on continuation subtokens. Dense labels in, collapse back to words for scoring. A few hours of NaN loss before that clicked. adds optimization machinery without adding information, and something in that interaction costs you the gain.

Per-entity F1 (same seeds):

Config	PER	ORG	LOC	MISC
Sent.	95.71	87.12	92.22	79.93
Sent. + CRF	95.59	86.82	92.33	80.72
Doc.	97.88	89.46	92.83	79.81
Doc. + CRF	97.07	87.34	91.70	77.59
BERT ref.	96.12	89.60	93.09	80.93

Accent marks the best score in each column; the BERT reference takes three of the four.

The breakdown tells a more specific story. Document context helped most on PER, where ModernBERT with document context led everything, including the BERT reference. It moved ORG, LOC, and MISC far less, and the BERT reference still led all three.

Frank Beamer led the Hokies to an 11-0 regular season in 1999.

The coach swept eight national awards that year.

He took Virginia Tech to the BCS title game in New Orleans.

(They lost to Florida State. The tagger, at least, got every mention right.)

Figure 3. One person, three mentions. A sentence-level model sees only one line at a time, so “The coach” and “He” are near-impossible PER calls on their own. Document context keeps the name in view to anchor them.

The reason is in how people get named. A person shows up once by full name and then becomes a last name, a title, or a pronoun. Organizations and locations tend to be written out the same way every time. Cross-sentence context pays off most when the same entity keeps changing shape across sentences, and that is what people do in news prose more than places or companies do. It is a plausible read, not a confirmed one: tokenizer differences (BPE vs. WordPiece; ~31% vs. ~17% multi-subword words) confound the cross-model comparison, so the per-entity gap between models is hard to pin on architecture alone.

The honestly interesting result is that reference column. A six-year-old BERT, with a different tokenizer, still edges out every ModernBERT configuration on three of the four entity types. Read carefully, that is less an indictment of ModernBERT than a comment on the benchmark. CoNLL-2003 is short, clean newswire, and the 8k context window never came close to full. ⁶Longest training document on eng.train: 2,735 ModernBERT subwords. Median: 261. None exceeded the 8,190-token content budget; the window barely activated. The efficiency, modern pretraining, and long-context capabilities ModernBERT was built around have nothing to bite on here, and a strong older model trained on the same kind of text is a hard thing to beat. The defensible reading is narrow: document context helped a strong encoder on an easy dataset, and nothing here speaks to how the two models compare where ModernBERT’s strengths would actually be tested.

Takeaways¶

We ran a modern encoder against a dated corpus that does not stress it in any meaningful way. CoNLL-2003 is short 1996 newswire: the 8k window barely activates, and the capabilities ModernBERT was built around never get a hard case to bite on. Treat every number here as a statement about benchmark fit at least as much as architecture. The guidance below is what I would actually do with these results, not a verdict on ModernBERT.

Pack neighbors before you add a CRF: If entities get referred to across sentences, give the model document context first. A strong bidirectional encoder already knows the BIO rules; the CRF added nothing at sentence level and erased the document-context gain.
Do not rank encoders on aggregate F1 here: BERT still led on three of four entity types. That is benchmark fit, not proof that a 2018 model beats a 2024 one. CoNLL-2003 is exactly the kind of text BERT was trained and evaluated on.
Scope claims to what was actually tested: Document context here meant same-article neighbors in one chunk, not multi-window long-document NER. To gauge whether ModernBERT’s long context pays off, you need a corpus that actually fills the window.

References¶

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.

Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Proceedings of the Eighteenth International Conference on Machine Learning, 282–289.

Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural Architectures for Named Entity Recognition. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 260–270.

Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving Language Understanding by Generative Pre-Training. OpenAI technical report.

Ramshaw, L. A., & Marcus, M. P. (1995). Text Chunking using Transformation-Based Learning. Proceedings of the Third Workshop on Very Large Corpora, 82–94.

Sang, E. F. T. K., & De Meulder, F. (2003). Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, 142–147.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 30. https://arxiv.org/abs/1706.03762

Warner, B., Chien, A., Duckworth, D., Dukhan, M., Goel, N., Gupta, A., Hoffman, J., Howard, J., Jha, A., Kuchaiev, O., Liao, Y., Linderman, M., Li, Z., Malte, T., Miller, G., Ramesh, A., Ryabinin, M., Sanyal, A., Shen, J., … Subramanian, S. (2024). Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference.

↑ Back to top