Document context vs. CRF decoding on ModernBERT for CoNLL-2003 NER

Named-entity recognition (NER) is the task of finding the proper nouns in a piece of text and labeling what kind of thing each one is: a person, an organization, or a place. It is one of the oldest useful things we ask a language model to do, and it quietly sits under a lot of software: the search box that knows Washington is a person in one sentence and a state in the next, the system that redacts patient names from a medical record, the pipeline that turns a news archive into a queryable knowledge graph.

NER is also a clean window into why encoder models still matter. Both families grew out of the transformer (Vaswani et al., 2017): a decoder model, the GPT Generative Pretrained TransformerThe decoder-only lineage trained to predict the next token, which makes it strong at generating text. lineage (Radford et al., 2018), is built to generate the next token; an encoder model, the BERT Bidirectional Encoder Representations from TransformersThe 2018 encoder that popularized masked-language-model pretraining for understanding tasks. lineage, is built to read a whole sequence at once and label each piece of it. For tagging spans you want the second kind: bidirectional context, one label per token, and an answer that is cheap, fast, and deterministic rather than a paragraph of prose.

Try it

The same task this note studies, running entirely in your browser. The model is the BERT-base CoNLL-2003 NER checkpoint quantized to int8 (not the ModernBERT model this note studies). It tags the four CoNLL entity types: people, organizations, locations, and miscellaneous names. Weights download only when you click, then run on-device; nothing is sent to a server.

Named-entity recognition (BERT-base-NER) · ~104 MB, runs locally in your browser

Nothing downloads until you ask. Everything self-hosted.

The study

We ran a factorial ablation The complete write-up (full methods, every run config, and all figures) is available to download as the full PDF. on CoNLL-2003 A classic English NER benchmark built from Reuters newswire, labeled for people, organizations, locations, and miscellaneous names. English NER (Sang & De Meulder, 2003): does document-level context and a CRF Conditional Random FieldA decoding layer that scores an entire label sequence at once, enforcing legal tag transitions. decoding head (Lafferty et al., 2001) add anything on top of sentence-level ModernBERT (Warner et al., 2024)? Both ideas are old in NER (cross-sentence context and structured BIO Begin, Inside, OutsideThe scheme that marks each token as the beginning, inside, or outside of an entity span. transitions (Ramshaw & Marcus, 1995)), but ModernBERT’s 8k context window makes document packing practical without changing the benchmark’s sentence-level labels.

Code and run configs: modernbert-ner-ablation.

BERT vs. ModernBERT

The two encoders in this study sit six years apart, and the gap is exactly what the ablation leans on.

DimensionBERT (2018)ModernBERT (2024)
Tokenizer WordPiece WordPieceA subword tokenizer (used by BERT) that splits rare words into known word fragments from a learned vocabulary. BPE Byte-Pair EncodingA subword tokenizer that repeatedly merges the most frequent character pairs, so rare words break into reusable pieces.
Max context512 tokens8,192 tokens
Core architectureoriginal encoder RoPE Rotary Position EmbeddingEncodes token position by rotating attention vectors, which generalizes to longer sequences better than learned position embeddings. , GeGLU GELU Gated Linear UnitA gated feed-forward activation that tends to train better than a plain GELU or ReLU layer. , unpadding, FlashAttention
Consequence heresentence-level onlydocument packing is practical

ModernBERT’s 8k window is the whole reason document-level context is even testable: you can pack an article’s neighboring sentences into one sequence where BERT would truncate. The flip side is the tokenizer change WordPiece and BPE split words into different pieces, so the two models don’t even agree on where one token ends and the next begins. : the BERT reference below is a reference, not a controlled comparison.

Setup

Data

CoNLL-2003 is the standard English NER benchmark, introduced as a shared task at the 2003 Conference on Computational Natural Language Learning. It is built from Reuters newswire collected in 1996 and labels four entity types (person, organization, location, and miscellaneous), and its original purpose was to compare language-independent NER systems of the era on a common English and German corpus. That date is worth sitting with: by the clock of modern machine learning a 2003 newswire benchmark is ancient, and almost none of what a model like ModernBERT was built to handle (long documents, code, web text, sheer scale) is present in it.

We used the original CoNLL files (eng.train, eng.testa, eng.testb) with -DOCSTART- markers preserved. HuggingFace’s copy drops those boundaries; without them you cannot segment by article.

Factorial design

Three seeds per cell; span-level F1 via seqeval seqeval scores whole entity spans, not tokens: a prediction counts only if its type and its exact start/end boundaries match the gold span, which is stricter than per-token accuracy. .

Figure 1. The 2×2 design: context (rows) crossed with the decoder (columns). Accent marks the cell that won: document context with softmax decoding.

Document runs pack neighboring sentences from the same article into one sequence while supervising only the target sentence. Articles over the token budget would use sliding windows (128-token overlap, word-aligned). On this corpus, no training document exceeded the budget, so every document-context example was a single window: in-budget context only, not a stress test of long-document chunking.

Alignment

ModernBERT uses BPE; we align labels to the first subtoken per word (word_ids()). Softmax configs mask continuation subtokens with -100. CRF configs need a label at every unmasked position, so we use dense labels and collapse back to words for evaluation.

Hyperparameters were tuned per condition (sentence vs. document, softmax vs. CRF), not locked across the grid. That matches how you’d actually train each variant, but it weakens strict causal claims about interaction effects.

Results

micro-F1 on CoNLL-2003 test for four ModernBERT configurations against a BERT reference at 91.37; document context reaches 91.61, the only cell above the reference BERT ref · 91.37 Sent. 90.12 Sent. + CRF 90.15 Doc. 91.61 Doc. + CRF 90.12 89.5 92.0
Figure 2. micro-F1 on CoNLL-2003 test. Document context is the lone ModernBERT cell that clears the BERT reference; stacking a CRF on top gives the gain back. Bars are ±1 SD over three seeds.

Test micro-F1 The harmonic mean of precision and recall, pooled across all entity spans rather than averaged per class. on the held-out set (mean ± std over three seeds):

ConfigMicro F1
Sent.90.12 ± 0.31
Sent. + CRF90.15 ± 0.21
Doc.91.61 ± 0.23
Doc. + CRF90.12 ± 0.13
BERT ref. (different tokenizer)91.37 ± 0.17

Accent marks the best score in the column.

Document context without CRF was the best ModernBERT cell, about +1.5 F1 over the sentence baseline. A fine-tuned BERT reference (Devlin et al., 2019) reached 91.37 ± 0.17 with a different tokenizer. Adding CRF on top of document context gave that gain back (91.61 → 90.12). At sentence level, CRF was essentially flat (90.12 → 90.15).

Why would packing the article help at all? Cross-sentence coreference: a later mention (“he,” “the company,” “there”) often resolves against a name a sentence or two away, and document context puts both in the model’s view at once. And why doesn’t the CRF earn its keep? A strong bidirectional encoder already learns the legal BIO transitions (you cannot open an I-PER without a B-PER) that a CRF layer exists to enforce, so the extra structured-decoding machinery is mostly redundant here.

Per-entity F1 (same seeds):

Config PER PersonNames of people. ORG OrganizationCompanies, institutions, agencies, and teams. LOC LocationGeographic and political places: cities, countries, regions. MISC MiscellaneousNamed entities that are not a person, organization, or location, such as nationalities, events, and products.
Sent.95.7187.1292.2279.93
Sent. + CRF95.5986.8292.3380.72
Doc.97.8889.4692.8379.81
Doc. + CRF97.0787.3491.7077.59
BERT ref.96.1289.6093.0980.93

Accent marks the best score in each column; the BERT reference takes three of the four.

Document-context ModernBERT led on PER; the BERT reference still led on ORG, LOC, and MISC. Gains look label-dependent: plausibly cross-sentence cues help person mentions in news text more than org/location/misc boundaries, though tokenizer differences (BPE vs. WordPiece; ~31% vs. ~17% multi-subword words) confound cross-model comparisons. The BiLSTM-CRF Bidirectional LSTM with a CRF layerThe pre-transformer standard for sequence labeling: a recurrent network reads context both directions, then a CRF decodes the best tag sequence. lineage (Lample et al., 2016) is the usual baseline for structured NER decoding before transformers took over.

The honestly interesting result is that reference column. A six-year-old BERT, with a different tokenizer, still edges out every ModernBERT configuration on three of the four entity types. Read carefully, that is less an indictment of ModernBERT than a comment on the benchmark. CoNLL-2003 is short, clean newswire that never exercises the long context, efficiency, and modern pretraining ModernBERT was built around; on a task this simple, those capabilities have little to bite on, and the newer architecture does not turn them into better spans. The defensible reading is narrow: document context helped a strong encoder on an easy dataset, and nothing here speaks to how the two models compare where ModernBERT’s strengths would actually be tested.

Takeaways

A note on reading these results: the aggregate win and the per-tag losses are both real, and they point the same way. CoNLL-2003 is not a hard enough benchmark to separate these models on their merits, so treat every number here as a statement about the dataset at least as much as the architecture.

Context beat structure
Packing the article helped; the CRF did not compound that win.
Window never filled
The findings are about same-document neighbors in one chunk, not multi-window long-document NER.
Not a uniform win
ModernBERT is strong overall with document context, but not best on every entity type.

Next: a hyperparameter-matched rerun and a corpus with genuinely long documents.

References

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Proceedings of the Eighteenth International Conference on Machine Learning, 282–289.
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural Architectures for Named Entity Recognition. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 260–270.
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving Language Understanding by Generative Pre-Training. OpenAI technical report.
Ramshaw, L. A., & Marcus, M. P. (1995). Text Chunking using Transformation-Based Learning. Proceedings of the Third Workshop on Very Large Corpora, 82–94.
Sang, E. F. T. K., & De Meulder, F. (2003). Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, 142–147.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 30. https://arxiv.org/abs/1706.03762
Warner, B., Chien, A., Duckworth, D., Dukhan, M., Goel, N., Gupta, A., Hoffman, J., Howard, J., Jha, A., Kuchaiev, O., Liao, Y., Linderman, M., Li, Z., Malte, T., Miller, G., Ramesh, A., Ryabinin, M., Sanyal, A., Shen, J., … Subramanian, S. (2024). Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference.