TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Growing Transformers: Modular Composition and Layer-wise E...

Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate

A. Bochkov

2025-07-08Continual LearningMMLU
PaperPDFCode(official)

Abstract

The prevailing paradigm for scaling large language models (LLMs) involves monolithic, end-to-end training, a resource-intensive process that lacks flexibility. This paper explores an alternative, constructive approach to model development, built upon the foundation of non-trainable, deterministic input embeddings. In prior [1], we established that high-level semantic reasoning can emerge in Transformers using frozen embeddings derived from the visual structure of Unicode glyphs. Here, we demonstrate that this fixed representational substrate acts as a universal "docking port," enabling two powerful and efficient scaling paradigms: seamless modular composition and progressive layer-wise growth. First, we show that specialist models trained on disparate datasets (e.g., Russian and Chinese text) can be merged into a single, more capable Mixture-of-Experts (MoE) model, post-training, with zero architectural modification. This is achieved by simply averaging their output logits. The resulting MoE model exhibits immediate performance improvements on reasoning benchmarks like MMLU, surpassing its constituent experts without catastrophic forgetting. Second, we introduce a layer-wise constructive training methodology, where a deep Transformer is "grown" by progressively stacking and training one layer at a time. This method demonstrates stable convergence and a clear correlation between model depth and the emergence of complex reasoning abilities, such as those required for SQuAD. Our findings suggest a paradigm shift from monolithic optimization towards a more biological or constructive model of AI development, where complexity is built incrementally and modules can be composed freely. This opens new avenues for resource-efficient scaling, continual learning, and a more democratized ecosystem for building powerful AI systems. We release all code and models to facilitate further research.

Related Papers

RegCL: Continual Adaptation of Segment Anything Model via Model Merging2025-07-16Information-Theoretic Generalization Bounds of Replay-based Continual Learning2025-07-16PROL : Rehearsal Free Continual Learning in Streaming Data via Prompt Online Learning2025-07-16Learning What Matters: Probabilistic Task Selection via Mutual Information for Model Finetuning2025-07-16Fast Last-Iterate Convergence of SGD in the Smooth Interpolation Regime2025-07-15A Neural Network Model of Complementary Learning Systems: Pattern Separation and Completion for Continual Learning2025-07-15Step-wise Policy for Rare-tool Knowledge (SPaRK): Offline RL that Drives Diverse Tool Use in LLMs2025-07-15LifelongPR: Lifelong knowledge fusion for point cloud place recognition based on replay and prompt learning2025-07-14