<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="FeedCreator 1.8" -->
<?xml-stylesheet href="https://aether.snwy.me/lib/exe/css.php?s=feed" type="text/css"?>
<rdf:RDF
    xmlns="http://purl.org/rss/1.0/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
    xmlns:dc="http://purl.org/dc/elements/1.1/">
    <channel rdf:about="https://aether.snwy.me/feed.php">
        <title>Aether - concepts</title>
        <description></description>
        <link>https://aether.snwy.me/</link>
        <image rdf:resource="https://aether.snwy.me/_media/wiki:dokuwiki.svg" />
       <dc:date>2026-04-20T14:49:07+00:00</dc:date>
        <items>
            <rdf:Seq>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:backpropagation?rev=1776587224&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:block_attnres?rev=1776568745&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:chinchilla_scaling?rev=1776587208&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:expert_routing?rev=1776587208&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:feedforward_network?rev=1776587224&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:gradient_highway?rev=1776568746&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:hidden_state_growth?rev=1776568745&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:kimi_linear?rev=1776568745&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:layer_normalization?rev=1776587209&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:layer_pruning?rev=1776568746&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:linear_attention?rev=1776587208&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:llm?rev=1776587208&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:model_pruning?rev=1776587208&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:moe?rev=1776568745&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:multi_head_attention?rev=1776587208&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:neural_scaling?rev=1776587208&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:pipeline_communication?rev=1776587209&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:positional_encoding?rev=1776587224&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:postnorm?rev=1776587209&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:prenorm?rev=1776568745&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:residual_connections?rev=1776568745&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:rnn?rev=1776587209&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:scaled_dot_product_attention?rev=1776587208&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:scaling_laws?rev=1776568746&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:softmax_attention?rev=1776568746&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:start?rev=1776587224&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:transformer?rev=1776587224&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:vanishing_gradients?rev=1776587208&amp;do=diff"/>
            </rdf:Seq>
        </items>
    </channel>
    <image rdf:about="https://aether.snwy.me/_media/wiki:dokuwiki.svg">
        <title>Aether</title>
        <link>https://aether.snwy.me/</link>
        <url>https://aether.snwy.me/_media/wiki:dokuwiki.svg</url>
    </image>
    <item rdf:about="https://aether.snwy.me/concepts:backpropagation?rev=1776587224&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:27:04+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>backpropagation</title>
        <link>https://aether.snwy.me/concepts:backpropagation?rev=1776587224&amp;do=diff</link>
        <description>Backpropagation

The algorithm for computing gradients of a loss function with respect to network parameters by applying the chain rule layer by layer from output to input. Vanishing gradients occur when these signals decay excessively in deep networks. Residual connections create the gradient highway that preserves gradient flow.</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:block_attnres?rev=1776568745&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T03:19:05+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>block_attnres</title>
        <link>https://aether.snwy.me/concepts:block_attnres?rev=1776568745&amp;do=diff</link>
        <description>Block AttnRes

A practical variant of Attention Residuals that partitions layers into N blocks. Within each block, standard residuals are used. At block boundaries, an AttnRes operation aggregates block-level representations, reducing memory from O(Ld) to O(Nd) while preserving most gains.</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:chinchilla_scaling?rev=1776587208&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:26:48+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>chinchilla_scaling</title>
        <link>https://aether.snwy.me/concepts:chinchilla_scaling?rev=1776587208&amp;do=diff</link>
        <description>Chinchilla Scaling

The finding from Hoffmann et al. (2022) that LLMs should be trained on roughly 20 tokens per parameter for compute-optimal training. Implies many large models are significantly undertrained. Validates that Attention Residuals improvements hold under this regime.</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:expert_routing?rev=1776587208&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:26:48+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>expert_routing</title>
        <link>https://aether.snwy.me/concepts:expert_routing?rev=1776587208&amp;do=diff</link>
        <description>Expert Routing

The mechanism in Mixture-of-Experts architectures that selects which experts process a given input. Typically implemented as a learned gating network producing a sparse distribution over experts. The quality of routing directly affects MoE efficiency and performance.</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:feedforward_network?rev=1776587224&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:27:04+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>feedforward_network</title>
        <link>https://aether.snwy.me/concepts:feedforward_network?rev=1776587224&amp;do=diff</link>
        <description>Feed-Forward Network (FFN)

The position-wise MLP applied after each attention layer in a Transformer block. Typically two linear transformations with a nonlinearity: FFN(x) = W2 · act(W1 · x). In MoE architectures, the FFN is replaced by multiple expert FFNs with</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:gradient_highway?rev=1776568746&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T03:19:06+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>gradient_highway</title>
        <link>https://aether.snwy.me/concepts:gradient_highway?rev=1776568746&amp;do=diff</link>
        <description>Gradient Highway

The property of residual connections that allows gradients to flow directly through the identity path during backpropagation, bypassing layer transformations. This enables stable training of very deep networks.

See also: residual_connections, vanishing_gradients, attention_residuals</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:hidden_state_growth?rev=1776568745&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T03:19:05+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>hidden_state_growth</title>
        <link>https://aether.snwy.me/concepts:hidden_state_growth?rev=1776568745&amp;do=diff</link>
        <description>Hidden-State Growth

Under PreNorm residual connections, hidden-state magnitudes grow as O(L) with depth because each layer adds a roughly unit-magnitude output to the running sum. This progressively dilutes each layer&#039;s relative contribution and buries early-layer information.</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:kimi_linear?rev=1776568745&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T03:19:05+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>kimi_linear</title>
        <link>https://aether.snwy.me/concepts:kimi_linear?rev=1776568745&amp;do=diff</link>
        <description>Kimi Linear

A Mixture-of-Experts architecture by Kimi Team with 48B total / 3B activated parameters. Uses MoE with linear attention. The paper Attention Residuals integrates AttnRes into this architecture, pre-training on 1.4T tokens.

See also: moe, linear_attention, attention_residuals</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:layer_normalization?rev=1776587209&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:26:49+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>layer_normalization</title>
        <link>https://aether.snwy.me/concepts:layer_normalization?rev=1776587209&amp;do=diff</link>
        <description>Layer Normalization

Normalizing activations across the feature dimension to stabilize training. Applied either before (PreNorm) or after (PostNorm) the sublayer. PreNorm dominates modern LLMs but causes hidden-state growth.

See also: prenorm, postnorm, hidden_state_growth, attention_residuals</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:layer_pruning?rev=1776568746&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T03:19:06+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>layer_pruning</title>
        <link>https://aether.snwy.me/concepts:layer_pruning?rev=1776568746&amp;do=diff</link>
        <description>Layer Pruning

Removing entire layers from a trained network. Under standard residual connections with PreNorm, many layers can be pruned with minimal performance loss because their contributions are heavily diluted. This motivates Attention Residuals, which gives each layer learned, content-dependent influence.</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:linear_attention?rev=1776587208&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:26:48+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>linear_attention</title>
        <link>https://aether.snwy.me/concepts:linear_attention?rev=1776587208&amp;do=diff</link>
        <description>Linear Attention

Attention variants that replace the O(n²) softmax with a kernelized decomposition, reducing sequence-length cost to linear. Used in Kimi Linear to handle long contexts efficiently. Trades some expressiveness for computational savings.

See also:</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:llm?rev=1776587208&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:26:48+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>llm</title>
        <link>https://aether.snwy.me/concepts:llm?rev=1776587208&amp;do=diff</link>
        <description>Large Language Model (LLM)

A neural network trained on large text corpora to model language. Modern LLMs use transformer architectures with PreNorm and residual connections, and increasingly MoE for efficiency. Scaling laws govern their performance gains.

See also: softmax_attention, moe, scaling_laws, prenorm</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:model_pruning?rev=1776587208&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:26:48+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>model_pruning</title>
        <link>https://aether.snwy.me/concepts:model_pruning?rev=1776587208&amp;do=diff</link>
        <description>Model Pruning

Removing parameters or structures from a trained network to reduce size or compute. Includes unstructured pruning (individual weights) and structured pruning (entire heads, layers, or experts). Layer pruning is a form of structured pruning that is surprisingly benign under</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:moe?rev=1776568745&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T03:19:05+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>moe</title>
        <link>https://aether.snwy.me/concepts:moe?rev=1776568745&amp;do=diff</link>
        <description>Mixture of Experts (MoE)

An architecture where only a subset of parameters (experts) are activated per input, determined by a routing function. Enables scaling total parameters while keeping compute per token low. Used in Kimi Linear and many modern LLMs.

See also:</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:multi_head_attention?rev=1776587208&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:26:48+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>multi_head_attention</title>
        <link>https://aether.snwy.me/concepts:multi_head_attention?rev=1776587208&amp;do=diff</link>
        <description>Multi-Head Attention

Running multiple attention operations in parallel with separate learned projections, then concatenating results. Each head can attend to different positional or semantic relationships. Standard in all modern LLMs.

See also: scaled_dot_product_attention, softmax_attention, attention_residuals</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:neural_scaling?rev=1776587208&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:26:48+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>neural_scaling</title>
        <link>https://aether.snwy.me/concepts:neural_scaling?rev=1776587208&amp;do=diff</link>
        <description>Neural Scaling

The broad phenomenon where model performance improves predictably with increases in parameters, data, or compute. Scaling laws formalize these relationships mathematically.

See also: scaling_laws, chinchilla_scaling, attention_residuals</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:pipeline_communication?rev=1776587209&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:26:49+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>pipeline_communication</title>
        <link>https://aether.snwy.me/concepts:pipeline_communication?rev=1776587209&amp;do=diff</link>
        <description>Pipeline Communication

The overhead of passing activations between pipeline stages in pipeline-parallel training. Block AttnRes reduces this by aggregating representations at block boundaries rather than every layer, cutting communication volume proportionally.</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:positional_encoding?rev=1776587224&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:27:04+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>positional_encoding</title>
        <link>https://aether.snwy.me/concepts:positional_encoding?rev=1776587224&amp;do=diff</link>
        <description>Positional Encoding

Since softmax attention is permutation-invariant, positional encodings inject sequence-order information into Transformer inputs. Variants include sinusoidal (original), learned, and rotary (RoPE) encodings. Not directly modified by Attention Residuals, but RoPE interacts with</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:postnorm?rev=1776587209&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:26:49+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>postnorm</title>
        <link>https://aether.snwy.me/concepts:postnorm?rev=1776587209&amp;do=diff</link>
        <description>PostNorm

Applying layer normalization after the sublayer and residual addition. Used in the original Transformer but largely replaced by PreNorm in modern LLMs due to training instability at depth. PostNorm does not exhibit hidden-state growth but is harder to train.

See also: prenorm, layer_normalization, hidden_state_growth, attention_residuals</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:prenorm?rev=1776568745&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T03:19:05+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>prenorm</title>
        <link>https://aether.snwy.me/concepts:prenorm?rev=1776568745&amp;do=diff</link>
        <description>PreNorm

Applying layer normalization before the sublayer transformation (attention or FFN). Dominant in modern LLMs for training stability, but its unweighted accumulation causes hidden-state magnitudes to grow as O(L) with depth, diluting each layer&#039;s contribution.</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:residual_connections?rev=1776568745&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T03:19:05+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>residual_connections</title>
        <link>https://aether.snwy.me/concepts:residual_connections?rev=1776568745&amp;do=diff</link>
        <description>Residual Connections

Skip connections that add a layer&#039;s input to its output: h_l = h_{l-1} + f(h_{l-1}). Enable gradient flow in deep networks but accumulate all prior outputs with fixed unit weights, causing dilution at depth.

See also: attention_residuals, prenorm, gradient_highway, hidden_state_growth, layer_pruning</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:rnn?rev=1776587209&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:26:49+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>rnn</title>
        <link>https://aether.snwy.me/concepts:rnn?rev=1776587209&amp;do=diff</link>
        <description>Recurrent Neural Network (RNN)

A network architecture that processes sequences by maintaining a hidden state updated at each time step. Attention Residuals has a recurrent interpretation: the softmax attention over all prior layer outputs can be viewed as a weighted recurrence, connecting transformer depth to recurrent computation.</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:scaled_dot_product_attention?rev=1776587208&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:26:48+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>scaled_dot_product_attention</title>
        <link>https://aether.snwy.me/concepts:scaled_dot_product_attention?rev=1776587208&amp;do=diff</link>
        <description>Scaled Dot-Product Attention

The core computation behind softmax_attention: softmax(QK^T / sqrt(d_k)) V. The sqrt(d_k) scaling prevents dot products from growing large in high dimensions, which would push softmax into saturated regions with vanishing gradients.

See also:</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:scaling_laws?rev=1776568746&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T03:19:06+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>scaling_laws</title>
        <link>https://aether.snwy.me/concepts:scaling_laws?rev=1776568746&amp;do=diff</link>
        <description>Scaling Laws

Empirical relationships describing how model performance (typically loss) scales with model size, data size, and compute. The paper Attention Residuals validates that its improvements hold consistently across model sizes via scaling law experiments.

See also:</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:softmax_attention?rev=1776568746&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T03:19:06+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>softmax_attention</title>
        <link>https://aether.snwy.me/concepts:softmax_attention?rev=1776568746&amp;do=diff</link>
        <description>Softmax Attention

The standard attention mechanism: computes dot-product similarity between a query and keys, applies softmax to produce a probability distribution, then takes a weighted sum of values. In Attention Residuals, softmax attention is repurposed for depth-wise aggregation across layer outputs instead of sequence positions.</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:start?rev=1776587224&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:27:04+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>start</title>
        <link>https://aether.snwy.me/concepts:start?rev=1776587224&amp;do=diff</link>
        <description>Concepts

Definitions and explanations of machine learning terms, organized as an interconnected graph.

Architecture

	*  LLM
	*  Softmax Attention
	*  Scaled Dot-Product Attention
	*  Multi-Head Attention
	*  Residual Connections
	*  Gradient Highway
	*  Mixture of Experts
	*  Expert Routing
	*  Linear Attention
	*  RNN

Normalization &amp; Stability

	*  Layer Normalization
	*  PreNorm
	*  PostNorm
	*  Hidden-State Growth
	*  Vanishing Gradients

Efficiency &amp; Pruning

	*  Model Pruning</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:transformer?rev=1776587224&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:27:04+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>transformer</title>
        <link>https://aether.snwy.me/concepts:transformer?rev=1776587224&amp;do=diff</link>
        <description>Transformer

The dominant architecture for LLMs, built from alternating self-attention and feed-forward layers with residual connections and layer normalization. The original design used PostNorm; modern variants use PreNorm. Attention Residuals modifies how the residual stream accumulates across layers.

See also: softmax_attention, multi_head_attention, prenorm, residual_connections</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:vanishing_gradients?rev=1776587208&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:26:48+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>vanishing_gradients</title>
        <link>https://aether.snwy.me/concepts:vanishing_gradients?rev=1776587208&amp;do=diff</link>
        <description>Vanishing Gradients

The problem where gradients shrink exponentially as they propagate through many layers during backpropagation, making deep networks difficult or impossible to train. Residual connections mitigate this by providing a direct gradient path — the gradient highway — that preserves signal across arbitrary depth.</description>
    </item>
</rdf:RDF>
