<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="FeedCreator 1.8" -->
<?xml-stylesheet href="https://aether.snwy.me/lib/exe/css.php?s=feed" type="text/css"?>
<rdf:RDF
    xmlns="http://purl.org/rss/1.0/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
    xmlns:dc="http://purl.org/dc/elements/1.1/">
    <channel rdf:about="https://aether.snwy.me/feed.php">
        <title>Aether</title>
        <description></description>
        <link>https://aether.snwy.me/</link>
        <image rdf:resource="https://aether.snwy.me/_media/wiki:dokuwiki.svg" />
       <dc:date>2026-06-17T15:06:12+00:00</dc:date>
        <items>
            <rdf:Seq>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:positional_encoding?rev=1776587224&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:backpropagation?rev=1776587224&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:feedforward_network?rev=1776587224&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:transformer?rev=1776587224&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/papers:start?rev=1776587224&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:start?rev=1776587224&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/start?rev=1776587224&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:rnn?rev=1776587209&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:postnorm?rev=1776587209&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:layer_normalization?rev=1776587209&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:pipeline_communication?rev=1776587209&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:linear_attention?rev=1776587208&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:model_pruning?rev=1776587208&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:multi_head_attention?rev=1776587208&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:scaled_dot_product_attention?rev=1776587208&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:llm?rev=1776587208&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:expert_routing?rev=1776587208&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:neural_scaling?rev=1776587208&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:chinchilla_scaling?rev=1776587208&amp;do=diff"/>
                <rdf:li rdf:resource="https://aether.snwy.me/concepts:vanishing_gradients?rev=1776587208&amp;do=diff"/>
            </rdf:Seq>
        </items>
    </channel>
    <image rdf:about="https://aether.snwy.me/_media/wiki:dokuwiki.svg">
        <title>Aether</title>
        <link>https://aether.snwy.me/</link>
        <url>https://aether.snwy.me/_media/wiki:dokuwiki.svg</url>
    </image>
    <item rdf:about="https://aether.snwy.me/concepts:positional_encoding?rev=1776587224&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:27:04+00:00</dc:date>
        <dc:creator>aethersync (aethersync@undisclosed.example.com)</dc:creator>
        <title>positional_encoding - Synced from Aether</title>
        <link>https://aether.snwy.me/concepts:positional_encoding?rev=1776587224&amp;do=diff</link>
        <description>Positional Encoding

Since softmax attention is permutation-invariant, positional encodings inject sequence-order information into Transformer inputs. Variants include sinusoidal (original), learned, and rotary (RoPE) encodings. Not directly modified by Attention Residuals, but RoPE interacts with</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:backpropagation?rev=1776587224&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:27:04+00:00</dc:date>
        <dc:creator>aethersync (aethersync@undisclosed.example.com)</dc:creator>
        <title>backpropagation - Synced from Aether</title>
        <link>https://aether.snwy.me/concepts:backpropagation?rev=1776587224&amp;do=diff</link>
        <description>Backpropagation

The algorithm for computing gradients of a loss function with respect to network parameters by applying the chain rule layer by layer from output to input. Vanishing gradients occur when these signals decay excessively in deep networks. Residual connections create the gradient highway that preserves gradient flow.</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:feedforward_network?rev=1776587224&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:27:04+00:00</dc:date>
        <dc:creator>aethersync (aethersync@undisclosed.example.com)</dc:creator>
        <title>feedforward_network - Synced from Aether</title>
        <link>https://aether.snwy.me/concepts:feedforward_network?rev=1776587224&amp;do=diff</link>
        <description>Feed-Forward Network (FFN)

The position-wise MLP applied after each attention layer in a Transformer block. Typically two linear transformations with a nonlinearity: FFN(x) = W2 · act(W1 · x). In MoE architectures, the FFN is replaced by multiple expert FFNs with</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:transformer?rev=1776587224&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:27:04+00:00</dc:date>
        <dc:creator>aethersync (aethersync@undisclosed.example.com)</dc:creator>
        <title>transformer - Synced from Aether</title>
        <link>https://aether.snwy.me/concepts:transformer?rev=1776587224&amp;do=diff</link>
        <description>Transformer

The dominant architecture for LLMs, built from alternating self-attention and feed-forward layers with residual connections and layer normalization. The original design used PostNorm; modern variants use PreNorm. Attention Residuals modifies how the residual stream accumulates across layers.

See also: softmax_attention, multi_head_attention, prenorm, residual_connections</description>
    </item>
    <item rdf:about="https://aether.snwy.me/papers:start?rev=1776587224&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:27:04+00:00</dc:date>
        <dc:creator>aethersync (aethersync@undisclosed.example.com)</dc:creator>
        <title>start - Synced from Aether</title>
        <link>https://aether.snwy.me/papers:start?rev=1776587224&amp;do=diff</link>
        <description>Papers

Notes and summaries of research papers.

	*  Attention Residuals — Kimi Team technique replacing fixed residual weights with learned softmax attention over layer outputs.</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:start?rev=1776587224&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:27:04+00:00</dc:date>
        <dc:creator>aethersync (aethersync@undisclosed.example.com)</dc:creator>
        <title>start - Synced from Aether</title>
        <link>https://aether.snwy.me/concepts:start?rev=1776587224&amp;do=diff</link>
        <description>Concepts

Definitions and explanations of machine learning terms, organized as an interconnected graph.

Architecture

	*  LLM
	*  Softmax Attention
	*  Scaled Dot-Product Attention
	*  Multi-Head Attention
	*  Residual Connections
	*  Gradient Highway
	*  Mixture of Experts
	*  Expert Routing
	*  Linear Attention
	*  RNN

Normalization &amp; Stability

	*  Layer Normalization
	*  PreNorm
	*  PostNorm
	*  Hidden-State Growth
	*  Vanishing Gradients

Efficiency &amp; Pruning

	*  Model Pruning</description>
    </item>
    <item rdf:about="https://aether.snwy.me/start?rev=1776587224&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:27:04+00:00</dc:date>
        <dc:creator>aethersync (aethersync@undisclosed.example.com)</dc:creator>
        <title>start - Synced from Aether</title>
        <link>https://aether.snwy.me/start?rev=1776587224&amp;do=diff</link>
        <description>Aether

A personal knowledge base covering machine learning concepts and papers.

Namespaces

	*  Concepts — definitions and explanations of ML terms
	*  Papers — paper notes and summaries

Recent Additions

	*  vanishing_gradients
	*  chinchilla_scaling
	*  linear_attention
	*  postnorm
	*  rnn

About This Host

This host runs:</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:rnn?rev=1776587209&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:26:49+00:00</dc:date>
        <dc:creator>aethersync (aethersync@undisclosed.example.com)</dc:creator>
        <title>rnn - Synced from Aether</title>
        <link>https://aether.snwy.me/concepts:rnn?rev=1776587209&amp;do=diff</link>
        <description>Recurrent Neural Network (RNN)

A network architecture that processes sequences by maintaining a hidden state updated at each time step. Attention Residuals has a recurrent interpretation: the softmax attention over all prior layer outputs can be viewed as a weighted recurrence, connecting transformer depth to recurrent computation.</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:postnorm?rev=1776587209&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:26:49+00:00</dc:date>
        <dc:creator>aethersync (aethersync@undisclosed.example.com)</dc:creator>
        <title>postnorm - Synced from Aether</title>
        <link>https://aether.snwy.me/concepts:postnorm?rev=1776587209&amp;do=diff</link>
        <description>PostNorm

Applying layer normalization after the sublayer and residual addition. Used in the original Transformer but largely replaced by PreNorm in modern LLMs due to training instability at depth. PostNorm does not exhibit hidden-state growth but is harder to train.

See also: prenorm, layer_normalization, hidden_state_growth, attention_residuals</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:layer_normalization?rev=1776587209&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:26:49+00:00</dc:date>
        <dc:creator>aethersync (aethersync@undisclosed.example.com)</dc:creator>
        <title>layer_normalization - Synced from Aether</title>
        <link>https://aether.snwy.me/concepts:layer_normalization?rev=1776587209&amp;do=diff</link>
        <description>Layer Normalization

Normalizing activations across the feature dimension to stabilize training. Applied either before (PreNorm) or after (PostNorm) the sublayer. PreNorm dominates modern LLMs but causes hidden-state growth.

See also: prenorm, postnorm, hidden_state_growth, attention_residuals</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:pipeline_communication?rev=1776587209&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:26:49+00:00</dc:date>
        <dc:creator>aethersync (aethersync@undisclosed.example.com)</dc:creator>
        <title>pipeline_communication - Synced from Aether</title>
        <link>https://aether.snwy.me/concepts:pipeline_communication?rev=1776587209&amp;do=diff</link>
        <description>Pipeline Communication

The overhead of passing activations between pipeline stages in pipeline-parallel training. Block AttnRes reduces this by aggregating representations at block boundaries rather than every layer, cutting communication volume proportionally.</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:linear_attention?rev=1776587208&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:26:48+00:00</dc:date>
        <dc:creator>aethersync (aethersync@undisclosed.example.com)</dc:creator>
        <title>linear_attention - Synced from Aether</title>
        <link>https://aether.snwy.me/concepts:linear_attention?rev=1776587208&amp;do=diff</link>
        <description>Linear Attention

Attention variants that replace the O(n²) softmax with a kernelized decomposition, reducing sequence-length cost to linear. Used in Kimi Linear to handle long contexts efficiently. Trades some expressiveness for computational savings.

See also:</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:model_pruning?rev=1776587208&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:26:48+00:00</dc:date>
        <dc:creator>aethersync (aethersync@undisclosed.example.com)</dc:creator>
        <title>model_pruning - Synced from Aether</title>
        <link>https://aether.snwy.me/concepts:model_pruning?rev=1776587208&amp;do=diff</link>
        <description>Model Pruning

Removing parameters or structures from a trained network to reduce size or compute. Includes unstructured pruning (individual weights) and structured pruning (entire heads, layers, or experts). Layer pruning is a form of structured pruning that is surprisingly benign under</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:multi_head_attention?rev=1776587208&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:26:48+00:00</dc:date>
        <dc:creator>aethersync (aethersync@undisclosed.example.com)</dc:creator>
        <title>multi_head_attention - Synced from Aether</title>
        <link>https://aether.snwy.me/concepts:multi_head_attention?rev=1776587208&amp;do=diff</link>
        <description>Multi-Head Attention

Running multiple attention operations in parallel with separate learned projections, then concatenating results. Each head can attend to different positional or semantic relationships. Standard in all modern LLMs.

See also: scaled_dot_product_attention, softmax_attention, attention_residuals</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:scaled_dot_product_attention?rev=1776587208&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:26:48+00:00</dc:date>
        <dc:creator>aethersync (aethersync@undisclosed.example.com)</dc:creator>
        <title>scaled_dot_product_attention - Synced from Aether</title>
        <link>https://aether.snwy.me/concepts:scaled_dot_product_attention?rev=1776587208&amp;do=diff</link>
        <description>Scaled Dot-Product Attention

The core computation behind softmax_attention: softmax(QK^T / sqrt(d_k)) V. The sqrt(d_k) scaling prevents dot products from growing large in high dimensions, which would push softmax into saturated regions with vanishing gradients.

See also:</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:llm?rev=1776587208&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:26:48+00:00</dc:date>
        <dc:creator>aethersync (aethersync@undisclosed.example.com)</dc:creator>
        <title>llm - Synced from Aether</title>
        <link>https://aether.snwy.me/concepts:llm?rev=1776587208&amp;do=diff</link>
        <description>Large Language Model (LLM)

A neural network trained on large text corpora to model language. Modern LLMs use transformer architectures with PreNorm and residual connections, and increasingly MoE for efficiency. Scaling laws govern their performance gains.

See also: softmax_attention, moe, scaling_laws, prenorm</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:expert_routing?rev=1776587208&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:26:48+00:00</dc:date>
        <dc:creator>aethersync (aethersync@undisclosed.example.com)</dc:creator>
        <title>expert_routing - Synced from Aether</title>
        <link>https://aether.snwy.me/concepts:expert_routing?rev=1776587208&amp;do=diff</link>
        <description>Expert Routing

The mechanism in Mixture-of-Experts architectures that selects which experts process a given input. Typically implemented as a learned gating network producing a sparse distribution over experts. The quality of routing directly affects MoE efficiency and performance.</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:neural_scaling?rev=1776587208&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:26:48+00:00</dc:date>
        <dc:creator>aethersync (aethersync@undisclosed.example.com)</dc:creator>
        <title>neural_scaling - Synced from Aether</title>
        <link>https://aether.snwy.me/concepts:neural_scaling?rev=1776587208&amp;do=diff</link>
        <description>Neural Scaling

The broad phenomenon where model performance improves predictably with increases in parameters, data, or compute. Scaling laws formalize these relationships mathematically.

See also: scaling_laws, chinchilla_scaling, attention_residuals</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:chinchilla_scaling?rev=1776587208&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:26:48+00:00</dc:date>
        <dc:creator>aethersync (aethersync@undisclosed.example.com)</dc:creator>
        <title>chinchilla_scaling - Synced from Aether</title>
        <link>https://aether.snwy.me/concepts:chinchilla_scaling?rev=1776587208&amp;do=diff</link>
        <description>Chinchilla Scaling

The finding from Hoffmann et al. (2022) that LLMs should be trained on roughly 20 tokens per parameter for compute-optimal training. Implies many large models are significantly undertrained. Validates that Attention Residuals improvements hold under this regime.</description>
    </item>
    <item rdf:about="https://aether.snwy.me/concepts:vanishing_gradients?rev=1776587208&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2026-04-19T08:26:48+00:00</dc:date>
        <dc:creator>aethersync (aethersync@undisclosed.example.com)</dc:creator>
        <title>vanishing_gradients - Synced from Aether</title>
        <link>https://aether.snwy.me/concepts:vanishing_gradients?rev=1776587208&amp;do=diff</link>
        <description>Vanishing Gradients

The problem where gradients shrink exponentially as they propagate through many layers during backpropagation, making deep networks difficult or impossible to train. Residual connections mitigate this by providing a direct gradient path — the gradient highway — that preserves signal across arbitrary depth.</description>
    </item>
</rdf:RDF>
