Aether - concepts

Aether - concepts https://aether.snwy.me/ 2026-06-17T15:53:10+00:00 Aether https://aether.snwy.me/ https://aether.snwy.me/_media/wiki:dokuwiki.svg text/html 2026-04-19T08:27:04+00:00 Anonymous (anonymous@undisclosed.example.com) backpropagation https://aether.snwy.me/concepts:backpropagation?rev=1776587224&do=diff Backpropagation The algorithm for computing gradients of a loss function with respect to network parameters by applying the chain rule layer by layer from output to input. Vanishing gradients occur when these signals decay excessively in deep networks. Residual connections create the gradient highway that preserves gradient flow. text/html 2026-04-19T03:19:05+00:00 Anonymous (anonymous@undisclosed.example.com) block_attnres https://aether.snwy.me/concepts:block_attnres?rev=1776568745&do=diff Block AttnRes A practical variant of Attention Residuals that partitions layers into N blocks. Within each block, standard residuals are used. At block boundaries, an AttnRes operation aggregates block-level representations, reducing memory from O(Ld) to O(Nd) while preserving most gains. text/html 2026-04-19T08:26:48+00:00 Anonymous (anonymous@undisclosed.example.com) chinchilla_scaling https://aether.snwy.me/concepts:chinchilla_scaling?rev=1776587208&do=diff Chinchilla Scaling The finding from Hoffmann et al. (2022) that LLMs should be trained on roughly 20 tokens per parameter for compute-optimal training. Implies many large models are significantly undertrained. Validates that Attention Residuals improvements hold under this regime. text/html 2026-04-19T08:26:48+00:00 Anonymous (anonymous@undisclosed.example.com) expert_routing https://aether.snwy.me/concepts:expert_routing?rev=1776587208&do=diff Expert Routing The mechanism in Mixture-of-Experts architectures that selects which experts process a given input. Typically implemented as a learned gating network producing a sparse distribution over experts. The quality of routing directly affects MoE efficiency and performance. text/html 2026-04-19T08:27:04+00:00 Anonymous (anonymous@undisclosed.example.com) feedforward_network https://aether.snwy.me/concepts:feedforward_network?rev=1776587224&do=diff Feed-Forward Network (FFN) The position-wise MLP applied after each attention layer in a Transformer block. Typically two linear transformations with a nonlinearity: FFN(x) = W2 · act(W1 · x). In MoE architectures, the FFN is replaced by multiple expert FFNs with text/html 2026-04-19T03:19:06+00:00 Anonymous (anonymous@undisclosed.example.com) gradient_highway https://aether.snwy.me/concepts:gradient_highway?rev=1776568746&do=diff Gradient Highway The property of residual connections that allows gradients to flow directly through the identity path during backpropagation, bypassing layer transformations. This enables stable training of very deep networks. See also: residual_connections, vanishing_gradients, attention_residuals text/html 2026-04-19T03:19:05+00:00 Anonymous (anonymous@undisclosed.example.com) hidden_state_growth https://aether.snwy.me/concepts:hidden_state_growth?rev=1776568745&do=diff Hidden-State Growth Under PreNorm residual connections, hidden-state magnitudes grow as O(L) with depth because each layer adds a roughly unit-magnitude output to the running sum. This progressively dilutes each layer's relative contribution and buries early-layer information. text/html 2026-04-19T03:19:05+00:00 Anonymous (anonymous@undisclosed.example.com) kimi_linear https://aether.snwy.me/concepts:kimi_linear?rev=1776568745&do=diff Kimi Linear A Mixture-of-Experts architecture by Kimi Team with 48B total / 3B activated parameters. Uses MoE with linear attention. The paper Attention Residuals integrates AttnRes into this architecture, pre-training on 1.4T tokens. See also: moe, linear_attention, attention_residuals text/html 2026-04-19T08:26:49+00:00 Anonymous (anonymous@undisclosed.example.com) layer_normalization https://aether.snwy.me/concepts:layer_normalization?rev=1776587209&do=diff Layer Normalization Normalizing activations across the feature dimension to stabilize training. Applied either before (PreNorm) or after (PostNorm) the sublayer. PreNorm dominates modern LLMs but causes hidden-state growth. See also: prenorm, postnorm, hidden_state_growth, attention_residuals text/html 2026-04-19T03:19:06+00:00 Anonymous (anonymous@undisclosed.example.com) layer_pruning https://aether.snwy.me/concepts:layer_pruning?rev=1776568746&do=diff Layer Pruning Removing entire layers from a trained network. Under standard residual connections with PreNorm, many layers can be pruned with minimal performance loss because their contributions are heavily diluted. This motivates Attention Residuals, which gives each layer learned, content-dependent influence. text/html 2026-04-19T08:26:48+00:00 Anonymous (anonymous@undisclosed.example.com) linear_attention https://aether.snwy.me/concepts:linear_attention?rev=1776587208&do=diff Linear Attention Attention variants that replace the O(n²) softmax with a kernelized decomposition, reducing sequence-length cost to linear. Used in Kimi Linear to handle long contexts efficiently. Trades some expressiveness for computational savings. See also: text/html 2026-04-19T08:26:48+00:00 Anonymous (anonymous@undisclosed.example.com) llm https://aether.snwy.me/concepts:llm?rev=1776587208&do=diff Large Language Model (LLM) A neural network trained on large text corpora to model language. Modern LLMs use transformer architectures with PreNorm and residual connections, and increasingly MoE for efficiency. Scaling laws govern their performance gains. See also: softmax_attention, moe, scaling_laws, prenorm text/html 2026-04-19T08:26:48+00:00 Anonymous (anonymous@undisclosed.example.com) model_pruning https://aether.snwy.me/concepts:model_pruning?rev=1776587208&do=diff Model Pruning Removing parameters or structures from a trained network to reduce size or compute. Includes unstructured pruning (individual weights) and structured pruning (entire heads, layers, or experts). Layer pruning is a form of structured pruning that is surprisingly benign under text/html 2026-04-19T03:19:05+00:00 Anonymous (anonymous@undisclosed.example.com) moe https://aether.snwy.me/concepts:moe?rev=1776568745&do=diff Mixture of Experts (MoE) An architecture where only a subset of parameters (experts) are activated per input, determined by a routing function. Enables scaling total parameters while keeping compute per token low. Used in Kimi Linear and many modern LLMs. See also: text/html 2026-04-19T08:26:48+00:00 Anonymous (anonymous@undisclosed.example.com) multi_head_attention https://aether.snwy.me/concepts:multi_head_attention?rev=1776587208&do=diff Multi-Head Attention Running multiple attention operations in parallel with separate learned projections, then concatenating results. Each head can attend to different positional or semantic relationships. Standard in all modern LLMs. See also: scaled_dot_product_attention, softmax_attention, attention_residuals text/html 2026-04-19T08:26:48+00:00 Anonymous (anonymous@undisclosed.example.com) neural_scaling https://aether.snwy.me/concepts:neural_scaling?rev=1776587208&do=diff Neural Scaling The broad phenomenon where model performance improves predictably with increases in parameters, data, or compute. Scaling laws formalize these relationships mathematically. See also: scaling_laws, chinchilla_scaling, attention_residuals text/html 2026-04-19T08:26:49+00:00 Anonymous (anonymous@undisclosed.example.com) pipeline_communication https://aether.snwy.me/concepts:pipeline_communication?rev=1776587209&do=diff Pipeline Communication The overhead of passing activations between pipeline stages in pipeline-parallel training. Block AttnRes reduces this by aggregating representations at block boundaries rather than every layer, cutting communication volume proportionally. text/html 2026-04-19T08:27:04+00:00 Anonymous (anonymous@undisclosed.example.com) positional_encoding https://aether.snwy.me/concepts:positional_encoding?rev=1776587224&do=diff Positional Encoding Since softmax attention is permutation-invariant, positional encodings inject sequence-order information into Transformer inputs. Variants include sinusoidal (original), learned, and rotary (RoPE) encodings. Not directly modified by Attention Residuals, but RoPE interacts with text/html 2026-04-19T08:26:49+00:00 Anonymous (anonymous@undisclosed.example.com) postnorm https://aether.snwy.me/concepts:postnorm?rev=1776587209&do=diff PostNorm Applying layer normalization after the sublayer and residual addition. Used in the original Transformer but largely replaced by PreNorm in modern LLMs due to training instability at depth. PostNorm does not exhibit hidden-state growth but is harder to train. See also: prenorm, layer_normalization, hidden_state_growth, attention_residuals text/html 2026-04-19T03:19:05+00:00 Anonymous (anonymous@undisclosed.example.com) prenorm https://aether.snwy.me/concepts:prenorm?rev=1776568745&do=diff PreNorm Applying layer normalization before the sublayer transformation (attention or FFN). Dominant in modern LLMs for training stability, but its unweighted accumulation causes hidden-state magnitudes to grow as O(L) with depth, diluting each layer's contribution. text/html 2026-04-19T03:19:05+00:00 Anonymous (anonymous@undisclosed.example.com) residual_connections https://aether.snwy.me/concepts:residual_connections?rev=1776568745&do=diff Residual Connections Skip connections that add a layer's input to its output: h_l = h_{l-1} + f(h_{l-1}). Enable gradient flow in deep networks but accumulate all prior outputs with fixed unit weights, causing dilution at depth. See also: attention_residuals, prenorm, gradient_highway, hidden_state_growth, layer_pruning text/html 2026-04-19T08:26:49+00:00 Anonymous (anonymous@undisclosed.example.com) rnn https://aether.snwy.me/concepts:rnn?rev=1776587209&do=diff Recurrent Neural Network (RNN) A network architecture that processes sequences by maintaining a hidden state updated at each time step. Attention Residuals has a recurrent interpretation: the softmax attention over all prior layer outputs can be viewed as a weighted recurrence, connecting transformer depth to recurrent computation. text/html 2026-04-19T08:26:48+00:00 Anonymous (anonymous@undisclosed.example.com) scaled_dot_product_attention https://aether.snwy.me/concepts:scaled_dot_product_attention?rev=1776587208&do=diff Scaled Dot-Product Attention The core computation behind softmax_attention: softmax(QK^T / sqrt(d_k)) V. The sqrt(d_k) scaling prevents dot products from growing large in high dimensions, which would push softmax into saturated regions with vanishing gradients. See also: text/html 2026-04-19T03:19:06+00:00 Anonymous (anonymous@undisclosed.example.com) scaling_laws https://aether.snwy.me/concepts:scaling_laws?rev=1776568746&do=diff Scaling Laws Empirical relationships describing how model performance (typically loss) scales with model size, data size, and compute. The paper Attention Residuals validates that its improvements hold consistently across model sizes via scaling law experiments. See also: text/html 2026-04-19T03:19:06+00:00 Anonymous (anonymous@undisclosed.example.com) softmax_attention https://aether.snwy.me/concepts:softmax_attention?rev=1776568746&do=diff Softmax Attention The standard attention mechanism: computes dot-product similarity between a query and keys, applies softmax to produce a probability distribution, then takes a weighted sum of values. In Attention Residuals, softmax attention is repurposed for depth-wise aggregation across layer outputs instead of sequence positions. text/html 2026-04-19T08:27:04+00:00 Anonymous (anonymous@undisclosed.example.com) start https://aether.snwy.me/concepts:start?rev=1776587224&do=diff Concepts Definitions and explanations of machine learning terms, organized as an interconnected graph. Architecture * LLM * Softmax Attention * Scaled Dot-Product Attention * Multi-Head Attention * Residual Connections * Gradient Highway * Mixture of Experts * Expert Routing * Linear Attention * RNN Normalization & Stability * Layer Normalization * PreNorm * PostNorm * Hidden-State Growth * Vanishing Gradients Efficiency & Pruning * Model Pruning text/html 2026-04-19T08:27:04+00:00 Anonymous (anonymous@undisclosed.example.com) transformer https://aether.snwy.me/concepts:transformer?rev=1776587224&do=diff Transformer The dominant architecture for LLMs, built from alternating self-attention and feed-forward layers with residual connections and layer normalization. The original design used PostNorm; modern variants use PreNorm. Attention Residuals modifies how the residual stream accumulates across layers. See also: softmax_attention, multi_head_attention, prenorm, residual_connections text/html 2026-04-19T08:26:48+00:00 Anonymous (anonymous@undisclosed.example.com) vanishing_gradients https://aether.snwy.me/concepts:vanishing_gradients?rev=1776587208&do=diff Vanishing Gradients The problem where gradients shrink exponentially as they propagate through many layers during backpropagation, making deep networks difficult or impossible to train. Residual connections mitigate this by providing a direct gradient path — the gradient highway — that preserves signal across arbitrary depth.