Press "Enter" to skip to content

How a Question Travels from Your Keyboard to a Large Language Model (LLM) and Back

A Comprehensive Step-by-Step Guide to the Journey of an AI Prompt

Simplified for Beginners with Technical Details

Introduction
1. Why This Journey Matters
Every time a person types a question to an AI assistant and presses Enter, a remarkably long and fast chain of events begins. What feels instantaneous — a reply appearing within one or two seconds — is in fact the product of dozens of discrete engineering steps spanning a personal keyboard, an operating system, a browser or app, the public internet, and a data centre packed with specialised computing hardware.
This article traces that chain from beginning to end in plain language, naming the real technologies, protocols and hardware involved at each stage, and giving realistic timing estimates so you can see exactly where the milliseconds go.
The thirteen stages at a glance

1. Keyboard event
2. Client app
3. Connection setup
4. Internet transit
5. Edge / load balancing
6. API gateway
7. Orchestration
8. Tokenization
9. Model inference
10. Hardware
11. Generation / streaming
12. Return journey
13. Rendering

Each is grouped into one of three phases: on the device (1–2), across the network (3–4, and again on the way back at 12), and inside the data centre (5–11), before finishing back on the device at stage 13.



Stage 1
2. The Keyboard Event
The journey begins at the most human part of the system: a physical or virtual keyboard. When a finger presses a key, the keyboard's internal controller detects a change in an electrical circuit (physical keyboards) or a touch coordinate (virtual keyboards), converting it into a scan code.

The keyboard controller detects the keypress and generates a scan code.
The scan code travels to the device over USB HID (Human Interface Device protocol) or Bluetooth.
The OS input driver — Windows, macOS, iOS, Android, or Linux — translates it into a character using the active keyboard layout.
The OS delivers the character as a software event to whichever app has focus: a browser (Chrome, Safari, Edge, Firefox) or a native app.

This entire sequence typically takes 5–30 milliseconds, dominated by keyboard polling rate (125 Hz–1000 Hz) and OS scheduling.



Stage 2
3. Inside the Client Application
Once the user submits the message, the client application — a web page's JavaScript engine or a native app — takes over.

Local validation: checking message length, attachment sizes, live Markdown rendering.
State assembly: packaging the message, conversation history, system instructions and settings into a structured JSON object.
Authentication attachment: a session or OAuth bearer token is attached to the request headers.
Serialization: the JSON object becomes a raw byte stream ready for transmission.

Web clients run on JavaScript engines such as V8 (Chrome, Node.js) or JavaScriptCore (Safari); native apps use Swift/UIKit (iOS) or Kotlin/Jetpack Compose (Android). This step completes in under 10 milliseconds.



Stage 3
4. Building the Network Connection
4.1 DNS resolution
The client asks a DNS resolver to translate the domain (e.g. api.anthropic.com) into an IP address — via a local cache, the ISP, or a public resolver like 1.1.1.1 or 8.8.8.8. Cached: under 5 ms. Fresh: 20–120 ms.
4.2 TCP handshake
The device opens a connection using TCP/IP, or increasingly QUIC (over UDP, underlying HTTP/3). The classic SYN / SYN-ACK / ACK handshake takes one round trip — 10–150 ms depending on distance.
4.3 TLS encryption setup
The device negotiates encryption via TLS 1.3, using Elliptic Curve Diffie-Hellman key exchange. TLS 1.3 needs just one round trip, adding 10–100 ms. Persistent connections skip this stage entirely on subsequent messages.



Stage 4
5. Crossing the Internet
The public internet is a mesh of independently owned networks: the ISP, transit/backbone providers, and finally the AI provider's own network or CDN.

The byte stream is broken into packets wrapped in IP headers.
Routers forward packets hop-by-hop using BGP (Border Gateway Protocol).
Data travels over fibre, coax, or mobile radio (4G/5G) to an exchange point, then long-haul fibre — often undersea cables — to the destination region.
Fibre signal speed is roughly two-thirds light speed: Johannesburg to the US West Coast is at least 260–320 ms round trip from distance alone.

This is why providers use regional data centres and CDN edge points — named providers include Cloudflare, Akamai, Fastly, and Amazon CloudFront.



Stage 5
6. Arriving at the Data Centre: Edge & Load Balancing

Anycast routing: the same IP is announced globally; BGP sends the user to the nearest healthy location.
DDoS mitigation: automated systems (Cloudflare, AWS Shield) drop malicious traffic floods.
Global load balancing: software like Envoy, NGINX, HAProxy, or a managed cloud load balancer (AWS/GCP/Azure) spreads traffic across regional clusters.
Local load balancing spreads requests across identical API server instances.

This hop adds only 1–20 ms since it runs on the provider's own high-speed internal network.



Stage 6
7. API Gateway, Authentication & Rate Limiting

Authentication: verifying the bearer token/API key against an identity service.
Authorization: checking plan and model permissions.
Rate limiting: checking request/token quotas, often via fast in-memory stores like Redis.
Input safety screening: a lightweight classifier may pre-check content policy compliance.
Logging: requests are recorded via pipelines like Kafka for billing and monitoring.

Typically adds 5–50 ms.



Stage 7
8. Request Queuing & Orchestration

Scheduling: a scheduler (conceptually like Kubernetes) assigns the request to an available inference server.
Batching: inference engines like NVIDIA Triton, vLLM, or custom schedulers group requests for parallel GPU processing.
Queue placement: usually a few milliseconds under normal load; longer at peak times.

Underlying infrastructure typically comes from AWS, Google Cloud, Microsoft Azure, Oracle Cloud Infrastructure, or private data centres.



Stage 8
9. Tokenization: Turning Words into Numbers
Neural networks operate purely on numbers. Tokenization converts raw text into a sequence of integers — tokens — each representing a word, sub-word, or punctuation mark.

Input: "Please simplify this"

Tokens: [5959, 39696, 420]

    Modern LLMs typically use Byte-Pair Encoding (BPE) or SentencePiece.
    Vocabularies commonly contain 50,000–100,000+ possible tokens.
    Common words become one token; rarer words split into sub-word pieces.

  Tokenization takes under 5 milliseconds even for long prompts, but token count drives both downstream compute time and billing cost.



  Stage 9
  10. Model Inference: Inside the Neural Network
  This is the computational heart of the journey. The token sequence is fed into a Transformer-based neural network — the architecture introduced in Google's 2017 paper "Attention Is All You Need," and the near-universal foundation of LLMs today.

    Embedding: each token becomes a high-dimensional vector encoding learned meaning.
    Positional encoding: adds information about token order.
    Self-attention layers: compute how strongly each token relates to every other token — the mechanism behind context understanding.
    Feed-forward layers: further transform the representation after each attention step.
    Layer stacking: these blocks repeat dozens to over a hundred times deep.

  Frontier models range from tens of billions to over a trillion parameters. Exact specifications of commercial models — including Claude — are not publicly disclosed, though the general Transformer mechanics are public research shared across Anthropic, OpenAI, Google DeepMind and Meta.
  This full pass to produce the first output token is called time to first token (TTFT) — one of the two headline industry speed metrics.



  Stage 10
  11. The Hardware: GPUs, Interconnects & Clusters
  Transformer math — mostly large matrix multiplications — runs on specialised parallel chips rather than a conventional CPU.

    ComponentNamed Examples
    GPUsNVIDIA A100, H100, H200, B100/B200/GB200; AMD MI300
    Custom AI chipsGoogle TPU (Tensor Processing Unit)
    MemoryHBM3 / HBM3e (High-Bandwidth Memory)
    Intra-server linkNVIDIA NVLink
    Inter-server linkInfiniBand, 400G+ Ethernet
    Software stackCUDA, PyTorch, TensorRT-LLM

  Because the largest models can't fit on one chip, tensor parallelism and pipeline parallelism split layers and operations across many GPUs, coordinated to return one coherent answer.



  Stage 11
  12. Autoregressive Generation & Streaming
  LLMs generate text one token at a time — autoregressive generation: predict the next token, append it, repeat.

    Sampling: temperature and top-p (nucleus) sampling balance coherence with natural variation.
    KV-caching: stores previously computed attention tensors to avoid redundant recomputation — a critical speed optimisation.
    Streaming: each token is sent as soon as produced, via Server-Sent Events (SSE) or WebSockets — why text appears word-by-word.

  The second headline metric, inter-token latency, is commonly 10–50 ms per token on well-optimised frontier systems.



  Stage 12
  13. The Return Journey: Server to Screen
  Each generated token retraces the path in reverse: orchestration layer → API gateway (output safety screening) → load balancer → edge network → public internet → ISP → device. Since the connection is already open and encrypted, no new DNS/TCP/TLS setup is needed — only physical transmission distance and congestion matter, typically adding a few milliseconds per chunk beyond generation time.



  Stage 13
  14. Client Rendering & Display
  The client receives each streamed chunk as JSON, appends it to the visible message, and applies Markdown rendering for headings, bullets, bold text, code blocks and tables. This happens on the device's own CPU/graphics stack in under 1 millisecond per chunk — the loop completes when the display's refresh cycle (60–120 Hz) paints the new pixels.



  Worked Example
  14A. One Question, Start to Finish
  A Johannesburg user types: "Summarise the causes of the 2008 financial crisis in three bullet points" and presses Enter.

    StepWhat HappensTime
    1Enter keypress captured by OS/browser~10 ms
    2Message bundled into JSON with session token~5 ms
    3Sent over already-open encrypted connection0 ms (no new handshake)
    4Transit to nearest serving region (EU/Middle East)~130–170 ms
    5Edge termination + load balancer routing~5 ms
    6Gateway auth, plan check, logging~15 ms
    7Orchestration assigns inference server~10 ms
    8Tokenize ~20 tokens~2 ms
    9Forward pass to first token~400–800 ms
    10First token streams back~130–170 ms
    11Remaining ~60-80 tokens stream at 15-30ms each~1–2 s

  Total: roughly two to three seconds end-to-end — over 80% of it spent on physical network transit or genuine neural computation, not avoidable inefficiency.



  Latency Budget
  15. End-to-End Latency Budget

    StageTypical RangeDominant Factor
    1. Keyboard event5–30 msHardware polling, OS scheduling
    2. Client-side processing1–10 msJavaScript/app execution
    3. DNS + TCP + TLS setup0 (cached) – 300 msConnection re-use vs. fresh handshake
    4. Internet transit20–300+ msPhysical distance, route quality
    5. Edge / load balancing1–20 msInternal network speed
    6. API gateway checks5–50 msAuth, rate limits, safety screen
    7. Queuing / orchestration1–100+ msCurrent system load
    8. Tokenization<1–5 msPrompt length
    9–11. Time to first token200–1500 msModel size, prompt length, load
    Inter-token generation10–50 ms/tokenModel size, hardware, batching
    12. Return transit (per chunk)20–300 ms (one-way)Physical distance
    13. Client rendering<1 msDevice rendering pipeline

  Illustrative ranges for a well-optimised system on broadband/mobile data. Same-region users sit toward the lower end.
  Two numbers matter most subjectively: time to first token (TTFT) and inter-token latency. A short prompt on an uncongested system commonly achieves TTFT around half a second to just over a second, streaming the rest at natural reading speed or faster.



  Reference
  16. Key Technologies and Their Names

    LayerRepresentative Named Technologies
    Input hardwareUSB HID protocol, Bluetooth LE, capacitive touchscreen controllers
    Operating systemWindows, macOS, iOS, Android, Linux input drivers
    Client runtimeV8, JavaScriptCore, Swift/UIKit, Kotlin
    Data formatJSON
    Naming / DNSDNS protocol; resolvers 1.1.1.1, 8.8.8.8
    TransportTCP/IP, UDP, QUIC
    SecurityTLS 1.3, HTTP/2, HTTP/3
    RoutingBGP, Anycast
    CDN / edgeCloudflare, Akamai, Fastly, Amazon CloudFront
    Load balancingEnvoy, NGINX, HAProxy; AWS/GCP/Azure managed LBs
    Caching / rate limitingRedis, Memcached
    Logging pipelinesApache Kafka
    OrchestrationKubernetes, Docker
    Inference servingNVIDIA Triton, vLLM, TensorRT-LLM
    Model frameworkPyTorch, CUDA
    GPUsNVIDIA A100/H100/H200/B200; AMD MI300
    Custom AI chipsGoogle TPU
    Chip interconnectNVLink, InfiniBand, 400G Ethernet
    MemoryHBM3 / HBM3e
    Model architectureTransformer, self-attention
    TokenizationByte-Pair Encoding (BPE), SentencePiece
    Response deliveryServer-Sent Events (SSE), WebSockets
    Cloud infrastructureAWS, Google Cloud, Microsoft Azure, Oracle Cloud Infrastructure




  Landscape
  16A. The Provider Landscape: Names You Will Encounter

    CompanyModel Family
    AnthropicClaude
    OpenAIGPT, o-series
    Google DeepMindGemini
    MetaLlama
    xAIGrok
    Mistral AIMistral, Mixtral
    AmazonNova
    MicrosoftAzure-hosted OpenAI models, Phi
    DeepSeekDeepSeek
    AlibabaQwen

  Exact architectures and parameter counts are, for almost every company, closely-held trade secrets. The Transformer mechanics described here are public research applying in broad terms across the industry.



  Infrastructure
  17. Data Centre Infrastructure
  Power
  High-density GPU racks can draw tens of kilowatts each. Facilities draw from multiple independent grid connections plus diesel generators and UPS battery systems for zero-interruption backup.
  Cooling
  Beyond traditional CRAH air-conditioning, many AI data centres now use direct liquid cooling (coolant on cold plates over chips) or immersion cooling to support much higher rack density.
  Redundancy
  Facilities are classified by Uptime Institute Tier standards (I–IV). Providers spread inference across multiple independent data centres and regions so a fault in one automatically reroutes traffic to another.



  Security
  18. Security and Privacy Along the Path

    Encryption in transit: TLS 1.3 keeps intercepted traffic unreadable.
    Encryption at rest: stored data typically uses AES-256.
    Authentication: bearer tokens, API keys, session cookies, multi-factor authentication.
    Automated safety systems: classifier models screen inputs and outputs for policy violations.
    Network isolation: production clusters run on private, segmented networks (VPCs) not directly reachable from the public internet.




  Speed Factors
  19. What Makes a Response Fast or Slow
  User-side factors

    Physical distance to the nearest data centre region — often the single largest controllable factor.
    Network quality: Wi-Fi congestion, mobile signal, saturated connections.
    Device performance (rarely the dominant factor).

  Provider-side factors

    Prompt length — longer inputs take proportionally longer to process.
    Model size and choice.
    Current system load / queue depth.
    Output length — autoregressive generation is token-by-token.
    Hardware generation — newer GPUs reduce both TTFT and inter-token latency.

  19A. A Business Note: Why Tokens Also Drive Cost
  Tokenization is also the standard billing unit industry-wide — providers price per million input and output tokens, with output priced higher (generation is more compute-intensive). Latency and cost are two views of the same quantity: trimming unnecessary prompt context, using prompt caching where supported, and choosing smaller model variants for simpler tasks can reduce both together.



  Reference
  20. Glossary
  API: a defined way for one piece of software to request services from another.
  Data centre: a facility housing servers plus power/cooling infrastructure.
  Inference: running a trained AI model to produce output, vs. training which teaches the model.
  Latency: the delay between an action and its observed effect.
  Parameter: a learned internal numeric value in a neural network; a rough measure of model size.
  Token: the basic unit of text an LLM processes.
  Transformer: the self-attention-based neural architecture underlying modern LLMs.
  Time to first token (TTFT): elapsed time from sending a request to the first visible response.
  Inter-token latency: time gap between subsequently streamed tokens.
  Load balancer: distributes incoming requests across multiple servers.
  GPU: a parallel-processing chip that is the standard hardware for AI inference.
  TLS: the cryptographic protocol encrypting client-server internet traffic.
  Autoregressive generation: predicting one token at a time, each conditioned on all previous ones.
  

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *