12. How does the Internet work?

To help you understand how the Internet works, we'll look at the things that happen when you do a typical Internet operation -- pointing a browser at the front page of this document at its home on the Web at the Linux Documentation Project. This document is

ttp://www.linuxdoc.org/HOWTO/Unix-and-Internet-Fundamentals-HOWTO/index.html

which means it lives in the file LDP/HOWTO/Unix-and-Internet-Fundamentals-HOWTO/index.html under the World Wide Web export directory of the host www.linuxdoc.org.

12.1. Names and locations

The first thing your browser has to do is to establish a network connection to the machine where the document lives. To do that, it first has to find the network location of the host www.linuxdoc.org (`host' is short for `host machine' or `network host'; www.linuxdoc.org is a typical hostname). The corresponding location is actually a number called an IP address (we'll explain the `IP' part of this term later).

To do this, your browser queries a program called a name server. The name server may live on your machine, but it's more likely to run on a service machine that yours talks to. When you sign up with an ISP, part of your setup procedure will almost certainly involve telling your Internet software the IP address of a nameserver on the ISP's network.

The name servers on different machines talk to each other, exchanging and keeping up to date all the information needed to resolve hostnames (map them to IP addresses). Your nameserver may query three or four different sites across the network in the process of resolving www.linuxdoc.org, but this usually happens very quickly (as in less than a second). We'll look at how nameservers detail in the next section.

The nameserver will tell your browser that www.linuxdoc.org's IP address is 152.19.254.81; knowing this, your machine will be able to exchange bits with www.linuxdoc.org directly.

12.2. The Domain Name System

The whole network of programs and databases that cooperates to translate hostnames to IP addresses is called `DNS' (Domain Name System). When you see references to a `DNS server', that means what we just called a nameserver. Now I'll explain how the overall system works.

Internet hostnames are composed of parts separated by dots. A domain is a collection of machines that share a common name suffix. Domains can live inside other domains. For example, the machine www.linuxdoc.org lives in the .linuxdoc.org subdomain of the .org domain.

Each domain is defined by an authoritative name server that knows the IP addresses of the other machines in the domain. The authoritative (or `primary') name server may have backups in case it goes down; if you see references to a secondary name server or (`secondary DNS') it's talking about one of those. These secondaries typically refresh their information from their primaries every few hours, so a change made to the hostname-to-IP mapping on the primary will automatically be propagated.

Now here's the important part. The nameservers for a domain do not have to know the locations of all the machines in other domains (including their own subdomains); they only have to know the location of the nameservers. In our example, the authoritative name server for the .org domain knows the IP address of the nameserver for .linuxdoc.org, but not the address of all the other machines in www.linuxdoc.org.

The domains in the DNS system are arranged like a big inverted tree. At the top are the root servers. Everybody knows the IP addresses of the root servers; they're wired into your DNS software. The root servers know the IP addresses of the nameservers for the top-level domains like .com and .org, but not the addresses of machines inside those domains. Each top-level domain server knows where the nameservers for the domains directly beneath it are, and so forth.

DNS is carefully designed so that each machine can get away with the minimum amount of knowledge it needs to have about the shape of the tree, and local changes to subtrees can be made simply by changing one authoritative server's database of name-to-IP-address mappings.

When you query for the IP address of www.linuxdoc.org, what actually happens is this: First, your nameserver asks a root server to tell it where it can find a nameserver for .org. Once it knows that, it then asks the .org server to tell it the IP address of a .linuxdoc.org nameserver. Once it has that, it asks the .linuxdoc.org nameserver to tell it the address of the host www.linuxdoc.org.

Most of the time, your nameserver doesn't actually have to work that hard. Nameservers do a lot of cacheing; when yours resolves a hostname, it keeps the association with the resulting IP address around in memory for a while. This is why, when you surf to a new website, you'll usually only see a message from your browser about "Looking up" the host for the first page you fetch. Eventually the name-to-address mapping expires and your DNS has to re-query — this is important so you don't have invalid information hanging around forever when a hostname changes addresses. Your cached IP address for a site is also thrown out if the host is unreachable.

12.3. Packets and routers

What the browser wants to do is send a command to the Web server on www.linuxdoc.org that looks like this:

GET /LDP/HOWTO/Fundamentals.html HTTP/1.0

Here's how that happens. The command is made into a packet, a block of bits like a telegram that is wrapped with three important things; the source address (the IP address of your machine), the destination address (152.19.254.81), and a service number or port number (80, in this case) that indicates that it's a World Wide Web request.

Your machine then ships the packet down the wire (your connection to your ISP, or local network) until it gets to a specialized machine called a router. The router has a map of the Internet in its memory -- not always a complete one, but one that completely describes your network neighborhood and knows how to get to the routers for other neighborhoods on the Internet.

Your packet may pass through several routers on the way to its destination. Routers are smart. They watch how long it takes for other routers to acknowledge having received a packet. They also use that information to direct traffic over fast links. They use it to notice when another routers (or a cable) have dropped off the network, and compensate if possible by finding another route.

There's an urban legend that the Internet was designed to survive nuclear war. This is not true, but the Internet's design is extremely good at getting reliable performance out of flaky hardware in an uncertain world. This is directly due to the fact that its intelligence is distributed through thousands of routers rather than concentrated in a few massive and vulnerable switches (like the phone network). This means that failures tend to be well localized and the network can route around them.

Once your packet gets to its destination machine, that machine uses the service number to feed the packet to the web server. The web server can tell where to reply to by looking at the command packet's source IP address. When the web server returns this document, it will be broken up into a number of packets. The size of the packets will vary according to the transmission media in the network and the type of service.

12.4. TCP and IP

To understand how multiple-packet transmissions are handled, you need to know that the Internet actually uses two protocols, stacked one on top of the other.

The lower level, IP (Internet Protocol), knows how to get individual packets from a source address to a destination address (this is why these are called IP addresses). However, IP is not reliable; if a packet gets lost or dropped, the source and destination machines may never know it. In network jargon, IP is a connectionless protocol; the sender just fires a packet at the receiver and doesn't expect an acknowledgement.

IP is fast and cheap, though. Sometimes fast, cheap and unreliable is OK. When you play networked Doom or Quake, each bullet is represented by an IP packet. If a few of those get lost, that's OK.

The upper level, TCP (Transmission Control Protocol), gives you reliability. When two machines negotiate a TCP connection (which they do using IP), the receiver knows to send acknowledgements of the packets it sees back to the sender. If the sender doesn't see an acknowledgement for a packet within some timeout period, it resends that packet. Furthermore, the sender gives each TCP packet a sequence number, which the receiver can use you reassemble packets in case they show up out of order. (This can easily happen if network links go up or down during a connection.)

TCP/IP packets also contain a checksum to enable detection of data corrupted by bad links. (The checksum is computed from the rest of the packet in such a way that if the either the rest of the packet or the checksum is corrupted, redoing the computation and comparing is very likely to indicate an error.) So, from the point of view of anyone using TCP/IP and nameservers, it looks like a reliable way to pass streams of bytes between hostname/service-number pairs. People who write network protocols almost never have to think about all the packetizing, packet reassembly, error checking, checksumming, and retransmission that goes on below that level.

12.5. HTTP, an application protocol

Now let's get back to our example. Web browsers and servers speak an application protocol that runs on top of TCP/IP, using it simply as a way to pass strings of bytes back and forth. This protocol is called HTTP (Hyper-Text Transfer Protocol) and we've already seen one command in it -- the GET shown above.

When the GET command goes to www.linuxdoc.org's webserver with service number 80, it will be dispatched to a server daemon listening on port 80. Most Internet services are implemented by server daemons that do nothing but wait on ports, watching for and executing incoming commands.

If the design of the Internet has one overall rule, it's that all the parts should be as simple and human-accessible as possible. HTTP, and its relatives (like the Simple Mail Transfer Protocol, SMTP, that is used to move electronic mail between hosts) tend to use simple printable-text commands that end with a carriage-return/line feed.

This is marginally inefficient; in some circumstances you could get more speed by using a tightly-coded binary protocol. But experience has shown that the benefits of having commands be easy for human beings to describe and understand outweigh any marginal gain in efficiency that you might get at the cost of making things tricky and opaque.

Therefore, what the server daemon ships back to you via TCP/IP is also text. The beginning of the response will look something like this (a few headers have been suppressed):

HTTP/1.1 200 OK
Date: Sat, 10 Oct 1998 18:43:35 GMT
Server: Apache/1.2.6 Red Hat
Last-Modified: Thu, 27 Aug 1998 17:55:15 GMT
Content-Length: 2982
Content-Type: text/html

These headers will be followed by a blank line and the text of the web page (after which the connection is dropped). Your browser just displays that page. The headers tell it how (in particular, the Content-Type header tells it the returned data is really HTML).