How Data flows through the Internet

sudan
14 min readApr 17, 2022

--

In this article, we will understand how Data flows through the Internet to reach the final destination.

source: Internet

Before we get into the details, we need to understand certain Terminologies, various Devices, Channels, and Protocols involved in the process.

Terminologies

Data: Data is nothing but a message which needs to be transferred from a Source to a Target. Data is represented in the form of bits (0s and 1s).

Octet: An Octet represents a group of 8 bits.

Host: A Host is a computer that is capable of sending and receiving data over the Network. Ex: Laptops, Servers, Mobiles, etc…

IP Address: An IP Address is a unique identifier for each Host similar to a postal address. Each IP Address is of 32 bits and represented as 4 Octets which can take value in the range 0–255. IP Address is often regarded as the Logical Address.

Network: A Network is a group of interconnected Hosts to share data between them using well-established Protocols.

Protocol: A Protocol is a set of guidelines that allows Hosts to transfer data between them.

MAC Address: Media Access Control is a unique identifier assigned to NIC to identify the Host within the same Network. MAC Address is assigned by device manufacturers and is often regarded as a Physical Address.

Devices

Following are the important devices that facilitate moving data from source to destination.

NIC

NIC (Network Interface Card) is a hardware component installed within every Host responsible for providing Network connectivity to the Host.

Repeaters

Data transmission through various channels loses its strength as it travels, making it difficult for long-distance communication.

A Repeater is an electronic device that receives the signal and retransmits it on the other end increasing the power of the signal thus enabling long-distance transmission.

Hub

A Hub is a multi-port Repeater that can increase the strength of signals and transmit it across multiple destination ports. For all practical purposes, Hubs are largely obsolete and replaced with Switches.

Bridge

A Bridge is installed between two Hubs which provides exactly two ports. A Bridge can learn which Hosts are on either side of the Bridge and can forward traffic accordingly. For all practical purposes, Bridges are largely obsolete similar to Hub, and replaced with Switches.

Switch

Switch

A Switch is an electronic device that provides the functionality of both Hub and Bridge.

Switches are responsible for transferring data between Hosts within the same Network and this process is called Switching.

A Switch has multiple ports and each port can be connected to a Host. A Switch has two mapping tables that help transfer data between Hosts.

  1. MAC Address Table: This table maintains a mapping of Port to MAC Address which is used to look up the destination port to which the data should be forwarded for a given MAC Address.
  2. ARP Cache Table: This table maintains a mapping of IP Address to MAC Address which is used to look up the destination MAC address to which the data should be forwarded for a given IP Address.

A Switch performs three main operations.

  1. Learn: Update MAC Address Table with a mapping of Port to MAC Address when a packet arrives in the Switch for which the mapping is not found.
  2. Flood: Send the packet over all the ports when the Switch receives the packet with no destination MAC Address. When the destination Host acknowledges the packet back to Switch, it updates its ARP Cache Table with the mapping of the destination MAC Address to IP Address.
  3. Forward: Forward the packet to the appropriate destination Port if the corresponding mapping of Port to MAC Address exists in MAC Address Table.

Router

Router

A Router is an electronic device that is responsible for moving packets between different heterogeneous Networks and this process is called Routing. Internet is basically a collection of interconnected Routers.

A Router is connected to multiple interfaces from different Networks and each interface has an IP Address and a MAC Address. This IP Address is called Default Gateway and is the public IP Address for all the Hosts within that Network.

A Router also provides various Access Control rules protecting the Network.

Each Router has two mapping tables that help transfer data between Networks.

  1. Routing Table: This table maintains a mapping of Routes to a particular Network. It primarily includes the Interface ID, Route Matching Policy, and the Next Hop if the destination IP matches the Route Matching Policy.
  2. ARP Cache Table: This table maintains a mapping of IP Address to MAC Address. Routers use this table to identify the destination MAC address to which the Data should be forwarded for a given IP Address.

When Routers receive packets with a missing Destination IP Address, those packets are dropped. Routers use either Static Mapping or Dynamic Mapping using various Dynamic Routing Protocols to populate this table.

Route Summarization is a mechanism to minimize the number of entries in the Routing Table by advertising a single mapping for all Hosts which are part of the single Network.

Modem

Modem (Modulator and Demodulator) is an electronic device that connects our Home/Office networks to the Internet Service Providers(ISP).

Hosts understand only binary while the data transmission requires Analog signals.

The Modem acts as a mediator converting Digital data to Analog signals for outgoing signals and vice versa for incoming signals.

In a typical setup, a Modem is connected to ISP and a Router sits behind the Modem.

Nowadays, most modern Modems come with a built-in Router and Switch as part of the same hardware providing all-in-one functionality.

Channels

Data has to be propagated through various Channels before reaching the destination.

Following are the important types of Channels used in data transmission.

DSL Cable

Digital Subscriber Line is a type of transmission line that transmits data over a telephone network through a telephone cable. DSL is mainly used for small to medium distances and can transmit data up to 6.1Mbps

Coaxial Cable

Coaxial cable internal structure

Coaxial Cable is a type of transmission line which is used to carry high-frequency electromagnetic signals with low losses, mainly used for small to medium distances, and can transmit data up to 10Mbps.

Coaxial Cable includes 4 layers.

  1. An innermost conductor made of copper wire is used to carry electromagnetic signals.
  2. An Insulator surrounding copper wire that provides Insulation.
  3. A couple of strands of Metal Mesh surround the Insulator which prevents interference of electromagnetic signals and prevents cross talks.
  4. An outermost plastic sheath that provides overall protection.

Fiber Optic Cable

Optic Fiber internal structure

Fiber Optic Cable is a type of transmission line which is used to carry pulses of light, mainly used for long distances, and can transmit data up to 10 Gbps.

Fiber Optic Cable consists of 5 layers

  1. An innermost core layer made of silica is responsible for transmitting the light and consists of thousands of fiber strands as thin as human hair.
  2. A Cladding layer surrounds the Core providing a protective layer with a lower Refractive Index enabling Total Internal Reflection.
  3. A plastic Coating layer surrounds Cladding acting as a shock absorber and providing protection against excessive cable bends.
  4. Strength Members provide additional protection and reduce cross-talks.
  5. An outermost Plastic jacket protects against environmental hazards.

Total Internal Reflection

Total Internal Reflection

When light travels from one medium to another, it bends, and consequently the speed of light changes. This phenomenon is called Refraction.

This change in speed is expressed as Refractive Index(RI) which is calculated as follows. The higher the Refractive Index, the slower the speed of the light in the medium.

Refractive Index = Speed of Light in Vaccum / Speed of Light of Medium 

Fiber Optic Cables use a phenomenon called Total Internal Reflection in which light waves instead of getting refracted into the second medium, get completely reflected back into the first medium.

This phenomenon occurs when the transmission happens in the medium with a higher RI compared to the surrounding medium and the angle of incidence is greater than a certain limiting angle called the Critical angle.

This phenomenon of total reflection enables light pulses to be carried over long distances in Fiber Optic Cables.

OSI Model

OSI Model is a framework that defines the rules of communication of a Host.

It divides the entire functionality into 7 layers with its own responsibility.

The purpose of the Presentation and Session Layer is not very important in the discussion and the distinction sometimes is very vague.

Application Layer(L7 Layer)

The Application Layer receives User input and converts them to binary before sending it to Transport Layer.

Transport Layer (L4 Layer)

The Transport Layer receives data from the L7 layer and breaks them into smaller chunks adding source and destination Port to it to create Segments.

The Layer also called End to End delivery is responsible for the delivery of Segments from the Source process running on a particular port to the Destination process running on a different port.

The source picks a random port on which it is listening for the response. The prominent protocols operating at this layer are TCP and UDP.

Network Layer (L3 Layer)

The Network Layer receives Segments from L4 Layer and adds source and Destination IP Addresses to it to form Packets.

The Layer also called Host to Host delivery is responsible for delivering Packets from Source Host to the Destination Host.

The devices involved in the L3 Layer include Routers, Hosts, etc…

Data Link Layer (L2 Layer)

The Data Link Layer receives Packets from the L3 Layer and adds Source and Destination MAC addresses to it to form Frames.

The Layer also called Hop to Hop delivery is responsible for delivering Frames from one node to the next node within the Network.

The source MAC Address is derived from the NIC of the Host while the destination MAC Address of the next hop is derived using Address Resolution Protocol.

The devices involved in the L2 layer include NIC, Switches, etc…

Physical Layer(L1 Layer)

The Physical Layer(L1 Layer) is the final layer and is responsible for receiving Frames from the L2 layer in the form of bits and converting them into Analog/Digital signals that are propagated through one of the Channels.

The Devices involved in the L1 layer include different types of Cables and also Repeaters, Hubs.

The exact reverse happens on the receiver end.

Domain Name System

The Domain Name System (DNS) is a naming system for resolving Domain Names like www.google.com into corresponding IP Addresses.

When the browser requests for DNS resolution of a domain name, it first checks its Browser Cache if there is an IP Address corresponding to the domain. If there is no mapping found, it looks up in OS Cache.

If the internal lookup fails, then the search proceeds to the Recursor Server.

Recursor is mostly an ISP that brings the Internet to the Host that acts as an orchestrator across multiple servers to resolve the IP Address. Recursor also maintains its local cache to resolve IP Addresses.

Recursor orchestrates across 3 servers if it cannot resolve the IP Address using its cache.

  1. Root NameServer: This server acts as a reference for all top-level domains like .com, .net, etc... The Root NameServer redirects the Recursor to TLD NameServer.
  2. Top-Level Domain NameServer(TLD NameServer): This server acts as a reference for a particular top-level domain and it redirects the Recursor to Authoritative NameServer to resolve the final IP Address.
  3. Authoritative NameServer: This server is responsible for knowing everything about the domain and resolves the IP Address to Recursor.

On the way back, domain to IP Address mapping is cached at various hops for subsequent requests.

Every Host needs an IP Address, Subnet Mask (range of IP Addresses within the Network), Default Gateway (Router’s IP Address), and DNS Server for Internet Connectivity.

These are provided by a protocol called DHCP (Dynamic Host Configuration Protocol). The client sends a Discover Request to DHCP Server and DHCP offers these details.

The DHCP Server is installed in the Modem which enables Network connectivity of the Host.

Address Resolution Protocol (ARP)

Address Resolution Protocol is used to resolve the MAC Address of the next hop in the same Network.

This can be the destination Host IP Address or the Router’s default Gateway depending on whether the destination is in the same or a foreign network.

In this example, Host 1 wants to send a packet to Host 2 within the same Network.

Host 1 knows the IP Address of Host 2 but doesn’t know its MAC Address. Host 1 uses ARP to resolve the MAC Address of Host 2.

The same procedure applies even if the destination is on a foreign network, in which case Host 1 tries to identify the MAC Address of the Router.

  1. Host 1 goes through the OSI Layer to construct Packets in the L3 Layer.
  2. Host 1 initiates the ARP to identify the MAC address of Host 2. ARP request includes the source and destination IP and source MAC Address.
  3. When the request reaches the Switch, Switch checks its ARP Cache. Since the mapping is not found, it Floods the ARP Request on all the Hosts in the Network.
  4. Host 2 acknowledges the ARP Request and sends an ARP Response with its MAC Address to the Switch while other Hosts ignore the request.
  5. Switch updates its ARP Cache and forwards it to Host 1
  6. Host 1 learns the MAC Address of Host 2 and updates its ARP Cache.

Border Gateway Protocol

Border Gateway Protocols among Autonomous Systems

Border Gateway Protocol is the heart of the Internet designed to exchange routing and reachability information among various Autonomous Systems on the Internet.

An Autonomous System (AS) is a set of Internet routable IP prefixes belonging to a network or a collection of networks that are all managed by a single organization. Ex: ISP

IP Addresses are grouped into network prefixes and network prefixes are grouped into Autonomous Systems. An Autonomous System is owned by an ISP and the ISP brings Internet connectivity to our home.

Each Autonomous System has one or more Edge Routers installed at the boundary responsible for Inter Communication between different Autonomous Systems.

Each Autonomous System has multiple Core Routers that act as the network backbone routing data packets within the network linking all the devices.

While the Core Routers communicate with each other through various Interior Gateway Protocols like RIP, OSPF, IS-IS, etc… which are designed for Speed, Edge Routers communicate among Autonomous Systems through BGP which is designed for Scale.

Edge Routers in an Autonomous System are configured with pre-defined neighboring Autonomous Systems through which they establish a TCP connection to learn Direct and Indirect Paths and Reachability information of other Autonomous Systems.

Neighboring Autonomous Systems are also referred to as Peers.

BGP uses 4 messages to achieve connectivity with other Autonomous Systems.

  1. Open: Establishes a TCP connection with the Peers.
  2. Update: Exchanges routing information with Peers which includes new Routes to be advertised, existing Routes to be withdrawn, Vector Path of Autonomous Systems which are not directly connected, their weights so that the receiving Autonomous Systems can create a preferred Route selection mechanism.
  3. Notification: Message sent by a BGP Peer when there is an error detected with the session.
  4. Keep Alive: Periodic Ping to Peers to check for connectivity.

Connecting the dots

Here is the oversimplified view of the data flow on the Internet.

Host Connectivity

The user purchases a computer with NIC built-in which provides a MAC Address (aecd). Let’s call this computer Host 1.

The user also subscribes to an Internet Service Provider to bring Internet connectivity to the House. ISPs provide a Modem to bring WiFi into the House.

The Modem is connected to ISP through a Coaxial or Fiber Optic Cable.

The Modem includes a built-in Router acting as a Default Gateway, a Switch to create an internal network, and a DHCP server that provides an IP Address to the Host for lease (172.12.3.0), Subnet Mask (172.12.3.0/8), Router’s IP Address (172.12.3.5), and a DNS Server Address.

DNS resolution

Now that the Internet connectivity is established for the Host 1, the user opens the browser and hits the domain http://www.google.com

The first step in data transmission is the resolution of the domain name to the IP Address.

The Host checks its Browser Cache and Kernel Cache to determine the cached mapping. If the mapping is not found, the request is forwarded to the Recursor.

The Recursor (ISP) checks its ISP Cache before forwarding and coordinating between Root, TLD, and Authoritative Nameservers to determine the IP Address. The mapping is cached for subsequent requests.

At the end of this exercise, the Host is aware of the destination IP Address.

Input Construction

Once the destination IP Address is known, the next step is the construction of an input request. The user input goes through the OSI Layer before being transmitted over the wire.

The Input request to download the http://www.google.com web page is converted into Binary in L7 Layer.

The L4 Layer receives this Data and breaks them into multiple smaller chunks and adds a Source Port and a Destination Port to each chunk to form Segments.

The L3 Layer receives Segments from L4 Layer and adds a Source IP Address and a Destination IP Address to each Segment to create Packets.

The L2 Layer receives Packets from L3 Layer and uses ARP to resolve the destination MAC Address (Router’s IP Address). Then the L2 Layer adds a Source MAC Address and Destination MAC Address to each Packet to create Frames.

The L1 Layer receives Frames and is responsible for transmitting the data over the wire.

Data Transmission

Each of the individual Frames is transmitted to the Router independently, the Default Gateway for the Network.

The decision of whether the destination is within the Network or outside the Network is done through a process called Subnetting.

Once the Frames are transmitted to the Router, they are forwarded to the Modem, which converts binary data into electromagnetic signals, and are sent over the Coaxial Cable to the ISP which is an Autonomous System(AS).

Every Host and Device receiving the Data Frame deconstructs to extract IP Address, MAC Address, and Data from it in the reverse order of the sender in the OSI Stack.

Similarly, every Host and Device constructs the Data Frames encapsulating IP Address, MAC Address, and Data similar to the sender in the OSI Stack.

The Edge Router of the Autonomous System receives the Frames and uses the BGP to determine the next destined Autonomous System.

The Autonomous System internally forwards the Frames through various Core Routers using Interior Gateway Protocols before forwarding them to the Edge Router for the next Autonomous System.

The Frames are converted into light pulses if the communication channel is a Fiber Optic Cable for the next destination.

This process repeats in every Autonomous System before reaching Long Haul Networks (Switching stations) that connect countries and continents through Fiber Optic cables laid out in ocean beds and terrains.

The Frames finally reach the Datacenter which fetches the web page for http://www.google.com and constructs the output response similar to how Host 1 constructed the input request in the OSI Layer.

The output response is transmitted all the way back to the destined Host (Host 1).

The Data is intercepted at various Stations and Fiber Optic Cables with Repeaters to amplify the signal strength.

This completes the oversimplified flow of Data from source to destination.

--

--