Linuxdoc Linux Questions
Click here to ask our community of linux experts!
Custom Search

5. Performance Tuning and Troubleshooting

5.1. Tuning

OK, now we are up and running, and we want to be running at warp factor nine. No such thing as too fast, right?

Linux networking is pretty robust, even a default installation with no "tuning". You may well not need to do anything else. But if your connection is not performing up to what you think it should be, then possibly there is a problem somewhere. This may be a more worthwhile approach than the pursuit of any magical "tweak".

A very rough guideline on what you might reasonably expect as a maximum sync rate, based on distance from DSLAM/CO:

0-12 K ft  (0-3.6 km)          2000 Kbps or more (8100 max for ADSL)
12-16 K ft (3.6-4.6 km)          1500 Kbps to 1000 Kbps
16-18 K ft (4.6-5.4 km)          1200 Kbps to 512 Kbps
18-?? K ft (5.4-?? km)          512 Kbps to  128 Kbps or less :(

There are many conceivable factors that could effect this one way or the other. Newer generations of DSL will surely improve this, as will related technologies like repeaters.

You will loose 10-20% of the modem's attainable sync rate to networking overheads (TCP, ATM, ethernet). So a 1500 Kbps connection, is only going to realize about 1100-1300 Kbps or so of real world throughput. No tweaking is going to change the built-in protocol overheads. Also, if your service is capped at a lesser speed by your provider, then you can't get above that speed no matter what. AND -- that there are numerous variables that can effect your loop/signal quality, and subsequently your speed (aka sync rate). Some of these may be beyond your control.

But there are a few things that you might want to look at.

5.1.1. TCP Receive Window

For many of us, a default Linux installation is going to give something close to optimum performance. Windows 9x users often get a big boost by increasing their TCP Receive Window (RWIN). But this is because it is too small to start with. This is just not the case with Linux where the default value is 32KB.

The exception here is if you have to routinely deal with a high latency connection. For instance, if your provider has a satellite uplink that is consistently adding unusual latency (250ms or greater?). Then a larger TCP Window will likely help. For more on TCP Receive Window and related issues, look at http://www.psc.edu/networking/perf_tune.html.

The Receive Window is a buffer that helps control the flow of data. If set too low, it can be a bottleneck and restrict throughput. The optimum value for this depends completely on your bandwidth and latency. Latency being what you would find as average roundtrip time (RTT) based on your typical destinations and conditions. This can be determined with ping. For example, the Linux default of 32KB is acceptable up to speeds of 2 Mbps and a typical latency of 125ms or so, or 1.0 Mbps and latency of 250ms. Setting this value too high can also adversely effect throughput, so don't over do it.

An example courtesy of Juha Saarinen of New Zealand:

The commonly used formula for working out the the tcp buffer is the "bandwidth delay product" one:

      Buffer size = Bandwidth (bits/s) * RTT (seconds)

In my case, I have roughly 8Mbps downstream, but the ATM network can only support ~3.5Mbps sustained. I'm far away from the rest of the world, so to squeeze in a sufficient amount of 1,500 byte packets, with average RTTs of 250ms, I should probably have a buffer of (3,500,000/8)*.25 = 106KB. (I've got 128KB at the moment, which works fine.)

The Receive Window can be dynamically set in the /proc filesystem. This requires entering a value that is twice the desired buffer size:


 #echo 262144 > /proc/sys/net/core/rmem_default 
 #echo 262144 > /proc/sys/net/core/rmem_max

 

The above example actually sets the value to 128K. The Send Window can also be set, but is not as likely to be a limiting factor on DSL connections as the Receive Window:

 
 #echo 262144 > /proc/sys/net/core/wmem_default 
 #echo 262144 > /proc/sys/net/core/wmem_max

 

These values can also be set using the sysctl command. See the man page.

Other suggested kernel options for those who want to squeeze every last bit out of that copper (selected entries only):


 # sysctl -a 
 net.ipv4.tcp_rfc1337 = 1
 net.ipv4.ip_no_pmtu_disc = 0
 net.ipv4.tcp_sack = 1
 net.ipv4.tcp_fack = 1
 net.ipv4.tcp_window_scaling = 1
 net.ipv4.tcp_timestamps = 1
 net.ipv4.tcp_ecn = 0

 

A brief description of these, and other, options may be found in /usr/src/linux/Documentation/networking/ip-sysctl.txt, in the kernel source directory.

5.1.2. Interleaving

"Interleaving" is an error control mechanism of ADSL with DMT line encoding. DMT is now the standard for ADSL, and is by far and away the most prevalent form of ADSL. Interleaving buffers the raw data and corrects errors on the fly at the DSLAM. This can significantly help marginal loops that may be prone to line errors. The downside is that this buffering also adds significant latency to the connection. So for those with reasonable quality lines, interleaving is of no real benefit, and may actually add unnecessary latency.

Interleaving is an adjustable parameter and can be turned on or off by the telco. Many telcos seem to like to have this on by default, since it probably reduces tech support calls in those cases where it does help stabilize a line. But everyone else pays a price.

How to know if your line is interleaved or not, and how to change it? Good question. Generally speaking, if your first hop or two on a traceroute is less than 25ms or so, you can pretty much figure that interleaving is off. But there may be other factors such as how far away those hops actually are. Unless your modem accurately reports this, the only other real way to know is to talk to someone at the telco. This may prove easier said than done.

"FastPath" DMT is synonymous with "interleaving off". Again, this only applies to ADSL/DMT.

5.1.3. TCP Bottlenecks

DSL connections may suffer performance degradations under certain circumstances. Thankfully, Linux has very robust and flexible networking tools to help us deal with these.

One such common situation is where traffic bottlenecks are created whenever data from a fast network segment hits a slower one. Such as ethernet hitting a DSL modem/router. This can cause short term traffic backlogs, known as "queues" in the device. Queuing can result in degraded performance, particularly for interactive protocols (like telnet or ssh) and streaming protocols (like RealAudio), and increased latency for ICMP and other network protocols. This is most evident when the upstream link is saturated (since downstream data is queued at the ISP's end and we can't do as much about that). The queued traffic is processed such that lower volume traffic protocols (like ssh) often get drowned out so to speak, by the higher volume, bulk traffic (like http or ftp), as there isn't any special prioritizing in default usage.

And if the upstream queuing, or other factors, causes enough of a delay, it can even decrease downstream bandwidth utilization by slowing the ACKnowledgements (which are heading upstream), that are required to keep a download moving at optimal rates. So it is possible that an upload can hurt a simultaneous download.

Such effects can be largely mitigated with Linux's built-in traffic shaping abilities. The user space tool for manipulating the kernel's advanced traffic routing features is iproute, sometimes packaged as iproute2. This includes various tools that can classify and prioritize traffic with a considerable degree of flexibility. It also requires various kernel config options to be turned on. And is also fairly close to Black Magic ;-) The definitive document on this is the Advanced Routing and Traffic Control HOWTO (http://tldp.org/HOWTO/Adv-Routing-HOWTO.html). Pay particular attention to the "Cookbook" Section #15, and in particular #15.8, "The Ultimate Traffic Conditioner: Low Latency, Fast Up & Downloads". A great read!

5.2. Installation Problems

Read this section, if you have no sync at all or are completely unable to connect. See your modem's owner's manual for interpreting the modem's LEDs. (Many will show a solid red (or orange) light if not in sync.)

5.2.1. No sync

The modem sync LED has never been green.

  • If doing a self-install, the DSL jack may be wired wrong, or the splitter may be wired wrong. Also, the modem may be wired differently than standard telco devices. See above.

  • Is the modem Linux compatible? If ethernet interfaced, this should not be a problem. But PCI or USB modems may require drivers just to achieve sync. This could be a show stopper since many PCI and USB modems are not Linux compatible.

  • Call your provider and make sure the line was provisioned. It is always possible someone dropped the ball. They may even be able to run a remote test on your line just to verify. There is a also remote possibility that the DSLAM is down. They should know this as well.

  • Disconnect the modem power cord and disconnect the DSL cord from the wall jack. Plug it into the test jack inside the NID (outside phone box), and run an extension cord if necessary for power. Temporarily disconnect the wiring to the inside phone circuit. This should effectively bypass any inside wiring and environmental issues. The ethernet cable to the NIC does not need to be connected to run this test (true only for ethernet modems). The modem will sync fine without it. (Easier said than done, I know.) But if possible, move enough of your system where you can view the modem's diagnostics (if available) and get the sync rate. If this works, there is probably something wired incorrectly inside, or a short in a connection somewhere, or there is severe electrical interference on the DSL line. Double check the splitter and wall jack connections. If a splitterless installation, look for bad wiring, bad (e.g. corroded) connections on all jacks, bad splices, or defective microfilters!

    If no sync on the above test, either the line was not readied, the modem is defective, or the DSLAM is down. Note that PCI and USB modems will need to load drivers before syncing, and thus make this test a little more complicated.

  • If you installed microfilters, remove these temporarily and unplug all telco devices, such as fax machines, etc. Possibly a mircofilter is defective and shorting out the line.

5.2.2. Network Card (NIC) Problems

Symptoms here are: NIC is not recognized, modules won't load, or ifconfig shows the interface is not up, or is generating lots of errors, etc.

  • Turn off Plug 'n Pray in BIOS. This may be labeled as "non-Microsoft OS" or similar. A sometimes symptom here is that the NIC is assigned IRQ 0. Or there may be an error message like "resource temporarily unavailable".

  • Check for IRQ conflicts with cat /proc/interrupts. If the NIC is sharing an IRQ, try moving cards around in slots, or tinker with BIOS IRQ settings. If an ISA card, you may need to get the setup utility from the manufacturer and use it to set IRQ, etc. This may require booting to DOS. Modern systems should theoretically be able to handle IRQ sharing, so it is not necessarily a problem in and of itself. Only if something is misbehaving.

  • Possibly the wrong module is being loaded. Look through the kernel source documentation in /usr/src/linux/Documentation/* for your card or chipset. Also, for comments and update information in /usr/src/linux/drivers/net/*.c for your respective chipset. It is worth noting that there is more than one module for some card types. This seems to be true of tulip and 3Com cards. Check boot messages or use lspci -v to see how the kernel is identifying your card. You can use insmod, rmmod, and modprobe to test different modules. See the respective man pages for more information.

  • Check the manufacturer's web site for Linux documentation. Or look at Donald Becker's informative site at http://www.scyld.com/network/.

  • Some Linux NIC drivers reportedly work better as non-modular. In other words, compile them into the kernel instead of as a module.

  • It is also possible that the card is bad, or the drivers just aren't up to snuff. Try another card. And you don't need an expensive, high quality card necessarily either.

5.2.3. IP Connection Problems

Read this section if you are sure the modem is syncing, the NIC is recognized and seems to be working properly, client software is installed and running without error, but the connection to the ISP fails. Verify the modem is indeed syncing by the LED(s). An IP connection failure may be evidenced by ifconfig not showing an active eth0 interface (or ppp0 for PPPoX), or pinging gateway and other destinations generates 'network unreachable' or similar errors.

  • Make sure you know which protocol your ISP is using. Are they using DHCP? PPPoX? It is critical that you have this right. You may have to ask tech support.

  • If you are using DHCP, does the ISP require MAC address authentication, and if so, do they have the right address? Did they or you typo it? If the ISP requires hostname authentication, is your DHCP client passing the required hostname? This is done with the "- h" command line option.

  • Look at /var/log/messages and see if any useful clues are there. Also, run tcpdump while trying to initiate the connection. tcpdump output is fairly cryptic, but you should be able to determine if there is any response at all.

  • If PPPoX, is the ISP using username as an id, or username@isp.com?

  • CHAP, PAP, or other? I would set up both CHAP and PAP (see man pppd) just to be safe.

  • Try pinging the default gateway's address. Get this with 'route -n'. If you can ping by IP address (i.e. 111.222.333.444), but not by hostname, then likely nameservers are not correctly setup in /etc/resolv.conf. This is configurable as to whether your connection protocol (e.g. PPPoE) does this automatically or not. And different distributions may have their own way of setting this up, so check their documentation first. In a pinch, just add them manually to /etc/resolv.conf. pppd also has the "usepeerdns" option that can be enabled.

  • For rp-pppoe, let the PPPoE client bring up the ethernet interface. Do not have it come up on boot. Make sure there is no existing default route before starting PPPoE. For rp-pppoe, David Skoll recommends that /etc/ppp/options be left empty.

  • If running a firewall (e.g. with ipchains), try temporarily taking it down. Possibly this is misconfigured, and not allowing packets through.

  • Roaring Penguin has a very nice debug output with all kinds of system info, and even tips for correcting problems. See the docs for turning this well-done feature on.

  • If the modem was purchased from a source other than your ISP, it may the wrong kind of modem. SDSL needs an SDSL modem, for instance. Also, for ADSL there are CAP and DMT encodings, and these are incompatible with each other.

    The modem may need to be configured for your ISP's service. All modems have configurations for VCI, VPI, encapsulation, etc. Call tech support for this information. Modem configuration is usually done by either telnetting or web browsing to the modem's IP address.

5.3. Sync Problems

Read this section if you have had a working connection, but now have lost sync, are intermittently losing sync, your sync rate has dropped significantly, or are getting a "sync/no surf" condition. (Better quality modems will have a way to report sync rate, usually via telnet or a web browser interface. See the owner's manual.)

A loss of sync indicates a problem with the DSLAM, your line (inside or outside) or your modem. DSLAMs typically have "shelves" with "cards". Alcatel DSLAM cards, just for instance, have a capacity of four connections each. If the card goes bad, at most four customers are effected. The point being that sync loss outages can be very isolated. Unlike network outages that tend to effect large numbers of users. Sync outages are a telco problem, not an ISP problem. If your service agreement is with the ISP, you will need to contact them, who will in turn contact the telco.

Degraded sync rates, and disruption of the DSL signal, can cause various problems. Obviously, you will never get your maximum throughput under these conditions. But, the symptoms are not always obvious as to whether the problem is on your end or the provider's.

For instance, a poor inside wire connection may result in retransmissions of packets that have been dropped. This can really reduce throughput and slow a connection down. It is tempting to think of packet loss as a traditional networking problem, but with DSL it is possible to be the result of a bad line, impaired signal, or even the modem itself.

Some things to try:

  • Power cycle the modem. Turn off the power button/switch, and physically unplug the cable to the wall jack for 30 seconds or so. Turn back on, and re-attach to the wall jack. This will force a resync. Unfortunately, the only way to power down a PCI modem, is to reboot. This may fix a "sync/no surf" condition that is caused by the modem, and maybe other conditions too.

  • See the above section on moving the modem lock, stock and barrel to the NID and thus bypassing all inside wiring. If the situation is improved there, then the problem is inside somewhere. If not, it is a telco problem.

  • RFI Bear-hunt: The DSL signal is fragile. There are a number of things that can degrade it. RFI, or Radio Frequency Interference, from sources in and around the home/office is one common source of reduced signal strength, intermittent sync loss, low sync rates and high line error rates that can cause retransmissions and slow things down. DSL frequencies just happen to be in a range that is susceptible to many potential RFI sources. Our test tool here is simply a portable AM radio. Tune it to any channel where you can get clear reception -- it makes no difference where. The AM radio will pick up RFI that is in the same frequency range as the DSL signal. It will sound like "frying bacon" type static. Put it against your computer's power supply. You should hear some static. Move it away and the static should fade pretty quickly. This will give you an idea of what RFI sounds like. A decent quality power supply should produce only weak RFI -- probably not enough to cause a problem. Use the radio like a Geiger counter and move it around your modem and DSL line. If you hear static, follow it to the source. Things to be suspicious of: power supplies, transformers, ballasts, electric motors, dimmer switches, high intensity lighting. Moving the modem, or rerouting cables is sometimes enough. Keeping the line between the modem and the wall jack as short as possible is a good idea too.

  • Chronic sync problems are often due to a line problem somewhere. Sometimes it is something as simple as a bad splice or corroded jack, and easily remedied if it can be found. Most such conditions can be isolated by a good telco tech. Check with your provider, and politely harass them if you have to. If you get the run-around, ask to go over their heads.

  • If you are near the distance limits of DSL, and having off and on sync problems, try the "Homerun" installation. See above. This can be effective in improving marginal signal/sync conditions.

  • If using a surge protector, try it without the surge protector. Some may interfere with the DSL signal.

Another possibility is a nearby AM radio station, or bandit ham radio operator that are disrupting the DSL signal since they operate in a similar frequency range. These may only cause problems at certain times of day, like when the station boosts its signal at night. A good telco DSL tech may be able to help minimize the impact of this. YMMV.

5.4. Network and Throughput Problems

Read this section if your connection is up, but are having throughput problems. In other words, your speed isn't what it should be based on your bit rate plan, and your distance from the CO. "Network" here is the WAN -- the ISP's gateway and local subnet/backbone, etc. Remember that a marginal line can cause a reduced sync rate, and this will impact throughput. See above.

The two factors we will be looking for are "latency" and "packet loss". Both are pretty easy to track down with the standard networking tools ping and traceroute. If either of these occur in our path, they will impact performance. Latency means "responsiveness" or "lag time". Actually what we are interested in is abnormally high latency, since there is always some latency. Packet loss is when a packet of data gets dropped somewhere along the way. TCP/IP will know it's been "lost", and there will be a retransmission of the lost data. Enough of this can really slow things down. Ideally packet loss should be 0%.

What we really need to be concerned about is that part of the WAN route that we routinely traverse. If you do a traceroute to several different sites, you will probably see that the first few "hops" tend to be the same. These are your ISP's local backbone, and your ISP's upstream provider's gateway. Any problem with any of this, and it will effect everywhere you go and everything you do.

We can start looking for packet loss and latency by pinging two or three different sites, hopefully in at least a couple of different directions. We will be looking for packet loss and/or unusually high latency.


 $ ping -c 12 -n www.tldp.org
 PING www.tldp.org (152.19.254.81) : 56(84) bytes of data.
 64 bytes from 152.19.254.81: icmp_seq=0 ttl=242 time=62.1 ms
 64 bytes from 152.19.254.81: icmp_seq=1 ttl=242 time=60.8 ms
 64 bytes from 152.19.254.81: icmp_seq=2 ttl=242 time=59.9 ms
 64 bytes from 152.19.254.81: icmp_seq=3 ttl=242 time=61.8 ms
 64 bytes from 152.19.254.81: icmp_seq=4 ttl=242 time=64.1 ms
 64 bytes from 152.19.254.81: icmp_seq=5 ttl=242 time=62.8 ms
 64 bytes from 152.19.254.81: icmp_seq=6 ttl=242 time=62.6 ms
 64 bytes from 152.19.254.81: icmp_seq=7 ttl=242 time=60.3 ms
 64 bytes from 152.19.254.81: icmp_seq=8 ttl=242 time=61.1 ms
 64 bytes from 152.19.254.81: icmp_seq=9 ttl=242 time=60.9 ms
 64 bytes from 152.19.254.81: icmp_seq=10 ttl=242 time=62.4 ms
 64 bytes from 152.19.254.81: icmp_seq=11 ttl=242 time=63.0 ms
 
 --- www.tldp.org ping statistics ---
 12 packets transmitted, 12 packets received, 0% packet loss
 round-trip min/avg/max = 59.9/61.8/64.1 ms

 

The above example is pretty normal from here. (You probably have a very different route to this site, and your results may thus be quite different.) Apparently no serious underlying problems that would slow me down. The below example reveals a problem:


 $ ping -c 20 -n www.debian.org
 
 PING www.debian.org (198.186.203.20) : 56(84) bytes of data.
 64 bytes from 198.186.203.20: icmp_seq=0 ttl=241 time=404.9 ms
 64 bytes from 198.186.203.20: icmp_seq=1 ttl=241 time=394.9 ms
 64 bytes from 198.186.203.20: icmp_seq=2 ttl=241 time=402.1 ms
 64 bytes from 198.186.203.20: icmp_seq=4 ttl=241 time=2870.3 ms
 64 bytes from 198.186.203.20: icmp_seq=7 ttl=241 time=126.9 ms
 64 bytes from 198.186.203.20: icmp_seq=12 ttl=241 time=88.3 ms
 64 bytes from 198.186.203.20: icmp_seq=13 ttl=241 time=87.9 ms
 64 bytes from 198.186.203.20: icmp_seq=14 ttl=241 time=87.7 ms
 64 bytes from 198.186.203.20: icmp_seq=15 ttl=241 time=85.0 ms
 64 bytes from 198.186.203.20: icmp_seq=16 ttl=241 time=84.5 ms
 64 bytes from 198.186.203.20: icmp_seq=17 ttl=241 time=90.7 ms
 64 bytes from 198.186.203.20: icmp_seq=18 ttl=241 time=87.3 ms
 64 bytes from 198.186.203.20: icmp_seq=19 ttl=241 time=87.6 ms
 
 --- www.debian.org ping statistics ---
 20 packets transmitted, 13 packets received, 35% packet loss
 round-trip min/avg/max = 84.5/376.7/2870.3 ms

 

High packet loss at 35%, and some really slow roundtrip times in there as well. A little digging on this showed that it was a backbone router 13 hops into the traceroute that was the problem. While making this site really slow from here, it would only effect those routes that happen to hit that same router. Now what would really hurt us is if something similar happens with a router that we tend to go through consistently. Like our gateway, or maybe the second hop router too. Find these with traceroute, by just picking a random site:


 $ traceroute www.bellsouth.net
 
 traceroute to bellsouth.net (192.223.22.134), 30 hops max, 38 byte packets
  1  adsl-78-196-1.sdf.bellsouth.net (216.78.196.1)  14.86ms  7.96ms 12.59ms
  2  205.152.133.65 (205.152.133.65)                  7.90ms  8.12ms  7.73ms
  3  205.152.133.248 (205.152.133.248)                8.99ms  8.52ms  8.17ms
  4  Hssi4-1-0.GW1.IND1.ALTER.NET (157.130.100.153)  11.36ms 11.48ms 11.72ms
  5  125.ATM3-0.XR2.CHI4.ALTER.NET (146.188.208.106) 14.46ms 14.23ms 14.40ms
  6  194.at-1-0-0.TR2.CHI2.ALTER.NET (152.63.65.66)  16.48ms 15.69ms 16.37ms
  7  126.at-5-1-0.TR2.ATL5.ALTER.NET (152.63.0.213)  65.66ms 66.18ms 66.39ms
  8  296.ATM6-0.XR2.ATL1.ALTER.NET (152.63.81.37)    66.86ms 66.42ms 66.40ms
  9  194.ATM8-0.GW1.ATL3.ALTER.NET (146.188.233.53)  67.87ms 68.69ms 69.63ms
 10  IMVI-gw.customer.ALTER.NET (157.130.69.202)     69.88ms 69.25ms 69.35ms
 11  www.bellsouth.net (192.223.22.134)              68.74ms 69.06ms 68.05ms

 

The first hop is the gateway. In fact, for me the first two hops are always the same, and the first three or four are often the same. So a problem with any of these may cause a problem anywhere I go. (The specifics of your own situation may be a little different than this example.) A "normal" gateway ping (normal for me!):

 
 $ ping -c 12 -n 216.78.196.1
 
 PING 216.78.196.1 (216.78.196.1) : 56(84) bytes of data.
 64 bytes from 216.78.196.1: icmp_seq=0 ttl=64 time=14.6 ms
 64 bytes from 216.78.196.1: icmp_seq=1 ttl=64 time=15.4 ms
 64 bytes from 216.78.196.1: icmp_seq=2 ttl=64 time=15.0 ms
 64 bytes from 216.78.196.1: icmp_seq=3 ttl=64 time=15.2 ms
 64 bytes from 216.78.196.1: icmp_seq=4 ttl=64 time=14.9 ms
 64 bytes from 216.78.196.1: icmp_seq=5 ttl=64 time=15.3 ms
 64 bytes from 216.78.196.1: icmp_seq=6 ttl=64 time=15.4 ms
 64 bytes from 216.78.196.1: icmp_seq=7 ttl=64 time=15.0 ms
 64 bytes from 216.78.196.1: icmp_seq=8 ttl=64 time=14.7 ms
 64 bytes from 216.78.196.1: icmp_seq=9 ttl=64 time=14.9 ms
 64 bytes from 216.78.196.1: icmp_seq=10 ttl=64 time=16.2 ms
 64 bytes from 216.78.196.1: icmp_seq=11 ttl=64 time=14.8 ms

 --- 216.78.196.1 ping statistics ---
 12 packets transmitted, 12 packets received, 0% packet loss
 round-trip min/avg/max = 14.6/15.1/16.2 ms

 

And a problem with the same gateway on a different day:


 $ ping  -c 12 -n 216.78.196.1
 
 PING 216.78.196.1 (216.78.196.1) : 56(84) bytes of data.
 64 bytes from 216.78.196.1: icmp_seq=0 ttl=64 time=20.5 ms
 64 bytes from 216.78.196.1: icmp_seq=3 ttl=64 time=22.0 ms
 64 bytes from 216.78.196.1: icmp_seq=4 ttl=64 time=21.8 ms
 64 bytes from 216.78.196.1: icmp_seq=6 ttl=64 time=32.0 ms
 64 bytes from 216.78.196.1: icmp_seq=8 ttl=64 time=21.7 ms
 64 bytes from 216.78.196.1: icmp_seq=9 ttl=64 time=42.0 ms
 64 bytes from 216.78.196.1: icmp_seq=10 ttl=64 time=26.8 ms
 
 --- adsl-78-196-1.sdf.bellsouth.net ping statistics ---
 12 packets transmitted, 7 packets received, 41% packet loss
 round-trip min/avg/max = 20.5/25.6/42.0 ms

 

41% packet loss is very high, to the point where many services, like HTTP, come to a screeching halt. Those services that were working, were working very, very slowly.

It's a little tempting on this last real-life example to think this gateway router is acting up. But, as it turned out, this was the result of a problem in the DSLAM/ATM segment of the telco's network. So any first hop problem with packet loss or high latency, may actually be the result of something occurring before the first hop. We just don't have the tools to isolate where it is starting well enough. Packet loss can be a telco problem, just as much as an ISP/NSP problem. Or conceivably, even a modem problem. In which case try resetting the modem by power cycling and by unplugging/replugging the DSL cable (from the wall jack).

It is also quite possible for the modem itself to cause packet loss. The fix here is to power cycle the modem, and resync by unplugging the DSL connection for 30 seconds or so. In fact, any part of the connection can be a source of packet loss -- modem, DSLAM, ATM network, etc.

If you do find a problem within your ISP's network, it's time to report the problem to tech support.

5.4.1. Miscellaneous Network Problems

Some odds and ends:

  • Some Web pages won't load. For PPPoX users, the MTU value could be too high. This will cause packet fragmentation, and likely will cause misbehaving routers to fail to route your requests per Path MTU Discovery specs.The correct ppp0 device setting should be a maximum of 1492, but actually it needs to be 8 bytes less than any router you pass through on the way to the site. If a router somewhere is misconfigured, you could have problems. Try experimenting with lower MTU values. Any LAN hosts behind the connection, may even need to be even lower -- 1452 or maybe even 1412. If ECN is enabled, it might also cause this problem. Cured with "echo 0 > cat /proc/sys/net/ipv4/tcp_ecn".

  • Ping by IP address works, but not hostname. The nameservers are not being setup correctly in /etc/resolv.conf. Check your client's (DHCP, PPPoX) documentation or enter these manually with a text editor. Get the correct DNS server addresses from your ISP.

  • PPPoX disconnects. Unfortunately, PPPoX is more likely to drop connections than routed or bridged networks. PPP can be sensitive to any line condition which results in a temporary interruption of the connection. This may not be completely solvable, depending on what and where the problem is. Check your client's docs for "LCP Keepalive" features. There generally is a timeout on each end of the connection if the other end does not respond. If worse comes to worse, set up a cron job to watch the connection, and re-establish if necessary.

    Some providers may also be enforcing idle timeout disconnects. This is a different issue altogether, since it is deliberate. The solution here is to switch providers if you can.

  • Interface or route goes down for no reason. If ifconfig and/or route show the interface and/or route has automagically disappeared, it may be due to a buggy NIC driver.

  • Sub-par performance, or errors with the interface (e.g. eth0), may possibly be caused by a duplex mismatch. This would be most apparent when maxing out the connection. Most DSL modems and routers typically are set to half duplex, and your NIC that interfaces with the modem should be set likewise.

5.5. Measuring Throughput

One of the first things most of us do is check our speeds to make sure we aren't getting short changed, and that our system is up to snuff. Doing this accurately is easier said than done however. First, remember you are losing 10-20% right off the top due to networking protocol overhead. Just how much is "lost" here depends on your provider's network architecture, where and how you are measuring this and other considerations. Most of us may wind up being closer to 20% than 10%.

Then, any time you hit the Internet, there is some slight degradation of performance with each hop you take. Now this may not amount to much, as long as you are not taking too many hops and all the components -- your system, your ISP's network, your ISP's upstream provider, and the destination itself -- are all working like well oiled machines. But there's the rub -- how do you really know with so many variables in the mix? One flaky interface, on one router, on one hop along the path, may cause misleading results.

Your absolute max speed is going to be at your point of connection to your ISP -- the ISP's gateway. It can only go downhill from there, not up! So the ideal test is as close to home as possible. This eliminates as many unknown variables as possible. If your ISP has a local ftp server, this is an excellent place to run your own tests. (Run a traceroute though just to see how local it really is.)

If your ISP does not have this, look for an ftp site that is close -- the fewer the hops, the better. And look for one that isn't too busy, or you will get misleading results. Find a large file -- like 10 Megs -- and time the download. Try this over several days, and at different times of day. The server, and the backbone, are going to be busier at certain times of day, which can skew results and you want to eliminate these variables as much as possible. Your provider cannot compensate for heavy backbone traffic, backbone bottlenecks, slow or busy servers, etc.

There are many test sites scattered around the web. Some are better than others, but take these with a grain of salt. There are just too many variables for these tests to reliably give you an accurate snapshot of your connection and throughput. They may give you a general picture of whether you are in the ballpark of where you think you should be or not. One good speed test is http://www.dslreports.com/stest/0. Another test is http://speedtest.mybc.com/ (both are Java). I find these to be better than some of the others out there.

Now keeping in mind that we are limited by the ~10-20% networking overhead rule, here is an example. My speed is capped at 1472 Kbps sync rate. Minus the ~15% is 1275 Kbps. My sync rate is known to be good and my distance to the CO is about 11,000 Ft, which is close enough that I should be able to hit my real world maximum throughput of 1275 Kbps or roughly 1.2-1.3 Mbps -- all other things being equal. From dslreports.com speed test:


 Test running..Downloaded 60900bytes in 5918ms
 Downloaded 696000bytes in 4914ms
 First guess is 1133kbps
 fairly fast line - now test 2mb
 Downloaded 1679100bytes in 11090ms
 Upload got ok 1 bytes uploaded
 Uploaded 1bytes in 211ms
 Upload got ok 1 bytes uploaded
 Uploaded 1bytes in 205ms
 Upload got ok 1 bytes uploaded
 Uploaded 1bytes in 207ms
 Upload got ok 50000 bytes uploaded
 Uploaded 50000bytes in 2065ms
 Upload got ok 100000 bytes uploaded
 Uploaded 100000bytes in 3911ms
 
 ** Speed 1211(down)/215(up) kbps **
 (At least 24 times faster than a 56k modem)
 Finish.

 

1.211 Mbps is probably about as good as I can realistically expect based on my service. There is no reason for me to go troubleshooting or looking for tweaks.

Big Caution: my ISP uses a caching proxy server for web pages. This is a big equalizer for these kinds of web based tests. Without that, I surely would have been significantly slower on this test. The effect of the proxy is that you are actually testing throughput from the proxy -- NOT the test site. Just FYI. Another note: at the same time I tried another test site and was consistently getting 600-700 Kbps. So YMMV with these tests. (Usually I get the same on each, more or less.) Timing a large ftp download from two different sites, I calculated about 1.25 Mbps.