stocksy.co.uk

"the site for those who crave disappointment"

Search

Cars

Cycling

Linux

Mac

Miscellaneous

Networks

Squid Proxy

5th Dec 2010, 15:38:59

By James Stocks

This is everything I know about Squid all in one place.

The last time wrote about Squid -- over five years ago -- it was at version 2.5. Much has changed since then and my setup looks very different these days. Now that bandwidth is not nearly so scarce as it was in 2005, I don't use Squid to cache anything to disk.

Here's how I set up the perfect Squid install for my purposes:

My platform of choice is Debian Linux. For my proxy setup I am using 'Squeeze', since it includes Squid 3.1. Squid version 3.1 has many enhancements, but most important for me is the inclusion of IPv6 support. You could just as easily use 'Lenny'.

Install the squid3 package rather than squid, unless you know you need the older Squid version 2.7.

# apt-get install squid3

The default squid.conf is very well commented, but it is overkill for a simple and efficient setup. It can serve as a useful resource for looking up what certain configuration directives do though, so we'll move it sideways:

# cd /etc/squid3
# mv squid.conf dist-squid.conf

In my view, this is the absolute minimal working Squid configuration one can have:

acl manager proto cache_object
acl localhost src 127.0.0.1/32
acl to_localhost dst 127.0.0.0/8
# These are our local networks which will have permission to access the cache
acl localnets src 172.16.0.0/24 src 2001:470:903f::/64
acl SSL_ports port 443
acl Safe_ports port 80		# http
acl Safe_ports port 21		# ftp
acl Safe_ports port 443		# https
acl Safe_ports port 70		# gopher
acl Safe_ports port 210		# wais
acl Safe_ports port 1025-65535	# unregistered ports
acl Safe_ports port 280		# http-mgmt
acl Safe_ports port 488		# gss-http
acl Safe_ports port 591		# filemaker
acl Safe_ports port 777		# multiling http
acl CONNECT method CONNECT
http_access allow manager localhost
http_access deny manager
http_access deny !Safe_ports
http_access deny CONNECT !SSL_ports
http_access allow localnets
http_access allow localhost
http_access deny all
http_port 3128
# Defaults to off for bandwidth management and access logging 
# If access logging or traffic shaping like delay pools are needed, turn this off!
pipeline_prefetch on
coredump_dir /var/spool/squid3
# Prevent stale data being served from cgi scripts
# (probably does nothing in my setup because I don't cache, but can't hurt)
hierarchy_stoplist cgi-bin ?
refresh_pattern ^ftp:		1440	20%	10080
refresh_pattern ^gopher:	1440	0%	1440
refresh_pattern -i (/cgi-bin/|\?) 0	0%	0
refresh_pattern .		0	20%	4320
# I don't do any access logging for privacy and security reasons
cache_access_log none
# Only needed for troubleshooting disk cache problems
cache_store_log none
cache_mgr trouble@toastputer.net
# default is 256 MB.  Controls amount of RAM to use as cache, not overall limit!
cache_mem 96 MB
visible_hostname proxy1.spruce.toastputer.net
# direct-site contains sites which don't seem to play nicely and I can't be bothered to fix
acl direct-site dstdomain .facebook.com
always_direct allow direct-site
# The following headers are useful for troubleshooting faults, but are really more of a risk to 
# privacy in my environment, so they are disabled
request_header_access Via deny All
request_header_access X-Forwarded-For deny All
request_header_access Proxy-Connection deny All

Using the above config, I have squid running comfortably in 256MB of RAM in a Xen paravirtualised virtual machine. If you just want a minimal Squid proxy, you can stop here.

Blocking Advertisements or Other Content

This is pretty easy and it doesn't even require a redirector script like adzapper any more. I just use the list at pgl.yoyo.org, since this blocks the most obnoxious adverts effectively enough for me.

I use the following script to fetch the list:

#!/bin/sh
# Fetch the list
/usr/bin/wget -O /etc/squid3/yoyo \
'http://pgl.yoyo.org/adservers/serverlist.php?hostformat=squid-dstdom-regex&showintro=0&mimetype=plaintext' \
|| { echo "wget failed"; exit 1; } 

# Reload squid
squid3 -k reconfigure
exit

We don't want to abuse the free service that this nice gentleman offers, so I have set a crontab entry to check for a new version once every eight days.

# m h  dom mon dow   command
00 04   *   *  */8   /usr/local/bin/getyoyolist >> /dev/null 2>&1

Once the script has run, these lines can be added to squid.conf so that squid will use the yoyo blacklist.

http_port 8080
acl ads dstdom_regex "/etc/squid3/yoyo"
acl ad-filtered myport 3128
# block ads for requests to dstdomains in 'ads' AND where user is on port 3128
# 'ads' acl must be last so that it is  the acl picked up by deny_info later
http_access deny ad-filtered ads
# Where a request is blocked due to 'ads' acl, return an empty file not an error
deny_info http://adzapper.toastputer.net/zaps/empty ads

Now your squid offers a filtered service on port 3128 and an unfiltered service on port 8080. I have set Squid to serve up an empty file in place of the adverts, whilst you're welcome to use mine, you should really point deny_info at web server you control. If deny_info is not set, Squid will return an error page instead of the blocked file, which may be desirable for troubleshooting when you need to confirm that an object is indeed being blocked.

Whilst this approach can be extended to block any content you wish simply by adding more ACLs, I recommend that you look at the following two products if your needs are more complex:

SquidGuard - a powerful filtering plugin. Useful if you need to block long lists of whole sites and present your users with pretty pages explaining why.
DansGuardian - true content filtering, more like WebSense(TM) and co. Will filter based on the content of a page, not just URL. Highly configurable.

Both of these approaches will be slower and require more system resources than plain old Squid.

Logging

Don't do any logging unless you really need to or you are prepared to accept the performance penalty. You must turn off the pipeline_prefetch, since this is incompatible with logging.

##is incompatible with access logging:
#pipeline_prefetch on 

#cache_access_log none
cache_log /var/log/squid3/cache.log
cache_access_log /var/log/squid3/access.log
cache_store_log none
#This can help troubleshooting, but leave commented out for production use - it degrades performance
#cache_store_log /var/log/squid3/store.log

Caching

Consider carefully whether you really want to have a disk cache. The hit rate is very low (about 3% of requests are ever served from the cache). Each object held in the cache requires a certain amount of RAM so that Squid can keep track of it, so this results in either tying up a lot of RAM, or a massive performance penalty if the server begins to hit swap space.

My Squid setup is configured to cache only in RAM. This means that the 'hottest' objects will be served quickly, but Squid doesn't eat through huge amounts of RAM trying to keep track of a large disk cache.

That said, if you have a large number of users who frequently request the same content, or you are so bandwidth limited that 3% is a big deal to you, of course you can cache. We must start with some tedious but important planning.

Firstly, we need to establish how much RAM Squid will require. On 64-bit architectures, Squid will us 14MB per 1GB of disk cache. In this example, I'm using a 120GB partition, so I know that Squid will need about 1.6GB of RAM, purely to keep track of its own cache. My server will have 4GB of RAM, so I know that I can spare this amount. Otherwise, I would need to reduce the size of my cache to match the available memory.

A Squid cache is divided up into first level and second level directories. This is necessary because it would take Squid far to long to locate the file it needed if they were all in the same directory. So, the second consideration is to calculate how many level 1 and level 2 directories are needed for our 120GB partition using this formula:

(((x / y) / 256) / 256) * 2 = z

Let x be the size of the cache in kB. Let y be the average size of objects in the cache in kB (if you don't know this value, 13kB is considered to be a reasonable choice). z will equal the number of level 1 directories required.

Squid gets extremely upset if it runs out of space in its cache_dir, so I am going to leave plenty of headroom here! For starters, my '120GB' disk is actually more like 111GB when measured in base-2 rather than the base-10 manufacturers use. Squid will need some space to write swap and other temporary files, so I am going to allocate only 100GB, leaving 11GB free for these purposes. (100 * 1024) * 1024 = 104857600kB, so:

(((104857600 / 13) / 256) / 256) * 2 = 246.153846

At long last we have to proper values to plug in to our cache_dir directive:

#             location          size in MB  L1  L2 
cache_dir ufs /var/spool/squid3 102835      247 256

By default, Squid will only cache files 4MB or smaller. This is a good optimisation for performance, but bad if you are looking to save bandwidth. Squid can be instructed to cache more aggressively, for example:

# default is 4096kB
maximum_object_size 1 GB

# tarballs tend not to change without their filename changing to a different version number:
refresh_pattern -i \.gz$ 4320 100% 43200 reload-into-ims 
refresh_pattern -i \.bz2$ 4320 100% 43200 reload-into-ims
refresh_pattern -i \.dmg$ 4320 100% 43200 reload-into-ims
refresh_pattern -i \.bin$ 4320 100% 43200 reload-into-ims

# cache Windows updates for your Windows users:
refresh_pattern -i windowsupdate.com/.*\.(cab|exe) 4320 100% 43200 reload-into-ims
refresh_pattern -i download.microsoft.com/.*\.(cab|exe) 4320 100% 43200 reload-into-ims
refresh_pattern -i uk.download.windowsupdate.com/.*\.(cab|exe) 4320 100% 43200 reload-into-ims

# AVG updates:
refresh_pattern guru.avg.com/.*\.(bin) 4320 100% 43200 reload-into-ims;

Bandwidth Restriction

Squid has a method of preventing a single user or small group of users from hogging all the bandwidth, or indeed to prevent your web users as a whole from swamping your Internet link. The feature is called 'delay pools'.

Important: Note carefully the difference between 'b' (one bit) and 'B' (one byte/eight bits). Squid uses only B (bytes) per second, whereas Internet links are normally talked about in terms of bits (b) per second. Things will get confusing very quickly if you mix them up!

If you want to impose an overall limit on Squid's bandwidth of, say, 6Mbps then this can be done very simply:

##is incompatible with delay pools:
#pipeline_prefetch on 
delay_pools 1
delay_class 1 1
delay_access 1 allow all
# 6Mbps = 768,000Bps (768 kilobytes per second)
delay_parameters 1 768000/768000

This is fine, but it will only limit bandwidth in a simplistic way. It's still possible for one user to hog all of that bandwidth to the detriment of other users. It's possible to prevent this, but it's necessary to have a more detailed knowledge of how Squid deals with bandwidth.

The overall bandwidth available to Squid comes from a delay pool which holds 200MB. This pool refills at a rate of 20mbps. This means that our users as a whole may download 200MB at a rate in excess of 20Mbps before any bandwidth controls will activate. This helps Squid to respond to short spikes in demand of the sort than can occur after a network outage or similar event.

Each of our users has a bandwidth bucket which they may dip into the pool. Each bandwidth bucket holds 20MB. An individual user can download a 20MB file at unrestricted speed, provided that there is sufficient bandwidth left in the delay pool. After this 20MB is bucket exhausted, or the delay pool becomes empty, the user will be limited to 2mbps.

The result for the end user is that small file downloads will be very fast, so normal web browsing will be very responsive. Those who download large files all day will find their connection rate limited so that they won't be able to impinge on other users' bandwidth.

Here's how it looks in the Squid config after all bits have been converted to bytes:

##is incompatible with access logging:
#pipeline_prefetch on 

# I have one delay pool
delay_pools 1

# It is a class two pool, designed for a class C (/24) network
delay_class 1 2

# Limits are expressed as: pool number, overall limits (fill rate/capacity),
# per host limits (fill/cap):
# Pool 1 fills at 20Mbps and holds 200MB.  Each host bucket fills at 2Mbps and holds 10MB.
#                        pool            bucket
delay_parameters 1 2621440/209715200 262144/104857600

# Pool 1 is accessed by the IP range described by the 'Public' acl
delay_access 1 allow All

I've chosen these numbers mainly because the maths is easy. An element of trial an error will be needed to make this work for you.

More Than One Squid

If your Squid proxy stops for any reason, you're likely to have lots of users complaining. You can guard against this by having multiple servers running Squid and using DNS to round robin between them. But what about the cache? If we don't tell each squid about the other, each will end up maintaining independent but similar caches. Squid has a mechanism to deal with this. Here's an example of how it would be configured on two Squids, proxy1 and proxy2.

On proxy1:

# Make squid listen for HTCP requests:
htcp_port 4827

# Tell it about the other Squid:
# proxy-only tells squid not to cache stuff it requests from this peer - that would be pointless
cache_peer proxy2.spruce.toastputer.net sibling 3128 4827 proxy-only htcp

# The other squid should only access stuff we have cached to avoid 'tromboning'.
acl othersquid 172.16.0.8/32
miss_access deny othersquid

On proxy2:

# Make squid listen for HTCP requests:
htcp_port 4827

# Tell it about the other Squid:
# proxy-only tells squid not to cache stuff it requests from this peer - that would be pointless
cache_peer proxy1.spruce.toastputer.net sibling 3128 4827 proxy-only htcp

# The other squid should only access stuff we have cached to avoid 'tromboning'.
acl othersquid 172.16.0.7/32
miss_access deny othersquid