s3cmd sync enhancements and call for help

Coming soon, Fedora and EPEL users with virtual machines in Amazon (US East for starters) will have super-fast updates.  I’ve been hacking away in Fedora Infrastructure and the Fedora Cloud SIG to place a mirror in Amazon S3.  A little more testing, and I’ll flip the switch in MirrorManager, and all Amazon EC2 US East users will be automatically directed to the S3 mirror first.  Yea!  Once that looks good, if there’s enough demand, we can put mirrors in other regions too.

I hadn’t done a lot of uploading into S3 before.  It seems the common tool people use is s3cmd.  I like to think of ‘s3cmd sync’ as a replacement for rsync.  It’s not – but with a few patches, and your help, I think it can be made more usable.  So far I’ve patched in –exclude-from so that it doesn’t walk the entire local file system only to later prune and exclude files – a speedup of over 20x in the Fedora case.  I added a –delete-after option, because there’s no reason to delete files early in the case of S3 – you’ve got virtually unlimited storage.  And I added a –delay-updates option, to minimize the amount of time the S3 mirror yum repositories are in an inconsistent state (now down to a few seconds, and could be even better).  I’m waiting on upstream to accept/reject/modify my patches, but Fedora Infrastructure is using my enhancements in the meantime.

One feature I’d really like to see added is to honor hardlinks.  Fedora extensively uses hardlinks to cut down on the number of files, amount of storage, and time needed to upload content.  Some files in the Fedora tree have 6 hardlinks, and over three quarters of the files have at least one hardlink sibling.  Unfortunately, S3 doesn’t natively understand anything about hardlinks.  Lacking that support, I expect that S3 COPY commands would be the best way to go about duplicating the effect of hardlinks (reduced file upload time), even if we don’t get all the benefits.  However, I don’t have a lot more time available in the next few weeks to create such a patch myself – hence my lazyweb plea for help.  If this sounds like something you’d like to take on, please do!

MirrorManager automatic local mirror selection

MirrorManager 1.3.2 (plus a hotfix) is now running on all Fedora Infrastructure application servers.  This brings one new interesting feature – automatic mirror detection.  How’s that you say?

As you know, Internet routing uses BGP (Border Gateway Protocol), and Autonomous System Numbers (ASNs) to exchange IP prefixes (aa.bb.cc.dd/nn) and routing tables.  By grabbing a copy of the global BGP table a few times a day, MM can know the ASN of an incoming client request, and Hosts in the MM database have grown two new fields: ASN and “ASN Clients?”.  MM then looks to see if there is a mirror with the same ASN as each client, and offers it up earlier in the list.

I’ve pre-populated the MM database, for public servers only, with ASNs, and set “ASN Clients?” = True, meaning such will offer to serve all clients on the same ASN.  If you have a private server and wish to do likewise (remember, this doesn’t work for home systems or those behind NATs), you can fill in those fields yourself.  The Fedora wiki page on mirroring gives an example on how to look up your ASN.  I recommend this for all schools, research organizations, companies, and ISPs.

The mirrorlist lookup code now goes in preferential order:

  • same netblock
  • same ASN
  • both on Internet2
  • same country
  • same continent
  • global

For ISPs and schools, this should mean that most of the possible Fedora traffic will stay within your network – no transit costs.  And as netblocks change, MM will keep up with them automatically.

To see this in action, try a query as such, and look for the ‘Using ASN ####’ in the result comment line.

$ wget -O – ‘http://mirrors.fedoraproject.org/mirrorlist?repo=fedora-11&arch=i386′

# Using preferred netblock Using ASN XXXX country = US country = MX country = CA
your-local-mirror-here

I hope you enjoy this new feature.