maxmouchet a day ago

Another solution is to use an MMDB (“MaxMind DB”) file [1] which is essentially a binary tree + deduplicated values (same as idea 3.1).

There are several free ASN MMDBs [2,3] but you can also build your own MMDB files from any Prefix->Value mapping with the mmdbwriter library [4] or a CLI tool built on top of it like mmdbctl [5].

Assuming the ASN MMDB is fully loaded in memory, it would use around 60MB.

[1] https://maxmind.github.io/MaxMind-DB/

[2] https://dev.maxmind.com/geoip/docs/databases/asn/

[3] https://ipinfo.io/products/free-ip-data-downloads

[4] https://github.com/maxmind/mmdbwriter

[5] https://github.com/ipinfo/mmdbctl

(I work for IPinfo, but there are lots of other companies offering MMDB files).

tptacek 7 hours ago

Hey, Julia! I know this isn't really material, but: `hallpass`, our SSH server, probably shouldn't be using 40MB. We had a bug in our `init` that we fixed recently: `hallpass` is a Go program, and we were passing user environment variables to it, so users who were passing Go memory settings were also passing settings through to `hallpass`.

If you're still seeing `hallpass` using that much memory, let me know! We'll figure out what's going on.

zamadatix a day ago

If you want to get fun with it you can actually fit the entire IP range in a single 32 bit value. On the internet networks larger than /24s aren't considered valid advertisements so you only need to store 3 bytes (the first 24 bits) of the address to hold the start. If you store than in a uint32 you can use the remaining 8 bits to hold the netmask (you don't actually need all of them but it keeps things byte aligned), which lets you dynamically calculate the end address in just a few CPU cycles. The search still occurs on the base address portion so you don't ever have to calculate the end address until you've already found the right base address.

  • riobard 20 hours ago

    > fit the entire IP range in a single 32 bit value

    … only if you pretend IPv6 does not exist, which isn't the case in the original blogpost.

    But the idea can be extended to fit entire IP (both v4 and v6) subnets in an uint64, given that the minimal IPv6 prefix size to be globally routed is /48, you could use the first byte of the uint64 to mark the address type (e.g. 0x4 for IPv4 and 0x6 for IPv6), the remaining 3 (for IPv4) or 6 (for IPv6) bytes for actual prefix, and the last byte for prefix length.

  • teddyh 20 hours ago

    > On the internet networks larger than /24s aren't considered valid advertisements

    The number 24 is larger, but the network is smaller.

    • zamadatix 20 hours ago

      Good catch, thanks! This should have been "netmasks" instead of "networks" but it's too late for me to edit now.

      • AStonesThrow 11 hours ago

        A netmask is different and deprecated terminology.

        The "/24" you refer to is a "prefix length", i.e. a network prefix being advertised by BGP.

  • blibble 14 hours ago

    for ipv4 you can do with with just a single lookup, so real O(1)

    store the entire thing in one big array, indexed by the uint24 value of the /24

    so ASN for 0.0.0.0 stored in position 0, ASN for 0.0.1.0 stored in position 1, etc

    and you can drop 0.0.0.0/8, 10.0.0.0/8, 127.0.0.0/8, 224.0.0.0/4, 240.0.0.0/24 to save 35 /8s

    (256-16-16-3)256256*4 bytes needed -> 55mb

    mmap it if necessary

    • zamadatix 2 hours ago

      If you do it this way I think you lose all of the prefix size info I.e. every match will claim to be part of a /24 match. Depends if you care showing that portion or not if that matters though.

vlovich123 a day ago

Since this is fairly static data, preprocess the on-disk format to match your memory layout and just mmap the file so that your memory usage for the DB is effectively 0 and the OS kernel pages it in/out for you based on system-level needs?

wvh 3 hours ago

SQL is not going to be of much help when the dataset easily fits into memory, but I'm a bit surprised by the lack of performance of the trie. I guess the overhead is too much when the dataset has a lot of single IP addresses (only one byte difference), but I imagine it would be helpful for sparse IPv6 addresses.

Go has an excellent standard library, but the solutions in there rarely compete with others writing a dedicated library to solve a hard problem they really had to solve.

zamadatix a day ago

> As an aside – I’m storing the ASN in a uint32, is that right? I looked in the ip2asn file and the biggest one seems to be 401307, though there are a few lines that say 4294901931 which is much bigger and would not fit in a uint32. I decided to just ignore the gigantic ASNs and assume I could fit everything in a uint32.

Does anyone have a theory to how that 4294967296 might have made its way into the data? I'm getting 0 matches on the actual internet with:

    birdc show route protocol peer4 | grep 4294967296 -c
And ASNs are certainly not >32 bit.

Edit: actually, I'm not seeing it in the linked iptoasn data either?

  • maxmouchet 20 hours ago

    Bogus announcements are probably filtered by your upstream(s) (see [1] for a common list of filters).

    IP-to-ASN mappings are typically built from route collectors [2,3] that peer with various networks and receive their announcements. AFAIK route collectors don't filter anything and it's easy to find bogus announcements (e.g. private ASNs) in the data.

    I can't find 4294967296 from a quick glance at the latest RouteViews data but I can find other private ASNs. For example AS7594 - AS2764 - AS4294901866 for 210.10.189.0/24 seen by the route-views.perth collector.

    I don't know what kind of filtering iptoasn.com is doing but at work (ipinfo.io) we do filter bogus origins, as well as a bunch of other things like RPKI/IRR-invalid routes and hyper-specific prefixes (> /24 or /48) [4].

    [1] https://bgpfilterguide.nlnog.net

    [2] https://www.routeviews.org/routeviews/

    [3] https://www.ripe.net/analyse/internet-measurements/routing-i...

    [4] https://hyperspecifics.io

    • zamadatix 20 hours ago

      Actually 4294967296 couldn't ever appear as the maximum value you can fit in the protocol field is 1 less than that... my problem here was I couldn't manage to keep the 2 numbers I was comparing (the one in the article and 2^32) straight haha! This was mistake was noted by a commenter here https://news.ycombinator.com/item?id=41963745

      That said you're ultimately right that my upstream provider is filtering the 4294901866 value from the article as well anyways for the reasons you stated.

      • maxmouchet 20 hours ago

        Ah right haha. Thanks for the heads up, I should have checked ^^

  • rayhaanj a day ago

    Also 4_294_901_931 is actually just small enough to fit in a u32 whose max value is 4_294_967_295

    • astrobe_ 21 hours ago

      Perhaps hex notation helps to see things better: 0xFFFF00AB. It's actually nearly 64K below the maximum. 0xAB is 171.

    • zamadatix 20 hours ago

      Ha, this is exactly right! It's 100% in u32 range and part of the private use reservations so wouldn't appear in the internet table unless such values aren't filtered by the provider and/or it's being used locally.

      Originally I went to write your exact comment because it seemed the value in the article should fit at a glance and then I must have done the comparison check backwards because I started pasting the 2^32 value in the rest of my comment concluded it was actually too large when really I had just jumbled things about.

      Thanks for setting my mind straight!

  • tialaramex a day ago

    The ip2asn-v4-u32.tsv file has

        996520704       996520959       4294901931      Unknown AS4294901931
    
        3523770368      3523770623      4294901931      Unknown AS4294901931
    
        3523771136      3523771391      4294901931      Unknown AS4294901931
hnenjoyer a day ago

Would be nice to see revisited SQLite solution.

Rest of the described approaches relies on data being in memory rather on disk - not sure whether it was microsd card, SSD over USB, etc. but there's in memory option for SQLite.

Having that in mind it's time to optimize storage - instead of text just store using two 64 bit integers and it was already noticed two indexes are not very helpful - just create one that has both columns included - preferable as primary key so you don't waste space and already have rest of the select data included in single lookup.

Happy to see revisited results, but still not sure if that's a best approach for your problem - but seems like obvious improvements for generic SQL approach.

moontear 21 hours ago

Interesting read - but how cool is Mess with DNS! What a nice live learning tool.

pjmlp 17 hours ago

A good example on how to use better data structures, instead of the usual rewrite into something else.

gmuslera a day ago

The trie approach seemed to be a bit plain. Tries are particularly good for storing netblocks, and see if a particular IP belongs to some block stored there. But is better to use binary values instead of strings, and to use patricia or radix trees as they might compress strings of bits without branches with a single node. And, anyway, postgresql already have the netblock type with efficient indexes for them.

But if that is used for individual IPs, without worrying about blocks they belong to, probably won't get big gains in that area.

Timber-6539 a day ago

Had the same issue with restic taking up a lock in the event of an unsuccessful operation. I have a very simple workaround for it.

ExecStartPre=/usr/bin/restic unlock

ExecStart=/usr/bin/restic backup

  • LtdJorge 21 hours ago

    That has the benefit of Systemd not letting more than one instance run at the same time, which I'm guessing is one of the reasons the lock was needed in the first place.

    You just have to make sure that the type of Service used is correct, so that Systemd can track whether Restic has actually stopped running.

immibis 17 hours ago

You need an array of 64-bit values: a 32-bit end address, and a 32-bit ASN. Use 0 for unallocated ranges - it's an invalid ASN.

For IPv6 you need more. It just barely fits if you consider that all public Internet addresses start with 001 and only up to 48 bits can be a published prefix - that's 45 bits - and the public ASNs go up to 19 bits according to your article - that's exactly 64 in total (those high ASNs are bad data from wherever you're getting it from). But it won't work once the next 100000 ASNs get assigned, so you'd better go up to 96 or 128 bits and store the ASN properly in 32 bits.

For cache efficiency you should have one table of addresses and another parallel table of ASNs. Then every access to the address table that pulls in a cache line doesn't waste half the cache line with ASNs you aren't looking for. It won't affect the total memory use, however.

A separate table would hold the name and country for each ASN.